While AI agents and LLMs have become increasingly accessible through simple API calls, systematically evaluating and maintaining their quality and performance has remained complex—until now.

This guide explains how developers can leverage the combined power of CircleCI’s CI/CD platform and Contextual AI’s LMUnit model—an LLM optimized for evaluating agents’ and other LLMs’ responses with natural language unit tests—to build reliable AI systems. We’ll demonstrate a practical approach that brings the rigor of software engineering testing practices to agent and LLM development, enabling you to implement automated CI/CD testing, detect regressions before production, and create feedback loops that continuously improve your AI systems. 

Whether you’re an AI engineer building a new AI feature or a team looking to ensure consistent performance across model updates, this workflow will help you deploy agents and LLMs with confidence. Join us as we walk through an example that shows how these tools can transform vibe-based testing into a systematic quality assurance process for the AI era.

Natural Language Unit Tests with the LMUnit evaluation model

Current approaches to AI evaluation have significant limitations: custom NLP metrics only capture surface-level similarities, human evaluations are inconsistent and expensive, and frontier models used as judges can be costly while still exhibiting biases. 

To address these challenges and bring the structure and rigor of traditional software engineering testing to AI evaluation, Contextual AI introduced the natural language unit tests paradigm with LMUnit, a language model specialized for agent and LLM evaluation. In this approach, developers and stakeholders write specific, testable, natural language statements about desired response properties for each query in their evaluation set. 

For example, for the query “What are the health benefits of regular exercise?”, some appropriate unit tests are:

  • “Does the response mention both physical and mental health benefits of exercise?”
  • “Is the response succinct without omitting essential information?”
  • “Does the response avoid making specific medical recommendations that should come from a healthcare professional?” 

LMUnit is a state-of-the-art, specialized model optimized for evaluating these unit tests. It outperforms general-purpose models on a variety of evaluation tasks while providing greater interpretability at a lower cost. The model takes a query, a response, and a natural language unit test, then produces a score from 1-5 indicating how well the response satisfies the unit test. 

This approach enables fine-grained assessment of specific qualities, custom evaluation criteria aligned with application needs, actionable feedback that pinpoints areas for improvement, and consistent standards that can be tracked over time.

Why you should automate your unit test evaluation

Many people start out testing their prompts on “vibes”:

  1. You write a prompt
  2. Submit it to the model
  3. Check the results and see if they “feel” right
  4. Repeat until you get a good response most of the time

This “vibes-based” approach works initially but becomes limited as workloads scale. Models change over time, rendering previously effective prompting strategies less useful. New prompts may introduce regressions for cases that used to work, and subjective assessments make it difficult to track improvements or share standards across team members.

Natural language unit tests with LMUnit provide a systematic alternative that aligns with established software engineering best practices. By defining clear, testable criteria for what makes a good response, teams can move from subjective assessments to objective, reproducible evaluations that can be integrated into CI/CD development workflows – creating a safety net that catches regressions before they reach production, enables controlled experimentation, and provides quantitative feedback on whether changes actually improve the system. Having a comprehensive, consistently executed test suite allows developers to confidently modify prompts and switch underlying models while ensuring the application continues to serve users effectively—bringing agentic and LLM applications closer to the reliability standards expected of traditional software.

How to automatically evaluate natural language unit tests with LMUnit in your CircleCI pipeline

We’ve developed a production example of how you can evaluate natural language unit tests with LMUnit in a CircleCI pipeline. In this section, we’ll walk you through how to build it. 

Setup

Sign up for a CircleCI account and connect a Github repository following our Quickstart guide: https://circleci.com/docs/getting-started/

Sign up for a Contextual AI account, visit the Getting Started page, and get an API key for the LMUnit Component API. 

Storing the LMUnit API token

Create a Context

CircleCI provides a secret store feature called Contexts where you can save sensitive credentials in an encrypted database and then reference them in your build pipelines.

  1. Go to Organization Settings
  2. Select Contexts
  3. Create a new Context
  4. Name your context lmunit-quickstart. We’ll use this name later in your pipeline

Add your API key
  1. Save your Contextual API key to an environment variable

Creating an evaluation set

An evaluation set consists of queries, unit tests and optional knowledge chunks. We suggest organizing these in a JSONL so you can update it over time and so it is accessible to non-technical stakeholders. We provide an example in the repo

  • Queries: We suggest that you take real user queries or create queries that are representative of real cases you want your agent or LLM to handle. 
  • Unit tests: Natural language statements describing desired qualities of the response. Unit tests can either apply globally to all queries in your evaluation set (e.g., “Is the response free from harmful content?”) or apply only to specific queries (e.g., “Does the response mention cardiovascular benefits for this exercise question?”). You can write unit tests either manually or with a separate LLM. 
  • Knowledge: If your use case involves using retrieved external knowledge in addition to an LLM’s parametric knowledge, then you may choose to include the relevant external knowledge in your evaluation set. 

Testing the evaluation set

After you develop your evaluation set, you need to write a script that generates responses to the queries and evaluates these responses against the unit tests. We provide a script in the example repo.

LMUnit evaluates each unit test on a scale from 1 to 5. The discrete scores 1, 2, 3, 4, and 5 roughly correspond to “Strongly fails,” “Fails,” “Neutral,” “Passes,” and “Strongly passes,” respectively. This granular scoring allows for nuanced evaluation beyond simple pass/fail tests, providing more actionable insights into model performance.

You can choose to apply different thresholds per unit test or a single threshold for all unit tests. After you’ve defined these thresholds, you need to check each (query, response, unit test) combination in your evaluation set and ensure that LMUnit’s output is above the threshold you define. 

Creating the pipeline

CircleCI uses a YAML file to define the build pipeline for your application. We’ll set up a pipeline that runs LMUnit tests every time you push to your repository. 

Running the pipeline

  1. Open your LMUnit tests in your editor
  2. Change one of the tests
  3. Save your changes
  4. Commit and push to your repository
  5. Open your CircleCI dashboard to see the test results

You can view an overview of your entire workflow, in this case we have a single job that runs your LMUnit tests. In a production application you may also have jobs that handle deployments of your application once tests pass.

CircleCI will handle installing and caching dependencies to keep runtimes fast over time, as well as running your tests and storing the test results. 

Test results are stored as JUnit formatted XML, and can be seen in the tests tab on your job. In this example, all tests pass. You may also view your test suite performance over time using CircleCI’s tests insights features. 

Finally, CircleCI will send the results back to your VCS as a status check to prevent merging breaking changes to your AI application.

Improving your pipeline over time

Once you’ve established your initial testing framework with CircleCI and LMUnit, the next challenge is maintaining and evolving your evaluation approach as your application grows. An effective evaluation strategy must adapt to new user behaviors, changing requirements, and emerging edge cases.

As your AI system interacts with more users, you’ll discover new query patterns that should be incorporated into your testing:

  1. Monitor user interactions: Set up logging to capture real user queries and AI system responses, particularly noting cases where users express dissatisfaction or rephrase their questions.
  2. Identify coverage gaps: Regularly analyze your evaluation set to identify underrepresented topics, edge cases, or emergent user behaviors not covered by existing tests.
  3. Add new test cases: When you identify valuable new test cases, simply add them to your test JSONL file.
  4. Version your test sets: As your test suite grows, consider versioning your test sets to track how evaluation criteria evolve over time.

Beyond adding new queries, you’ll also need to refine your unit tests as you learn more about what makes responses effective for your specific use case:

  1. Analyze user feedback: Pay special attention to user feedback that highlights quality dimensions you hadn’t previously considered.
  2. Add domain-specific unit tests: As you gain expertise in your application domain, create more specialized unit tests that reflect deeper domain knowledge. 
  3. Adjust thresholds based on experience: You may find that some criteria need stricter thresholds than others. Update your threshold values based on what you learn about their impact on user satisfaction.

By continuously refining both your evaluation set queries and unit tests, you create a virtuous cycle of improvement. Your agents and LLMs become increasingly aligned with user needs, while your testing framework provides the safety net needed to confidently implement changes and updates. This ongoing process transforms AI development from subjective guesswork into a rigorous engineering discipline, enabling you to deliver consistent, high-quality AI experiences that truly serve your users’ needs.

Getting Started

You can sign up for a Contextual AI account and a CircleCI account for free. 

To use LMUnit, visit Contextual AI’s Getting Started page and get an API key for the LMUnit Component API. The page also includes code examples and a UI playground. If you have any questions on LMUnit, email lmunit-feedback@contextual.ai. To request custom rate limits or enterprise pricing, please contact us.

To use CircleCI, set up a project and connect it to a GitHub repository. You can then add an API key for LMUnit and use it in your pipelines. For any questions on using CircleCI for building AI applications, please email ai-feedback@circleci.com.