LoginTry freeRequest a demoLogin
Pricing
Back to Articles

Making ACE Work in Production: Two Lessons from the Real World

| 10 min read
Zhipeng YeMember of Technical Staff
Aravind MohanHead of Data Science
C
Casey FitzpatrickMember of Technical Staff

Production agents drift. Context that worked well last month can quietly degrade performance today.

Agentic Context Engineering (ACE) was designed to solve this: an algorithm that continuously refines an agent's context based on real interactions, so performance improves over time rather than erodes. But getting ACE to work reliably in production is a different challenge than getting it to work in a lab.

We deployed ACE against real enterprise workloads — agentic search tasks across engineering and finance domains — and identified two practical blockers that determine whether ACE improves your agent or regresses it: feedback quality and data efficiency. Both have solutions that aren't obvious from the original algorithm, and both are directly actionable.

The core finding: how you evaluate your agent matters as much as the algorithm optimizing it. And when data is scarce, the right prior context can make or break ACE's ability to learn.

Background: What ACE Does

ACE is a self-evolving optimization framework that adjusts an agent's prompt to improve retrieval and generation quality over time. It does this through a three-step loop:

  • Generator runs the current task using the existing playbook
  • Reflector reviews the outcome and identifies what helped or hurt
  • Curator/Mutator integrates updates into the playbook, which is added to the system prompt

The playbook evolves iteratively — each cycle either commits the update or rolls it back based on a self-evaluation step that checks whether the new version actually outperforms the old one.

Source

For a deeper treatment of the algorithm, see the original ACE paper. This post is about what we learned running it in production.

Why Production Is Harder

ACE learns by analyzing past interaction traces and deciding which parts of the playbook helped or hurt performance. This is called the credit assignment problem: without gradient-based optimization, the system has to reason — using an LLM — about which guidance led to which outcomes.

Two things make the credit assignment process hard in production:

  1. Feedback is usually weak. Most production systems don't have labeled ground truth. Without a strong feedback signal, the reflector can preserve ineffective rules, discard useful ones, or overfit to noise in a small sample.
  2. Data is scarce early on. Customers expect optimization to show value quickly. But ACE's default configuration struggles to learn effectively from limited interaction data — especially when onboarding a new agent or use case.

These aren't edge cases. They're the default state of enterprise deployment.


Finding 1: Detailed LLM Self-eval Is A Strong Alternative To Human-labeled Ground Truth

To understand how feedback quality shapes learning, we ran a controlled ablation study holding everything constant — model, training data, mutator, and optimization hyperparameters — while varying only the feedback strategy across four conditions:

ConditionFeedback StrategyDescription
A1Single MetricCosine similarity between generated and gold answer embeddings
A2Multi MetricCosine similarity, token-level precision, recall, and F1 between generated and gold answer provided as separate raw score
A3LLM EquivalenceA binary LLM judge determining semantic equivalence to gold answer
A4LLM Self-EvalLLM evaluates relevance, groundedness, completeness, and clarity — without seeing the gold answer

All conditions were evaluated on the same held-out test set using an independent multi-metric judge (WeightedLMUnitScore), so final results are directly comparable.

The most surprising result: A4 outperformed A1, A2, and A3 — including all approaches that compared directly against ground truth.

A strong LLM judge can infer quality from the question–context–response triple alone — evaluating groundedness, completeness, and coherence without a reference answer, and presenting these scores as feedback helped ACE to assign credit more effectively.

This has immediate practical consequences. Ground truth labels are expensive to collect and hard to maintain as documents and tasks evolve. If ground truth is too expensive to acquire, LLM self-eval with multiple criterias presents a strong alternative candidate as the feedback strategy,. — and get more reliable optimization in the process.

Equally important: the wrong feedback function doesn't just slow learning — it can actively degrade context quality below the starting baseline. Feedback design is not a tuning detail. It's a core architectural decision.

Recommendation: For production deployments, use LLM-based self-evaluation with or without ground truth as your default feedback strategy. It produces richer, more diverse learning signals than binary or embedding-based comparisons against gold labels.


Finding 2: Prior Context Alleviates the Cold Start Problem

ACE learns from interaction traces. But when a customer deploys a new agent or a new use case, there are few traces to learn from. In our production setting, this cold start problem was common — and waiting to accumulate sufficient data is rarely realistic when users are already forming impressions.

We tested whether providing ACE with richer prior context about the agent — its primary use case, the types of data it accesses, and examples of good and bad responses — could improve learning efficiency when only 5 interaction traces were available. We used these contexts to generate initial playbooks as seed parameters for ACE to work on so that the algorithm preserves our prior knowledge.

The result: agents initialized with prior context produced a 7% additional improvement over agents optimized from traces alone.

When the reflector and mutator understand the agent's purpose and data environment before optimization begins, they generate more targeted playbook updates — identifying relevant patterns faster rather than inferring the domain from scratch across limited traces.

The quality of ACE's initial brief matters. Documenting an agent's intended role, data types, and known failure modes before running optimization directly improves how quickly the playbook reaches production quality.

Recommendation: Before running ACE, provide explicit agent context: the agent's primary purpose, the type of data it accesses, known failure modes, and at least one example of a good and a poor response. This prior substantially improves optimization quality under limited data conditions.


What This Means for Agent Development

Building effective agents requires more than adding information to a prompt. It requires systematically shaping, evaluating, and refining the context with both engineering and applied research.

Our results show that the choices surrounding ACE — not just the algorithm itself — determine whether an agent improves or regresses:

  • The feedback function used during optimization can determine whether the playbook improves or degrades
  • The prior context provided before optimization begins directly affects how quickly ACE reaches useful performance
  • A strong lab result does not guarantee production success. The algorithm still needs to be optimized for the business goal and task setup.

The most effective agents aren't the ones with the most context — they're the ones whose context is continuously tested, refined, and adapted as production conditions change.

A Note on Evaluation

We measured performance using WeightedLMUnitScore, a multi-faceted LLM-based evaluation framework we developed to replace binary pass/fail judgments with continuous, multi-dimensional scoring across four criteria:

  • Accuracy (factual correctness vs. ground truth): 40%
  • Quality (overall helpfulness): 30%
  • Equivalence (semantic similarity vs. ground truth): 20%
  • Readability (structure and formatting): 10%

For more on the methodology, see our earlier post on diagnostic metrics for agent evaluation at scale.


Learn More

For more on the underlying algorithm, see the ACE paper and our work on . Questions or feedback welcome at research@contextual.ai.

Appendix: Experiment Configurations

Experiment 1 - Static QA Baseline Result (HotpotQA)

You can find this publicly available dataset from HuggingFace

ParameterValue
Train set500
Validation set30
Test set100
Batch size6
Max iteration10
Early stopping5
ACE mutator/reflector modelgpt-4o
ACE generator modelgpt-4o
Feedback functionexact match
Final eval functionexact match

Expriment 2 — Agentic Search Baseline Result (Domain Dataset)

ParameterValue
Train set100
Validation set10
Test set30
Batch size6
Max iteration5
Early stopping2
ACE mutator/reflector modelclaude-sonnet-4-6
ACE generator modelclaude-sonnet-4-6
Feedback functioncosine similarity score
Final eval functionWeightedLMUnitScore, gpt-5.2

Experiment 3 - Feedback Strategy Ablation Experiment Config

ParameterValue
Train set100
Validation set10
Test set30
Batch size6
Max iteration5
Early stopping2
ACE mutator/reflector modelclaude-sonnet-4-6
ACE generator modelclaude-sonnet-4-6
Feedback functionvaries by conditions
Final eval functionWeightedLMUnitScore, gpt-5.2