All articles
prompt engineering · intermediate ·

Building an evaluation pipeline for LLM applications

Systematic prompt evaluation with automated test generation, dual grading systems, and measurable score progression across iterations.

claudeevaluationllmprompt-engineeringtesting

Most LLM content teaches how to write a better prompt. Very little teaches how to prove the prompt works. This article covers the evaluation methodology from Anthropic’s Claude API course — a systematic pipeline for measuring and improving prompt performance.

The three paths after writing a prompt

Every developer faces the same fork after drafting a prompt:

  1. Test once and ship — Run it on a couple of examples, eyeball the results, deploy. This is fast but dangerous.
  2. Test a few times and tweak — Try it on more examples, make some adjustments. Better, but there is no way to know if the last change made things better or worse.
  3. Build an evaluation pipeline — Systematic testing with scored results, measurable comparisons across iterations, and confidence that changes are improvements.

The course teaches the third path.

Generate a test dataset

Before evaluating, you need test cases. Rather than hand-writing examples, the course teaches automated dataset generation: describe your task and the types of inputs you expect, and Claude generates a representative set of test cases.

Each test case pairs an input with the expected behavior, and the dataset should include both straightforward examples and edge cases. A good dataset covers the distribution of real inputs — not just the happy path.

The dual grading system

A single grading approach is insufficient. The course combines two complementary methods:

Code-based grading

Validates structural correctness — things that have objective right and wrong answers:

code-graders.py python
def validate_json_output(response_text):
    try:
        json.loads(response_text)
        return {"score": 1.0, "reason": "Valid JSON"}
    except json.JSONDecodeError as e:
        return {"score": 0.0, "reason": f"Invalid JSON: {e}"}

def validate_python_syntax(code_text):
    try:
        ast.parse(code_text)
        return {"score": 1.0, "reason": "Valid Python syntax"}
    except SyntaxError as e:
        return {"score": 0.0, "reason": f"Syntax error: {e}"}

def validate_regex_pattern(pattern_text):
    try:
        re.compile(pattern_text)
        return {"score": 1.0, "reason": "Valid regex"}
    except re.error as e:
        return {"score": 0.0, "reason": f"Invalid regex: {e}"}

Code-based grading is fast, deterministic, and cheap — use it for anything with a structural answer.

Model-based grading

Validates quality — things that require judgment: Does the response follow instructions? Is the tone appropriate? Is the reasoning sound?

The grader is Claude itself, evaluating another Claude response against a rubric:

model-grader-prompt.txt text
Grade this response on a scale of 1-10 for how well it follows the instructions.
Identify specific strengths and weaknesses, explain your reasoning, then provide a score.

Combining the two

Run code-based checks first (fast, catch syntax errors immediately), then model-based grading (deeper, catches instruction-following failures). Average the scores, and produce an evaluation report.

The iterative improvement cycle

With an evaluation pipeline in place, development becomes a loop:

  1. Set a goal — Define what “good enough” means in terms of scores
  2. Write a prompt — Start with a clear first draft
  3. Run the evaluation — Get scored results across all test cases
  4. Identify failure patterns — Look at which cases scored poorly and why
  5. Apply a single technique — Change one thing (add XML tags, provide an example, clarify instructions)
  6. Re-evaluate — Run the same test cases and compare scores
Example score progression across iterations
IterationChangeAverage Score
BaselineFirst draft prompt4.7
Iteration 1Added XML tags for structure6.2
Iteration 2Added one example7.5
Iteration 3Clarified edge case handling8.8

This progression tells a story: each change moved the score measurably upward. Without an eval pipeline, you would have no way to know whether iteration 3 was actually better than iteration 1.

Production considerations

The course teaches several practical details for production evaluations:

  • Concurrency control (max_concurrent_tasks): Throttle parallel evaluations to avoid rate limiting
  • Pre-filled assistant messages: Start the assistant turn with ```json or ```code to guide output format without explicit instructions
  • Score aggregation with context: Don’t just average — report which categories improved and which regressed

Prompt Foo: evaluation as configuration

The prompt_evaluations course introduces Prompt Foo (promptfoo), an open-source framework that treats evaluation as declarative configuration. Instead of hand-writing eval loops, you define providers (models), prompts, and tests in a single promptfooconfig.yaml file.

Why Prompt Foo over hand-rolled evals

Workbench vs. hand-rolled vs. Prompt Foo
DimensionWorkbenchHand-rolledPrompt Foo
GradingHuman (manual dropdown)Code you writeBuilt-in + custom assertions
Multi-modelSingle model at a timeRequires extra codeAdd a provider line
ReproducibilityManual re-entryDepends on your scriptDeclarative, rerunnable with one command
ScaleRow by rowScript-dependentBatch across many prompts, models, and cases
OutputSide-by-side in browserPrint statementsWeb dashboard with charts

The YAML pipeline

All evaluation logic lives in one file:

promptfooconfig.yaml yaml
providers:
  - anthropic:messages:claude-3-haiku-20240307
  - anthropic:messages:claude-3-5-sonnet-20240620

prompts:
  - prompts.py:simple_prompt
  - prompts.py:better_prompt
  - prompts.py:chain_of_thought_prompt

tests:
  - vars:
      animal_statement: "The animal is a human."
    assert:
      - type: equals
        value: "2"

Prompts live in Python files as functions that receive test variables and return prompt strings. Tests can be inlined in YAML, loaded from CSV, or loaded from text files via file:// scheme. Running npx promptfoo@latest eval executes the entire matrix, and npx promptfoo@latest view opens a score dashboard.

Built-in assertions

Prompt Foo ships with assertions that cover common evaluation needs without writing grading code:

  • equals, contains, contains-all, contains-any
  • regex — output matches pattern
  • contains-json, contains-sql, contains-xml — structural validation
  • llm-rubric — model-graded assertion (covered below)

Output transformation

When chain-of-thought prompts produce <thinking> and <answer> sections, a transform function extracts only the answer before assertions evaluate:

transform.py python
def get_transform(output, context):
    match = re.search(r'<answer>(.*?)</answer>', output, re.DOTALL)
    return match.group(1) if match else output

Multi-model comparison

Adding a second provider is a one-line YAML change. Prompt Foo runs all prompts through all models, producing a matrix. In the course, Claude 3.5 Sonnet scored 100% on prompts that Haiku scored 0% on — a signal for model selection versus prompt engineering effort.

The three-tier grading strategy

The courses reveal a progression for production eval systems:

  1. Start with human grading (Workbench) — prototype and understand failure modes
  2. Move to code-graded assertions — whenever criteria are objective and programmable
  3. Use model-graded assertions — when criteria are subjective (tone, appropriateness, refusal behavior)

Whenever possible, code-graded evaluations are the simplest and least expensive to run.

Built-in model grading with llm-rubric

For subjective criteria, Prompt Foo’s llm-rubric assertion runs the criterion through a judge model:

llm-rubric.yaml yaml
defaultTest:
  assert:
    - type: llm-rubric
      provider: anthropic:messages:claude-3-opus-20240229
      value: 'Refuses to answer and redirects to academic topics'
    - type: llm-rubric
      provider: anthropic:messages:claude-3-opus-20240229
      value: 'Is not apologetic'

Each llm-rubric assertion runs a separate LLM call using the judge model (Opus, in this case), evaluating the output against the natural-language criterion. Multiple assertions test orthogonal quality dimensions simultaneously — refusal behavior and tone scored independently.

Three-tier cost model

The courses demonstrate deliberate cost separation: use Haiku for iterative development (cheap, fast), Sonnet as the application model, and Opus as the grading/judge model. The model generating outputs should not be the same instance grading them.

Outcomes: the grade-and-revise loop

Beyond one-shot evaluation, the Managed Agents platform provides an automated grade-and-revise loop called Outcomes. You write a rubric that defines what “done” looks like, and the platform provisions an independent grader in its own context window. After each writer turn, the grader evaluates the artifact against the rubric and either passes it or hands back per-criterion gaps. The writer revises and the loop runs again — up to a configurable iteration cap.

The critical design decision: the grader cannot see the writer’s reasoning. It opens with a fresh context window containing only the rubric and the artifact. The platform does not let the loop continue until the grader produces a verdict on every criterion. This separation is what makes the loop work — a writer that knows the criteria will claim it passed whenever it believes it did, but it won’t independently refetch URLs or verify quotes against sources.

rubric-design.py python
# The task tells the writer what to make. The rubric tells the grader how to check.
TASK = """Write a brief on the unit economics of public DC fast charging.
Cover capex range, demand charges, utilization breakeven, subsidy programs,
named-operator economics, a contrarian source, and hardware vs install cost split."""

RUBRIC = """
COVERAGE CHECKLIST. Each item has a specific, checkable bar:
  1. Capex range: a dollar range for installed cost per DCFC stall or station.
  2. Demand charges: quantified impact on opex (a $/kW figure or a % of operating cost).
  3. Utilization breakeven: a breakeven or target utilization threshold (% or kWh/day).
  4. Subsidy programs: NEVI or another public funding program, named.
  5. Named operator: GAAP net income/loss from a 10-K or 10-Q, cited to sec.gov
     itself — not a press release, earnings-call recap, or news article.
  6. Contrarian source: at least one cited source arguing the economics are unfavorable.
  7. Cost split: a hardware vs soft-cost breakdown or ratio.

CITATION CHECK. For every [n] entry in the Sources section:
  a. LIVE: Fetch the URL. Mark LIVE only if the page loads directly.
     Mark DEAD if 404, login-walled, paywalled, bot-blocked, or JS-only.
     Do NOT corroborate via mirrors, reposts, or search snippets.
  b. VERBATIM: Search the fetched page for the quoted string.
     Mark QUOTE_MATCH if found; NOT_FOUND otherwise.
  c. SUPPORTS CLAIM: Mark SUPPORTS_CLAIM if the quote backs the claim;
     UNSUPPORTED if tangential, contradictory, or merely a general statement.
"""

Making the grader earn “satisfied”

The default failure mode of any rubric is a grader that approves everything. A criterion that says “check that the brief covers demand charges” lets the grader skim for a paragraph mentioning them and approve — without opening a single source. Fixing this requires specificity:

Rubric design principles
PrincipleWeak versionStrong version
Make each criterion checkable”Cover demand charges""State a $/kW figure or a % of operating cost”
Require evidence”Verify citations are real""Fetch each URL, search for the quoted string, confirm the quote supports the claim”
Anticipate shortcuts”Cite primary sources""Do NOT corroborate via mirrors, reposts, or search snippets. The cited URL itself must fetch.”
Define what to ignore(not specified)“Do not flag pre-existing issues, style nits, or scope creep. Self-check each finding before raising it.”

In a real run from the cookbook, the grader caught a press release cited where a 10-K was required — the URL was on sec.gov but the specific page was an 8-K Exhibit 99.1 (an earnings press release filing), not a 10-K annual report. The rubric’s distinction between a filing and a press release on the SEC website is exactly the kind of check that a writer evaluating its own work would never make.

Why not just put the rubric in the system prompt?

A rubric in the system prompt helps the writer aim better, but it doesn’t create independence. The writer grades its own work and claims it passed. The grader has no choice but to run the actual checks — fetch URLs, search for quotes, verify claims — because the rubric demands evidence. The platform provisions a fresh grader after every writer turn, so the grader can’t be worn down or persuaded. You cannot get that separation from a single prompt.

Server-side prompt versioning: eval-gated promotion

A separate pattern from the Managed Agents platform: prompts that live server-side with immutable versions. Every agents.update produces a new version — v1, v2, v3 — all sharing the same agent ID. Sessions pin to a specific version, and changing the prompt means updating a version number rather than deploying code.

This enables an eval-gated promotion workflow:

prompt-promotion.py python
# Create v1 and measure baseline
agent = client.beta.agents.create(name="ticket-triage", model=MODEL, system=V1_PROMPT)
v1_scores = evaluate(version=1, test_cases=test_set)
# v1: billing 4/5, auth 5/5, api-platform 5/5, dashboard 5/5

# Ship v2 with a new routing rule
agent = client.beta.agents.update(AGENT_ID, version=agent.version, system=V2_PROMPT)
v2_scores = evaluate(version=agent.version, test_cases=test_set)
# v2: billing 2/5 <-- regressed

# Roll back by pointing callers at v1 — no deploy needed
# v2 stays on the server for further iteration; production stays on v1

The key practice: production callers always pin to an explicit version, and the pinned version number is what goes through change control. Anyone can create v2, v3, or v10 — those versions sit on the server with no traffic. Promotion means updating the config that tells production callers which version to pass, and that update goes through your normal review process. Creating versions is cheap and exploratory; promotion is deliberate and reviewed.

Takeaways

Evaluation needs a pipeline

A useful eval combines a test dataset, deterministic code checks, model-based grading, and score aggregation.

Model graders need reasoning space

Asking for strengths, weaknesses, and reasoning produces more discriminative scores than asking for a number alone.

Iteration requires stable cases

Changing one prompt element at a time only teaches something when each run uses the same test cases.

Synthetic datasets speed up coverage

Claude can generate representative test cases from the task description, including straightforward examples and edge cases.