All articles
codelab · intermediate ·

Iterative prompt engineering for structured data extraction

A test-driven prompt engineering system for extracting structured review data from HTML with fixture-backed validation and iterative model calibration.

data-extractionllmprompt-engineeringtesting

Overview

This project uses a test-driven approach to prompt engineering. Given 20 HTML review pages, the goal is to write system prompts that extract reviewer data into precise formats, then verify correctness against a ground-truth fixture.

Extraction tasks
  1. Extract reviewer and rating as a two-column CSV.
  2. Extract reviewer, rating, location, and sentiment as a four-column CSV.
  3. Extract current and previous reviews into season-grouped JSON.

The fixture includes edge cases: missing ratings, zero ratings, decimals (3.5, 0.5, 4.5), unusual location formats, and boundary-line sentiments.

A recurring challenge was enforcing output format. LLMs tend to wrap output in markdown fences, invent extra columns, skip quoting, or normalize values. Each stage required increasingly explicit format rules to override these habits. Format enforcement turned out to be as important as the extraction logic itself.


Development Setup

The system forms a closed feedback loop: write the prompt, run it, validate the output, analyze failures, and refine.

1
Write
Author the next prompt revision in the system prompt file
2
Run
Execute the full fixture set against the chosen model
3
Validate
Check structure, rows, labels, and edge-value handling
4
Analyze
Look for recurring failure patterns instead of one-off misses
5
Refine
Change one layer at a time and rerun the loop
Loading diagram…

Two nested feedback loops drive the work:

  • Inner loop (prompt refinement) — edit, run, validate, analyze, refine. The analysis step is LLM-powered: Sonnet receives each failed page’s HTML alongside the expected and actual values, then suggests specific prompt edits.
  • Outer loop (fixture calibration) — when both models consistently produce the same “wrong” answer, the ground truth itself is re-examined.

Prompt Evolution: Stage by Stage

The three prompts build on each other. Each stage inherits the previous stage’s rules and adds one layer of complexity.

Consistent core: <role> and <task>

The <role> stays nearly identical across all stages — a one-line persona anchoring the model as a data extraction expert. The <task> is the part that changes. Here are all three variants:

Basic Table

Stage 1

Two fields, one CSV format. The simplest version establishes the current-review scope and the reviewer/rating extraction contract.

<role>
You are a data extraction expert extracting review data from HTML pages.
</role>

<task>
Extract the main current review from each page. Return raw CSV with columns `reviewer,rating`.
</task>

"Main current review" tells the model to skip any previous-review content on the page.

Extended Table

Stage 2

The same role carries forward, while the task expands the required field set to prevent invented or dropped columns.

<role>
You are a data extraction expert extracting review data from HTML pages.
</role>

<task>
Extract the main current review from each page. Return raw CSV with columns `reviewer,rating,location,sentiment`.
</task>

Listing all four columns explicitly made the output contract harder to reinterpret.

JSON

Stage 3

The role now names structured JSON up front, and the task shifts from a column list to a schema-driven structure.

<role>
You are a data extraction expert extracting review data from HTML pages into structured JSON.
</role>

<task>
Extract the main current review from each page. Return JSON grouped by season under a `reviews` key.
</task>

The JSON output shape is defined by the schema section, not by CSV column names.

Rules and format contracts

Each prompt has two main working sections: <rules> (what to extract and how) and <format_contract> (how to shape the output). Rules are additive — each stage carries forward every rule from the previous stage, then adds new ones for the new fields. Format contracts, by contrast, are rewritten at each stage because the output shape changes.

The rule patterns that emerged follow a few recurring styles:

# Guarding against inference
- Do not infer [value] from [visual signal]

# Requiring explicit values
- [field]: explicit numeric value only; preserve decimals and 0

# Normalization
- Normalize [field] to [TargetFormat]

# Defining ambiguous labels with positive statements
- Use [label] when [specific observable condition]

# Scoping extraction
- Ignore [content type] / Include [content type] when present

For example, the star-icon rule follows the “do not infer” pattern, the rating rule follows “explicit values only,” and the name rule follows “normalize X to Y.” Sentiment labels follow the “positive definition” pattern — each label says what it is, not what it isn’t.

Format contracts serve a different purpose: they override the model’s default formatting habits. Representative rules include:

# Setting the output mode (first line of every format contract)
- Output raw [format] only — no prose, markdown, fences, or extra columns.

# Forcing the first token (prevents markdown fencing)
- Start directly with `.

# Mandating quoting (prevents comma-in-value breakage)
- Double-quote every field value in every data row (RFC 4180).

# Declaring the missing-value token
- Use "null" for missing [field].

The format contract turned out to be the section that required the most iteration. Quoting rules, anti-fencing instructions, and missing-value tokens each went through at least one revision cycle before stabilizing.

Rules: additive layering

Foundation

Stage 1
  • Do not infer from visual elements.
  • Use explicit values only; preserve decimals and 0.
  • Normalize names to FirstName LastName.
  • Use null when missing.
  • Return one row per page.
  • Ignore previous-review content.

Five rules target the fixture's first edge cases: star-icon traps, zero-as-null confusion, and decimal rounding.

Sentiment and location

Stage 2

Inherits every Stage 1 rule.

  • Add five sentiment labels with positive definitions.
  • Preserve location as exact source text or null.

Sentiment went through three revision cycles and was the hardest section to stabilize.

JSON, season, and schema

Stage 3

Inherits every Stage 1 and Stage 2 rule.

  • Infer season from date, text, or location.
  • Include previous review when present.
  • Add a schema section with an example.

The format contract is rewritten for JSON; the schema becomes both documentation and a structural template.

Side-by-side prompt comparison

Prompt sections across stages

Role

Stage 1 — Basic Table

Data extraction expert for HTML review pages.

Stage 2 — Extended Table

Inherited from Stage 1.

Stage 3 — JSON

Stage 1 role plus a structured-JSON target.

Task

Stage 1 — Basic Table

Return a two-column CSV for the current review.

Stage 2 — Extended Table

Return a four-column CSV for the current review.

Stage 3 — JSON

Return season-grouped JSON under a reviews key.

Rules

Stage 1 — Basic Table

Five foundation rules: no visual inference, explicit values, normalization, null handling, ignore previous review.

Stage 2 — Extended Table

Carry Stage 1 forward and add sentiment definitions plus exact location preservation.

Stage 3 — JSON

Carry Stage 2 forward and add season inference plus previous-review inclusion.

Format contract

Stage 1 — Basic Table

Raw CSV, header first, RFC 4180 quoting, exactly 2 fields.

Stage 2 — Extended Table

Raw CSV, escaped quotes, exactly 4 fields, no blank values.

Stage 3 — JSON

Raw JSON, start directly with `{`, season arrays, all fields required.

Schema

Stage 1 — Basic Table

Not used.

Stage 2 — Extended Table

Not used.

Stage 3 — JSON

Full JSON example with nested review objects.

Format contract evolution

Format contract across stages

Output shape

Stage 1 — Basic Table

Raw CSV with 2 fields.

Stage 2 — Extended Table

Raw CSV with 4 fields.

Stage 3 — JSON

Raw JSON grouped by season.

Quoting

Stage 1 — Basic Table

Quote every field value with RFC 4180 rules.

Stage 2 — Extended Table

Keep Stage 1 quoting and escape embedded quotes.

Stage 3 — JSON

Handled by native JSON string encoding.

Missing values

Stage 1 — Basic Table

Use "null" for a missing rating.

Stage 2 — Extended Table

Use "null" for missing rating and location.

Stage 3 — JSON

Use native JSON null values.

Anti-prose rule

Stage 1 — Basic Table

No prose, markdown, or extra columns.

Stage 2 — Extended Table

Keep Stage 1 and add “no blank fields.”

Stage 3 — JSON

Start directly with `{`.

Prompt structure

Every prompt follows the same XML scaffold. The sections below show the general shape, with inherited content summarized rather than repeated.

<role>
  You are a data extraction expert extracting review data from HTML pages.
  ← Stage 3 adds: "into structured JSON"
</role>

<task>
  Extract the main current review from each page. Return [format] with [fields/structure].
</task>

<rules>
  # Foundation rules (Stage 1, inherited by all)
  - Do not infer [value] from [visual signal].
  - [field]: explicit numeric value only; preserve decimals and 0; use null when missing.
  - Normalize [field] to [TargetFormat].
  - One [row/object] per page.
  - Ignore previous-review content.

  # Sentiment rules (Stage 2+, inherited by Stage 3)
  - [field]: one of [label_1], [label_2], … [label_n].
  - Use [label] when [specific observable condition].

  # Season + previous review (Stage 3 only)
  - [field]: one of [value_1], … [value_n]. Infer from [source_1], then [source_2].
  - Include previous review when present; use null otherwise.
</rules>

<format_contract>
  - Output raw [format] only — no prose, markdown, fences.
  - [Format-specific quoting, header, or structural rules.]
  - Use [token] for missing values.
</format_contract>

<schema>              ← Stage 3 only
  { "reviews": { "Fall": [ { … } ], "Winter": [], … } }
</schema>

How Stage 3 reinforces earlier rules

  • Season grouping checks date accuracy. A wrong date lands the review in the wrong season bucket, failing the season_grouping check.
  • The schema constrains field types. Rating must be a number, stats must be integers, and the review object needs both current and previous keys.
  • Previous-review extraction tests the scoping rule. Stages 1–2 say “ignore previous content.” Stage 3 says “include it in review.previous.” The model must distinguish current from previous data precisely.

Iteration Process

Each iteration follows a five-step cycle: change one thing, run it, measure the result, analyze failures, then decide what to do next.

Loading diagram…
Iteration cycle
StepWhat HappensArtifact
ChangeEdit one rule or format line in the promptUpdated .system.md
RunSend all 20 pages + prompt to Bedrockoutput.txt
MeasureParse output, run checks against fixturevalidation.json
AnalyzeSonnet examines each failure with expected and actual values, then suggests fixessuggested_improvements.json
DecideFix the prompt, update the fixture, or try a different modelNext iteration

Every run saves a timestamped bundle to runs/, so any two iterations can be compared directly.


Iteration Log: What Changed and Why


Model Comparison

The same prompts were tested on two Claude model tiers via Amazon Bedrock.

Results (after fixture calibration)

Best-run summary: both models reached 20/20 on Basic Table, Sonnet 4.6 reached 20/20 on Extended Table, and Haiku 4.5 reached 20/20 on JSON. The remaining drift was limited to 0-2 nondeterministic sentiment mismatches per run.

Per-task results
TaskSonnet 4.6Haiku 4.5
Basic Table20/2020/20
Extended Table20/2019–20/20
JSON18–20/2020/20

Both models achieve full marks on their best runs. The 0–2 remaining mismatches per run are nondeterministic — borderline sentiment cases that flip at temperature 0.3. This is the noise floor, and it shows which classifications are robust versus which sit on genuine semantic boundaries.

Model-specific behaviors

Behavioral differences
BehaviorSonnet 4.6Haiku 4.5
Format complianceFollows anti-fence and quoting rules reliablyNeeded “start with `” to stop markdown fencing
Sentiment boundariesOccasionally shifts mixed → positiveOccasionally shifts positive → mixed
Structural accuracyConsistentSame after Haiku 4.5 upgrade

Ground-Truth Calibration

Fixture corrections
ReviewerOriginalCorrectedEvidence
Theo CarterunclearmixedUses contrast language (“but”, “though I would hesitate”). Both models, every run.
Colin WardunclearneutralPurely observational (“packaging arrived intact”). Both models, every run.
Henry BrookspositivemixedExplicit contrast (“shaving off a half point only because…”). Both models, every run.

Final Results and Takeaways

Validation checks

Final validation matrix
Check Task 1Task 2Task 3
format_compliance PASS PASS PASS
required_fields_match PASS PASS PASS
row_coverage PASS PASS PASS
no_hallucination PASS PASS PASS
missing_and_edge_values PASS PASS PASS
rating_missing_token PASS PASS
location_missing_token PASS
column_mapping PASS
season_grouping PASS

Key takeaways

Build prompts incrementally

Start simple and add one layer at a time. Each new failure traces to the latest change, not to accumulated complexity.

Edge cases drive rule quality

Null ratings, zero ratings, decimals, and unusual formats forced rules that would otherwise be skipped. Happy-path-only test data produces prompts that fail in production.

Format contracts matter as much as extraction rules

Quoting, fencing, and output shape caused more iterations than the actual data extraction.

Define labels positively

"Neutral = facts or mild opinions" is more stable than defining labels by exclusion. Positive definitions hold up better across model versions and temperatures.

Positive instructions beat prohibitions

"Start with `" works better than "do not wrap in markdown fences."

Ground truth needs calibration too

When two models consistently agree against the fixture, re-examine the label. The fixture is part of the system under test.

Nondeterminism is informative

Borderline cases that flip between runs reveal which classifications are robust and which sit on real semantic boundaries.

Use consistent XML scaffolding

Sections like <role>, <task>, <rules>, and <format_contract> give the model clear boundaries. Adding <schema> in Stage 3 served as both documentation and a structural template.