Iterative prompt engineering for structured data extraction
A test-driven prompt engineering system for extracting structured review data from HTML with fixture-backed validation and iterative model calibration.
| GitHub URL | ravikanchikare/prompt-eng-code-lab |
| Models | Claude Sonnet 4.6 · Haiku 4.5 |
| Platform | Amazon Bedrock · promptfoo |
| Pages | 20 HTML review pages |
| Tasks | 3 extraction stages |
Overview
This project uses a test-driven approach to prompt engineering. Given 20 HTML review pages, the goal is to write system prompts that extract reviewer data into precise formats, then verify correctness against a ground-truth fixture.
- Extract reviewer and rating as a two-column CSV.
- Extract reviewer, rating, location, and sentiment as a four-column CSV.
- Extract current and previous reviews into season-grouped JSON.
The fixture includes edge cases: missing ratings, zero ratings, decimals (3.5, 0.5, 4.5), unusual location formats, and boundary-line sentiments.
A recurring challenge was enforcing output format. LLMs tend to wrap output in markdown fences, invent extra columns, skip quoting, or normalize values. Each stage required increasingly explicit format rules to override these habits. Format enforcement turned out to be as important as the extraction logic itself.
Development Setup
The system forms a closed feedback loop: write the prompt, run it, validate the output, analyze failures, and refine.
Two nested feedback loops drive the work:
- Inner loop (prompt refinement) — edit, run, validate, analyze, refine. The analysis step is LLM-powered: Sonnet receives each failed page’s HTML alongside the expected and actual values, then suggests specific prompt edits.
- Outer loop (fixture calibration) — when both models consistently produce the same “wrong” answer, the ground truth itself is re-examined.
Prompt Evolution: Stage by Stage
The three prompts build on each other. Each stage inherits the previous stage’s rules and adds one layer of complexity.
Consistent core: <role> and <task>
The <role> stays nearly identical across all stages — a one-line persona anchoring the model as a data extraction expert. The <task> is the part that changes. Here are all three variants:
Basic Table
Two fields, one CSV format. The simplest version establishes the current-review scope and the reviewer/rating extraction contract.
<role>
You are a data extraction expert extracting review data from HTML pages.
</role>
<task>
Extract the main current review from each page. Return raw CSV with columns `reviewer,rating`.
</task> "Main current review" tells the model to skip any previous-review content on the page.
Extended Table
The same role carries forward, while the task expands the required field set to prevent invented or dropped columns.
<role>
You are a data extraction expert extracting review data from HTML pages.
</role>
<task>
Extract the main current review from each page. Return raw CSV with columns `reviewer,rating,location,sentiment`.
</task> Listing all four columns explicitly made the output contract harder to reinterpret.
JSON
The role now names structured JSON up front, and the task shifts from a column list to a schema-driven structure.
<role>
You are a data extraction expert extracting review data from HTML pages into structured JSON.
</role>
<task>
Extract the main current review from each page. Return JSON grouped by season under a `reviews` key.
</task> The JSON output shape is defined by the schema section, not by CSV column names.
Rules and format contracts
Each prompt has two main working sections: <rules> (what to extract and how) and <format_contract> (how to shape the output). Rules are additive — each stage carries forward every rule from the previous stage, then adds new ones for the new fields. Format contracts, by contrast, are rewritten at each stage because the output shape changes.
The rule patterns that emerged follow a few recurring styles:
# Guarding against inference
- Do not infer [value] from [visual signal]
# Requiring explicit values
- [field]: explicit numeric value only; preserve decimals and 0
# Normalization
- Normalize [field] to [TargetFormat]
# Defining ambiguous labels with positive statements
- Use [label] when [specific observable condition]
# Scoping extraction
- Ignore [content type] / Include [content type] when present
For example, the star-icon rule follows the “do not infer” pattern, the rating rule follows “explicit values only,” and the name rule follows “normalize X to Y.” Sentiment labels follow the “positive definition” pattern — each label says what it is, not what it isn’t.
Format contracts serve a different purpose: they override the model’s default formatting habits. Representative rules include:
# Setting the output mode (first line of every format contract)
- Output raw [format] only — no prose, markdown, fences, or extra columns.
# Forcing the first token (prevents markdown fencing)
- Start directly with `.
# Mandating quoting (prevents comma-in-value breakage)
- Double-quote every field value in every data row (RFC 4180).
# Declaring the missing-value token
- Use "null" for missing [field].
The format contract turned out to be the section that required the most iteration. Quoting rules, anti-fencing instructions, and missing-value tokens each went through at least one revision cycle before stabilizing.
Rules: additive layering
Foundation
- Do not infer from visual elements.
- Use explicit values only; preserve decimals and 0.
- Normalize names to FirstName LastName.
- Use null when missing.
- Return one row per page.
- Ignore previous-review content.
Five rules target the fixture's first edge cases: star-icon traps, zero-as-null confusion, and decimal rounding.
Sentiment and location
Inherits every Stage 1 rule.
- Add five sentiment labels with positive definitions.
- Preserve location as exact source text or null.
Sentiment went through three revision cycles and was the hardest section to stabilize.
JSON, season, and schema
Inherits every Stage 1 and Stage 2 rule.
- Infer season from date, text, or location.
- Include previous review when present.
- Add a schema section with an example.
The format contract is rewritten for JSON; the schema becomes both documentation and a structural template.
Side-by-side prompt comparison
Role
Data extraction expert for HTML review pages.
Inherited from Stage 1.
Stage 1 role plus a structured-JSON target.
Task
Return a two-column CSV for the current review.
Return a four-column CSV for the current review.
Return season-grouped JSON under a reviews key.
Rules
Five foundation rules: no visual inference, explicit values, normalization, null handling, ignore previous review.
Carry Stage 1 forward and add sentiment definitions plus exact location preservation.
Carry Stage 2 forward and add season inference plus previous-review inclusion.
Format contract
Raw CSV, header first, RFC 4180 quoting, exactly 2 fields.
Raw CSV, escaped quotes, exactly 4 fields, no blank values.
Raw JSON, start directly with `{`, season arrays, all fields required.
Schema
Not used.
Not used.
Full JSON example with nested review objects.
Format contract evolution
Output shape
Raw CSV with 2 fields.
Raw CSV with 4 fields.
Raw JSON grouped by season.
Quoting
Quote every field value with RFC 4180 rules.
Keep Stage 1 quoting and escape embedded quotes.
Handled by native JSON string encoding.
Missing values
Use "null" for a missing rating.
Use "null" for missing rating and location.
Use native JSON null values.
Anti-prose rule
No prose, markdown, or extra columns.
Keep Stage 1 and add “no blank fields.”
Start directly with `{`.
Prompt structure
Every prompt follows the same XML scaffold. The sections below show the general shape, with inherited content summarized rather than repeated.
<role>
You are a data extraction expert extracting review data from HTML pages.
← Stage 3 adds: "into structured JSON"
</role>
<task>
Extract the main current review from each page. Return [format] with [fields/structure].
</task>
<rules>
# Foundation rules (Stage 1, inherited by all)
- Do not infer [value] from [visual signal].
- [field]: explicit numeric value only; preserve decimals and 0; use null when missing.
- Normalize [field] to [TargetFormat].
- One [row/object] per page.
- Ignore previous-review content.
# Sentiment rules (Stage 2+, inherited by Stage 3)
- [field]: one of [label_1], [label_2], … [label_n].
- Use [label] when [specific observable condition].
# Season + previous review (Stage 3 only)
- [field]: one of [value_1], … [value_n]. Infer from [source_1], then [source_2].
- Include previous review when present; use null otherwise.
</rules>
<format_contract>
- Output raw [format] only — no prose, markdown, fences.
- [Format-specific quoting, header, or structural rules.]
- Use [token] for missing values.
</format_contract>
<schema> ← Stage 3 only
{ "reviews": { "Fall": [ { … } ], "Winter": [], … } }
</schema>
How Stage 3 reinforces earlier rules
- Season grouping checks date accuracy. A wrong date lands the review in the wrong season bucket, failing the
season_groupingcheck. - The schema constrains field types. Rating must be a number, stats must be integers, and the review object needs both
currentandpreviouskeys. - Previous-review extraction tests the scoping rule. Stages 1–2 say “ignore previous content.” Stage 3 says “include it in
review.previous.” The model must distinguish current from previous data precisely.
Iteration Process
Each iteration follows a five-step cycle: change one thing, run it, measure the result, analyze failures, then decide what to do next.
| Step | What Happens | Artifact |
|---|---|---|
| Change | Edit one rule or format line in the prompt | Updated .system.md |
| Run | Send all 20 pages + prompt to Bedrock | output.txt |
| Measure | Parse output, run checks against fixture | validation.json |
| Analyze | Sonnet examines each failure with expected and actual values, then suggests fixes | suggested_improvements.json |
| Decide | Fix the prompt, update the fixture, or try a different model | Next iteration |
Every run saves a timestamped bundle to runs/, so any two iterations can be compared directly.
Iteration Log: What Changed and Why
Model Comparison
The same prompts were tested on two Claude model tiers via Amazon Bedrock.
Results (after fixture calibration)
Best-run summary: both models reached 20/20 on Basic Table, Sonnet 4.6 reached 20/20 on Extended Table, and Haiku 4.5 reached 20/20 on JSON. The remaining drift was limited to 0-2 nondeterministic sentiment mismatches per run.
| Task | Sonnet 4.6 | Haiku 4.5 |
|---|---|---|
| Basic Table | 20/20 | 20/20 |
| Extended Table | 20/20 | 19–20/20 |
| JSON | 18–20/20 | 20/20 |
Both models achieve full marks on their best runs. The 0–2 remaining mismatches per run are nondeterministic — borderline sentiment cases that flip at temperature 0.3. This is the noise floor, and it shows which classifications are robust versus which sit on genuine semantic boundaries.
Model-specific behaviors
| Behavior | Sonnet 4.6 | Haiku 4.5 |
|---|---|---|
| Format compliance | Follows anti-fence and quoting rules reliably | Needed “start with `” to stop markdown fencing |
| Sentiment boundaries | Occasionally shifts mixed → positive | Occasionally shifts positive → mixed |
| Structural accuracy | Consistent | Same after Haiku 4.5 upgrade |
Ground-Truth Calibration
| Reviewer | Original | Corrected | Evidence |
|---|---|---|---|
| Theo Carter | unclear | mixed | Uses contrast language (“but”, “though I would hesitate”). Both models, every run. |
| Colin Ward | unclear | neutral | Purely observational (“packaging arrived intact”). Both models, every run. |
| Henry Brooks | positive | mixed | Explicit contrast (“shaving off a half point only because…”). Both models, every run. |
Final Results and Takeaways
Validation checks
| Check | Task 1 | Task 2 | Task 3 |
|---|---|---|---|
| format_compliance | PASS | PASS | PASS |
| required_fields_match | PASS | PASS | PASS |
| row_coverage | PASS | PASS | PASS |
| no_hallucination | PASS | PASS | PASS |
| missing_and_edge_values | PASS | PASS | PASS |
| rating_missing_token | PASS | PASS | — |
| location_missing_token | — | PASS | — |
| column_mapping | — | PASS | — |
| season_grouping | — | — | PASS |
Key takeaways
Build prompts incrementally
Start simple and add one layer at a time. Each new failure traces to the latest change, not to accumulated complexity.
Edge cases drive rule quality
Null ratings, zero ratings, decimals, and unusual formats forced rules that would otherwise be skipped. Happy-path-only test data produces prompts that fail in production.
Format contracts matter as much as extraction rules
Quoting, fencing, and output shape caused more iterations than the actual data extraction.
Define labels positively
"Neutral = facts or mild opinions" is more stable than defining labels by exclusion. Positive definitions hold up better across model versions and temperatures.
Positive instructions beat prohibitions
"Start with `" works better than "do not wrap in markdown fences."
Ground truth needs calibration too
When two models consistently agree against the fixture, re-examine the label. The fixture is part of the system under test.
Nondeterminism is informative
Borderline cases that flip between runs reveal which classifications are robust and which sit on real semantic boundaries.
Use consistent XML scaffolding
Sections like <role>, <task>, <rules>, and <format_contract> give the model clear boundaries. Adding <schema> in Stage 3 served as both documentation and a structural template.