Iterative prompt engineering for structured data extraction

Resources

GitHub URL	ravikanchikare/prompt-eng-code-lab
Models	Claude Sonnet 4.6 · Haiku 4.5
Platform	Amazon Bedrock · promptfoo
Pages	20 HTML review pages
Tasks	3 extraction stages

Overview

This project uses a test-driven approach to prompt engineering. Given 20 HTML review pages, the goal is to write system prompts that extract reviewer data into precise formats, then verify correctness against a ground-truth fixture.

Extraction tasks

Extract reviewer and rating as a two-column CSV.
Extract reviewer, rating, location, and sentiment as a four-column CSV.
Extract current and previous reviews into season-grouped JSON.

The fixture includes edge cases: missing ratings, zero ratings, decimals (3.5, 0.5, 4.5), unusual location formats, and boundary-line sentiments.

A recurring challenge was enforcing output format. LLMs tend to wrap output in markdown fences, invent extra columns, skip quoting, or normalize values. Each stage required increasingly explicit format rules to override these habits. Format enforcement turned out to be as important as the extraction logic itself.

Development Setup

The system forms a closed feedback loop: write the prompt, run it, validate the output, analyze failures, and refine.

Write

Author the next prompt revision in the system prompt file

Run

Execute the full fixture set against the chosen model

Validate

Check structure, rows, labels, and edge-value handling

Analyze

Look for recurring failure patterns instead of one-off misses

Refine

Change one layer at a time and rerun the loop

Loading diagram…

Two nested feedback loops drive the work:

Inner loop (prompt refinement) — edit, run, validate, analyze, refine. The analysis step is LLM-powered: Sonnet receives each failed page’s HTML alongside the expected and actual values, then suggests specific prompt edits.
Outer loop (fixture calibration) — when both models consistently produce the same “wrong” answer, the ground truth itself is re-examined.

Prompt Evolution: Stage by Stage

The three prompts build on each other. Each stage inherits the previous stage’s rules and adds one layer of complexity.

Consistent core: `<role>` and `<task>`

The <role> stays nearly identical across all stages — a one-line persona anchoring the model as a data extraction expert. The <task> is the part that changes. Here are all three variants:

Basic Table

Stage 1

Two fields, one CSV format. The simplest version establishes the current-review scope and the reviewer/rating extraction contract.

<role>
You are a data extraction expert extracting review data from HTML pages.
</role>

<task>
Extract the main current review from each page. Return raw CSV with columns `reviewer,rating`.
</task>

"Main current review" tells the model to skip any previous-review content on the page.

Extended Table

Stage 2

The same role carries forward, while the task expands the required field set to prevent invented or dropped columns.

<role>
You are a data extraction expert extracting review data from HTML pages.
</role>

<task>
Extract the main current review from each page. Return raw CSV with columns `reviewer,rating,location,sentiment`.
</task>

Listing all four columns explicitly made the output contract harder to reinterpret.

JSON

Stage 3

The role now names structured JSON up front, and the task shifts from a column list to a schema-driven structure.

<role>
You are a data extraction expert extracting review data from HTML pages into structured JSON.
</role>

<task>
Extract the main current review from each page. Return JSON grouped by season under a `reviews` key.
</task>

The JSON output shape is defined by the schema section, not by CSV column names.

Rules and format contracts

Each prompt has two main working sections: <rules> (what to extract and how) and <format_contract> (how to shape the output). Rules are additive — each stage carries forward every rule from the previous stage, then adds new ones for the new fields. Format contracts, by contrast, are rewritten at each stage because the output shape changes.

The rule patterns that emerged follow a few recurring styles:

# Guarding against inference
- Do not infer [value] from [visual signal]

# Requiring explicit values
- [field]: explicit numeric value only; preserve decimals and 0

# Normalization
- Normalize [field] to [TargetFormat]

# Defining ambiguous labels with positive statements
- Use [label] when [specific observable condition]

# Scoping extraction
- Ignore [content type] / Include [content type] when present

For example, the star-icon rule follows the “do not infer” pattern, the rating rule follows “explicit values only,” and the name rule follows “normalize X to Y.” Sentiment labels follow the “positive definition” pattern — each label says what it is, not what it isn’t.

Format contracts serve a different purpose: they override the model’s default formatting habits. Representative rules include:

# Setting the output mode (first line of every format contract)
- Output raw [format] only — no prose, markdown, fences, or extra columns.

# Forcing the first token (prevents markdown fencing)
- Start directly with `.

# Mandating quoting (prevents comma-in-value breakage)
- Double-quote every field value in every data row (RFC 4180).

# Declaring the missing-value token
- Use "null" for missing [field].

The format contract turned out to be the section that required the most iteration. Quoting rules, anti-fencing instructions, and missing-value tokens each went through at least one revision cycle before stabilizing.

Rules: additive layering

Foundation

Stage 1

Do not infer from visual elements.
Use explicit values only; preserve decimals and 0.
Normalize names to FirstName LastName.
Use null when missing.
Return one row per page.
Ignore previous-review content.

Five rules target the fixture's first edge cases: star-icon traps, zero-as-null confusion, and decimal rounding.

Sentiment and location

Stage 2

Inherits every Stage 1 rule.

Add five sentiment labels with positive definitions.
Preserve location as exact source text or null.

Sentiment went through three revision cycles and was the hardest section to stabilize.

JSON, season, and schema

Stage 3

Inherits every Stage 1 and Stage 2 rule.

Infer season from date, text, or location.
Include previous review when present.
Add a schema section with an example.

The format contract is rewritten for JSON; the schema becomes both documentation and a structural template.

Side-by-side prompt comparison

Prompt sections across stages

Role

Stage 1 — Basic Table

Data extraction expert for HTML review pages.

Stage 2 — Extended Table

Inherited from Stage 1.

Stage 3 — JSON

Stage 1 role plus a structured-JSON target.

Task

Stage 1 — Basic Table

Return a two-column CSV for the current review.

Stage 2 — Extended Table

Return a four-column CSV for the current review.

Stage 3 — JSON

Return season-grouped JSON under a reviews key.

Rules

Stage 1 — Basic Table

Five foundation rules: no visual inference, explicit values, normalization, null handling, ignore previous review.

Stage 2 — Extended Table

Carry Stage 1 forward and add sentiment definitions plus exact location preservation.

Stage 3 — JSON

Carry Stage 2 forward and add season inference plus previous-review inclusion.

Format contract

Stage 1 — Basic Table

Raw CSV, header first, RFC 4180 quoting, exactly 2 fields.

Stage 2 — Extended Table

Raw CSV, escaped quotes, exactly 4 fields, no blank values.

Stage 3 — JSON

Raw JSON, start directly with `{`, season arrays, all fields required.

Schema

Stage 1 — Basic Table

Not used.

Stage 2 — Extended Table

Not used.

Stage 3 — JSON

Full JSON example with nested review objects.

Format contract evolution

Format contract across stages

Output shape

Stage 1 — Basic Table

Raw CSV with 2 fields.

Stage 2 — Extended Table

Raw CSV with 4 fields.

Stage 3 — JSON

Raw JSON grouped by season.

Quoting

Stage 1 — Basic Table

Quote every field value with RFC 4180 rules.

Stage 2 — Extended Table

Keep Stage 1 quoting and escape embedded quotes.

Stage 3 — JSON

Handled by native JSON string encoding.

Missing values

Stage 1 — Basic Table

Use "null" for a missing rating.

Stage 2 — Extended Table

Use "null" for missing rating and location.

Stage 3 — JSON

Use native JSON null values.

Anti-prose rule

Stage 1 — Basic Table

No prose, markdown, or extra columns.

Stage 2 — Extended Table

Keep Stage 1 and add “no blank fields.”

Stage 3 — JSON

Start directly with `{`.

Prompt structure

Every prompt follows the same XML scaffold. The sections below show the general shape, with inherited content summarized rather than repeated.

<role>
  You are a data extraction expert extracting review data from HTML pages.
  ← Stage 3 adds: "into structured JSON"
</role>

<task>
  Extract the main current review from each page. Return [format] with [fields/structure].
</task>

<rules>
  # Foundation rules (Stage 1, inherited by all)
  - Do not infer [value] from [visual signal].
  - [field]: explicit numeric value only; preserve decimals and 0; use null when missing.
  - Normalize [field] to [TargetFormat].
  - One [row/object] per page.
  - Ignore previous-review content.

  # Sentiment rules (Stage 2+, inherited by Stage 3)
  - [field]: one of [label_1], [label_2], … [label_n].
  - Use [label] when [specific observable condition].

  # Season + previous review (Stage 3 only)
  - [field]: one of [value_1], … [value_n]. Infer from [source_1], then [source_2].
  - Include previous review when present; use null otherwise.
</rules>

<format_contract>
  - Output raw [format] only — no prose, markdown, fences.
  - [Format-specific quoting, header, or structural rules.]
  - Use [token] for missing values.
</format_contract>

<schema>              ← Stage 3 only
  { "reviews": { "Fall": [ { … } ], "Winter": [], … } }
</schema>

How Stage 3 reinforces earlier rules

Season grouping checks date accuracy. A wrong date lands the review in the wrong season bucket, failing the season_grouping check.
The schema constrains field types. Rating must be a number, stats must be integers, and the review object needs both current and previous keys.
Previous-review extraction tests the scoping rule. Stages 1–2 say “ignore previous content.” Stage 3 says “include it in review.previous.” The model must distinguish current from previous data precisely.

Iteration Process

Each iteration follows a five-step cycle: change one thing, run it, measure the result, analyze failures, then decide what to do next.

Loading diagram…

Iteration cycle

Step	What Happens	Artifact
Change	Edit one rule or format line in the prompt	Updated `.system.md`
Run	Send all 20 pages + prompt to Bedrock	`output.txt`
Measure	Parse output, run checks against fixture	`validation.json`
Analyze	Sonnet examines each failure with expected and actual values, then suggests fixes	`suggested_improvements.json`
Decide	Fix the prompt, update the fixture, or try a different model	Next iteration

Every run saves a timestamped bundle to runs/, so any two iterations can be compared directly.

Iteration Log: What Changed and Why

Baseline prompt

Fail

The initial prompt used minimal extraction instructions and failed on pages with star-like visual cues.

Result Pages 08, 09, 14, and 15 returned fabricated ratings even when no numeric score existed.

Cause Empty-star widgets and fill-width CSS were interpreted as ratings.

Action Added an explicit ban on inferring ratings from visual stars, empty stars, fill width, or star count.

Zero treated as null

Fail

The star-inference ban worked, but the null rule caused a valid zero rating to disappear.

Result Page 07 returned null for a valid rating of 0.

Cause The model treated zero as equivalent to a missing value.

Action Separated absence from value: preserve decimals and 0; use null only when missing.

Model selection

Model

A smaller model was tested for speed and cost, but instruction-following was below the task complexity.

Result Haiku 3 produced invalid headers, dropped reviewers, and extracted only 3 of 20 reviews in the JSON task.

Cause The model could not maintain format compliance across the full prompt.

Action Moved to Haiku 4.5; structural failures resolved immediately.

CSV quoting for locations

Fail

Adding location introduced comma-containing values that broke CSV parsing.

Result Six pages failed column mapping because locations such as Austin, TX split into extra fields.

Cause The format contract did not require CSV quoting.

Action Added RFC 4180 quoting for every field value and applied it consistently across CSV stages.

Sentiment labels without definitions

Fail

Adding sentiment labels without operational definitions made mixed the model’s safe default.

Result Both models overused mixed on ambiguous reviews.

Cause The label list described allowed values but not decision boundaries.

Action Added disambiguation rules for neutral, mixed, and unclear.

Sentiment overcorrected toward unclear

Fail

The first disambiguation rule defined unclear by exclusion, which pulled neutral cases into the wrong bucket.

Result Three genuinely neutral reviews were classified as unclear.

Cause Purely observational language was mistaken for an inability to determine sentiment.

Action Rewrote neutral, mixed, and unclear as positive definitions.

JSON markdown fences

Fail

The JSON task exposed a formatting habit: one model wrapped output in markdown fences.

Result Haiku 4.5 returned fenced JSON, causing the parser to reject otherwise useful output.

Cause A negative rule against fences was weaker than the model’s default answer style.

Action Added a first-token target: start directly with the JSON opening character.

Fixture calibration

Fixture

Persistent failures across both models revealed that some fixture labels were wrong.

Result Three reviewers failed consistently in every saved run.

Cause The review text supported the models’ classifications under the prompt’s own definitions.

Action Updated the fixture labels and treated ground truth as a testable hypothesis.

Model Comparison

The same prompts were tested on two Claude model tiers via Amazon Bedrock.

Results (after fixture calibration)

Best-run summary: both models reached 20/20 on Basic Table, Sonnet 4.6 reached 20/20 on Extended Table, and Haiku 4.5 reached 20/20 on JSON. The remaining drift was limited to 0-2 nondeterministic sentiment mismatches per run.

Per-task results

Task	Sonnet 4.6	Haiku 4.5
Basic Table	20/20	20/20
Extended Table	20/20	19–20/20
JSON	18–20/20	20/20

Both models achieve full marks on their best runs. The 0–2 remaining mismatches per run are nondeterministic — borderline sentiment cases that flip at temperature 0.3. This is the noise floor, and it shows which classifications are robust versus which sit on genuine semantic boundaries.

Model-specific behaviors

Behavioral differences

Behavior	Sonnet 4.6	Haiku 4.5
Format compliance	Follows anti-fence and quoting rules reliably	Needed “start with `” to stop markdown fencing
Sentiment boundaries	Occasionally shifts mixed → positive	Occasionally shifts positive → mixed
Structural accuracy	Consistent	Same after Haiku 4.5 upgrade

Ground-Truth Calibration

Fixture corrections

Reviewer	Original	Corrected	Evidence
Theo Carter	unclear	mixed	Uses contrast language (“but”, “though I would hesitate”). Both models, every run.
Colin Ward	unclear	neutral	Purely observational (“packaging arrived intact”). Both models, every run.
Henry Brooks	positive	mixed	Explicit contrast (“shaving off a half point only because…”). Both models, every run.

Final Results and Takeaways

Validation checks

Final validation matrix

Check	Task 1	Task 2	Task 3
format_compliance	PASS	PASS	PASS
required_fields_match	PASS	PASS	PASS
row_coverage	PASS	PASS	PASS
no_hallucination	PASS	PASS	PASS
missing_and_edge_values	PASS	PASS	PASS
rating_missing_token	PASS	PASS	—
location_missing_token	—	PASS	—
column_mapping	—	PASS	—
season_grouping	—	—	PASS

Key takeaways

Build prompts incrementally

Start simple and add one layer at a time. Each new failure traces to the latest change, not to accumulated complexity.

Edge cases drive rule quality

Null ratings, zero ratings, decimals, and unusual formats forced rules that would otherwise be skipped. Happy-path-only test data produces prompts that fail in production.

Format contracts matter as much as extraction rules

Quoting, fencing, and output shape caused more iterations than the actual data extraction.

Define labels positively

"Neutral = facts or mild opinions" is more stable than defining labels by exclusion. Positive definitions hold up better across model versions and temperatures.

Positive instructions beat prohibitions

"Start with `" works better than "do not wrap in markdown fences."

Ground truth needs calibration too

When two models consistently agree against the fixture, re-examine the label. The fixture is part of the system under test.

Nondeterminism is informative

Borderline cases that flip between runs reveal which classifications are robust and which sit on real semantic boundaries.

Use consistent XML scaffolding

Sections like <role>, <task>, <rules>, and <format_contract> give the model clear boundaries. Adding <schema> in Stage 3 served as both documentation and a structural template.

Overview

Development Setup

Prompt Evolution: Stage by Stage

Consistent core: <role> and <task>

Basic Table

Extended Table

JSON

Rules and format contracts

Rules: additive layering

Foundation

Sentiment and location

JSON, season, and schema

Side-by-side prompt comparison

Role

Task

Rules

Format contract

Schema

Format contract evolution

Output shape

Quoting

Missing values

Anti-prose rule

Prompt structure

How Stage 3 reinforces earlier rules

Iteration Process

Iteration Log: What Changed and Why

Baseline prompt

Zero treated as null

Model selection

CSV quoting for locations

Sentiment labels without definitions

Sentiment overcorrected toward unclear

JSON markdown fences

Fixture calibration

Model Comparison

Results (after fixture calibration)

Model-specific behaviors

Ground-Truth Calibration

Final Results and Takeaways

Validation checks

Key takeaways

Build prompts incrementally

Edge cases drive rule quality

Format contracts matter as much as extraction rules

Define labels positively

Positive instructions beat prohibitions

Ground truth needs calibration too

Nondeterminism is informative

Use consistent XML scaffolding

Consistent core: `<role>` and `<task>`