Same-flyer reuse to skip duplicate extraction
- Baseline
- 55%
- Final
- 88%
- Delta
- +33 pts
- Variants
- 3
What we set out to improve
Detect when a flyer has already been forwarded and reuse its prior extraction instead of re-running structured extraction on the near-duplicate.
Kept. Combining a content hash with a layout-similarity check caught re-forwarded flyers at 0.88 precision, skipping redundant extraction at a low resource cost. The heuristic was promoted to the family-documents knowledge base.
Variants we tried
Each variant and its coarse objective metric. The kept variant is marked; bars are relative to the best run.
- 1Baseline — re-extract every forwardMedium55%
- 2Variant A — content-hash match onlyLow74%
- 3Variant B — content-hash + layout similarityWinnerLow88%
Stages
baseline
Succeeded · 1.6s
variant run
Succeeded · 5.4s
eval
Succeeded · 950ms
promote
Succeeded · 240ms
Artifacts and what shipped
Redaction-safe artifact previews, diffs, metric tables, and prompt variants with sensitive text removed.
- Metric table
Duplicate-detection precision by variant (0.55 → 0.88)
- Diff summary
Pipeline diff: add a reuse gate before extraction
- KB write
Promoted the reuse heuristic to the family-docs KB