Skip to main content
AIDiveForge AIDiveForge
🧬

HTML-to-Dataset Extractor

Data Extraction & Parsing · by AIDiveForge · Apr 20, 2026 · Intermediate

Turn a repetitive HTML listing page (directory, catalog, search result) into a flat CSV with a self-verifying schema — no per-site selectors to maintain.

🧠 Why it works

Brittle CSS selectors break whenever a site redesigns, and every scraper eventually dies from maintenance burden. Letting the model infer the record shape per run trades runtime cost for zero maintenance, which is the right trade when you're extracting occasionally rather than continuously.

⚙️ How it works

  1. Fetch the HTML (or load from disk), strip script / style / nav. 2. Show the LLM 3 randomly-sampled inner blocks and ask for a JSON schema of the record type. 3. Apply the schema to every matched block and emit rows. 4. Verification pass: re-extract 10 percent with a zero-shot prompt (no schema shown) and diff against the first pass — fail the run if field-level agreement is below 95 percent. 5. Emit CSV plus the inferred schema as a manifest file.

Description

Point it at a URL or a folder of saved HTML files that all follow the same template. It infers the common record shape, extracts all records, and emits a CSV. A verification step re-runs extraction on a random 10 percent sample with a different prompt to confirm field-level agreement.

Install this skill

A Claude skill is a skill.md file with YAML frontmatter and a markdown body. Drop the file into your tool of choice — or pick a different format if you use Cursor, Windsurf, Copilot, or something else.

Download skill.md
mkdir -p ~/.claude/skills/html-to-dataset-extractor \
  && curl -L https://aidiveforge.com/skill/html-to-dataset-extractor.skill-md \
       -o ~/.claude/skills/html-to-dataset-extractor/skill.md

Save to ~/.claude/skills/html-to-dataset-extractor/skill.md

Recommended Use

Tools and workflow packs this skill pairs well with. Forge picks are auto-generated from category + capability signals; Community picks are added by people who've used the pairing.

No matches yet. Be the first to suggest a pairing, or the Forge will populate suggestions as signals align.