Skip to main content
AIDiveForge AIDiveForge
📑

PDF Structure Extraction

Data Extraction & Parsing · by AIDiveForge · Apr 17, 2026 · Intermediate · ✓ 1 verified compat

Parse a PDF into a structured outline (sections, tables, footnotes) without losing layout semantics.

🧠 Why it works

PDFs are a display format, not a data format — characters on the page have no intrinsic reading order or semantic role. Pure text extraction loses tables, flow, and citation links. A layout-aware pipeline (bounding-box clustering + heuristic heading detection) recovers the author's intended structure 90%+ of the time, which is exactly what an LLM needs to cite accurately. The skill packages those heuristics as a single callable unit so the LLM can stay focused on reasoning rather than parsing.

⚙️ How it works

1) pdfplumber extracts character boxes + font sizes per page. 2) Lines are clustered by baseline y-coordinate, then columns by x-histogram. 3) Text in the largest font on a page is tagged as a heading; superscript numeric runs are tagged as footnote anchors. 4) Tabular regions are detected via horizontal/vertical ruling-line density and passed to camelot-lite. 5) Output is a JSON tree: {sections: [{heading, level, text, figures, footnotes}]}. 6) If per-page character count < threshold, the page is re-run through Tesseract OCR before step 1 — catches scanned pages transparently.

Description

A portable skill package for converting arbitrary PDFs — scientific papers, financial statements, scanned forms — into clean structured JSON that downstream LLM prompts can reason over. Bundles an OCR fallback, a layout-aware text extractor, and a post-processor that labels headings, figure captions, and footnote anchors.

Install this skill

A Claude skill is a skill.md file with YAML frontmatter and a markdown body. Drop the file into your tool of choice — or pick a different format if you use Cursor, Windsurf, Copilot, or something else.

Download skill.md
mkdir -p ~/.claude/skills/pdf-structure-extraction \
  && curl -L https://aidiveforge.com/skill/pdf-structure-extraction.skill-md \
       -o ~/.claude/skills/pdf-structure-extraction/skill.md

Save to ~/.claude/skills/pdf-structure-extraction/skill.md

Recommended Use

Tools and workflow packs this skill pairs well with. Forge picks are auto-generated from category + capability signals; Community picks are added by people who've used the pairing.

No matches yet. Be the first to suggest a pairing, or the Forge will populate suggestions as signals align.