PDF Structure Extraction
Parse a PDF into a structured outline (sections, tables, footnotes) without losing layout semantics.
🧠 Why it works
PDFs are a display format, not a data format — characters on the page have no intrinsic reading order or semantic role. Pure text extraction loses tables, flow, and citation links. A layout-aware pipeline (bounding-box clustering + heuristic heading detection) recovers the author's intended structure 90%+ of the time, which is exactly what an LLM needs to cite accurately. The skill packages those heuristics as a single callable unit so the LLM can stay focused on reasoning rather than parsing.
⚙️ How it works
1) pdfplumber extracts character boxes + font sizes per page. 2) Lines are clustered by baseline y-coordinate, then columns by x-histogram. 3) Text in the largest font on a page is tagged as a heading; superscript numeric runs are tagged as footnote anchors. 4) Tabular regions are detected via horizontal/vertical ruling-line density and passed to camelot-lite. 5) Output is a JSON tree: {sections: [{heading, level, text, figures, footnotes}]}. 6) If per-page character count < threshold, the page is re-run through Tesseract OCR before step 1 — catches scanned pages transparently.
Description
A portable skill package for converting arbitrary PDFs — scientific papers, financial statements, scanned forms — into clean structured JSON that downstream LLM prompts can reason over. Bundles an OCR fallback, a layout-aware text extractor, and a post-processor that labels headings, figure captions, and footnote anchors.
Install this skill
A Claude skill is a skill.md file with YAML frontmatter and a markdown body.
Drop the file into your tool of choice — or pick a different format if you use Cursor, Windsurf, Copilot, or something else.
mkdir -p ~/.claude/skills/pdf-structure-extraction \
&& curl -L https://aidiveforge.com/skill/pdf-structure-extraction.skill-md \
-o ~/.claude/skills/pdf-structure-extraction/skill.md
Save to ~/.claude/skills/pdf-structure-extraction/skill.md
Recommended Use
Sign in to suggestTools and workflow packs this skill pairs well with. Forge picks are auto-generated from category + capability signals; Community picks are added by people who've used the pairing.
No matches yet. Be the first to suggest a pairing, or the Forge will populate suggestions as signals align.
Report compatibility
Tell the community which tool you used this skill with, and whether it worked.
Suggest a pairing
Recommend a tool or workflow pack that this skill works well with. Up to 5 recommendations per day.