---
name: pdf-structure-extraction
description: Parse a PDF into a structured outline (sections, tables, footnotes) without losing layout semantics.
title: PDF Structure Extraction
category: data-parsing
difficulty: intermediate
license: MIT
source_url: "https://github.com/anthropics/anthropic-cookbook/tree/main/skills/pdf"
icon: 📑
input: pdf
output: structured-json
phase: pre
domain: data
tags: pdf-extraction,layout-analysis,ocr,structured-json,table-detection,heading-detection,document-parsing,bounding-box-clustering,scanned-documents,semantic-preservation
best_for:
  - scientific papers and academic documents
  - financial statements and regulatory filings
  - scanned forms and legacy documents
  - multi-page reports with complex layouts
---

## Description

A portable skill package for converting arbitrary PDFs — scientific papers, financial statements, scanned forms — into clean structured JSON that downstream LLM prompts can reason over. Bundles an OCR fallback, a layout-aware text extractor, and a post-processor that labels headings, figure captions, and footnote anchors.

## Why it works

PDFs are a display format, not a data format — characters on the page have no intrinsic reading order or semantic role. Pure text extraction loses tables, flow, and citation links. A layout-aware pipeline (bounding-box clustering + heuristic heading detection) recovers the author's intended structure 90%+ of the time, which is exactly what an LLM needs to cite accurately. The skill packages those heuristics as a single callable unit so the LLM can stay focused on reasoning rather than parsing.

## How it works

1) pdfplumber extracts character boxes + font sizes per page. 2) Lines are clustered by baseline y-coordinate, then columns by x-histogram. 3) Text in the largest font on a page is tagged as a heading; superscript numeric runs are tagged as footnote anchors. 4) Tabular regions are detected via horizontal/vertical ruling-line density and passed to camelot-lite. 5) Output is a JSON tree: `{sections: [{heading, level, text, figures, footnotes}]}`. 6) If per-page character count < threshold, the page is re-run through Tesseract OCR before step 1 — catches scanned pages transparently.