extract-pdf-text

When to Use

Agent needs to extract text from PDFs. Use PyMuPDF (fitz) for fast local extraction. Works with text-based documents, scanned pages with OCR, forms, and complex layouts.

Quick Reference

Topic	File
Code examples	`examples.md`
OCR setup	`ocr.md`
Troubleshooting	`troubleshooting.md`

Core Rules

1. Install PyMuPDF First

pip install PyMuPDF

Import as fitz (historical name):

import fitz  # PyMuPDF

2. Basic Text Extraction

import fitz

doc = fitz.open("document.pdf")
text = ""
for page in doc:
    text += page.get_text()
doc.close()

3. Pick the Right Method

PDF Type	Method
Text-based	`page.get_text()` — fast, accurate
Scanned	OCR with pytesseract — slower
Mixed	Check each page, use OCR when needed

4. Check for Text Before OCR

def needs_ocr(page):
    text = page.get_text().strip()
    return len(text) < 50  # Likely scanned if very little text

5. Handle Errors Gracefully

try:
    doc = fitz.open(path)
except fitz.FileDataError:
    print("Invalid or corrupted PDF")
except fitz.PasswordError:
    doc = fitz.open(path, password="secret")

Extraction Traps

Trap	What Happens	Fix
OCR on text PDF	Slow + worse accuracy	Check `get_text()` first
Forget to close doc	Memory leak	Use `with` or `doc.close()`
Assume page order	Wrong reading flow	Use `sort=True` in get_text()
Ignore encoding	Garbled characters	PyMuPDF handles UTF-8

Scope

This skill provides instructions for using PyMuPDF to extract PDF text.

This skill ONLY: - Gives code examples for PyMuPDF - Explains OCR setup when needed - Troubleshoots common issues

This skill NEVER: - Accesses files without user request - Sends data externally - Modifies original PDFs

Security & Privacy

All processing is local: - PyMuPDF runs entirely on your machine - No external API calls - No data leaves your system

Output Formats

Plain Text

text = page.get_text()

Structured (dict)

blocks = page.get_text("dict")["blocks"]
for b in blocks:
    if b["type"] == 0:  # text block
        for line in b["lines"]:
            for span in line["spans"]:
                print(span["text"], span["size"])

JSON

import json
data = page.get_text("json")
parsed = json.loads(data)

Full Example

import fitz

def extract_pdf(path):
    """Extract text from PDF, with OCR fallback for scanned pages."""
    doc = fitz.open(path)
    results = []

    for i, page in enumerate(doc):
        text = page.get_text()
        method = "text"

        # If very little text, might be scanned
        if len(text.strip()) < 50:
            # OCR would go here (see ocr.md)
            method = "needs_ocr"

        results.append({
            "page": i + 1,
            "text": text,
            "method": method
        })

    doc.close()
    return {
        "pages": len(results),
        "content": results,
        "word_count": sum(len(r["text"].split()) for r in results)
    }

# Usage
result = extract_pdf("document.pdf")
print(f"Extracted {result['word_count']} words from {result['pages']} pages")

Feedback

Useful? clawhub star extract-pdf-text
Stay updated: clawhub sync