SkillHub

extract-pdf-text

v1.0.2

Extract text from PDF files using PyMuPDF. Parse tables, forms, and complex layouts. Supports OCR for scanned documents.

Sourced from ClawHub, Authored by Iván

Installation

Please help me install the skill `extract-pdf-text` from SkillHub official store. npx skills add ivangdavila/extract-pdf-text

When to Use

Agent needs to extract text from PDFs. Use PyMuPDF (fitz) for fast local extraction. Works with text-based documents, scanned pages with OCR, forms, and complex layouts.

Quick Reference

Topic File
Code examples examples.md
OCR setup ocr.md
Troubleshooting troubleshooting.md

Core Rules

1. Install PyMuPDF First

pip install PyMuPDF

Import as fitz (historical name):

import fitz  # PyMuPDF

2. Basic Text Extraction

import fitz

doc = fitz.open("document.pdf")
text = ""
for page in doc:
    text += page.get_text()
doc.close()

3. Pick the Right Method

PDF Type Method
Text-based page.get_text() — fast, accurate
Scanned OCR with pytesseract — slower
Mixed Check each page, use OCR when needed

4. Check for Text Before OCR

def needs_ocr(page):
    text = page.get_text().strip()
    return len(text) < 50  # Likely scanned if very little text

5. Handle Errors Gracefully

try:
    doc = fitz.open(path)
except fitz.FileDataError:
    print("Invalid or corrupted PDF")
except fitz.PasswordError:
    doc = fitz.open(path, password="secret")

Extraction Traps

Trap What Happens Fix
OCR on text PDF Slow + worse accuracy Check get_text() first
Forget to close doc Memory leak Use with or doc.close()
Assume page order Wrong reading flow Use sort=True in get_text()
Ignore encoding Garbled characters PyMuPDF handles UTF-8

Scope

This skill provides instructions for using PyMuPDF to extract PDF text.

This skill ONLY: - Gives code examples for PyMuPDF - Explains OCR setup when needed - Troubleshoots common issues

This skill NEVER: - Accesses files without user request - Sends data externally - Modifies original PDFs

Security & Privacy

All processing is local: - PyMuPDF runs entirely on your machine - No external API calls - No data leaves your system

Output Formats

Plain Text

text = page.get_text()

Structured (dict)

blocks = page.get_text("dict")["blocks"]
for b in blocks:
    if b["type"] == 0:  # text block
        for line in b["lines"]:
            for span in line["spans"]:
                print(span["text"], span["size"])

JSON

import json
data = page.get_text("json")
parsed = json.loads(data)

Full Example

import fitz

def extract_pdf(path):
    """Extract text from PDF, with OCR fallback for scanned pages."""
    doc = fitz.open(path)
    results = []

    for i, page in enumerate(doc):
        text = page.get_text()
        method = "text"

        # If very little text, might be scanned
        if len(text.strip()) < 50:
            # OCR would go here (see ocr.md)
            method = "needs_ocr"

        results.append({
            "page": i + 1,
            "text": text,
            "method": method
        })

    doc.close()
    return {
        "pages": len(results),
        "content": results,
        "word_count": sum(len(r["text"].split()) for r in results)
    }

# Usage
result = extract_pdf("document.pdf")
print(f"Extracted {result['word_count']} words from {result['pages']} pages")

Feedback

  • Useful? clawhub star extract-pdf-text
  • Stay updated: clawhub sync