Azure Document Intelligence OCR

Extract text and structured data from documents using Azure Document Intelligence REST API.

Quick Start

1. Environment Setup

Set your Azure Document Intelligence credentials:

export AZURE_DOC_INTEL_ENDPOINT="https://your-resource.cognitiveservices.azure.com"
export AZURE_DOC_INTEL_KEY="your-api-key"

2. Single File OCR

# Basic text extraction from PDF
python scripts/ocr_extract.py document.pdf

# Extract with layout (tables, structure)
python scripts/ocr_extract.py document.pdf --model prebuilt-layout --format markdown

# Process invoice
python scripts/ocr_extract.py invoice.pdf --model prebuilt-invoice --format json

# OCR from URL
python scripts/ocr_extract.py --url "https://example.com/document.pdf"

# Save output to file
python scripts/ocr_extract.py document.pdf --output result.txt

# Extract specific pages
python scripts/ocr_extract.py document.pdf --pages 1-3,5

3. Batch Processing

# Process all documents in a folder
python scripts/batch_ocr.py ./documents/

# Custom output directory and format
python scripts/batch_ocr.py ./documents/ --output-dir ./extracted/ --format markdown

# Use layout model with 8 workers
python scripts/batch_ocr.py ./documents/ --model prebuilt-layout --workers 8

# Filter specific extensions
python scripts/batch_ocr.py ./documents/ --ext .pdf,.png

Model Selection Guide

Document Type	Recommended Model	Use Case
General text	`prebuilt-read`	Pure text extraction, any document
Structured docs	`prebuilt-layout`	Tables, forms, paragraphs, figures
Invoices	`prebuilt-invoice`	Vendor info, line items, totals
Receipts	`prebuilt-receipt`	Merchant, items, totals, dates
IDs/Passports	`prebuilt-idDocument`	Identity documents
Business cards	`prebuilt-businessCard`	Contact information
W-2 forms	`prebuilt-tax.us.w2`	US tax documents
Insurance cards	`prebuilt-healthInsuranceCard.us`	Health insurance info

See references/models.md for detailed model documentation.

Supported Input Formats

PDF: .pdf (including scanned PDFs)
Images: .png, .jpg, .jpeg, .tiff, .bmp
URLs: Direct links to documents

Output Formats

text: Plain text concatenation of all extracted content
markdown: Structured output with headers and tables (best with layout model)
json: Raw API response with full extraction details

Features

Handwriting Recognition: Extracts handwritten text alongside printed text
CJK Support: Full support for Chinese, Japanese, Korean characters
Table Extraction: Preserves table structure (use layout model)
Multi-page Processing: Handles documents with multiple pages
Concurrent Processing: Batch script supports parallel processing
URL Input: Process documents directly from URLs

Environment Variables

Variable	Required	Description
`AZURE_DOC_INTEL_ENDPOINT`	Yes	Azure Document Intelligence endpoint URL
`AZURE_DOC_INTEL_KEY`	Yes	API subscription key

Error Handling

Invalid credentials: Check endpoint URL and API key
Unsupported format: Ensure file extension matches supported types
Timeout: Large documents may need longer processing (max 300s)
Rate limiting: Reduce concurrent workers for batch processing

Examples

Extract text from scanned PDF

python scripts/ocr_extract.py scanned_contract.pdf --model prebuilt-read

Process invoices with structured output

python scripts/ocr_extract.py invoice.pdf --model prebuilt-invoice --format json --output invoice_data.json

Batch process with layout analysis

python scripts/batch_ocr.py ./reports/ --model prebuilt-layout --format markdown --workers 4

Extract specific pages from large document

python scripts/ocr_extract.py large_doc.pdf --pages 1,3-5,10 --format text

azure-doc-ocr

Installation

Azure Document Intelligence OCR

Quick Start

1. Environment Setup

2. Single File OCR

3. Batch Processing

Model Selection Guide

Supported Input Formats

Output Formats

Features

Environment Variables

Error Handling

Examples

Extract text from scanned PDF

Process invoices with structured output

Batch process with layout analysis

Extract specific pages from large document