✓ Verified 💻 Development ✓ Enhanced Data

Azure Doc Ocr

Extract text and structured data from documents using Azure Document Intelligence (formerly Form Rec

Rating
3.9 (474 reviews)
Downloads
2,849 downloads
Version
1.0.0

Overview

Extract text and structured data from documents using Azure Document Intelligence (formerly Form Recognizer).

Key Features

1

Handwriting Recognition: Extracts handwritten text alongside printed text

2

CJK Support: Full support for Chinese, Japanese, Korean characters

3

Table Extraction: Preserves table structure (use layout model)

4

Multi-page Processing: Handles documents with multiple pages

5

Concurrent Processing: Batch script supports parallel processing

6

URL Input: Process documents directly from URLs

Complete Documentation

View Source →

Azure Document Intelligence OCR

Extract text and structured data from documents using Azure Document Intelligence REST API.

Quick Start

1. Environment Setup

Set your Azure Document Intelligence credentials:

bash
export AZURE_DOC_INTEL_ENDPOINT="https://your-resource.cognitiveservices.azure.com"
export AZURE_DOC_INTEL_KEY="your-api-key"

2. Single File OCR

bash
# Basic text extraction from PDF
python scripts/ocr_extract.py document.pdf

# Extract with layout (tables, structure)
python scripts/ocr_extract.py document.pdf --model prebuilt-layout --format markdown

# Process invoice
python scripts/ocr_extract.py invoice.pdf --model prebuilt-invoice --format json

# OCR from URL
python scripts/ocr_extract.py --url "https://example.com/document.pdf"

# Save output to file
python scripts/ocr_extract.py document.pdf --output result.txt

# Extract specific pages
python scripts/ocr_extract.py document.pdf --pages 1-3,5

3. Batch Processing

bash
# Process all documents in a folder
python scripts/batch_ocr.py ./documents/

# Custom output directory and format
python scripts/batch_ocr.py ./documents/ --output-dir ./extracted/ --format markdown

# Use layout model with 8 workers
python scripts/batch_ocr.py ./documents/ --model prebuilt-layout --workers 8

# Filter specific extensions
python scripts/batch_ocr.py ./documents/ --ext .pdf,.png

Model Selection Guide

Document TypeRecommended ModelUse Case
General textprebuilt-readPure text extraction, any document
Structured docsprebuilt-layoutTables, forms, paragraphs, figures
Invoicesprebuilt-invoiceVendor info, line items, totals
Receiptsprebuilt-receiptMerchant, items, totals, dates
IDs/Passportsprebuilt-idDocumentIdentity documents
Business cardsprebuilt-businessCardContact information
W-2 formsprebuilt-tax.us.w2US tax documents
Insurance cardsprebuilt-healthInsuranceCard.usHealth insurance info
See references/models.md for detailed model documentation.

Supported Input Formats

  • PDF: .pdf (including scanned PDFs)
  • Images: .png, .jpg, .jpeg, .tiff, .bmp
  • URLs: Direct links to documents

Output Formats

  • text: Plain text concatenation of all extracted content
  • markdown: Structured output with headers and tables (best with layout model)
  • json: Raw API response with full extraction details

Features

  • Handwriting Recognition: Extracts handwritten text alongside printed text
  • CJK Support: Full support for Chinese, Japanese, Korean characters
  • Table Extraction: Preserves table structure (use layout model)
  • Multi-page Processing: Handles documents with multiple pages
  • Concurrent Processing: Batch script supports parallel processing
  • URL Input: Process documents directly from URLs

Environment Variables

VariableRequiredDescription
AZURE_DOC_INTEL_ENDPOINTYesAzure Document Intelligence endpoint URL
AZURE_DOC_INTEL_KEYYesAPI subscription key

Error Handling

  • Invalid credentials: Check endpoint URL and API key
  • Unsupported format: Ensure file extension matches supported types
  • Timeout: Large documents may need longer processing (max 300s)
  • Rate limiting: Reduce concurrent workers for batch processing

Examples

Extract text from scanned PDF

bash
python scripts/ocr_extract.py scanned_contract.pdf --model prebuilt-read

Process invoices with structured output

bash
python scripts/ocr_extract.py invoice.pdf --model prebuilt-invoice --format json --output invoice_data.json

Batch process with layout analysis

bash
python scripts/batch_ocr.py ./reports/ --model prebuilt-layout --format markdown --workers 4

Extract specific pages from large document

bash
python scripts/ocr_extract.py large_doc.pdf --pages 1,3-5,10 --format text

Installation

Terminal bash

openclaw install azure-doc-ocr
    
Copied!

💻Code Examples

python scripts/batch_ocr.py ./documents/ --ext .pdf,.png

python-scriptsbatchocrpy-documents---ext-pdfpng.txt
## Model Selection Guide

| Document Type | Recommended Model | Use Case |
|---------------|-------------------|----------|
| General text | `prebuilt-read` | Pure text extraction, any document |
| Structured docs | `prebuilt-layout` | Tables, forms, paragraphs, figures |
| Invoices | `prebuilt-invoice` | Vendor info, line items, totals |
| Receipts | `prebuilt-receipt` | Merchant, items, totals, dates |
| IDs/Passports | `prebuilt-idDocument` | Identity documents |
| Business cards | `prebuilt-businessCard` | Contact information |
| W-2 forms | `prebuilt-tax.us.w2` | US tax documents |
| Insurance cards | `prebuilt-healthInsuranceCard.us` | Health insurance info |

See [references/models.md](references/models.md) for detailed model documentation.

## Supported Input Formats

- **PDF**: `.pdf` (including scanned PDFs)
- **Images**: `.png`, `.jpg`, `.jpeg`, `.tiff`, `.bmp`
- **URLs**: Direct links to documents

## Output Formats

- **text**: Plain text concatenation of all extracted content
- **markdown**: Structured output with headers and tables (best with layout model)
- **json**: Raw API response with full extraction details

## Features

- **Handwriting Recognition**: Extracts handwritten text alongside printed text
- **CJK Support**: Full support for Chinese, Japanese, Korean characters
- **Table Extraction**: Preserves table structure (use layout model)
- **Multi-page Processing**: Handles documents with multiple pages
- **Concurrent Processing**: Batch script supports parallel processing
- **URL Input**: Process documents directly from URLs

## Environment Variables

| Variable | Required | Description |
|----------|----------|-------------|
| `AZURE_DOC_INTEL_ENDPOINT` | Yes | Azure Document Intelligence endpoint URL |
| `AZURE_DOC_INTEL_KEY` | Yes | API subscription key |

## Error Handling

- Invalid credentials: Check endpoint URL and API key
- Unsupported format: Ensure file extension matches supported types
- Timeout: Large documents may need longer processing (max 300s)
- Rate limiting: Reduce concurrent workers for batch processing

## Examples

### Extract text from scanned PDF
example.sh
# Basic text extraction from PDF
python scripts/ocr_extract.py document.pdf

# Extract with layout (tables, structure)
python scripts/ocr_extract.py document.pdf --model prebuilt-layout --format markdown

# Process invoice
python scripts/ocr_extract.py invoice.pdf --model prebuilt-invoice --format json

# OCR from URL
python scripts/ocr_extract.py --url "https://example.com/document.pdf"

# Save output to file
python scripts/ocr_extract.py document.pdf --output result.txt

# Extract specific pages
python scripts/ocr_extract.py document.pdf --pages 1-3,5
example.sh
# Process all documents in a folder
python scripts/batch_ocr.py ./documents/

# Custom output directory and format
python scripts/batch_ocr.py ./documents/ --output-dir ./extracted/ --format markdown

# Use layout model with 8 workers
python scripts/batch_ocr.py ./documents/ --model prebuilt-layout --workers 8

# Filter specific extensions
python scripts/batch_ocr.py ./documents/ --ext .pdf,.png

Tags

#devops_and-cloud #data

Quick Info

Category Development
Model Claude 3.5
Complexity One-Click
Author li-hongmin
Last Updated 3/10/2026
🚀
Optimized for
Claude 3.5
🧠

Ready to Install?

Get started with this skill in seconds

openclaw install azure-doc-ocr