✓ Verified 💻 Development ✓ Enhanced Data

Mineru Pdf Extractor

Extract PDF content to Markdown using MinerU API.

Rating
4.8 (195 reviews)
Downloads
5,312 downloads
Version
1.0.0

Overview

Extract PDF content to Markdown using MinerU API.

Complete Documentation

View Source →

MinerU PDF Extractor

Extract PDF documents to structured Markdown using the MinerU API. Supports formula recognition, table extraction, and OCR.

Note: This is a community skill, not an official MinerU product. You need to obtain your own API key from MinerU.


📁 Skill Structure

text
mineru-pdf-extractor/
├── SKILL.md                          # English documentation
├── SKILL_zh.md                       # Chinese documentation
├── docs/                             # Documentation
│   ├── Local_File_Parsing_Guide.md   # Local PDF parsing detailed guide (English)
│   ├── Online_URL_Parsing_Guide.md   # Online PDF parsing detailed guide (English)
│   ├── MinerU_本地文档解析完整流程.md  # Local parsing complete guide (Chinese)
│   └── MinerU_在线文档解析完整流程.md  # Online parsing complete guide (Chinese)
└── scripts/                          # Executable scripts
    ├── local_file_step1_apply_upload_url.sh    # Local parsing Step 1
    ├── local_file_step2_upload_file.sh         # Local parsing Step 2
    ├── local_file_step3_poll_result.sh         # Local parsing Step 3
    ├── local_file_step4_download.sh            # Local parsing Step 4
    ├── online_file_step1_submit_task.sh        # Online parsing Step 1
    └── online_file_step2_poll_result.sh        # Online parsing Step 2


🔧 Requirements

Required Environment Variables

Scripts automatically read MinerU Token from environment variables (choose one):

bash
# Option 1: Set MINERU_TOKEN
export MINERU_TOKEN="your_api_token_here"

# Option 2: Set MINERU_API_KEY
export MINERU_API_KEY="your_api_token_here"

Required Command-Line Tools

  • curl - For HTTP requests (usually pre-installed)
  • unzip - For extracting results (usually pre-installed)

Optional Tools

  • jq - For enhanced JSON parsing and security (recommended but not required)
  • If not installed, scripts will use fallback methods
  • Install: apt-get install jq (Debian/Ubuntu) or brew install jq (macOS)

Optional Configuration

bash
# Set API base URL (default is pre-configured)
export MINERU_BASE_URL="https://mineru.net/api/v4"

💡 Get Token: Visit https://mineru.net/apiManage/docs to register and obtain an API Key


📄 Feature 1: Parse Local PDF Documents

For locally stored PDF files. Requires 4 steps.

Quick Start

bash
cd scripts/

# Step 1: Apply for upload URL
./local_file_step1_apply_upload_url.sh /path/to/your.pdf
# Output: BATCH_ID=xxx UPLOAD_URL=xxx

# Step 2: Upload file
./local_file_step2_upload_file.sh "$UPLOAD_URL" /path/to/your.pdf

# Step 3: Poll for results
./local_file_step3_poll_result.sh "$BATCH_ID"
# Output: FULL_ZIP_URL=xxx

# Step 4: Download results
./local_file_step4_download.sh "$FULL_ZIP_URL" result.zip extracted/

Script Descriptions

#### local_file_step1_apply_upload_url.sh

Apply for upload URL and batch_id.

Usage:

bash
./local_file_step1_apply_upload_url.sh <pdf_file_path> [language] [layout_model]

Parameters:

  • language: ch (Chinese), en (English), auto (auto-detect), default ch
  • layout_model: doclayout_yolo (fast), layoutlmv3 (accurate), default doclayout_yolo
Output:
text
BATCH_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
UPLOAD_URL=https://mineru.oss-cn-shanghai.aliyuncs.com/...


#### local_file_step2_upload_file.sh

Upload PDF file to the presigned URL.

Usage:

bash
./local_file_step2_upload_file.sh <upload_url> <pdf_file_path>


#### local_file_step3_poll_result.sh

Poll extraction results until completion or failure.

Usage:

bash
./local_file_step3_poll_result.sh <batch_id> [max_retries] [retry_interval_seconds]

Output:

text
FULL_ZIP_URL=https://cdn-mineru.openxlab.org.cn/pdf/.../xxx.zip


#### local_file_step4_download.sh

Download result ZIP and extract.

Usage:

bash
./local_file_step4_download.sh <zip_url> [output_zip_filename] [extract_directory_name]

Output Structure:

text
extracted/
├── full.md              # 📄 Markdown document (main result)
├── images/              # 🖼️ Extracted images
├── content_list.json    # Structured content
└── layout.json          # Layout analysis data

Detailed Documentation

📚 Complete Guide: See docs/Local_File_Parsing_Guide.md


🌐 Feature 2: Parse Online PDF Documents (URL Method)

For PDF files already available online (e.g., arXiv, websites). Only 2 steps, more concise and efficient.

Quick Start

bash
cd scripts/

# Step 1: Submit parsing task (provide URL directly)
./online_file_step1_submit_task.sh "https://arxiv.org/pdf/2410.17247.pdf"
# Output: TASK_ID=xxx

# Step 2: Poll results and auto-download/extract
./online_file_step2_poll_result.sh "$TASK_ID" extracted/

Script Descriptions

#### online_file_step1_submit_task.sh

Submit parsing task for online PDF.

Usage:

bash
./online_file_step1_submit_task.sh <pdf_url> [language] [layout_model]

Parameters:

  • pdf_url: Complete URL of the online PDF (required)
  • language: ch (Chinese), en (English), auto (auto-detect), default ch
  • layout_model: doclayout_yolo (fast), layoutlmv3 (accurate), default doclayout_yolo
Output:
text
TASK_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx


#### online_file_step2_poll_result.sh

Poll extraction results, automatically download and extract when complete.

Usage:

bash
./online_file_step2_poll_result.sh <task_id> [output_directory] [max_retries] [retry_interval_seconds]

Output Structure:

text
extracted/
├── full.md              # 📄 Markdown document (main result)
├── images/              # 🖼️ Extracted images
├── content_list.json    # Structured content
└── layout.json          # Layout analysis data

Detailed Documentation

📚 Complete Guide: See docs/Online_URL_Parsing_Guide.md


📊 Comparison of Two Parsing Methods

FeatureLocal PDF ParsingOnline PDF Parsing
Steps4 steps2 steps
Upload Required✅ Yes❌ No
Average Time30-60 seconds10-20 seconds
Use CaseLocal filesFiles already online (arXiv, websites, etc.)
File Size Limit200MBLimited by source server

⚙️ Advanced Usage

Batch Process Local Files

bash
for pdf in /path/to/pdfs/*.pdf; do
    echo "Processing: $pdf"
    
    # Step 1
    result=$(./local_file_step1_apply_upload_url.sh "$pdf" 2>&1)
    batch_id=$(echo "$result" | grep BATCH_ID | cut -d= -f2)
    upload_url=$(echo "$result" | grep UPLOAD_URL | cut -d= -f2)
    
    # Step 2
    ./local_file_step2_upload_file.sh "$upload_url" "$pdf"
    
    # Step 3
    zip_url=$(./local_file_step3_poll_result.sh "$batch_id" | grep FULL_ZIP_URL | cut -d= -f2)
    
    # Step 4
    filename=$(basename "$pdf" .pdf)
    ./local_file_step4_download.sh "$zip_url" "${filename}.zip" "${filename}_extracted"
done

Batch Process Online Files

bash
for url in \
  "https://arxiv.org/pdf/2410.17247.pdf" \
  "https://arxiv.org/pdf/2409.12345.pdf"; do
    echo "Processing: $url"
    
    # Step 1
    result=$(./online_file_step1_submit_task.sh "$url" 2>&1)
    task_id=$(echo "$result" | grep TASK_ID | cut -d= -f2)
    
    # Step 2
    filename=$(basename "$url" .pdf)
    ./online_file_step2_poll_result.sh "$task_id" "${filename}_extracted"
done


⚠️ Notes

  • Token Configuration: Scripts prioritize MINERU_TOKEN, fall back to MINERU_API_KEY if not found
  • Token Security: Do not hard-code tokens in scripts; use environment variables
  • URL Accessibility: For online parsing, ensure the provided URL is publicly accessible
  • File Limits: Single file recommended not exceeding 200MB, maximum 600 pages
  • Network Stability: Ensure stable network when uploading large files
  • Security: This skill includes input validation and sanitization to prevent JSON injection and directory traversal attacks
  • Optional jq: Installing jq provides enhanced JSON parsing and additional security checks

📚 Reference Documentation

DocumentDescription
docs/Local_File_Parsing_Guide.mdDetailed curl commands and parameters for local PDF parsing
docs/Online_URL_Parsing_Guide.mdDetailed curl commands and parameters for online PDF parsing
External Resources:
  • 🏠 MinerU Official: https://mineru.net/
  • 📖 API Documentation: https://mineru.net/apiManage/docs
  • 💻 GitHub Repository: https://github.com/opendatalab/MinerU

Skill Version: 1.0.0 Release Date: 2026-02-18 Community Skill - Not affiliated with MinerU official

Installation

Terminal bash

openclaw install mineru-pdf-extractor
    
Copied!

💻Code Examples

└── online_file_step2_poll_result.sh # Online parsing Step 2

--onlinefilestep2pollresultsh--online-parsing-step-2.txt
---

## 🔧 Requirements

### Required Environment Variables

Scripts automatically read MinerU Token from environment variables (choose one):

export MINERU_API_KEY="your_api_token_here"

export-mineruapikeyyourapitokenhere.txt
### Required Command-Line Tools

- `curl` - For HTTP requests (usually pre-installed)
- `unzip` - For extracting results (usually pre-installed)

### Optional Tools

- `jq` - For enhanced JSON parsing and security (recommended but not required)
  - If not installed, scripts will use fallback methods
  - Install: `apt-get install jq` (Debian/Ubuntu) or `brew install jq` (macOS)

### Optional Configuration

export MINERU_BASE_URL="https://mineru.net/api/v4"

export-minerubaseurlhttpsminerunetapiv4.txt
> 💡 **Get Token**: Visit https://mineru.net/apiManage/docs to register and obtain an API Key

---

## 📄 Feature 1: Parse Local PDF Documents

For locally stored PDF files. Requires 4 steps.

### Quick Start

./local_file_step4_download.sh "$FULL_ZIP_URL" result.zip extracted/

localfilestep4downloadsh-fullzipurl-resultzip-extracted.txt
### Script Descriptions

#### local_file_step1_apply_upload_url.sh

Apply for upload URL and batch_id.

**Usage:**

./local_file_step1_apply_upload_url.sh <pdf_file_path> [language] [layout_model]

localfilestep1applyuploadurlsh-pdffilepath-language-layoutmodel.txt
**Parameters:**
- `language`: `ch` (Chinese), `en` (English), `auto` (auto-detect), default `ch`
- `layout_model`: `doclayout_yolo` (fast), `layoutlmv3` (accurate), default `doclayout_yolo`

**Output:**

UPLOAD_URL=https://mineru.oss-cn-shanghai.aliyuncs.com/...

uploadurlhttpsmineruoss-cn-shanghaialiyuncscom.txt
---

#### local_file_step2_upload_file.sh

Upload PDF file to the presigned URL.

**Usage:**

./local_file_step2_upload_file.sh <upload_url> <pdf_file_path>

localfilestep2uploadfilesh-uploadurl-pdffilepath.txt
---

#### local_file_step3_poll_result.sh

Poll extraction results until completion or failure.

**Usage:**

FULL_ZIP_URL=https://cdn-mineru.openxlab.org.cn/pdf/.../xxx.zip

fullzipurlhttpscdn-mineruopenxlaborgcnpdfxxxzip.txt
---

#### local_file_step4_download.sh

Download result ZIP and extract.

**Usage:**

└── layout.json # Layout analysis data

-layoutjson--layout-analysis-data.txt
### Detailed Documentation

📚 **Complete Guide**: See `docs/Local_File_Parsing_Guide.md`

---

## 🌐 Feature 2: Parse Online PDF Documents (URL Method)

For PDF files already available online (e.g., arXiv, websites). Only 2 steps, more concise and efficient.

### Quick Start

./online_file_step2_poll_result.sh "$TASK_ID" extracted/

onlinefilestep2pollresultsh-taskid-extracted.txt
### Script Descriptions

#### online_file_step1_submit_task.sh

Submit parsing task for online PDF.

**Usage:**

Tags

#coding_agents-and-ides #api

Quick Info

Category Development
Model Claude 3.5
Complexity One-Click
Author a-i-r
Last Updated 3/10/2026
🚀
Optimized for
Claude 3.5
🧠

Ready to Install?

Get started with this skill in seconds

openclaw install mineru-pdf-extractor