✓ Verified 📁 File Management ✓ Enhanced Data

Links To Pdfs

Scrape documents from Notion, DocSend, PDFs.

Rating: 4.4 (81 reviews)
Downloads: 855 downloads
Version: 1.0.0

Overview

Scrape documents from Notion, DocSend, PDFs.

Complete Documentation

View Source →

docs-scraper

CLI tool that scrapes documents from various sources into local PDF files using browser automation.

Installation

bash

npm install -g docs-scraper

Quick start

Scrape any document URL to PDF:

bash

docs-scraper scrape https://example.com/document

Returns local path: ~/.docs-scraper/output/1706123456-abc123.pdf

Basic scraping

Scrape with daemon (recommended, keeps browser warm):

bash

docs-scraper scrape <url>

Scrape with named profile (for authenticated sites):

bash

docs-scraper scrape <url> -p <profile-name>

Scrape with pre-filled data (e.g., email for DocSend):

bash

docs-scraper scrape <url> -D [email protected]

Direct mode (single-shot, no daemon):

bash

docs-scraper scrape <url> --no-daemon

Authentication workflow

When a document requires authentication (login, email verification, passcode):

Initial scrape returns a job ID:

bash

docs-scraper scrape https://docsend.com/view/xxx
   # Output: Scrape blocked
   #         Job ID: abc123

Retry with data:

bash

docs-scraper update abc123 -D [email protected]
   # or with password
   docs-scraper update abc123 -D [email protected] -D password=1234

Profile management

Profiles store session cookies for authenticated sites.

bash

docs-scraper profiles list     # List saved profiles
docs-scraper profiles clear    # Clear all profiles
docs-scraper scrape <url> -p myprofile  # Use a profile

Daemon management

The daemon keeps browser instances warm for faster scraping.

bash

docs-scraper daemon status     # Check status
docs-scraper daemon start      # Start manually
docs-scraper daemon stop       # Stop daemon

Note: Daemon auto-starts when running scrape commands.

Cleanup

PDFs are stored in ~/.docs-scraper/output/. The daemon automatically cleans up files older than 1 hour.

Manual cleanup:

bash

docs-scraper cleanup                    # Delete all PDFs
docs-scraper cleanup --older-than 1h    # Delete PDFs older than 1 hour

Job management

bash

docs-scraper jobs list         # List blocked jobs awaiting auth

Supported sources

Direct PDF links - Downloads PDF directly
Notion pages - Exports Notion page to PDF
DocSend documents - Handles DocSend viewer
LLM fallback - Uses Claude API for any other webpage

Scraper Reference

Each scraper accepts specific -D data fields. Use the appropriate fields based on the URL type.

DirectPdfScraper

Handles: URLs ending in .pdf

Data fields: None (downloads directly)

Example:

bash

docs-scraper scrape https://example.com/document.pdf

DocsendScraper

Handles: docsend.com/view/, docsend.com/v/, and subdomains (e.g., org-a.docsend.com)

URL patterns:

Documents: https://docsend.com/view/{id} or https://docsend.com/v/{id}
Folders: https://docsend.com/view/s/{id}
Subdomains: https://{subdomain}.docsend.com/view/{id}

Data fields:

Field	Type	Description
email	email	Email address for document access
password	password	Passcode/password for protected documents
name	text	Your name (required for NDA-gated documents)

Examples:

bash

# Pre-fill email for DocSend
docs-scraper scrape https://docsend.com/view/abc123 -D [email protected]

# With password protection
docs-scraper scrape https://docsend.com/view/abc123 -D [email protected] -D password=secret123

# With NDA name requirement
docs-scraper scrape https://docsend.com/view/abc123 -D [email protected] -D name="John Doe"

# Retry blocked job
docs-scraper update abc123 -D [email protected] -D password=secret123

Notes:

DocSend may require any combination of email, password, and name
Folders are scraped as a table of contents PDF with document links
The scraper auto-checks NDA checkboxes when name is provided

NotionScraper

Handles: notion.so/, .notion.site/*

Data fields:

Field	Type	Description
email	email	Notion account email
password	password	Notion account password

Examples:

bash

# Public page (no auth needed)
docs-scraper scrape https://notion.so/Public-Page-abc123

# Private page with login
docs-scraper scrape https://notion.so/Private-Page-abc123 \
  -D [email protected] -D password=mypassword

# Custom domain
docs-scraper scrape https://docs.company.notion.site/Page-abc123

Notes:

Public Notion pages don't require authentication
Toggle blocks are automatically expanded before PDF generation
Uses session profiles to persist login across scrapes

LlmFallbackScraper

Handles: Any URL not matched by other scrapers (automatic fallback)

Data fields: Dynamic - determined by Claude analyzing the page

The LLM scraper uses Claude to analyze the page HTML and detect:

Login forms (extracts field names dynamically)
Cookie banners (auto-dismisses)
Expandable content (auto-expands)
CAPTCHAs (reports as blocked)
Paywalls (reports as blocked)

Common dynamic fields:

Field	Type	Description
email	email	Login email (if detected)
password	password	Login password (if detected)
username	text	Username (if login uses username)

Examples:

bash

# Generic webpage (no auth)
docs-scraper scrape https://example.com/article

# Webpage requiring login
docs-scraper scrape https://members.example.com/article \
  -D [email protected] -D password=secret

# When blocked, check the job for required fields
docs-scraper jobs list
# Then retry with the fields the scraper detected
docs-scraper update abc123 -D username=myuser -D password=secret

Notes:

Requires ANTHROPIC_API_KEY environment variable
Field names are extracted from the page's actual form fields
Limited to 2 login attempts before failing
CAPTCHAs require manual intervention

Data field summary

Scraper	email	password	name	Other
DirectPdf	-	-	-	-
DocSend	✓	✓	✓	-
Notion	✓	✓	-	-
LLM Fallback	✓	✓	-	Dynamic*

*Fields detected dynamically from page analysis

Environment setup (optional)

Only needed for LLM fallback scraper:

bash

export ANTHROPIC_API_KEY=your_key

Optional browser settings:

bash

export BROWSER_HEADLESS=true   # Set false for debugging

Common patterns

Archive a Notion page:

bash

docs-scraper scrape https://notion.so/My-Page-abc123

Download protected DocSend:

bash

docs-scraper scrape https://docsend.com/view/xxx
# If blocked:
docs-scraper update <job-id> -D [email protected] -D password=1234

Batch scraping with profiles:

bash

docs-scraper scrape https://site.com/doc1 -p mysite
docs-scraper scrape https://site.com/doc2 -p mysite

Output

Success: Local file path (e.g., ~/.docs-scraper/output/1706123456-abc123.pdf) Blocked: Job ID + required credential types

Troubleshooting

Timeout: docs-scraper daemon stop && docs-scraper daemon start
Auth fails: docs-scraper jobs list to check pending jobs
Disk full: docs-scraper cleanup to remove old PDFs

Installation

Terminal bash


openclaw install links-to-pdfs

Copied!

💻Code Examples

npm install -g docs-scraper

npm-install--g-docs-scraper.txt

## Quick start

Scrape any document URL to PDF:

docs-scraper scrape https://example.com/document

docs-scraper-scrape-httpsexamplecomdocument.txt

Returns local path: `~/.docs-scraper/output/1706123456-abc123.pdf`

## Basic scraping

**Scrape with daemon** (recommended, keeps browser warm):

docs-scraper scrape <url> --no-daemon

docs-scraper-scrape-url---no-daemon.txt

## Authentication workflow

When a document requires authentication (login, email verification, passcode):

1. Initial scrape returns a job ID:

docs-scraper scrape <url> -p myprofile # Use a profile

docs-scraper-scrape-url--p-myprofile--use-a-profile.txt

## Daemon management

The daemon keeps browser instances warm for faster scraping.

docs-scraper daemon stop # Stop daemon

docs-scraper-daemon-stop--stop-daemon.txt

Note: Daemon auto-starts when running scrape commands.

## Cleanup

PDFs are stored in `~/.docs-scraper/output/`. The daemon automatically cleans up files older than 1 hour.

Manual cleanup:

docs-scraper jobs list # List blocked jobs awaiting auth

docs-scraper-jobs-list--list-blocked-jobs-awaiting-auth.txt

## Supported sources

- **Direct PDF links** - Downloads PDF directly
- **Notion pages** - Exports Notion page to PDF
- **DocSend documents** - Handles DocSend viewer
- **LLM fallback** - Uses Claude API for any other webpage

---

## Scraper Reference

Each scraper accepts specific `-D` data fields. Use the appropriate fields based on the URL type.

### DirectPdfScraper

**Handles:** URLs ending in `.pdf`

**Data fields:** None (downloads directly)

**Example:**

docs-scraper scrape https://example.com/document.pdf

docs-scraper-scrape-httpsexamplecomdocumentpdf.txt

---

### DocsendScraper

**Handles:** `docsend.com/view/*`, `docsend.com/v/*`, and subdomains (e.g., `org-a.docsend.com`)

**URL patterns:**
- Documents: `https://docsend.com/view/{id}` or `https://docsend.com/v/{id}`
- Folders: `https://docsend.com/view/s/{id}`
- Subdomains: `https://{subdomain}.docsend.com/view/{id}`

**Data fields:**

| Field | Type | Description |
|-------|------|-------------|
| `email` | email | Email address for document access |
| `password` | password | Passcode/password for protected documents |
| `name` | text | Your name (required for NDA-gated documents) |

**Examples:**

docs-scraper update abc123 -D [email protected] -D password=secret123

docs-scraper-update-abc123--d-emailuserexamplecom--d-passwordsecret123.txt

**Notes:**
- DocSend may require any combination of email, password, and name
- Folders are scraped as a table of contents PDF with document links
- The scraper auto-checks NDA checkboxes when name is provided

---

### NotionScraper

**Handles:** `notion.so/*`, `*.notion.site/*`

**Data fields:**

| Field | Type | Description |
|-------|------|-------------|
| `email` | email | Notion account email |
| `password` | password | Notion account password |

**Examples:**

docs-scraper scrape https://docs.company.notion.site/Page-abc123

docs-scraper-scrape-httpsdocscompanynotionsitepage-abc123.txt

**Notes:**
- Public Notion pages don't require authentication
- Toggle blocks are automatically expanded before PDF generation
- Uses session profiles to persist login across scrapes

---

### LlmFallbackScraper

**Handles:** Any URL not matched by other scrapers (automatic fallback)

**Data fields:** Dynamic - determined by Claude analyzing the page

The LLM scraper uses Claude to analyze the page HTML and detect:
- Login forms (extracts field names dynamically)
- Cookie banners (auto-dismisses)
- Expandable content (auto-expands)
- CAPTCHAs (reports as blocked)
- Paywalls (reports as blocked)

**Common dynamic fields:**

| Field | Type | Description |
|-------|------|-------------|
| `email` | email | Login email (if detected) |
| `password` | password | Login password (if detected) |
| `username` | text | Username (if login uses username) |

**Examples:**

docs-scraper update abc123 -D username=myuser -D password=secret

docs-scraper-update-abc123--d-usernamemyuser--d-passwordsecret.txt

**Notes:**
- Requires `ANTHROPIC_API_KEY` environment variable
- Field names are extracted from the page's actual form fields
- Limited to 2 login attempts before failing
- CAPTCHAs require manual intervention

---

## Data field summary

| Scraper | email | password | name | Other |
|---------|-------|----------|------|-------|
| DirectPdf | - | - | - | - |
| DocSend | ✓ | ✓ | ✓ | - |
| Notion | ✓ | ✓ | - | - |
| LLM Fallback | ✓* | ✓* | - | Dynamic* |

*Fields detected dynamically from page analysis

## Environment setup (optional)

Only needed for LLM fallback scraper: