✓ Verified 📁 File Management ✓ Enhanced Data

Links To Pdfs

Scrape documents from Notion, DocSend, PDFs.

Rating
4.4 (81 reviews)
Downloads
855 downloads
Version
1.0.0

Overview

Scrape documents from Notion, DocSend, PDFs.

Complete Documentation

View Source →

docs-scraper

CLI tool that scrapes documents from various sources into local PDF files using browser automation.

Installation

bash
npm install -g docs-scraper

Quick start

Scrape any document URL to PDF:

bash
docs-scraper scrape https://example.com/document

Returns local path: ~/.docs-scraper/output/1706123456-abc123.pdf

Basic scraping

Scrape with daemon (recommended, keeps browser warm):

bash
docs-scraper scrape <url>

Scrape with named profile (for authenticated sites):

bash
docs-scraper scrape <url> -p <profile-name>

Scrape with pre-filled data (e.g., email for DocSend):

bash
docs-scraper scrape <url> -D [email protected]

Direct mode (single-shot, no daemon):

bash
docs-scraper scrape <url> --no-daemon

Authentication workflow

When a document requires authentication (login, email verification, passcode):

  • Initial scrape returns a job ID:
bash
docs-scraper scrape https://docsend.com/view/xxx
   # Output: Scrape blocked
   #         Job ID: abc123
  • Retry with data:
bash
docs-scraper update abc123 -D [email protected]
   # or with password
   docs-scraper update abc123 -D [email protected] -D password=1234

Profile management

Profiles store session cookies for authenticated sites.

bash
docs-scraper profiles list     # List saved profiles
docs-scraper profiles clear    # Clear all profiles
docs-scraper scrape <url> -p myprofile  # Use a profile

Daemon management

The daemon keeps browser instances warm for faster scraping.

bash
docs-scraper daemon status     # Check status
docs-scraper daemon start      # Start manually
docs-scraper daemon stop       # Stop daemon

Note: Daemon auto-starts when running scrape commands.

Cleanup

PDFs are stored in ~/.docs-scraper/output/. The daemon automatically cleans up files older than 1 hour.

Manual cleanup:

bash
docs-scraper cleanup                    # Delete all PDFs
docs-scraper cleanup --older-than 1h    # Delete PDFs older than 1 hour

Job management

bash
docs-scraper jobs list         # List blocked jobs awaiting auth

Supported sources

  • Direct PDF links - Downloads PDF directly
  • Notion pages - Exports Notion page to PDF
  • DocSend documents - Handles DocSend viewer
  • LLM fallback - Uses Claude API for any other webpage

Scraper Reference

Each scraper accepts specific -D data fields. Use the appropriate fields based on the URL type.

DirectPdfScraper

Handles: URLs ending in .pdf

Data fields: None (downloads directly)

Example:

bash
docs-scraper scrape https://example.com/document.pdf


DocsendScraper

Handles: docsend.com/view/, docsend.com/v/, and subdomains (e.g., org-a.docsend.com)

URL patterns:

  • Documents: https://docsend.com/view/{id} or https://docsend.com/v/{id}
  • Folders: https://docsend.com/view/s/{id}
  • Subdomains: https://{subdomain}.docsend.com/view/{id}
Data fields:

FieldTypeDescription
emailemailEmail address for document access
passwordpasswordPasscode/password for protected documents
nametextYour name (required for NDA-gated documents)
Examples:
bash
# Pre-fill email for DocSend
docs-scraper scrape https://docsend.com/view/abc123 -D [email protected]

# With password protection
docs-scraper scrape https://docsend.com/view/abc123 -D [email protected] -D password=secret123

# With NDA name requirement
docs-scraper scrape https://docsend.com/view/abc123 -D [email protected] -D name="John Doe"

# Retry blocked job
docs-scraper update abc123 -D [email protected] -D password=secret123

Notes:

  • DocSend may require any combination of email, password, and name
  • Folders are scraped as a table of contents PDF with document links
  • The scraper auto-checks NDA checkboxes when name is provided

NotionScraper

Handles: notion.so/, .notion.site/*

Data fields:

FieldTypeDescription
emailemailNotion account email
passwordpasswordNotion account password
Examples:
bash
# Public page (no auth needed)
docs-scraper scrape https://notion.so/Public-Page-abc123

# Private page with login
docs-scraper scrape https://notion.so/Private-Page-abc123 \
  -D [email protected] -D password=mypassword

# Custom domain
docs-scraper scrape https://docs.company.notion.site/Page-abc123

Notes:

  • Public Notion pages don't require authentication
  • Toggle blocks are automatically expanded before PDF generation
  • Uses session profiles to persist login across scrapes

LlmFallbackScraper

Handles: Any URL not matched by other scrapers (automatic fallback)

Data fields: Dynamic - determined by Claude analyzing the page

The LLM scraper uses Claude to analyze the page HTML and detect:

  • Login forms (extracts field names dynamically)
  • Cookie banners (auto-dismisses)
  • Expandable content (auto-expands)
  • CAPTCHAs (reports as blocked)
  • Paywalls (reports as blocked)
Common dynamic fields:

FieldTypeDescription
emailemailLogin email (if detected)
passwordpasswordLogin password (if detected)
usernametextUsername (if login uses username)
Examples:
bash
# Generic webpage (no auth)
docs-scraper scrape https://example.com/article

# Webpage requiring login
docs-scraper scrape https://members.example.com/article \
  -D [email protected] -D password=secret

# When blocked, check the job for required fields
docs-scraper jobs list
# Then retry with the fields the scraper detected
docs-scraper update abc123 -D username=myuser -D password=secret

Notes:

  • Requires ANTHROPIC_API_KEY environment variable
  • Field names are extracted from the page's actual form fields
  • Limited to 2 login attempts before failing
  • CAPTCHAs require manual intervention

Data field summary

ScraperemailpasswordnameOther
DirectPdf----
DocSend-
Notion--
LLM Fallback-Dynamic*
*Fields detected dynamically from page analysis

Environment setup (optional)

Only needed for LLM fallback scraper:

bash
export ANTHROPIC_API_KEY=your_key

Optional browser settings:

bash
export BROWSER_HEADLESS=true   # Set false for debugging

Common patterns

Archive a Notion page:

bash
docs-scraper scrape https://notion.so/My-Page-abc123

Download protected DocSend:

bash
docs-scraper scrape https://docsend.com/view/xxx
# If blocked:
docs-scraper update <job-id> -D [email protected] -D password=1234

Batch scraping with profiles:

bash
docs-scraper scrape https://site.com/doc1 -p mysite
docs-scraper scrape https://site.com/doc2 -p mysite

Output

Success: Local file path (e.g., ~/.docs-scraper/output/1706123456-abc123.pdf) Blocked: Job ID + required credential types

Troubleshooting

  • Timeout: docs-scraper daemon stop && docs-scraper daemon start
  • Auth fails: docs-scraper jobs list to check pending jobs
  • Disk full: docs-scraper cleanup to remove old PDFs

Installation

Terminal bash

openclaw install links-to-pdfs
    
Copied!

💻Code Examples

npm install -g docs-scraper

npm-install--g-docs-scraper.txt
## Quick start

Scrape any document URL to PDF:

docs-scraper scrape https://example.com/document

docs-scraper-scrape-httpsexamplecomdocument.txt
Returns local path: `~/.docs-scraper/output/1706123456-abc123.pdf`

## Basic scraping

**Scrape with daemon** (recommended, keeps browser warm):

docs-scraper scrape <url> --no-daemon

docs-scraper-scrape-url---no-daemon.txt
## Authentication workflow

When a document requires authentication (login, email verification, passcode):

1. Initial scrape returns a job ID:

docs-scraper scrape <url> -p myprofile # Use a profile

docs-scraper-scrape-url--p-myprofile--use-a-profile.txt
## Daemon management

The daemon keeps browser instances warm for faster scraping.

docs-scraper daemon stop # Stop daemon

docs-scraper-daemon-stop--stop-daemon.txt
Note: Daemon auto-starts when running scrape commands.

## Cleanup

PDFs are stored in `~/.docs-scraper/output/`. The daemon automatically cleans up files older than 1 hour.

Manual cleanup:

docs-scraper jobs list # List blocked jobs awaiting auth

docs-scraper-jobs-list--list-blocked-jobs-awaiting-auth.txt
## Supported sources

- **Direct PDF links** - Downloads PDF directly
- **Notion pages** - Exports Notion page to PDF
- **DocSend documents** - Handles DocSend viewer
- **LLM fallback** - Uses Claude API for any other webpage

---

## Scraper Reference

Each scraper accepts specific `-D` data fields. Use the appropriate fields based on the URL type.

### DirectPdfScraper

**Handles:** URLs ending in `.pdf`

**Data fields:** None (downloads directly)

**Example:**

docs-scraper scrape https://example.com/document.pdf

docs-scraper-scrape-httpsexamplecomdocumentpdf.txt
---

### DocsendScraper

**Handles:** `docsend.com/view/*`, `docsend.com/v/*`, and subdomains (e.g., `org-a.docsend.com`)

**URL patterns:**
- Documents: `https://docsend.com/view/{id}` or `https://docsend.com/v/{id}`
- Folders: `https://docsend.com/view/s/{id}`
- Subdomains: `https://{subdomain}.docsend.com/view/{id}`

**Data fields:**

| Field | Type | Description |
|-------|------|-------------|
| `email` | email | Email address for document access |
| `password` | password | Passcode/password for protected documents |
| `name` | text | Your name (required for NDA-gated documents) |

**Examples:**

docs-scraper update abc123 -D [email protected] -D password=secret123

docs-scraper-update-abc123--d-emailuserexamplecom--d-passwordsecret123.txt
**Notes:**
- DocSend may require any combination of email, password, and name
- Folders are scraped as a table of contents PDF with document links
- The scraper auto-checks NDA checkboxes when name is provided

---

### NotionScraper

**Handles:** `notion.so/*`, `*.notion.site/*`

**Data fields:**

| Field | Type | Description |
|-------|------|-------------|
| `email` | email | Notion account email |
| `password` | password | Notion account password |

**Examples:**

docs-scraper scrape https://docs.company.notion.site/Page-abc123

docs-scraper-scrape-httpsdocscompanynotionsitepage-abc123.txt
**Notes:**
- Public Notion pages don't require authentication
- Toggle blocks are automatically expanded before PDF generation
- Uses session profiles to persist login across scrapes

---

### LlmFallbackScraper

**Handles:** Any URL not matched by other scrapers (automatic fallback)

**Data fields:** Dynamic - determined by Claude analyzing the page

The LLM scraper uses Claude to analyze the page HTML and detect:
- Login forms (extracts field names dynamically)
- Cookie banners (auto-dismisses)
- Expandable content (auto-expands)
- CAPTCHAs (reports as blocked)
- Paywalls (reports as blocked)

**Common dynamic fields:**

| Field | Type | Description |
|-------|------|-------------|
| `email` | email | Login email (if detected) |
| `password` | password | Login password (if detected) |
| `username` | text | Username (if login uses username) |

**Examples:**

docs-scraper update abc123 -D username=myuser -D password=secret

docs-scraper-update-abc123--d-usernamemyuser--d-passwordsecret.txt
**Notes:**
- Requires `ANTHROPIC_API_KEY` environment variable
- Field names are extracted from the page's actual form fields
- Limited to 2 login attempts before failing
- CAPTCHAs require manual intervention

---

## Data field summary

| Scraper | email | password | name | Other |
|---------|-------|----------|------|-------|
| DirectPdf | - | - | - | - |
| DocSend | ✓ | ✓ | ✓ | - |
| Notion | ✓ | ✓ | - | - |
| LLM Fallback | ✓* | ✓* | - | Dynamic* |

*Fields detected dynamically from page analysis

## Environment setup (optional)

Only needed for LLM fallback scraper:

Tags

#pdf_and-documents

Quick Info

Category File Management
Model Claude 3.5
Complexity One-Click
Author chrisling-dev
Last Updated 3/10/2026
🚀
Optimized for
Claude 3.5
🧠

Ready to Install?

Get started with this skill in seconds

openclaw install links-to-pdfs