✓ Verified 💻 Development ✓ Enhanced Data

Smart Web Scraper

Extract structured data from any web page.

Rating: 4.1 (432 reviews)
Downloads: 11,479 downloads
Version: 1.0.0

Overview

Extract structured data from any web page.

Complete Documentation

View Source →

Smart Web Scraper

Extract structured data from web pages into clean JSON or CSV.

Quick Start

bash

# Scrape a page, extract all text content
uv run --with beautifulsoup4 --with lxml python scripts/scraper.py extract "https://example.com"

# Extract specific elements with CSS selector
uv run --with beautifulsoup4 --with lxml python scripts/scraper.py extract "https://example.com/products" -s ".product-card"

# Auto-detect and extract tables
uv run --with beautifulsoup4 --with lxml python scripts/scraper.py tables "https://example.com/pricing"

# Extract all links from a page
uv run --with beautifulsoup4 --with lxml python scripts/scraper.py links "https://example.com"

# Extract structured data (title, meta, headings, links)
uv run --with beautifulsoup4 --with lxml python scripts/scraper.py structure "https://example.com"

# Output as JSON
uv run --with beautifulsoup4 --with lxml python scripts/scraper.py extract "https://example.com" -s ".item" -f json

# Output as CSV
uv run --with beautifulsoup4 --with lxml python scripts/scraper.py extract "https://example.com" -s "table tr" -f csv

# Save to file
uv run --with beautifulsoup4 --with lxml python scripts/scraper.py extract "https://example.com" -s ".product" -f json -o products.json

# Multi-page scrape (follow pagination)
uv run --with beautifulsoup4 --with lxml python scripts/scraper.py crawl "https://example.com/page/1" --pages 5 -s ".article"

Commands

Command	Args	Description
extract	[-s selector] [-f format] [-o file]	Extract content, optionally filtered by CSS selector
tables	[-f format] [-o file]	Auto-detect and extract all HTML tables
links	[--external] [--internal]	Extract all links (href + text)
structure		Extract page structure: title, meta, headings, images, links
crawl	--pages N [-s selector] [-f format] [-o file]	Follow pagination links, extract from multiple pages

Output Formats

Format	Flag	Description
Text	-f text	Plain text (default)
JSON	-f json	Structured JSON array
CSV	-f csv	Comma-separated values
Markdown	-f md	Markdown-formatted

Examples

Extract product listings

bash

uv run --with beautifulsoup4 --with lxml python scripts/scraper.py extract "https://shop.example.com" -s ".product" -f json

Output:

json

[
  {"text": "Widget Pro - $29.99", "tag": "div", "class": "product"},
  {"text": "Widget Max - $49.99", "tag": "div", "class": "product"}
]

Extract pricing table

bash

uv run --with beautifulsoup4 --with lxml python scripts/scraper.py tables "https://example.com/pricing" -f csv

Get all external links

bash

uv run --with beautifulsoup4 --with lxml python scripts/scraper.py links "https://example.com" --external

Rate Limiting

Default: 1 request per second (respectful crawling)
Override with --delay 0.5 (seconds between requests)
Respects robots.txt by default (override with --ignore-robots)

Notes

Requires beautifulsoup4 and lxml (auto-installed by uv run --with)
Uses a standard browser User-Agent to avoid blocks
Handles redirects, encoding detection, and error pages gracefully
No JavaScript rendering (use for static HTML pages)

Installation

Terminal bash


openclaw install smart-web-scraper

Copied!

💻Code Examples

uv run --with beautifulsoup4 --with lxml python scripts/scraper.py crawl "https://example.com/page/1" --pages 5 -s ".article"

uv-run---with-beautifulsoup4---with-lxml-python-scriptsscraperpy-crawl-httpsexamplecompage1---pages-5--s-article.txt

## Commands

| Command | Args | Description |
|---------|------|-------------|
| `extract` | `<url> [-s selector] [-f format] [-o file]` | Extract content, optionally filtered by CSS selector |
| `tables` | `<url> [-f format] [-o file]` | Auto-detect and extract all HTML tables |
| `links` | `<url> [--external] [--internal]` | Extract all links (href + text) |
| `structure` | `<url>` | Extract page structure: title, meta, headings, images, links |
| `crawl` | `<url> --pages N [-s selector] [-f format] [-o file]` | Follow pagination links, extract from multiple pages |

## Output Formats

| Format | Flag | Description |
|--------|------|-------------|
| Text | `-f text` | Plain text (default) |
| JSON | `-f json` | Structured JSON array |
| CSV | `-f csv` | Comma-separated values |
| Markdown | `-f md` | Markdown-formatted |

## Examples

### Extract product listings

example.sh

# Scrape a page, extract all text content
uv run --with beautifulsoup4 --with lxml python scripts/scraper.py extract "https://example.com"

# Extract specific elements with CSS selector
uv run --with beautifulsoup4 --with lxml python scripts/scraper.py extract "https://example.com/products" -s ".product-card"

# Auto-detect and extract tables
uv run --with beautifulsoup4 --with lxml python scripts/scraper.py tables "https://example.com/pricing"

# Extract all links from a page
uv run --with beautifulsoup4 --with lxml python scripts/scraper.py links "https://example.com"

# Extract structured data (title, meta, headings, links)
uv run --with beautifulsoup4 --with lxml python scripts/scraper.py structure "https://example.com"

# Output as JSON
uv run --with beautifulsoup4 --with lxml python scripts/scraper.py extract "https://example.com" -s ".item" -f json

# Output as CSV
uv run --with beautifulsoup4 --with lxml python scripts/scraper.py extract "https://example.com" -s "table tr" -f csv

# Save to file
uv run --with beautifulsoup4 --with lxml python scripts/scraper.py extract "https://example.com" -s ".product" -f json -o products.json

# Multi-page scrape (follow pagination)
uv run --with beautifulsoup4 --with lxml python scripts/scraper.py crawl "https://example.com/page/1" --pages 5 -s ".article"

example.json

[
  {"text": "Widget Pro - $29.99", "tag": "div", "class": "product"},
  {"text": "Widget Max - $49.99", "tag": "div", "class": "product"}
]

Related Skills

✓ Verified 💻 Development

4claw

4claw — a moderated imageboard for AI agents.

🧠 Claude-Ready #ai_and-llms

✓ Verified 💻 Development

Aap Passport

Agent Attestation Protocol - The Reverse Turing Test.

🧠 Claude-Ready #ai_and-llms

✓ Verified 💻 Development

Acestep Lyrics Transcription

Transcribe audio to timestamped lyrics using OpenAI Whisper or ElevenLabs Scribe API.

⚡ GPT-Optimized #ai_and-llms #api #script

✓ Verified 💻 Development

Adaptive Suite

A continuously adaptive skill suite that empowers Clawdbot.

🧠 Claude-Ready #ai_and-llms #bot

Smart Web Scraper

Overview

Complete Documentation

Smart Web Scraper

Quick Start

Commands

Output Formats

Examples

Extract product listings

Extract pricing table

Get all external links

Rate Limiting

Notes

Installation

💻Code Examples

uv run --with beautifulsoup4 --with lxml python scripts/scraper.py crawl "https://example.com/page/1" --pages 5 -s ".article"

Tags

Quick Info

Ready to Install?

Resources

Related Skills

4claw

Aap Passport

Acestep Lyrics Transcription

Adaptive Suite