Your First Web Scraper

Learn to use OpenClaw’s browser automation capabilities to extract data from websites.

🎯 What You’ll Learn

How to use OpenClaw’s built-in browser tool to:

Navigate to websites
Extract data using CSS selectors
Handle JavaScript-rendered content
Save results to files
Create reusable scraping workflows

Real-world example: Build a news aggregator that scrapes headlines from tech news sites.

📋 Prerequisites

✅ Completed 15-Minute Quick Start
✅ OpenClaw Gateway running (openclaw gateway)
✅ Basic understanding of HTML/CSS selectors
✅ Node.js 22+ installed

🛠️ Understanding OpenClaw’s Browser Tool

OpenClaw includes a powerful browser automation tool that you can control through natural language or direct commands. The browser tool can:

Navigate to any URL
Wait for page elements to load
Extract text, links, images, and structured data
Handle dynamic JavaScript content
Take screenshots
Fill forms and click buttons

How it works:

You → OpenClaw (via chat/channel) → Gateway → Browser Tool → Website

📝 Step 1: Your First Scraping Task (5 minutes)

Start the Gateway

Make sure your OpenClaw Gateway is running:

# Start the gateway
openclaw gateway --port 18789 --verbose

# In another terminal, verify it's running
curl http://localhost:18789/health

Open WebChat UI

Navigate to:

http://localhost:18789

This opens the WebChat interface where you can interact with your AI assistant.

Your First Request

In the WebChat, type:

Can you go to https://techcrunch.com and extract the latest 5 article headlines and links?

What happens:

OpenClaw understands your request
Uses the browser tool to navigate to TechCrunch
Waits for the page to load
Extracts the article headlines and links
Returns the data to you in the chat

Expected response: The assistant will return something like:

Here are the latest 5 articles from TechCrunch:

1. "AI Breakthrough in Natural Language Processing"
   Link: https://techcrunch.com/2025/03/15/ai-breakthrough-nlp/

2. "New JavaScript Framework Promises 10x Performance"
   Link: https://techcrunch.com/2025/03/15/javascript-framework/

... (and so on)

🔄 Step 2: Create a Reusable Scraping Workflow (10 minutes)

Understanding OpenClaw Skills

In OpenClaw, skills are Markdown files that define reusable capabilities. Let’s create a web scraping skill.

Create Your First Skill

# Navigate to workspace skills directory
cd ~/.openclaw/workspace/skills

# Create a new skill directory
mkdir news-scraper
cd news-scraper

Create the Skill Definition

Create SKILL.md:

# News Scraper

## Description
Extracts the latest news articles from technology news websites.

## Capabilities
- Navigate to news websites
- Extract article headlines, links, and summaries
- Handle JavaScript-rendered content
- Support for multiple news sites

## Usage
Simply ask: "Scrape the latest news from [website]" or "Get headlines from [site]"

## Supported Sites
- TechCrunch (https://techcrunch.com)
- The Verge (https://theverge.com)
- Ars Technica (https://arstechnica.com)
- Hacker News (https://news.ycombinator.com)

## Output Format
Returns structured data with:
- title: Article title
- link: Article URL
- date: Publication date
- summary: Brief description

Test Your New Skill

In WebChat, try:

Use the news-scraper skill to get the latest headlines from The Verge.

OpenClaw will now use your defined skill to perform the scraping task.

🌐 Step 3: Advanced Scraping Techniques (15 minutes)

Handling JavaScript-Rendered Content

Many modern websites use JavaScript to load content. OpenClaw’s browser tool automatically handles this.

Try this request:

Go to https://news.ycombinator.com and extract the top 10 stories with their points and comment counts.

The browser tool will:

Wait for the page to fully load
Execute any JavaScript
Extract the data once it’s visible

Custom CSS Selectors

You can specify exactly what to extract:

Scrape https://example.com/products and extract:
- Product names (class: .product-name)
- Prices (class: .product-price)
- Ratings (class: .rating)
- Stock availability (class: .stock-status)

Pagination

For multi-page content:

Scrape the first 3 pages of search results from https://example.com/search?q=automation
Get all results and compile them into a single list.

Waiting for Specific Elements

If a site has slow-loading content:

Go to https://slow-site.com/data
Wait for the .results-container to appear
Then extract all .data-item elements

💾 Step 4: Saving Data to Files (5 minutes)

Save to JSON

Scrape product data from https://example.com/products and save it to a file called products.json in my downloads folder.

Save to CSV

Extract the article data and save it as a CSV file with columns: title, link, date, author.

Append to Existing Files

Scrape today's news and append it to my news-archive.json file.

🔧 Step 5: Creating Automation Workflows (10 minutes)

Schedule Regular Scraping

You can set up recurring scraping tasks:

Every morning at 9 AM, scrape the front page of TechCrunch and send me a summary of the top 10 articles.

OpenClaw will use its built-in cron/scheduling capabilities to run this automatically.

Conditional Scraping

Monitor https://example.com/products
Alert me if any product price drops below $100

Multi-Site Aggregation

Scrape the latest headlines from TechCrunch, The Verge, and Ars Technica
Combine them into a single feed
Remove duplicates
Sort by publication date
Save the combined feed to tech-news-feed.json

⚙️ Step 6: Troubleshooting Common Issues (10 minutes)

Issue: “Page not loading”

Solution: Some sites block automated browsers. Try:

Use stealth mode and rotate user agents when scraping https://protected-site.com

Issue: “Data not found”

Solution: The page structure may have changed. Ask OpenClaw to:

Inspect the page structure at https://example.com and tell me what CSS selectors I should use.

Issue: “JavaScript content not loading”

Solution: Increase wait time:

Go to https://dynamic-site.com
Wait up to 10 seconds for the .main-content to load
Then extract the data

Issue: “Rate limiting”

Solution: Slow down requests:

Scrape all products from https://example.com
Add a 2-second delay between each page
Don't get rate limited

🎯 Real-World Examples

Example 1: Price Monitoring

Monitor the price of "MacBook Pro 16" on Amazon
Check every hour
Alert me if the price drops below $2000
Save the price history to a file

Example 2: Content Aggregation

Create a daily tech news digest:
1. Scrape top 20 articles from TechCrunch
2. Scrape top 20 articles from The Verge
3. Scrape top 20 articles from Ars Technica
4. Remove duplicates based on article titles
5. Sort by recency
6. Generate a summary with key themes
7. Save to tech-digest-[date].json

Monitor Hacker News for mentions of "OpenClaw"
Extract the title, link, points, and comment count
Alert me if any post gets more than 100 points
Save notable posts to openclaw-mentions.json

🔍 Advanced Techniques

Working with Forms

Go to https://example.com/search
Fill in the search box with "AI automation"
Click the search button
Extract the top 10 results

Handling Authentication

For sites requiring login:

Go to https://example.com/login
Fill in username: [email protected]
Fill in password: mypassword
Click the login button
Wait for the dashboard to load
Then scrape the user profile data

⚠️ Security Note: Never share sensitive credentials. Use environment variables or secure credential management.

Screenshot Automation

Go to https://example.com
Take a screenshot of the hero section
Save it to homepage-screenshot.png

✅ Best Practices

1. Respect robots.txt

Always check if a site allows scraping:

Check the robots.txt file for https://example.com and tell me what scraping is allowed.

2. Rate Limiting

Be polite to servers:

When scraping, add a 1-2 second delay between requests to avoid overloading the server.

3. Error Handling

Make your scraping robust:

Try to scrape https://example.com/data
If it fails, wait 30 seconds and retry up to 3 times
If still failing, alert me with the error details

4. Data Validation

Verify extracted data:

Scrape product prices from https://example.com/products
Verify that all prices are valid numbers
Alert me if any price looks suspicious (negative, zero, or extremely high)

5. Incremental Scraping

For large sites, break it down:

Scrape https://example.com/products
Get only the first 20 products
Save to page-1.json
Then continue with the next 20

📊 Understanding OpenClaw’s Browser Capabilities

OpenClaw’s browser tool is built on modern browser automation technologies and supports:

Full JavaScript execution: All JS on the page runs normally
CSS selectors: Use any CSS selector to target elements
XPath queries: For complex element selection
Screenshots: Visual verification of scraped content
Form interaction: Fill forms, click buttons, upload files
Multi-tab support: Work with multiple pages simultaneously
Cookie management: Maintain sessions across requests
Proxy support: Route traffic through proxies if needed

🎯 What’s Next?

📁 File Processing Automation - Process scraped data
🎨 Custom Skill Development - Build your own skills
🚀 Advanced Browser Automation - Complex browser tasks
📚 Skills Library - Browse community-built scraping skills

🆘 Need Help?

💬 Ask OpenClaw directly: Just describe what you want to scrape in natural language
📖 Official Documentation - Detailed browser tool reference
🌟 Community Examples - Real-world usage examples
🐛 GitHub Issues - Report bugs or request features

⏱️ Total Time: 30 minutes 📊 Difficulty: Beginner 🎯 Result: Successfully scraping websites with OpenClaw

💡 Key Takeaways

Natural Language Interface: You don’t need to write code - just describe what you want in plain English
Built-in Browser Tool: OpenClaw includes powerful browser automation out of the box
Skills are Simple: Skills are just Markdown files that describe capabilities
Flexible Automation: Schedule, monitor, and automate any web scraping task
Production Ready: Handle JavaScript, forms, authentication, and complex scenarios

Next: Try scraping your first website by asking OpenClaw in WebChat!