πŸ“š Step-by-Step Tutorial Beginner Level ⏱️ 20 minutes

Your First Web Scraper

Learn to build web scrapers with natural language requests - real examples, CSS selectors, and error handling

🎯
Hands-on
πŸ’»
Code Examples
πŸ“Š
Real Projects
βœ…
Best Practices
βœ“ Updated: March 2025
βœ“ Beginner Friendly
βœ“ Free Forever
20 minutes read 180 sec read

Your First Web Scraper

Learn to use OpenClaw’s browser automation capabilities to extract data from websites.

🎯 What You’ll Learn

How to use OpenClaw’s built-in browser tool to:

  • Navigate to websites
  • Extract data using CSS selectors
  • Handle JavaScript-rendered content
  • Save results to files
  • Create reusable scraping workflows

Real-world example: Build a news aggregator that scrapes headlines from tech news sites.


πŸ“‹ Prerequisites

  • βœ… Completed 15-Minute Quick Start
  • βœ… OpenClaw Gateway running (openclaw gateway)
  • βœ… Basic understanding of HTML/CSS selectors
  • βœ… Node.js 22+ installed

πŸ› οΈ Understanding OpenClaw’s Browser Tool

OpenClaw includes a powerful browser automation tool that you can control through natural language or direct commands. The browser tool can:

  • Navigate to any URL
  • Wait for page elements to load
  • Extract text, links, images, and structured data
  • Handle dynamic JavaScript content
  • Take screenshots
  • Fill forms and click buttons

How it works:

You β†’ OpenClaw (via chat/channel) β†’ Gateway β†’ Browser Tool β†’ Website

πŸ“ Step 1: Your First Scraping Task (5 minutes)

Start the Gateway

Make sure your OpenClaw Gateway is running:

# Start the gateway
openclaw gateway --port 18789 --verbose

# In another terminal, verify it's running
curl http://localhost:18789/health

Open WebChat UI

Navigate to:

http://localhost:18789

This opens the WebChat interface where you can interact with your AI assistant.

Your First Request

In the WebChat, type:

Can you go to https://techcrunch.com and extract the latest 5 article headlines and links?

What happens:

  1. OpenClaw understands your request
  2. Uses the browser tool to navigate to TechCrunch
  3. Waits for the page to load
  4. Extracts the article headlines and links
  5. Returns the data to you in the chat

Expected response: The assistant will return something like:

Here are the latest 5 articles from TechCrunch:

1. "AI Breakthrough in Natural Language Processing"
   Link: https://techcrunch.com/2025/03/15/ai-breakthrough-nlp/

2. "New JavaScript Framework Promises 10x Performance"
   Link: https://techcrunch.com/2025/03/15/javascript-framework/

... (and so on)

πŸ”„ Step 2: Create a Reusable Scraping Workflow (10 minutes)

Understanding OpenClaw Skills

In OpenClaw, skills are Markdown files that define reusable capabilities. Let’s create a web scraping skill.

Create Your First Skill

# Navigate to workspace skills directory
cd ~/.openclaw/workspace/skills

# Create a new skill directory
mkdir news-scraper
cd news-scraper

Create the Skill Definition

Create SKILL.md:

# News Scraper

## Description
Extracts the latest news articles from technology news websites.

## Capabilities
- Navigate to news websites
- Extract article headlines, links, and summaries
- Handle JavaScript-rendered content
- Support for multiple news sites

## Usage
Simply ask: "Scrape the latest news from [website]" or "Get headlines from [site]"

## Supported Sites
- TechCrunch (https://techcrunch.com)
- The Verge (https://theverge.com)
- Ars Technica (https://arstechnica.com)
- Hacker News (https://news.ycombinator.com)

## Output Format
Returns structured data with:
- title: Article title
- link: Article URL
- date: Publication date
- summary: Brief description

Test Your New Skill

In WebChat, try:

Use the news-scraper skill to get the latest headlines from The Verge.

OpenClaw will now use your defined skill to perform the scraping task.


🌐 Step 3: Advanced Scraping Techniques (15 minutes)

Handling JavaScript-Rendered Content

Many modern websites use JavaScript to load content. OpenClaw’s browser tool automatically handles this.

Try this request:

Go to https://news.ycombinator.com and extract the top 10 stories with their points and comment counts.

The browser tool will:

  • Wait for the page to fully load
  • Execute any JavaScript
  • Extract the data once it’s visible

Custom CSS Selectors

You can specify exactly what to extract:

Scrape https://example.com/products and extract:
- Product names (class: .product-name)
- Prices (class: .product-price)
- Ratings (class: .rating)
- Stock availability (class: .stock-status)

Pagination

For multi-page content:

Scrape the first 3 pages of search results from https://example.com/search?q=automation
Get all results and compile them into a single list.

Waiting for Specific Elements

If a site has slow-loading content:

Go to https://slow-site.com/data
Wait for the .results-container to appear
Then extract all .data-item elements

πŸ’Ύ Step 4: Saving Data to Files (5 minutes)

Save to JSON

Scrape product data from https://example.com/products and save it to a file called products.json in my downloads folder.

Save to CSV

Extract the article data and save it as a CSV file with columns: title, link, date, author.

Append to Existing Files

Scrape today's news and append it to my news-archive.json file.

πŸ”§ Step 5: Creating Automation Workflows (10 minutes)

Schedule Regular Scraping

You can set up recurring scraping tasks:

Every morning at 9 AM, scrape the front page of TechCrunch and send me a summary of the top 10 articles.

OpenClaw will use its built-in cron/scheduling capabilities to run this automatically.

Conditional Scraping

Monitor https://example.com/products
Alert me if any product price drops below $100

Multi-Site Aggregation

Scrape the latest headlines from TechCrunch, The Verge, and Ars Technica
Combine them into a single feed
Remove duplicates
Sort by publication date
Save the combined feed to tech-news-feed.json

βš™οΈ Step 6: Troubleshooting Common Issues (10 minutes)

Issue: β€œPage not loading”

Solution: Some sites block automated browsers. Try:

Use stealth mode and rotate user agents when scraping https://protected-site.com

Issue: β€œData not found”

Solution: The page structure may have changed. Ask OpenClaw to:

Inspect the page structure at https://example.com and tell me what CSS selectors I should use.

Issue: β€œJavaScript content not loading”

Solution: Increase wait time:

Go to https://dynamic-site.com
Wait up to 10 seconds for the .main-content to load
Then extract the data

Issue: β€œRate limiting”

Solution: Slow down requests:

Scrape all products from https://example.com
Add a 2-second delay between each page
Don't get rate limited

🎯 Real-World Examples

Example 1: Price Monitoring

Monitor the price of "MacBook Pro 16" on Amazon
Check every hour
Alert me if the price drops below $2000
Save the price history to a file

Example 2: Content Aggregation

Create a daily tech news digest:
1. Scrape top 20 articles from TechCrunch
2. Scrape top 20 articles from The Verge
3. Scrape top 20 articles from Ars Technica
4. Remove duplicates based on article titles
5. Sort by recency
6. Generate a summary with key themes
7. Save to tech-digest-[date].json

Example 3: Social Media Monitoring

Monitor Hacker News for mentions of "OpenClaw"
Extract the title, link, points, and comment count
Alert me if any post gets more than 100 points
Save notable posts to openclaw-mentions.json

πŸ” Advanced Techniques

Working with Forms

Go to https://example.com/search
Fill in the search box with "AI automation"
Click the search button
Extract the top 10 results

Handling Authentication

For sites requiring login:

Go to https://example.com/login
Fill in username: [email protected]
Fill in password: mypassword
Click the login button
Wait for the dashboard to load
Then scrape the user profile data

⚠️ Security Note: Never share sensitive credentials. Use environment variables or secure credential management.

Screenshot Automation

Go to https://example.com
Take a screenshot of the hero section
Save it to homepage-screenshot.png

βœ… Best Practices

1. Respect robots.txt

Always check if a site allows scraping:

Check the robots.txt file for https://example.com and tell me what scraping is allowed.

2. Rate Limiting

Be polite to servers:

When scraping, add a 1-2 second delay between requests to avoid overloading the server.

3. Error Handling

Make your scraping robust:

Try to scrape https://example.com/data
If it fails, wait 30 seconds and retry up to 3 times
If still failing, alert me with the error details

4. Data Validation

Verify extracted data:

Scrape product prices from https://example.com/products
Verify that all prices are valid numbers
Alert me if any price looks suspicious (negative, zero, or extremely high)

5. Incremental Scraping

For large sites, break it down:

Scrape https://example.com/products
Get only the first 20 products
Save to page-1.json
Then continue with the next 20

πŸ“Š Understanding OpenClaw’s Browser Capabilities

OpenClaw’s browser tool is built on modern browser automation technologies and supports:

  • Full JavaScript execution: All JS on the page runs normally
  • CSS selectors: Use any CSS selector to target elements
  • XPath queries: For complex element selection
  • Screenshots: Visual verification of scraped content
  • Form interaction: Fill forms, click buttons, upload files
  • Multi-tab support: Work with multiple pages simultaneously
  • Cookie management: Maintain sessions across requests
  • Proxy support: Route traffic through proxies if needed

🎯 What’s Next?


πŸ†˜ Need Help?

  • πŸ’¬ Ask OpenClaw directly: Just describe what you want to scrape in natural language
  • πŸ“– Official Documentation - Detailed browser tool reference
  • 🌟 Community Examples - Real-world usage examples
  • πŸ› GitHub Issues - Report bugs or request features

⏱️ Total Time: 30 minutes πŸ“Š Difficulty: Beginner 🎯 Result: Successfully scraping websites with OpenClaw


πŸ’‘ Key Takeaways

  1. Natural Language Interface: You don’t need to write code - just describe what you want in plain English
  2. Built-in Browser Tool: OpenClaw includes powerful browser automation out of the box
  3. Skills are Simple: Skills are just Markdown files that describe capabilities
  4. Flexible Automation: Schedule, monitor, and automate any web scraping task
  5. Production Ready: Handle JavaScript, forms, authentication, and complex scenarios

Next: Try scraping your first website by asking OpenClaw in WebChat!

πŸŽ‰

Congratulations!

You've completed this tutorial. Ready for the next challenge?