Your First Web Scraper
Learn to use OpenClawβs browser automation capabilities to extract data from websites.
π― What Youβll Learn
How to use OpenClawβs built-in browser tool to:
- Navigate to websites
- Extract data using CSS selectors
- Handle JavaScript-rendered content
- Save results to files
- Create reusable scraping workflows
Real-world example: Build a news aggregator that scrapes headlines from tech news sites.
π Prerequisites
- β Completed 15-Minute Quick Start
- β
OpenClaw Gateway running (
openclaw gateway) - β Basic understanding of HTML/CSS selectors
- β Node.js 22+ installed
π οΈ Understanding OpenClawβs Browser Tool
OpenClaw includes a powerful browser automation tool that you can control through natural language or direct commands. The browser tool can:
- Navigate to any URL
- Wait for page elements to load
- Extract text, links, images, and structured data
- Handle dynamic JavaScript content
- Take screenshots
- Fill forms and click buttons
How it works:
You β OpenClaw (via chat/channel) β Gateway β Browser Tool β Website
π Step 1: Your First Scraping Task (5 minutes)
Start the Gateway
Make sure your OpenClaw Gateway is running:
# Start the gateway
openclaw gateway --port 18789 --verbose
# In another terminal, verify it's running
curl http://localhost:18789/health
Open WebChat UI
Navigate to:
http://localhost:18789
This opens the WebChat interface where you can interact with your AI assistant.
Your First Request
In the WebChat, type:
Can you go to https://techcrunch.com and extract the latest 5 article headlines and links?
What happens:
- OpenClaw understands your request
- Uses the browser tool to navigate to TechCrunch
- Waits for the page to load
- Extracts the article headlines and links
- Returns the data to you in the chat
Expected response: The assistant will return something like:
Here are the latest 5 articles from TechCrunch:
1. "AI Breakthrough in Natural Language Processing"
Link: https://techcrunch.com/2025/03/15/ai-breakthrough-nlp/
2. "New JavaScript Framework Promises 10x Performance"
Link: https://techcrunch.com/2025/03/15/javascript-framework/
... (and so on)
π Step 2: Create a Reusable Scraping Workflow (10 minutes)
Understanding OpenClaw Skills
In OpenClaw, skills are Markdown files that define reusable capabilities. Letβs create a web scraping skill.
Create Your First Skill
# Navigate to workspace skills directory
cd ~/.openclaw/workspace/skills
# Create a new skill directory
mkdir news-scraper
cd news-scraper
Create the Skill Definition
Create SKILL.md:
# News Scraper
## Description
Extracts the latest news articles from technology news websites.
## Capabilities
- Navigate to news websites
- Extract article headlines, links, and summaries
- Handle JavaScript-rendered content
- Support for multiple news sites
## Usage
Simply ask: "Scrape the latest news from [website]" or "Get headlines from [site]"
## Supported Sites
- TechCrunch (https://techcrunch.com)
- The Verge (https://theverge.com)
- Ars Technica (https://arstechnica.com)
- Hacker News (https://news.ycombinator.com)
## Output Format
Returns structured data with:
- title: Article title
- link: Article URL
- date: Publication date
- summary: Brief description
Test Your New Skill
In WebChat, try:
Use the news-scraper skill to get the latest headlines from The Verge.
OpenClaw will now use your defined skill to perform the scraping task.
π Step 3: Advanced Scraping Techniques (15 minutes)
Handling JavaScript-Rendered Content
Many modern websites use JavaScript to load content. OpenClawβs browser tool automatically handles this.
Try this request:
Go to https://news.ycombinator.com and extract the top 10 stories with their points and comment counts.
The browser tool will:
- Wait for the page to fully load
- Execute any JavaScript
- Extract the data once itβs visible
Custom CSS Selectors
You can specify exactly what to extract:
Scrape https://example.com/products and extract:
- Product names (class: .product-name)
- Prices (class: .product-price)
- Ratings (class: .rating)
- Stock availability (class: .stock-status)
Pagination
For multi-page content:
Scrape the first 3 pages of search results from https://example.com/search?q=automation
Get all results and compile them into a single list.
Waiting for Specific Elements
If a site has slow-loading content:
Go to https://slow-site.com/data
Wait for the .results-container to appear
Then extract all .data-item elements
πΎ Step 4: Saving Data to Files (5 minutes)
Save to JSON
Scrape product data from https://example.com/products and save it to a file called products.json in my downloads folder.
Save to CSV
Extract the article data and save it as a CSV file with columns: title, link, date, author.
Append to Existing Files
Scrape today's news and append it to my news-archive.json file.
π§ Step 5: Creating Automation Workflows (10 minutes)
Schedule Regular Scraping
You can set up recurring scraping tasks:
Every morning at 9 AM, scrape the front page of TechCrunch and send me a summary of the top 10 articles.
OpenClaw will use its built-in cron/scheduling capabilities to run this automatically.
Conditional Scraping
Monitor https://example.com/products
Alert me if any product price drops below $100
Multi-Site Aggregation
Scrape the latest headlines from TechCrunch, The Verge, and Ars Technica
Combine them into a single feed
Remove duplicates
Sort by publication date
Save the combined feed to tech-news-feed.json
βοΈ Step 6: Troubleshooting Common Issues (10 minutes)
Issue: βPage not loadingβ
Solution: Some sites block automated browsers. Try:
Use stealth mode and rotate user agents when scraping https://protected-site.com
Issue: βData not foundβ
Solution: The page structure may have changed. Ask OpenClaw to:
Inspect the page structure at https://example.com and tell me what CSS selectors I should use.
Issue: βJavaScript content not loadingβ
Solution: Increase wait time:
Go to https://dynamic-site.com
Wait up to 10 seconds for the .main-content to load
Then extract the data
Issue: βRate limitingβ
Solution: Slow down requests:
Scrape all products from https://example.com
Add a 2-second delay between each page
Don't get rate limited
π― Real-World Examples
Example 1: Price Monitoring
Monitor the price of "MacBook Pro 16" on Amazon
Check every hour
Alert me if the price drops below $2000
Save the price history to a file
Example 2: Content Aggregation
Create a daily tech news digest:
1. Scrape top 20 articles from TechCrunch
2. Scrape top 20 articles from The Verge
3. Scrape top 20 articles from Ars Technica
4. Remove duplicates based on article titles
5. Sort by recency
6. Generate a summary with key themes
7. Save to tech-digest-[date].json
Example 3: Social Media Monitoring
Monitor Hacker News for mentions of "OpenClaw"
Extract the title, link, points, and comment count
Alert me if any post gets more than 100 points
Save notable posts to openclaw-mentions.json
π Advanced Techniques
Working with Forms
Go to https://example.com/search
Fill in the search box with "AI automation"
Click the search button
Extract the top 10 results
Handling Authentication
For sites requiring login:
Go to https://example.com/login
Fill in username: [email protected]
Fill in password: mypassword
Click the login button
Wait for the dashboard to load
Then scrape the user profile data
β οΈ Security Note: Never share sensitive credentials. Use environment variables or secure credential management.
Screenshot Automation
Go to https://example.com
Take a screenshot of the hero section
Save it to homepage-screenshot.png
β Best Practices
1. Respect robots.txt
Always check if a site allows scraping:
Check the robots.txt file for https://example.com and tell me what scraping is allowed.
2. Rate Limiting
Be polite to servers:
When scraping, add a 1-2 second delay between requests to avoid overloading the server.
3. Error Handling
Make your scraping robust:
Try to scrape https://example.com/data
If it fails, wait 30 seconds and retry up to 3 times
If still failing, alert me with the error details
4. Data Validation
Verify extracted data:
Scrape product prices from https://example.com/products
Verify that all prices are valid numbers
Alert me if any price looks suspicious (negative, zero, or extremely high)
5. Incremental Scraping
For large sites, break it down:
Scrape https://example.com/products
Get only the first 20 products
Save to page-1.json
Then continue with the next 20
π Understanding OpenClawβs Browser Capabilities
OpenClawβs browser tool is built on modern browser automation technologies and supports:
- Full JavaScript execution: All JS on the page runs normally
- CSS selectors: Use any CSS selector to target elements
- XPath queries: For complex element selection
- Screenshots: Visual verification of scraped content
- Form interaction: Fill forms, click buttons, upload files
- Multi-tab support: Work with multiple pages simultaneously
- Cookie management: Maintain sessions across requests
- Proxy support: Route traffic through proxies if needed
π― Whatβs Next?
- π File Processing Automation - Process scraped data
- π¨ Custom Skill Development - Build your own skills
- π Advanced Browser Automation - Complex browser tasks
- π Skills Library - Browse community-built scraping skills
π Need Help?
- π¬ Ask OpenClaw directly: Just describe what you want to scrape in natural language
- π Official Documentation - Detailed browser tool reference
- π Community Examples - Real-world usage examples
- π GitHub Issues - Report bugs or request features
β±οΈ Total Time: 30 minutes π Difficulty: Beginner π― Result: Successfully scraping websites with OpenClaw
π‘ Key Takeaways
- Natural Language Interface: You donβt need to write code - just describe what you want in plain English
- Built-in Browser Tool: OpenClaw includes powerful browser automation out of the box
- Skills are Simple: Skills are just Markdown files that describe capabilities
- Flexible Automation: Schedule, monitor, and automate any web scraping task
- Production Ready: Handle JavaScript, forms, authentication, and complex scenarios
Next: Try scraping your first website by asking OpenClaw in WebChat!
Congratulations!
You've completed this tutorial. Ready for the next challenge?