Data Processing and Cleaning
Transform raw data into insights using OpenClawβs natural language data processing capabilities.
π― What Youβll Learn
How to use OpenClaw for data processing workflows:
- Read and parse structured data (CSV, JSON, XML)
- Clean and validate data
- Transform and aggregate datasets
- Generate reports and visualizations
- Build automated data pipelines
Real-world example: Build an automated sales data analysis pipeline.
π Prerequisites
- β Completed File Processing Automation Basics
- β Completed Chaining Multiple Operations
- β OpenClaw Gateway running
- β Sample datasets for testing
π οΈ Understanding OpenClawβs Data Tools
OpenClaw provides powerful data processing through natural language:
- Structured data parsing: Read CSV, JSON, XML, Excel files
- Data transformation: Filter, sort, aggregate, and reshape data
- Data cleaning: Handle missing values, duplicates, and errors
- Statistical analysis: Calculate summaries, trends, and patterns
- Visualization: Generate charts and reports
- Export capabilities: Save to multiple formats
π Step 1: Your First Data Processing Task (5 minutes)
Start the Gateway
openclaw gateway --port 18789 --verbose
Open WebChat UI
Navigate to:
http://localhost:18789
Basic Data Analysis
Analyze a CSV file:
Read the file sales-data.csv from my Desktop
Calculate the total sales amount
Show me the average order value
Find the highest and lowest orders
OpenClaw will:
- Parse the CSV file
- Perform the calculations
- Present the results in a clear format
Simple Data Filtering
Open orders.csv
Filter for orders where status is "completed" and amount is greater than 100
Create a new CSV file called large-completed-orders.csv
π§Ή Step 2: Data Cleaning (10 minutes)
Handling Missing Values
Read customer-data.csv
Check for missing values in all columns
For missing email addresses:
- Mark them as "missing" in the status column
For missing phone numbers:
- Fill with "N/A"
Save the cleaned data to customer-data-cleaned.csv
Removing Duplicates
Open transactions.csv
Identify duplicate rows based on transaction_id
Remove all duplicates keeping the first occurrence
Save to transactions-deduplicated.csv
Report how many duplicates were removed
Standardizing Data Formats
Read raw-data.csv
Standardize all date columns to ISO 8601 format (YYYY-MM-DD)
Convert all email addresses to lowercase
Trim whitespace from all text fields
Ensure phone numbers are in format: +1-XXX-XXX-XXXX
Save to standardized-data.csv
Data Validation
Validate user-data.csv:
- Email addresses must be valid format
- Ages must be between 18 and 120
- Postal codes must match their country's format
- Status must be one of: active, inactive, pending
Create a validation report listing all issues
Save invalid records to review-needed.csv
π Step 3: Data Transformation (12 minutes)
Pivoting and Reshaping
Read sales-by-month.csv
Pivot the data to show months as columns and products as rows
Save to sales-pivoted.csv
Merging Datasets
Read orders.csv and customers.csv
Join them on customer_id
Combine all matching records
Include all columns from both files
Save to orders-with-customers.csv
Aggregation Operations
Open sales-data.csv
Group by product_category
For each category calculate:
- Total revenue
- Number of orders
- Average order value
- Best-selling product
Save to category-summary.csv
Advanced Transformations
Read website-traffic.csv
Create new columns:
- Conversion rate = conversions / visits
- Revenue per visit = revenue / visits
- Bounce rate category (high/medium/low)
Apply these transformations to each row
Save to enriched-metrics.csv
π Step 4: Data Analysis and Insights (10 minutes)
Statistical Analysis
Analyze sales-2025.csv:
- Calculate mean, median, mode for order amounts
- Find standard deviation
- Identify outliers (values > 2 std deviations from mean)
- Generate correlation matrix for all numeric columns
Create a statistical summary report
Trend Analysis
Read monthly-revenue.csv
Calculate month-over-month growth
Identify trends (increasing/decreasing/stable)
Find seasonal patterns
Predict next month's revenue based on trends
Generate a trend analysis report
Comparative Analysis
Compare sales between Q1 and Q2:
Read Q1-sales.csv and Q2-sales.csv
Calculate total revenue for each quarter
Find products with >20% growth
Find products with declining sales
Generate comparison report with insights
Customer Segmentation
Read customer-purchases.csv
Segment customers into tiers:
- VIP: total purchases > $5000
- Regular: total purchases > $1000
- Casual: total purchases <= $1000
For each segment calculate:
- Number of customers
- Average purchase amount
- Most popular products
Save to customer-segments.csv
π Step 5: Building Data Pipelines (15 minutes)
Multi-Step Processing Pipeline
Build a data processing pipeline:
Step 1: Read raw-sales-data.csv
Step 2: Remove duplicate transactions
Step 3: Standardize date formats to ISO 8601
Step 4: Validate all email addresses
Step 5: Filter out invalid records
Step 6: Add calculated columns (tax, total)
Step 7: Aggregate by customer
Step 8: Generate summary statistics
Step 9: Save processed data to processed-sales.csv
Step 10: Create executive summary report
Automated Data Quality Pipeline
Set up automated data quality monitoring:
For each new CSV file in ~/inbox/:
1. Validate file structure
2. Check for required columns
3. Validate data types
4. Check for missing values
5. Identify duplicates
6. Run business rule validations
7. Generate quality score (0-100)
8. If score < 80:
- Move to review/ folder
- Alert me with issues
9. If score >= 80:
- Move to approved/ folder
- Log quality metrics
E-Commerce Analytics Pipeline
Build an e-commerce analytics pipeline:
Daily:
1. Import orders from ~/raw-orders/
2. Import customer data from CRM
3. Merge and deduplicate
4. Calculate customer lifetime value
5. Identify high-value customers
6. Analyze product performance
7. Generate daily KPI dashboard
8. Save reports to ~/analytics/daily-[date]/
9. Alert on significant changes
π― Step 6: Data Export and Reporting (8 minutes)
Multiple Format Export
Read analytics-data.json
Export to multiple formats:
- Excel workbook with multiple sheets
- PDF report with charts
- Interactive HTML dashboard
- CSV for data processing
- JSON for API integration
Save all to ~/reports/[date]/
Automated Report Generation
Generate weekly sales report:
1. Read all sales data from past week
2. Aggregate by product, region, salesperson
3. Calculate key metrics:
- Total revenue
- Top performers
- Growth vs last week
- Regional breakdown
4. Create visualizations:
- Bar chart for sales by product
- Line chart for weekly trend
- Pie chart for regional distribution
5. Compile into formatted markdown report
6. Save to ~/reports/weekly-[week].md
7. Email PDF version to management
Custom Data Summaries
Create custom executive summary:
Read financial-data.csv
Generate 1-page summary including:
- Key metrics at a glance
- Top 5 insights
- 3 notable trends
- Areas needing attention
- Recommendations
Format as professional document
Save to ~/executive-summary-[date].pdf
π§ Step 7: Error Handling and Data Quality (10 minutes)
Robust Error Handling
Process data with error recovery:
Try to read large-dataset.csv
If file is too large:
- Split into chunks of 10000 rows
- Process each chunk separately
- Combine results at end
If parsing fails:
- Identify problematic rows
- Save errors to error-log.txt
- Continue with valid rows
- Alert me with summary
Data Quality Checks
Implement comprehensive data quality checks:
For each dataset:
1. Completeness check:
- All required fields present
- No missing critical values
2. Accuracy check:
- Values in valid ranges
- Correct data types
3. Consistency check:
- Related fields match
- No contradictions
4. Timeliness check:
- Data is current
- No stale records
5. Uniqueness check:
- No duplicate keys
- Primary key integrity
Generate quality scorecard
Anomaly Detection
Detect anomalies in data:
Read metrics.csv
Identify unusual patterns:
- Statistical outliers (>3 std dev)
- Sudden spikes or drops
- Values outside expected ranges
- Missing time periods
For each anomaly:
- Log timestamp and value
- Calculate deviation from normal
- Assign severity score
- Flag for review if critical
Create anomaly report
π Advanced Data Processing Examples
Example 1: Marketing Campaign Analysis
Analyze marketing campaign performance:
1. Import campaign data from multiple sources:
- Email platform (opens, clicks)
- Social media (engagement)
- Website analytics
2. Merge all data by campaign_id
3. Calculate ROI for each campaign
4. A/B test analysis:
- Compare conversion rates
- Statistical significance
- Winner identification
5. Channel performance comparison
6. Generate recommendations
7. Create visual dashboard
8. Save to ~/campaign-analysis/[campaign]/
Example 2: Inventory Optimization
Build inventory analysis system:
1. Read sales history (past 12 months)
2. Read current inventory levels
3. For each product:
- Calculate average monthly sales
- Identify seasonal trends
- Calculate optimal stock level
- Flag overstock items
- Flag understock items
4. Generate purchase recommendations
5. Project stockout dates
6. Calculate safety stock levels
7. Create reorder alerts
8. Generate inventory optimization report
Example 3: Financial Data Processing
Financial data reconciliation pipeline:
Daily:
1. Import transactions from:
- Bank statements
- Credit card feeds
- Payment processors
2. Categorize transactions
3. Match to accounting system
4. Identify unreconciled items
5. Flag suspicious transactions
6. Calculate cash flow metrics
7. Update financial dashboards
8. Generate daily reconciliation report
9. Alert on discrepancies
π‘ Best Practices
1. Always Backup Original Data
Before processing data:
1. Create backup of original file
2. Work on copy
3. Preserve raw data
4. Document all transformations
2. Validate Before Processing
Before processing large dataset:
- Check file size and format
- Preview first few rows
- Verify schema
- Test with small sample
- Only then process full dataset
3. Document Transformations
Keep track of all changes:
- What transformations were applied
- Why they were necessary
- What assumptions were made
- What data quality issues were found
4. Use Checkpoints
For complex pipelines:
- Save intermediate results
- Validate each step
- Create checkpoints
- Enable resume from failures
5. Monitor Data Quality
Implement ongoing monitoring:
- Set up quality alerts
- Track data quality trends
- Review error logs regularly
- Continuously improve validation
π Troubleshooting Common Issues
Issue: βFile too large to processβ
Solution: Process in chunks
Read large-file.csv
Split into chunks of 1000 rows
Process each chunk separately
Save each chunk to output/
Combine all chunks at the end
Issue: βMemory errorsβ
Solution: Optimize memory usage
Process file row by row
Don't load entire file into memory
Save results incrementally
Clear processed data from memory
Issue: βInconsistent data formatsβ
Solution: Standardize early
First, detect all data formats in each column
Then, standardize to single format
Document all format conversions
Validate standardization worked
Issue: βProcessing too slowβ
Solution: Optimize operations
Filter data early (reduce dataset size)
Use efficient operations
Avoid unnecessary loops
Process in parallel where possible
Cache intermediate results
π― Whatβs Next?
- β‘ Performance Optimization - Optimize data processing performance
- π¨ Custom Skill Development - Create reusable data processing skills
- π Production Deployment - Deploy data pipelines to production
π Need Help?
- π¬ Ask OpenClaw: Describe your data processing needs in natural language
- π Data Processing Guide - Detailed data processing documentation
- π Community Examples - Real-world data processing workflows
- π GitHub Issues - Report problems
β±οΈ Total Time: 60 minutes π Difficulty: Intermediate π― Result: Building automated data processing pipelines with OpenClaw
π‘ Key Takeaways
- Natural Language Data Processing: Describe data operations in plain English
- Comprehensive Format Support: Work with CSV, JSON, XML, Excel, and more
- Automated Pipelines: Build multi-step data workflows easily
- Quality Assurance: Built-in validation and error handling
- Scalable: Process from small files to large datasets
- Production Ready: Suitable for real-world data processing tasks
Next: Try building your own data processing pipeline by describing your workflow step by step!
Congratulations!
You've completed this tutorial. Ready for the next challenge?