πŸ“š Step-by-Step Tutorial Intermediate Level ⏱️ 50 minutes

Data Processing and Cleaning

Transform raw data into insights using natural language - clean, validate, and analyze datasets

🎯
Hands-on
πŸ’»
Code Examples
πŸ“Š
Real Projects
βœ…
Best Practices
βœ“ Updated: March 2025
βœ“ Beginner Friendly
βœ“ Free Forever
50 minutes read 180 sec read

Data Processing and Cleaning

Transform raw data into insights using OpenClaw’s natural language data processing capabilities.

🎯 What You’ll Learn

How to use OpenClaw for data processing workflows:

  • Read and parse structured data (CSV, JSON, XML)
  • Clean and validate data
  • Transform and aggregate datasets
  • Generate reports and visualizations
  • Build automated data pipelines

Real-world example: Build an automated sales data analysis pipeline.


πŸ“‹ Prerequisites


πŸ› οΈ Understanding OpenClaw’s Data Tools

OpenClaw provides powerful data processing through natural language:

  • Structured data parsing: Read CSV, JSON, XML, Excel files
  • Data transformation: Filter, sort, aggregate, and reshape data
  • Data cleaning: Handle missing values, duplicates, and errors
  • Statistical analysis: Calculate summaries, trends, and patterns
  • Visualization: Generate charts and reports
  • Export capabilities: Save to multiple formats

πŸ“ Step 1: Your First Data Processing Task (5 minutes)

Start the Gateway

openclaw gateway --port 18789 --verbose

Open WebChat UI

Navigate to:

http://localhost:18789

Basic Data Analysis

Analyze a CSV file:

Read the file sales-data.csv from my Desktop
Calculate the total sales amount
Show me the average order value
Find the highest and lowest orders

OpenClaw will:

  1. Parse the CSV file
  2. Perform the calculations
  3. Present the results in a clear format

Simple Data Filtering

Open orders.csv
Filter for orders where status is "completed" and amount is greater than 100
Create a new CSV file called large-completed-orders.csv

🧹 Step 2: Data Cleaning (10 minutes)

Handling Missing Values

Read customer-data.csv
Check for missing values in all columns
For missing email addresses:
  - Mark them as "missing" in the status column
For missing phone numbers:
  - Fill with "N/A"
Save the cleaned data to customer-data-cleaned.csv

Removing Duplicates

Open transactions.csv
Identify duplicate rows based on transaction_id
Remove all duplicates keeping the first occurrence
Save to transactions-deduplicated.csv
Report how many duplicates were removed

Standardizing Data Formats

Read raw-data.csv
Standardize all date columns to ISO 8601 format (YYYY-MM-DD)
Convert all email addresses to lowercase
Trim whitespace from all text fields
Ensure phone numbers are in format: +1-XXX-XXX-XXXX
Save to standardized-data.csv

Data Validation

Validate user-data.csv:
- Email addresses must be valid format
- Ages must be between 18 and 120
- Postal codes must match their country's format
- Status must be one of: active, inactive, pending
Create a validation report listing all issues
Save invalid records to review-needed.csv

πŸ”„ Step 3: Data Transformation (12 minutes)

Pivoting and Reshaping

Read sales-by-month.csv
Pivot the data to show months as columns and products as rows
Save to sales-pivoted.csv

Merging Datasets

Read orders.csv and customers.csv
Join them on customer_id
Combine all matching records
Include all columns from both files
Save to orders-with-customers.csv

Aggregation Operations

Open sales-data.csv
Group by product_category
For each category calculate:
  - Total revenue
  - Number of orders
  - Average order value
  - Best-selling product
Save to category-summary.csv

Advanced Transformations

Read website-traffic.csv
Create new columns:
  - Conversion rate = conversions / visits
  - Revenue per visit = revenue / visits
  - Bounce rate category (high/medium/low)
Apply these transformations to each row
Save to enriched-metrics.csv

πŸ“Š Step 4: Data Analysis and Insights (10 minutes)

Statistical Analysis

Analyze sales-2025.csv:
- Calculate mean, median, mode for order amounts
- Find standard deviation
- Identify outliers (values > 2 std deviations from mean)
- Generate correlation matrix for all numeric columns
Create a statistical summary report

Trend Analysis

Read monthly-revenue.csv
Calculate month-over-month growth
Identify trends (increasing/decreasing/stable)
Find seasonal patterns
Predict next month's revenue based on trends
Generate a trend analysis report

Comparative Analysis

Compare sales between Q1 and Q2:
Read Q1-sales.csv and Q2-sales.csv
Calculate total revenue for each quarter
Find products with >20% growth
Find products with declining sales
Generate comparison report with insights

Customer Segmentation

Read customer-purchases.csv
Segment customers into tiers:
  - VIP: total purchases > $5000
  - Regular: total purchases > $1000
  - Casual: total purchases <= $1000
For each segment calculate:
  - Number of customers
  - Average purchase amount
  - Most popular products
Save to customer-segments.csv

πŸ“ˆ Step 5: Building Data Pipelines (15 minutes)

Multi-Step Processing Pipeline

Build a data processing pipeline:

Step 1: Read raw-sales-data.csv
Step 2: Remove duplicate transactions
Step 3: Standardize date formats to ISO 8601
Step 4: Validate all email addresses
Step 5: Filter out invalid records
Step 6: Add calculated columns (tax, total)
Step 7: Aggregate by customer
Step 8: Generate summary statistics
Step 9: Save processed data to processed-sales.csv
Step 10: Create executive summary report

Automated Data Quality Pipeline

Set up automated data quality monitoring:

For each new CSV file in ~/inbox/:
  1. Validate file structure
  2. Check for required columns
  3. Validate data types
  4. Check for missing values
  5. Identify duplicates
  6. Run business rule validations
  7. Generate quality score (0-100)
  8. If score < 80:
     - Move to review/ folder
     - Alert me with issues
  9. If score >= 80:
     - Move to approved/ folder
     - Log quality metrics

E-Commerce Analytics Pipeline

Build an e-commerce analytics pipeline:

Daily:
  1. Import orders from ~/raw-orders/
  2. Import customer data from CRM
  3. Merge and deduplicate
  4. Calculate customer lifetime value
  5. Identify high-value customers
  6. Analyze product performance
  7. Generate daily KPI dashboard
  8. Save reports to ~/analytics/daily-[date]/
  9. Alert on significant changes

🎯 Step 6: Data Export and Reporting (8 minutes)

Multiple Format Export

Read analytics-data.json
Export to multiple formats:
  - Excel workbook with multiple sheets
  - PDF report with charts
  - Interactive HTML dashboard
  - CSV for data processing
  - JSON for API integration
Save all to ~/reports/[date]/

Automated Report Generation

Generate weekly sales report:

1. Read all sales data from past week
2. Aggregate by product, region, salesperson
3. Calculate key metrics:
   - Total revenue
   - Top performers
   - Growth vs last week
   - Regional breakdown
4. Create visualizations:
   - Bar chart for sales by product
   - Line chart for weekly trend
   - Pie chart for regional distribution
5. Compile into formatted markdown report
6. Save to ~/reports/weekly-[week].md
7. Email PDF version to management

Custom Data Summaries

Create custom executive summary:

Read financial-data.csv
Generate 1-page summary including:
  - Key metrics at a glance
  - Top 5 insights
  - 3 notable trends
  - Areas needing attention
  - Recommendations
Format as professional document
Save to ~/executive-summary-[date].pdf

πŸ”§ Step 7: Error Handling and Data Quality (10 minutes)

Robust Error Handling

Process data with error recovery:

Try to read large-dataset.csv
If file is too large:
  - Split into chunks of 10000 rows
  - Process each chunk separately
  - Combine results at end
If parsing fails:
  - Identify problematic rows
  - Save errors to error-log.txt
  - Continue with valid rows
  - Alert me with summary

Data Quality Checks

Implement comprehensive data quality checks:

For each dataset:
  1. Completeness check:
     - All required fields present
     - No missing critical values
  2. Accuracy check:
     - Values in valid ranges
     - Correct data types
  3. Consistency check:
     - Related fields match
     - No contradictions
  4. Timeliness check:
     - Data is current
     - No stale records
  5. Uniqueness check:
     - No duplicate keys
     - Primary key integrity
Generate quality scorecard

Anomaly Detection

Detect anomalies in data:

Read metrics.csv
Identify unusual patterns:
  - Statistical outliers (>3 std dev)
  - Sudden spikes or drops
  - Values outside expected ranges
  - Missing time periods
For each anomaly:
  - Log timestamp and value
  - Calculate deviation from normal
  - Assign severity score
  - Flag for review if critical
Create anomaly report

πŸš€ Advanced Data Processing Examples

Example 1: Marketing Campaign Analysis

Analyze marketing campaign performance:

1. Import campaign data from multiple sources:
   - Email platform (opens, clicks)
   - Social media (engagement)
   - Website analytics
2. Merge all data by campaign_id
3. Calculate ROI for each campaign
4. A/B test analysis:
   - Compare conversion rates
   - Statistical significance
   - Winner identification
5. Channel performance comparison
6. Generate recommendations
7. Create visual dashboard
8. Save to ~/campaign-analysis/[campaign]/

Example 2: Inventory Optimization

Build inventory analysis system:

1. Read sales history (past 12 months)
2. Read current inventory levels
3. For each product:
   - Calculate average monthly sales
   - Identify seasonal trends
   - Calculate optimal stock level
   - Flag overstock items
   - Flag understock items
4. Generate purchase recommendations
5. Project stockout dates
6. Calculate safety stock levels
7. Create reorder alerts
8. Generate inventory optimization report

Example 3: Financial Data Processing

Financial data reconciliation pipeline:

Daily:
1. Import transactions from:
   - Bank statements
   - Credit card feeds
   - Payment processors
2. Categorize transactions
3. Match to accounting system
4. Identify unreconciled items
5. Flag suspicious transactions
6. Calculate cash flow metrics
7. Update financial dashboards
8. Generate daily reconciliation report
9. Alert on discrepancies

πŸ’‘ Best Practices

1. Always Backup Original Data

Before processing data:
1. Create backup of original file
2. Work on copy
3. Preserve raw data
4. Document all transformations

2. Validate Before Processing

Before processing large dataset:
- Check file size and format
- Preview first few rows
- Verify schema
- Test with small sample
- Only then process full dataset

3. Document Transformations

Keep track of all changes:
- What transformations were applied
- Why they were necessary
- What assumptions were made
- What data quality issues were found

4. Use Checkpoints

For complex pipelines:
- Save intermediate results
- Validate each step
- Create checkpoints
- Enable resume from failures

5. Monitor Data Quality

Implement ongoing monitoring:
- Set up quality alerts
- Track data quality trends
- Review error logs regularly
- Continuously improve validation

πŸ” Troubleshooting Common Issues

Issue: β€œFile too large to process”

Solution: Process in chunks

Read large-file.csv
Split into chunks of 1000 rows
Process each chunk separately
Save each chunk to output/
Combine all chunks at the end

Issue: β€œMemory errors”

Solution: Optimize memory usage

Process file row by row
Don't load entire file into memory
Save results incrementally
Clear processed data from memory

Issue: β€œInconsistent data formats”

Solution: Standardize early

First, detect all data formats in each column
Then, standardize to single format
Document all format conversions
Validate standardization worked

Issue: β€œProcessing too slow”

Solution: Optimize operations

Filter data early (reduce dataset size)
Use efficient operations
Avoid unnecessary loops
Process in parallel where possible
Cache intermediate results

🎯 What’s Next?


πŸ†˜ Need Help?

  • πŸ’¬ Ask OpenClaw: Describe your data processing needs in natural language
  • πŸ“– Data Processing Guide - Detailed data processing documentation
  • 🌟 Community Examples - Real-world data processing workflows
  • πŸ› GitHub Issues - Report problems

⏱️ Total Time: 60 minutes πŸ“Š Difficulty: Intermediate 🎯 Result: Building automated data processing pipelines with OpenClaw


πŸ’‘ Key Takeaways

  1. Natural Language Data Processing: Describe data operations in plain English
  2. Comprehensive Format Support: Work with CSV, JSON, XML, Excel, and more
  3. Automated Pipelines: Build multi-step data workflows easily
  4. Quality Assurance: Built-in validation and error handling
  5. Scalable: Process from small files to large datasets
  6. Production Ready: Suitable for real-world data processing tasks

Next: Try building your own data processing pipeline by describing your workflow step by step!

πŸŽ‰

Congratulations!

You've completed this tutorial. Ready for the next challenge?