Data Cleaning Annotation Workflow
Complete workflow for time series datasets (Energy, Manufacturing, Climate) on Kaggle to Data Annota
- Rating
- 4.7 (294 reviews)
- Downloads
- 4,950 downloads
- Version
- 1.0.0
Overview
Complete workflow for time series datasets (Energy, Manufacturing, Climate) on Kaggle to Data Annotation platform.
✨Key Features
Find and Download Dataset
Clean the Dataset
Upload Raw Dataset to Platform
Upload Cleaned File & Configure Metadata
Assign Groups to Variables
Final Upload
Complete Documentation
View Source →
Simulacrum Data Annotation Workflow
Complete end-to-end workflow for time series dataset preparation and annotation on the Data Annotation platform (data.smlcrm.com).
What This Skill Does
This skill captures the precise workflow for processing time series datasets (Energy, Manufacturing, Climate) from discovery to CLEAN status:
- Find Dataset: Search Kaggle for Energy/Manufacturing/Climate time series data
- Download: Get CSV files via browser or Kaggle CLI
- Clean: Run Python/pandas script to handle missing values, duplicates, formatting
- Upload RAW: Upload original CSV with metadata (name, domain, source URL, description)
- Configure Headers: Set column types (Time, Target, Covariate, Group) and units
- Assign Groups: Select ALL variables (target + covariates), apply ALL group tags
- Upload Cleaned: Final upload → CLEAN status
Supported Domains
- Energy: Power consumption, utilities, renewable energy, grid data
- Manufacturing: Industrial processes, steel production, emissions, equipment data
- Climate: CO2 emissions, environmental monitoring, weather correlation data
Quick Start
For the full pipeline from Kaggle to annotated dataset:
1. Find dataset on Kaggle
2. Download (browser or kaggle CLI)
3. Clean with scripts/clean_dataset.py
4. Upload RAW dataset to data.smlcrm.com (with metadata)
5. Click "Clean" and upload cleaned file
6. Configure column metadata (types, units)
7. Assign groups to variables
8. Upload cleaned dataset → CLEAN status
Workflow Steps
Step 1: Find and Download Dataset
From Kaggle (Browser Method):
- Navigate to kaggle.com/datasets
- Search for relevant dataset (e.g., "steel industry energy consumption", "manufacturing emissions", "climate CO2")
- Review data description, file list, and preview
- Click "Download" button
- Extract CSV file from downloaded zip
# Install if needed: pip install kaggle
# Configure: kaggle competitions list
scripts/download_kaggle.sh <dataset-name> [output-dir]
# Example: scripts/download_kaggle.sh csafrit2/steel-industry-energy-consumption
Step 2: Clean the Dataset
Always run the cleaning script before upload:
python3 scripts/clean_dataset.py <input.csv> [-o <output.csv>]
What the script does:
- Strips whitespace from column names
- Removes duplicate rows
- Fills missing numeric values with median
- Fills missing categorical values with mode or 'Unknown'
- Converts timestamp columns to datetime format
- Outputs column summary for metadata configuration
- Cleaned CSV file ready for upload
- Column summary printed to console (save this for metadata config)
Step 3: Upload Raw Dataset to Platform
- Navigate to data.smlcrm.com/dashboard
- Click "Upload Dataset" button
- Fill in metadata for the RAW dataset:
- Name: Descriptive dataset name
- Domain: Category (Energy, Manufacturing, Climate, etc.)
- Source URL: Kaggle or original source URL
- Description: Brief summary of the dataset
- Upload the original/raw CSV file (not cleaned yet)
- Click Upload
Step 4: Upload Cleaned File & Configure Metadata
- Find the RAW dataset in the list
- Click "Clean" button
- Upload the cleaned CSV file (from Step 2)
- Configure headers for each column:
| Setting | Description |
|---|---|
| Name | Column name (editable) |
| Units | Measurement units (kWh, °C, %, ratio, tCO2, etc.) |
| Type | Time / Target / Covariate / Group |
- Time: Timestamp/datetime columns (usually required)
- Target: Variable to predict (at least one required)
- Covariate: Input features/independent variables
- Group: Categorical segment variables (WeekStatus, Day_of_week, Load_Type, etc.)
- Select multiple rows via checkboxes
- Use "Apply" dropdown to set type for selected columns
- Set units individually or in bulk
- Energy: kWh, MWh, MW
- Power: kVarh, kW
- Emissions: tCO2, kgCO2
- Ratios: ratio, %
- Time: seconds, minutes, hours
Step 5: Assign Groups to Variables
Purpose: Group variables define how data is segmented for analysis.
Exact Workflow:
- Select ALL variables by checking their checkboxes:
- Target variable(s)
- ALL covariate variables
- Apply ALL group tags to selected variables:
- Click first group tag (e.g., WeekStatus) → all selected get this group
- Click second group tag (e.g., Day_of_week) → all selected get this group
- Click third group tag (e.g., Load_Type) → all selected get this group
- Continue for all available group tags
- Result: All variables have all groups assigned (e.g., "WeekStatus × Day_of_week × Load_Type")
Step 6: Final Upload
- Click "Upload Cleaned Dataset" button
- Wait for processing
- Dataset status changes from RAW → CLEAN
- Verify data points count is correct
Example: Steel Industry Energy Dataset
Source: https://www.kaggle.com/datasets/csafrit2/steel-industry-energy-consumption
Metadata:
- Name: Steel Industry Energy Consumption (South Korea)
- Domain: Energy
- Data Points: 350,400
| Column | Type | Units |
|---|---|---|
| Timestamps | Time | - |
| Usage_kWh | Target | kWh |
| Lagging_Current_Reactive.Power_kVarh | Covariate | kVarh |
| Leading_Current_Reactive_Power_kVarh | Covariate | kVarh |
| CO2(tCO2) | Covariate | tCO2 |
| Lagging_Current_Power_Factor | Covariate | ratio |
| Leading_Current_Power_Factor | Covariate | ratio |
| NSM | Covariate | seconds |
| WeekStatus | Group | - |
| Day_of_week | Group | - |
| Load_Type | Group | - |
- Select: Usage_kWh, Lagging_Current_Reactive.Power_kVarh, Leading_Current_Reactive_Power_kVarh, CO2(tCO2), Lagging_Current_Power_Factor, Leading_Current_Power_Factor, NSM
- Click: WeekStatus → all selected get WeekStatus
- Click: Day_of_week → all selected get Day_of_week
- Click: Load_Type → all selected get Load_Type
- Final: All variables show "WeekStatus × Day_of_week × Load_Type"
Reference Materials
For detailed platform configuration guidance, see references/platform_guide.md.
Troubleshooting
"Next" button disabled:
- Check at least one Time column is set
- Check at least one Target column is set
- Verify all columns have types assigned
- Columns must be marked as "Group" type first
- Proceed to next step after setting Group types
- Re-run cleaning script
- Check CSV format (comma-delimited)
- Verify no empty column names
Scripts
| Script | Purpose |
|---|---|
| scripts/clean_dataset.py | Clean and prepare CSV for upload |
| scripts/download_kaggle.sh | Download datasets via Kaggle CLI |
Platform URL
Data Annotation Platform: https://data.smlcrm.com
Installation
openclaw install data-cleaning-annotation-workflow
💻Code Examples
8. Upload cleaned dataset → CLEAN status
## Workflow Steps
### Step 1: Find and Download Dataset
**From Kaggle (Browser Method):**
1. Navigate to kaggle.com/datasets
2. Search for relevant dataset (e.g., "steel industry energy consumption", "manufacturing emissions", "climate CO2")
3. Review data description, file list, and preview
4. Click "Download" button
5. Extract CSV file from downloaded zip
**Alternative: Kaggle CLI**# Example: scripts/download_kaggle.sh csafrit2/steel-industry-energy-consumption
### Step 2: Clean the Dataset
**Always run the cleaning script before upload:**1. Find dataset on Kaggle
2. Download (browser or kaggle CLI)
3. Clean with scripts/clean_dataset.py
4. Upload RAW dataset to data.smlcrm.com (with metadata)
5. Click "Clean" and upload cleaned file
6. Configure column metadata (types, units)
7. Assign groups to variables
8. Upload cleaned dataset → CLEAN status# Install if needed: pip install kaggle
# Configure: kaggle competitions list
scripts/download_kaggle.sh <dataset-name> [output-dir]
# Example: scripts/download_kaggle.sh csafrit2/steel-industry-energy-consumptionTags
Quick Info
Ready to Install?
Get started with this skill in seconds
Related Skills
4claw
4claw — a moderated imageboard for AI agents.
Aap Passport
Agent Attestation Protocol - The Reverse Turing Test.
Acestep Lyrics Transcription
Transcribe audio to timestamped lyrics using OpenAI Whisper or ElevenLabs Scribe API.
Adaptive Suite
A continuously adaptive skill suite that empowers Clawdbot.