✓ Verified 💻 Development ✓ Enhanced Data

Pinchbench

Run PinchBench benchmarks to evaluate OpenClaw agent performance across real-world tasks.

Rating
4.9 (186 reviews)
Downloads
2,493 downloads
Version
1.0.0

Overview

Run PinchBench benchmarks to evaluate OpenClaw agent performance across real-world tasks.

Complete Documentation

View Source →

PinchBench Benchmark Skill

PinchBench measures how well LLM models perform as the brain of an OpenClaw agent. Results are collected on a public leaderboard at pinchbench.com.

Prerequisites

  • Python 3.10+
  • uv package manager
  • OpenClaw instance (this agent)

Quick Start

bash
cd <skill_directory>

# Run benchmark with a specific model
uv run benchmark.py --model anthropic/claude-sonnet-4

# Run only automated tasks (faster)
uv run benchmark.py --model anthropic/claude-sonnet-4 --suite automated-only

# Run specific tasks
uv run benchmark.py --model anthropic/claude-sonnet-4 --suite task_01_calendar,task_02_stock

# Skip uploading results
uv run benchmark.py --model anthropic/claude-sonnet-4 --no-upload

Available Tasks (23)

TaskCategoryDescription
task_00_sanityBasicVerify agent works
task_01_calendarProductivityCalendar event creation
task_02_stockResearchStock price lookup
task_03_blogWritingBlog post creation
task_04_weatherCodingWeather script
task_05_summaryAnalysisDocument summarization
task_06_eventsResearchConference research
task_07_emailWritingEmail drafting
task_08_memoryMemoryContext retrieval
task_09_filesFilesFile structure creation
task_10_workflowIntegrationMulti-step API workflow
task_11_clawdhubSkillsClawHub interaction
task_12_skill_searchSkillsSkill discovery
task_13_image_genCreativeImage generation
task_14_humanizerWritingText humanization
task_15_daily_summaryProductivityDaily digest
task_16_email_triageEmailInbox triage
task_17_email_searchEmailEmail search
task_18_market_researchResearchMarket analysis
task_19_spreadsheet_summaryAnalysisSpreadsheet analysis
task_20_eli5_pdf_summaryAnalysisPDF simplification
task_21_openclaw_comprehensionKnowledgeOpenClaw docs comprehension
task_22_second_brainMemoryKnowledge management

Command Line Options

OptionDescription
--modelModel identifier (e.g., anthropic/claude-sonnet-4)
--suiteall, automated-only, or comma-separated task IDs
--output-dirResults directory (default: results/)
--timeout-multiplierScale task timeouts for slower models
--runsNumber of runs per task for averaging
--no-uploadSkip uploading to leaderboard
--registerRequest new API token for submissions
--upload FILEUpload previous results JSON

Token Registration

To submit results to the leaderboard:

bash
# Register for an API token (one-time)
uv run benchmark.py --register

# Run benchmark (auto-uploads with token)
uv run benchmark.py --model anthropic/claude-sonnet-4

Results

Results are saved as JSON in the output directory:

bash
# View task scores
jq '.tasks[] | {task_id, score: .grading.mean}' results/0001_anthropic-claude-sonnet-4.json

# Show failed tasks
jq '.tasks[] | select(.grading.mean < 0.5)' results/*.json

# Calculate overall score
jq '{average: ([.tasks[].grading.mean] | add / length)}' results/*.json

Adding Custom Tasks

Create a markdown file in tasks/ following TASK_TEMPLATE.md. Each task needs:

  • YAML frontmatter (id, name, category, grading_type, timeout)
  • Prompt section
  • Expected behavior
  • Grading criteria
  • Automated checks (Python grading function)

Leaderboard

View results at pinchbench.com. The leaderboard shows:

  • Model rankings by overall score
  • Per-task breakdowns
  • Historical performance trends

Installation

Terminal bash

openclaw install pinchbench
    
Copied!

💻Code Examples

uv run benchmark.py --model anthropic/claude-sonnet-4 --no-upload

uv-run-benchmarkpy---model-anthropicclaude-sonnet-4---no-upload.txt
## Available Tasks (23)

| Task | Category | Description |
|------|----------|-------------|
| `task_00_sanity` | Basic | Verify agent works |
| `task_01_calendar` | Productivity | Calendar event creation |
| `task_02_stock` | Research | Stock price lookup |
| `task_03_blog` | Writing | Blog post creation |
| `task_04_weather` | Coding | Weather script |
| `task_05_summary` | Analysis | Document summarization |
| `task_06_events` | Research | Conference research |
| `task_07_email` | Writing | Email drafting |
| `task_08_memory` | Memory | Context retrieval |
| `task_09_files` | Files | File structure creation |
| `task_10_workflow` | Integration | Multi-step API workflow |
| `task_11_clawdhub` | Skills | ClawHub interaction |
| `task_12_skill_search` | Skills | Skill discovery |
| `task_13_image_gen` | Creative | Image generation |
| `task_14_humanizer` | Writing | Text humanization |
| `task_15_daily_summary` | Productivity | Daily digest |
| `task_16_email_triage` | Email | Inbox triage |
| `task_17_email_search` | Email | Email search |
| `task_18_market_research` | Research | Market analysis |
| `task_19_spreadsheet_summary` | Analysis | Spreadsheet analysis |
| `task_20_eli5_pdf_summary` | Analysis | PDF simplification |
| `task_21_openclaw_comprehension` | Knowledge | OpenClaw docs comprehension |
| `task_22_second_brain` | Memory | Knowledge management |

## Command Line Options

| Option | Description |
|--------|-------------|
| `--model` | Model identifier (e.g., `anthropic/claude-sonnet-4`) |
| `--suite` | `all`, `automated-only`, or comma-separated task IDs |
| `--output-dir` | Results directory (default: `results/`) |
| `--timeout-multiplier` | Scale task timeouts for slower models |
| `--runs` | Number of runs per task for averaging |
| `--no-upload` | Skip uploading to leaderboard |
| `--register` | Request new API token for submissions |
| `--upload FILE` | Upload previous results JSON |

## Token Registration

To submit results to the leaderboard:

uv run benchmark.py --model anthropic/claude-sonnet-4

uv-run-benchmarkpy---model-anthropicclaude-sonnet-4.txt
## Results

Results are saved as JSON in the output directory:
example.sh
cd <skill_directory>

# Run benchmark with a specific model
uv run benchmark.py --model anthropic/claude-sonnet-4

# Run only automated tasks (faster)
uv run benchmark.py --model anthropic/claude-sonnet-4 --suite automated-only

# Run specific tasks
uv run benchmark.py --model anthropic/claude-sonnet-4 --suite task_01_calendar,task_02_stock

# Skip uploading results
uv run benchmark.py --model anthropic/claude-sonnet-4 --no-upload
example.sh
# Register for an API token (one-time)
uv run benchmark.py --register

# Run benchmark (auto-uploads with token)
uv run benchmark.py --model anthropic/claude-sonnet-4
example.sh
# View task scores
jq '.tasks[] | {task_id, score: .grading.mean}' results/0001_anthropic-claude-sonnet-4.json

# Show failed tasks
jq '.tasks[] | select(.grading.mean < 0.5)' results/*.json

# Calculate overall score
jq '{average: ([.tasks[].grading.mean] | add / length)}' results/*.json

Tags

#coding_agents-and-ides

Quick Info

Category Development
Model Claude 3.5
Complexity Multi-Agent
Author olearycrew
Last Updated 3/10/2026
🚀
Optimized for
Claude 3.5
🧠

Ready to Install?

Get started with this skill in seconds

openclaw install pinchbench