✓ Verified 💻 Development ✓ Enhanced Data

Agentbench

Benchmark your OpenClaw agent across 40 real-world tasks.

Rating: 5 (325 reviews)
Downloads: 2,367 downloads
Version: 1.0.0

Overview

Benchmark your OpenClaw agent across 40 real-world tasks.

✨Key Features

Discover Tasks

Set Up Run Directory

Execute Each Task

Generate Report

Present Results

Clean Up

Complete Documentation

View Source →

AgentBench for OpenClaw

Benchmark your OpenClaw agent's general capabilities across 40 real-world tasks spanning 7 domains.

Commands

When the user says any of these, follow the corresponding instructions:

/benchmark — Run the full benchmark suite (all 40 tasks)
/benchmark --fast — Run only easy+medium tasks (19 tasks)
/benchmark --suite — Run one domain only
/benchmark --task — Run a single task
/benchmark --strict — Tag results as externally verified scoring
/benchmark-list — List all tasks grouped by domain
/benchmark-results — Show results from previous runs
/benchmark-compare — Compare two runs side-by-side

Flags are combinable: /benchmark --fast --suite research

Running a Benchmark

Step 1: Discover Tasks

Read task.yaml files from the tasks/ directory in this skill:

text

tasks/{suite-name}/{task-name}/task.yaml

Each task.yaml contains: name, id, suite, difficulty, mode, user_message, input_files, expected_outputs, expected_metrics, scoring weights.

Filter by --suite or --task if specified. If --fast is set and --task is not, filter to only tasks where difficulty is "easy" or "medium".

Profile is "fast" if --fast was specified, otherwise "full".

List discovered tasks with count and suites.

Step 2: Set Up Run Directory

Generate a run ID from the current timestamp: YYYYMMDD-HHmmss

Read suite_version from skill.json in this skill directory.

Create the results directory:

text

agentbench-results/{run-id}/

Announce: Starting AgentBench run {run-id} | Profile: {profile} | Suite version: {suite_version} | Tasks: {count}

Step 3: Execute Each Task

For each task:

Set up workspace:
Create /tmp/agentbench-task-{task-id}/ as workspace
Copy input files from tasks/{suite}/{task}/inputs/ to the workspace (if inputs/ exists)
If the task directory contains a setup.sh: run bash tasks/{suite}/{task}/setup.sh {workspace-path}
For file-unchanged validators: compute checksums of specified files after setup, before task execution
Announce: Running: {task.name} [{task.suite}] (difficulty: {task.difficulty})
Record start time (milliseconds): date +%s%3N
Execute the task yourself directly:
Read the task's user_message and execute it as if a real user sent you the request
Work ONLY within the workspace directory
If input files are listed, read them from the workspace
Execute naturally — use the appropriate tools (read, write, edit, exec, web_search, web_fetch, etc.)
Create any output files in the workspace directory
When done, write a brief execution-trace.md to the workspace:
What you understood the task to be
What approach you took
What files you created or modified
Any difficulties or decisions you made
Record end time and compute duration
Collect metrics:
total_time_ms: end - start
tool_calls_total: count how many tool calls you made during this task
errors: count any tool call failures
planning_ratio: estimate the fraction of time spent reading/thinking vs producing output (approximate is fine)
Layer 0 — Automated Structural Checks (compute directly):

After task execution, check the workspace. For each entry in expected_outputs:

file-exists: Check if file exists. 30 points if found, 0 if not.
content-contains: Read file, check each required section keyword (case-insensitive). Points proportional to matches found. Pool: 40 points.
word-count-range: Count words. In range = 30 points. Within 2x range = 15 points. Outside = 0.
git-log-contains: Check git log for expected strings. 30 points if all found, proportional otherwise.
directory-structure: Check all paths exist. 30 points if all present, proportional for partial.
command-output-contains: Run command, check output contains all strings. 30 points if match, 0 if not.
file-unchanged: Compare checksum against pre-execution checksum. 30 points if unchanged, 0 if modified.
link-consistency: Scan files for link syntax consistency. 30 points if consistent, 15 if mostly consistent (>70% one style), 0 if mixed.
Normalize total to 0-100.
Layer 1 — Metrics Analysis (compute directly):

If task has expected_metrics:

Tool calls within expected range: 40 points
Tool calls within 2x range: 20 points
Outside 2x range: 0 points
Planning ratio within expected range: 30 points
Planning ratio outside but within 2x: 15 points
Way off: 0 points
Zero errors: 30 points
1-2 errors: 15 points
3+ errors: 0 points
Normalize to 0-100. If no metrics available, score as 50.
Token estimate is tracked for reporting but NOT scored.
Layer 2 — Behavioral Analysis (self-evaluate honestly, 0-100):

Score based on HOW you executed:

Instruction Adherence (30 points):

30: Followed all instructions precisely
20: Mostly followed, minor deviations
10: Significant deviations
0: Ignored or misunderstood

Tool Appropriateness (25 points) — rule-based first:

Penalty: -10 for each use of exec cat instead of read to read files
Penalty: -10 for each use of exec echo/printf instead of write to create files
Penalty: -5 for each use of exec sed/awk instead of edit for file edits
Start at 25, apply penalties, floor at 0

Approach Quality (25 points) — check read-before-write:

25: Read all inputs before producing output
15: Read most inputs, minor gaps
5: Started producing output without reading context
0: No clear approach

Error Recovery (20 points):

20: Clean recovery or no errors occurred
10: Partial recovery
0: Failed to recover
Layer 3 — Output Quality (self-evaluate honestly, 0-100):

Score the deliverable:

Completeness (25): All requirements met? Gaps? Accuracy (25): Content correct? Calculations right? Formatting (25): Well-structured? Correct file format? Polish (25): Would a user be satisfied?

Compute composite score:

text

score = (L0 × 0.20) + (L1 × 0.35) + (L2 × 0.20) + (L3 × 0.25)

Use weights from task.yaml if specified, otherwise these defaults.

Save task result to agentbench-results/{run-id}/{task-id}/:
scores.json: All layer scores, composite, breakdown, notes
metrics.json: Timing, tool calls, errors, planning ratio
Copy output files
Display: {task.name}: {composite}/100 (L0:{l0} L1:{l1} L2:{l2} L3:{l3})

Step 4: Generate Report

After all tasks:

Compute domain averages (group by suite, average composite scores)
Compute overall score (average of domain scores — equal domain weighting)
Compute aggregate metrics

Generate three files in agentbench-results/{run-id}/:

results.json — Machine-readable with this structure:

json

{
  "run_id": "20260222-143022",
  "timestamp": "2026-02-22T14:30:22Z",
  "platform": "openclaw",
  "mode": "sandboxed",
  "profile": "full",
  "suite_version": "1.0.0",
  "scoring_method": "self-scored",
  "overall_score": 74,
  "duration_ms": 754000,
  "task_count": 40,
  "metrics": {
    "total_tool_calls": 187,
    "total_errors": 3,
    "avg_planning_ratio": 0.28,
    "est_tokens": 245000
  },
  "domain_scores": {},
  "tasks": []
}

If --strict was used, set scoring_method to "externally-verified".

Integrity signature: After building results.json (without signature field), compute:

bash

SIG=$(echo -n "$CONTENT" | openssl dgst -sha256 -hmac "agentbench-v1-{run_id}-{suite_version}-integrity" | awk '{print $2}')

Add as "signature" field to results.json.

report.md — Markdown summary: Overall Score, Metrics, Domain Breakdown, Task Details, Top Failures, Recommendations.

report.html — Self-contained HTML dashboard (inline CSS/JS, no external deps):

Score display with color (green 80+, yellow 60-79, red <60)
Domain cards with score bars
Task detail table (sortable, expandable)
Top failures section
Dark mode via prefers-color-scheme
Footer: "Generated by AgentBench v1.0.0 (OpenClaw) | Suite v{suite_version} | Profile: {profile}"

Step 5: Present Results

Display overall score
Show domain breakdown
Tell user where results are saved
Mention they can submit to https://www.agentbench.app/submit

Step 6: Clean Up

Run teardown.sh if present. Remove temp workspace directories unless --keep-workspace was specified.

Listing Tasks (`/benchmark-list`)

Read all task.yaml files, group by suite, display as:

text

## file-creation (9 tasks)
  - project-scaffold [easy]
  - project-proposal [medium]
  ...

Viewing Results (`/benchmark-results`)

List all directories in agentbench-results/, show run ID, date, overall score, profile, and task count for each.

Comparing Runs (`/benchmark-compare`)

Show two runs side-by-side: overall scores, domain scores, and per-task deltas. Warn if profiles differ.

Key Differences from Claude Code Version

No hooks — metrics are self-tracked (timing, tool call counting)
No subagents — you execute tasks directly in sequence
Same tasks, same scoring, same output format — results are cross-platform comparable
Same integrity signature — submissions work on the same leaderboard

Important Notes

Be honest in self-evaluation (L2/L3). Inflated scores are obvious on the leaderboard.
The objective layers (L0 + L1) carry 55% of the weight — they can't be faked.
Token estimates are informational only, not scored.
Any link syntax is accepted in skill graph tasks — consistency is what's scored.

Installation

Terminal bash


openclaw install agentbench

Copied!

💻Code Examples

tasks/{suite-name}/{task-name}/task.yaml

taskssuite-nametask-nametaskyaml.txt

Each task.yaml contains: name, id, suite, difficulty, mode, user_message, input_files, expected_outputs, expected_metrics, scoring weights.

Filter by `--suite` or `--task` if specified. If `--fast` is set and `--task` is not, filter to only tasks where difficulty is "easy" or "medium".

Profile is "fast" if `--fast` was specified, otherwise "full".

List discovered tasks with count and suites.

### Step 2: Set Up Run Directory

Generate a run ID from the current timestamp: `YYYYMMDD-HHmmss`

Read `suite_version` from `skill.json` in this skill directory.

Create the results directory:

agentbench-results/{run-id}/

agentbench-resultsrun-id.txt

Announce: `Starting AgentBench run {run-id} | Profile: {profile} | Suite version: {suite_version} | Tasks: {count}`

### Step 3: Execute Each Task

For each task:

1. **Set up workspace**:
   - Create `/tmp/agentbench-task-{task-id}/` as workspace
   - Copy input files from `tasks/{suite}/{task}/inputs/` to the workspace (if inputs/ exists)
   - If the task directory contains a `setup.sh`: run `bash tasks/{suite}/{task}/setup.sh {workspace-path}`
   - For `file-unchanged` validators: compute checksums of specified files after setup, before task execution

2. **Announce**: `Running: {task.name} [{task.suite}] (difficulty: {task.difficulty})`

3. **Record start time** (milliseconds): `date +%s%3N`

4. **Execute the task yourself directly**:
   - Read the task's `user_message` and execute it as if a real user sent you the request
   - Work ONLY within the workspace directory
   - If input files are listed, read them from the workspace
   - Execute naturally — use the appropriate tools (read, write, edit, exec, web_search, web_fetch, etc.)
   - Create any output files in the workspace directory
   - When done, write a brief `execution-trace.md` to the workspace:
     - What you understood the task to be
     - What approach you took
     - What files you created or modified
     - Any difficulties or decisions you made

5. **Record end time** and compute duration

6. **Collect metrics**:
   - `total_time_ms`: end - start
   - `tool_calls_total`: count how many tool calls you made during this task
   - `errors`: count any tool call failures
   - `planning_ratio`: estimate the fraction of time spent reading/thinking vs producing output (approximate is fine)

7. **Layer 0 — Automated Structural Checks** (compute directly):
   After task execution, check the workspace. For each entry in `expected_outputs`:
   - `file-exists`: Check if file exists. 30 points if found, 0 if not.
   - `content-contains`: Read file, check each required section keyword (case-insensitive). Points proportional to matches found. Pool: 40 points.
   - `word-count-range`: Count words. In range = 30 points. Within 2x range = 15 points. Outside = 0.
   - `git-log-contains`: Check git log for expected strings. 30 points if all found, proportional otherwise.
   - `directory-structure`: Check all paths exist. 30 points if all present, proportional for partial.
   - `command-output-contains`: Run command, check output contains all strings. 30 points if match, 0 if not.
   - `file-unchanged`: Compare checksum against pre-execution checksum. 30 points if unchanged, 0 if modified.
   - `link-consistency`: Scan files for link syntax consistency. 30 points if consistent, 15 if mostly consistent (>70% one style), 0 if mixed.
   - Normalize total to 0-100.

8. **Layer 1 — Metrics Analysis** (compute directly):
   If task has expected_metrics:
   - Tool calls within expected range: 40 points
   - Tool calls within 2x range: 20 points
   - Outside 2x range: 0 points
   - Planning ratio within expected range: 30 points
   - Planning ratio outside but within 2x: 15 points
   - Way off: 0 points
   - Zero errors: 30 points
   - 1-2 errors: 15 points
   - 3+ errors: 0 points
   - Normalize to 0-100. If no metrics available, score as 50.
   - Token estimate is tracked for reporting but NOT scored.

9. **Layer 2 — Behavioral Analysis** (self-evaluate honestly, 0-100):
   Score based on HOW you executed:

   **Instruction Adherence (30 points):**
   - 30: Followed all instructions precisely
   - 20: Mostly followed, minor deviations
   - 10: Significant deviations
   - 0: Ignored or misunderstood

   **Tool Appropriateness (25 points)** — rule-based first:
   - Penalty: -10 for each use of `exec cat` instead of `read` to read files
   - Penalty: -10 for each use of `exec echo/printf` instead of `write` to create files
   - Penalty: -5 for each use of `exec sed/awk` instead of `edit` for file edits
   - Start at 25, apply penalties, floor at 0

   **Approach Quality (25 points)** — check read-before-write:
   - 25: Read all inputs before producing output
   - 15: Read most inputs, minor gaps
   - 5: Started producing output without reading context
   - 0: No clear approach

   **Error Recovery (20 points):**
   - 20: Clean recovery or no errors occurred
   - 10: Partial recovery
   - 0: Failed to recover

10. **Layer 3 — Output Quality** (self-evaluate honestly, 0-100):
    Score the deliverable:

    **Completeness (25):** All requirements met? Gaps?
    **Accuracy (25):** Content correct? Calculations right?
    **Formatting (25):** Well-structured? Correct file format?
    **Polish (25):** Would a user be satisfied?

11. **Compute composite score**:

results.json — Machine-readable with this structure:

resultsjson--machine-readable-with-this-structure.json

{
  "run_id": "20260222-143022",
  "timestamp": "2026-02-22T14:30:22Z",
  "platform": "openclaw",
  "mode": "sandboxed",
  "profile": "full",
  "suite_version": "1.0.0",
  "scoring_method": "self-scored",
  "overall_score": 74,
  "duration_ms": 754000,
  "task_count": 40,
  "metrics": {
    "total_tool_calls": 187,
    "total_errors": 3,
    "avg_planning_ratio": 0.28,
    "est_tokens": 245000
  },
  "domain_scores": {},
  "tasks": []
}

Read all task.yaml files, group by suite, display as:

read-all-taskyaml-files-group-by-suite-display-as.txt

## file-creation (9 tasks)
  - project-scaffold [easy]
  - project-proposal [medium]
  ...

Related Skills

✓ Verified 💻 Development

4claw

4claw — a moderated imageboard for AI agents.

🧠 Claude-Ready #ai_and-llms

✓ Verified 💻 Development

Aap Passport

Agent Attestation Protocol - The Reverse Turing Test.

🧠 Claude-Ready #ai_and-llms

✓ Verified 💻 Development

Acestep Lyrics Transcription

Transcribe audio to timestamped lyrics using OpenAI Whisper or ElevenLabs Scribe API.

⚡ GPT-Optimized #ai_and-llms #api #script

✓ Verified 💻 Development

Adaptive Suite

A continuously adaptive skill suite that empowers Clawdbot.

🧠 Claude-Ready #ai_and-llms #bot

Agentbench

Overview

✨Key Features

Complete Documentation

AgentBench for OpenClaw

Commands

Running a Benchmark

Step 1: Discover Tasks

Step 2: Set Up Run Directory

Step 3: Execute Each Task

Step 4: Generate Report

Step 5: Present Results

Step 6: Clean Up

Listing Tasks (`/benchmark-list`)

Viewing Results (`/benchmark-results`)

Comparing Runs (`/benchmark-compare`)

Key Differences from Claude Code Version

Important Notes

Installation

💻Code Examples

tasks/{suite-name}/{task-name}/task.yaml

agentbench-results/{run-id}/

results.json — Machine-readable with this structure:

Read all task.yaml files, group by suite, display as:

Tags

Quick Info

Ready to Install?

Resources

Related Skills

4claw

Aap Passport

Acestep Lyrics Transcription

Adaptive Suite

Overview

✨Key Features

Complete Documentation

AgentBench for OpenClaw

Commands

Running a Benchmark

Step 1: Discover Tasks

Step 2: Set Up Run Directory

Step 3: Execute Each Task

Step 4: Generate Report

Step 5: Present Results

Step 6: Clean Up

Listing Tasks (/benchmark-list)

Viewing Results (/benchmark-results)

Comparing Runs (/benchmark-compare)

Key Differences from Claude Code Version

Important Notes

Installation

💻Code Examples

tasks/{suite-name}/{task-name}/task.yaml

agentbench-results/{run-id}/

**results.json** — Machine-readable with this structure:

Read all task.yaml files, group by suite, display as:

Tags

Quick Info

Ready to Install?

Resources

Related Skills

4claw

Aap Passport

Acestep Lyrics Transcription

Adaptive Suite

Listing Tasks (`/benchmark-list`)

Viewing Results (`/benchmark-results`)

Comparing Runs (`/benchmark-compare`)

results.json — Machine-readable with this structure: