✓ Verified 💻 Development ✓ Enhanced Data

Agentbench

Benchmark your OpenClaw agent across 40 real-world tasks.

Rating
5 (325 reviews)
Downloads
2,367 downloads
Version
1.0.0

Overview

Benchmark your OpenClaw agent across 40 real-world tasks.

Key Features

1

Discover Tasks

2

Set Up Run Directory

3

Execute Each Task

4

Generate Report

5

Present Results

6

Clean Up

Complete Documentation

View Source →

AgentBench for OpenClaw

Benchmark your OpenClaw agent's general capabilities across 40 real-world tasks spanning 7 domains.

Commands

When the user says any of these, follow the corresponding instructions:

  • /benchmark — Run the full benchmark suite (all 40 tasks)
  • /benchmark --fast — Run only easy+medium tasks (19 tasks)
  • /benchmark --suite — Run one domain only
  • /benchmark --task — Run a single task
  • /benchmark --strict — Tag results as externally verified scoring
  • /benchmark-list — List all tasks grouped by domain
  • /benchmark-results — Show results from previous runs
  • /benchmark-compare — Compare two runs side-by-side
Flags are combinable: /benchmark --fast --suite research

Running a Benchmark

Step 1: Discover Tasks

Read task.yaml files from the tasks/ directory in this skill:

text
tasks/{suite-name}/{task-name}/task.yaml

Each task.yaml contains: name, id, suite, difficulty, mode, user_message, input_files, expected_outputs, expected_metrics, scoring weights.

Filter by --suite or --task if specified. If --fast is set and --task is not, filter to only tasks where difficulty is "easy" or "medium".

Profile is "fast" if --fast was specified, otherwise "full".

List discovered tasks with count and suites.

Step 2: Set Up Run Directory

Generate a run ID from the current timestamp: YYYYMMDD-HHmmss

Read suite_version from skill.json in this skill directory.

Create the results directory:

text
agentbench-results/{run-id}/

Announce: Starting AgentBench run {run-id} | Profile: {profile} | Suite version: {suite_version} | Tasks: {count}

Step 3: Execute Each Task

For each task:

  • Set up workspace:
  • Create /tmp/agentbench-task-{task-id}/ as workspace
  • Copy input files from tasks/{suite}/{task}/inputs/ to the workspace (if inputs/ exists)
  • If the task directory contains a setup.sh: run bash tasks/{suite}/{task}/setup.sh {workspace-path}
  • For file-unchanged validators: compute checksums of specified files after setup, before task execution
  • Announce: Running: {task.name} [{task.suite}] (difficulty: {task.difficulty})
  • Record start time (milliseconds): date +%s%3N
  • Execute the task yourself directly:
  • Read the task's user_message and execute it as if a real user sent you the request
  • Work ONLY within the workspace directory
  • If input files are listed, read them from the workspace
  • Execute naturally — use the appropriate tools (read, write, edit, exec, web_search, web_fetch, etc.)
  • Create any output files in the workspace directory
  • When done, write a brief execution-trace.md to the workspace:
  • What you understood the task to be
  • What approach you took
  • What files you created or modified
  • Any difficulties or decisions you made
  • Record end time and compute duration
  • Collect metrics:
  • total_time_ms: end - start
  • tool_calls_total: count how many tool calls you made during this task
  • errors: count any tool call failures
  • planning_ratio: estimate the fraction of time spent reading/thinking vs producing output (approximate is fine)
  • Layer 0 — Automated Structural Checks (compute directly):
After task execution, check the workspace. For each entry in expected_outputs:
  • file-exists: Check if file exists. 30 points if found, 0 if not.
  • content-contains: Read file, check each required section keyword (case-insensitive). Points proportional to matches found. Pool: 40 points.
  • word-count-range: Count words. In range = 30 points. Within 2x range = 15 points. Outside = 0.
  • git-log-contains: Check git log for expected strings. 30 points if all found, proportional otherwise.
  • directory-structure: Check all paths exist. 30 points if all present, proportional for partial.
  • command-output-contains: Run command, check output contains all strings. 30 points if match, 0 if not.
  • file-unchanged: Compare checksum against pre-execution checksum. 30 points if unchanged, 0 if modified.
  • link-consistency: Scan files for link syntax consistency. 30 points if consistent, 15 if mostly consistent (>70% one style), 0 if mixed.
  • Normalize total to 0-100.
  • Layer 1 — Metrics Analysis (compute directly):
If task has expected_metrics:
  • Tool calls within expected range: 40 points
  • Tool calls within 2x range: 20 points
  • Outside 2x range: 0 points
  • Planning ratio within expected range: 30 points
  • Planning ratio outside but within 2x: 15 points
  • Way off: 0 points
  • Zero errors: 30 points
  • 1-2 errors: 15 points
  • 3+ errors: 0 points
  • Normalize to 0-100. If no metrics available, score as 50.
  • Token estimate is tracked for reporting but NOT scored.
  • Layer 2 — Behavioral Analysis (self-evaluate honestly, 0-100):
Score based on HOW you executed:

Instruction Adherence (30 points):

  • 30: Followed all instructions precisely
  • 20: Mostly followed, minor deviations
  • 10: Significant deviations
  • 0: Ignored or misunderstood
Tool Appropriateness (25 points) — rule-based first:
  • Penalty: -10 for each use of exec cat instead of read to read files
  • Penalty: -10 for each use of exec echo/printf instead of write to create files
  • Penalty: -5 for each use of exec sed/awk instead of edit for file edits
  • Start at 25, apply penalties, floor at 0
Approach Quality (25 points) — check read-before-write:
  • 25: Read all inputs before producing output
  • 15: Read most inputs, minor gaps
  • 5: Started producing output without reading context
  • 0: No clear approach
Error Recovery (20 points):
  • 20: Clean recovery or no errors occurred
  • 10: Partial recovery
  • 0: Failed to recover
  • Layer 3 — Output Quality (self-evaluate honestly, 0-100):
Score the deliverable:

Completeness (25): All requirements met? Gaps? Accuracy (25): Content correct? Calculations right? Formatting (25): Well-structured? Correct file format? Polish (25): Would a user be satisfied?

  • Compute composite score:
text
score = (L0 × 0.20) + (L1 × 0.35) + (L2 × 0.20) + (L3 × 0.25)
Use weights from task.yaml if specified, otherwise these defaults.
  • Save task result to agentbench-results/{run-id}/{task-id}/:
  • scores.json: All layer scores, composite, breakdown, notes
  • metrics.json: Timing, tool calls, errors, planning ratio
  • Copy output files
  • Display: {task.name}: {composite}/100 (L0:{l0} L1:{l1} L2:{l2} L3:{l3})

Step 4: Generate Report

After all tasks:

  • Compute domain averages (group by suite, average composite scores)
  • Compute overall score (average of domain scores — equal domain weighting)
  • Compute aggregate metrics
Generate three files in agentbench-results/{run-id}/:

results.json — Machine-readable with this structure:

json
{
  "run_id": "20260222-143022",
  "timestamp": "2026-02-22T14:30:22Z",
  "platform": "openclaw",
  "mode": "sandboxed",
  "profile": "full",
  "suite_version": "1.0.0",
  "scoring_method": "self-scored",
  "overall_score": 74,
  "duration_ms": 754000,
  "task_count": 40,
  "metrics": {
    "total_tool_calls": 187,
    "total_errors": 3,
    "avg_planning_ratio": 0.28,
    "est_tokens": 245000
  },
  "domain_scores": {},
  "tasks": []
}

If --strict was used, set scoring_method to "externally-verified".

Integrity signature: After building results.json (without signature field), compute:

bash
SIG=$(echo -n "$CONTENT" | openssl dgst -sha256 -hmac "agentbench-v1-{run_id}-{suite_version}-integrity" | awk '{print $2}')
Add as "signature" field to results.json.

report.md — Markdown summary: Overall Score, Metrics, Domain Breakdown, Task Details, Top Failures, Recommendations.

report.html — Self-contained HTML dashboard (inline CSS/JS, no external deps):

  • Score display with color (green 80+, yellow 60-79, red <60)
  • Domain cards with score bars
  • Task detail table (sortable, expandable)
  • Top failures section
  • Dark mode via prefers-color-scheme
  • Footer: "Generated by AgentBench v1.0.0 (OpenClaw) | Suite v{suite_version} | Profile: {profile}"

Step 5: Present Results

  • Display overall score
  • Show domain breakdown
  • Tell user where results are saved
  • Mention they can submit to https://www.agentbench.app/submit

Step 6: Clean Up

Run teardown.sh if present. Remove temp workspace directories unless --keep-workspace was specified.

Listing Tasks (/benchmark-list)

Read all task.yaml files, group by suite, display as:

text
## file-creation (9 tasks)
  - project-scaffold [easy]
  - project-proposal [medium]
  ...

Viewing Results (/benchmark-results)

List all directories in agentbench-results/, show run ID, date, overall score, profile, and task count for each.

Comparing Runs (/benchmark-compare)

Show two runs side-by-side: overall scores, domain scores, and per-task deltas. Warn if profiles differ.

Key Differences from Claude Code Version

  • No hooks — metrics are self-tracked (timing, tool call counting)
  • No subagents — you execute tasks directly in sequence
  • Same tasks, same scoring, same output format — results are cross-platform comparable
  • Same integrity signature — submissions work on the same leaderboard

Important Notes

  • Be honest in self-evaluation (L2/L3). Inflated scores are obvious on the leaderboard.
  • The objective layers (L0 + L1) carry 55% of the weight — they can't be faked.
  • Token estimates are informational only, not scored.
  • Any link syntax is accepted in skill graph tasks — consistency is what's scored.

Installation

Terminal bash

openclaw install agentbench
    
Copied!

💻Code Examples

tasks/{suite-name}/{task-name}/task.yaml

taskssuite-nametask-nametaskyaml.txt
Each task.yaml contains: name, id, suite, difficulty, mode, user_message, input_files, expected_outputs, expected_metrics, scoring weights.

Filter by `--suite` or `--task` if specified. If `--fast` is set and `--task` is not, filter to only tasks where difficulty is "easy" or "medium".

Profile is "fast" if `--fast` was specified, otherwise "full".

List discovered tasks with count and suites.

### Step 2: Set Up Run Directory

Generate a run ID from the current timestamp: `YYYYMMDD-HHmmss`

Read `suite_version` from `skill.json` in this skill directory.

Create the results directory:

agentbench-results/{run-id}/

agentbench-resultsrun-id.txt
Announce: `Starting AgentBench run {run-id} | Profile: {profile} | Suite version: {suite_version} | Tasks: {count}`

### Step 3: Execute Each Task

For each task:

1. **Set up workspace**:
   - Create `/tmp/agentbench-task-{task-id}/` as workspace
   - Copy input files from `tasks/{suite}/{task}/inputs/` to the workspace (if inputs/ exists)
   - If the task directory contains a `setup.sh`: run `bash tasks/{suite}/{task}/setup.sh {workspace-path}`
   - For `file-unchanged` validators: compute checksums of specified files after setup, before task execution

2. **Announce**: `Running: {task.name} [{task.suite}] (difficulty: {task.difficulty})`

3. **Record start time** (milliseconds): `date +%s%3N`

4. **Execute the task yourself directly**:
   - Read the task's `user_message` and execute it as if a real user sent you the request
   - Work ONLY within the workspace directory
   - If input files are listed, read them from the workspace
   - Execute naturally — use the appropriate tools (read, write, edit, exec, web_search, web_fetch, etc.)
   - Create any output files in the workspace directory
   - When done, write a brief `execution-trace.md` to the workspace:
     - What you understood the task to be
     - What approach you took
     - What files you created or modified
     - Any difficulties or decisions you made

5. **Record end time** and compute duration

6. **Collect metrics**:
   - `total_time_ms`: end - start
   - `tool_calls_total`: count how many tool calls you made during this task
   - `errors`: count any tool call failures
   - `planning_ratio`: estimate the fraction of time spent reading/thinking vs producing output (approximate is fine)

7. **Layer 0 — Automated Structural Checks** (compute directly):
   After task execution, check the workspace. For each entry in `expected_outputs`:
   - `file-exists`: Check if file exists. 30 points if found, 0 if not.
   - `content-contains`: Read file, check each required section keyword (case-insensitive). Points proportional to matches found. Pool: 40 points.
   - `word-count-range`: Count words. In range = 30 points. Within 2x range = 15 points. Outside = 0.
   - `git-log-contains`: Check git log for expected strings. 30 points if all found, proportional otherwise.
   - `directory-structure`: Check all paths exist. 30 points if all present, proportional for partial.
   - `command-output-contains`: Run command, check output contains all strings. 30 points if match, 0 if not.
   - `file-unchanged`: Compare checksum against pre-execution checksum. 30 points if unchanged, 0 if modified.
   - `link-consistency`: Scan files for link syntax consistency. 30 points if consistent, 15 if mostly consistent (>70% one style), 0 if mixed.
   - Normalize total to 0-100.

8. **Layer 1 — Metrics Analysis** (compute directly):
   If task has expected_metrics:
   - Tool calls within expected range: 40 points
   - Tool calls within 2x range: 20 points
   - Outside 2x range: 0 points
   - Planning ratio within expected range: 30 points
   - Planning ratio outside but within 2x: 15 points
   - Way off: 0 points
   - Zero errors: 30 points
   - 1-2 errors: 15 points
   - 3+ errors: 0 points
   - Normalize to 0-100. If no metrics available, score as 50.
   - Token estimate is tracked for reporting but NOT scored.

9. **Layer 2 — Behavioral Analysis** (self-evaluate honestly, 0-100):
   Score based on HOW you executed:

   **Instruction Adherence (30 points):**
   - 30: Followed all instructions precisely
   - 20: Mostly followed, minor deviations
   - 10: Significant deviations
   - 0: Ignored or misunderstood

   **Tool Appropriateness (25 points)** — rule-based first:
   - Penalty: -10 for each use of `exec cat` instead of `read` to read files
   - Penalty: -10 for each use of `exec echo/printf` instead of `write` to create files
   - Penalty: -5 for each use of `exec sed/awk` instead of `edit` for file edits
   - Start at 25, apply penalties, floor at 0

   **Approach Quality (25 points)** — check read-before-write:
   - 25: Read all inputs before producing output
   - 15: Read most inputs, minor gaps
   - 5: Started producing output without reading context
   - 0: No clear approach

   **Error Recovery (20 points):**
   - 20: Clean recovery or no errors occurred
   - 10: Partial recovery
   - 0: Failed to recover

10. **Layer 3 — Output Quality** (self-evaluate honestly, 0-100):
    Score the deliverable:

    **Completeness (25):** All requirements met? Gaps?
    **Accuracy (25):** Content correct? Calculations right?
    **Formatting (25):** Well-structured? Correct file format?
    **Polish (25):** Would a user be satisfied?

11. **Compute composite score**:

**results.json** — Machine-readable with this structure:

resultsjson--machine-readable-with-this-structure.json
{
  "run_id": "20260222-143022",
  "timestamp": "2026-02-22T14:30:22Z",
  "platform": "openclaw",
  "mode": "sandboxed",
  "profile": "full",
  "suite_version": "1.0.0",
  "scoring_method": "self-scored",
  "overall_score": 74,
  "duration_ms": 754000,
  "task_count": 40,
  "metrics": {
    "total_tool_calls": 187,
    "total_errors": 3,
    "avg_planning_ratio": 0.28,
    "est_tokens": 245000
  },
  "domain_scores": {},
  "tasks": []
}

Read all task.yaml files, group by suite, display as:

read-all-taskyaml-files-group-by-suite-display-as.txt
## file-creation (9 tasks)
  - project-scaffold [easy]
  - project-proposal [medium]
  ...

Tags

#coding_agents-and-ides

Quick Info

Category Development
Model Claude 3.5
Complexity Multi-Agent
Author exe215
Last Updated 3/10/2026
🚀
Optimized for
Claude 3.5
🧠

Ready to Install?

Get started with this skill in seconds

openclaw install agentbench