✓ Verified 💻 Development ✓ Enhanced Data

Rag Eval

Evaluate your RAG pipeline quality using Ragas metrics (faithfulness, answer relevancy, context prec

Rating
5 (285 reviews)
Downloads
1,664 downloads
Version
1.0.0

Overview

Evaluate your RAG pipeline quality using Ragas metrics (faithfulness, answer relevancy, context precision).

Complete Documentation

View Source →

RAG Eval — Quality Testing for Your RAG Pipeline

Test and monitor your RAG pipeline's output quality.

🛠️ Installation

1. Ask OpenClaw (Recommended)

Tell OpenClaw: "Install the rag-eval skill." The agent will handle the installation and configuration automatically.

2. Manual Installation (CLI)

If you prefer the terminal, run:
bash
clawhub install rag-eval

⚠️ Prerequisites

  • Your OpenClaw must have a RAG system (vector DB + retrieval pipeline). This skill evaluates the output quality of that pipeline — it does not provide RAG functionality itself.
  • At least one LLM API key is required — Ragas uses an LLM as judge internally. Set one of:
  • OPENAI_API_KEY (default, uses GPT-4o)
  • ANTHROPIC_API_KEY (uses Claude Haiku)
  • RAGAS_LLM=ollama/llama3 (for local/offline evaluation)

Setup (first run only)

bash
bash scripts/setup.sh

This installs ragas, datasets, and other dependencies.

Single Response Evaluation

When user asks to evaluate an answer, collect:

  • question — the original user question
  • answer — the LLM output to evaluate
  • contexts — list of text chunks used to generate the answer (retrieved docs)
⚠️ SECURITY: Never interpolate user content directly into shell commands. Write the input to a temp JSON file first, then pipe it to the evaluator:

bash
# Step 1: Write input to a temp file (agent should use the write/edit tool, NOT echo)
# Write this JSON to /tmp/rag-eval-input.json using the file write tool:
# {"question": "...", "answer": "...", "contexts": ["chunk1", "chunk2"]}

# Step 2: Pipe the file to the evaluator
python3 scripts/run_eval.py < /tmp/rag-eval-input.json

# Step 3: Clean up
rm -f /tmp/rag-eval-input.json

Alternatively, use --input-file:

bash
python3 scripts/run_eval.py --input-file /tmp/rag-eval-input.json

Output JSON:

json
{
  "faithfulness": 0.92,
  "answer_relevancy": 0.87,
  "context_precision": 0.79,
  "overall_score": 0.86,
  "verdict": "PASS",
  "flags": []
}

Post results to user with human-readable summary:

text
🧪 Eval Results
• Faithfulness: 0.92 ✅ (no hallucination detected)
• Answer Relevancy: 0.87 ✅
• Context Precision: 0.79 ⚠️ (some irrelevant context retrieved)
• Overall: 0.86 — PASS

Save to memory/eval-results/YYYY-MM-DD.jsonl.

Batch Evaluation

For a JSONL dataset file (each line: {"question":..., "answer":..., "contexts":[...]}):

bash
python3 scripts/batch_eval.py --input references/sample_dataset.jsonl --output memory/eval-results/batch-YYYY-MM-DD.json

Score Interpretation

ScoreVerdictMeaning
0.85+✅ PASSProduction-ready quality
0.70-0.84⚠️ REVIEWNeeds improvement
< 0.70❌ FAILSignificant quality issues

Faithfulness Deep-Dive

If faithfulness < 0.80, run:

bash
python3 scripts/run_eval.py --explain --metric faithfulness
This outputs which sentences in the answer are NOT supported by context.

Notes

  • Ragas uses an LLM internally as judge (uses your configured OpenAI/Anthropic key)
  • Evaluation costs ~$0.01-0.05 per response depending on length
  • For offline use, set RAGAS_LLM=ollama/llama3 in environment

Installation

Terminal bash

openclaw install rag-eval
    
Copied!

💻Code Examples

bash scripts/setup.sh

bash-scriptssetupsh.txt
This installs `ragas`, `datasets`, and other dependencies.

## Single Response Evaluation

When user asks to evaluate an answer, collect:
1. **question** — the original user question
2. **answer** — the LLM output to evaluate
3. **contexts** — list of text chunks used to generate the answer (retrieved docs)

**⚠️ SECURITY: Never interpolate user content directly into shell commands.**
Write the input to a temp JSON file first, then pipe it to the evaluator:

• Overall: 0.86 — PASS

-overall-086--pass.txt
Save to `memory/eval-results/YYYY-MM-DD.jsonl`.

## Batch Evaluation

For a JSONL dataset file (each line: `{"question":..., "answer":..., "contexts":[...]}`):

python3 scripts/batch_eval.py --input references/sample_dataset.jsonl --output memory/eval-results/batch-YYYY-MM-DD.json

python3-scriptsbatchevalpy---input-referencessampledatasetjsonl---output-memoryeval-resultsbatch-yyyy-mm-ddjson.txt
## Score Interpretation

| Score | Verdict | Meaning |
|-------|---------|---------|
| 0.85+ | ✅ PASS | Production-ready quality |
| 0.70-0.84 | ⚠️ REVIEW | Needs improvement |
| < 0.70 | ❌ FAIL | Significant quality issues |

## Faithfulness Deep-Dive

If faithfulness < 0.80, run:
example.sh
# Step 1: Write input to a temp file (agent should use the write/edit tool, NOT echo)
# Write this JSON to /tmp/rag-eval-input.json using the file write tool:
# {"question": "...", "answer": "...", "contexts": ["chunk1", "chunk2"]}

# Step 2: Pipe the file to the evaluator
python3 scripts/run_eval.py < /tmp/rag-eval-input.json

# Step 3: Clean up
rm -f /tmp/rag-eval-input.json
example.json
{
  "faithfulness": 0.92,
  "answer_relevancy": 0.87,
  "context_precision": 0.79,
  "overall_score": 0.86,
  "verdict": "PASS",
  "flags": []
}
example.txt
🧪 Eval Results
• Faithfulness: 0.92 ✅ (no hallucination detected)
• Answer Relevancy: 0.87 ✅
• Context Precision: 0.79 ⚠️ (some irrelevant context retrieved)
• Overall: 0.86 — PASS

Tags

#coding_agents-and-ides

Quick Info

Category Development
Model Claude 3.5
Complexity One-Click
Author jonathanjing
Last Updated 3/10/2026
🚀
Optimized for
Claude 3.5
🧠

Ready to Install?

Get started with this skill in seconds

openclaw install rag-eval