✓ Verified 💻 Development ✓ Enhanced Data

Adaptivetest

Adaptive testing engine with IRT/CAT, AI question generation, and personalized learning recommendati

Rating
5 (35 reviews)
Downloads
8,089 downloads
Version
1.0.0

Overview

Adaptive testing engine with IRT/CAT, AI question generation, and personalized learning recommendations.

Complete Documentation

View Source →

Adaptive Testing with IRT

Design computerized adaptive tests that measure ability efficiently and accurately using Item Response Theory.

Core Concept

Adaptive tests adjust difficulty in real-time based on student responses. A correct answer → harder question. Incorrect → easier question. The result: accurate ability estimates in ~50% fewer questions than fixed-length tests.

Key advantage: Traditional tests waste time on too-easy or too-hard questions. Adaptive tests spend time where measurement matters most — near the student's ability level.

Quick Decision Tree

You need to...See
Understand IRT models and parametersIRT Fundamentals
Design a new adaptive testTest Design Workflow
Choose item selection algorithmItem Selection
Decide when to stop the testStopping Rules
Calibrate new questionsreferences/calibration.md
Implement CAT algorithmreferences/implementation.md

IRT Fundamentals

The 3-Parameter Logistic (3PL) Model

Most adaptive tests use the 3PL model. Each question has three parameters:

  • a (discrimination) — How well the question differentiates ability levels. Higher = steeper curve. Typical range: 0.5 to 2.5
  • b (difficulty) — The ability level where P(correct) = 0.5. Range: -3 to +3 (standardized scale)
  • c (guessing) — Probability of guessing correctly. Usually 0.2 to 0.25 for multiple choice
Probability of correct response:
text
P(correct | ability, a, b, c) = c + (1 - c) / (1 + e^(-a(ability - b)))

Simpler models:

  • 2PL: Set c = 0 (no guessing parameter)
  • 1PL (Rasch): Set c = 0 and a = 1 for all items (only difficulty varies)
Use 3PL for high-stakes tests. Use 2PL/1PL when sample size is small (<500 responses per item).

Information and Standard Error

Information measures how precisely an item estimates ability at a given level. Peak information occurs when ability ≈ difficulty (b parameter).

Standard Error (SE) is the inverse of information:

text
SE = 1 / sqrt(Information)

Goal of CAT: Maximize information (minimize SE) at the student's true ability level.


Test Design Workflow

1. Define Test Specifications

  • Purpose: Placement, diagnostic, certification, progress monitoring?
  • Content domain: Single skill or multidimensional?
  • Target population: What ability range (-3 to +3)?
  • Constraints: Time limit, minimum/maximum length, content balance

2. Build Item Bank

Minimum bank size: 10× the average test length. For a 20-item CAT, you need ≥200 calibrated items.

Distribution targets:

  • Difficulty (b): Spread across expected ability range
  • Discrimination (a): Target 1.0 to 2.0 (high discrimination)
  • Exposure: No item used >20% of the time
Content balancing: If testing math, ensure geometry/algebra/etc. are proportionally represented.

3. Choose Algorithms

Pick one from each category:

Item selection: (see below)

  • Maximum Information
  • Randomesque (MFI + exposure control)
  • Content balancing
Ability estimation:
  • Maximum Likelihood Estimation (MLE)
  • Expected A Posteriori (EAP) — better for extreme scores
  • Weighted Likelihood (WLE)
Stopping rule: (see below)
  • Fixed length
  • Standard error threshold
  • Information threshold

4. Simulate Performance

Before going live, simulate 1000+ test sessions with known abilities. Check:

  • Average test length
  • SE at different ability levels
  • Item exposure rates
  • Content balance adherence
Adjust if needed.


Item Selection Strategies

Maximum Fisher Information (MFI)

Rule: Select the item with highest information at current ability estimate.

Pros: Optimal precision, shortest tests Cons: Overuses "best" items, poor security

Use when: Pilot testing, low-stakes practice

Randomesque (MFI + Exposure Control)

Rule: Select from top N items by information (e.g., top 5), choose randomly from that set.

Pros: Balances precision and security Cons: Slightly longer tests than pure MFI

Use when: Operational tests, default choice

a-Stratified

Rule: Start with high-discrimination items (high a), use mid-discrimination later.

Pros: Fast initial ability estimate Cons: Complex to implement

Use when: Very large item banks, research settings

Content Balancing

Rule: Track content area usage, prioritize underrepresented areas when selecting next item.

Implementation: Weight information by content constraint satisfaction.

Use when: Blueprint requirements, multidimensional tests


Stopping Rules

Fixed Length

Stop after N items (e.g., 20 questions).

Pros: Predictable time, simple Cons: May over/under-test some students

Use when: Time limits matter, simple implementation needed

Standard Error Threshold

Stop when SE < target (e.g., SE < 0.3).

Pros: Consistent precision across ability levels Cons: Variable test length (harder to schedule)

Typical targets:

  • Low-stakes: SE < 0.4
  • Medium-stakes: SE < 0.3
  • High-stakes: SE < 0.25
Use when: Precision matters more than time

Combined Rule

Stop when (SE < target) OR (length ≥ max) OR (length ≥ min AND ability estimate stable).

Use when: Production systems (safest approach)


Practical Considerations

Starting Ability Estimate

Options:

  • Population mean (θ = 0)
  • Prior information (e.g., grade level, previous test)
  • First question is medium difficulty, estimate from there
Never start at extremes (-3 or +3).

Handling Extreme Response Patterns

All correct or all incorrect: MLE fails. Use EAP or Bayesian prior to regularize.

Rapid changes: If ability estimate jumps >1.0, consider response anomaly (cheating, guessing).

Exposure Control

Track how often each item is used. Flag items used >20% of the time. Consider:

  • Randomesque selection (above)
  • Sympson-Hetter method (advanced)
  • Periodic item bank refresh

Multidimensional IRT (MIRT)

If testing multiple skills (e.g., algebra + geometry), use separate ability estimates per dimension. Select items to balance information across dimensions.

Warning: MIRT requires larger item banks and more complex calibration.


Common Mistakes

Too few items in bank → High exposure, security risk ✅ Aim for 10× average test length

Poorly distributed difficulties → Accurate only in narrow ability range ✅ Spread items across -2 to +2 difficulty

Ignoring content balance → May skip important topics ✅ Build content constraints into item selection

Using MLE for all incorrect → Returns -∞ ✅ Use EAP or cap estimates at -3/+3

No exposure control → Same items every test ✅ Use randomesque or Sympson-Hetter


When to Load References

NeedFile
Calibrate new items (collect data, estimate parameters)references/calibration.md
Implement CAT algorithm (code patterns, libraries)references/implementation.md

Real-World Example: K-12 Math Placement

Setup:

  • Item bank: 300 questions, b from -2 (basic) to +2 (advanced)
  • Target: SE < 0.35 or max 25 questions
  • Content: 40% algebra, 30% geometry, 30% statistics
  • Algorithm: Randomesque (top 5), EAP estimation
Flow:
  • Start at θ = 0 (grade-level average)
  • Select item: b ≈ 0, content area needed
  • Student answers → update ability estimate (EAP)
  • Select next: maximize information at new θ, respect content balance, randomesque from top 5
  • Stop when SE < 0.35 or 25 questions reached
  • Report: ability estimate + placement recommendation
Result: Average 18 questions, 95% of students placed within ±0.5 grade levels of true ability.


Further Reading

  • Lord, F. M. (1980). Applications of Item Response Theory to Practical Testing Problems
  • Wainer, H. (2000). Computerized Adaptive Testing: A Primer (2nd ed.)
  • van der Linden, W. J., & Glas, C. A. W. (2010). Elements of Adaptive Testing
IRT packages:
  • Python: mirt, girth, catsim
  • R: mirt, TAM, catR
  • Production: Custom implementation or AdaptiveTest.io

Installation

Terminal bash

openclaw install adaptivetest
    
Copied!

Tags

#personal_development #testing

Quick Info

Category Development
Model Claude 3.5
Complexity One-Click
Author woodstocksoftware
Last Updated 3/10/2026
🚀
Optimized for
Claude 3.5
🧠

Ready to Install?

Get started with this skill in seconds

openclaw install adaptivetest