Skip to main content
AIDiveForge AIDiveForge
⚖️

Prompt A/B Evaluator

Research & Analysis · by AIDiveForge · Apr 17, 2026 · Advanced · ✓ 1 verified compat

Run two prompt variants against a fixed test set, score with a rubric LLM, and tell you which wins (and why).

🧠 Why it works

Prompt engineering without A/B evaluation is vibes-driven development. Pairwise judging is cheaper than absolute scoring, less biased than single-shot grading, and matches how humans actually compare outputs. By fixing the dataset and the judge model, you isolate the one variable you care about: the prompt change. The skill makes the whole loop a one-call operation instead of custom notebooks every time.

⚙️ How it works

1) Load dataset: [{input, tags?}...]. 2) Run variant A and variant B in parallel against every input (bounded concurrency). 3) For each pair, call the judge LLM with a rubric and the two responses randomly ordered; require JSON {winner: A|B|tie, criteria: {correctness, brevity, style, reasoning}, rationale}. 4) Aggregate: per-criterion win rate, overall win rate, failure examples where A won but B lost and vice versa. 5) Bootstrap-sample the pairs to produce a 95% CI on the win rate. 6) Output is a markdown report + a CSV for further analysis.

Description

Evaluation harness packaged as a skill. Takes two prompt variants + a dataset of test inputs, runs both, scores the responses pairwise using a rubric LLM acting as judge, and reports the Elo-style win rate plus category-level breakdowns (correctness, brevity, style fit).

Install this skill

A Claude skill is a skill.md file with YAML frontmatter and a markdown body. Drop the file into your tool of choice — or pick a different format if you use Cursor, Windsurf, Copilot, or something else.

Download skill.md
mkdir -p ~/.claude/skills/prompt-ab-evaluator \
  && curl -L https://aidiveforge.com/skill/prompt-ab-evaluator.skill-md \
       -o ~/.claude/skills/prompt-ab-evaluator/skill.md

Save to ~/.claude/skills/prompt-ab-evaluator/skill.md

Recommended Use

Tools and workflow packs this skill pairs well with. Forge picks are auto-generated from category + capability signals; Community picks are added by people who've used the pairing.

No matches yet. Be the first to suggest a pairing, or the Forge will populate suggestions as signals align.