Prompt A/B Evaluator
Run two prompt variants against a fixed test set, score with a rubric LLM, and tell you which wins (and why).
🧠 Why it works
Prompt engineering without A/B evaluation is vibes-driven development. Pairwise judging is cheaper than absolute scoring, less biased than single-shot grading, and matches how humans actually compare outputs. By fixing the dataset and the judge model, you isolate the one variable you care about: the prompt change. The skill makes the whole loop a one-call operation instead of custom notebooks every time.
⚙️ How it works
1) Load dataset: [{input, tags?}...]. 2) Run variant A and variant B in parallel against every input (bounded concurrency). 3) For each pair, call the judge LLM with a rubric and the two responses randomly ordered; require JSON {winner: A|B|tie, criteria: {correctness, brevity, style, reasoning}, rationale}. 4) Aggregate: per-criterion win rate, overall win rate, failure examples where A won but B lost and vice versa. 5) Bootstrap-sample the pairs to produce a 95% CI on the win rate. 6) Output is a markdown report + a CSV for further analysis.
Description
Evaluation harness packaged as a skill. Takes two prompt variants + a dataset of test inputs, runs both, scores the responses pairwise using a rubric LLM acting as judge, and reports the Elo-style win rate plus category-level breakdowns (correctness, brevity, style fit).
Install this skill
A Claude skill is a skill.md file with YAML frontmatter and a markdown body.
Drop the file into your tool of choice — or pick a different format if you use Cursor, Windsurf, Copilot, or something else.
mkdir -p ~/.claude/skills/prompt-ab-evaluator \
&& curl -L https://aidiveforge.com/skill/prompt-ab-evaluator.skill-md \
-o ~/.claude/skills/prompt-ab-evaluator/skill.md
Save to ~/.claude/skills/prompt-ab-evaluator/skill.md
Recommended Use
Sign in to suggestTools and workflow packs this skill pairs well with. Forge picks are auto-generated from category + capability signals; Community picks are added by people who've used the pairing.
No matches yet. Be the first to suggest a pairing, or the Forge will populate suggestions as signals align.
Report compatibility
Tell the community which tool you used this skill with, and whether it worked.
Suggest a pairing
Recommend a tool or workflow pack that this skill works well with. Up to 5 recommendations per day.