---
name: prompt-ab-evaluator
description: Run two prompt variants against a fixed test set, score with a rubric LLM, and tell you which wins (and why).
title: Prompt A/B Evaluator
category: research-analysis
difficulty: advanced
license: MIT
source_url: "https://github.com/openai/evals"
icon: ⚖️
input: structured-data
output: structured-json
phase: post
domain: research
tags: prompt-engineering,a-b-testing,evaluation-framework,llm-judging,pairwise-comparison,rubric-scoring,test-harness,statistical-analysis,comparative-analysis,quality-assurance
best_for:
  - Prompt optimization and iteration
  - Comparing LLM output quality at scale
  - Structured A/B testing of language models
  - Reducing prompt engineering guesswork
---

## Description

Evaluation harness packaged as a skill. Takes two prompt variants + a dataset of test inputs, runs both, scores the responses pairwise using a rubric LLM acting as judge, and reports the Elo-style win rate plus category-level breakdowns (correctness, brevity, style fit).

## Why it works

Prompt engineering without A/B evaluation is vibes-driven development. Pairwise judging is cheaper than absolute scoring, less biased than single-shot grading, and matches how humans actually compare outputs. By fixing the dataset and the judge model, you isolate the one variable you care about: the prompt change. The skill makes the whole loop a one-call operation instead of custom notebooks every time.

## How it works

1) Load dataset: `[{input, tags?}...]`. 2) Run variant A and variant B in parallel against every input (bounded concurrency). 3) For each pair, call the judge LLM with a rubric and the two responses randomly ordered; require JSON `{winner: A|B|tie, criteria: {correctness, brevity, style, reasoning}, rationale}`. 4) Aggregate: per-criterion win rate, overall win rate, failure examples where A won but B lost and vice versa. 5) Bootstrap-sample the pairs to produce a 95% CI on the win rate. 6) Output is a markdown report + a CSV for further analysis.