Skip to main content
AIDiveForge AIDiveForge

Bloom vs Resurf

Bloom and Resurf are both coding assistants tracked by AIDiveForge. Below is a side-by-side comparison of pricing, capabilities, platforms, and ownership — sourced from each tool's live website and verified before publishing.

Bloom

Bloom

Bloom generates targeted evaluation suites for arbitrary behavioral traits.

Resurf

Resurf

Testing framework providing deterministic, reproducible environments for AI browser agent validation with synthetic websites and failure-mode injection.

AttributeBloomResurf
PricingFreeFree
Free trialNoNo
Open sourceNoYes
Has APIYesNo
Self-hosted optionYesYes
PlatformsPython; integrates with Anthropic and OpenAI models via LiteLLM; supports Weights & BiasesDocker, Python, Node.js, Chromium
LanguagesPython
Released2025-12-20
Pros
  • Reproducible and targeted evaluations that quantify frequency and severity across automatically generated scenarios
  • Evaluations correlate strongly with hand-labelled judgments and reliably separate baseline models from intentionally misaligned ones
  • Researchers can extensively configure Bloom's behavior, through choosing models for each stage, adjusting interactions' length and modality
  • Using Bloom evaluations took only a few days to conceptualize, refine and generate
  • Integrates with Weights & Biases for experiments at scale and exports Inspect-compatible transcripts
  • Deterministic and reproducible test execution via SQLite reset and seeding
  • Failure-mode injection enables testing resilience without real-world dependencies
  • Auditable success evaluation based on database state rather than LLM judges
  • Multiple adapter support (browser-use, stagehand, vision-only)
  • Production-shaped synthetic site covers realistic flows (auth, multi-step checkout, returns)
Cons
  • Bloom is only as robust as the seeds and judging logic that power it; teams should treat seeds as living governance artifacts, and for ambiguous or highly contextual behaviors, periodic manual review is still necessary
  • Bloom's evaluation suite is unlikely to match the precise distribution of scenarios found in existing benchmarks, and since model behavior can be sensitive to context and prompt variations, direct comparisons are unreliable
  • Early v0 release with single synthetic site (shop_v1); expanding to more domains requires additional content work
  • Limited to e-commerce domain in current version
  • Requires Docker, Python 3.11+, Node 20+ (for stagehand), and Chromium
Bottom line

Resurf is open source; only Bloom exposes a public API. Choose based on which difference matters most for your workflow.

Comparison data is sourced and verified by the AIDiveForge data pipeline. AIDiveForge is editorially independent.