Yes — Bloom is fully free to use. There is no paid tier.

Is Bloom open source?

No — Bloom is a closed-source tool. Source code is not publicly available.

Does Bloom have an API?

Yes. Bloom exposes a developer API. See the official documentation at https://anthropic.com for details.

Can I self-host Bloom?

Yes. Bloom supports self-hosting on your own infrastructure.

When was Bloom released?

Bloom was first released in 2025.

What platforms does Bloom support?

Bloom is available on: Python; integrates with Anthropic and OpenAI models via LiteLLM; supports Weights & Biases.

Visit Bloom

Screenshots 2

Bloom

FreeAPISelf-HostedAgentic

Summary

Anthropic's evaluation framework that auto-generates targeted test suites to measure specific AI behaviors at scale.

Bloom tackles a real gap in AI safety work: most evaluations are either broad (and miss edge cases) or manual (and don't scale). This tool generates reproducible test scenarios for narrow, high-stakes behaviors—like refusal consistency or prompt injection resilience—then scores models across thousands of permutations. The catch is structural: it's only as good as the seed prompts and judging logic you feed it, so teams still need to validate results manually for ambiguous cases. Free to use, but assumes you have both the domain expertise to set it up and the infrastructure to run it.

Bottom line: *Use this if you're a safety researcher or red-teaming group who needs systematic, repeatable measurements of specific model risks.*

Pricing Plans

Free

Free Tier: No limits; fully open-source

Open Source

Free

Freely available open-source framework

Full agentic evaluation pipeline
Four-stage system (Understanding, Ideation, Rollout, Judgment)
Weights & Biases integration
Inspect transcript export
Interactive transcript viewer

View full pricing on anthropic.com →

Pricing may have changed since last verified. Check the official site for current plans.

Community Performance Report Card

No community ratings yet. Be the first to rate this tool!

Best For: Regression testing, release gating, and tracking mitigations over time, AI safety and alignment research teams, Studying narrow but critical risks that may be missed by broader evaluations, Evaluating frontier AI models for specific behavioral traits, Automating evaluation suite generation without manual engineering

Community Benchmarks Community

No community benchmarks yet. Be the first to share a real-world data point.

Coding Assistants Large Language Models LLM Evaluation & Benchmarks Test Generation

Released December 20, 2025

Pros

Reproducible and targeted evaluations that quantify frequency and severity across automatically generated scenarios
Evaluations correlate strongly with hand-labelled judgments and reliably separate baseline models from intentionally misaligned ones
Researchers can extensively configure Bloom's behavior, through choosing models for each stage, adjusting interactions' length and modality
Using Bloom evaluations took only a few days to conceptualize, refine and generate
Integrates with Weights & Biases for experiments at scale and exports Inspect-compatible transcripts

Cons

Bloom is only as robust as the seeds and judging logic that power it; teams should treat seeds as living governance artifacts, and for ambiguous or highly contextual behaviors, periodic manual review is still necessary
Bloom's evaluation suite is unlikely to match the precise distribution of scenarios found in existing benchmarks, and since model behavior can be sensitive to context and prompt variations, direct comparisons are unreliable

Community Reviews

No reviews yet. Be the first to share your experience.

About

Platforms: Python; integrates with Anthropic and OpenAI models via LiteLLM; supports Weights & Biases
Languages: Python
API Available: Yes
Self-Hosted: Yes
Last Updated: 2026-04-20T21:16:31.339Z

Best For

Who it's for

Regression testing, release gating, and tracking mitigations over time
AI safety and alignment research teams
Studying narrow but critical risks that may be missed by broader evaluations
Evaluating frontier AI models for specific behavioral traits
Automating evaluation suite generation without manual engineering

What it does well

Measuring behaviors like delusional sycophancy, long-horizon sabotage, self-preservation and self-preferential bias
Regression testing, release gating, and tracking mitigations over time
Investigating jailbreak susceptibility, self-preferential bias, and long-horizon sabotage risks
Quantifying frequency and severity of target behaviors across generated scenarios
Baseline model comparison and intentional misalignment detection

Integrations

Weights & Biases for experiments at scale; Inspect-compatible transcripts; LiteLLM backend for model API calls supporting Anthropic and OpenAI models

Discussion Community

No discussion yet. Sign in to start the conversation.

Compare Bloom

Spotted incorrect or missing data? Join our community of contributors.

Community Notes & Tips Community

Be the first to contribute. General notes, observations, gotchas, and tips from people who use this tool day-to-day.

Frequently Asked Questions

Is Bloom free?: Yes — Bloom is fully free to use. There is no paid tier.
Is Bloom open source?: No — Bloom is a closed-source tool. Source code is not publicly available.
Does Bloom have an API?: Yes. Bloom exposes a developer API. See the official documentation at https://anthropic.com for details.
Can I self-host Bloom?: Yes. Bloom supports self-hosting on your own infrastructure.
When was Bloom released?: Bloom was first released in 2025.
What platforms does Bloom support?: Bloom is available on: Python; integrates with Anthropic and OpenAI models via LiteLLM; supports Weights & Biases.