Skip to main content
AIDiveForge AIDiveForge

Self-Hosted LLM Evaluation & Benchmarks

As of June 2026, AIDiveForge tracks 3 self-hosted llm evaluation & benchmarks. Curated self-hosted llm evaluation & benchmarks tracked by AIDiveForge. Listings are verified against each tool's live website and re-checked regularly.

Last updated June 9, 2026 · 3 tools

  1. Bloom

    1. Bloom

    Bloom generates targeted evaluation suites for arbitrary behavioral traits.

    Free
  2. GEDD

    2. GEDD

    The vendor describes GEDD as a release-readiness tool for AI product managers and domain experts. A PM loads realistic launch-risk scenarios, the domain expert reviews the agent in the shape of the actual task, names failure modes in their own vocabulary, and the session exits with a release report plus a validated evaluation set. That loop converts qualitative judgment into regression gates usable in CI/CD. The ceiling appears when you need programmatic API access — GEDD exposes none, so teams that want to pipe evaluation results into downstream automation build that bridge themselves. Setup requires local installation via pip and depends on sagemaker-mlflow, grounded-evals, and mlflow.

    FreeOpen Source
  3. HermesBench

    3. HermesBench

    OpenResume is a browser-based resume builder and parser that keeps all data local: nothing is sent to a server, no account is required. You fill in a form, the tool renders an ATS-optimized PDF in real time, and you download it. The parser side lets you drop in an existing resume and see exactly how an automated screener will read it — which fields it finds, which it misses. The tool handles one job well. It does not support multiple resume versions with branching tailoring logic, and teams needing bulk generation or API-driven output will find no hooks to connect to.

    FreeOpen Source

Listings on this page are sourced and verified by the AIDiveForge data pipeline. AIDiveForge is editorially independent — no money changes hands for inclusion.