Yes — GEDD is fully free to use. There is no paid tier.

Can I self-host GEDD?

Yes. GEDD supports self-hosting on your own infrastructure.

When was GEDD released?

GEDD was first released in 2025.

What platforms does GEDD support?

GEDD is available on: AWS (Bedrock, SageMaker, AgentCore); Python.

Visit GEDD

Get This Tool

License: 0BSD Any use incl. commercial

Local-run terms: MIT-0 (MIT No Attribution) license allows unrestricted use, modification, and distribution without attribution requirements. Suitable for commercial and proprietary use.

Official Website

GEDD

FreeOpen SourceSelf-Hosted

Pricing

Model: Free

Summary

Most AI agent eval workflows break down the moment you hand a raw trace log to a domain expert and ask them to score it — the interface wasn't built for the task they're judging, so feedback becomes vague and unusable. GEDD is an open-source evaluation workbench from AWS that restructures that review into scenario-first sessions where expert feedback becomes structured, repeatable criteria.

The vendor describes GEDD as a release-readiness tool for AI product managers and domain experts. A PM loads realistic launch-risk scenarios, the domain expert reviews the agent in the shape of the actual task, names failure modes in their own vocabulary, and the session exits with a release report plus a validated evaluation set. That loop converts qualitative judgment into regression gates usable in CI/CD. The ceiling appears when you need programmatic API access — GEDD exposes none, so teams that want to pipe evaluation results into downstream automation build that bridge themselves. Setup requires local installation via pip and depends on sagemaker-mlflow, grounded-evals, and mlflow.

Bottom line: Pick this when you have a domain expert whose judgment you trust but no structured way to capture it before launch; plan a separate integration layer when you need evaluation results flowing automatically into your deployment pipeline.

Community Performance Report Card

No community ratings yet. Be the first to rate this tool!

Best For: Product managers evaluating AI agent readiness, Domain experts defining evaluation criteria, Teams transitioning from manual to automated agent evaluation, Organizations needing task-shaped evaluation interfaces, AWS-native ML/AI teams using Bedrock and SageMaker

Community Benchmarks Community

No community benchmarks yet. Be the first to share a real-world data point.

Large Language Models LLM Evaluation & Benchmarks

Released 2025

Pros

Scenario-first review interface shaped to the actual task, so domain experts surface failure modes that a generic metric table would never surface — the kind a support team only discovers after the first escalation wave.
Converts unstructured expert feedback into structured evaluation criteria during the session itself, so the output is a validated eval set teams can reuse as regression gates rather than a pile of sticky notes.
Task-specific evaluation interfaces are configurable per agent type, which means a clinical reviewer and a code-review expert each see a surface built for their judgment rather than a one-size table that fits neither.
MIT-0 license with full source available on GitHub, so teams running in air-gapped or regulated environments can audit and deploy without a vendor dependency or contract.
Produces a release report at session end, giving product managers a documented artifact for go/no-go decisions instead of synthesizing scattered reviewer notes by hand.

Cons

GEDD exposes no API. Teams that need evaluation outcomes consumed automatically — scoring thresholds feeding a deployment gate, results written to a data store, metrics surfaced in a dashboard — must build that extraction layer on top of the tool. At the point where a team is maintaining both GEDD and a custom integration wrapper, the total maintenance burden often pushes them toward an evaluation framework that ships API access out of the box.
Local installation with three pip dependencies (sagemaker-mlflow, grounded-evals, mlflow) means there is no hosted option — every team runs their own instance. For small teams without an ML infrastructure owner, standing up and maintaining that environment is a recurring friction point, not a one-time cost.
The project is an AWS sample repository, not a managed AWS service. Issues and pull requests are the support surface. Teams that hit an undocumented setup problem or edge-case behavior have no escalation path beyond GitHub — which fails at the worst time: the sprint before a production launch.

Community Reviews

No reviews yet. Be the first to share your experience.

About

Platforms: AWS (Bedrock, SageMaker, AgentCore); Python
API Available: No
Self-Hosted: Yes
Last Updated: 2026-06-09T10:47:19.447Z

Best For

Who it's for

Product managers evaluating AI agent readiness
Domain experts defining evaluation criteria
Teams transitioning from manual to automated agent evaluation
Organizations needing task-shaped evaluation interfaces
AWS-native ML/AI teams using Bedrock and SageMaker

What it does well

Evaluating AI agents before production launch with domain expert review
Converting unstructured expert feedback into structured evaluation criteria
Building task-specific evaluation interfaces for different agent types
Detecting domain-specific failure modes missed by generic evaluation metrics
Generating regression test gates for agent quality in CI/CD pipelines

Integrations

Amazon BedrockBedrock AgentCoreSageMaker MLflowClaude (Haiku 4.5)AWS S3AWS IAMCI/CD systems

Discussion Community

No discussion yet. Sign in to start the conversation.

Compare GEDD

Spotted incorrect or missing data? Join our community of contributors.

Community Notes & Tips Community

Be the first to contribute. General notes, observations, gotchas, and tips from people who use this tool day-to-day.

Frequently Asked Questions

Is GEDD free?: Yes — GEDD is fully free to use. There is no paid tier.
Is GEDD open source?: Yes. GEDD is open source.
Can I self-host GEDD?: Yes. GEDD supports self-hosting on your own infrastructure.
When was GEDD released?: GEDD was first released in 2025.
What platforms does GEDD support?: GEDD is available on: AWS (Bedrock, SageMaker, AgentCore); Python.

Hours Saved & ROI Stories Community

Be the first to contribute. Concrete time/cost savings, with context. e.g. "Cut my code review backlog from 4h to 45m per week."

Curated lists that include this category

Generic eval dashboards show agents as rows of metrics — which tells you nothing about whether the agent actually handles the edge cases your domain expert would catch in five minutes. GEDD reframes evaluation as a scenario-first session: a PM loads realistic launch-risk scenarios, a domain expert reviews the agent in the shape of the task it actually performs, names the failure modes in their own language, and the session produces a release report alongside a validated evaluation dataset. The core workflow is a review-and-annotation workbench, not an automated scorer — a human stays in the loop at every judgment call.

The differentiating design choice is what the docs call task-shaped interfaces. Rather than forcing a healthcare reviewer to score a clinical agent through the same generic table used for a code assistant, GEDD allows teams to build evaluation interfaces that match the specific agent type being judged. That surface area change is the reason domain experts can name failure modes that generic metrics miss entirely — the interface speaks the task’s vocabulary, not the eval framework’s.

GEDD fits teams crossing the gap from ad-hoc manual review toward structured, repeatable evaluation — particularly product managers who own release readiness but lack a rubric, and AWS-native teams already running Bedrock or SageMaker. The tool runs locally; there is no vendor-hosted SaaS. It carries an MIT-0 license and is self-hosted from the GitHub source. Where it breaks: GEDD offers no API, so teams that need evaluation results consumed programmatically by a CI/CD pipeline must build that extraction layer themselves. At that point they are maintaining two systems.

Installation depends on three pip packages — sagemaker-mlflow, grounded-evals, and mlflow — documented in the repository’s SETUP.md. The project is an AWS sample, not a managed AWS service, which means support is community-driven through GitHub issues rather than AWS support channels.

Get This Tool

GEDD

Pricing

Summary

Community Performance Report Card

Community Benchmarks Community

Pros

Cons

Community Reviews

About

Best For

Who it's for

What it does well

Integrations

Discussion Community

Compare GEDD

Community Notes & Tips Community

Frequently Asked Questions

Hours Saved & ROI Stories Community

Curated lists that include this category

AutoLang

FalsifyLab Alpha

Kimi WebBridge

Get This Tool

Share This Tool

GEDD

Pricing

Summary

Community Performance Report Card

Community Benchmarks Community

Pros

Cons

Community Reviews

About

Best For

Who it's for

What it does well

Integrations

Discussion Community

Compare GEDD

Community Notes & Tips Community

Frequently Asked Questions

Hours Saved & ROI Stories Community

Curated lists that include this category