Get This Tool
GEDD
Pricing
- Model
- Free
Summary
Most AI agent eval workflows break down the moment you hand a raw trace log to a domain expert and ask them to score it — the interface wasn't built for the task they're judging, so feedback becomes vague and unusable. GEDD is an open-source evaluation workbench from AWS that restructures that review into scenario-first sessions where expert feedback becomes structured, repeatable criteria.
The vendor describes GEDD as a release-readiness tool for AI product managers and domain experts. A PM loads realistic launch-risk scenarios, the domain expert reviews the agent in the shape of the actual task, names failure modes in their own vocabulary, and the session exits with a release report plus a validated evaluation set. That loop converts qualitative judgment into regression gates usable in CI/CD. The ceiling appears when you need programmatic API access — GEDD exposes none, so teams that want to pipe evaluation results into downstream automation build that bridge themselves. Setup requires local installation via pip and depends on sagemaker-mlflow, grounded-evals, and mlflow.
Bottom line: Pick this when you have a domain expert whose judgment you trust but no structured way to capture it before launch; plan a separate integration layer when you need evaluation results flowing automatically into your deployment pipeline.
Community Performance Report Card
No community ratings yet. Be the first to rate this tool!
Community Benchmarks Community
Sign in to submit a benchmarkNo community benchmarks yet. Be the first to share a real-world data point.
Pros
Sign in to edit- Scenario-first review interface shaped to the actual task, so domain experts surface failure modes that a generic metric table would never surface — the kind a support team only discovers after the first escalation wave.
- Converts unstructured expert feedback into structured evaluation criteria during the session itself, so the output is a validated eval set teams can reuse as regression gates rather than a pile of sticky notes.
- Task-specific evaluation interfaces are configurable per agent type, which means a clinical reviewer and a code-review expert each see a surface built for their judgment rather than a one-size table that fits neither.
- MIT-0 license with full source available on GitHub, so teams running in air-gapped or regulated environments can audit and deploy without a vendor dependency or contract.
- Produces a release report at session end, giving product managers a documented artifact for go/no-go decisions instead of synthesizing scattered reviewer notes by hand.
Cons
Sign in to edit- GEDD exposes no API. Teams that need evaluation outcomes consumed automatically — scoring thresholds feeding a deployment gate, results written to a data store, metrics surfaced in a dashboard — must build that extraction layer on top of the tool. At the point where a team is maintaining both GEDD and a custom integration wrapper, the total maintenance burden often pushes them toward an evaluation framework that ships API access out of the box.
- Local installation with three pip dependencies (sagemaker-mlflow, grounded-evals, mlflow) means there is no hosted option — every team runs their own instance. For small teams without an ML infrastructure owner, standing up and maintaining that environment is a recurring friction point, not a one-time cost.
- The project is an AWS sample repository, not a managed AWS service. Issues and pull requests are the support surface. Teams that hit an undocumented setup problem or edge-case behavior have no escalation path beyond GitHub — which fails at the worst time: the sprint before a production launch.
Community Reviews
Sign in to write a reviewNo reviews yet. Be the first to share your experience.
About
- Platforms
- AWS (Bedrock, SageMaker, AgentCore); Python
- API Available
- No
- Self-Hosted
- Yes
- Last Updated
- 2026-06-09T10:47:19.447Z
Best For
Who it's for
- Product managers evaluating AI agent readiness
- Domain experts defining evaluation criteria
- Teams transitioning from manual to automated agent evaluation
- Organizations needing task-shaped evaluation interfaces
- AWS-native ML/AI teams using Bedrock and SageMaker
What it does well
- Evaluating AI agents before production launch with domain expert review
- Converting unstructured expert feedback into structured evaluation criteria
- Building task-specific evaluation interfaces for different agent types
- Detecting domain-specific failure modes missed by generic evaluation metrics
- Generating regression test gates for agent quality in CI/CD pipelines
Integrations
Discussion Community
Sign in to commentNo discussion yet. Sign in to start the conversation.
Spotted incorrect or missing data? Join our community of contributors.
Sign Up to ContributeCommunity Notes & Tips Community
Sign in to contributeBe the first to contribute. General notes, observations, gotchas, and tips from people who use this tool day-to-day.
Frequently Asked Questions
- Is GEDD free?
- Yes — GEDD is fully free to use. There is no paid tier.
- Is GEDD open source?
- Yes. GEDD is open source.
- Can I self-host GEDD?
- Yes. GEDD supports self-hosting on your own infrastructure.
- When was GEDD released?
- GEDD was first released in 2025.
- What platforms does GEDD support?
- GEDD is available on: AWS (Bedrock, SageMaker, AgentCore); Python.
Hours Saved & ROI Stories Community
Sign in to contributeBe the first to contribute. Concrete time/cost savings, with context. e.g. "Cut my code review backlog from 4h to 45m per week."
Curated lists that include this category
Generic eval dashboards show agents as rows of metrics — which tells you nothing about whether the agent actually handles the edge cases your domain expert would catch in five minutes. GEDD reframes evaluation as a scenario-first session: a PM loads realistic launch-risk scenarios, a domain expert reviews the agent in the shape of the task it actually performs, names the failure modes in their own language, and the session produces a release report alongside a validated evaluation dataset. The core workflow is a review-and-annotation workbench, not an automated scorer — a human stays in the loop at every judgment call.
The differentiating design choice is what the docs call task-shaped interfaces. Rather than forcing a healthcare reviewer to score a clinical agent through the same generic table used for a code assistant, GEDD allows teams to build evaluation interfaces that match the specific agent type being judged. That surface area change is the reason domain experts can name failure modes that generic metrics miss entirely — the interface speaks the task’s vocabulary, not the eval framework’s.
GEDD fits teams crossing the gap from ad-hoc manual review toward structured, repeatable evaluation — particularly product managers who own release readiness but lack a rubric, and AWS-native teams already running Bedrock or SageMaker. The tool runs locally; there is no vendor-hosted SaaS. It carries an MIT-0 license and is self-hosted from the GitHub source. Where it breaks: GEDD offers no API, so teams that need evaluation results consumed programmatically by a CI/CD pipeline must build that extraction layer themselves. At that point they are maintaining two systems.
Installation depends on three pip packages — sagemaker-mlflow, grounded-evals, and mlflow — documented in the repository’s SETUP.md. The project is an AWS sample, not a managed AWS service, which means support is community-driven through GitHub issues rather than AWS support channels.
