Get This Tool
RunbookHermes
Pricing
- Model
- Free
Summary
Incident response falls apart when the gap between 'something is wrong' and 'we know why' takes longer than the outage itself — and most on-call tooling just pages people faster without doing the diagnosis work. RunbookHermes is an MIT-licensed AIOps agent that closes that gap by autonomously correlating metrics, logs, and traces, proposing evidence-backed remediation, and requiring a human sign-off before anything executes.
The agent runs multi-signal diagnosis across observability data, builds a root-cause hypothesis, and generates or updates runbooks from what it learns — so the next incident with the same failure pattern starts from a documented baseline instead of a blank slate. The approval-gated remediation workflow means automated action doesn't ship without a reviewer, which matters when the blast radius is a production service. Where it breaks: the repo is five commits deep with zero open issues, which signals early-stage software, not battle-hardened infrastructure. Teams with complex multi-service topologies will hit integration gaps before the agent's reasoning does. Self-hosting is required, so operationalizing this adds a deployment and maintenance surface your platform team owns.
Bottom line: Pick RunbookHermes for an SRE team that wants an autonomous first-responder to triage and document incidents while a human stays in the loop — but expect to build integrations yourself if your observability stack is anything beyond what the repo ships with.
Community Performance Report Card
No community ratings yet. Be the first to rate this tool!
Community Benchmarks Community
Sign in to submit a benchmarkNo community benchmarks yet. Be the first to share a real-world data point.
Pros
Sign in to edit- Evidence-driven root-cause hypothesis before remediation is proposed, so the on-call engineer reviews a reasoned diagnosis instead of raw signal noise — which means sign-off decisions take seconds rather than requiring independent investigation.
- Approval-gated execution model, so automated remediation actions cannot ship to production without a reviewer in the loop — which avoids the class of incidents caused by runaway automation acting on a misdiagnosis.
- Runbook generation and learning from live incidents, so operational knowledge accumulates in structured documentation rather than living exclusively in the memory of whoever was paged — which matters when the person who handled the last incident is on vacation for the next one.
- MIT license with full self-hosted deployment, so the agent and its incident data stay inside your own infrastructure — which removes the vendor-access and data-residency concerns that block AIOps adoption in regulated environments.
- Multi-signal ingestion across metrics, logs, and traces, so the agent correlates evidence across observability layers rather than diagnosing from a single data source — which reduces false-positive root-cause conclusions from incomplete signal.
Cons
Sign in to edit- The repository has five commits and no closed issues, which means there is no public evidence of the agent performing correctly under real production incident load — teams that need a vetted tool before adoption will need to run their own failure-mode testing before trusting it on a live on-call rotation.
- Integration coverage is bounded by what the observability MCP toolserver ships with; teams running Datadog, Honeycomb, or custom telemetry pipelines that fall outside that surface will write and maintain their own integration connectors — at which point they are owning a non-trivial piece of the agent's input layer.
- There is no community or commercial support path documented in the repo; when the agent produces a wrong root-cause hypothesis or the approval workflow misbehaves at 3 AM, the escalation path is the GitHub repo and whatever institutional knowledge your team has built — teams that require SLA-backed support or vendor escalation will move to a commercial AIOps platform instead.
Community Reviews
Sign in to write a reviewNo reviews yet. Be the first to share your experience.
About
- Platforms
- Linux, macOS, Docker, Kubernetes
- API Available
- Yes
- Self-Hosted
- Yes
- Last Updated
- 2026-06-09T05:56:26.804Z
Best For
Who it's for
- Organizations seeking autonomous incident response with human oversight
- Teams wanting to reduce MTTR while maintaining safety
- Engineering cultures that treat incidents as learning opportunities
- Multi-service deployments with observability data integration
- SRE and Platform Engineering teams
What it does well
- Production incident response and root-cause analysis
- Evidence-driven remediation with human approval gates
- Automated runbook generation and SRE knowledge capture
- Multi-signal incident diagnosis from metrics, logs, and traces
- Team training on fault patterns and operational procedures
Integrations
Discussion Community
Sign in to commentNo discussion yet. Sign in to start the conversation.
Compare RunbookHermes
Spotted incorrect or missing data? Join our community of contributors.
Sign Up to ContributeCommunity Notes & Tips Community
Sign in to contributeBe the first to contribute. General notes, observations, gotchas, and tips from people who use this tool day-to-day.
Frequently Asked Questions
- Is RunbookHermes free?
- Yes — RunbookHermes is fully free to use. There is no paid tier.
- Is RunbookHermes open source?
- Yes. RunbookHermes is open source.
- Does RunbookHermes have an API?
- Yes. RunbookHermes exposes a developer API. See the official documentation at https://github.com/tommy-yw/runbookhermes for details.
- Can I self-host RunbookHermes?
- Yes. RunbookHermes supports self-hosting on your own infrastructure.
- What platforms does RunbookHermes support?
- RunbookHermes is available on: Linux, macOS, Docker, Kubernetes.
Hours Saved & ROI Stories Community
Sign in to contributeBe the first to contribute. Concrete time/cost savings, with context. e.g. "Cut my code review backlog from 4h to 45m per week."
Curated lists that include this category
Incident diagnosis is the part of on-call that burns people out: pulling signals from three different dashboards at 2 AM, manually correlating a latency spike in traces with a log error from ten minutes earlier, then writing up a postmortem that nobody reads before the same pattern hits again. RunbookHermes addresses this as a Hermes-native AIOps agent: it autonomously ingests multi-signal observability data — metrics, logs, and traces — constructs an evidence-driven root-cause hypothesis, proposes a remediation action, and waits for a human to approve before executing. The runbook it produces from that incident becomes the starting point for the next one.
The defining feature is the approval-gated remediation loop. The agent does not act autonomously end-to-end — it reasons and proposes, then the reviewer decides. This is architecturally meaningful for production environments where autonomous execution without oversight is a liability, not a feature. Combined with runbook learning, the system is designed to accumulate operational knowledge from real incidents rather than requiring an SRE team to maintain documentation separately from the work that generates it.
RunbookHermes fits SRE and platform engineering teams who want to reduce mean time to resolution without removing human judgment from the remediation step. The repo includes a TUI gateway, web interface, observability MCP toolserver, ACP adapter, and plugin/skills architecture — indicating a modular design that supports extension. What the repo does not yet show is a track record at scale: with five commits and no closed issues, the gap between the architectural intent and production-hardened behavior is unknown. Teams running heterogeneous observability stacks should audit the integrations directory carefully before committing to this as a production dependency.
The project ships with Docker support, a Nix environment, Homebrew packaging, and an example environment configuration, so the self-hosting path is documented. The observability MCP toolserver in the repo is the integration surface for connecting the agent to live telemetry — the vendor describes this as Hermes-native, meaning the agent framework is purpose-built around this tool rather than layered on top of a generic agent SDK.
