Get This Tool
Nightwatch
Summary
A 3am page with fifty alerts for one outage — and your runbook still leaves the diagnosis entirely to you. ninoxAI is an open-source, local-first AI SRE agent that collapses that alert storm into a single incident, traces the root cause across live systems, and hands you a ranked fix list before you've finished your second coffee.
The agent runs a ReAct loop: it calls tools against your live infrastructure — Kubernetes, Docker, AWS, Grafana, GitHub — reasons over what it finds, and produces ranked remediation proposals that sit in a queue waiting for your sign-off before anything touches production. Read-only investigation is the hard constraint by design, which means the agent cannot act unilaterally. That boundary is a feature for regulated or risk-averse teams and a ceiling for teams that want closed-loop auto-remediation. Self-hosted and air-gap friendly, with local inference support, it fits environments where data never leaves the building.
Bottom line: Pick this for a regulated or on-prem environment where you need agentic root-cause investigation with human sign-off on every fix — but if your team wants the agent to close the loop and execute remediations automatically, you will hit a deliberate architectural wall.
Pricing Plans
SubscriptionLast verified 2 days ago- Price
- $20/mo
- Free Tier
- 300k events included, 14 days lookback, unlimited applications, unlimited environments, unlimited seats, email support
Free
For small applications
- 300k events included
- 14 days lookback
- Unlimited applications
- Unlimited environments
- Unlimited seats
- Email support
- 1 performance monitor
Pro
For growing applications
- 7.5m events included
- 30 days lookback
- Unlimited applications
- Unlimited environments
- Unlimited seats
- Email support
- 10 thresholds per application
- Discounted additional events at $0.35 per 100k
Team
For teams with large applications
- 30m events included
- 60 days lookback
- Unlimited applications
- Unlimited environments
- Unlimited seats
- Email support
- 20 thresholds per application
- Discounted additional events at $0.35 per 100k
Business
For large team with high usage
- 180m events included
- 90 days lookback
- Unlimited applications
- Unlimited environments
- Unlimited seats
- Priority support
- 30 thresholds per application
- Discounted additional events at $0.20 per 100k
Enterprise
Tailored solutions for high-demand teams and applications
- Custom event limits
- Extended lookback periods
- Unlimited applications
- Unlimited environments
- Unlimited seats
- Priority support
View full pricing on github.com →
Pricing may have changed since last verified. Check the official site for current plans.
Community Performance Report Card
No community ratings yet. Be the first to rate this tool!
Community Benchmarks Community
Sign in to submit a benchmarkNo community benchmarks yet. Be the first to share a real-world data point.
Pros
Sign in to edit- Alert clustering and deduplication across Prometheus, Checkmk, Zabbix, and similar sources, so a fifty-alert storm from one outage becomes one incident to investigate rather than fifty tickets to triage.
- ReAct-pattern autonomous investigation across live Kubernetes, Docker, AWS, and Grafana contexts, which means the agent does the pivot-between-tools work that currently costs your on-call engineer thirty minutes of manual correlation at 3am.
- Read-only operation with human-gated fix proposals, so the agent cannot cause a second incident while investigating the first — the fix queue waits for your explicit approval before anything runs.
- Self-hosted and local-inference capable under Apache-2.0, which means audit logs stay on your infrastructure and the tool works in air-gapped or regulated environments where data cannot leave the building.
- Noisy check detection with tuning recommendations backed by evidence from the investigation, so you reduce the alert volume that caused the storm in the first place rather than just surviving this one.
Cons
Sign in to edit- The agent never executes fixes — it proposes them and stops. Teams that want closed-loop auto-remediation, where the agent identifies and resolves an incident without human intervention, hit this wall on the first use case that requires it. There is no configuration to change this behavior; it is architectural. Those teams move to platforms that support write-mode tool execution.
- The integration surface covers Docker, Kubernetes, AWS, Grafana, GitHub, and Git — if your stack runs on Azure, GCP-native tooling, PagerDuty, or Datadog as the primary alert source, you are writing custom connectors before the agent investigates its first incident.
- Community adoption is early-stage — the repo shows under 100 stars and zero open pull requests at the time of listing. Teams that depend on community-sourced fixes for edge cases, third-party integration guides, or prompt-tuning advice will find precious little outside the repo docs and a Discord.
Community Reviews
Sign in to write a reviewNo reviews yet. Be the first to share your experience.
About
- Platforms
- Linux, macOS, Windows (via Docker, Kubernetes, on-prem VMs)
- API Available
- Yes
- Self-Hosted
- Yes
- Last Updated
- 2026-06-09T12:05:37.446Z
Best For
Who it's for
- Teams with complex multi-source alert stacks (Prometheus, Checkmk, Zabbix, etc.)
- Organizations requiring human-gated remediation (regulatory, risk-averse)
- On-premise or air-gapped environments needing local inference
- Teams wanting agentic investigation without auto-execute risk
What it does well
- Alert aggregation and deduplication from multiple monitoring systems
- Root cause analysis using live system introspection
- Noisy check detection and tuning recommendations with evidence
- Ranked fix proposals for human approval before remediation
- On-prem and edge deployments requiring local-first operation
Integrations
Discussion Community
Sign in to commentNo discussion yet. Sign in to start the conversation.
Compare Nightwatch
Spotted incorrect or missing data? Join our community of contributors.
Sign Up to ContributeCommunity Notes & Tips Community
Sign in to contributeBe the first to contribute. General notes, observations, gotchas, and tips from people who use this tool day-to-day.
Frequently Asked Questions
- Is Nightwatch free?
- Nightwatch is a paid tool ($20/mo). No permanent free tier is offered.
- Is Nightwatch open source?
- Yes. Nightwatch is open source.
- Does Nightwatch have an API?
- Yes. Nightwatch exposes a developer API. See the official documentation at https://github.com/ninoxai/nightwatch for details.
- Can I self-host Nightwatch?
- Yes. Nightwatch supports self-hosting on your own infrastructure.
- What platforms does Nightwatch support?
- Nightwatch is available on: Linux, macOS, Windows (via Docker, Kubernetes, on-prem VMs).
Hours Saved & ROI Stories Community
Sign in to contributeBe the first to contribute. Concrete time/cost savings, with context. e.g. "Cut my code review backlog from 4h to 45m per week."
Curated lists that include this category
On-call fatigue isn’t a staffing problem — it’s a signal-to-noise problem. ninoxAI ingests alert streams from sources like Prometheus, Checkmk, and Zabbix, deduplicates and clusters them into coherent incidents, then launches an autonomous investigation loop using a tool-calling LLM. The agent queries live systems — container state, cluster health, deployment history, metrics dashboards — reasons over what it finds using a ReAct pattern, and surfaces a ranked list of proposed fixes. Nothing executes until a human approves it.
The defining architectural choice is read-only, human-gated remediation. The agent is explicitly prevented from writing to production systems during investigation. This is not a limitation of ambition — the repo describes it as a core design principle. For teams operating under change-management requirements or in environments where an automated action at the wrong moment causes more damage than the original incident, this constraint is the entire point.
The tool fits best when your alert volume is high enough that triage itself is the bottleneck, your infrastructure spans multiple tools that a human has to pivot between manually, and your risk posture requires a human to review before anything changes. It breaks down if you need closed-loop auto-remediation — the agent proposes but never executes, and there is no configuration path around that. Teams that need autonomous fix execution will reach that ceiling on their first real incident and look elsewhere.
On the integration side, the docs describe connectors for Docker, Kubernetes, AWS, Grafana, GitHub, and Git. Deployment is supported via Docker Compose and Kubernetes manifests, and the repo includes a Dockerfile for local inference, making air-gapped operation viable. The codebase is Apache-2.0 licensed with no vendor hosting layer — you run it, you own it.
