Open Source Inference Engines & Infra

As of June 2026, AIDiveForge tracks 22 open source inference engines & infra. Curated open source inference engines & infra tracked by AIDiveForge. Each project has a verified public source repository. Listings are verified against each tool's live website and re-checked regularly.

Last updated June 12, 2026 · 22 tools

1. AGEF
The specification defines a content-addressed, Merkle-linked event structure so every decision in an agent session can be hashed, bundled, and checked offline — no live service required. The reference implementation is Akmon (v2.0.0 and later), which handles bundle export, import, and journaling via akmon-journal. AGEF is a format standard, not a deployed platform: there is no SaaS, no API, and no hosted verification service. Teams adopting it are taking on the work of building or integrating bundle-producing substrates into their existing agent infrastructure. At v0.1.1, the spec is pre-stable — conformance profiles and bundle structure are defined, but tooling outside the Akmon reference implementation is essentially absent.
FreeOpen Source
2. AgentMeter
AgentMeter runs locally — no cloud sync, no account creation, no vendor dashboard to log into — and parses the tool calls, token counts, and caching splits that CLI agents like Claude Code, Gemini CLI, Codex CLI, and Copilot CLI generate. It surfaces the three-tier cost structure that prompt caching creates (input, cached-input, and output tokens each priced differently), which the raw API bill flattens into noise. The value-multiplier calculation compares API spend against estimated developer time saved, giving you a number to put in front of a manager. The wall appears when you need alerting, real-time budget enforcement, or integration with a team billing system — none of that is here.
FreeOpen Source
3. Atlas Inference Engine
The vendor page benchmarks Atlas at 3.1x the decode throughput of vLLM on Nvidia DGX Spark hardware — 111 tok/s average versus 37 tok/s on Qwen3.5-35B, with a cold start measured in two minutes instead of ten. That gap exists because Atlas ships no Python, no PyTorch, and no JIT warm-up: every path from HTTP request to kernel dispatch is compiled. The tradeoff is hardware specificity — hand-tuned CUDA kernels target Blackwell SM120/121, so teams not running DGX Spark get none of the headline numbers. The model matrix covers Qwen, Gemma, Nemotron, Mistral, and MiniMax, but every recipe is written for that hardware profile. Teams running other GPU generations are not the audience.
FreeOpen Source
4. Beacon
Beacon is an open-source endpoint telemetry layer that runs locally alongside AI agents, capturing prompts, tool calls, file modifications, and approval workflows before any of that activity disappears into the void. It normalizes that telemetry and forwards it to SIEM platforms like Wazuh, Elastic, or Splunk, so security teams can apply the same detection logic they already run against the rest of the fleet. The architecture is self-hosted by design — no data leaves the endpoint unless you route it there yourself. The project is early-stage; the plugin ecosystem covers the major local agent harnesses but gaps exist for less common runtimes. Teams with agents not yet on the supported list write custom collector plugins — which means more surface area to maintain.
FreeOpen Source
5. Bitloops
Bitloops runs as a local CLI that builds a semantic model of your codebase and captures AI interactions — prompts, reasoning, decisions — then links them to the Git commits they produced. The vendor describes it as an intelligence layer sitting between your repository and your agents, so Claude Code, Cursor, Codex, or Copilot pull structured context instead of crawling raw source. Everything stays local: no cloud proxy, no data leaving your environment. The constraint enforcement pillar is listed as coming soon, which means teams that need automated rule enforcement on generated code are buying a roadmap item, not a shipping feature. Early-stage tooling with real architectural intent, but the feature set reflects a pre-seed trajectory.
FreeOpen Source
6. Deep Memory
The library pairs a GraphRAG implementation with a Vocabulary system: a shared, schema-enforced dictionary of node types, relationship labels, and property constraints that every agent queries before writing. The result is consistent graph data across sessions without prompting every agent with walls of example documents — the schema replaces the examples, trimming token overhead. Backends include Neo4j, SQL Server, Azure Cosmos DB, and an in-memory option, all wired up via Docker Compose quickstarts the docs describe. Where the ceiling appears: there is no hosted service, no GUI, and no API surface — this is a library you embed and operate, which means your team owns the infra from day one.
FreeOpen Source
7. Engram
Engram sits between your IDE and its file reads, maintaining a local SQLite summary of your codebase so agents pull compressed context instead of raw files. The vendor states an 89% measured token reduction. It installs via npm, runs locally with zero cloud dependency, and connects to Claude Code, Cursor, Cline, Continue, Aider, Codex, Windsurf, and Zed through a combination of OpenVSX extensions, an Anthropic plugin, and adapter scripts. The bug-prevention layer surfaces past mistakes from revert history before the agent touches that code path again. This is a passive interceptor, not an agent — it does not plan tasks or run autonomously.
FreeOpen Source
8. Flightdeck
Every LLM call, MCP event, and tool invocation your agents make streams to a live dashboard — per-agent timelines and a fleet-wide feed, not batched logs you dig through after the incident. The vendor describes token budgets and MCP allow/block rules you set before problems hit, plus the ability to issue live directives to running agents without restarting them. The self-hosted, Apache-2.0 model means no telemetry leaves your infrastructure — critical for teams in regulated environments or those burned by SaaS observability vendors billing by event volume. The project is early-stage by star count, and the operational surface you take on by self-hosting is real.
FreeOpen Source
9. HarvestGuard
The system fuses live satellite vegetation indices, rainfall anomaly data, and WFP food security indicators, then routes that combined signal through Claude to produce country-level crop failure risk assessments. Docker handles deployment; an Anthropic API key handles the inference. For an NGO standing up a proof-of-concept or a research institution prototyping AI plus Earth observation, the architecture is legible and the cost surface is clear — you pay for API calls, not a platform license. The wall appears when you need operational guarantees: this is a single-maintainer GitHub project with one star, no issue history, and no documented accuracy benchmarks against historical famine events. Teams that need auditable model provenance or SLA-backed uptime will hit that ceiling fast.
FreeOpen Source
10. Honcho
Every message written to Honcho triggers automatic reasoning via the vendor's Neuromancer model, which learns user psychology and behavioral patterns rather than just indexing text. The `context()` call returns a curated summary plus conversation history shaped to a token budget you set — the vendor claims 60–90% token reduction versus naive retrieval. Multi-participant sessions model each peer separately, so a group conversation doesn't collapse everyone's state into one blob. The ceiling appears when you need reasoning beyond user memory — Honcho does not run tasks, make decisions, or coordinate agents; it only informs them. Teams building full autonomous pipelines still wire Honcho into a separate orchestration layer.
PaidOpen Source
11. llama.cpp
llama.cpp is a C/C++ inference engine that runs quantized LLMs entirely on local hardware, from an Apple Silicon laptop to an H100 cluster to a Jetson edge device, using the same binary and the same hand-tuned kernels across all of them. No API keys, no telemetry, no requests leaving the machine. It exposes an OpenAI-compatible server via `llama serve`, which means drop-in compatibility with tooling already pointed at OpenAI endpoints. The ceiling appears when you need the inference engine to do more than infer — there is no planning loop, no tool-calling orchestration, no agent layer built in. Teams building autonomous workflows bolt on a framework on top, which means they are maintaining two systems.
FreeOpen Source
12. local-deep-research
The tool autonomously plans and executes multi-step research tasks: it queries sources, follows citations, synthesizes findings, and returns results with full attribution — all without a cloud handoff. The vendor reports ~95% on SimpleQA benchmarks using models like Qwen3-27B on a single RTX 3090, which gives you a concrete hardware target. It pulls from 10+ search backends including arXiv, PubMed, and private document collections. Where it breaks: running capable local models demands real GPU headroom, and teams without that hardware will either throttle to weaker models or route queries to cloud LLMs — at which point the privacy guarantee depends entirely on which cloud endpoint they configure. The 109 open issues and 210 open pull requests on GitHub signal an active but fast-moving codebase; production stability requires version pinning.
FreeOpen Source
13. LocalAI
LocalAI is a self-hosted, MIT-licensed stack that exposes an OpenAI-compatible REST API from your own hardware. Language model inference, image generation, audio, semantic search via LocalRecall, and autonomous agents via LocalAGI all run without a network call leaving your machine. The modular design pulls backends on demand, so you don't install inference engines you don't use. The wall appears at model selection and hardware sizing: you need at least 10GB of RAM and enough disk for the models you want to run, and the quality ceiling is set by what open-weight models can actually do. Teams needing GPT-4-class reasoning on constrained hardware eventually look elsewhere.
FreeOpen Source
14. MTPLX
The vendor states a 2.24× decode speedup on Qwen3-27B running on an M5 Max MacBook Pro, achieved by using the model's own built-in MTP heads as the drafter — no second model loaded, no external checkpoint to maintain. Acceptance is handled via Leviathan–Chen rejection sampling with a residual (p − q)+ correction, verified bit-exact against single-token autoregressive output. It serves an OpenAI- and Anthropic-compatible API, so downstream tooling like Claude Code, Cline, or the openai-python SDK connects without shims. The wall appears immediately if you leave Apple Silicon: the runtime is explicitly Apple Silicon only, and the custom Metal kernels have no CUDA path.
FreeOpen Source
15. Ollama
Ollama downloads open-source models like Llama 2 and Mistral and runs them on your own hardware—no API calls, no subscriptions, no data leaving your machine. The pitch is straightforward: you get inference without the per-token pricing or rate limits of cloud services. The catch is real: performance depends entirely on your CPU or GPU, and setup requires comfort with command-line tools and ~10GB of disk space per model. It's genuinely free, but you're trading convenience and speed for privacy and control.
PaidOpen Source
16. RAGFlow
Open-source RAG engine with deep document understanding, hybrid search, and agentic workflow orchestration.
PaidOpen Source
17. RiskKernel
Deployed as a single Go binary, it sits in front of your existing OpenAI, Anthropic, or LangChain stack via a one-variable proxy — no rewrite required. Every call is metered and checkpointed, so a killed or crashed run resumes from the last saved state instead of re-spending from zero. The human-approval gate routes irreversible tool calls for sign-off over CLI, web, or webhook before they fire, and the LLM cannot bypass it because the gate lives in compiled code, not a prompt. The hosted dashboard is private beta only; teams that need a UI today are self-managing.
FreeOpen Source
18. Selvedge
Selvedge is a local MCP server that AI coding agents (Claude Code, Cursor, Copilot) call as they work, logging the reasoning behind every change into a SQLite file that lives next to your code under .selvedge/. Queries are entity-scoped — you ask about users.email or deps/stripe, not line numbers — so the answer surfaces in the same terms you search in. The vendor describes zero telemetry, no accounts, and no external servers; everything stays on disk. The wall appears when your team needs cross-repo provenance or wants to pipe this data into an existing observability stack — Selvedge emits records but does not integrate with those systems out of the box.
FreeOpen Source
19. Spanlens
Spanlens sits in front of your LLM provider via a single baseURL change, recording every call's cost, latency, tokens, and full request-response body with no SDK rewrite required. Agent runs surface as waterfall span trees so you can identify the one step consuming 80% of wall-clock time. The model recommender flags GPT-4o calls that look like classification tasks and shows the cost delta if you swap — with numbers from your own traffic, not benchmarks. The eval and experiment layer lets you replay a fixed dataset across prompt versions before you ship, so quality regressions don't surprise you in production. PII scanning and anomaly detection run at log time, which matters when sensitive data crosses the wire at 3 a.m. with nobody watching.
PaidOpen Source
20. Supermemory
Supermemory wraps memory, retrieval, user profiling, data connectors, and document extraction into one API so your agent doesn't reassemble context from scratch on every request. The retrieval layer claims sub-300ms latency using hybrid search with reranking, and the memory layer maintains a knowledge graph that merges contradictions and evolves facts over time rather than appending chunks blindly. Connectors to Slack, Notion, Drive, Gmail, GitHub, and S3 sync automatically — no ETL pipeline to maintain. The core memory engine is proprietary and hosted-only; self-hosting requires an enterprise agreement, so teams with strict data residency requirements hit a wall before they ship.
PaidOpen Source
21. vLLM
vLLM's core mechanism is PagedAttention, which the docs describe as a paged memory management approach for the KV cache — the part of GPU memory that normally fragments and wastes capacity at scale. Continuous batching sits on top of that, keeping the GPU fed instead of waiting for a fixed batch to fill. The result, per vendor benchmarks at perf.vllm.ai, is significantly higher throughput per GPU than naive serving setups. It exposes an OpenAI-compatible REST API, so existing client code needs no rewrite. The ceiling arrives when you need multi-node tensor parallelism beyond what your hardware topology supports, or when you're serving models on non-NVIDIA silicon — AMD ROCm and CPU paths exist, but community reports suggest NVIDIA CUDA gets the fastest fixes and the deepest optimization.
FreeOpen Source
22. Xinference
Open-source library for unified deployment and serving of language, speech, and multimodal models across diverse hardware and infrastructure.
FreeOpen Source

Listings on this page are sourced and verified by the AIDiveForge data pipeline. AIDiveForge is editorially independent — no money changes hands for inclusion.