Inference Engines & Infra With an API

As of June 2026, AIDiveForge tracks 47 inference engines & infra with an api. Curated inference engines & infra with an api tracked by AIDiveForge. Listings are verified against each tool's live website and re-checked regularly.

Last updated June 12, 2026 · 47 tools

1. Agent Governance Toolkit
Policy enforcement, zero-trust identity, execution sandboxing, and reliability engineering for autonomous AI agents.
Free
2. AgentRecall
AgentRecall is a memory layer that gives AI agents persistent context across sessions — so a support agent recalls a customer's past issue, a sales agent remembers where a deal stalled, and a coding assistant doesn't ask you to re-explain your architecture for the third time. The vendor describes a retrieval-and-storage infrastructure that indexes memories and surfaces relevant ones at query time, rather than stuffing the full conversation history into every prompt. The cloud tier caps at 1,000 stored memories, which is adequate for prototyping but a ceiling teams hit in production. Self-hosting under the MIT license removes that ceiling and keeps data inside your own infrastructure — the tradeoff is that you own the ops. API access covers JavaScript and Python environments.
Paid
3. AI Boost
MCP server for capturing and injecting developer expertise as searchable, reusable context for LLM agents.
Paid
4. Apertis
Apertis functions as an API gateway layer that sits between your coding agents — Cursor, Cline, Claude Code and the like — and the underlying model providers. You point your agent at one endpoint, authenticate once, and the platform handles provider routing, failover, and cost tracking behind it. The vendor states that automatic failover keeps production agents running when a provider has an outage, which removes a class of silent failures teams usually discover too late. The free tier covers basic models with no payment required; premium models and higher quotas are paid-only features. The platform is cloud-only — no self-hosted option — so your API traffic routes through Apertis infrastructure, and teams with data-residency requirements hit that wall immediately.
Paid
5. APIDot
The platform routes requests to multiple underlying AI models for image and video generation, handling the vendor-side complexity so your codebase talks to one interface instead of five. Async generation with webhook delivery means high-volume batch jobs don't block your application waiting on responses. Switching between providers is a config change, not a refactor. The ceiling appears when you need anything beyond generation pass-through — fine-tuning, custom model hosting, or output post-processing live outside what this layer provides. Teams needing those capabilities end up routing some requests through APIDot and others directly to vendors, which partially recreates the sprawl they were trying to eliminate.
Paid
6. APIMart
APIMart is a paid API gateway that routes requests to 500-plus models — including chat, image, video, and audio — through one OpenAI-compatible interface, with discounts the vendor states range from 30 to 70 percent off official provider pricing. You swap one base URL and keep your existing SDK. The catalog spans OpenAI, Anthropic, Google, ByteDance, Qwen, Kimi, and MiniMax, so switching between providers is a config change, not a refactor. The ceiling shows up when you need call-level control: APIMart is a passive gateway, not an orchestrator, so any branching logic, retries, or fallback chains live entirely in your own code. Teams building complex multi-step pipelines maintain that routing layer themselves.
Paid
7. Atlas Inference Engine
The vendor page benchmarks Atlas at 3.1x the decode throughput of vLLM on Nvidia DGX Spark hardware — 111 tok/s average versus 37 tok/s on Qwen3.5-35B, with a cold start measured in two minutes instead of ten. That gap exists because Atlas ships no Python, no PyTorch, and no JIT warm-up: every path from HTTP request to kernel dispatch is compiled. The tradeoff is hardware specificity — hand-tuned CUDA kernels target Blackwell SM120/121, so teams not running DGX Spark get none of the headline numbers. The model matrix covers Qwen, Gemma, Nemotron, Mistral, and MiniMax, but every recipe is written for that hardware profile. Teams running other GPU generations are not the audience.
FreeOpen Source
8. Cactus
Open-source inference engine for deploying AI models locally on mobile and edge devices with automatic cloud fallback.
Paid
9. Cognita
An open-source RAG framework for building and deploying scalable retrieval-augmented generation applications.
Free
10. Context Mode Insight
Context Mode is built to answer that question honestly. It sits between your AI coding tools and your engineering metrics, correlating actual usage patterns with sprint velocity, incident rates, and individual blockers surfaced through manager 1:1 data. The Remote MCP endpoint lets AI agents call live functions — engagement health checks, blocker detection — so a manager can ask a question in Claude and get a sourced answer instead of a stale report. The platform also generates compliance audit logs formatted for CISO reviews, which keeps security teams out of your sprint. The wall appears when your org is under 50 developers: the signal-to-noise ratio on correlations drops, and the per-seat cost structure stops making sense before the insights do.
Paid
11. Dify
Open-source LLM app development platform combining AI workflow, RAG pipeline, agent capabilities, model management, observability features and more.
Paid
12. Elysia
An open-source framework that spins up an end-to-end agentic RAG application with just two terminal commands.
Free
13. Engram
Engram sits between your IDE and its file reads, maintaining a local SQLite summary of your codebase so agents pull compressed context instead of raw files. The vendor states an 89% measured token reduction. It installs via npm, runs locally with zero cloud dependency, and connects to Claude Code, Cursor, Cline, Continue, Aider, Codex, Windsurf, and Zed through a combination of OpenVSX extensions, an Anthropic plugin, and adapter scripts. The bug-prevention layer surfaces past mistakes from revert history before the agent touches that code path again. This is a passive interceptor, not an agent — it does not plan tasks or run autonomously.
FreeOpen Source
14. Exogram
Exogram is an execution governance layer that intercepts AI agent actions — payments, database writes, customer emails, record updates — and applies a policy decision before anything hits your infrastructure. The vendor describes a four-way enforcement decision: allow, deny, escalate, or log. Policy rules are checked at runtime, not after the fact, which means a $25,000 invoice approval blocked against a $1,000 limit never reaches your payment system. The immutable audit trail is positioned for SOC 2, HIPAA, and financial compliance workflows. The tool is not itself an agent runner — it assumes you already have an agent; it governs what that agent is allowed to touch.
Paid
15. Gateplex
Gateplex is governance middleware: it does not run your agents, it watches them. The vendor describes it as a policy enforcement layer that intercepts agent actions — API calls, approvals, data sends — checks them against defined rules, and blocks or flags violations before execution completes. That distinction matters for regulated environments where post-hoc logging is not enough. The free tier covers three agents and a capped intercept volume per month, which fits a proof-of-concept but runs short the moment a second team deploys. Beyond that ceiling, teams move to a paid tier or hit a wall.
Paid
16. Google AI Studio Text-to-Speech
The studio gives you a browser-based workspace where you write prompts, adjust model parameters, compare outputs side-by-side, and generate an API key when the prototype is ready to leave the browser. Multimodal inputs — text, images, documents, and via Imagen and Veo, generated images and video — are handled in the same canvas, so a prototype that mixes modalities does not require stitching together separate tools. The free tier covers the studio itself; API calls beyond the free quota move to pay-as-you-go. Where it strains: the environment is built for Gemini, so any workflow that needs to swap providers or run a non-Google model hits a hard wall. Teams that outgrow single-model prototyping typically move prompt logic into code or a provider-agnostic framework.
Paid
17. HarvestGuard
The system fuses live satellite vegetation indices, rainfall anomaly data, and WFP food security indicators, then routes that combined signal through Claude to produce country-level crop failure risk assessments. Docker handles deployment; an Anthropic API key handles the inference. For an NGO standing up a proof-of-concept or a research institution prototyping AI plus Earth observation, the architecture is legible and the cost surface is clear — you pay for API calls, not a platform license. The wall appears when you need operational guarantees: this is a single-maintainer GitHub project with one star, no issue history, and no documented accuracy benchmarks against historical famine events. Teams that need auditable model provenance or SLA-backed uptime will hit that ceiling fast.
FreeOpen Source
18. Honcho
Every message written to Honcho triggers automatic reasoning via the vendor's Neuromancer model, which learns user psychology and behavioral patterns rather than just indexing text. The `context()` call returns a curated summary plus conversation history shaped to a token budget you set — the vendor claims 60–90% token reduction versus naive retrieval. Multi-participant sessions model each peer separately, so a group conversation doesn't collapse everyone's state into one blob. The ceiling appears when you need reasoning beyond user memory — Honcho does not run tasks, make decisions, or coordinate agents; it only informs them. Teams building full autonomous pipelines still wire Honcho into a separate orchestration layer.
PaidOpen Source
19. Intencion
The scraped page content provided does not match the tool described in the structured data — the page describes a travel photography app called Spotter, not an AI agent observability platform. No production details, integration specifics, or architectural constraints for this tool can be sourced from the supplied content. Accordingly, this listing cannot be completed to AIDiveForge accuracy standards without verified source material. All fields below are constructed from the structured tool data and validator context only, and any claims beyond those inputs would be fabricated.
Paid
20. LanceDB
Open-source embedded vector database for multimodal AI with billion-scale search on Lance columnar format.
Paid
21. llama.cpp
llama.cpp is a C/C++ inference engine that runs quantized LLMs entirely on local hardware, from an Apple Silicon laptop to an H100 cluster to a Jetson edge device, using the same binary and the same hand-tuned kernels across all of them. No API keys, no telemetry, no requests leaving the machine. It exposes an OpenAI-compatible server via `llama serve`, which means drop-in compatibility with tooling already pointed at OpenAI endpoints. The ceiling appears when you need the inference engine to do more than infer — there is no planning loop, no tool-calling orchestration, no agent layer built in. Teams building autonomous workflows bolt on a framework on top, which means they are maintaining two systems.
FreeOpen Source
22. LM Studio
LM Studio, built by Element Labs Inc., is a desktop and server runtime for running open-source LLMs — Qwen, Gemma, DeepSeek, gpt-oss, and others — entirely on local hardware, with no outbound API calls required. The GUI lets you download and chat with models in minutes; the headless CLI tool `llmster` extends the same runtime to Linux servers, cloud VMs, and CI pipelines with no interface overhead. An OpenAI-compatible API layer means existing code talking to OpenAI endpoints can be redirected to a local LM Studio server with minimal changes. The ceiling appears when you need the model to do something at scale: high-throughput production inference, fine-tuning, or multi-tenant serving — none of those are what this tool is built for.
Paid
23. local-deep-research
The tool autonomously plans and executes multi-step research tasks: it queries sources, follows citations, synthesizes findings, and returns results with full attribution — all without a cloud handoff. The vendor reports ~95% on SimpleQA benchmarks using models like Qwen3-27B on a single RTX 3090, which gives you a concrete hardware target. It pulls from 10+ search backends including arXiv, PubMed, and private document collections. Where it breaks: running capable local models demands real GPU headroom, and teams without that hardware will either throttle to weaker models or route queries to cloud LLMs — at which point the privacy guarantee depends entirely on which cloud endpoint they configure. The 109 open issues and 210 open pull requests on GitHub signal an active but fast-moving codebase; production stability requires version pinning.
FreeOpen Source
24. LocalAI
LocalAI is a self-hosted, MIT-licensed stack that exposes an OpenAI-compatible REST API from your own hardware. Language model inference, image generation, audio, semantic search via LocalRecall, and autonomous agents via LocalAGI all run without a network call leaving your machine. The modular design pulls backends on demand, so you don't install inference engines you don't use. The wall appears at model selection and hardware sizing: you need at least 10GB of RAM and enough disk for the models you want to run, and the quality ceiling is set by what open-weight models can actually do. Teams needing GPT-4-class reasoning on constrained hardware eventually look elsewhere.
FreeOpen Source
25. Memori
The vendor states Memori classifies each chat turn into facts, preferences, rules, and summaries, then pulls targeted snippets at recall time rather than re-injecting full history. On the LoCoMo benchmark, the docs report 81.95% accuracy while cutting token usage by 95% versus full-context retrieval — a meaningful number if your cost problem is upstream of the model choice. The memory graph shows how entities connect across sessions, and every recall result ships with lineage explaining why that snippet was included, which matters when an enterprise audit asks why the agent said what it said. The ceiling appears when your retrieval logic needs fine-grained control the SDK's zero-configuration defaults don't expose — teams at that point are writing wrapper logic to compensate. Self-hosted deployment is available, so organizations with data-residency requirements are not locked into the cloud path.
Paid
26. ModelHub API
ModelHub is a hosted API gateway that puts 45 Chinese and global LLMs — DeepSeek V4, Qwen 3, GLM-4, Doubao, Kimi — behind a single OpenAI-compatible endpoint. You swap your base_url, keep your existing SDK, and your token bill drops. The vendor states prompts are never stored and payments run through Paddle under PCI Level 1 certification. The ceiling appears fast: no self-hosted option, no agentic tooling, no fine-tuning surface. Teams that need dedicated infrastructure or low-latency SLAs will exhaust what the service offers and contact the Enterprise tier — or leave.
Paid
27. MTPLX
The vendor states a 2.24× decode speedup on Qwen3-27B running on an M5 Max MacBook Pro, achieved by using the model's own built-in MTP heads as the drafter — no second model loaded, no external checkpoint to maintain. Acceptance is handled via Leviathan–Chen rejection sampling with a residual (p − q)+ correction, verified bit-exact against single-token autoregressive output. It serves an OpenAI- and Anthropic-compatible API, so downstream tooling like Claude Code, Cline, or the openai-python SDK connects without shims. The wall appears immediately if you leave Apple Silicon: the runtime is explicitly Apple Silicon only, and the custom Metal kernels have no CUDA path.
FreeOpen Source
28. Northbeams
Northbeams sits between your workforce and their AI tools, classifying what's running, blocking what shouldn't be, and generating the evidence chain your SOC 2 or HIPAA auditor will ask for. The browser-based agent installs without network changes, so IT doesn't need a procurement cycle to get visibility. Discovery is ungated, which means you can map your shadow AI footprint before committing to enforcement. The ceiling appears when your environment scales past a single site or when you need MCP agent governance — those capabilities are paid-only features. Teams running large multi-site deployments report that per-seat policy management becomes the operational bottleneck.
PaidFree Trial · 14 days
29. Ollama
Ollama downloads open-source models like Llama 2 and Mistral and runs them on your own hardware—no API calls, no subscriptions, no data leaving your machine. The pitch is straightforward: you get inference without the per-token pricing or rate limits of cloud services. The catch is real: performance depends entirely on your CPU or GPU, and setup requires comfort with command-line tools and ~10GB of disk space per model. It's genuinely free, but you're trading convenience and speed for privacy and control.
PaidOpen Source
30. OpenRAG
OpenRAG is a modular framework for exploring Retrieval-Augmented Generation (RAG) techniques, built for transparency and rapid experimentation to develop document-grounded AI systems—fully ready for production-scale deployment. It uses Ray to parallelize chunking, embedding, and ingestion across CPUs and GPUs, enabling fast, scalable processing of large document sets, and can be deployed seamlessly on Kubernetes for distributed, production-grade workloads. Advanced loaders like Docling and Marker parse complex layouts with OCR-enhanced PDFs, and chunk contextualization significantly boosts retrieval relevance. The platform ships with fully OpenAI-compatible chat API for seamless integration with tools like LangChain, OpenWebUI, or N8N—no adapter work required. Built-in clustering auto-generates synthetic QA datasets from your indexed documents, and a local LLM scores each query-chunk pair to help you tune retrieval before production. Two friction points surface at scale: in collaborative systems where documents update hourly, embeddings are recomputed every time by vLLM, which is computationally expensive, and admin users cannot grant access to partitions they were not explicitly given access to—the admin role does not override partition-level access restrictions.
Free
31. OpenVINO™ Toolkit
Open-source toolkit for optimizing and deploying AI inference on Intel and multi-platform hardware.
Free
32. PromptLayer
PromptLayer sits between your application and the LLM API, logging every request, tagging it to a prompt version, and giving engineers and non-technical collaborators a shared interface to iterate without touching code. The audit trail and A/B testing pipeline solve the 'who changed what and when' problem that kills rapid iteration on teams larger than two. The self-hosted deployment option exists for teams with data residency requirements. Where it hits a ceiling: the scraped page data available for this listing does not reflect PromptLayer's documented product — factual claims about specific integrations, provider support, or evaluation workflows cannot be sourced from the content retrieved.
Free
33. PromptUnit
AI proxy that automatically routes requests to cheaper models while maintaining quality.
PaidFree Trial · 14 days
34. RAGFlow
Open-source RAG engine with deep document understanding, hybrid search, and agentic workflow orchestration.
PaidOpen Source
35. RiskKernel
Deployed as a single Go binary, it sits in front of your existing OpenAI, Anthropic, or LangChain stack via a one-variable proxy — no rewrite required. Every call is metered and checkpointed, so a killed or crashed run resumes from the last saved state instead of re-spending from zero. The human-approval gate routes irreversible tool calls for sign-off over CLI, web, or webhook before they fire, and the LLM cannot bypass it because the gate lives in compiled code, not a prompt. The hosted dashboard is private beta only; teams that need a UI today are self-managing.
FreeOpen Source
36. RunAPI
RunAPI is a unified inference API that routes requests across image, video, audio, and text generation models through a single endpoint and a single bill. The vendor states it is designed for high-volume workloads where per-request cost efficiency matters more than model-provider loyalty. Teams prototyping across modalities can swap providers without rewriting integration code. The ceiling appears when you need fine-grained control over model behavior, custom fine-tuned weights, or self-hosted deployment — none of which are available here. At that point, teams move request routing back in-house and use provider SDKs directly.
Paid
37. Spanlens
Spanlens sits in front of your LLM provider via a single baseURL change, recording every call's cost, latency, tokens, and full request-response body with no SDK rewrite required. Agent runs surface as waterfall span trees so you can identify the one step consuming 80% of wall-clock time. The model recommender flags GPT-4o calls that look like classification tasks and shows the cost delta if you swap — with numbers from your own traffic, not benchmarks. The eval and experiment layer lets you replay a fixed dataset across prompt versions before you ship, so quality regressions don't surprise you in production. PII scanning and anomaly detection run at log time, which matters when sensitive data crosses the wire at 3 a.m. with nobody watching.
PaidOpen Source
38. Supermemory
Supermemory wraps memory, retrieval, user profiling, data connectors, and document extraction into one API so your agent doesn't reassemble context from scratch on every request. The retrieval layer claims sub-300ms latency using hybrid search with reranking, and the memory layer maintains a knowledge graph that merges contradictions and evolves facts over time rather than appending chunks blindly. Connectors to Slack, Notion, Drive, Gmail, GitHub, and S3 sync automatically — no ETL pipeline to maintain. The core memory engine is proprietary and hosted-only; self-hosting requires an enterprise agreement, so teams with strict data residency requirements hit a wall before they ship.
PaidOpen Source
39. SynapCores
The engine handles graph traversal, HNSW vector similarity, and in-database LLM inference inside a single MATCH statement, so the four-to-five round-trips that Pinecone plus Postgres plus an external reranker produce become one. The Community Edition ships with 161 ready-to-run recipes covering GraphRAG, fraud detection, document ingestion, and AutoML — each a runnable markdown file you can modify locally. The ceiling arrives at the infrastructure layer: multi-node clustering, Raft replication, and CDC ingest from MySQL or Postgres binlogs are paid-only features. Teams that outgrow a single host hit that wall before they hit a query performance problem. For single-host deployments, the binary wire protocol and B-tree indexes the vendor targets in a future release are not yet available.
Paid
40. Tenure
Where most memory systems rely on similarity search with soft boundaries, Tenure enforces hard scope isolation at the structural level: engineering beliefs stay in engineering sessions, Project A never bleeds into Project B. The vendor's benchmark claims a drift score of 0.00 against competing memory systems that score above 0.80. Retrieval latency is documented at 15ms with 1.0 precision. The self-hosted Helm install takes roughly 30 seconds and exposes an OpenAI-compatible endpoint, so existing clients require no code changes. The ceiling appears when your team needs managed infrastructure or enterprise support — neither is documented on the vendor site.
Paid
41. Thunderbolt
Open-source, self-hosted enterprise AI client emphasizing data sovereignty and model choice.
Paid
42. Unabyss
The scraped page content provided does not match the tool described in the structured data: the page describes 'Spotter,' a travel-identification app, not the context-infrastructure layer attributed to Unabyss. No production details, integration specifics, API behavior, or access-control mechanics for the named tool can be sourced from the provided content. Any description of how the tool retrieves context, gates permissions, or connects to Cursor and Claude Code would be fabricated. What the validator context does confirm: the tool is a passive retrieval and permission-gating system, not an agent — it feeds context to external tools rather than executing tasks on its own.
Paid
43. VideoDB
VideoDB ingests video from YouTube, S3, URLs, and RTSP/RTMP streams, then produces a continuous AI context stream — transcripts, visual scene indexes, audio summaries, and triggered alerts — with the vendor citing roughly two seconds of processing latency. Agents downstream query that structure instead of wrestling with raw frames or bloated context windows. The pattern holds well for single-stream use cases: a meeting copilot, a screen-aware pair programming agent, a security monitor flagging sensitive content. Where you hit friction is multi-stream scale and anything requiring on-premise data residency — the platform is cloud-only, with no self-hosted option. Teams with strict data sovereignty requirements end up re-evaluating before they ship.
Paid
44. vLLM
vLLM's core mechanism is PagedAttention, which the docs describe as a paged memory management approach for the KV cache — the part of GPU memory that normally fragments and wastes capacity at scale. Continuous batching sits on top of that, keeping the GPU fed instead of waiting for a fixed batch to fill. The result, per vendor benchmarks at perf.vllm.ai, is significantly higher throughput per GPU than naive serving setups. It exposes an OpenAI-compatible REST API, so existing client code needs no rewrite. The ceiling arrives when you need multi-node tensor parallelism beyond what your hardware topology supports, or when you're serving models on non-NVIDIA silicon — AMD ROCm and CPU paths exist, but community reports suggest NVIDIA CUDA gets the fastest fixes and the deepest optimization.
FreeOpen Source
45. Voker
Voker is a passive observability platform for conversational AI agents: it ingests chat session data, surfaces frustration patterns and knowledge gaps, and ties agent behavior to downstream metrics like conversion and retention. The self-hosted deployment path means your conversation data stays on your infrastructure — a hard requirement for many enterprise teams that competing SaaS observability tools cannot meet. The platform targets teams running at least 1,000 monthly sessions; below that threshold the pattern-detection signal is thin and the tooling is underutilized. Non-engineering teams can query agent insights without filing a ticket, which removes the bottleneck between product decisions and session data. Note: the scraped page content did not match Voker's product — factual claims here are drawn from the structured tool data provided.
PaidFree Trial · 30 days
46. WonderIpsum
The scraped page content provided does not match the tool data supplied: the page describes Spotter, a travel-identification app, not a synthetic data generation tool. No factual claims about the described tool's workflow, output quality, or integration behavior can be sourced from the available content. The validator context confirms a paid-only access model with no free tier, meaning teams cannot evaluate output quality before committing. Without grounded page content, production behavior at scale, API rate characteristics, and schema export fidelity cannot be assessed and should be verified directly with the vendor before any sprint commitment.
Paid
47. Xinference
Open-source library for unified deployment and serving of language, speech, and multimodal models across diverse hardware and infrastructure.
FreeOpen Source

Listings on this page are sourced and verified by the AIDiveForge data pipeline. AIDiveForge is editorially independent — no money changes hands for inclusion.