Inference engines and ML infrastructure are the plumbing layer underneath every other category on this site. The tools here are not visible to end users — they are what you reach for when you decide that closed-API pricing or control is unacceptable, or when you need to serve models at scale, on the edge, or entirely air-gapped. The category spans local runtimes that run open-weights models on a laptop or a modest server, hosted inference platforms that let you rent GPUs by the second or by the request, and the MLOps glue that tracks experiments, serves endpoints, and monitors models in production. Picking between them is mostly an operations decision, not a model decision, and the right choice depends on your traffic shape, security posture, and the size of the team willing to maintain the stack.
Ollama is the path of least resistance for running open-weights LLMs on a developer machine or a modest server. One command to pull a model, one command to run it, and an OpenAI-compatible API that most client libraries already speak. It is how we get a local model in front of a team for evaluation before making any architecture decisions.
The directory of infrastructure tools on this site is still growing. Serious production serving stacks (vLLM, TGI, Triton, TensorRT-LLM) and hosted inference platforms (Together, Fireworks, Replicate, Baseten) are being catalogued and will appear in this section as they are verified. For now, Ollama covers the overwhelmingly common case of "evaluate an open model before deciding whether to stand up anything larger."
When data cannot leave your environment, when you need a frozen model version for audit reasons, or when your per-token spend on closed APIs clearly exceeds the monthly cost of renting the GPUs to serve equivalent traffic. Below those thresholds, a hosted API is almost always the right call.
For a 7B-parameter model at INT4, consumer GPUs (16 GB VRAM) are sufficient. For a 70B model at usable quality, budget a single H100 or a small cluster of A100s. For frontier-scale open models, expect a serious hardware investment or a hosted inference partner.
Yes — modern runtimes support multi-model serving with dynamic loading. Expect cold-start latency when switching and plan your autoscaling around it.
Replay your real traffic shape (request sizes, concurrency, prompt lengths) against each engine on the same hardware and measure throughput, p50 and p99 latency, and memory. Synthetic benchmarks rarely predict production behavior accurately.
For production-critical workloads, treat model updates like any other dependency update: test the candidate against a held-out evaluation set, roll out gradually, and keep the ability to revert. Quarterly is a sensible cadence for most teams; monthly is justified only if the capability gains are material to the product.
Hosted APIs (serverless inference) are almost always the right answer for bursty or uncertain workloads — you pay only for what you use. Renting GPUs directly makes sense only when traffic is steady enough to saturate the instance, or when you need a model or configuration the hosted providers do not offer.