Skip to main content
AIDiveForge AIDiveForge
Visit Atlas Inference Engine

Get This Tool

License: AGPL-3.0 Commercial ok; derivatives must share license
Local-run terms: Community Edition (AGPLv3): free for personal use, research, hobby projects, side-projects, and hosted demos. If bundling into a closed-source product or operating as a SaaS backend, a commercial license is required.

Share This Tool

Compare This Tool
📋 Embed this tool on your site

Copy this code to embed a compact tool card:

Atlas Inference Engine

FreeOpen SourceAPISelf-HostedAgentic

Pricing

Model
Free

Summary

vLLM's 10-minute torch.compile cycle and 20 GB image are survivable in a research notebook — they become a real problem when you are iterating on agentic tool loops and every cold start eats your afternoon. Atlas is an LLM inference engine written from scratch in Rust and CUDA, built to remove that tax entirely.

The vendor page benchmarks Atlas at 3.1x the decode throughput of vLLM on Nvidia DGX Spark hardware — 111 tok/s average versus 37 tok/s on Qwen3.5-35B, with a cold start measured in two minutes instead of ten. That gap exists because Atlas ships no Python, no PyTorch, and no JIT warm-up: every path from HTTP request to kernel dispatch is compiled. The tradeoff is hardware specificity — hand-tuned CUDA kernels target Blackwell SM120/121, so teams not running DGX Spark get none of the headline numbers. The model matrix covers Qwen, Gemma, Nemotron, Mistral, and MiniMax, but every recipe is written for that hardware profile. Teams running other GPU generations are not the audience.

Bottom line: Atlas is the right call if you have DGX Spark hardware and need the fastest possible path from cold image to serving a 35B or 122B model — but if your fleet runs anything other than Blackwell, the benchmark that sold you does not transfer.

Community Performance Report Card

No community ratings yet. Be the first to rate this tool!

Best For: Teams with Nvidia DGX Spark hardware, Production inference requiring minimal operational overhead, Agentic AI applications with tool calling and multi-turn loops, Organizations seeking clean, auditable compiled code

Community Benchmarks Community

No community benchmarks yet. Be the first to share a real-world data point.

  • ~2.5 GB container image with no Python or PyTorch dependencies, which means cold starts take two minutes instead of ten — a difference that compounds across every iteration in an agentic development loop.
  • Compiled Rust + CUDA architecture with no GIL or JIT warm-up, so request latency is consistent from the first token rather than degrading during the warm-up window that costs vLLM its first several minutes.
  • Hand-tuned CUDA kernels per model family with NVFP4 and FP8 on Blackwell tensor cores, so quantized inference does not trade throughput for accuracy the way a generic quantization layer would.
  • Multi-Token Prediction speculative decoding built in, so a single DGX Spark node serving a 35B model reaches throughput that would otherwise require additional hardware or a more complex multi-node setup.
  • OpenAI-compatible API endpoint out of the box, so existing tooling — Claude Code, Cline, Open WebUI — connects without a translation layer or custom client code.
  • Every published benchmark and kernel optimization targets Nvidia Blackwell SM120/121 on DGX Spark. Teams running Ampere, Ada, or Hopper GPUs get none of the headlined throughput numbers — the architecture constraint is not a tuning issue, it is baked into the kernel design. Those teams are still on vLLM or TensorRT-LLM.
  • The model matrix is a curated, hand-tuned list — Qwen, Gemma, Nemotron, Mistral, MiniMax — not an open registry. A team that needs to serve a fine-tuned model outside that matrix hits a wall immediately and either waits on the Atlas roadmap, opens a Discord request, or returns to vLLM where arbitrary HuggingFace checkpoints load without curation.
  • AGPL-3.0 is the default license. Any team building a closed-source product or operating a SaaS service on top of Atlas is required to obtain a commercial license. Teams that discover this constraint after building on the free version face a licensing conversation before they can ship.

Community Reviews

No reviews yet. Be the first to share your experience.

About

Platforms
Linux (Ubuntu 22.04+) with NVIDIA GPU support (Blackwell GB10 primary, Hopper/Ampere in development)
API Available
Yes
Self-Hosted
Yes
Last Updated
2026-06-09T05:38:17.988Z

Best For

Who it's for

  • Teams with Nvidia DGX Spark hardware
  • Production inference requiring minimal operational overhead
  • Agentic AI applications with tool calling and multi-turn loops
  • Organizations seeking clean, auditable compiled code

What it does well

  • High-throughput LLM inference on Nvidia Blackwell GPUs
  • Agentic AI workloads requiring low-latency multi-turn interactions
  • On-premises model serving with zero external dependencies
  • Cost-optimized inference for large-scale deployments

Integrations

OpenAI-compatible API; Claude CodeClineOpenCodeOpen WebUI; Anthropic API compatibility planned

Discussion Community

No discussion yet. Sign in to start the conversation.

Compare Atlas Inference Engine

Spotted incorrect or missing data? Join our community of contributors.

Sign Up to Contribute

Community Notes & Tips Community

Be the first to contribute. General notes, observations, gotchas, and tips from people who use this tool day-to-day.

Frequently Asked Questions

Is Atlas Inference Engine free?
Yes — Atlas Inference Engine is fully free to use. There is no paid tier.
Is Atlas Inference Engine open source?
Yes. Atlas Inference Engine is open source.
Does Atlas Inference Engine have an API?
Yes. Atlas Inference Engine exposes a developer API. See the official documentation at https://atlasinference.io for details.
Can I self-host Atlas Inference Engine?
Yes. Atlas Inference Engine supports self-hosting on your own infrastructure.
What platforms does Atlas Inference Engine support?
Atlas Inference Engine is available on: Linux (Ubuntu 22.04+) with NVIDIA GPU support (Blackwell GB10 primary, Hopper/Ampere in development).

Hours Saved & ROI Stories Community

Be the first to contribute. Concrete time/cost savings, with context. e.g. "Cut my code review backlog from 4h to 45m per week."

Atlas Inference Engine

Most LLM inference stacks treat Python as load-bearing infrastructure: PyTorch for tensor operations, a Python runtime managing the request queue, JIT compilation warming up on first use. Atlas discards that entire layer. Written in Rust and CUDA, the engine compiles from HTTP handling down to kernel dispatch with no interpreter in the path, no GIL, and no warm-up pass. The result is a single ~2.5 GB container image that the vendor states starts serving in two minutes. Deployment is a single `sparkrun` command that pulls the image and starts an OpenAI-compatible endpoint at localhost:8888.

The performance claim rests on two architectural decisions working together. First, hand-tuned CUDA kernels cover attention, MoE, GDN, and Mamba-2 specifically for Blackwell SM120/121 tensor cores, with NVFP4 and FP8 quantization baked in at the kernel level rather than layered on top. Second, Multi-Token Prediction speculative decoding generates multiple tokens per forward pass — the vendor benchmarks this at up to 3x throughput over single-token decoding. On Qwen3.5-35B at batch=1 with MTP K=2, the published numbers show 130 tok/s peak and 111 tok/s average sustained across diverse workloads.

Atlas fits one profile precisely: a team with DGX Spark hardware running agentic workloads — tool-calling loops, multi-turn coding agents, Claude Code or Cline integrations — where inference latency is the bottleneck and operational simplicity matters. The OpenAI-compatible API means dropping it behind any client that already speaks that protocol. Where it breaks: the kernel optimizations are Blackwell-specific, the model matrix is curated rather than open, and the AGPL-3.0 license requires teams building closed-source products or SaaS to obtain a commercial license. Community reports from Discord and r/LocalLLaMA confirm the throughput claims on DGX Spark, but no evidence on the page addresses non-Blackwell hardware performance.