Get This Tool
MTPLX
Pricing
- Model
- Free
Summary
Most local LLM runtimes advertise temperature support, then quietly approximate it with greedy argmaxes — silently corrupting the output distribution every time you ask for anything other than the single most probable token. MTPLX is an Apple Silicon inference runtime built around the one claim most tools quietly dodge: speculative decoding that is mathematically exact at temperature > 0.
The vendor states a 2.24× decode speedup on Qwen3-27B running on an M5 Max MacBook Pro, achieved by using the model's own built-in MTP heads as the drafter — no second model loaded, no external checkpoint to maintain. Acceptance is handled via Leviathan–Chen rejection sampling with a residual (p − q)+ correction, verified bit-exact against single-token autoregressive output. It serves an OpenAI- and Anthropic-compatible API, so downstream tooling like Claude Code, Cline, or the openai-python SDK connects without shims. The wall appears immediately if you leave Apple Silicon: the runtime is explicitly Apple Silicon only, and the custom Metal kernels have no CUDA path.
Bottom line: Pick this when you are running 27B+ models on Apple Silicon and correctness at temperature matters — skip it the moment your deployment target is anything other than macOS on Apple hardware.
Community Performance Report Card
No community ratings yet. Be the first to rate this tool!
Community Benchmarks Community
Sign in to submit a benchmarkNo community benchmarks yet. Be the first to share a real-world data point.
Pros
Sign in to edit- Leviathan–Chen rejection sampling with residual correction produces bit-exact output at temperature > 0, so agent workflows that depend on non-greedy sampling get the correct distribution instead of a silent approximation that drifts results unpredictably.
- The drafter lives inside the target checkpoint's own MTP heads, which means no second model in memory — on a MacBook with 64–128 GB unified memory, that headroom stays available for context or parallel sessions rather than a dedicated draft model.
- OpenAI- and Anthropic-compatible API endpoints with streaming SSE, so tools like Claude Code, Cline, Continue, and the openai-python SDK connect without a translation layer or custom adapter.
- The vendor reports 2.24× decode speed on Qwen3-27B at temperature 0.6/top_p 0.95 on an M5 Max — meaning you get more tokens per second without switching to a smaller model or lowering temperature to approximate greedy.
- Apache-2.0 license with no cloud tier or usage telemetry mentioned in the docs, which means inference stays entirely on local hardware — no prompt data leaves the machine.
Cons
Sign in to edit- The runtime is Apple Silicon only, with custom Metal kernels and no CUDA path: the moment your deployment target is a Linux server, a cloud VM, or a Windows workstation, this tool is not an option and teams move to vLLM or llama.cpp instead.
- MTP speculative decoding requires models that ship with native MTP heads in their checkpoint — models without those heads get no speedup and fall back to standard autoregressive decode, which means the 2.24× figure applies only to a specific subset of supported architectures.
- The project is at v0.1.0-preview.1 and built by a single developer: production teams that need an SLA-backed issue resolution path, a security response process, or a multi-maintainer commit history will hit that wall before they finish the proof-of-concept.
Community Reviews
Sign in to write a reviewNo reviews yet. Be the first to share your experience.
About
- Platforms
- macOS (Apple Silicon)
- API Available
- Yes
- Self-Hosted
- Yes
- Last Updated
- 2026-06-09T06:20:25.411Z
Best For
Who it's for
- MacBook Pro / Mac Mini users running 27B+ models locally
- Developers requiring exact sampling (not greedy approximation) for agent workflows
- Inference workloads where temperature > 0 matters (code generation, creative tasks)
- Organizations seeking mathematical correctness in open-source runtimes
- Apple Silicon-first deployment scenarios
What it does well
- Local inference for coding agents and multi-turn reasoning on Apple Silicon
- Browser-based or terminal chat with high-performance 27B+ models on MacBooks
- OpenAI/Anthropic-compatible API serving for downstream integrations
- Research benchmarking and validation of speculative decoding at exact temperatures
- Privacy-preserving LLM serving entirely on local hardware
Integrations
Discussion Community
Sign in to commentNo discussion yet. Sign in to start the conversation.
Compare MTPLX
Spotted incorrect or missing data? Join our community of contributors.
Sign Up to ContributeCommunity Notes & Tips Community
Sign in to contributeBe the first to contribute. General notes, observations, gotchas, and tips from people who use this tool day-to-day.
Frequently Asked Questions
- Is MTPLX free?
- Yes — MTPLX is fully free to use. There is no paid tier.
- Is MTPLX open source?
- Yes. MTPLX is open source.
- Does MTPLX have an API?
- Yes. MTPLX exposes a developer API. See the official documentation at https://mtplx.com for details.
- Can I self-host MTPLX?
- Yes. MTPLX supports self-hosting on your own infrastructure.
- When was MTPLX released?
- MTPLX was first released in 2025.
- What platforms does MTPLX support?
- MTPLX is available on: macOS (Apple Silicon).
Hours Saved & ROI Stories Community
Sign in to contributeBe the first to contribute. Concrete time/cost savings, with context. e.g. "Cut my code review backlog from 4h to 45m per week."
Curated lists that include this category
Most speculative decoding implementations trade statistical correctness for throughput by accepting draft tokens greedily — matching argmaxes instead of running the proper probability-ratio check. MTPLX is a native MTP inference runtime for Apple Silicon that refuses that trade. Per the vendor’s documentation, each speculative cycle drafts K tokens from the target model’s own built-in MTP heads, verifies all K positions in a single batched forward pass, and accepts or rejects each position via Leviathan–Chen rejection sampling with fp32 ratio arithmetic and a (p − q)+ residual correction for rejections. The result is verified max_diff = 0.0 against single-token autoregressive reference output.
The architectural differentiator is that there is no external drafter model. The draft heads are baked into the target checkpoint itself, so you load one model, not two. The vendor reports D4 acceptance rates on Qwen3-27B that exceed vLLM’s MTP-5 CUDA per-position acceptance on the same prompts — 75.61% at depth 4 versus vLLM’s 50.9% at depth 4 — which the docs attribute to the Leviathan–Chen path rather than greedy approximation. Custom Metal kernels handle the verify hot path, and GraphBank-compiled verify shapes cover the small-M matrix multiply cases that dominate the verify cycle.
The runtime fits precisely one deployment context: Apple Silicon Macs running models with native MTP heads. Installation is Homebrew-based; a first-run wizard selects model, serving mode, and surface (browser or terminal). The API layer exposes /v1/chat/completions and /v1/messages endpoints with streaming SSE, making it a drop-in local backend for OpenAI- or Anthropic-compatible clients. It does not run on Linux, Windows, or CUDA hardware — the vendor states Apple Silicon only, and there is no mention of a cross-platform roadmap. The project is at v0.1.0-preview.1, built solo by a single developer under Apache-2.0.
