Yes — MTPLX is fully free to use. There is no paid tier.

Does MTPLX have an API?

Yes. MTPLX exposes a developer API. See the official documentation at https://mtplx.com for details.

Can I self-host MTPLX?

Yes. MTPLX supports self-hosting on your own infrastructure.

When was MTPLX released?

MTPLX was first released in 2025.

What platforms does MTPLX support?

MTPLX is available on: macOS (Apple Silicon).

Visit MTPLX

Get This Tool

License: Apache-2.0 Any use incl. commercial

Local-run terms: Apache-2.0 licensed. Users can use, modify, and ship commercially. Redistribution requires preserving license and NOTICE attribution; credit or citation strongly appreciated for public projects and research.

Official Website

MTPLX

FreeOpen SourceAPISelf-Hosted

Pricing

Model: Free

Summary

Most local LLM runtimes advertise temperature support, then quietly approximate it with greedy argmaxes — silently corrupting the output distribution every time you ask for anything other than the single most probable token. MTPLX is an Apple Silicon inference runtime built around the one claim most tools quietly dodge: speculative decoding that is mathematically exact at temperature > 0.

The vendor states a 2.24× decode speedup on Qwen3-27B running on an M5 Max MacBook Pro, achieved by using the model's own built-in MTP heads as the drafter — no second model loaded, no external checkpoint to maintain. Acceptance is handled via Leviathan–Chen rejection sampling with a residual (p − q)+ correction, verified bit-exact against single-token autoregressive output. It serves an OpenAI- and Anthropic-compatible API, so downstream tooling like Claude Code, Cline, or the openai-python SDK connects without shims. The wall appears immediately if you leave Apple Silicon: the runtime is explicitly Apple Silicon only, and the custom Metal kernels have no CUDA path.

Bottom line: Pick this when you are running 27B+ models on Apple Silicon and correctness at temperature matters — skip it the moment your deployment target is anything other than macOS on Apple hardware.

Community Performance Report Card

No community ratings yet. Be the first to rate this tool!

Best For: MacBook Pro / Mac Mini users running 27B+ models locally, Developers requiring exact sampling (not greedy approximation) for agent workflows, Inference workloads where temperature > 0 matters (code generation, creative tasks), Organizations seeking mathematical correctness in open-source runtimes, Apple Silicon-first deployment scenarios

Community Benchmarks Community

No community benchmarks yet. Be the first to share a real-world data point.

Inference Engines & Infra Local Inference Runtimes

Released 2025

Pros

Leviathan–Chen rejection sampling with residual correction produces bit-exact output at temperature > 0, so agent workflows that depend on non-greedy sampling get the correct distribution instead of a silent approximation that drifts results unpredictably.
The drafter lives inside the target checkpoint's own MTP heads, which means no second model in memory — on a MacBook with 64–128 GB unified memory, that headroom stays available for context or parallel sessions rather than a dedicated draft model.
OpenAI- and Anthropic-compatible API endpoints with streaming SSE, so tools like Claude Code, Cline, Continue, and the openai-python SDK connect without a translation layer or custom adapter.
The vendor reports 2.24× decode speed on Qwen3-27B at temperature 0.6/top_p 0.95 on an M5 Max — meaning you get more tokens per second without switching to a smaller model or lowering temperature to approximate greedy.
Apache-2.0 license with no cloud tier or usage telemetry mentioned in the docs, which means inference stays entirely on local hardware — no prompt data leaves the machine.

Cons

The runtime is Apple Silicon only, with custom Metal kernels and no CUDA path: the moment your deployment target is a Linux server, a cloud VM, or a Windows workstation, this tool is not an option and teams move to vLLM or llama.cpp instead.
MTP speculative decoding requires models that ship with native MTP heads in their checkpoint — models without those heads get no speedup and fall back to standard autoregressive decode, which means the 2.24× figure applies only to a specific subset of supported architectures.
The project is at v0.1.0-preview.1 and built by a single developer: production teams that need an SLA-backed issue resolution path, a security response process, or a multi-maintainer commit history will hit that wall before they finish the proof-of-concept.

Community Reviews

No reviews yet. Be the first to share your experience.

About

Platforms: macOS (Apple Silicon)
API Available: Yes
Self-Hosted: Yes
Last Updated: 2026-06-09T06:20:25.411Z

Best For

Who it's for

MacBook Pro / Mac Mini users running 27B+ models locally
Developers requiring exact sampling (not greedy approximation) for agent workflows
Inference workloads where temperature > 0 matters (code generation, creative tasks)
Organizations seeking mathematical correctness in open-source runtimes
Apple Silicon-first deployment scenarios

What it does well

Local inference for coding agents and multi-turn reasoning on Apple Silicon
Browser-based or terminal chat with high-performance 27B+ models on MacBooks
OpenAI/Anthropic-compatible API serving for downstream integrations
Research benchmarking and validation of speculative decoding at exact temperatures
Privacy-preserving LLM serving entirely on local hardware

Integrations

OpenAI-compatible APIAnthropic Messages APIPiOpenCodeClaude CodeContinueopen-webui

Discussion Community

No discussion yet. Sign in to start the conversation.

Compare MTPLX

Spotted incorrect or missing data? Join our community of contributors.

Community Notes & Tips Community

Be the first to contribute. General notes, observations, gotchas, and tips from people who use this tool day-to-day.

Frequently Asked Questions

Is MTPLX free?: Yes — MTPLX is fully free to use. There is no paid tier.
Is MTPLX open source?: Yes. MTPLX is open source.
Does MTPLX have an API?: Yes. MTPLX exposes a developer API. See the official documentation at https://mtplx.com for details.
Can I self-host MTPLX?: Yes. MTPLX supports self-hosting on your own infrastructure.
When was MTPLX released?: MTPLX was first released in 2025.
What platforms does MTPLX support?: MTPLX is available on: macOS (Apple Silicon).

Hours Saved & ROI Stories Community

Be the first to contribute. Concrete time/cost savings, with context. e.g. "Cut my code review backlog from 4h to 45m per week."

Curated lists that include this category

Most speculative decoding implementations trade statistical correctness for throughput by accepting draft tokens greedily — matching argmaxes instead of running the proper probability-ratio check. MTPLX is a native MTP inference runtime for Apple Silicon that refuses that trade. Per the vendor’s documentation, each speculative cycle drafts K tokens from the target model’s own built-in MTP heads, verifies all K positions in a single batched forward pass, and accepts or rejects each position via Leviathan–Chen rejection sampling with fp32 ratio arithmetic and a (p − q)+ residual correction for rejections. The result is verified max_diff = 0.0 against single-token autoregressive reference output.

The architectural differentiator is that there is no external drafter model. The draft heads are baked into the target checkpoint itself, so you load one model, not two. The vendor reports D4 acceptance rates on Qwen3-27B that exceed vLLM’s MTP-5 CUDA per-position acceptance on the same prompts — 75.61% at depth 4 versus vLLM’s 50.9% at depth 4 — which the docs attribute to the Leviathan–Chen path rather than greedy approximation. Custom Metal kernels handle the verify hot path, and GraphBank-compiled verify shapes cover the small-M matrix multiply cases that dominate the verify cycle.

The runtime fits precisely one deployment context: Apple Silicon Macs running models with native MTP heads. Installation is Homebrew-based; a first-run wizard selects model, serving mode, and surface (browser or terminal). The API layer exposes /v1/chat/completions and /v1/messages endpoints with streaming SSE, making it a drop-in local backend for OpenAI- or Anthropic-compatible clients. It does not run on Linux, Windows, or CUDA hardware — the vendor states Apple Silicon only, and there is no mention of a cross-platform roadmap. The project is at v0.1.0-preview.1, built solo by a single developer under Apache-2.0.

Get This Tool

MTPLX

Pricing

Summary

Community Performance Report Card

Community Benchmarks Community

Pros

Cons

Community Reviews

About

Best For

Who it's for

What it does well

Integrations

Discussion Community

Compare MTPLX

Community Notes & Tips Community

Frequently Asked Questions

Hours Saved & ROI Stories Community

Curated lists that include this category

OpenRAG

LM Studio

Memori

Get This Tool

Share This Tool

MTPLX

Pricing

Summary

Community Performance Report Card

Community Benchmarks Community

Pros

Cons

Community Reviews

About

Best For

Who it's for

What it does well

Integrations

Discussion Community

Compare MTPLX

Community Notes & Tips Community

Frequently Asked Questions

Hours Saved & ROI Stories Community

Curated lists that include this category