Yes — LMCache is fully free to use. There is no paid tier.

Can I self-host LMCache?

Yes. LMCache supports self-hosting on your own infrastructure.

What platforms does LMCache support?

LMCache is available on: Cross-platform; integrates with vLLM, TGI, SGLang.

Visit LMCache

Get This Tool

License: Apache-2.0 Any use incl. commercial

Local-run terms: Users can self-host, modify, and deploy the library under Apache-2.0 terms for commercial use.

Official Website

LMCache

FreeOpen SourceSelf-Hosted

Pricing

Model: Free

Summary

Every time a long-context prompt hits a cold LLM instance, you pay the full prefill cost again — and your users wait. LMCache is an open-source KV cache layer that sits between your serving engine and your model, so repeated prompts and shared context chunks get retrieved instead of recomputed.

The library plugs into vLLM or TGI backends and stores KV cache tensors so that overlapping prompt prefixes — system prompts, document chunks, conversation history — are served from cache on subsequent requests. The vendor states 8–10x latency improvements for prompt caching workloads and 4–10x for RAG queries where the same document chunks appear across requests. The compression and streaming techniques described in the backing research (CacheGen, CacheBlend) are what make cache delivery fast enough to beat recomputation. The ceiling appears when your workload has little prompt overlap — unique user queries with no shared prefix — at which point the cache layer adds infrastructure without meaningful savings.

Bottom line: Deploy this when you run high-volume RAG with repeated document chunks or chatbots with long shared system prompts; skip it when your traffic is nearly all unique one-shot queries, because the operational overhead of maintaining a cache layer will outpace the gains.

Community Performance Report Card

No community ratings yet. Be the first to rate this tool!

Best For: Enterprise LLM inference optimization, RAG pipelines requiring cache reuse, Cost-sensitive high-volume LLM serving, Integration with vLLM or TGI backends

Community Benchmarks Community

No community benchmarks yet. Be the first to share a real-world data point.

Inference Engines & Infra RAG Frameworks

Added on June 18, 2026

Pros

Shared KV cache across serving instances, so you avoid GPU session-affinity routing and can load balance freely without cold-cache penalties on every node switch.
KV cache compression via the CacheGen approach, so storage costs for large caches stay bounded — without this, storing full-precision KV tensors for long contexts becomes prohibitively expensive at volume.
Native integration with vLLM and TGI, so teams already running those backends add the cache layer without replacing their serving stack.
Designed for RAG workloads via the CacheBlend technique, which lets the system combine cached KV entries from different document chunks rather than requiring an exact prefix match — so document-heavy pipelines see cache hits even when queries draw from different combinations of stored passages.
Fully open-source under Apache-2.0 with published research papers, so you can inspect the compression and streaming logic, audit it for your compliance requirements, and fork or extend it without vendor lock-in.

Cons

On workloads where prompt overlap is low — unique queries, varied instructions, one-shot tasks — cache hit rates fall close to zero, and you are running a distributed caching system that adds latency on misses without providing the offsetting speedup; teams in this situation remove the layer rather than tune it.
LMCache has no API and requires direct integration into a vLLM or TGI deployment, so teams running other inference backends (Triton, custom serving, managed endpoints) face a porting effort the docs do not cover — at that point, teams typically stay with whatever per-request caching their serving engine natively offers.
Cache invalidation for dynamic content — documents that change, system prompts that update, user context that shifts — requires manual invalidation logic that the library does not automate; teams building products where source documents update frequently report building their own staleness-tracking layer on top.

Community Reviews

No reviews yet. Be the first to share your experience.

About

Platforms: Cross-platform; integrates with vLLM, TGI, SGLang
API Available: No
Self-Hosted: Yes
Last Updated: 2026-06-18T03:28:23.714Z

Best For

Who it's for

Enterprise LLM inference optimization
RAG pipelines requiring cache reuse
Cost-sensitive high-volume LLM serving
Integration with vLLM or TGI backends

What it does well

Prompt caching for chatbots and document processing
Fast RAG for enterprise search and document AI
KV cache sharing across multiple LLM serving instances
Reducing latency in long-context LLM applications

Integrations

vLLMTGISGLang

Discussion Community

No discussion yet. Sign in to start the conversation.

Compare LMCache

Spotted incorrect or missing data? Join our community of contributors.

Community Notes & Tips Community

Be the first to contribute. General notes, observations, gotchas, and tips from people who use this tool day-to-day.

Frequently Asked Questions

Is LMCache free?: Yes — LMCache is fully free to use. There is no paid tier.
Is LMCache open source?: Yes. LMCache is open source.
Can I self-host LMCache?: Yes. LMCache supports self-hosting on your own infrastructure.
What platforms does LMCache support?: LMCache is available on: Cross-platform; integrates with vLLM, TGI, SGLang.

Hours Saved & ROI Stories Community

Be the first to contribute. Concrete time/cost savings, with context. e.g. "Cut my code review backlog from 4h to 45m per week."

Curated lists that include this category

LMCache inserts a KV cache storage and retrieval layer between your application and your LLM serving backend. Instead of recomputing attention keys and values for tokens that appeared in a previous request, LMCache stores those tensors, compresses them, and streams them back on cache hits. The core workflow is: a request arrives, LMCache checks whether any prefix or chunk of the prompt was already computed, retrieves matching KV tensors from storage, and passes only the uncached suffix to the GPU for actual computation.

The differentiating capability described in the vendor’s research is cache sharing across multiple serving instances. Most KV caching is per-instance and session-bound — when a load balancer sends a user to a different GPU node, the cache is cold again. LMCache is designed as a shared layer so any instance in a pool can retrieve a cache entry that a different instance originally computed. The vendor frames this as eliminating the need for GPU request routing tied to session affinity.

This architecture fits best when your serving traffic has structural prompt overlap: enterprise RAG pipelines where thousands of queries share the same indexed document chunks, or chatbots where a long system prompt is prepended to every request. It breaks down when prompt diversity is high — unique user instructions, varied one-shot queries — because cache hit rates drop and you are left maintaining a distributed cache layer with negligible latency benefit. Teams running truly heterogeneous workloads often find the operational surface area does not justify the setup.

LMCache is Apache-2.0 licensed, self-hosted only, and has no API surface of its own — it operates as a library integrated into your existing vLLM or TGI deployment. Deployment requires that your inference stack already runs one of those supported backends; there is no drop-in path for other serving engines without additional integration work.

Get This Tool

LMCache

Pricing

Summary

Community Performance Report Card

Community Benchmarks Community

Pros

Cons

Community Reviews

About

Best For

Who it's for

What it does well

Integrations

Discussion Community

Compare LMCache

Community Notes & Tips Community

Frequently Asked Questions

Hours Saved & ROI Stories Community

Curated lists that include this category

local-deep-research

APIDot

RunAPI

Get This Tool

Share This Tool

LMCache

Pricing

Summary

Community Performance Report Card

Community Benchmarks Community

Pros

Cons

Community Reviews

About

Best For

Who it's for

What it does well

Integrations

Discussion Community

Compare LMCache

Community Notes & Tips Community

Frequently Asked Questions

Hours Saved & ROI Stories Community

Curated lists that include this category