Get This Tool
Whissle Gateway
Summary
Standard ASR pipelines capture words and throw away everything else — emotion gone, intent gone, speaker context gone, all in the same moment the transcript lands. Whissle is built for the teams who needed that context to actually route, escalate, or act on the call.
Whissle's Stream2Action architecture feeds audio, text, or video through a single-pass discriminative model — META-1 — and returns structured JSON carrying transcription, speaker diarization, emotion, intent, age, gender, and entities simultaneously. The full stack (ASR, LLM, TTS, diarization) runs self-hosted on a single GPU via Docker, which is the core production story here. The cloud API is documented as temporarily down while on-prem infrastructure is reinforced, so teams who need cloud failover have no fallback path right now. Video input is on a stated roadmap; text streaming arrives next. For contact center or privacy-sensitive workloads where you control the hardware, the on-prem path is active — for anything cloud-dependent, you are waiting.
Bottom line: Pick Whissle for a self-hosted, single-GPU voice intelligence stack where emotion and intent metadata matter — plan on a different architecture if your deployment requires a live cloud API or video ingestion before those roadmap items ship.
Community Performance Report Card
No community ratings yet. Be the first to rate this tool!
Community Benchmarks Community
Sign in to submit a benchmarkNo community benchmarks yet. Be the first to share a real-world data point.
Pros
Sign in to edit- Single-pass emotion, intent, speaker, and entity extraction alongside transcription, so downstream routing logic gets a structured JSON payload instead of raw text that requires a second model call to interpret.
- Full stack — ASR, LLM, TTS, diarization — runs on a single GPU via self-hosted Docker, which means teams in regulated industries can keep audio on-prem without stitching together separate self-hosted components.
- META-1 processes in real time rather than post-call, so a contact center agent or escalation router receives intent signals while the call is still active — not after it ends.
- Provider-agnostic, open-source self-hosted architecture, so teams are not locked to a vendor's cloud pricing model when inference volume scales.
- The browser and macOS app extend the same intelligence stack to ambient and on-device scenarios, so developers can prototype voice agents locally before committing to a server deployment.
Cons
Sign in to edit- The cloud API is explicitly offline at the time of listing. Teams that need a hosted endpoint for testing, staging, or production fallback have no active path — they either self-host immediately or wait for service restoration with no stated timeline.
- Video input is on a multi-month roadmap and text streaming is listed as coming next month; teams building pipelines that ingest video or require text-stream intelligence today will hit a hard capability gap and need a different tool for those modalities.
- Agents Studio — the interface for building and deploying multi-modal voice agents — is listed as cloud-only and coming soon. Teams who need a visual agent-building environment now will find no equivalent on the self-hosted Gateway path, pushing them toward competitors like Vapi or Retell that have live agent-building tooling.
- Community stress-test data on single-GPU throughput under sustained concurrent call load is not publicly available. Teams running high-volume contact center deployments cannot size hardware requirements from documented benchmarks — they are provisioning blind until they run their own load tests.
Community Reviews
Sign in to write a reviewNo reviews yet. Be the first to share your experience.
About
- Platforms
- macOS, Linux, WSL, Docker
- API Available
- Yes
- Self-Hosted
- Yes
- Last Updated
- 2026-06-18T04:32:07.270Z
Best For
Who it's for
- Real-time voice and multi-modal processing
- Self-hosted or on-prem deployments
- Applications needing emotion/intent alongside transcription
- Single-GPU voice AI stacks
- Developers building streaming agents
What it does well
- Contact center voice agents resolving calls quickly
- Real-time transcription with speaker and emotion metadata
- Stream-to-action pipelines for audio/text/video inputs
- On-device or on-prem enterprise search and intelligence
- Development of privacy-first voice agents
Discussion Community
Sign in to commentNo discussion yet. Sign in to start the conversation.
Spotted incorrect or missing data? Join our community of contributors.
Sign Up to ContributeCommunity Notes & Tips Community
Sign in to contributeBe the first to contribute. General notes, observations, gotchas, and tips from people who use this tool day-to-day.
Frequently Asked Questions
- Is Whissle Gateway free?
- Whissle Gateway is a paid tool. No permanent free tier is offered.
- Is Whissle Gateway open source?
- Yes. Whissle Gateway is open source.
- Does Whissle Gateway have an API?
- Yes. Whissle Gateway exposes a developer API. See the official documentation at https://whissle.ai for details.
- Can I self-host Whissle Gateway?
- Yes. Whissle Gateway supports self-hosting on your own infrastructure.
- What platforms does Whissle Gateway support?
- Whissle Gateway is available on: macOS, Linux, WSL, Docker.
Hours Saved & ROI Stories Community
Sign in to contributeBe the first to contribute. Concrete time/cost savings, with context. e.g. "Cut my code review backlog from 4h to 45m per week."
Curated lists that include this category
Traditional ASR returns a transcript. What it discards — tone, speaker identity, emotional state, intent signal — is often the information a contact center agent or routing system actually needs to act. Whissle’s Stream2Action pipeline addresses that gap by running audio (and eventually text and video) through META-1, a multi-modal discriminative model that produces a single structured JSON payload per pass: transcription with punctuation, speaker diarization, emotion tags, intent classification, entity extraction, and speech analysis metrics like fluency and pitch. The output feeds directly into a generative LLM layer, a routing engine, or third-party APIs via webhooks — the vendor describes the architecture as converting any stream into actionable intelligence without a multi-step pipeline.
The differentiating claim is speed with depth. The vendor positions META-1 as bridging the gap between fast-but-shallow ASR and deep-but-slow multi-modal LLMs — a single forward pass that returns semantic metadata in real time rather than after the fact. Whether that holds at production call volumes is something the community has not yet stress-tested publicly, and the cloud API being offline limits independent verification.
Whissle fits best on teams with the infrastructure to self-host and the use case — contact center voice agents, on-prem enterprise search, privacy-first voice apps — where sending audio to a third-party cloud is a non-starter. The Docker install is documented as a one-line curl command that pulls the image and starts with Docker Compose, covering macOS, Linux, and WSL. The Agents Studio for building and deploying multi-modal voice agents is listed as coming soon on cloud; on-prem agent workflows are available through the Gateway today.
The browser product (macOS, free download) and a desktop macOS app extend Whissle to ambient voice intelligence and on-device AI scenarios. The API surface covers ASR, TTS, LLM, and voice agents with developer docs and streaming examples published — though with cloud services offline, live API testing against the hosted endpoint is blocked until service resumes.
