Add WebSocket endpoint for streaming audio transcription by leonbaronick · Pull Request #99 · mutablelogic/go-whisper

leonbaronick · 2026-03-20T23:37:22Z

Hi there! I've been looking for a server-client-split implementation for whisper.cpp and discovered go-whisper. I need real-time microphone transcription though, so I took the liberty of implementing a WebSocket streaming endpoint. Happy to adjust the approach if it doesn't fit your vision for the project.

I'm also working on a seperate branch to add microphone recording on the client-side, although that is maybe not that suitable for the CLI application.

Summary

Adds a WebSocket endpoint (GET /api/whisper/stream) that accepts real-time audio input and returns transcription segments as they are produced. This uses a sliding-window approach mirroring https://github.com/ggml-org/whisper.cpp/blob/master/examples/stream/stream.cpp. Audio is accumulated and whisper_full() is called repeatedly on overlapping windows with timestamp-based segment deduplication.

Protocol: Client connects via WebSocket, sends a JSON config (model, language, window parameters), then streams binary PCM frames (16-bit LE, 16kHz mono). The server responds with JSON segment events as they're transcribed.

Key changes

pkg/schema/stream.go: StreamConfig and StreamEvent types
pkg/whisper/stream.go: StreamSession implementing sliding-window processing with configurable step_ms, length_ms, keep_ms, and a simple RMS-based VAD gate
pkg/whisper/whisper.go: AcquireTask() for long-lived task ownership (streaming sessions hold a context for their duration, unlike the per-request WithModel pattern)
pkg/manager.go: NewStreamSession() with model validation (local whisper models only)
pkg/httphandler/stream.go: WebSocket handler with PCM-to-float32 conversion
README and API docs updated

Limitations

Timestamp-based deduplication can occasionally produce partial repeats at window boundaries when whisper re-transcribes overlapping audio differently across passes
VAD uses simple RMS energy; whisper.cpp has a model-based VAD (Silero) via whisper_vad_context but the Go bindings don't exist yet (tracked with a TODO)
Streaming sessions hold a pool slot for their entire duration, which may starve batch requests under --whisper.max-contexts

Added StreamConfig, StreamEvent types and event constants for the WebSocket streaming transcription protocol.

StreamSession implements chunked audio processing mirroring whisper.cpp's stream example. AcquireTask allows long-lived task ownership for streaming sessions.

Adds the /stream WebSocket endpoint that accepts PCM audio and returns transcription segments in real time. Integrates with the manager layer via NewStreamSession.

Update README features list and API docs with the WebSocket streaming transcription protocol, configuration parameters, and event format.

djthorpe · 2026-03-24T06:06:19Z

Hi Leon. Thanks for the PR! Sorry for the late reply but I will take a look at the weekend and get back to you.

leonbaronick added 4 commits March 20, 2026 17:55

Add streaming schema types and websocket dependency

6e4dd5f

Added StreamConfig, StreamEvent types and event constants for the WebSocket streaming transcription protocol.

Add sliding-window streaming session and AcquireTask

0be1160

StreamSession implements chunked audio processing mirroring whisper.cpp's stream example. AcquireTask allows long-lived task ownership for streaming sessions.

Add WebSocket handler and wire up streaming endpoint

9c8083f

Adds the /stream WebSocket endpoint that accepts PCM audio and returns transcription segments in real time. Integrates with the manager layer via NewStreamSession.

Add documentation for streaming audio input endpoint

6ff0168

Update README features list and API docs with the WebSocket streaming transcription protocol, configuration parameters, and event format.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add WebSocket endpoint for streaming audio transcription#99

Add WebSocket endpoint for streaming audio transcription#99
leonbaronick wants to merge 4 commits into
mutablelogic:mainfrom
leonbaronick:streaming-audio-input

leonbaronick commented Mar 20, 2026

Uh oh!

djthorpe commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

leonbaronick commented Mar 20, 2026

Summary

Key changes

Limitations

Uh oh!

djthorpe commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants