Skip to content

Add WebSocket endpoint for streaming audio transcription#99

Open
leonbaronick wants to merge 4 commits into
mutablelogic:mainfrom
leonbaronick:streaming-audio-input
Open

Add WebSocket endpoint for streaming audio transcription#99
leonbaronick wants to merge 4 commits into
mutablelogic:mainfrom
leonbaronick:streaming-audio-input

Conversation

@leonbaronick

Copy link
Copy Markdown

Hi there! I've been looking for a server-client-split implementation for whisper.cpp and discovered go-whisper. I need real-time microphone transcription though, so I took the liberty of implementing a WebSocket streaming endpoint. Happy to adjust the approach if it doesn't fit your vision for the project.

I'm also working on a seperate branch to add microphone recording on the client-side, although that is maybe not that suitable for the CLI application.

Summary

Adds a WebSocket endpoint (GET /api/whisper/stream) that accepts real-time audio input and returns transcription segments as they are produced. This uses a sliding-window approach mirroring https://github.com/ggml-org/whisper.cpp/blob/master/examples/stream/stream.cpp. Audio is accumulated and whisper_full() is called repeatedly on overlapping windows with timestamp-based segment deduplication.

Protocol: Client connects via WebSocket, sends a JSON config (model, language, window parameters), then streams binary PCM frames (16-bit LE, 16kHz mono). The server responds with JSON segment events as they're transcribed.

Key changes

  • pkg/schema/stream.go: StreamConfig and StreamEvent types
  • pkg/whisper/stream.go: StreamSession implementing sliding-window processing with configurable step_ms, length_ms, keep_ms, and a simple RMS-based VAD gate
  • pkg/whisper/whisper.go: AcquireTask() for long-lived task ownership (streaming sessions hold a context for their duration, unlike the per-request WithModel pattern)
  • pkg/manager.go: NewStreamSession() with model validation (local whisper models only)
  • pkg/httphandler/stream.go: WebSocket handler with PCM-to-float32 conversion
  • README and API docs updated

Limitations

  • Timestamp-based deduplication can occasionally produce partial repeats at window boundaries when whisper re-transcribes overlapping audio differently across passes
  • VAD uses simple RMS energy; whisper.cpp has a model-based VAD (Silero) via whisper_vad_context but the Go bindings don't exist yet (tracked with a TODO)
  • Streaming sessions hold a pool slot for their entire duration, which may starve batch requests under --whisper.max-contexts

Added StreamConfig, StreamEvent types and event constants for
the WebSocket streaming transcription protocol.
StreamSession implements chunked audio processing mirroring
whisper.cpp's stream example. AcquireTask allows long-lived
task ownership for streaming sessions.
Adds the /stream WebSocket endpoint that accepts PCM audio
and returns transcription segments in real time. Integrates
with the manager layer via NewStreamSession.
Update README features list and API docs with the WebSocket
streaming transcription protocol, configuration parameters,
and event format.
@djthorpe

Copy link
Copy Markdown
Member

Hi Leon. Thanks for the PR! Sorry for the late reply but I will take a look at the weekend and get back to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants