One process. Every voice backend.
Timbre is a local voice gateway that unifies your TTS and STT backends behind a single OpenAI-compatible API. Instead of managing separate containers for each service, Timbre imports backends as libraries and serves them from one process, with lazy model loading, automatic unloading, and a built-in management UI.
Running local voice means juggling PocketTTS on one port, Whisper on another, Parakeet on a third, each in its own container with its own config. Timbre replaces all of that with one URL.
Point your agent, app, or tool at http://localhost:9000/v1/audio/speech and pick your backend with the model field. That's it.
Timbre vs the same backends running as standalone Docker containers. Same hardware, same input, Docker-to-Docker comparison.
| Backend | Task | Standalone | Timbre | Result |
|---|---|---|---|---|
| PocketTTS | TTS | 2.63s | 0.95s | 64% faster |
| Supertonic | TTS | 1.01s | 1.30s | +290ms |
| Parakeet | STT | 0.27s | 0.26s | equal |
| faster-whisper | STT | 3.48s | 2.32s | 33% faster |
Ryzen 9 5950X, CPU inference, 3.3s test audio.
Text-to-Speech: PocketTTS (fast, CPU), Supertonic (fast, ONNX), Qwen3 (quality, CUDA)
Speech-to-Text: Parakeet (fast, ONNX), faster-whisper (compatible, CTranslate2)
python3.12 -m venv .venv && . .venv/bin/activate
pip install timbre-voice[all]
timbre setup
timbre serveServer starts at http://127.0.0.1:9000. UI at http://127.0.0.1:9000/ui/.
docker compose up -d --buildFor Qwen3 / CUDA:
docker compose -f docker-compose.cuda.yml up -d --buildcurl http://127.0.0.1:9000/health
curl http://127.0.0.1:9000/v1/backendscurl http://127.0.0.1:9000/v1/audio/speech \
-H "content-type: application/json" \
-d '{"model":"pocket","input":"Hello from Timbre.","voice":"alba"}' \
--output speech.wavSwitch backends by changing model: pocket, supertonic, or qwen3.
Supported output formats: wav, mp3, opus, ogg, flac.
curl http://127.0.0.1:9000/v1/audio/transcriptions \
-F model=parakeet \
-F file=@recording.wavSwitch backends: parakeet or whisper.
Upload a 5-10 second reference clip and use it by name:
curl http://127.0.0.1:9000/v1/voices \
-F name=my_voice \
-F backend=pocket \
-F precompute=true \
-F file=@reference.wavcurl http://127.0.0.1:9000/v1/audio/speech \
-H "content-type: application/json" \
-d '{"model":"pocket","input":"Speaking in a cloned voice.","voice":"my_voice"}' \
--output cloned.wavTimbre maps OpenAI voice names to native backend voices, so existing clients work without changes:
curl http://127.0.0.1:9000/v1/audio/speech \
-H "content-type: application/json" \
-d '{"model":"supertonic","input":"Hello.","voice":"alloy"}' \
--output hello.wavalloy resolves to Supertonic's F1. Default aliases: alloy/F1, echo/M1, fable/M2, nova/F2, onyx/M3, shimmer/F3. Custom aliases can be created through the API or UI.
Qwen has its own Studio tab and API because it has more than one workflow:
- Clone: use your own reference audio with Qwen Base.
- Preset Voice: use Qwen preset speakers with style/emotion instructions.
- Voice Design: describe a new voice in text, generate it, then save the exact generated audio as a clone.
Clone and Preset Voice can use 0.6b or 1.7b through model_size. Voice Design uses 1.7b.
Timbre switches the active Qwen model automatically for these Studio endpoints, so you do not have to set the model manually before each request.
Upload a Qwen clone reference:
curl http://127.0.0.1:9000/v1/qwen/voices \
-F name=my_qwen_voice \
-F file=@reference.wav \
-F ref_text='Text spoken in the reference audio.' \
-F model_size=1.7b \
-F prepare=falseGenerate with that clone:
curl http://127.0.0.1:9000/v1/qwen/clone/speech \
-H "content-type: application/json" \
-d '{
"input": "Speaking with a Qwen reference voice.",
"voice": "my_qwen_voice",
"model_size": "1.7b",
"response_format": "wav",
"language": "Auto"
}' \
--output qwen-clone.wavUse a Qwen preset voice with instructions:
curl http://127.0.0.1:9000/v1/qwen/custom-voice/speech \
-H "content-type: application/json" \
-d '{
"input": "This uses a Qwen preset speaker.",
"speaker": "Vivian",
"model_size": "1.7b",
"instruct": "Speak warmly with calm confidence.",
"response_format": "wav",
"language": "Auto"
}' \
--output qwen-preset.wavDesign a voice:
curl http://127.0.0.1:9000/v1/qwen/voice-design/speech \
-H "content-type: application/json" \
-d '{
"input": "This is a newly designed narrator voice.",
"instruct": "A warm mature narrator, slow pacing, intimate microphone.",
"model_size": "1.7b",
"response_format": "wav",
"language": "Auto"
}' \
--output qwen-design.wavTo save a Voice Design result as a reusable clone, upload the generated WAV:
curl http://127.0.0.1:9000/v1/qwen/voices \
-F name=my_designed_voice \
-F file=@qwen-design.wav \
-F ref_text='This is a newly designed narrator voice.' \
-F design='A warm mature narrator, slow pacing, intimate microphone.' \
-F model_size=1.7b \
-F prepare=falseLazy loading with TTL. Models load on first request and unload after a configurable idle timeout. The server is always reachable; only model weights cycle in and out of memory.
Backend management. Enable, disable, load, and unload backends at runtime through the API or UI without restarting the server.
curl http://127.0.0.1:9000/v1/backends/tts/pocket \
-H "content-type: application/json" \
-d '{"action":"load"}'Model profiles. Download and switch between model variants per backend.
timbre download-models --model whisper:small --set-default
timbre download-models --model parakeet:int8 --set-defaultConfig API. View and update configuration at runtime.
curl http://127.0.0.1:9000/v1/config
curl -X PUT http://127.0.0.1:9000/v1/config \
-H "content-type: application/json" \
-d @config.jsonQwen3 is disabled by default. It requires a CUDA GPU and pulls heavy dependencies. Use the CUDA Docker compose file for the simplest setup.
pip install timbre-voice[qwen3]
timbre download-models --model qwen3:1.7b-customvoice --set-defaultAvailable Qwen model profiles:
qwen3:0.6b-baseqwen3:0.6b-customvoiceqwen3:1.7b-baseqwen3:1.7b-customvoiceqwen3:1.7b-voicedesign
You can still use Qwen through the generic OpenAI-compatible speech route when the active Qwen model supports the selected voice:
curl http://127.0.0.1:9000/v1/audio/speech \
-H "content-type: application/json" \
-d '{"model":"qwen3","input":"Hello from Qwen.","voice":"Vivian"}' \
--output qwen.wavFor multi-GPU systems, set the device to cuda:auto (picks the GPU with most free VRAM), or pin a specific GPU with cuda:0, cuda:1, etc. In Docker, you can also limit the visible GPU with TIMBRE_NVIDIA_VISIBLE_DEVICES in .env before starting docker-compose.cuda.yml.
For best throughput, install Flash Attention:
pip install flash-attn --no-build-isolationCUDA Docker handles this automatically.
Config lives at ~/.config/timbre/config.yaml. Edit it directly, through the UI Config page, or through the API. Changes to backends, TTL, and device settings apply immediately. Host/port changes require a restart.
git clone https://github.com/Spadav/Timbre.git
cd Timbre
python3.12 -m venv .venv && . .venv/bin/activate
pip install -e ".[all,dev]"
cd web && npm install && npm run build && cd ..
timbre serveRun tests:
ruff check .
pytest -qTimbre is not a wrapper around existing audio servers. Each backend is imported as a Python library and called directly. The API surface, routing, TTL management, voice system, and UI are all original code. The backends are dependencies, like numpy or torch.
Request -> FastAPI Router -> Backend Manager -> Backend ABC -> Library Call
|
TTL Manager (load/unload weights)
Adding a new backend means implementing a simple ABC (TTSBackend or STTBackend) and registering it. The router and API never change.
Timbre's source code is MIT licensed. Backend model weights are subject to their own licenses (see each backend's documentation).
- GitHub: Spadav/Timbre
- PyPI: timbre-voice
- Docker Hub: spadav/timbre
