SuperSonic WebUI is a local text-to-speech gateway and browser interface for Supertone's Supertonic TTS engine. It exposes a mostly OpenAI-compatible /v1/audio/speech endpoint, maps OpenAI-style voice names to Supertonic voices, and includes a compact WebUI for generating speech, changing language, selecting output format, and using simulated streaming playback.
The app runs locally with FastAPI and serves both the API and static WebUI from the same process.
- Local Supertonic TTS through
supertonic>=1.2.0 - OpenAI-style speech endpoint:
POST /v1/audio/speech - WebUI at
/ - Voice aliases:
alloy,echo,fable,nova,onyx,shimmer - Native Supertonic voices:
M1-M5,F1-F5 - Output formats:
wav,mp3,opus,flac,pcm - Simulated streaming endpoint:
POST /v1/audio/stream - 31 language options
- Text normalization for numbers, currencies, units, and phone numbers
- Custom voice-style JSON upload support
- Python 3.10+ recommended
ffmpegfor MP3, Opus, and FLAC output- The first Supertonic model load may download model assets into
~/.cache/supertonic3
python3 -m venv .venv
.venv/bin/python -m pip install -r requirements.txt
.venv/bin/python server.pyOpen:
http://127.0.0.1:8880
Health check:
curl http://127.0.0.1:8880/healthBuild and run with Docker Compose:
docker compose up -d --buildOpen:
http://127.0.0.1:8880
The compose file keeps the Supertonic model cache in a Docker volume so the model is not downloaded every time:
supertonic-cache:/root/.cache
Custom uploaded voices are persisted through:
./voices:/app/voices
Stop the container:
docker compose downRemove the model cache volume as well:
docker compose down -vGenerate WAV audio:
curl -o speech.wav \
-X POST http://127.0.0.1:8880/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "tts-1",
"input": "Hello from SuperSonic.",
"voice": "alloy",
"response_format": "wav",
"speed": 1.0,
"language": "en",
"steps": 8
}'List voices and languages:
curl http://127.0.0.1:8880/v1/audio/voicesList models:
curl http://127.0.0.1:8880/v1/modelsStream simulated PCM frames over Server-Sent Events:
curl -N \
-X POST "http://127.0.0.1:8880/v1/audio/stream?chunk_size=300" \
-H "Content-Type: application/json" \
-d '{
"input": "This is a simulated streaming test.",
"voice": "echo",
"language": "en",
"speed": 1.0,
"steps": 8
}'Compatible for basic text-to-speech clients that call:
POST /v1/audio/speech
Supported OpenAI-style fields:
modelinputvoiceresponse_formatspeed
SuperSonic-specific fields:
languagesteps
Important differences:
modelis accepted but the app always uses Supertonic.- Authentication is not required or validated.
- Error responses are FastAPI-style, not exact OpenAI error objects.
- Streaming uses a custom
/v1/audio/streamSSE format. - Transcription and translation endpoints are not implemented.
Environment variables:
| Variable | Default | Description |
|---|---|---|
SUPERSONIC_HOST |
0.0.0.0 |
Bind address |
SUPERSONIC_PORT |
8880 |
HTTP port |
SUPERSONIC_LOG_LEVEL |
INFO |
Python logging level |
Upload a Supertonic voice-style JSON file from the WebUI, or call:
curl -X POST http://127.0.0.1:8880/v1/audio/voices/upload \
-F "file=@voice.json"Uploaded voices are stored in voices/ and can be used by name in voice.
- WAV and PCM do not require ffmpeg.
- MP3, Opus, and FLAC require ffmpeg.
- The streaming endpoint is simulated streaming: Supertonic synthesizes text chunks, then the server slices generated PCM into small frames for browser playback.
- CPU inference works, but generation speed depends on machine performance.