Fast Speech-to-Text Pipeline Supporting Multiple ASR Backends: Whisper + Parakeet
WhisperS2T is an optimized, lightning-fast Speech-to-Text (ASR) pipeline supporting multiple model backends:
| Backend | Model | Languages | Best For |
|---|---|---|---|
| Parakeet | NVIDIA Parakeet TDT 0.6B v2 | English only | State-of-the-art English accuracy |
| CTranslate2 | OpenAI Whisper | 99+ languages | Fast multilingual transcription |
| TensorRT-LLM | OpenAI Whisper | 99+ languages | Maximum speed on NVIDIA GPUs |
| HuggingFace | OpenAI Whisper | 99+ languages | Flexibility, Distil models |
The pipeline provides 2.3X speed improvement over WhisperX and 3X boost over HuggingFace Pipeline with FlashAttention 2.
This fork includes a push-to-talk hotkey application for instant speech-to-text with automatic clipboard copy.
conda activate whisper
python whisper_hotkey.py- Hold hotkey → Record (hear a pop sound)
- Release → Transcribe & auto-copy to clipboard
- Paste anywhere with
Ctrl+V
- ⌨️ Configurable hotkey (default:
ctrl+windows) - 📋 Auto-copy to clipboard - paste transcriptions anywhere instantly
- 🔊 Audio notification when recording starts
- 🧵 Multi-threaded - records and transcribes in parallel for long recordings
- 🔗 Intelligent stitching - handles chunk boundaries with smart overlap detection
- ⚙️
.envconfiguration - easily customize model, mic, hotkey, and more - 🦜 Parakeet support - use NVIDIA's state-of-the-art English ASR model
Edit .env to choose your backend:
# ============ Option 1: Whisper (multilingual) ============
MODEL=large-v3
BACKEND=CTranslate2
LANGUAGE=en
# ============ Option 2: Parakeet (English, best accuracy) ============
# MODEL=models/parakeet-tdt-0.6b-v2.nemo
# BACKEND=Parakeet
# LANGUAGE=enSee SETUP.md for installation and USAGE_GUIDE.md for detailed options.
NVIDIA's Parakeet TDT 0.6B v2 is the current state-of-the-art for English speech recognition, outperforming Whisper large-v3 on accuracy benchmarks.
# Install NeMo (in a fresh conda env recommended)
pip install nemo_toolkit[asr]
# Or use the Parakeet-specific requirements
pip install -r requirements-parakeet.txtimport whisper_s2t
# Load Parakeet model
model = whisper_s2t.load_model(
"models/parakeet-tdt-0.6b-v2.nemo", # or "nvidia/parakeet-tdt-0.6b-v2"
backend="Parakeet"
)
# Transcribe
result = model.transcribe_with_vad(["audio.wav"])
print(result[0][0]['text'])| Feature | Parakeet TDT | Whisper large-v3 |
|---|---|---|
| English Accuracy | Best | Very Good |
| Languages | English only | 99+ languages |
| Speed | Fast | Depends on backend |
| Model Size | ~600MB | ~1.5GB |
| Timestamps | Built-in | Via alignment |
Recommendation:
- For English: Use Parakeet
- For multilingual: Use Whisper with CTranslate2
- [Dec 15, 2025]: Added NVIDIA Parakeet TDT backend for state-of-the-art English ASR
- [Feb 25, 2024]: Added prebuilt docker images and transcript exporter to
txt, json, tsv, srt, vtt. - [Jan 28, 2024]: Added support for TensorRT-LLM backend.
- [Dec 23, 2023]: Added support for word alignment for CTranslate2 backend.
- [Dec 19, 2023]: Added support for Whisper-Large-V3 and Distil-Whisper-Large-V2.
- [Dec 17, 2023]: Released WhisperS2T!
Checkout the Google Colab notebooks provided here: notebooks
- 🔄 Multi-Backend Support: Whisper (CTranslate2, HuggingFace, TensorRT-LLM, OpenAI) + Parakeet (NeMo)
- 🦜 State-of-the-Art English: NVIDIA Parakeet TDT achieves best-in-class English accuracy
- 🎙️ Easy Integration of Custom VAD Models: Seamlessly add custom Voice Activity Detection models
- 🎧 Effortless Handling of Audio Files: Intelligently batch smaller speech segments
- ⏳ Streamlined Processing: Asynchronously loads large audio files while transcribing
- 🌐 Batching Support: Decode multiple languages or tasks in a single batch
- 🧠 Reduction in Hallucination: Optimized parameters to decrease repeated text
- ⏱️ Dynamic Time Length Support: Process variable-length inputs (CTranslate2)
| File | Use Case |
|---|---|
requirements.txt |
Full reference (all backends documented) |
requirements-whisper.txt |
Whisper-only (lighter install) |
requirements-parakeet.txt |
Parakeet-only (NeMo) |
Install audio packages required for resampling and loading audio files.
apt-get install -y libsndfile1 ffmpegbrew install ffmpegconda install conda-forge::ffmpeg# For Whisper backends
pip install -r requirements-whisper.txt
pip install -e .
# For Parakeet backend (fresh env recommended)
pip install -r requirements-parakeet.txt
pip install -e .import whisper_s2t
model = whisper_s2t.load_model(model_identifier="large-v2", backend='CTranslate2')
files = ['audio.wav']
out = model.transcribe_with_vad(files, lang_codes=['en'], tasks=['transcribe'])
print(out[0][0]['text'])import whisper_s2t
model = whisper_s2t.load_model(
model_identifier="nvidia/parakeet-tdt-0.6b-v2",
backend='Parakeet'
)
files = ['audio.wav']
out = model.transcribe_with_vad(files)
print(out[0][0]['text'])import whisper_s2t
model = whisper_s2t.load_model(model_identifier="large-v2", backend='TensorRT-LLM')
files = ['audio.wav']
out = model.transcribe_with_vad(files, lang_codes=['en'], tasks=['transcribe'])
print(out[0][0]['text'])Check docs.md for more details.
- OpenAI Whisper Team: Thanks for open-sourcing the whisper model.
- NVIDIA NeMo Team: Thanks for the Parakeet TDT models and VAD model.
- HuggingFace Team: Thanks for FlashAttention2 integration.
- CTranslate2 Team: Thanks for the faster inference engine.
- NVIDIA TensorRT-LLM Team: Thanks for LLM inference optimizations.
This project is licensed under MIT License - see the LICENSE file for details.