Speech-to-Text
Local speech-to-text transcription using Whisper ONNX models via transformers.js. No API keys required - everything runs on-device.
Quick Start
Node.js
01import { Gerbil } from "@tryhamster/gerbil";02import { readFileSync } from "fs";03
04const g = new Gerbil();05
06// Transcribe a WAV file07const audioData = new Uint8Array(readFileSync("audio.wav"));08const result = await g.transcribe(audioData);09console.log(result.text);CLI
# Transcribe audio filegerbil transcribe audio.wav
# With timestampsgerbil transcribe audio.wav --timestamps
# List available modelsgerbil transcribe --list-models
# Use a different modelgerbil transcribe audio.wav --model whisper-base.enAvailable Models
| Model | Size | Languages | Description |
|---|---|---|---|
whisper-tiny.en | 39M | English | Fastest, good for simple audio |
whisper-tiny | 39M | Multi | Multilingual tiny model |
whisper-base.en | 74M | English | Good balance of speed/accuracy |
whisper-base | 74M | Multi | Multilingual base model |
whisper-small.en | 244M | English | High quality transcription |
whisper-small | 244M | Multi | High quality multilingual |
whisper-large-v3-turbo | 809M | 80+ langs | Best quality, 5.4x faster than v3 |
whisper-large-v3-turbo for best quality with excellent speed. For quick demos, whisper-tiny.en offers the fastest loading.API Reference
Gerbil Class Methods
01// Load STT model explicitly (optional - auto-loads on first transcribe)02await g.loadSTT("whisper-tiny.en", {03 onProgress: (p) => console.log(p.status),04});05
06// Transcribe audio07const result = await g.transcribe(audioData, {08 language: "en", // Language hint (multilingual models only)09 timestamps: true, // Return segment timestamps10 onProgress: (p) => console.log(p.status),11});12
13// Result structure14console.log(result.text); // Full transcription15console.log(result.duration); // Audio duration in seconds16console.log(result.totalTime); // Processing time in ms17console.log(result.segments); // Timestamped segments (if requested)18
19// Check if loaded20console.log(g.isSTTLoaded());21
22// List available models23const models = await g.listSTTModels();Audio Input Formats
// WAV file (Uint8Array) - automatically decodedconst wavData = new Uint8Array(fs.readFileSync("audio.wav"));await g.transcribe(wavData);
// Raw audio (Float32Array at 16kHz mono)const raw16k = new Float32Array([...]); // 16kHz mono samplesawait g.transcribe(raw16k);
// WAV at any sample rate - automatically resampled to 16kHzconst wav44k = new Uint8Array(fs.readFileSync("audio-44khz.wav"));await g.transcribe(wav44k); // Resampled internallyTranscription with Timestamps
const result = await g.transcribe(audioData, { timestamps: true });
for (const segment of result.segments || []) { console.log(`[${segment.start.toFixed(1)}s - ${segment.end.toFixed(1)}s] ${segment.text}`);}// [0.0s - 5.2s] Hello world, this is a test// [5.2s - 8.0s] of the transcription systemLanguage Detection (Multilingual Models)
// Use multilingual modelawait g.loadSTT("whisper-base");
// Let Whisper detect languageconst result = await g.transcribe(audioData);console.log(result.language); // "en", "es", "fr", etc.
// Or provide a language hintconst spanishResult = await g.transcribe(audioData, { language: "es" });Direct WhisperSTT Class
For lower-level control:
01import { WhisperSTT, decodeWav, resampleAudio } from "@tryhamster/gerbil/core/stt";02
03const stt = new WhisperSTT("whisper-tiny.en");04
05await stt.load({06 onProgress: (p) => console.log(p.status),07});08
09const result = await stt.transcribe(audioData, {10 timestamps: true,11});12
13console.log(result.text);14
15stt.dispose();Audio Utilities
import { decodeWav, resampleAudio } from "@tryhamster/gerbil/core/stt";
// Decode WAV to Float32Arrayconst { audio, sampleRate } = decodeWav(wavUint8Array);
// Resample to 16kHz (Whisper requirement)const audio16k = resampleAudio(audio, sampleRate, 16000);Streaming Transcription (Core)
Transcribe audio in real-time as it's recorded, instead of waiting until recording stops. Perfect for live captioning, call transcription, and voice note taking.
01import { Gerbil } from "@tryhamster/gerbil";02
03const g = new Gerbil();04
05// Create a streaming transcription session06const session = await g.createStreamingTranscription({07 chunkDuration: 1500, // Transcribe every 1.5 seconds08 onChunk: (text, idx) => {09 console.log(`Chunk ${idx}: ${text}`);10 },11 onTranscript: (fullText) => {12 console.log("Full transcript:", fullText);13 },14});15
16// Start the session17session.start();18
19// Feed audio as it comes in (Float32Array at 16kHz)20session.feedAudio(audioChunk);21
22// ... continue feeding audio ...23
24// Stop and get final transcript25const finalText = await session.stop();26console.log("Final:", finalText);StreamingMicrophone (Node.js)
For Node.js, use StreamingMicrophone to capture audio in real-time:
01import { Gerbil, StreamingMicrophone } from "@tryhamster/gerbil";02
03const g = new Gerbil();04
05// Create streaming session06const session = await g.createStreamingTranscription({07 chunkDuration: 1500,08 onTranscript: (text) => process.stdout.write(`\r> ${text}`),09});10
11// Create streaming microphone (requires SoX)12const mic = new StreamingMicrophone({13 sampleRate: 16000,14 onAudio: (chunk) => session.feedAudio(chunk),15});16
17session.start();18mic.start();19
20// Record for 10 seconds21await new Promise(r => setTimeout(r, 10000));22
23mic.stop();24const transcript = await session.stop();25console.log("\nFinal:", transcript);Streaming Options
| Option | Type | Default | Description |
|---|---|---|---|
chunkDuration | number | 1500 | Milliseconds between transcriptions |
onChunk | (text, idx) => void | - | Callback for each chunk result |
onTranscript | (fullText) => void | - | Callback with full accumulated transcript |
chunkDuration = more responsive but more processing overhead. Each chunk is transcribed independently. StreamingMicrophone requires SoX to be installed.React Hook (Browser)
Use useVoiceInput for browser-based voice recording and transcription:
01import { useVoiceInput } from "@tryhamster/gerbil/browser";02
03function VoiceInput() {04 const {05 startRecording,06 stopRecording,07 isRecording,08 isTranscribing,09 transcript,10 error,11 } = useVoiceInput({12 model: "whisper-tiny.en",13 onTranscript: (text) => console.log("User said:", text),14 });15
16 return (17 <div>18 <button onClick={isRecording ? stopRecording : startRecording}>19 {isRecording ? "🔴 Stop" : "🎤 Record"}20 </button>21 {isTranscribing && <span>Transcribing...</span>}22 {transcript && <p>{transcript}</p>}23 </div>24 );25}Hook Options
01const {02 startRecording, // () => Promise<void> - start recording03 stopRecording, // () => Promise<string> - stop and transcribe04 cancelRecording, // () => void - stop without transcribing05 transcribe, // (audio: Float32Array) => Promise<string>06 isRecording, // boolean - currently recording07 isTranscribing, // boolean - transcribing audio08 isLoading, // boolean - model loading09 isReady, // boolean - model ready10 transcript, // string - latest result11 loadingProgress, // STTProgress - loading status12 error, // string | null13 load, // () => void - manually load model14} = useVoiceInput({15 model: "whisper-tiny.en", // STT model ID16 autoLoad: false, // loads on first record17 onReady: () => {}, // called when model ready18 onTranscript: (text) => {},// called on each transcription19 onError: (error) => {}, // called on errors20 onProgress: (p) => {}, // loading progress21});Streaming Transcription
Transcribe audio in real-time as the user speaks, instead of waiting until they stop recording. Perfect for live captioning, call transcription, and voice note taking.
01import { useVoiceInput } from "@tryhamster/gerbil/browser";02
03function LiveTranscription() {04 const { 05 startRecording, 06 stopRecording, 07 isRecording, 08 transcript, // Full text so far09 streamingChunk, // Latest chunk10 chunkCount,11 isTranscribing,12 } = useVoiceInput({13 streaming: true,14 chunkDuration: 1500, // Transcribe every 1.5 seconds15 onChunk: (text, idx) => {16 console.log(`Chunk ${idx}: ${text}`);17 },18 });19
20 return (21 <div>22 <button onClick={isRecording ? stopRecording : startRecording}>23 {isRecording ? "Stop" : "Start Live Transcription"}24 </button>25 26 {isTranscribing && <span>Processing...</span>}27 {streamingChunk && <p style={{ opacity: 0.6 }}>"{streamingChunk}"</p>}28 29 <div>30 <strong>Full Transcript ({chunkCount} chunks):</strong>31 <p>{transcript}</p>32 </div>33 </div>34 );35}Streaming Options
| Option | Type | Default | Description |
|---|---|---|---|
streaming | boolean | false | Enable real-time transcription |
chunkDuration | number | 1500 | Milliseconds between transcriptions |
onChunk | (text, idx) => void | - | Callback for each chunk result |
chunkDuration = more responsive but more processing overhead. Each chunk is transcribed independently (no context from previous chunks).useVoiceChat Hook (Full Voice Conversation)
For complete voice-to-voice conversations (STT → LLM → TTS):
01import { useVoiceChat } from "@tryhamster/gerbil/browser";02
03function VoiceAssistant() {04 const {05 messages,06 startListening,07 stopListening,08 isListening,09 isSpeaking,10 stage, // "idle" | "listening" | "transcribing" | "thinking" | "speaking"11 isReady,12 } = useVoiceChat({13 llmModel: "qwen3-0.6b",14 sttModel: "whisper-tiny.en",15 voice: "af_bella",16 system: "You are a helpful voice assistant.",17 });18
19 return (20 <div>21 {messages.map(m => <p key={m.id}>{m.role}: {m.content}</p>)}22 <button23 onMouseDown={startListening}24 onMouseUp={stopListening}25 >26 {stage === "idle" ? "🎤 Hold to Speak" : stage}27 </button>28 </div>29 );30}CLI Commands
Transcribe
# Basic transcriptiongerbil transcribe recording.wav
# With timestampsgerbil transcribe recording.wav --timestamps
# Save to filegerbil transcribe recording.wav --output transcript.txt
# Use specific modelgerbil transcribe recording.wav --model whisper-base.en
# Multilingual with language hintgerbil transcribe recording.wav --model whisper-base --language es
# List modelsgerbil transcribe --list-modelsVoice Chat (STT → LLM → TTS)
Complete voice conversation loop from audio file:
# Voice chat with audio filegerbil voice question.wav
# Customize models and voicegerbil voice question.wav --model qwen3-1.7b --voice bf_emma
# With custom system promptgerbil voice question.wav --system "You are a pirate. Speak like one!"
# Enable thinking modegerbil voice question.wav --thinkingThis command transcribes your audio (Whisper), generates a response (LLM), and speaks the response (Kokoro TTS).
Performance
Transcription speed on Apple M1:
| Model | Audio Length | Time | Speed |
|---|---|---|---|
whisper-tiny.en | 11s | ~250ms | 44x realtime |
whisper-base.en | 11s | ~500ms | 22x realtime |
whisper-small.en | 11s | ~1.5s | 7x realtime |
Troubleshooting
"Unsupported bit depth"
Only 16-bit WAV files are supported. Convert with:
ffmpeg -i audio.mp3 -ar 16000 -ac 1 -sample_fmt s16 output.wavSlow first transcription
The first call downloads the model (~40-250MB). Subsequent calls use the cached model.
Memory usage
Whisper models require:
- —tiny: ~150MB RAM
- —base: ~300MB RAM
- —small: ~600MB RAM
- —large-v3-turbo: ~1.5GB RAM
Microphone Recording (Node.js)
Record audio directly from your system microphone in Node.js using SoX.
Requirements
Requires SoX (Sound eXchange) to be installed:
# macOSbrew install sox
# Ubuntu/Debiansudo apt install sox
# Windows# Download from https://sox.sourceforge.net/Listen API
One-liner to record and transcribe:
01import { Gerbil } from "@tryhamster/gerbil";02
03const g = new Gerbil();04
05// Check if microphone is available06const available = await g.isMicrophoneAvailable();07
08// Record and transcribe in one call09const result = await g.listen(5000, { // 5 seconds10 onProgress: (status) => console.log(status),11});12console.log(result.text);Low-level Microphone API
01import { Microphone, isSoxAvailable } from "@tryhamster/gerbil";02
03if (!isSoxAvailable()) {04 console.log("Install SoX first!");05 process.exit(1);06}07
08const mic = new Microphone({ sampleRate: 16000 });09await mic.start();10
11// Record for 5 seconds12await new Promise(r => setTimeout(r, 5000));13
14const { audio, sampleRate, duration } = await mic.stop();15// audio = Float32Array at 16kHz, ready for WhisperNext Steps
- Text-to-Speech → — generate natural speech with 28 voices
- React Hooks → — useSpeech, useVoiceInput, useVoiceChat
- AI SDK Integration → — TranscriptionModelV2 for Vercel AI SDK
- CLI → — gerbil transcribe, gerbil voice commands