Speech-to-Text

Name: Gerbil
Author: Gerbil

Local speech-to-text transcription using Whisper ONNX models via transformers.js. No API keys required - everything runs on-device.

Quick Start

Node.js

stt-basic.ts

01import { Gerbil } from "@tryhamster/gerbil";
02import { readFileSync } from "fs";
03
04const g = new Gerbil();
05
06// Transcribe a WAV file
07const audioData = new Uint8Array(readFileSync("audio.wav"));
08const result = await g.transcribe(audioData);
09console.log(result.text);

CLI

terminal

# Transcribe audio file
gerbil transcribe audio.wav

# With timestamps
gerbil transcribe audio.wav --timestamps

# List available models
gerbil transcribe --list-models

# Use a different model
gerbil transcribe audio.wav --model whisper-base.en

Available Models

Model	Size	Languages	Description
`whisper-tiny.en`	39M	English	Fastest, good for simple audio
`whisper-tiny`	39M	Multi	Multilingual tiny model
`whisper-base.en`	74M	English	Good balance of speed/accuracy
`whisper-base`	74M	Multi	Multilingual base model
`whisper-small.en`	244M	English	High quality transcription
`whisper-small`	244M	Multi	High quality multilingual
`whisper-large-v3-turbo`	809M	80+ langs	Best quality, 5.4x faster than v3

Recommended: Use whisper-large-v3-turbo for best quality with excellent speed. For quick demos, whisper-tiny.en offers the fastest loading.

API Reference

Gerbil Class Methods

stt-api.ts

01// Load STT model explicitly (optional - auto-loads on first transcribe)
02await g.loadSTT("whisper-tiny.en", {
03  onProgress: (p) => console.log(p.status),
04});
05
06// Transcribe audio
07const result = await g.transcribe(audioData, {
08  language: "en",      // Language hint (multilingual models only)
09  timestamps: true,    // Return segment timestamps
10  onProgress: (p) => console.log(p.status),
11});
12
13// Result structure
14console.log(result.text);       // Full transcription
15console.log(result.duration);   // Audio duration in seconds
16console.log(result.totalTime);  // Processing time in ms
17console.log(result.segments);   // Timestamped segments (if requested)
18
19// Check if loaded
20console.log(g.isSTTLoaded());
21
22// List available models
23const models = await g.listSTTModels();

Audio Input Formats

audio-formats.ts

// WAV file (Uint8Array) - automatically decoded
const wavData = new Uint8Array(fs.readFileSync("audio.wav"));
await g.transcribe(wavData);

// Raw audio (Float32Array at 16kHz mono)
const raw16k = new Float32Array([...]); // 16kHz mono samples
await g.transcribe(raw16k);

// WAV at any sample rate - automatically resampled to 16kHz
const wav44k = new Uint8Array(fs.readFileSync("audio-44khz.wav"));
await g.transcribe(wav44k); // Resampled internally

Transcription with Timestamps

timestamps.ts

const result = await g.transcribe(audioData, { timestamps: true });

for (const segment of result.segments || []) {
  console.log(`[${segment.start.toFixed(1)}s - ${segment.end.toFixed(1)}s] ${segment.text}`);
}
// [0.0s - 5.2s] Hello world, this is a test
// [5.2s - 8.0s] of the transcription system

Language Detection (Multilingual Models)

language-detection.ts

// Use multilingual model
await g.loadSTT("whisper-base");

// Let Whisper detect language
const result = await g.transcribe(audioData);
console.log(result.language); // "en", "es", "fr", etc.

// Or provide a language hint
const spanishResult = await g.transcribe(audioData, { language: "es" });

Direct WhisperSTT Class

For lower-level control:

whisper-stt.ts

01import { WhisperSTT, decodeWav, resampleAudio } from "@tryhamster/gerbil/core/stt";
02
03const stt = new WhisperSTT("whisper-tiny.en");
04
05await stt.load({
06  onProgress: (p) => console.log(p.status),
07});
08
09const result = await stt.transcribe(audioData, {
10  timestamps: true,
11});
12
13console.log(result.text);
14
15stt.dispose();

Audio Utilities

audio-utils.ts

import { decodeWav, resampleAudio } from "@tryhamster/gerbil/core/stt";

// Decode WAV to Float32Array
const { audio, sampleRate } = decodeWav(wavUint8Array);

// Resample to 16kHz (Whisper requirement)
const audio16k = resampleAudio(audio, sampleRate, 16000);

Streaming Transcription (Core)

Transcribe audio in real-time as it's recorded, instead of waiting until recording stops. Perfect for live captioning, call transcription, and voice note taking.

streaming-core.ts

01import { Gerbil } from "@tryhamster/gerbil";
02
03const g = new Gerbil();
04
05// Create a streaming transcription session
06const session = await g.createStreamingTranscription({
07  chunkDuration: 1500,  // Transcribe every 1.5 seconds
08  onChunk: (text, idx) => {
09    console.log(`Chunk ${idx}: ${text}`);
10  },
11  onTranscript: (fullText) => {
12    console.log("Full transcript:", fullText);
13  },
14});
15
16// Start the session
17session.start();
18
19// Feed audio as it comes in (Float32Array at 16kHz)
20session.feedAudio(audioChunk);
21
22// ... continue feeding audio ...
23
24// Stop and get final transcript
25const finalText = await session.stop();
26console.log("Final:", finalText);

StreamingMicrophone (Node.js)

For Node.js, use StreamingMicrophone to capture audio in real-time:

streaming-mic.ts

01import { Gerbil, StreamingMicrophone } from "@tryhamster/gerbil";
02
03const g = new Gerbil();
04
05// Create streaming session
06const session = await g.createStreamingTranscription({
07  chunkDuration: 1500,
08  onTranscript: (text) => process.stdout.write(`\r> ${text}`),
09});
10
11// Create streaming microphone (requires SoX)
12const mic = new StreamingMicrophone({
13  sampleRate: 16000,
14  onAudio: (chunk) => session.feedAudio(chunk),
15});
16
17session.start();
18mic.start();
19
20// Record for 10 seconds
21await new Promise(r => setTimeout(r, 10000));
22
23mic.stop();
24const transcript = await session.stop();
25console.log("\nFinal:", transcript);

Streaming Options

Option	Type	Default	Description
`chunkDuration`	number	1500	Milliseconds between transcriptions
`onChunk`	(text, idx) => void	-	Callback for each chunk result
`onTranscript`	(fullText) => void	-	Callback with full accumulated transcript

Note: Shorter chunkDuration = more responsive but more processing overhead. Each chunk is transcribed independently. StreamingMicrophone requires SoX to be installed.

React Hook (Browser)

Use useVoiceInput for browser-based voice recording and transcription:

VoiceInput.tsx

01import { useVoiceInput } from "@tryhamster/gerbil/browser";
02
03function VoiceInput() {
04  const {
05    startRecording,
06    stopRecording,
07    isRecording,
08    isTranscribing,
09    transcript,
10    error,
11  } = useVoiceInput({
12    model: "whisper-tiny.en",
13    onTranscript: (text) => console.log("User said:", text),
14  });
15
16  return (
17    <div>
18      <button onClick={isRecording ? stopRecording : startRecording}>
19        {isRecording ? "🔴 Stop" : "🎤 Record"}
20      </button>
21      {isTranscribing && <span>Transcribing...</span>}
22      {transcript && <p>{transcript}</p>}
23    </div>
24  );
25}

Hook Options

useVoiceInput.ts

01const {
02  startRecording,   // () => Promise<void> - start recording
03  stopRecording,    // () => Promise<string> - stop and transcribe
04  cancelRecording,  // () => void - stop without transcribing
05  transcribe,       // (audio: Float32Array) => Promise<string>
06  isRecording,      // boolean - currently recording
07  isTranscribing,   // boolean - transcribing audio
08  isLoading,        // boolean - model loading
09  isReady,          // boolean - model ready
10  transcript,       // string - latest result
11  loadingProgress,  // STTProgress - loading status
12  error,            // string | null
13  load,             // () => void - manually load model
14} = useVoiceInput({
15  model: "whisper-tiny.en",  // STT model ID
16  autoLoad: false,           // loads on first record
17  onReady: () => {},         // called when model ready
18  onTranscript: (text) => {},// called on each transcription
19  onError: (error) => {},    // called on errors
20  onProgress: (p) => {},     // loading progress
21});

Streaming Transcription

Transcribe audio in real-time as the user speaks, instead of waiting until they stop recording. Perfect for live captioning, call transcription, and voice note taking.

LiveTranscription.tsx

01import { useVoiceInput } from "@tryhamster/gerbil/browser";
02
03function LiveTranscription() {
04  const { 
05    startRecording, 
06    stopRecording, 
07    isRecording, 
08    transcript,       // Full text so far
09    streamingChunk,   // Latest chunk
10    chunkCount,
11    isTranscribing,
12  } = useVoiceInput({
13    streaming: true,
14    chunkDuration: 1500,  // Transcribe every 1.5 seconds
15    onChunk: (text, idx) => {
16      console.log(`Chunk ${idx}: ${text}`);
17    },
18  });
19
20  return (
21    <div>
22      <button onClick={isRecording ? stopRecording : startRecording}>
23        {isRecording ? "Stop" : "Start Live Transcription"}
24      </button>
25      
26      {isTranscribing && <span>Processing...</span>}
27      {streamingChunk && <p style={{ opacity: 0.6 }}>"{streamingChunk}"</p>}
28      
29      <div>
30        <strong>Full Transcript ({chunkCount} chunks):</strong>
31        <p>{transcript}</p>
32      </div>
33    </div>
34  );
35}

Streaming Options

Option	Type	Default	Description
`streaming`	boolean	false	Enable real-time transcription
`chunkDuration`	number	1500	Milliseconds between transcriptions
`onChunk`	(text, idx) => void	-	Callback for each chunk result

Note: Shorter chunkDuration = more responsive but more processing overhead. Each chunk is transcribed independently (no context from previous chunks).

useVoiceChat Hook (Full Voice Conversation)

For complete voice-to-voice conversations (STT → LLM → TTS):

VoiceAssistant.tsx

01import { useVoiceChat } from "@tryhamster/gerbil/browser";
02
03function VoiceAssistant() {
04  const {
05    messages,
06    startListening,
07    stopListening,
08    isListening,
09    isSpeaking,
10    stage,  // "idle" | "listening" | "transcribing" | "thinking" | "speaking"
11    isReady,
12  } = useVoiceChat({
13    llmModel: "qwen3-0.6b",
14    sttModel: "whisper-tiny.en",
15    voice: "af_bella",
16    system: "You are a helpful voice assistant.",
17  });
18
19  return (
20    <div>
21      {messages.map(m => <p key={m.id}>{m.role}: {m.content}</p>)}
22      <button
23        onMouseDown={startListening}
24        onMouseUp={stopListening}
25      >
26        {stage === "idle" ? "🎤 Hold to Speak" : stage}
27      </button>
28    </div>
29  );
30}

CLI Commands

Transcribe

terminal

# Basic transcription
gerbil transcribe recording.wav

# With timestamps
gerbil transcribe recording.wav --timestamps

# Save to file
gerbil transcribe recording.wav --output transcript.txt

# Use specific model
gerbil transcribe recording.wav --model whisper-base.en

# Multilingual with language hint
gerbil transcribe recording.wav --model whisper-base --language es

# List models
gerbil transcribe --list-models

Voice Chat (STT → LLM → TTS)

Complete voice conversation loop from audio file:

terminal

# Voice chat with audio file
gerbil voice question.wav

# Customize models and voice
gerbil voice question.wav --model qwen3-1.7b --voice bf_emma

# With custom system prompt
gerbil voice question.wav --system "You are a pirate. Speak like one!"

# Enable thinking mode
gerbil voice question.wav --thinking

This command transcribes your audio (Whisper), generates a response (LLM), and speaks the response (Kokoro TTS).

Performance

Transcription speed on Apple M1:

Model	Audio Length	Time	Speed
`whisper-tiny.en`	11s	~250ms	44x realtime
`whisper-base.en`	11s	~500ms	22x realtime
`whisper-small.en`	11s	~1.5s	7x realtime

Troubleshooting

"Unsupported bit depth"

Only 16-bit WAV files are supported. Convert with:

terminal

ffmpeg -i audio.mp3 -ar 16000 -ac 1 -sample_fmt s16 output.wav

Slow first transcription

The first call downloads the model (~40-250MB). Subsequent calls use the cached model.

Memory usage

Whisper models require:

—tiny: ~150MB RAM
—base: ~300MB RAM
—small: ~600MB RAM
—large-v3-turbo: ~1.5GB RAM

Microphone Recording (Node.js)

Record audio directly from your system microphone in Node.js using SoX.

Requirements

Requires SoX (Sound eXchange) to be installed:

terminal

# macOS
brew install sox

# Ubuntu/Debian
sudo apt install sox

# Windows
# Download from https://sox.sourceforge.net/

Listen API

One-liner to record and transcribe:

listen.ts

01import { Gerbil } from "@tryhamster/gerbil";
02
03const g = new Gerbil();
04
05// Check if microphone is available
06const available = await g.isMicrophoneAvailable();
07
08// Record and transcribe in one call
09const result = await g.listen(5000, {  // 5 seconds
10  onProgress: (status) => console.log(status),
11});
12console.log(result.text);

Low-level Microphone API

microphone.ts

01import { Microphone, isSoxAvailable } from "@tryhamster/gerbil";
02
03if (!isSoxAvailable()) {
04  console.log("Install SoX first!");
05  process.exit(1);
06}
07
08const mic = new Microphone({ sampleRate: 16000 });
09await mic.start();
10
11// Record for 5 seconds
12await new Promise(r => setTimeout(r, 5000));
13
14const { audio, sampleRate, duration } = await mic.stop();
15// audio = Float32Array at 16kHz, ready for Whisper

Next Steps

Text-to-Speech → — generate natural speech with 28 voices
React Hooks → — useSpeech, useVoiceInput, useVoiceChat
AI SDK Integration → — TranscriptionModelV2 for Vercel AI SDK
CLI → — gerbil transcribe, gerbil voice commands