Text-to-Speech
On-device text-to-speech using Kokoro-82M or Supertonic-66M. Generate natural-sounding speech locally without API keys or internet connection.
Available Models
| Model | ID | Size | Sample Rate | Voices | Speed |
|---|---|---|---|---|---|
| Kokoro-82M | kokoro-82m | ~330MB | 24kHz | 28 | ~100x realtime |
| Supertonic-66M | supertonic-66m | ~250MB | 44.1kHz | 4 | ~167x realtime |
Quick Start
Node.js / CLI
01import { Gerbil } from "@tryhamster/gerbil";02
03const g = new Gerbil();04
05// Load TTS (optional - auto-loads on first speak)06await g.loadTTS(); // Default: kokoro-82m07await g.loadTTS({ model: "supertonic-66m" }); // Use Supertonic08
09// Generate speech10const result = await g.speak("Hello, I'm Gerbil!", { 11 voice: "af_heart", // American female (highest quality)12 speed: 1.0 13});14
15// result.audio = Float32Array (PCM samples)16// result.sampleRate = 24000 (or 44100 for Supertonic)17// result.duration = secondsReact (Browser)
01import { useSpeech } from "@tryhamster/gerbil/browser";02
03function SpeechDemo() {04 const { speak, stop, isSpeaking, isLoading, listVoices } = useSpeech();05
06 if (isLoading) return <div>Loading TTS model...</div>;07
08 return (09 <div>10 <select onChange={e => setVoice(e.target.value)}>11 {listVoices().map(v => (12 <option key={v.id} value={v.id}>{v.name} ({v.language})</option>13 ))}14 </select>15 16 <button onClick={() => speak("Hello world!")}>17 {isSpeaking ? "Speaking..." : "Speak"}18 </button>19 20 {isSpeaking && <button onClick={stop}>Stop</button>}21 </div>22 );23}AI SDK
import { experimental_generateSpeech as generateSpeech } from "ai";import { gerbil } from "@tryhamster/gerbil/ai";
const audio = await generateSpeech({ model: gerbil.speech(), text: "Hello from Gerbil!", voice: "bf_emma", // British female});
// audio.audioData = Uint8Array (WAV format)CLI
# Speak text with default voicegerbil speak "Hello world"
# Specify voice and speedgerbil speak --voice bf_emma --speed 1.2 "Cheerio!"Skills
import { speak, announce, readAloud } from "@tryhamster/gerbil/skills";
// Simple speechawait speak({ text: "Hello world", voice: "af_heart" });
// AI-crafted announcementawait announce({ message: "Build completed successfully", style: "excited", // casual, formal, excited, calm, urgent voice: "af_bella"});
// Read file aloudawait readAloud({ content: "./README.md", voice: "bf_emma", summarizeIfLong: true // Summarize if > 5000 chars});Available Voices
Kokoro provides 28 voices across American and British English:
American English - Female (Recommended)
| Voice ID | Name | Quality | Description |
|---|---|---|---|
af_heart | Heart | A | Highest quality, warm tone |
af_bella | Bella | A- | Warm and friendly |
af_nicole | Nicole | B- | Soft and gentle |
af_sarah | Sarah | C+ | Clear and professional |
af_sky | Sky | C- | Light and airy |
af_alloy | Alloy | C | Neutral tone |
af_aoede | Aoede | C+ | Musical quality |
af_kore | Kore | C+ | Youthful |
af_nova | Nova | C | Modern |
af_river | River | D | Flowing |
af_jessica | Jessica | D | Casual |
American English - Male (9 voices)
| Voice ID | Name | Quality | Description |
|---|---|---|---|
am_fenrir | Fenrir | C+ | Best male quality |
am_michael | Michael | C+ | Warm and friendly |
am_puck | Puck | C+ | Neutral tone |
am_adam | Adam | F+ | Deep voice |
am_echo | Echo | D | Resonant |
am_eric | Eric | D | Steady |
am_liam | Liam | D | Young |
am_onyx | Onyx | D | Rich |
am_santa | Santa | D- | Jovial |
British English - Female (4 voices)
| Voice ID | Name | Quality | Description |
|---|---|---|---|
bf_emma | Emma | B- | Elegant and clear |
bf_isabella | Isabella | C | Sophisticated |
bf_alice | Alice | D | Bright |
bf_lily | Lily | D | Gentle |
British English - Male (4 voices)
| Voice ID | Name | Quality | Description |
|---|---|---|---|
bm_george | George | C | Distinguished |
bm_fable | Fable | C | Storyteller tone |
bm_lewis | Lewis | D+ | Refined |
bm_daniel | Daniel | D | Classic |
Supertonic-66M Voices (4 voices)
Supertonic is a faster TTS model at 44.1kHz sample rate with its own voice set:
| Voice ID | Name | Gender | Description |
|---|---|---|---|
F1 | Female 1 | Female | Clear and natural (default) |
F2 | Female 2 | Female | Warm and expressive |
M1 | Male 1 | Male | Deep and confident |
M2 | Male 2 | Male | Friendly and casual |
af_heart is the highest quality voice. For Supertonic-66M, use F1 for clear, natural speech at 44.1kHz.API Reference
Gerbil Class Methods
class Gerbil { // Load TTS model (auto-called by speak if needed) async loadTTS(options?: LoadTTSOptions): Promise<void>; // Check if TTS is loaded isTTSLoaded(): boolean; // Generate speech async speak(text: string, options?: SpeakOptions): Promise<SpeakResult>; // Stream speech (yields chunks as generated) async *speakStream(text: string, options?: SpeakOptions): AsyncGenerator<AudioChunk>; // Get available voices listVoices(): VoiceInfo[]; getVoice(voiceId: string): VoiceInfo | null;}SpeakOptions
interface SpeakOptions { /** Voice ID (default: "af_heart") */ voice?: string; /** Speed multiplier 0.5-2.0 (default: 1.0) */ speed?: number; /** Progress callback during loading */ onProgress?: (info: ProgressInfo) => void; /** Callback for each audio chunk (streaming) */ onAudioChunk?: (chunk: AudioChunk) => void;}SpeakResult
interface SpeakResult { /** PCM audio samples (mono, float32, -1 to 1) */ audio: Float32Array; /** Sample rate (always 24000 for Kokoro) */ sampleRate: number; /** Duration in seconds */ duration: number; /** Voice ID used */ voice: string; /** Generation time in milliseconds */ totalTime: number;}Playing Audio
Browser (Web Audio API)
import { playAudio } from "@tryhamster/gerbil/browser";
const result = await gerbil.speak("Hello!");
// One-liner playbackconst controller = await playAudio(result.audio, result.sampleRate);
// Stop earlycontroller.stop();
// Wait for completionawait controller.onEnded;Node.js (Save to File)
01import { writeFileSync } from "fs";02import { execSync } from "child_process";03
04const result = await gerbil.speak("Hello!");05
06// Convert to WAV07function saveWav(filename: string, audio: Float32Array, sampleRate: number) {08 const buffer = Buffer.alloc(44 + audio.length * 2);09 10 // WAV header11 buffer.write("RIFF", 0);12 buffer.writeUInt32LE(36 + audio.length * 2, 4);13 buffer.write("WAVE", 8);14 buffer.write("fmt ", 12);15 buffer.writeUInt32LE(16, 16);16 buffer.writeUInt16LE(1, 20); // PCM17 buffer.writeUInt16LE(1, 22); // Mono18 buffer.writeUInt32LE(sampleRate, 24);19 buffer.writeUInt32LE(sampleRate * 2, 28);20 buffer.writeUInt16LE(2, 32);21 buffer.writeUInt16LE(16, 34);22 buffer.write("data", 36);23 buffer.writeUInt32LE(audio.length * 2, 40);24 25 // Audio data26 for (let i = 0; i < audio.length; i++) {27 const s = Math.max(-1, Math.min(1, audio[i]));28 buffer.writeInt16LE(Math.round(s * 32767), 44 + i * 2);29 }30 31 writeFileSync(filename, buffer);32}33
34saveWav("output.wav", result.audio, result.sampleRate);35
36// Play on macOS37execSync("afplay output.wav");38
39// Play on Linux40// execSync("aplay output.wav");Streaming Audio
For long text, stream audio chunks as they're generated:
01// Node.js streaming02for await (const chunk of gerbil.speakStream("Long paragraph of text...")) {03 console.log(`Chunk ${chunk.index}: ${chunk.samples.length} samples`);04 // Process/play chunk.samples05 06 if (chunk.isFinal) {07 console.log("Done!");08 }09}10
11// Browser streaming with seamless playback12import { createAudioPlayer } from "@tryhamster/gerbil/browser";13
14const player = createAudioPlayer(24000);15
16for await (const chunk of gerbil.speakStream("Long text...")) {17 player.queue(chunk.samples);18}useSpeech Hook Reference
01const {02 // Actions03 speak, // (text: string, opts?) => Promise<void>04 stop, // () => void05 load, // () => void (manual load trigger)06 07 // State08 isLoading, // boolean - model loading09 isSpeaking, // boolean - currently speaking10 isReady, // boolean - model ready11 error, // string | null12 13 // Voice control14 listVoices, // () => VoiceInfo[]15 currentVoice, // string16 setVoice, // (id: string) => void17 currentSpeed, // number18 setSpeed, // (speed: number) => void19 20 // Loading progress21 loadingProgress, // { status, file?, progress? }22} = useSpeech({23 voice: "af_heart", // default voice24 speed: 1.0, // default speed25 autoLoad: false, // load on first speak() call26 onReady: () => {},27 onError: (err) => {},28 onStart: () => {},29 onEnd: () => {},30});Performance
| Metric | Value |
|---|---|
| Model size | ~330MB |
| First load | 3-10s (downloads model) |
| Cached load | <1s |
| Generation speed | ~8x realtime on M1 Mac |
| Sample rate | 24kHz |
| Audio format | Mono Float32 PCM |
Troubleshooting
Audio sounds like gibberish
This happens when using raw transformers.js without proper phoneme conversion. Gerbil uses kokoro-js which handles grapheme-to-phoneme (G2P) conversion automatically.
Audio is too quiet
Kokoro output can be quiet. The audio is normalized by default, but if you're processing raw audio, ensure proper normalization:
function normalizeAudio(audio: Float32Array, targetRms = 0.15): Float32Array { const currentRms = Math.sqrt(audio.reduce((a, b) => a + b * b, 0) / audio.length); const gain = targetRms / currentRms; return audio.map(s => Math.max(-1, Math.min(1, s * gain)));}Browser autoplay blocked
Browsers require user interaction before playing audio. Trigger speech from a click handler:
<button onClick={() => speak("Hello!")}>Speak</button>Voice not found
Use listVoices() to see available voice IDs. Voice IDs follow the pattern {language}{gender}_{name}:
- —
af_= American female - —
am_= American male - —
bf_= British female - —
bm_= British male
Model Info
Gerbil uses Kokoro-82M via the ONNX export and kokoro-js wrapper.
| Property | Value |
|---|---|
| Architecture | StyleTTS2-based |
| Parameters | 82M |
| Languages | English (US + UK) |
| License | Apache 2.0 |
Next Steps
- Speech-to-Text → — transcribe audio with Whisper models
- React Hooks → — useSpeech, useVoiceInput, useVoiceChat
- AI SDK Integration → — SpeechModelV2 for Vercel AI SDK
- CLI → — gerbil speak, gerbil voice commands