Text-to-Speech

On-device text-to-speech using Kokoro-82M or Supertonic-66M. Generate natural-sounding speech locally without API keys or internet connection.

Available Models

ModelIDSizeSample RateVoicesSpeed
Kokoro-82Mkokoro-82m~330MB24kHz28~100x realtime
Supertonic-66Msupertonic-66m~250MB44.1kHz4~167x realtime

Quick Start

Node.js / CLI

tts-basic.ts
01import { Gerbil } from "@tryhamster/gerbil";
02
03const g = new Gerbil();
04
05// Load TTS (optional - auto-loads on first speak)
06await g.loadTTS(); // Default: kokoro-82m
07await g.loadTTS({ model: "supertonic-66m" }); // Use Supertonic
08
09// Generate speech
10const result = await g.speak("Hello, I'm Gerbil!", {
11 voice: "af_heart", // American female (highest quality)
12 speed: 1.0
13});
14
15// result.audio = Float32Array (PCM samples)
16// result.sampleRate = 24000 (or 44100 for Supertonic)
17// result.duration = seconds

React (Browser)

SpeechDemo.tsx
01import { useSpeech } from "@tryhamster/gerbil/browser";
02
03function SpeechDemo() {
04 const { speak, stop, isSpeaking, isLoading, listVoices } = useSpeech();
05
06 if (isLoading) return <div>Loading TTS model...</div>;
07
08 return (
09 <div>
10 <select onChange={e => setVoice(e.target.value)}>
11 {listVoices().map(v => (
12 <option key={v.id} value={v.id}>{v.name} ({v.language})</option>
13 ))}
14 </select>
15
16 <button onClick={() => speak("Hello world!")}>
17 {isSpeaking ? "Speaking..." : "Speak"}
18 </button>
19
20 {isSpeaking && <button onClick={stop}>Stop</button>}
21 </div>
22 );
23}

AI SDK

ai-sdk-tts.ts
import { experimental_generateSpeech as generateSpeech } from "ai";
import { gerbil } from "@tryhamster/gerbil/ai";
const audio = await generateSpeech({
model: gerbil.speech(),
text: "Hello from Gerbil!",
voice: "bf_emma", // British female
});
// audio.audioData = Uint8Array (WAV format)

CLI

terminal
# Speak text with default voice
gerbil speak "Hello world"
# Specify voice and speed
gerbil speak --voice bf_emma --speed 1.2 "Cheerio!"

Skills

tts-skills.ts
import { speak, announce, readAloud } from "@tryhamster/gerbil/skills";
// Simple speech
await speak({ text: "Hello world", voice: "af_heart" });
// AI-crafted announcement
await announce({
message: "Build completed successfully",
style: "excited", // casual, formal, excited, calm, urgent
voice: "af_bella"
});
// Read file aloud
await readAloud({
content: "./README.md",
voice: "bf_emma",
summarizeIfLong: true // Summarize if > 5000 chars
});

Available Voices

Kokoro provides 28 voices across American and British English:

American English - Female (Recommended)

Voice IDNameQualityDescription
af_heartHeartAHighest quality, warm tone
af_bellaBellaA-Warm and friendly
af_nicoleNicoleB-Soft and gentle
af_sarahSarahC+Clear and professional
af_skySkyC-Light and airy
af_alloyAlloyCNeutral tone
af_aoedeAoedeC+Musical quality
af_koreKoreC+Youthful
af_novaNovaCModern
af_riverRiverDFlowing
af_jessicaJessicaDCasual

American English - Male (9 voices)

Voice IDNameQualityDescription
am_fenrirFenrirC+Best male quality
am_michaelMichaelC+Warm and friendly
am_puckPuckC+Neutral tone
am_adamAdamF+Deep voice
am_echoEchoDResonant
am_ericEricDSteady
am_liamLiamDYoung
am_onyxOnyxDRich
am_santaSantaD-Jovial

British English - Female (4 voices)

Voice IDNameQualityDescription
bf_emmaEmmaB-Elegant and clear
bf_isabellaIsabellaCSophisticated
bf_aliceAliceDBright
bf_lilyLilyDGentle

British English - Male (4 voices)

Voice IDNameQualityDescription
bm_georgeGeorgeCDistinguished
bm_fableFableCStoryteller tone
bm_lewisLewisD+Refined
bm_danielDanielDClassic

Supertonic-66M Voices (4 voices)

Supertonic is a faster TTS model at 44.1kHz sample rate with its own voice set:

Voice IDNameGenderDescription
F1Female 1FemaleClear and natural (default)
F2Female 2FemaleWarm and expressive
M1Male 1MaleDeep and confident
M2Male 2MaleFriendly and casual
Tip: For Kokoro-82M, af_heart is the highest quality voice. For Supertonic-66M, use F1 for clear, natural speech at 44.1kHz.

API Reference

Gerbil Class Methods

Gerbil.ts
class Gerbil {
// Load TTS model (auto-called by speak if needed)
async loadTTS(options?: LoadTTSOptions): Promise<void>;
// Check if TTS is loaded
isTTSLoaded(): boolean;
// Generate speech
async speak(text: string, options?: SpeakOptions): Promise<SpeakResult>;
// Stream speech (yields chunks as generated)
async *speakStream(text: string, options?: SpeakOptions): AsyncGenerator<AudioChunk>;
// Get available voices
listVoices(): VoiceInfo[];
getVoice(voiceId: string): VoiceInfo | null;
}

SpeakOptions

SpeakOptions.ts
interface SpeakOptions {
/** Voice ID (default: "af_heart") */
voice?: string;
/** Speed multiplier 0.5-2.0 (default: 1.0) */
speed?: number;
/** Progress callback during loading */
onProgress?: (info: ProgressInfo) => void;
/** Callback for each audio chunk (streaming) */
onAudioChunk?: (chunk: AudioChunk) => void;
}

SpeakResult

SpeakResult.ts
interface SpeakResult {
/** PCM audio samples (mono, float32, -1 to 1) */
audio: Float32Array;
/** Sample rate (always 24000 for Kokoro) */
sampleRate: number;
/** Duration in seconds */
duration: number;
/** Voice ID used */
voice: string;
/** Generation time in milliseconds */
totalTime: number;
}

Playing Audio

Browser (Web Audio API)

browser-playback.ts
import { playAudio } from "@tryhamster/gerbil/browser";
const result = await gerbil.speak("Hello!");
// One-liner playback
const controller = await playAudio(result.audio, result.sampleRate);
// Stop early
controller.stop();
// Wait for completion
await controller.onEnded;

Node.js (Save to File)

nodejs-save.ts
01import { writeFileSync } from "fs";
02import { execSync } from "child_process";
03
04const result = await gerbil.speak("Hello!");
05
06// Convert to WAV
07function saveWav(filename: string, audio: Float32Array, sampleRate: number) {
08 const buffer = Buffer.alloc(44 + audio.length * 2);
09
10 // WAV header
11 buffer.write("RIFF", 0);
12 buffer.writeUInt32LE(36 + audio.length * 2, 4);
13 buffer.write("WAVE", 8);
14 buffer.write("fmt ", 12);
15 buffer.writeUInt32LE(16, 16);
16 buffer.writeUInt16LE(1, 20); // PCM
17 buffer.writeUInt16LE(1, 22); // Mono
18 buffer.writeUInt32LE(sampleRate, 24);
19 buffer.writeUInt32LE(sampleRate * 2, 28);
20 buffer.writeUInt16LE(2, 32);
21 buffer.writeUInt16LE(16, 34);
22 buffer.write("data", 36);
23 buffer.writeUInt32LE(audio.length * 2, 40);
24
25 // Audio data
26 for (let i = 0; i < audio.length; i++) {
27 const s = Math.max(-1, Math.min(1, audio[i]));
28 buffer.writeInt16LE(Math.round(s * 32767), 44 + i * 2);
29 }
30
31 writeFileSync(filename, buffer);
32}
33
34saveWav("output.wav", result.audio, result.sampleRate);
35
36// Play on macOS
37execSync("afplay output.wav");
38
39// Play on Linux
40// execSync("aplay output.wav");

Streaming Audio

For long text, stream audio chunks as they're generated:

streaming.ts
01// Node.js streaming
02for await (const chunk of gerbil.speakStream("Long paragraph of text...")) {
03 console.log(`Chunk ${chunk.index}: ${chunk.samples.length} samples`);
04 // Process/play chunk.samples
05
06 if (chunk.isFinal) {
07 console.log("Done!");
08 }
09}
10
11// Browser streaming with seamless playback
12import { createAudioPlayer } from "@tryhamster/gerbil/browser";
13
14const player = createAudioPlayer(24000);
15
16for await (const chunk of gerbil.speakStream("Long text...")) {
17 player.queue(chunk.samples);
18}

useSpeech Hook Reference

useSpeech.ts
01const {
02 // Actions
03 speak, // (text: string, opts?) => Promise<void>
04 stop, // () => void
05 load, // () => void (manual load trigger)
06
07 // State
08 isLoading, // boolean - model loading
09 isSpeaking, // boolean - currently speaking
10 isReady, // boolean - model ready
11 error, // string | null
12
13 // Voice control
14 listVoices, // () => VoiceInfo[]
15 currentVoice, // string
16 setVoice, // (id: string) => void
17 currentSpeed, // number
18 setSpeed, // (speed: number) => void
19
20 // Loading progress
21 loadingProgress, // { status, file?, progress? }
22} = useSpeech({
23 voice: "af_heart", // default voice
24 speed: 1.0, // default speed
25 autoLoad: false, // load on first speak() call
26 onReady: () => {},
27 onError: (err) => {},
28 onStart: () => {},
29 onEnd: () => {},
30});

Performance

MetricValue
Model size~330MB
First load3-10s (downloads model)
Cached load<1s
Generation speed~8x realtime on M1 Mac
Sample rate24kHz
Audio formatMono Float32 PCM

Troubleshooting

Audio sounds like gibberish

This happens when using raw transformers.js without proper phoneme conversion. Gerbil uses kokoro-js which handles grapheme-to-phoneme (G2P) conversion automatically.

Audio is too quiet

Kokoro output can be quiet. The audio is normalized by default, but if you're processing raw audio, ensure proper normalization:

normalize.ts
function normalizeAudio(audio: Float32Array, targetRms = 0.15): Float32Array {
const currentRms = Math.sqrt(audio.reduce((a, b) => a + b * b, 0) / audio.length);
const gain = targetRms / currentRms;
return audio.map(s => Math.max(-1, Math.min(1, s * gain)));
}

Browser autoplay blocked

Browsers require user interaction before playing audio. Trigger speech from a click handler:

autoplay-fix.tsx
<button onClick={() => speak("Hello!")}>Speak</button>

Voice not found

Use listVoices() to see available voice IDs. Voice IDs follow the pattern {language}{gender}_{name}:

  • af_ = American female
  • am_ = American male
  • bf_ = British female
  • bm_ = British male

Model Info

Gerbil uses Kokoro-82M via the ONNX export and kokoro-js wrapper.

PropertyValue
ArchitectureStyleTTS2-based
Parameters82M
LanguagesEnglish (US + UK)
LicenseApache 2.0

Next Steps