Text-to-Speech

Name: Gerbil
Author: Gerbil

On-device text-to-speech using Kokoro-82M or Supertonic-66M. Generate natural-sounding speech locally without API keys or internet connection.

Available Models

Model	ID	Size	Sample Rate	Voices	Speed
Kokoro-82M	`kokoro-82m`	~330MB	24kHz	28	~100x realtime
Supertonic-66M	`supertonic-66m`	~250MB	44.1kHz	4	~167x realtime

Quick Start

Node.js / CLI

tts-basic.ts

01import { Gerbil } from "@tryhamster/gerbil";
02
03const g = new Gerbil();
04
05// Load TTS (optional - auto-loads on first speak)
06await g.loadTTS(); // Default: kokoro-82m
07await g.loadTTS({ model: "supertonic-66m" }); // Use Supertonic
08
09// Generate speech
10const result = await g.speak("Hello, I'm Gerbil!", { 
11  voice: "af_heart",  // American female (highest quality)
12  speed: 1.0 
13});
14
15// result.audio = Float32Array (PCM samples)
16// result.sampleRate = 24000 (or 44100 for Supertonic)
17// result.duration = seconds

React (Browser)

SpeechDemo.tsx

01import { useSpeech } from "@tryhamster/gerbil/browser";
02
03function SpeechDemo() {
04  const { speak, stop, isSpeaking, isLoading, listVoices } = useSpeech();
05
06  if (isLoading) return <div>Loading TTS model...</div>;
07
08  return (
09    <div>
10      <select onChange={e => setVoice(e.target.value)}>
11        {listVoices().map(v => (
12          <option key={v.id} value={v.id}>{v.name} ({v.language})</option>
13        ))}
14      </select>
15      
16      <button onClick={() => speak("Hello world!")}>
17        {isSpeaking ? "Speaking..." : "Speak"}
18      </button>
19      
20      {isSpeaking && <button onClick={stop}>Stop</button>}
21    </div>
22  );
23}

AI SDK

ai-sdk-tts.ts

import { experimental_generateSpeech as generateSpeech } from "ai";
import { gerbil } from "@tryhamster/gerbil/ai";

const audio = await generateSpeech({
  model: gerbil.speech(),
  text: "Hello from Gerbil!",
  voice: "bf_emma",  // British female
});

// audio.audioData = Uint8Array (WAV format)

CLI

terminal

# Speak text with default voice
gerbil speak "Hello world"

# Specify voice and speed
gerbil speak --voice bf_emma --speed 1.2 "Cheerio!"

Skills

tts-skills.ts

import { speak, announce, readAloud } from "@tryhamster/gerbil/skills";

// Simple speech
await speak({ text: "Hello world", voice: "af_heart" });

// AI-crafted announcement
await announce({ 
  message: "Build completed successfully",
  style: "excited",  // casual, formal, excited, calm, urgent
  voice: "af_bella"
});

// Read file aloud
await readAloud({ 
  content: "./README.md",
  voice: "bf_emma",
  summarizeIfLong: true  // Summarize if > 5000 chars
});

Available Voices

Kokoro provides 28 voices across American and British English:

American English - Female (Recommended)

Voice ID	Name	Quality	Description
`af_heart`	Heart	A	Highest quality, warm tone
`af_bella`	Bella	A-	Warm and friendly
`af_nicole`	Nicole	B-	Soft and gentle
`af_sarah`	Sarah	C+	Clear and professional
`af_sky`	Sky	C-	Light and airy
`af_alloy`	Alloy	C	Neutral tone
`af_aoede`	Aoede	C+	Musical quality
`af_kore`	Kore	C+	Youthful
`af_nova`	Nova	C	Modern
`af_river`	River	D	Flowing
`af_jessica`	Jessica	D	Casual

American English - Male (9 voices)

Voice ID	Name	Quality	Description
`am_fenrir`	Fenrir	C+	Best male quality
`am_michael`	Michael	C+	Warm and friendly
`am_puck`	Puck	C+	Neutral tone
`am_adam`	Adam	F+	Deep voice
`am_echo`	Echo	D	Resonant
`am_eric`	Eric	D	Steady
`am_liam`	Liam	D	Young
`am_onyx`	Onyx	D	Rich
`am_santa`	Santa	D-	Jovial

British English - Female (4 voices)

Voice ID	Name	Quality	Description
`bf_emma`	Emma	B-	Elegant and clear
`bf_isabella`	Isabella	C	Sophisticated
`bf_alice`	Alice	D	Bright
`bf_lily`	Lily	D	Gentle

British English - Male (4 voices)

Voice ID	Name	Quality	Description
`bm_george`	George	C	Distinguished
`bm_fable`	Fable	C	Storyteller tone
`bm_lewis`	Lewis	D+	Refined
`bm_daniel`	Daniel	D	Classic

Supertonic-66M Voices (4 voices)

Supertonic is a faster TTS model at 44.1kHz sample rate with its own voice set:

Voice ID	Name	Gender	Description
`F1`	Female 1	Female	Clear and natural (default)
`F2`	Female 2	Female	Warm and expressive
`M1`	Male 1	Male	Deep and confident
`M2`	Male 2	Male	Friendly and casual

Tip: For Kokoro-82M, af_heart is the highest quality voice. For Supertonic-66M, use F1 for clear, natural speech at 44.1kHz.

API Reference

Gerbil Class Methods

Gerbil.ts

class Gerbil {
  // Load TTS model (auto-called by speak if needed)
  async loadTTS(options?: LoadTTSOptions): Promise<void>;
  
  // Check if TTS is loaded
  isTTSLoaded(): boolean;
  
  // Generate speech
  async speak(text: string, options?: SpeakOptions): Promise<SpeakResult>;
  
  // Stream speech (yields chunks as generated)
  async *speakStream(text: string, options?: SpeakOptions): AsyncGenerator<AudioChunk>;
  
  // Get available voices
  listVoices(): VoiceInfo[];
  getVoice(voiceId: string): VoiceInfo | null;
}

SpeakOptions

SpeakOptions.ts

interface SpeakOptions {
  /** Voice ID (default: "af_heart") */
  voice?: string;
  
  /** Speed multiplier 0.5-2.0 (default: 1.0) */
  speed?: number;
  
  /** Progress callback during loading */
  onProgress?: (info: ProgressInfo) => void;
  
  /** Callback for each audio chunk (streaming) */
  onAudioChunk?: (chunk: AudioChunk) => void;
}

SpeakResult

SpeakResult.ts

interface SpeakResult {
  /** PCM audio samples (mono, float32, -1 to 1) */
  audio: Float32Array;
  
  /** Sample rate (always 24000 for Kokoro) */
  sampleRate: number;
  
  /** Duration in seconds */
  duration: number;
  
  /** Voice ID used */
  voice: string;
  
  /** Generation time in milliseconds */
  totalTime: number;
}

Playing Audio

Browser (Web Audio API)

browser-playback.ts

import { playAudio } from "@tryhamster/gerbil/browser";

const result = await gerbil.speak("Hello!");

// One-liner playback
const controller = await playAudio(result.audio, result.sampleRate);

// Stop early
controller.stop();

// Wait for completion
await controller.onEnded;

Node.js (Save to File)

nodejs-save.ts

01import { writeFileSync } from "fs";
02import { execSync } from "child_process";
03
04const result = await gerbil.speak("Hello!");
05
06// Convert to WAV
07function saveWav(filename: string, audio: Float32Array, sampleRate: number) {
08  const buffer = Buffer.alloc(44 + audio.length * 2);
09  
10  // WAV header
11  buffer.write("RIFF", 0);
12  buffer.writeUInt32LE(36 + audio.length * 2, 4);
13  buffer.write("WAVE", 8);
14  buffer.write("fmt ", 12);
15  buffer.writeUInt32LE(16, 16);
16  buffer.writeUInt16LE(1, 20);  // PCM
17  buffer.writeUInt16LE(1, 22);  // Mono
18  buffer.writeUInt32LE(sampleRate, 24);
19  buffer.writeUInt32LE(sampleRate * 2, 28);
20  buffer.writeUInt16LE(2, 32);
21  buffer.writeUInt16LE(16, 34);
22  buffer.write("data", 36);
23  buffer.writeUInt32LE(audio.length * 2, 40);
24  
25  // Audio data
26  for (let i = 0; i < audio.length; i++) {
27    const s = Math.max(-1, Math.min(1, audio[i]));
28    buffer.writeInt16LE(Math.round(s * 32767), 44 + i * 2);
29  }
30  
31  writeFileSync(filename, buffer);
32}
33
34saveWav("output.wav", result.audio, result.sampleRate);
35
36// Play on macOS
37execSync("afplay output.wav");
38
39// Play on Linux
40// execSync("aplay output.wav");

Streaming Audio

For long text, stream audio chunks as they're generated:

streaming.ts

01// Node.js streaming
02for await (const chunk of gerbil.speakStream("Long paragraph of text...")) {
03  console.log(`Chunk ${chunk.index}: ${chunk.samples.length} samples`);
04  // Process/play chunk.samples
05  
06  if (chunk.isFinal) {
07    console.log("Done!");
08  }
09}
10
11// Browser streaming with seamless playback
12import { createAudioPlayer } from "@tryhamster/gerbil/browser";
13
14const player = createAudioPlayer(24000);
15
16for await (const chunk of gerbil.speakStream("Long text...")) {
17  player.queue(chunk.samples);
18}

useSpeech Hook Reference

useSpeech.ts

01const {
02  // Actions
03  speak,          // (text: string, opts?) => Promise<void>
04  stop,           // () => void
05  load,           // () => void (manual load trigger)
06  
07  // State
08  isLoading,      // boolean - model loading
09  isSpeaking,     // boolean - currently speaking
10  isReady,        // boolean - model ready
11  error,          // string | null
12  
13  // Voice control
14  listVoices,     // () => VoiceInfo[]
15  currentVoice,   // string
16  setVoice,       // (id: string) => void
17  currentSpeed,   // number
18  setSpeed,       // (speed: number) => void
19  
20  // Loading progress
21  loadingProgress, // { status, file?, progress? }
22} = useSpeech({
23  voice: "af_heart",   // default voice
24  speed: 1.0,          // default speed
25  autoLoad: false,     // load on first speak() call
26  onReady: () => {},
27  onError: (err) => {},
28  onStart: () => {},
29  onEnd: () => {},
30});

Performance

Metric	Value
Model size	~330MB
First load	3-10s (downloads model)
Cached load	<1s
Generation speed	~8x realtime on M1 Mac
Sample rate	24kHz
Audio format	Mono Float32 PCM

Troubleshooting

Audio sounds like gibberish

This happens when using raw transformers.js without proper phoneme conversion. Gerbil uses kokoro-js which handles grapheme-to-phoneme (G2P) conversion automatically.

Audio is too quiet

Kokoro output can be quiet. The audio is normalized by default, but if you're processing raw audio, ensure proper normalization:

normalize.ts

function normalizeAudio(audio: Float32Array, targetRms = 0.15): Float32Array {
  const currentRms = Math.sqrt(audio.reduce((a, b) => a + b * b, 0) / audio.length);
  const gain = targetRms / currentRms;
  return audio.map(s => Math.max(-1, Math.min(1, s * gain)));
}

Browser autoplay blocked

Browsers require user interaction before playing audio. Trigger speech from a click handler:

autoplay-fix.tsx

<button onClick={() => speak("Hello!")}>Speak</button>

Voice not found

Use listVoices() to see available voice IDs. Voice IDs follow the pattern {language}{gender}_{name}:

—af_ = American female
—am_ = American male
—bf_ = British female
—bm_ = British male

Model Info

Gerbil uses Kokoro-82M via the ONNX export and kokoro-js wrapper.

Property	Value
Architecture	StyleTTS2-based
Parameters	82M
Languages	English (US + UK)
License	Apache 2.0

Next Steps

Speech-to-Text → — transcribe audio with Whisper models
React Hooks → — useSpeech, useVoiceInput, useVoiceChat
AI SDK Integration → — SpeechModelV2 for Vercel AI SDK
CLI → — gerbil speak, gerbil voice commands