Speech-to-Text
Native, on-device speech-to-text with Moonshine — a raw-waveform encoder/decoder ASR model running entirely on the WebGPU engine (WGSL). Private by default and free to run.
MoonshineSTT from @tryhamster/gerbil/gpu. Moonshine is English, greedy, and consumes raw 16 kHz mono PCM. WebGPU is required — there is no CPU fallback for the native engine.Quick Start (React): useSTT
In React, use useSTT from @tryhamster/gerbil/hooks. It captures the mic between startRecording() and stopRecording(), resamples to 16 kHz mono for you, runs Moonshine, and surfaces the transcript. Defaults to Moonshine — no model argument needed:
01"use client";02
03import { useSTT } from "@tryhamster/gerbil/hooks";04
05function VoiceInput() {06 const {07 startRecording,08 stopRecording,09 isRecording,10 isTranscribing,11 transcript,12 audioSeconds,13 error,14 } = useSTT();15
16 return (17 <div>18 <button19 onClick={() => (isRecording ? stopRecording() : startRecording())}20 disabled={isTranscribing}21 >22 {isRecording ? "Stop" : "Record"}23 </button>24 {isTranscribing && <span>Transcribing…</span>}25 {transcript && <p>You said: {transcript}</p>}26 {error && <p className="text-red-500">{error}</p>}27 </div>28 );29}Hands-free conversation: useVoiceChat
Transcription is the front of a voice loop. To feed what the user said straight into a chat model — and speak the reply back — use useVoiceChat. It composes useSTT, useChat, and useTTS into one mic → LLM → spoken-reply assistant, fully on-device:
01"use client";02
03import { useVoiceChat } from "@tryhamster/gerbil/hooks";04
05function VoiceAssistant() {06 const { messages, start, stop, isListening, isTranscribing, transcript } =07 useVoiceChat({ system: "You are a friendly voice assistant." });08
09 return (10 <div>11 <button onClick={() => (isListening ? stop() : start())}>12 {isListening ? "Listening…" : isTranscribing ? "Transcribing…" : "Tap to talk"}13 </button>14 {transcript && <p>You said: {transcript}</p>}15 {messages.map((m, i) => (16 <p key={i}><strong>{m.role}:</strong> {m.content}</p>17 ))}18 </div>19 );20}Use useSTT on its own when you only need text from speech; reach for useVoiceChat when you want the whole conversation loop. The full hook set is in the React Hooks reference.
Model
| Repo | Architecture | Language | Type |
|---|---|---|---|
UsefulSensors/moonshine-base | MoonshineForConditionalGeneration | English | Encoder-decoder ASR |
Preparing Audio
Moonshine expects raw 16 kHz mono Float32 PCM. In the browser, decode and resample any file or mic capture with the Web Audio API:
01// Decode an audio File/Blob and resample to 16 kHz mono Float32.02async function toMonoPcm16k(file: Blob): Promise<Float32Array> {03 const arrayBuf = await file.arrayBuffer();04
05 // Decode at the source sample rate.06 const decodeCtx = new AudioContext();07 const decoded = await decodeCtx.decodeAudioData(arrayBuf);08 await decodeCtx.close();09
10 // Resample to 16 kHz mono via an OfflineAudioContext.11 const frames = Math.ceil(decoded.duration * 16000);12 const offline = new OfflineAudioContext(1, frames, 16000);13 const src = offline.createBufferSource();14 src.buffer = decoded;15 src.connect(offline.destination);16 src.start();17 const rendered = await offline.startRendering();18
19 return rendered.getChannelData(0); // Float32Array @ 16 kHz mono20}21
22import { MoonshineSTT } from "@tryhamster/gerbil/gpu";23
24const stt = await MoonshineSTT.create();25const pcm = await toMonoPcm16k(audioFile);26const { text } = await stt.transcribe(pcm);27console.log(text);transcribe() throws. Always resample to 16 kHz mono — passing 44.1 kHz or stereo audio produces garbage.API Reference
MoonshineSTT.create()
01import { MoonshineSTT } from "@tryhamster/gerbil/gpu";02
03const stt = await MoonshineSTT.create({04 repo: "UsefulSensors/moonshine-base", // optional, this is the default05 revision: "main", // optional HF revision06 hfToken: "hf_...", // optional, for gated repos07 onProgress: (loaded, total, message) => {08 // loaded/total are bytes; message is a human-readable phase09 },10});11
12stt.architecture; // "MoonshineForConditionalGeneration"transcribe()
01// Transcribe raw 16 kHz mono PCM.02const result = await stt.transcribe(pcm, {03 maxNewTokens: 194, // default: model max_position_embeddings04 minRms: 0.01, // default: silence threshold (RMS energy)05 minSpeechSeconds: 0.35, // default: minimum clip length to treat as speech06});07
08interface TranscribeOptions {09 maxNewTokens?: number; // default 194 (model max_position_embeddings)10 minRms?: number; // default 0.01 — clips quieter than this are no-speech11 minSpeechSeconds?: number; // default 0.35 — shorter clips are no-speech12}13
14interface TranscribeResult {15 text: string; // decoded transcript16 tokens: number[]; // token ids, excludes start token, includes trailing EOS17 encoderFrames: number; // conv-frontend output frames18 audioSeconds: number; // pcm.length / 1600019 speechRms: number; // RMS energy of the input PCM (~0..1)20 noSpeech: boolean; // true when the clip is silence/noise (skip the transcript)21}destroy()
// Free all GPU buffers and pipelines when done.stt.destroy();No-Speech / Silence Detection
ASR models hallucinate on silence — feed Moonshine a quiet or near-empty clip and it can confidently emit a phrase that was never spoken, or a non-speech marker like [BLANK_AUDIO]. To make this easy to handle, transcribe() reports a noSpeech: boolean (and the underlying speechRms) on every result. It is computed from the PCM RMS energy plus a minimum-duration guard, then confirmed against the decoded text (empty or a non-speech marker also counts as no-speech).
minRms energy, or shorter than minSpeechSeconds, returns noSpeech: true immediately — without running the encoder or decoder. Silent or too-short clips cost nothing.Both thresholds are tunable per call. Raise minRms in a noisy environment, or minSpeechSeconds to ignore very short blips:
01const result = await stt.transcribe(pcm, {02 minRms: 0.02, // default 0.01 — higher = stricter silence cutoff03 minSpeechSeconds: 0.5, // default 0.35 — ignore clips shorter than 0.5s04});05
06if (result.noSpeech) {07 // Silent / too short / hallucinated — don't surface result.text.08 console.log("No speech detected", { rms: result.speechRms });09} else {10 console.log("Transcript:", result.text);11}In React: useSTT
useSTT surfaces the same signal. After a recording is transcribed, read noSpeech from the return value to skip empty or hallucinated transcripts, or pass an onNoSpeech callback that fires when a finalized utterance contains no speech:
01"use client";02
03import { useSTT } from "@tryhamster/gerbil/hooks";04
05function VoiceInput() {06 const {07 startRecording,08 stopRecording,09 isRecording,10 isTranscribing,11 transcript,12 noSpeech,13 } = useSTT({14 onNoSpeech: () => {15 // Finalized utterance had no speech — e.g. re-arm the mic, show a hint.16 console.log("Didn't catch that — try again.");17 },18 });19
20 return (21 <div>22 <button23 onClick={() => (isRecording ? stopRecording() : startRecording())}24 disabled={isTranscribing}25 >26 {isRecording ? "Stop" : "Record"}27 </button>28 {/* Only show a transcript when there was actually speech. */}29 {transcript && !noSpeech && <p>You said: {transcript}</p>}30 {noSpeech && <p>No speech detected.</p>}31 </div>32 );33}Advanced: MoonshineSTT (vanilla JS / Node)
Outside React — or when you already have a Float32Array of 16 kHz mono PCM — drive MoonshineSTT directly. Create it, hand it the samples, and read back the transcript:
01import { MoonshineSTT } from "@tryhamster/gerbil/gpu";02
03// Downloads from HF, compiles WGSL, uploads weights.04const stt = await MoonshineSTT.create({05 repo: "UsefulSensors/moonshine-base", // optional — this is the default06 onProgress: (loaded, total, message) => {07 const pct = total > 0 ? Math.round((loaded / total) * 100) : 0;08 console.log(`${message} (${pct}%)`);09 },10});11
12// pcm must be 16 kHz MONO Float32 (see "Preparing Audio" above).13const result = await stt.transcribe(pcm);14console.log(result.text); // "Hello world, this is a test."15console.log(result.audioSeconds); // pcm.length / 1600016
17stt.destroy(); // free GPU resourcesRecording from the Microphone (manual)
useSTT handles mic capture for you. If you need manual control outside React, capture audio with MediaRecorder, then resample the recorded blob to 16 kHz mono before transcribing:
01const stream = await navigator.mediaDevices.getUserMedia({ audio: true });02const recorder = new MediaRecorder(stream);03const chunks: Blob[] = [];04
05recorder.ondataavailable = (e) => chunks.push(e.data);06recorder.onstop = async () => {07 const blob = new Blob(chunks, { type: recorder.mimeType });08 const pcm = await toMonoPcm16k(blob); // from "Preparing Audio" above09 const { text } = await stt.transcribe(pcm);10 console.log("Transcript:", text);11};12
13recorder.start();14// ...later...15recorder.stop();Limitations
- —English only, greedy. No multilingual support and no sampling controls.
- —No streaming. The encoder runs once over the whole utterance; you get the full transcript back, not partials.
- —WebGPU required. The native engine throws if no WebGPU adapter is present — there is no CPU/WASM fallback. Feature-detect WebGPU up front.
- —Runs everywhere WebGPU does. Works on desktop browsers and on iPad Safari.