Speech-to-Text

Native, on-device speech-to-text with Moonshine — a raw-waveform encoder/decoder ASR model running entirely on the WebGPU engine (WGSL). Private by default and free to run.

Native engine: Transcription uses MoonshineSTT from @tryhamster/gerbil/gpu. Moonshine is English, greedy, and consumes raw 16 kHz mono PCM. WebGPU is required — there is no CPU fallback for the native engine.

Quick Start (React): useSTT

In React, use useSTT from @tryhamster/gerbil/hooks. It captures the mic between startRecording() and stopRecording(), resamples to 16 kHz mono for you, runs Moonshine, and surfaces the transcript. Defaults to Moonshine — no model argument needed:

VoiceInput.tsx
01"use client";
02
03import { useSTT } from "@tryhamster/gerbil/hooks";
04
05function VoiceInput() {
06 const {
07 startRecording,
08 stopRecording,
09 isRecording,
10 isTranscribing,
11 transcript,
12 audioSeconds,
13 error,
14 } = useSTT();
15
16 return (
17 <div>
18 <button
19 onClick={() => (isRecording ? stopRecording() : startRecording())}
20 disabled={isTranscribing}
21 >
22 {isRecording ? "Stop" : "Record"}
23 </button>
24 {isTranscribing && <span>Transcribing…</span>}
25 {transcript && <p>You said: {transcript}</p>}
26 {error && <p className="text-red-500">{error}</p>}
27 </div>
28 );
29}

Hands-free conversation: useVoiceChat

Transcription is the front of a voice loop. To feed what the user said straight into a chat model — and speak the reply back — use useVoiceChat. It composes useSTT, useChat, and useTTS into one mic → LLM → spoken-reply assistant, fully on-device:

VoiceAssistant.tsx
01"use client";
02
03import { useVoiceChat } from "@tryhamster/gerbil/hooks";
04
05function VoiceAssistant() {
06 const { messages, start, stop, isListening, isTranscribing, transcript } =
07 useVoiceChat({ system: "You are a friendly voice assistant." });
08
09 return (
10 <div>
11 <button onClick={() => (isListening ? stop() : start())}>
12 {isListening ? "Listening…" : isTranscribing ? "Transcribing…" : "Tap to talk"}
13 </button>
14 {transcript && <p>You said: {transcript}</p>}
15 {messages.map((m, i) => (
16 <p key={i}><strong>{m.role}:</strong> {m.content}</p>
17 ))}
18 </div>
19 );
20}

Use useSTT on its own when you only need text from speech; reach for useVoiceChat when you want the whole conversation loop. The full hook set is in the React Hooks reference.

Model

RepoArchitectureLanguageType
UsefulSensors/moonshine-baseMoonshineForConditionalGenerationEnglishEncoder-decoder ASR
How it works: Moonshine runs a Conv1d frontend directly on the raw waveform, encodes the whole utterance once with a bidirectional encoder, freezes per-layer cross-attention K/V, then greedily auto-regressively decodes to EOS. It returns the full transcript per utterance — there is no streaming/partial API.

Preparing Audio

Moonshine expects raw 16 kHz mono Float32 PCM. In the browser, decode and resample any file or mic capture with the Web Audio API:

prepare-audio.ts
01// Decode an audio File/Blob and resample to 16 kHz mono Float32.
02async function toMonoPcm16k(file: Blob): Promise<Float32Array> {
03 const arrayBuf = await file.arrayBuffer();
04
05 // Decode at the source sample rate.
06 const decodeCtx = new AudioContext();
07 const decoded = await decodeCtx.decodeAudioData(arrayBuf);
08 await decodeCtx.close();
09
10 // Resample to 16 kHz mono via an OfflineAudioContext.
11 const frames = Math.ceil(decoded.duration * 16000);
12 const offline = new OfflineAudioContext(1, frames, 16000);
13 const src = offline.createBufferSource();
14 src.buffer = decoded;
15 src.connect(offline.destination);
16 src.start();
17 const rendered = await offline.startRendering();
18
19 return rendered.getChannelData(0); // Float32Array @ 16 kHz mono
20}
21
22import { MoonshineSTT } from "@tryhamster/gerbil/gpu";
23
24const stt = await MoonshineSTT.create();
25const pcm = await toMonoPcm16k(audioFile);
26const { text } = await stt.transcribe(pcm);
27console.log(text);
Note: PCM must be at least 127 samples (the conv1 kernel size) or transcribe() throws. Always resample to 16 kHz mono — passing 44.1 kHz or stereo audio produces garbage.

API Reference

MoonshineSTT.create()

create.ts
01import { MoonshineSTT } from "@tryhamster/gerbil/gpu";
02
03const stt = await MoonshineSTT.create({
04 repo: "UsefulSensors/moonshine-base", // optional, this is the default
05 revision: "main", // optional HF revision
06 hfToken: "hf_...", // optional, for gated repos
07 onProgress: (loaded, total, message) => {
08 // loaded/total are bytes; message is a human-readable phase
09 },
10});
11
12stt.architecture; // "MoonshineForConditionalGeneration"

transcribe()

transcribe.ts
01// Transcribe raw 16 kHz mono PCM.
02const result = await stt.transcribe(pcm, {
03 maxNewTokens: 194, // default: model max_position_embeddings
04 minRms: 0.01, // default: silence threshold (RMS energy)
05 minSpeechSeconds: 0.35, // default: minimum clip length to treat as speech
06});
07
08interface TranscribeOptions {
09 maxNewTokens?: number; // default 194 (model max_position_embeddings)
10 minRms?: number; // default 0.01 — clips quieter than this are no-speech
11 minSpeechSeconds?: number; // default 0.35 — shorter clips are no-speech
12}
13
14interface TranscribeResult {
15 text: string; // decoded transcript
16 tokens: number[]; // token ids, excludes start token, includes trailing EOS
17 encoderFrames: number; // conv-frontend output frames
18 audioSeconds: number; // pcm.length / 16000
19 speechRms: number; // RMS energy of the input PCM (~0..1)
20 noSpeech: boolean; // true when the clip is silence/noise (skip the transcript)
21}

destroy()

destroy.ts
// Free all GPU buffers and pipelines when done.
stt.destroy();

No-Speech / Silence Detection

ASR models hallucinate on silence — feed Moonshine a quiet or near-empty clip and it can confidently emit a phrase that was never spoken, or a non-speech marker like [BLANK_AUDIO]. To make this easy to handle, transcribe() reports a noSpeech: boolean (and the underlying speechRms) on every result. It is computed from the PCM RMS energy plus a minimum-duration guard, then confirmed against the decoded text (empty or a non-speech marker also counts as no-speech).

Short-circuits before the GPU work: a clip below minRms energy, or shorter than minSpeechSeconds, returns noSpeech: true immediately — without running the encoder or decoder. Silent or too-short clips cost nothing.

Both thresholds are tunable per call. Raise minRms in a noisy environment, or minSpeechSeconds to ignore very short blips:

no-speech.ts
01const result = await stt.transcribe(pcm, {
02 minRms: 0.02, // default 0.01 — higher = stricter silence cutoff
03 minSpeechSeconds: 0.5, // default 0.35 — ignore clips shorter than 0.5s
04});
05
06if (result.noSpeech) {
07 // Silent / too short / hallucinated — don't surface result.text.
08 console.log("No speech detected", { rms: result.speechRms });
09} else {
10 console.log("Transcript:", result.text);
11}

In React: useSTT

useSTT surfaces the same signal. After a recording is transcribed, read noSpeech from the return value to skip empty or hallucinated transcripts, or pass an onNoSpeech callback that fires when a finalized utterance contains no speech:

VoiceInput.tsx
01"use client";
02
03import { useSTT } from "@tryhamster/gerbil/hooks";
04
05function VoiceInput() {
06 const {
07 startRecording,
08 stopRecording,
09 isRecording,
10 isTranscribing,
11 transcript,
12 noSpeech,
13 } = useSTT({
14 onNoSpeech: () => {
15 // Finalized utterance had no speech — e.g. re-arm the mic, show a hint.
16 console.log("Didn't catch that — try again.");
17 },
18 });
19
20 return (
21 <div>
22 <button
23 onClick={() => (isRecording ? stopRecording() : startRecording())}
24 disabled={isTranscribing}
25 >
26 {isRecording ? "Stop" : "Record"}
27 </button>
28 {/* Only show a transcript when there was actually speech. */}
29 {transcript && !noSpeech && <p>You said: {transcript}</p>}
30 {noSpeech && <p>No speech detected.</p>}
31 </div>
32 );
33}

Advanced: MoonshineSTT (vanilla JS / Node)

Outside React — or when you already have a Float32Array of 16 kHz mono PCM — drive MoonshineSTT directly. Create it, hand it the samples, and read back the transcript:

moonshine.ts
01import { MoonshineSTT } from "@tryhamster/gerbil/gpu";
02
03// Downloads from HF, compiles WGSL, uploads weights.
04const stt = await MoonshineSTT.create({
05 repo: "UsefulSensors/moonshine-base", // optional — this is the default
06 onProgress: (loaded, total, message) => {
07 const pct = total > 0 ? Math.round((loaded / total) * 100) : 0;
08 console.log(`${message} (${pct}%)`);
09 },
10});
11
12// pcm must be 16 kHz MONO Float32 (see "Preparing Audio" above).
13const result = await stt.transcribe(pcm);
14console.log(result.text); // "Hello world, this is a test."
15console.log(result.audioSeconds); // pcm.length / 16000
16
17stt.destroy(); // free GPU resources

Recording from the Microphone (manual)

useSTT handles mic capture for you. If you need manual control outside React, capture audio with MediaRecorder, then resample the recorded blob to 16 kHz mono before transcribing:

record-mic.ts
01const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
02const recorder = new MediaRecorder(stream);
03const chunks: Blob[] = [];
04
05recorder.ondataavailable = (e) => chunks.push(e.data);
06recorder.onstop = async () => {
07 const blob = new Blob(chunks, { type: recorder.mimeType });
08 const pcm = await toMonoPcm16k(blob); // from "Preparing Audio" above
09 const { text } = await stt.transcribe(pcm);
10 console.log("Transcript:", text);
11};
12
13recorder.start();
14// ...later...
15recorder.stop();

Limitations

  • English only, greedy. No multilingual support and no sampling controls.
  • No streaming. The encoder runs once over the whole utterance; you get the full transcript back, not partials.
  • WebGPU required. The native engine throws if no WebGPU adapter is present — there is no CPU/WASM fallback. Feature-detect WebGPU up front.
  • Runs everywhere WebGPU does. Works on desktop browsers and on iPad Safari.

Next Steps

  • Models → — the full native model lineup
  • Vision → — image understanding with describeImage()
  • Browser → — running the native WebGPU engine on-device