Getting Started
Gerbil gets you running a local LLM in just a few minutes. Install the package, load a model, and generate your first response — all on your own machine.
Installation
One-off usage — try without installing:
npx @tryhamster/gerbil "Write a haiku about coding"Global install — use gerbil command directly:
npm install -g @tryhamster/gerbilLocal install — for programmatic use in your project:
npm install @tryhamster/gerbilgerbil and assume global install. Substitute npx @tryhamster/gerbil if using without installing.Quick Start
One-liner API
The simplest way to use Gerbil. No setup required:
import gerbil from "@tryhamster/gerbil";
const text = await gerbil("Explain recursion in one sentence");console.log(text);// → "Recursion is when a function calls itself to solve smaller instances of the same problem."Engine API
For more control, create a WebGPUEngine directly. It downloads safetensors from the HuggingFace Hub and caches them in IndexedDB on first run:
01import { WebGPUEngine } from "@tryhamster/gerbil/gpu";02
03// dtype "auto" picks int4 on mobile, the repo's native precision on desktop.04const engine = await WebGPUEngine.create({05 repo: "mlx-community/Qwen3.5-0.8B-4bit",06 dtype: "auto",07});08
09// Generate text10const result = await engine.generate("Write a haiku about coding", {11 maxTokens: 100,12 sampling: { temperature: 0.8 },13});14
15console.log(result.text);16console.log(`Speed: ${result.tokensPerSecond.toFixed(1)} tok/s`);17
18// Clean up when done19engine.destroy();In the browser
In a React app, the one-liner is the useEngine hook from @tryhamster/gerbil/hooks. It lazy-loads the model, streams into completion, and needs no model argument:
import { useEngine } from "@tryhamster/gerbil/hooks";
function Chat() { const { complete, completion, isGenerating } = useEngine(); return ( <button onClick={() => complete("Write a haiku about coding")} disabled={isGenerating}> {completion || "Generate"} </button> );}Outside React (vanilla JS / Node), drive the native WebGPUEngine directly — a lean bundle that runs the same code across desktop and mobile. The repo is optional (omit it for a device-aware default):
import { WebGPUEngine } from "@tryhamster/gerbil/gpu";
// Zero-config: no repo → a device-aware default. Or pass any repo.// Downloads + caches the model in IndexedDB on first run.const engine = await WebGPUEngine.create();
const { text } = await engine.generate("Write a haiku about coding");console.log(text);See the Browser documentation for chat, vision, and embeddings.
Streaming
Stream responses token by token:
import { WebGPUEngine } from "@tryhamster/gerbil/gpu";
const engine = await WebGPUEngine.create({ repo: "mlx-community/Qwen3.5-0.8B-4bit",});
for await (const token of engine.stream("Tell me a story")) { process.stdout.write(token);}
engine.destroy();Thinking Mode
Qwen3 models reason before answering. The chain-of-thought is separated out for you on result.thinking, so you can show or hide it:
const result = await engine.generate("What is 127 × 43?");
console.log("Thinking:", result.thinking);// → "127 × 43 = 127 × 40 + 127 × 3 = 5080 + 381 = 5461"
console.log("Answer:", result.text);// → "5461"Structured JSON Output
Get structured data with Zod schema validation:
import { json } from "@tryhamster/gerbil";import { z } from "zod";
const person = await json("Extract: John is 32 and lives in NYC", { schema: z.object({ name: z.string(), age: z.number(), city: z.string(), }),});
console.log(person);// → { name: "John", age: 32, city: "NYC" }Embeddings
Generate embeddings for semantic search and RAG with the native EmbeddingGemma-300M model (dim 768, ~173 MB). Load it with embedding: true; embed() returns a unit-L2-normalized vector:
import { WebGPUEngine } from "@tryhamster/gerbil/gpu";
const engine = await WebGPUEngine.create({ repo: "mlx-community/embeddinggemma-300m-4bit", embedding: true,});
// EmbeddingGemma is asymmetric — tag queries vs. documents.const query = await engine.embed("how do I cache a model?", { taskType: "query" });const doc = await engine.embed("Models are cached in IndexedDB.", { taskType: "document" });
// Cosine similarity is just a dot product (vectors are L2-normalized).const score = query.reduce((sum, q, i) => sum + q * doc[i], 0);Vision AI
Vision runs on Qwen3.5-0.8B's own built-in ViT tower (no separate vision model). Build the engine with enableVision: true and call describeImage():
01import { WebGPUEngine } from "@tryhamster/gerbil/gpu";02
03// The ViT tower (~192MB) ships inside the lean Qwen3.5 4-bit repo.04const engine = await WebGPUEngine.create({05 repo: "mlx-community/Qwen3.5-0.8B-4bit",06 enableVision: true,07});08
09// Pass a data/http URL or pre-decoded RGB pixels.10const { text } = await engine.describeImage(11 { pixels, width, height },12 "What's in this image?"13);14
15console.log(text);16// → "A golden retriever playing fetch in a sunny park..."See the Vision documentation for more details.
Text-to-Speech
Synthesize speech on-device with native Kani-TTS-2. Call engine.speak() and it returns raw 22.05 kHz mono PCM:
01import { WebGPUEngine } from "@tryhamster/gerbil/gpu";02
03const engine = await WebGPUEngine.create({04 repo: "nineninesix/kani-tts-450m-0.2-ft",05});06
07const { pcm, sampleRate } = await engine.speak(08 "Hello from Gerbil! This runs entirely on-device.",09);10
11// pcm → Float32Array mono, sampleRate → 22050See the Text-to-Speech documentation for voices and controls.
Speech-to-Text
Transcribe audio on-device with native Moonshine. It takes raw 16 kHz mono PCM (Float32Array), runs an encoder-decoder ASR pass, and returns the full transcript:
01import { MoonshineSTT } from "@tryhamster/gerbil/gpu";02
03const stt = await MoonshineSTT.create(); // UsefulSensors/moonshine-base04
05// pcm must be 16 kHz mono Float32 (resample mic/file audio first).06const { text, audioSeconds } = await stt.transcribe(pcm);07
08console.log(text); // "Hello world"See the Speech-to-Text documentation for details.
Using Any HuggingFace Model
Load any compatible model from HuggingFace:
import { WebGPUEngine } from "@tryhamster/gerbil/gpu";
// HF repo id (the lean 4-bit Qwen3.5 checkpoint)const engine = await WebGPUEngine.create({ repo: "mlx-community/Qwen3.5-0.8B-4bit",});
// Faster/smaller alternative: LFM2.5-350M (~199MB)const fast = await WebGPUEngine.create({ repo: "LiquidAI/LFM2.5-350M-MLX-4bit",});💡 Pro Tip: Preload Models
Create the engine during app initialization so weights are downloaded, compiled, and cached in IndexedDB before the user's first prompt. Reuse the same instance afterwards:
import { WebGPUEngine } from "@tryhamster/gerbil/gpu";
// Warm up at startup — downloads + caches the model.const engine = await WebGPUEngine.create({ repo: "mlx-community/Qwen3.5-0.8B-4bit", onProgress: (loaded, total, message) => console.log(message, total ? loaded / total : 0),});
// Subsequent prompts reuse the same warm engine — instant.await engine.generate("Hello!");See the Browser documentation for caching details.
Next Steps
- —Explore available models
- —Use Vision AI — describe images, extract text, analyze screenshots
- —Text-to-Speech — native Kani-TTS-2 speech synthesis
- —Speech-to-Text — transcribe audio with native Moonshine
- —Use built-in skills like commit, summarize, review — or create your own
- —Integrate with your framework — AI SDK, Next.js, Express
- —Enable response caching for instant repeated prompts
- —Set up MCP server for Claude Desktop
- —Try the playground to experiment in your browser