Getting Started

Gerbil gets you running a local LLM in just a few minutes. Install the package, load a model, and generate your first response — all on your own machine.

Installation

One-off usage — try without installing:

Terminal
npx @tryhamster/gerbil "Write a haiku about coding"

Global install — use gerbil command directly:

Terminal
npm install -g @tryhamster/gerbil

Local install — for programmatic use in your project:

Terminal
npm install @tryhamster/gerbil
Note: CLI examples on this site use gerbil and assume global install. Substitute npx @tryhamster/gerbil if using without installing.

Quick Start

One-liner API

The simplest way to use Gerbil. No setup required:

index.ts
import gerbil from "@tryhamster/gerbil";
const text = await gerbil("Explain recursion in one sentence");
console.log(text);
// → "Recursion is when a function calls itself to solve smaller instances of the same problem."

Engine API

For more control, create a WebGPUEngine directly. It downloads safetensors from the HuggingFace Hub and caches them in IndexedDB on first run:

standalone.ts
01import { WebGPUEngine } from "@tryhamster/gerbil/gpu";
02
03// dtype "auto" picks int4 on mobile, the repo's native precision on desktop.
04const engine = await WebGPUEngine.create({
05 repo: "mlx-community/Qwen3.5-0.8B-4bit",
06 dtype: "auto",
07});
08
09// Generate text
10const result = await engine.generate("Write a haiku about coding", {
11 maxTokens: 100,
12 sampling: { temperature: 0.8 },
13});
14
15console.log(result.text);
16console.log(`Speed: ${result.tokensPerSecond.toFixed(1)} tok/s`);
17
18// Clean up when done
19engine.destroy();

In the browser

In a React app, the one-liner is the useEngine hook from @tryhamster/gerbil/hooks. It lazy-loads the model, streams into completion, and needs no model argument:

Chat.tsx
import { useEngine } from "@tryhamster/gerbil/hooks";
function Chat() {
const { complete, completion, isGenerating } = useEngine();
return (
<button onClick={() => complete("Write a haiku about coding")} disabled={isGenerating}>
{completion || "Generate"}
</button>
);
}

Outside React (vanilla JS / Node), drive the native WebGPUEngine directly — a lean bundle that runs the same code across desktop and mobile. The repo is optional (omit it for a device-aware default):

browser.ts
import { WebGPUEngine } from "@tryhamster/gerbil/gpu";
// Zero-config: no repo → a device-aware default. Or pass any repo.
// Downloads + caches the model in IndexedDB on first run.
const engine = await WebGPUEngine.create();
const { text } = await engine.generate("Write a haiku about coding");
console.log(text);

See the Browser documentation for chat, vision, and embeddings.

Streaming

Stream responses token by token:

streaming.ts
import { WebGPUEngine } from "@tryhamster/gerbil/gpu";
const engine = await WebGPUEngine.create({
repo: "mlx-community/Qwen3.5-0.8B-4bit",
});
for await (const token of engine.stream("Tell me a story")) {
process.stdout.write(token);
}
engine.destroy();

Thinking Mode

Qwen3 models reason before answering. The chain-of-thought is separated out for you on result.thinking, so you can show or hide it:

thinking.ts
const result = await engine.generate("What is 127 × 43?");
console.log("Thinking:", result.thinking);
// → "127 × 43 = 127 × 40 + 127 × 3 = 5080 + 381 = 5461"
console.log("Answer:", result.text);
// → "5461"

Structured JSON Output

Get structured data with Zod schema validation:

json.ts
import { json } from "@tryhamster/gerbil";
import { z } from "zod";
const person = await json("Extract: John is 32 and lives in NYC", {
schema: z.object({
name: z.string(),
age: z.number(),
city: z.string(),
}),
});
console.log(person);
// → { name: "John", age: 32, city: "NYC" }

Embeddings

Generate embeddings for semantic search and RAG with the native EmbeddingGemma-300M model (dim 768, ~173 MB). Load it with embedding: true; embed() returns a unit-L2-normalized vector:

embeddings.ts
import { WebGPUEngine } from "@tryhamster/gerbil/gpu";
const engine = await WebGPUEngine.create({
repo: "mlx-community/embeddinggemma-300m-4bit",
embedding: true,
});
// EmbeddingGemma is asymmetric — tag queries vs. documents.
const query = await engine.embed("how do I cache a model?", { taskType: "query" });
const doc = await engine.embed("Models are cached in IndexedDB.", { taskType: "document" });
// Cosine similarity is just a dot product (vectors are L2-normalized).
const score = query.reduce((sum, q, i) => sum + q * doc[i], 0);

Vision AI

Vision runs on Qwen3.5-0.8B's own built-in ViT tower (no separate vision model). Build the engine with enableVision: true and call describeImage():

vision.ts
01import { WebGPUEngine } from "@tryhamster/gerbil/gpu";
02
03// The ViT tower (~192MB) ships inside the lean Qwen3.5 4-bit repo.
04const engine = await WebGPUEngine.create({
05 repo: "mlx-community/Qwen3.5-0.8B-4bit",
06 enableVision: true,
07});
08
09// Pass a data/http URL or pre-decoded RGB pixels.
10const { text } = await engine.describeImage(
11 { pixels, width, height },
12 "What's in this image?"
13);
14
15console.log(text);
16// → "A golden retriever playing fetch in a sunny park..."

See the Vision documentation for more details.

Text-to-Speech

Synthesize speech on-device with native Kani-TTS-2. Call engine.speak() and it returns raw 22.05 kHz mono PCM:

tts.ts
01import { WebGPUEngine } from "@tryhamster/gerbil/gpu";
02
03const engine = await WebGPUEngine.create({
04 repo: "nineninesix/kani-tts-450m-0.2-ft",
05});
06
07const { pcm, sampleRate } = await engine.speak(
08 "Hello from Gerbil! This runs entirely on-device.",
09);
10
11// pcm → Float32Array mono, sampleRate → 22050

See the Text-to-Speech documentation for voices and controls.

Speech-to-Text

Transcribe audio on-device with native Moonshine. It takes raw 16 kHz mono PCM (Float32Array), runs an encoder-decoder ASR pass, and returns the full transcript:

stt.ts
01import { MoonshineSTT } from "@tryhamster/gerbil/gpu";
02
03const stt = await MoonshineSTT.create(); // UsefulSensors/moonshine-base
04
05// pcm must be 16 kHz mono Float32 (resample mic/file audio first).
06const { text, audioSeconds } = await stt.transcribe(pcm);
07
08console.log(text); // "Hello world"

See the Speech-to-Text documentation for details.

Using Any HuggingFace Model

Load any compatible model from HuggingFace:

custom-model.ts
import { WebGPUEngine } from "@tryhamster/gerbil/gpu";
// HF repo id (the lean 4-bit Qwen3.5 checkpoint)
const engine = await WebGPUEngine.create({
repo: "mlx-community/Qwen3.5-0.8B-4bit",
});
// Faster/smaller alternative: LFM2.5-350M (~199MB)
const fast = await WebGPUEngine.create({
repo: "LiquidAI/LFM2.5-350M-MLX-4bit",
});

💡 Pro Tip: Preload Models

Create the engine during app initialization so weights are downloaded, compiled, and cached in IndexedDB before the user's first prompt. Reuse the same instance afterwards:

preload.ts
import { WebGPUEngine } from "@tryhamster/gerbil/gpu";
// Warm up at startup — downloads + caches the model.
const engine = await WebGPUEngine.create({
repo: "mlx-community/Qwen3.5-0.8B-4bit",
onProgress: (loaded, total, message) =>
console.log(message, total ? loaded / total : 0),
});
// Subsequent prompts reuse the same warm engine — instant.
await engine.generate("Hello!");

See the Browser documentation for caching details.

Next Steps