How It Works
Gerbil turns a model's weights into GPU compute that runs on your device. This page walks through how the engine loads models, runs inference, and squeezes performance out of WebGPU.
One engine: Gerbil is a from-scratch WebGPU engine — every kernel is a hand-written WGSL compute shader. Every modality (text, vision, embeddings, speech-to-text, text-to-speech) runs on the same WGSL stack. It is WebGPU-only: it throws a clear error when no WebGPU adapter is present rather than silently falling back to CPU. The same shaders run in the browser and in Node.js via Dawn.
System Overview
Key Design Decisions
1. A From-Scratch Engine
Gerbil implements the full inference stack itself — tokenizers, the model graph, and every kernel as a WGSL compute shader. There are no third-party runtimes underneath, so the same code path drives every modality on every platform.
2. WebGPU Only
Every kernel is a WGSL compute shader. In browsers it uses the platform's WebGPU directly. In Node.js it runs the exact same shaders via Dawn (the webgpu npm package). If no WebGPU adapter is present it throws — there is no CPU fallback.
3. Quantization for Speed
Weights are quantized to 4-bit (q4) with fp16 compute. This cuts model size by ~4x and keeps the working set small enough to run comfortably on consumer GPUs and mobile devices.
4. Token Streaming
Generation streams tokens as they are produced via an onToken callback, so the UI updates in real time. You can also run the engine in a Web Worker to keep the main thread free.
Inference Pipeline
The inference stack is a single native path from text to GPU:
Where It Runs
| Environment | WebGPU Source | Speed | Notes |
|---|---|---|---|
| Browser | Platform WebGPU | ~225 desktop · ~50 iPad · ~38 iPhone tok/s (Qwen3.5-0.8B) | Chrome/Edge 113+, Safari 26+ (incl. iPad/iPhone), Firefox 141+. Zero install, on-device. |
| Node.js | Dawn (webgpu npm) | ~230 tok/s server (Qwen3.5-0.8B) | Same WGSL shaders, any machine with a real GPU |
Throughput: browser and server
The same engine and the same code run in two places. In the browser, anyone can use your app the moment the page loads — nothing to install, and every token is generated on the user's own device, so their data stays there. In Node.js, the same shaders run through Dawn on a machine with a real GPU, which is where you get the highest throughput. You write your inference once; you choose where it runs.
On a desktop GPU the in-browser and server figures land close together (~225 vs ~230 tok/s) — the same WGSL shaders, just a different host. Throughput then scales down with the device: ~225 tok/s desktop · ~50 tok/s iPad · ~38 tok/s iPhone for Qwen3.5-0.8B. That mobile range is the price of zero-install, on-device privacy — a tradeoff you get to make per use case, not a limitation to design around.
| Setting | Throughput | What you get |
|---|---|---|
| In-browser | ≈ 225 tok/s desktop | Zero install, runs on the user's device, data never leaves it (~50 iPad · ~38 iPhone) |
| Server (Node) | ≈ 230 tok/s | Maximum throughput on a machine with a dedicated GPU |
Representative figures with Qwen3.5-0.8B: in-browser ≈ 225 tok/s desktop, ≈ 50 tok/s iPad, ≈ 38 tok/s iPhone; server ≈ 230 tok/s. Your numbers will scale with the device and the model — the live playground reports the real measured rate on whatever hardware you open it on.
Quantization Types
| Type | Weights | Compute | Size Reduction | Use Case |
|---|---|---|---|---|
| fp32 | 32-bit float | 32-bit | 1x (baseline) | Training |
| fp16 | 16-bit float | 16-bit | 2x | GPU inference |
| q4f16 | 4-bit int | 16-bit | ~4x | WebGPU inference |
| q4 | 4-bit int | 16-bit | ~4x | Default for WebGPU inference |
WebGPU Acceleration
WebGPU is a modern compute API that provides access to GPU hardware. Gerbil uses it for fast inference in both browsers and Node.js.
Browser Path
The engine runs on the main thread by default. For long generations, run it inside a Web Worker so the UI stays responsive — the engine API is identical either way.
Node.js (Dawn)
Node.js gets WebGPU from Dawn — the same WebGPU implementation Chrome uses — exposed through the webgpu npm package. Gerbil runs the exact same WGSL shaders there as in the browser, so behaviour and output match across environments:
Any machine with a real GPU works — no headless browser, no separate process, no IPC.
Streaming Architecture
Generation streams each token to an onToken callback as it is decoded. To keep the UI thread free during heavy generations, run the engine in a Web Worker and forward tokens back to the main thread:
Worker Message Protocol
A minimal protocol for driving the engine from a Web Worker:
// Main → Worker{ type: "load", repo: "Qwen/Qwen3.5-0.8B" }{ type: "generate", prompt: "...", options: {...} }{ type: "interrupt" }
// Worker → Main{ status: "loading", message: "Loading model..." }{ status: "progress", loaded: 50, total: 100 }{ status: "ready" }{ status: "token", text: "Hello", tps: 75 }{ status: "complete", text: "Hello world!", numTokens: 3, tps: 75 }{ status: "error", error: "Out of memory" }Thinking State Tracking
For Qwen3 thinking mode, Gerbil tracks whether the model is "thinking" or "answering" by monitoring special tokens:
const [START_THINKING_TOKEN_ID, END_THINKING_TOKEN_ID] = tokenizer.encode("<think></think>", { add_special_tokens: false });
let state = "answering";
const tokenCallback = (tokens) => { const tokenId = Number(tokens[0]); if (tokenId === START_THINKING_TOKEN_ID) state = "thinking"; if (tokenId === END_THINKING_TOKEN_ID) state = "answering";};Model Caching
Models are large (100MB - 500MB). Gerbil caches them locally to avoid re-downloading:
| Environment | Cache Location | Mechanism |
|---|---|---|
| Browser | IndexedDB | Cached after first download |
| Node.js | ~/.cache/huggingface/hub | Local Hugging Face cache |
Cache Behavior
First load
Downloads from Hugging Face Hub (~15-30s depending on model size)
Subsequent loads
Reads from local cache (~1-2s for browser, ~0.5s for Node.js)
Clearing Cache
Browser (in DevTools console):
indexedDB.deleteDatabase("gerbil-cache");Node.js CLI:
npx @tryhamster/gerbil cache --cleanMemory Management
Gerbil automatically manages memory to prevent leaks while maintaining performance. For WebGPU inference, memory is bounded and monitored.
Automatic KV Cache Reset
The KV cache automatically resets when it exceeds the model's context length (2048 tokens for Qwen3). This prevents unbounded memory growth:
// Memory automatically resets after ~2048 tokens// No action needed - happens transparentlyconst engine = await WebGPUEngine.create({ repo: "mlx-community/Qwen3.5-0.8B-4bit",});
// Long conversations work fine - auto-reset preserves context windowfor (let i = 0; i < 100; i++) { await engine.generate("Tell me something interesting");}Memory Bounds
| Metric | Limit | Notes |
|---|---|---|
| Per-page maximum | ~4GB | context length × token size |
| Concurrent pages | 5 max | Multiple Gerbil instances |
| Typical usage | < 2GB | Most conversations < 500 tokens |
Memory API (Gerbil class)
The higher-level Gerbil class adds explicit memory helpers on top of the engine. With the lower-level WebGPUEngine, the KV cache auto-bounds and you call engine.destroy() to free the device.
// Inspect device + context usageconst info = gerbil.getInfo();console.log(`Backend: ${info.device.backend}, VRAM: ${info.device.vram}`);console.log(`Context: ${info.context.used} / ${info.context.max} tokens`);
// Clear cached model files from disk (frees IndexedDB / cache dir)await gerbil.clearCache();
// Always dispose when done — frees the GPU deviceawait gerbil.dispose();Long-Running Sessions
For background services or persistent processes, monitor and clean up periodically:
let engine = await WebGPUEngine.create({ repo: "mlx-community/Qwen3.5-0.8B-4bit",});
// The KV cache auto-resets at the context limit, so memory stays bounded.// For very long-lived processes, periodically recreate the engine to release// and re-acquire the GPU device cleanly:setInterval(async () => { engine.destroy(); engine = await WebGPUEngine.create({ repo: "mlx-community/Qwen3.5-0.8B-4bit", });}, 60 * 60 * 1000); // Recycle hourlyCleanup Best Practices
✓ Good: Dispose when done
const engine = await WebGPUEngine.create({ repo: "mlx-community/Qwen3.5-0.8B-4bit",});// ... use engine ...engine.destroy(); // Frees the GPU device + weights✗ Bad: Creating many without cleanup
for (let i = 0; i < 10; i++) { const engine = await WebGPUEngine.create({ repo: "mlx-community/Qwen3.5-0.8B-4bit", }); // Forgot to destroy - GPU devices accumulate!}Performance Tips
Batch UI Updates
Tokens arrive very fast (~100/sec). Consider batching UI updates with requestAnimationFrame.
Preload Models
Load models during idle time so they're cached for instant use later.
Use Smaller Models
LFM2.5-350M (~199 MB) is the faster, smaller native alternative to Qwen3.5-0.8B — roughly 2.8× the throughput, often plenty for simple tasks.
Always Cleanup
Call gerbil.dispose() or worker.terminate() to free GPU memory.