How It Works

Gerbil turns a model's weights into GPU compute that runs on your device. This page walks through how the engine loads models, runs inference, and squeezes performance out of WebGPU.

One engine: Gerbil is a from-scratch WebGPU engine — every kernel is a hand-written WGSL compute shader. Every modality (text, vision, embeddings, speech-to-text, text-to-speech) runs on the same WGSL stack. It is WebGPU-only: it throws a clear error when no WebGPU adapter is present rather than silently falling back to CPU. The same shaders run in the browser and in Node.js via Dawn.

System Overview

Key Design Decisions

1. A From-Scratch Engine

Gerbil implements the full inference stack itself — tokenizers, the model graph, and every kernel as a WGSL compute shader. There are no third-party runtimes underneath, so the same code path drives every modality on every platform.

2. WebGPU Only

Every kernel is a WGSL compute shader. In browsers it uses the platform's WebGPU directly. In Node.js it runs the exact same shaders via Dawn (the webgpu npm package). If no WebGPU adapter is present it throws — there is no CPU fallback.

3. Quantization for Speed

Weights are quantized to 4-bit (q4) with fp16 compute. This cuts model size by ~4x and keeps the working set small enough to run comfortably on consumer GPUs and mobile devices.

4. Token Streaming

Generation streams tokens as they are produced via an onToken callback, so the UI updates in real time. You can also run the engine in a Web Worker to keep the main thread free.

Inference Pipeline

The inference stack is a single native path from text to GPU:

Where It Runs

EnvironmentWebGPU SourceSpeedNotes
BrowserPlatform WebGPU~225 desktop · ~50 iPad · ~38 iPhone tok/s (Qwen3.5-0.8B)Chrome/Edge 113+, Safari 26+ (incl. iPad/iPhone), Firefox 141+. Zero install, on-device.
Node.jsDawn (webgpu npm)~230 tok/s server (Qwen3.5-0.8B)Same WGSL shaders, any machine with a real GPU

Throughput: browser and server

The same engine and the same code run in two places. In the browser, anyone can use your app the moment the page loads — nothing to install, and every token is generated on the user's own device, so their data stays there. In Node.js, the same shaders run through Dawn on a machine with a real GPU, which is where you get the highest throughput. You write your inference once; you choose where it runs.

On a desktop GPU the in-browser and server figures land close together (~225 vs ~230 tok/s) — the same WGSL shaders, just a different host. Throughput then scales down with the device: ~225 tok/s desktop · ~50 tok/s iPad · ~38 tok/s iPhone for Qwen3.5-0.8B. That mobile range is the price of zero-install, on-device privacy — a tradeoff you get to make per use case, not a limitation to design around.

SettingThroughputWhat you get
In-browser≈ 225 tok/s desktopZero install, runs on the user's device, data never leaves it (~50 iPad · ~38 iPhone)
Server (Node)≈ 230 tok/sMaximum throughput on a machine with a dedicated GPU

Representative figures with Qwen3.5-0.8B: in-browser ≈ 225 tok/s desktop, ≈ 50 tok/s iPad, ≈ 38 tok/s iPhone; server ≈ 230 tok/s. Your numbers will scale with the device and the model — the live playground reports the real measured rate on whatever hardware you open it on.

Quantization Types

TypeWeightsComputeSize ReductionUse Case
fp3232-bit float32-bit1x (baseline)Training
fp1616-bit float16-bit2xGPU inference
q4f164-bit int16-bit~4xWebGPU inference
q44-bit int16-bit~4xDefault for WebGPU inference

WebGPU Acceleration

WebGPU is a modern compute API that provides access to GPU hardware. Gerbil uses it for fast inference in both browsers and Node.js.

Browser Path

The engine runs on the main thread by default. For long generations, run it inside a Web Worker so the UI stays responsive — the engine API is identical either way.

Node.js (Dawn)

Node.js gets WebGPU from Dawn — the same WebGPU implementation Chrome uses — exposed through the webgpu npm package. Gerbil runs the exact same WGSL shaders there as in the browser, so behaviour and output match across environments:

Any machine with a real GPU works — no headless browser, no separate process, no IPC.

Streaming Architecture

Generation streams each token to an onToken callback as it is decoded. To keep the UI thread free during heavy generations, run the engine in a Web Worker and forward tokens back to the main thread:

Worker Message Protocol

A minimal protocol for driving the engine from a Web Worker:

protocol.ts
// Main → Worker
{ type: "load", repo: "Qwen/Qwen3.5-0.8B" }
{ type: "generate", prompt: "...", options: {...} }
{ type: "interrupt" }
// Worker → Main
{ status: "loading", message: "Loading model..." }
{ status: "progress", loaded: 50, total: 100 }
{ status: "ready" }
{ status: "token", text: "Hello", tps: 75 }
{ status: "complete", text: "Hello world!", numTokens: 3, tps: 75 }
{ status: "error", error: "Out of memory" }

Thinking State Tracking

For Qwen3 thinking mode, Gerbil tracks whether the model is "thinking" or "answering" by monitoring special tokens:

thinking.ts
const [START_THINKING_TOKEN_ID, END_THINKING_TOKEN_ID] =
tokenizer.encode("<think></think>", { add_special_tokens: false });
let state = "answering";
const tokenCallback = (tokens) => {
const tokenId = Number(tokens[0]);
if (tokenId === START_THINKING_TOKEN_ID) state = "thinking";
if (tokenId === END_THINKING_TOKEN_ID) state = "answering";
};

Model Caching

Models are large (100MB - 500MB). Gerbil caches them locally to avoid re-downloading:

EnvironmentCache LocationMechanism
BrowserIndexedDBCached after first download
Node.js~/.cache/huggingface/hubLocal Hugging Face cache

Cache Behavior

1

First load

Downloads from Hugging Face Hub (~15-30s depending on model size)

2

Subsequent loads

Reads from local cache (~1-2s for browser, ~0.5s for Node.js)

Clearing Cache

Browser (in DevTools console):

Terminal
indexedDB.deleteDatabase("gerbil-cache");

Node.js CLI:

Terminal
npx @tryhamster/gerbil cache --clean

Memory Management

Gerbil automatically manages memory to prevent leaks while maintaining performance. For WebGPU inference, memory is bounded and monitored.

Automatic KV Cache Reset

The KV cache automatically resets when it exceeds the model's context length (2048 tokens for Qwen3). This prevents unbounded memory growth:

auto-reset.ts
// Memory automatically resets after ~2048 tokens
// No action needed - happens transparently
const engine = await WebGPUEngine.create({
repo: "mlx-community/Qwen3.5-0.8B-4bit",
});
// Long conversations work fine - auto-reset preserves context window
for (let i = 0; i < 100; i++) {
await engine.generate("Tell me something interesting");
}

Memory Bounds

MetricLimitNotes
Per-page maximum~4GBcontext length × token size
Concurrent pages5 maxMultiple Gerbil instances
Typical usage< 2GBMost conversations < 500 tokens

Memory API (Gerbil class)

The higher-level Gerbil class adds explicit memory helpers on top of the engine. With the lower-level WebGPUEngine, the KV cache auto-bounds and you call engine.destroy() to free the device.

memory-api.ts
// Inspect device + context usage
const info = gerbil.getInfo();
console.log(`Backend: ${info.device.backend}, VRAM: ${info.device.vram}`);
console.log(`Context: ${info.context.used} / ${info.context.max} tokens`);
// Clear cached model files from disk (frees IndexedDB / cache dir)
await gerbil.clearCache();
// Always dispose when done — frees the GPU device
await gerbil.dispose();

Long-Running Sessions

For background services or persistent processes, monitor and clean up periodically:

long-running.ts
let engine = await WebGPUEngine.create({
repo: "mlx-community/Qwen3.5-0.8B-4bit",
});
// The KV cache auto-resets at the context limit, so memory stays bounded.
// For very long-lived processes, periodically recreate the engine to release
// and re-acquire the GPU device cleanly:
setInterval(async () => {
engine.destroy();
engine = await WebGPUEngine.create({
repo: "mlx-community/Qwen3.5-0.8B-4bit",
});
}, 60 * 60 * 1000); // Recycle hourly

Cleanup Best Practices

✓ Good: Dispose when done

good.ts
const engine = await WebGPUEngine.create({
repo: "mlx-community/Qwen3.5-0.8B-4bit",
});
// ... use engine ...
engine.destroy(); // Frees the GPU device + weights

✗ Bad: Creating many without cleanup

bad.ts
for (let i = 0; i < 10; i++) {
const engine = await WebGPUEngine.create({
repo: "mlx-community/Qwen3.5-0.8B-4bit",
});
// Forgot to destroy - GPU devices accumulate!
}

Performance Tips

Batch UI Updates

Tokens arrive very fast (~100/sec). Consider batching UI updates with requestAnimationFrame.

Preload Models

Load models during idle time so they're cached for instant use later.

Use Smaller Models

LFM2.5-350M (~199 MB) is the faster, smaller native alternative to Qwen3.5-0.8B — roughly 2.8× the throughput, often plenty for simple tasks.

Always Cleanup

Call gerbil.dispose() or worker.terminate() to free GPU memory.