How It Works
Technical deep-dive into how Gerbil works under the hood.
System Overview
Key Design Decisions
1. transformers.js as the Foundation
We use Hugging Face transformers.js which provides pre-converted ONNX models, exact tokenizers, and a unified API across all backends.
2. WebGPU First
WebGPU provides 5-10x speedup over CPU. In browsers, we use native WebGPU. In Node.js, we use headless Chrome as a WebGPU accelerator (ChromeGPUBackend).
3. Quantization for Speed
All models use quantized weights: q4f16 (4-bit weights, fp16 compute) for WebGPU, q4 for CPU. This reduces model size by ~4x.
4. Streaming via Web Workers
Browser inference runs in a Web Worker to keep the UI responsive, stream tokens in real-time, and isolate GPU memory from the main thread.
Inference Pipeline
The inference stack consists of three layers:
Execution Backends
| Backend | Environment | Speed | Notes |
|---|---|---|---|
| WebGPU | Browser, Chrome | ~100-150 tok/s | Fastest, requires GPU |
| CPU | Node.js | ~30-60 tok/s | Uses SIMD, good on Apple Silicon |
| WASM | Browser fallback | ~5-10 tok/s | Works everywhere |
Quantization Types
| Type | Weights | Compute | Size Reduction | Use Case |
|---|---|---|---|---|
| fp32 | 32-bit float | 32-bit | 1x (baseline) | Training |
| fp16 | 16-bit float | 16-bit | 2x | GPU inference |
| q4f16 | 4-bit int | 16-bit | ~4x | WebGPU inference |
| q4 | 4-bit int | 32-bit | ~4x | CPU inference |
WebGPU Acceleration
WebGPU is a modern compute API that provides access to GPU hardware. Gerbil uses it for fast inference in both browsers and Node.js.
Browser Path
Node.js WebGPU Path (ChromeGPUBackend)
Node.js doesn't have native WebGPU, so Gerbil uses headless Chrome as a GPU accelerator:
The fixed port (43724) ensures consistent IndexedDB caching across runs.
Streaming Architecture
LLM inference is computationally intensive. Gerbil runs it in a Web Worker to keep the UI responsive:
Message Protocol
// Main → Worker{ type: "load", modelId: "qwen3-0.6b" }{ type: "generate", messages: [...], options: {...} }{ type: "interrupt" }{ type: "reset" }
// Worker → Main{ status: "loading", message: "Loading model..." }{ status: "progress", file: "model.onnx", progress: 50 }{ status: "ready" }{ status: "token", text: "Hello", state: "answering", tps: 75 }{ status: "complete", text: "Hello world!", numTokens: 3, tps: 75 }{ status: "error", error: "Out of memory" }Thinking State Tracking
For Qwen3 thinking mode, Gerbil tracks whether the model is "thinking" or "answering" by monitoring special tokens:
const [START_THINKING_TOKEN_ID, END_THINKING_TOKEN_ID] = tokenizer.encode("<think></think>", { add_special_tokens: false });
let state = "answering";
const tokenCallback = (tokens) => { const tokenId = Number(tokens[0]); if (tokenId === START_THINKING_TOKEN_ID) state = "thinking"; if (tokenId === END_THINKING_TOKEN_ID) state = "answering";};Model Caching
Models are large (100MB - 500MB). Gerbil caches them locally to avoid re-downloading:
| Environment | Cache Location | Mechanism |
|---|---|---|
| Browser | IndexedDB | transformers.js built-in |
| Node.js (CPU) | ~/.cache/huggingface/hub | transformers.js built-in |
| Node.js (WebGPU) | Chrome's IndexedDB | Via ChromeGPUBackend |
Cache Behavior
First load
Downloads from Hugging Face Hub (~15-30s depending on model size)
Subsequent loads
Reads from local cache (~1-2s for browser, ~0.5s for Node.js)
Clearing Cache
Browser (in DevTools console):
indexedDB.deleteDatabase("transformers-cache");Node.js CLI:
npx @tryhamster/gerbil cache --cleanMemory Management
Gerbil automatically manages memory to prevent leaks while maintaining performance. For WebGPU inference, memory is bounded and monitored.
Automatic KV Cache Reset
The KV cache automatically resets when it exceeds the model's context length (2048 tokens for Qwen3). This prevents unbounded memory growth:
// Memory automatically resets after ~2048 tokens// No action needed - happens transparentlyconst gerbil = new Gerbil();await gerbil.loadModel("qwen3-0.6b");
// Long conversations work fine - auto-reset preserves context windowfor (let i = 0; i < 100; i++) { await gerbil.generate("Tell me something interesting");}Memory Bounds
| Metric | Limit | Notes |
|---|---|---|
| Per-page maximum | ~4GB | context length × token size |
| Concurrent pages | 5 max | Multiple Gerbil instances |
| Typical usage | < 2GB | Most conversations < 500 tokens |
Memory API
// Check memory usage (WebGPU only)const mem = await gerbil.getMemoryUsage();if (mem) { console.log(`Using ${mem.usedGB.toFixed(1)}GB / ${mem.totalGB.toFixed(1)}GB`);}
// Clear KV cache manually (resets conversation context)await gerbil.clearCache();
// Auto-cleanup if threshold exceededconst didCleanup = await gerbil.checkMemoryAndCleanup(8); // 8GB threshold
// Always dispose when doneawait gerbil.dispose();Long-Running Sessions
For background services or persistent processes, monitor and clean up periodically:
const gerbil = new Gerbil();await gerbil.loadModel("qwen3-0.6b");
// Periodic memory monitoringsetInterval(async () => { const mem = await gerbil.getMemoryUsage(); if (mem && mem.usedGB > 10) { console.warn(`High memory: ${mem.usedGB.toFixed(1)}GB`); await gerbil.clearCache(); }}, 60000); // Check every minuteCleanup Best Practices
✓ Good: Dispose when done
const gerbil = new Gerbil();await gerbil.loadModel("qwen3-0.6b");// ... use gerbil ...await gerbil.dispose(); // Frees resources✗ Bad: Creating many without cleanup
for (let i = 0; i < 10; i++) { const g = new Gerbil(); await g.loadModel("qwen3-0.6b"); // Forgot to dispose - pages accumulate!}Performance Tips
Batch UI Updates
Tokens arrive very fast (~100/sec). Consider batching UI updates with requestAnimationFrame.
Preload Models
Load models during idle time so they're cached for instant use later.
Use Smaller Models
smollm2-135m is 4x smaller than Qwen3 and often sufficient for simple tasks.
Always Cleanup
Call gerbil.dispose() or worker.terminate() to free GPU memory.