How It Works

Name: Gerbil
Author: Gerbil

Technical deep-dive into how Gerbil works under the hood.

System Overview

Key Design Decisions

1. transformers.js as the Foundation

We use Hugging Face transformers.js which provides pre-converted ONNX models, exact tokenizers, and a unified API across all backends.

2. WebGPU First

WebGPU provides 5-10x speedup over CPU. In browsers, we use native WebGPU. In Node.js, we use headless Chrome as a WebGPU accelerator (ChromeGPUBackend).

3. Quantization for Speed

All models use quantized weights: q4f16 (4-bit weights, fp16 compute) for WebGPU, q4 for CPU. This reduces model size by ~4x.

4. Streaming via Web Workers

Browser inference runs in a Web Worker to keep the UI responsive, stream tokens in real-time, and isolate GPU memory from the main thread.

Inference Pipeline

The inference stack consists of three layers:

Execution Backends

Backend	Environment	Speed	Notes
WebGPU	Browser, Chrome	~100-150 tok/s	Fastest, requires GPU
CPU	Node.js	~30-60 tok/s	Uses SIMD, good on Apple Silicon
WASM	Browser fallback	~5-10 tok/s	Works everywhere

Quantization Types

Type	Weights	Compute	Size Reduction	Use Case
fp32	32-bit float	32-bit	1x (baseline)	Training
fp16	16-bit float	16-bit	2x	GPU inference
q4f16	4-bit int	16-bit	~4x	WebGPU inference
q4	4-bit int	32-bit	~4x	CPU inference

WebGPU Acceleration

WebGPU is a modern compute API that provides access to GPU hardware. Gerbil uses it for fast inference in both browsers and Node.js.

Browser Path

Node.js WebGPU Path (ChromeGPUBackend)

Node.js doesn't have native WebGPU, so Gerbil uses headless Chrome as a GPU accelerator:

The fixed port (43724) ensures consistent IndexedDB caching across runs.

Streaming Architecture

LLM inference is computationally intensive. Gerbil runs it in a Web Worker to keep the UI responsive:

Message Protocol

protocol.ts

// Main → Worker
{ type: "load", modelId: "qwen3-0.6b" }
{ type: "generate", messages: [...], options: {...} }
{ type: "interrupt" }
{ type: "reset" }

// Worker → Main
{ status: "loading", message: "Loading model..." }
{ status: "progress", file: "model.onnx", progress: 50 }
{ status: "ready" }
{ status: "token", text: "Hello", state: "answering", tps: 75 }
{ status: "complete", text: "Hello world!", numTokens: 3, tps: 75 }
{ status: "error", error: "Out of memory" }

Thinking State Tracking

For Qwen3 thinking mode, Gerbil tracks whether the model is "thinking" or "answering" by monitoring special tokens:

thinking.ts

const [START_THINKING_TOKEN_ID, END_THINKING_TOKEN_ID] = 
  tokenizer.encode("<think></think>", { add_special_tokens: false });

let state = "answering";

const tokenCallback = (tokens) => {
  const tokenId = Number(tokens[0]);
  if (tokenId === START_THINKING_TOKEN_ID) state = "thinking";
  if (tokenId === END_THINKING_TOKEN_ID) state = "answering";
};

Model Caching

Models are large (100MB - 500MB). Gerbil caches them locally to avoid re-downloading:

Environment	Cache Location	Mechanism
Browser	IndexedDB	transformers.js built-in
Node.js (CPU)	~/.cache/huggingface/hub	transformers.js built-in
Node.js (WebGPU)	Chrome's IndexedDB	Via ChromeGPUBackend

Cache Behavior

First load

Downloads from Hugging Face Hub (~15-30s depending on model size)

Subsequent loads

Reads from local cache (~1-2s for browser, ~0.5s for Node.js)

Clearing Cache

Browser (in DevTools console):

Terminal

indexedDB.deleteDatabase("transformers-cache");

Node.js CLI:

Terminal

npx @tryhamster/gerbil cache --clean

Memory Management

Gerbil automatically manages memory to prevent leaks while maintaining performance. For WebGPU inference, memory is bounded and monitored.

Automatic KV Cache Reset

The KV cache automatically resets when it exceeds the model's context length (2048 tokens for Qwen3). This prevents unbounded memory growth:

auto-reset.ts

// Memory automatically resets after ~2048 tokens
// No action needed - happens transparently
const gerbil = new Gerbil();
await gerbil.loadModel("qwen3-0.6b");

// Long conversations work fine - auto-reset preserves context window
for (let i = 0; i < 100; i++) {
  await gerbil.generate("Tell me something interesting");
}

Memory Bounds

Metric	Limit	Notes
Per-page maximum	~4GB	context length × token size
Concurrent pages	5 max	Multiple Gerbil instances
Typical usage	< 2GB	Most conversations < 500 tokens

Memory API

memory-api.ts

// Check memory usage (WebGPU only)
const mem = await gerbil.getMemoryUsage();
if (mem) {
  console.log(`Using ${mem.usedGB.toFixed(1)}GB / ${mem.totalGB.toFixed(1)}GB`);
}

// Clear KV cache manually (resets conversation context)
await gerbil.clearCache();

// Auto-cleanup if threshold exceeded
const didCleanup = await gerbil.checkMemoryAndCleanup(8); // 8GB threshold

// Always dispose when done
await gerbil.dispose();

Long-Running Sessions

For background services or persistent processes, monitor and clean up periodically:

long-running.ts

const gerbil = new Gerbil();
await gerbil.loadModel("qwen3-0.6b");

// Periodic memory monitoring
setInterval(async () => {
  const mem = await gerbil.getMemoryUsage();
  
  if (mem && mem.usedGB > 10) {
    console.warn(`High memory: ${mem.usedGB.toFixed(1)}GB`);
    await gerbil.clearCache();
  }
}, 60000); // Check every minute

Cleanup Best Practices

✓ Good: Dispose when done

good.ts

const gerbil = new Gerbil();
await gerbil.loadModel("qwen3-0.6b");
// ... use gerbil ...
await gerbil.dispose(); // Frees resources

✗ Bad: Creating many without cleanup

bad.ts

for (let i = 0; i < 10; i++) {
  const g = new Gerbil();
  await g.loadModel("qwen3-0.6b");
  // Forgot to dispose - pages accumulate!
}

Performance Tips

Batch UI Updates

Tokens arrive very fast (~100/sec). Consider batching UI updates with requestAnimationFrame.

Preload Models

Load models during idle time so they're cached for instant use later.

Use Smaller Models

smollm2-135m is 4x smaller than Qwen3 and often sufficient for simple tasks.

Always Cleanup

Call gerbil.dispose() or worker.terminate() to free GPU memory.