Browser Usage
Gerbil runs full LLMs directly in the browser, accelerated by WebGPU. There's no server to deploy and no data leaves the user's device.
~225 tok/s desktop · ~50 tok/s iPad · ~38 tok/s iPhone (Qwen3.5-0.8B, in-browser) ·Models cached in IndexedDB ·Fully private, runs locally
Native WebGPU engine. Chat, vision, embeddings, speech-to-text, and text-to-speech all run on Gerbil's own WebGPUEngine — a lean bundle, with the same code running across desktop and mobile devices that expose WebGPU (Chrome/Edge 113+, Safari 26+, iPadOS 26+). Native STT (Moonshine) is available via MoonshineSTT, and native TTS (Kani-TTS-2) via engine.speak().
Native WebGPU Engine
In a React app, reach for the hooks below — useEngine, useTTS, useSTT — they handle loading, streaming, and lifecycle for you. The vanilla WebGPUEngine shown here is the escape hatch for non-React / Node usage.
Import WebGPUEngine from @tryhamster/gerbil/gpu. Create an engine once (downloads + caches the model in IndexedDB on first run), then call generate, describeImage, or embed. The repo is optional — omit it for a device-aware default, or pass any compatible HuggingFace repo.
Chat / text generation
Generate text, with optional streaming via onToken or the stream() async iterator:
01import { WebGPUEngine } from "@tryhamster/gerbil/gpu";02
03// Zero-config: no repo → a device-aware default chat model.04// Or pass any repo: WebGPUEngine.create({ repo: "mlx-community/Qwen3.5-0.8B-4bit" }).05// Downloads + compiles once (cached in IndexedDB after first run).06const engine = await WebGPUEngine.create();07
08// One-shot generation.09const result = await engine.generate("Write a haiku about coding", {10 maxTokens: 512,11 systemPrompt: "You are a helpful assistant.",12});13console.log(result.text);14console.log(`${result.tokensPerSecond.toFixed(1)} tok/s`);15
16// Streaming — yields token-by-token, returns the full result.17for await (const token of engine.stream("Tell me a story")) {18 process.stdout.write(token);19}20
21// Multi-turn chat: pass an array of messages.22await engine.generate([23 { role: "user", content: "Hi!" },24 { role: "assistant", content: "Hello! How can I help?" },25 { role: "user", content: "Explain recursion." },26]);27
28engine.destroy(); // free GPU resources when doneIn React: useEngine
You don't need to wrap the engine yourself — Gerbil ships the hook. useEngine from @tryhamster/gerbil/hooks loads the model lazily, streams tokens into completion, tracks isGenerating/tps, and shares one engine per model across every component. Call it with no arguments for a sensible default, or pass any repo:
01import { useState } from "react";02import { useEngine } from "@tryhamster/gerbil/hooks";03
04function Chat() {05 // No model argument → a good chat default. Or: useEngine({ model: "..." }).06 const { complete, completion, isGenerating, isLoading, tps } = useEngine();07 const [input, setInput] = useState("");08
09 async function handleSubmit(e: React.FormEvent) {10 e.preventDefault();11 await complete(input); // lazy-loads on first call, then streams into completion12 setInput("");13 }14
15 return (16 <div>17 <p>{completion}</p>18 <form onSubmit={handleSubmit}>19 <input20 value={input}21 onChange={(e) => setInput(e.target.value)}22 disabled={isGenerating || isLoading}23 placeholder="Ask anything..."24 />25 <button type="submit" disabled={isGenerating || isLoading}>Send</button>26 </form>27 {tps > 0 && <span>{tps.toFixed(1)} tok/s</span>}28 </div>29 );30}The same hook powers vision (useEngine({ enableVision: true }) + describeImage) and embeddings (useEngine({ embedding: true }) + embed). For speech, use useTTS and useSTT from the same import. See the React hooks reference for the full surface.
Embeddings
Load an embedding model with embedding: true. embed() returns an L2-normalized vector, so cosine similarity is just a dot product:
01import { WebGPUEngine } from "@tryhamster/gerbil/gpu";02
03const engine = await WebGPUEngine.create({04 repo: "mlx-community/embeddinggemma-300m-4bit",05 embedding: true,06});07
08// EmbeddingGemma is asymmetric — tag queries vs. documents.09const query = await engine.embed("how do I cache a model?", { taskType: "query" });10const doc = await engine.embed("Models are cached in IndexedDB.", { taskType: "document" });11
12// Vectors are unit-norm → cosine similarity == dot product.13const similarity = query.reduce((s, v, i) => s + v * doc[i], 0);14console.log(query.length, "dims", similarity.toFixed(3));Create Options
interface WebGPUEngineOptions { repo?: string; // HuggingFace repo — optional; defaults per capability enableVision?: boolean;// Build the ViT vision tower (for describeImage) embedding?: boolean; // Load as an embedding model (use embed(), not generate()) maxSeqLen?: number; // Cap sequence length (auto from config otherwise)}
interface GenerateOptions { maxTokens?: number; // Default: 512 systemPrompt?: string; // Prepend a system message stopSequences?: string[]; // Stop on any of these strings sampling?: { temperature?: number; topP?: number; topK?: number }; onToken?: (token: string) => void; // Streaming callback}Gate on WebGPU availability
The native engine requires WebGPU. Check before creating it and fall back to a server route when it's missing:
import { WebGPUEngine } from "@tryhamster/gerbil/gpu";
// WebGPU detection is a one-line feature check — no helper needed.const hasWebGPU = typeof navigator !== "undefined" && "gpu" in navigator;
if (hasWebGPU) { const engine = await WebGPUEngine.create({ repo: "mlx-community/Qwen3.5-0.8B-4bit" }); // ...native inference} else { // No WebGPU (older Safari, Firefox, locked-down devices). // Fall back to a server endpoint or show a message.}Lazy Loading
Create the engine lazily on first use — not on page load — to avoid surprise downloads. Hold it in a ref/singleton so subsequent calls reuse the same loaded model:
import { WebGPUEngine } from "@tryhamster/gerbil/gpu";
let enginePromise: Promise<WebGPUEngine> | null = null;
// Created once, on first call. Concurrent callers share the same promise.function getEngine() { enginePromise ??= WebGPUEngine.create({ repo: "mlx-community/Qwen3.5-0.8B-4bit", }); return enginePromise;}
// Later — first call downloads + compiles; the rest are instant.const engine = await getEngine();await engine.generate("Hello!");Loading Progress
Pass onProgress to create() to surface download + compile progress. It reports (loaded, total, message) — a 0–100 percentage and a human-readable phase:
const engine = await WebGPUEngine.create({ repo: "mlx-community/Qwen3.5-0.8B-4bit", onProgress: (loaded, total, message) => { // message walks through: "Fetching model config..." → // "Downloading model.safetensors (120/404 MB)" → "Compiling shaders..." const pct = total ? Math.round((loaded / total) * 100) : loaded; console.log(`${pct}% — ${message}`); },});Chat Messages
For multi-turn chat, pass a ChatMessage[] to generate(). The chat template is applied for you:
interface ChatMessage { role: "system" | "user" | "assistant"; content: string;}
await engine.generate([ { role: "system", content: "You are concise." }, { role: "user", content: "What is recursion?" },]);Vision
Vision runs natively too — the Qwen3.5 ViT ships inside the same lean 4-bit repo. Pass enableVision: true, decode an image to RGB pixels, and call describeImage:
01import { WebGPUEngine } from "@tryhamster/gerbil/gpu";02
03const engine = await WebGPUEngine.create({04 repo: "mlx-community/Qwen3.5-0.8B-4bit",05 enableVision: true,06});07
08// Decode any <img>/File/URL to RGB pixels (HWC, 0..255) via a canvas.09async function decodeToPixels(src: string) {10 const img = new Image();11 img.crossOrigin = "anonymous";12 await new Promise((res, rej) => { img.onload = res; img.onerror = rej; img.src = src; });13 const canvas = document.createElement("canvas");14 canvas.width = img.naturalWidth;15 canvas.height = img.naturalHeight;16 const ctx = canvas.getContext("2d")!;17 ctx.drawImage(img, 0, 0);18 const { data } = ctx.getImageData(0, 0, canvas.width, canvas.height);19 const rgb = new Uint8ClampedArray(canvas.width * canvas.height * 3);20 for (let i = 0, j = 0; i < data.length; i += 4, j += 3) {21 rgb[j] = data[i]; rgb[j + 1] = data[i + 1]; rgb[j + 2] = data[i + 2];22 }23 return { pixels: rgb, width: canvas.width, height: canvas.height };24}25
26const image = await decodeToPixels("/photo.jpg");27const { text } = await engine.describeImage(image, "What's in this image?");28console.log(text);See the Vision documentation for more.
Utilities
Helper functions to detect WebGPU, pick device-appropriate models, and check storage before a large download.
import { getRecommendedModels, // Memory-aware model selection isModelSafeForDevice, // Will this model fit this device? checkStorageQuota, // Verify disk space before downloading} from "@tryhamster/gerbil/browser";
// WebGPU detection is a one-line feature check — no helper needed.const hasWebGPU = typeof navigator !== "undefined" && "gpu" in navigator;Detecting WebGPU
The native engine requires WebGPU. There's no helper to import — it's a single feature check on navigator:
const hasWebGPU = typeof navigator !== "undefined" && "gpu" in navigator;
if (!hasWebGPU) { // Show fallback UI or error message alert("Please use Chrome/Edge 113+ or Safari 26+ for WebGPU support");}getRecommendedModels()
Memory-aware model selection based on navigator.deviceMemory:
import { getRecommendedModels } from "@tryhamster/gerbil/browser";
const models = getRecommendedModels();// {// chat: "mlx-community/Qwen3.5-0.8B-4bit", // or LFM2.5-350M on low-memory// embedding: "mlx-community/embeddinggemma-300m-4bit",// reason: "8GB+ detected, using full models"// }
// Feed the recommended chat model into the engine for device-aware defaults.const engine = await WebGPUEngine.create({ repo: models.chat });Mobile: On mobile devices, Gerbil automatically uses q4 quantization (CPU-optimized) instead of q4f16 (GPU-optimized) for better compatibility and performance.
checkStorageQuota()
Verify available storage before downloading a large model:
import { checkStorageQuota } from "@tryhamster/gerbil/browser";
// Check if we have ~404MB available (for Qwen3.5-0.8B-4bit)const storage = await checkStorageQuota(404);// {// ok: true,// availableMB: 4500, // MB available// usedMB: 120, // MB already used by this origin// quotaMB: 4620, // total quota granted// message: "4.5GB available, 404MB required"// }
if (!storage.ok) { alert(storage.message); // "Only 200MB available, need 404MB" return;}isModelSafeForDevice()
Check whether a specific model is safe to load on the current device before you create the engine — this catches iOS memory limits and steers you to a smaller repo:
import { isModelSafeForDevice } from "@tryhamster/gerbil/browser";
const check = isModelSafeForDevice("mlx-community/Qwen3.5-2B-4bit");// On iPhone:// {// safe: false,// risky: true, // borderline — OK to offer an explicit "load anyway"// reason: "Model ... (~1.7GB) exceeds the iPhone WKWebView memory budget ...",// recommendation: "Use mlx-community/Qwen3.5-0.8B-4bit on iPhone; ...",// maxSafeModel: "mlx-community/Qwen3.5-0.8B-4bit"// }
if (!check.safe) { console.warn(check.reason); // risky === true → borderline; a UI can offer "load anyway" on capable hardware. // risky === false → genuinely too large; just fall back. const engine = await WebGPUEngine.create({ repo: check.maxSafeModel });}Model Preloading
Warm the IndexedDB cache ahead of time during app initialization, so users don't wait when they first use AI. WebGPUEngine.create() downloads + compiles the model and caches the weights; calling destroy() frees the GPU memory while the download stays cached for an instant reload. This works outside React hooks — perfect for app startup.
01import { WebGPUEngine } from "@tryhamster/gerbil/gpu";02
03// During app initialization (before React mounts)04async function initApp() {05 // Warm the cache for the chat model with progress tracking.06 const chat = await WebGPUEngine.create({07 repo: "mlx-community/Qwen3.5-0.8B-4bit",08 onProgress: (loaded, total, message) => {09 const pct = total ? Math.round((loaded / total) * 100) : 0;10 console.log(`${message}: ${pct}%`);11 },12 });13 // Free GPU memory but keep the weights cached in IndexedDB.14 chat.destroy();15
16 // Warm the embedding model too (load with embedding: true).17 const embed = await WebGPUEngine.create({18 repo: "mlx-community/embeddinggemma-300m-4bit",19 embedding: true,20 });21 embed.destroy();22
23 console.log("All models cached!");24}25
26// Call during app startup27initApp();Create Options
interface WebGPUEngineOptions { /** HF repo id (e.g. "mlx-community/Qwen3.5-0.8B-4bit") or full URL. */ repo?: string; /** Progress callback: (loaded, total, message). */ onProgress?: (loaded: number, total: number, message: string) => void; /** Weight dtype. "auto" picks q4 on mobile, native on desktop. */ dtype?: "auto" | "f32" | "q4"; /** Max sequence length (default: from model config, capped at 4096). */ maxSeqLen?: number; /** Build an embedding graph so embed()/similarity() work. */ embedding?: boolean; /** Build the vision encoder so describeImage() works (vision repos only). */ enableVision?: boolean;}Keep loaded vs. free memory
Control whether the engine stays in GPU memory after warming the cache:
| Action | Behavior | Use Case |
|---|---|---|
| destroy() | Free GPU memory, weights stay cached in IndexedDB | Preload for later, save RAM |
| keep the handle | Engine stays loaded in memory | Instant use, no reload delay |
// Warm only — free GPU memory after caching the weights.const engine = await WebGPUEngine.create({ repo: "mlx-community/Qwen3.5-0.8B-4bit" });engine.destroy();// Later: WebGPUEngine.create() reloads from the IndexedDB cache (~1-2s).
// Keep in memory — uses RAM but inference is instant.const live = await WebGPUEngine.create({ repo: "mlx-community/Qwen3.5-0.8B-4bit" });const { text } = await live.generate("Hello!"); // no reload waitNative Browser Models
The native engine loads compact 4-bit MLX-community repos directly from HuggingFace by repo id. Each is cached in IndexedDB after first download and runs on any device with WebGPU.
| Repo | Use | Size | Notes |
|---|---|---|---|
| mlx-community/Qwen3.5-0.8B-4bit | Chat + Vision | ~404MB | ViT ships in-repo (enableVision) |
| mlx-community/embeddinggemma-300m-4bit | Embeddings | ~173MB | 768-dim, asymmetric query/document |
Any compatible HuggingFace repo works — pass its id as repo.
Browser Support
| Browser | Version | Status |
|---|---|---|
| Chrome / Edge | 113+ | ✓ Full support |
| Safari / iPadOS | 18+ | ✓ Native engine (packed-f16 KV) |
| Firefox | — | ✗ Behind flag, not recommended |
iOS Memory Guards
iOS Safari and iPadOS have strict memory limits (the web-content process is killed around 1.5–2GB). The native engine accounts for this automatically: on WebKit it caps maxSeqLen conservatively and uses a packed-f16 KV cache to halve memory. Prefer the 4-bit repos on mobile. The same guards apply to every modality — text, vision, embeddings, and audio (Moonshine STT, Kani-TTS-2) — since they all run on the one WebGPU engine.
Automatic Protection: The native engine caps sequence length and KV-cache memory on WebKit, detects crashes, and caches downloads in IndexedDB.
What Happens Automatically
| Path | Backend | iOS Memory Guard | Crash Detect | Chunked DL |
|---|---|---|---|---|
| WebGPUEngine | Native WebGPU | ✓ Caps seq len + packed-f16 KV | ✓ | ✓ IndexedDB cache |
| MoonshineSTT | Native WebGPU | ✓ Caps seq len + packed-f16 KV | ✓ | ✓ IndexedDB cache |
| engine.speak (Kani-TTS-2) | Native WebGPU | ✓ Caps seq len + packed-f16 KV | ✓ | ✓ IndexedDB cache |
iOS Compatibility Matrix
| Model | Size | iOS Safe | Notes |
|---|---|---|---|
| LFM2.5-350M | ~199MB | ✓ Yes | Fastest text default on iOS (~46 tok/s) |
| EmbeddingGemma-300M | ~173MB | ✓ Yes | Embeddings (768-dim), runs on iPad |
| Qwen3.5-0.8B | ~404MB | ✓ Yes | Text + vision (~50 tok/s iPad, ~38 tok/s iPhone); +~192MB with ViT |
| Moonshine | ~190MB | ⚠ Pending | Native STT; runs on desktop, iOS support not yet confirmed |
Manual Utilities (Advanced)
For custom implementations or advanced control, these utilities are available:
01import { 02 isModelSafeForDevice, // Check if model is safe for current device03 detectMemoryCrash, // Check if previous session crashed04 setDownloadPhase, // Track download phase for crash detection05 clearDownloadPhase, // Clear phase on success06 downloadModelChunked, // Resumable chunked downloads07 hasIncompleteDownload, // Check for interrupted downloads08 clearIncompleteDownload, // Clear partial download09} from "@tryhamster/gerbil/browser";10
11// Check model safety before loading12const check = isModelSafeForDevice("mlx-community/Qwen3.5-2B-4bit");13if (!check.safe) {14 console.log(check.reason); // "Model ... exceeds the iPhone WKWebView memory budget..."15 console.log(check.recommendation); // "Use mlx-community/Qwen3.5-0.8B-4bit on iPhone; ..."16 console.log(check.maxSafeModel); // "mlx-community/Qwen3.5-0.8B-4bit"17 console.log(check.risky); // true = borderline (offer "load anyway"); false = hard block18}19
20// Detect if page crashed during previous model load21const crash = detectMemoryCrash();22if (crash.crashed) {23 console.log(crash.recommendation); // "The model was too large..."24 console.log(crash.phase); // "downloading" | "initializing"25 console.log(crash.modelId); // which model caused it26}Chunked Resumable Downloads
Model downloads automatically use chunked downloading with resume support. If a download is interrupted (page refresh, crash, network error), it resumes from where it left off. For manual control:
01import { 02 downloadModelChunked, 03 hasIncompleteDownload, 04 clearIncompleteDownload 05} from "@tryhamster/gerbil/browser";06
07// Check for interrupted downloads08const incomplete = await hasIncompleteDownload("mlx-community/Qwen3.5-0.8B-4bit");09if (incomplete.incomplete) {10 console.log(`Resuming: ${incomplete.percent}% complete`);11}12
13// Download with progress and resume support14const buffer = await downloadModelChunked(15 "https://huggingface.co/...",16 "mlx-community/Qwen3.5-0.8B-4bit",17 {18 onProgress: (info) => {19 console.log(`${info.phase}: ${info.percent}%`);20 },21 signal: abortController.signal,22 }23);| Feature | Description |
|---|---|
| HTTP Range requests | Downloads in 1.5MB chunks using Range: bytes=start-end |
| IndexedDB storage | Each chunk stored separately to avoid large transaction spikes |
| Automatic resume | Tracks completed chunks in manifest, resumes from last position |
| ETag validation | Clears cached chunks if model version changes |
| Abort support | Cancel downloads gracefully with AbortController |
| Fallback | Falls back to regular download if server doesn't support Range |
Troubleshooting
"WebGPU not supported"
- Update to Chrome/Edge 113+
- Check
chrome://gpufor WebGPU status - Try enabling
chrome://flags/#enable-unsafe-webgpu
Slow first load
First load downloads the model (~404MB for Qwen3.5-0.8B) and compiles WebGPU shaders. Subsequent loads use IndexedDB cache and are much faster (~2-5s).
Out of memory
Smaller repos like LFM2.5-350M use less GPU memory. Close other GPU-intensive tabs.
Model fails to download
The native WebGPU engine fetches weights over plain HTTPS and caches them in IndexedDB — no special cross-origin headers are required. Check your network connection and that the HuggingFace repo is reachable.
Next.js Configuration
The native engine ships as ESM and loads weights over plain HTTPS, so no special cross-origin headers are needed. Keep the browser bundle clean by excluding the Node-only paths from the GPU barrel:
01// next.config.js02/** @type {import('next').NextConfig} */03const nextConfig = {04 webpack: (config, { isServer }) => {05 if (!isServer) {06 // Exclude Node.js-only modules from the browser bundle.07 config.resolve.alias = {08 ...config.resolve.alias,09 webgpu: false, // node-dawn binding — server/Node only10 };11 config.resolve.fallback = {12 ...config.resolve.fallback,13 path: false,14 fs: false,15 os: false,16 };17 }18
19 return config;20 },21};22
23module.exports = nextConfig;Next Steps
- React Hooks Reference → — drive the native engine from React
- Text-to-Speech → — generate natural speech in the browser
- Speech-to-Text → — transcribe audio with native Moonshine
- Vision AI → — analyze images in the browser