Browser Usage

Gerbil runs full LLMs directly in the browser, accelerated by WebGPU. There's no server to deploy and no data leaves the user's device.

~225 tok/s desktop · ~50 tok/s iPad · ~38 tok/s iPhone (Qwen3.5-0.8B, in-browser) ·Models cached in IndexedDB ·Fully private, runs locally

Native WebGPU engine. Chat, vision, embeddings, speech-to-text, and text-to-speech all run on Gerbil's own WebGPUEngine — a lean bundle, with the same code running across desktop and mobile devices that expose WebGPU (Chrome/Edge 113+, Safari 26+, iPadOS 26+). Native STT (Moonshine) is available via MoonshineSTT, and native TTS (Kani-TTS-2) via engine.speak().

Native WebGPU Engine

In a React app, reach for the hooks below — useEngine, useTTS, useSTT — they handle loading, streaming, and lifecycle for you. The vanilla WebGPUEngine shown here is the escape hatch for non-React / Node usage.

Import WebGPUEngine from @tryhamster/gerbil/gpu. Create an engine once (downloads + caches the model in IndexedDB on first run), then call generate, describeImage, or embed. The repo is optional — omit it for a device-aware default, or pass any compatible HuggingFace repo.

Chat / text generation

Generate text, with optional streaming via onToken or the stream() async iterator:

chat.ts
01import { WebGPUEngine } from "@tryhamster/gerbil/gpu";
02
03// Zero-config: no repo → a device-aware default chat model.
04// Or pass any repo: WebGPUEngine.create({ repo: "mlx-community/Qwen3.5-0.8B-4bit" }).
05// Downloads + compiles once (cached in IndexedDB after first run).
06const engine = await WebGPUEngine.create();
07
08// One-shot generation.
09const result = await engine.generate("Write a haiku about coding", {
10 maxTokens: 512,
11 systemPrompt: "You are a helpful assistant.",
12});
13console.log(result.text);
14console.log(`${result.tokensPerSecond.toFixed(1)} tok/s`);
15
16// Streaming — yields token-by-token, returns the full result.
17for await (const token of engine.stream("Tell me a story")) {
18 process.stdout.write(token);
19}
20
21// Multi-turn chat: pass an array of messages.
22await engine.generate([
23 { role: "user", content: "Hi!" },
24 { role: "assistant", content: "Hello! How can I help?" },
25 { role: "user", content: "Explain recursion." },
26]);
27
28engine.destroy(); // free GPU resources when done

In React: useEngine

You don't need to wrap the engine yourself — Gerbil ships the hook. useEngine from @tryhamster/gerbil/hooks loads the model lazily, streams tokens into completion, tracks isGenerating/tps, and shares one engine per model across every component. Call it with no arguments for a sensible default, or pass any repo:

Chat.tsx
01import { useState } from "react";
02import { useEngine } from "@tryhamster/gerbil/hooks";
03
04function Chat() {
05 // No model argument → a good chat default. Or: useEngine({ model: "..." }).
06 const { complete, completion, isGenerating, isLoading, tps } = useEngine();
07 const [input, setInput] = useState("");
08
09 async function handleSubmit(e: React.FormEvent) {
10 e.preventDefault();
11 await complete(input); // lazy-loads on first call, then streams into completion
12 setInput("");
13 }
14
15 return (
16 <div>
17 <p>{completion}</p>
18 <form onSubmit={handleSubmit}>
19 <input
20 value={input}
21 onChange={(e) => setInput(e.target.value)}
22 disabled={isGenerating || isLoading}
23 placeholder="Ask anything..."
24 />
25 <button type="submit" disabled={isGenerating || isLoading}>Send</button>
26 </form>
27 {tps > 0 && <span>{tps.toFixed(1)} tok/s</span>}
28 </div>
29 );
30}

The same hook powers vision (useEngine({ enableVision: true }) + describeImage) and embeddings (useEngine({ embedding: true }) + embed). For speech, use useTTS and useSTT from the same import. See the React hooks reference for the full surface.

Embeddings

Load an embedding model with embedding: true. embed() returns an L2-normalized vector, so cosine similarity is just a dot product:

embeddings.ts
01import { WebGPUEngine } from "@tryhamster/gerbil/gpu";
02
03const engine = await WebGPUEngine.create({
04 repo: "mlx-community/embeddinggemma-300m-4bit",
05 embedding: true,
06});
07
08// EmbeddingGemma is asymmetric — tag queries vs. documents.
09const query = await engine.embed("how do I cache a model?", { taskType: "query" });
10const doc = await engine.embed("Models are cached in IndexedDB.", { taskType: "document" });
11
12// Vectors are unit-norm → cosine similarity == dot product.
13const similarity = query.reduce((s, v, i) => s + v * doc[i], 0);
14console.log(query.length, "dims", similarity.toFixed(3));

Create Options

types.ts
interface WebGPUEngineOptions {
repo?: string; // HuggingFace repo — optional; defaults per capability
enableVision?: boolean;// Build the ViT vision tower (for describeImage)
embedding?: boolean; // Load as an embedding model (use embed(), not generate())
maxSeqLen?: number; // Cap sequence length (auto from config otherwise)
}
interface GenerateOptions {
maxTokens?: number; // Default: 512
systemPrompt?: string; // Prepend a system message
stopSequences?: string[]; // Stop on any of these strings
sampling?: { temperature?: number; topP?: number; topK?: number };
onToken?: (token: string) => void; // Streaming callback
}

Gate on WebGPU availability

The native engine requires WebGPU. Check before creating it and fall back to a server route when it's missing:

gate.ts
import { WebGPUEngine } from "@tryhamster/gerbil/gpu";
// WebGPU detection is a one-line feature check — no helper needed.
const hasWebGPU = typeof navigator !== "undefined" && "gpu" in navigator;
if (hasWebGPU) {
const engine = await WebGPUEngine.create({ repo: "mlx-community/Qwen3.5-0.8B-4bit" });
// ...native inference
} else {
// No WebGPU (older Safari, Firefox, locked-down devices).
// Fall back to a server endpoint or show a message.
}

Lazy Loading

Create the engine lazily on first use — not on page load — to avoid surprise downloads. Hold it in a ref/singleton so subsequent calls reuse the same loaded model:

lazy.ts
import { WebGPUEngine } from "@tryhamster/gerbil/gpu";
let enginePromise: Promise<WebGPUEngine> | null = null;
// Created once, on first call. Concurrent callers share the same promise.
function getEngine() {
enginePromise ??= WebGPUEngine.create({
repo: "mlx-community/Qwen3.5-0.8B-4bit",
});
return enginePromise;
}
// Later — first call downloads + compiles; the rest are instant.
const engine = await getEngine();
await engine.generate("Hello!");

Loading Progress

Pass onProgress to create() to surface download + compile progress. It reports (loaded, total, message) — a 0–100 percentage and a human-readable phase:

loading-progress.ts
const engine = await WebGPUEngine.create({
repo: "mlx-community/Qwen3.5-0.8B-4bit",
onProgress: (loaded, total, message) => {
// message walks through: "Fetching model config..." →
// "Downloading model.safetensors (120/404 MB)" → "Compiling shaders..."
const pct = total ? Math.round((loaded / total) * 100) : loaded;
console.log(`${pct}% — ${message}`);
},
});

Chat Messages

For multi-turn chat, pass a ChatMessage[] to generate(). The chat template is applied for you:

messages.ts
interface ChatMessage {
role: "system" | "user" | "assistant";
content: string;
}
await engine.generate([
{ role: "system", content: "You are concise." },
{ role: "user", content: "What is recursion?" },
]);

Vision

Vision runs natively too — the Qwen3.5 ViT ships inside the same lean 4-bit repo. Pass enableVision: true, decode an image to RGB pixels, and call describeImage:

vision.ts
01import { WebGPUEngine } from "@tryhamster/gerbil/gpu";
02
03const engine = await WebGPUEngine.create({
04 repo: "mlx-community/Qwen3.5-0.8B-4bit",
05 enableVision: true,
06});
07
08// Decode any <img>/File/URL to RGB pixels (HWC, 0..255) via a canvas.
09async function decodeToPixels(src: string) {
10 const img = new Image();
11 img.crossOrigin = "anonymous";
12 await new Promise((res, rej) => { img.onload = res; img.onerror = rej; img.src = src; });
13 const canvas = document.createElement("canvas");
14 canvas.width = img.naturalWidth;
15 canvas.height = img.naturalHeight;
16 const ctx = canvas.getContext("2d")!;
17 ctx.drawImage(img, 0, 0);
18 const { data } = ctx.getImageData(0, 0, canvas.width, canvas.height);
19 const rgb = new Uint8ClampedArray(canvas.width * canvas.height * 3);
20 for (let i = 0, j = 0; i < data.length; i += 4, j += 3) {
21 rgb[j] = data[i]; rgb[j + 1] = data[i + 1]; rgb[j + 2] = data[i + 2];
22 }
23 return { pixels: rgb, width: canvas.width, height: canvas.height };
24}
25
26const image = await decodeToPixels("/photo.jpg");
27const { text } = await engine.describeImage(image, "What's in this image?");
28console.log(text);

See the Vision documentation for more.

Utilities

Helper functions to detect WebGPU, pick device-appropriate models, and check storage before a large download.

imports.ts
import {
getRecommendedModels, // Memory-aware model selection
isModelSafeForDevice, // Will this model fit this device?
checkStorageQuota, // Verify disk space before downloading
} from "@tryhamster/gerbil/browser";
// WebGPU detection is a one-line feature check — no helper needed.
const hasWebGPU = typeof navigator !== "undefined" && "gpu" in navigator;

Detecting WebGPU

The native engine requires WebGPU. There's no helper to import — it's a single feature check on navigator:

check.ts
const hasWebGPU = typeof navigator !== "undefined" && "gpu" in navigator;
if (!hasWebGPU) {
// Show fallback UI or error message
alert("Please use Chrome/Edge 113+ or Safari 26+ for WebGPU support");
}

Memory-aware model selection based on navigator.deviceMemory:

recommended.ts
import { getRecommendedModels } from "@tryhamster/gerbil/browser";
const models = getRecommendedModels();
// {
// chat: "mlx-community/Qwen3.5-0.8B-4bit", // or LFM2.5-350M on low-memory
// embedding: "mlx-community/embeddinggemma-300m-4bit",
// reason: "8GB+ detected, using full models"
// }
// Feed the recommended chat model into the engine for device-aware defaults.
const engine = await WebGPUEngine.create({ repo: models.chat });

Mobile: On mobile devices, Gerbil automatically uses q4 quantization (CPU-optimized) instead of q4f16 (GPU-optimized) for better compatibility and performance.

checkStorageQuota()

Verify available storage before downloading a large model:

storage.ts
import { checkStorageQuota } from "@tryhamster/gerbil/browser";
// Check if we have ~404MB available (for Qwen3.5-0.8B-4bit)
const storage = await checkStorageQuota(404);
// {
// ok: true,
// availableMB: 4500, // MB available
// usedMB: 120, // MB already used by this origin
// quotaMB: 4620, // total quota granted
// message: "4.5GB available, 404MB required"
// }
if (!storage.ok) {
alert(storage.message); // "Only 200MB available, need 404MB"
return;
}

isModelSafeForDevice()

Check whether a specific model is safe to load on the current device before you create the engine — this catches iOS memory limits and steers you to a smaller repo:

model-safety.ts
import { isModelSafeForDevice } from "@tryhamster/gerbil/browser";
const check = isModelSafeForDevice("mlx-community/Qwen3.5-2B-4bit");
// On iPhone:
// {
// safe: false,
// risky: true, // borderline — OK to offer an explicit "load anyway"
// reason: "Model ... (~1.7GB) exceeds the iPhone WKWebView memory budget ...",
// recommendation: "Use mlx-community/Qwen3.5-0.8B-4bit on iPhone; ...",
// maxSafeModel: "mlx-community/Qwen3.5-0.8B-4bit"
// }
if (!check.safe) {
console.warn(check.reason);
// risky === true → borderline; a UI can offer "load anyway" on capable hardware.
// risky === false → genuinely too large; just fall back.
const engine = await WebGPUEngine.create({ repo: check.maxSafeModel });
}

Model Preloading

Warm the IndexedDB cache ahead of time during app initialization, so users don't wait when they first use AI. WebGPUEngine.create() downloads + compiles the model and caches the weights; calling destroy() frees the GPU memory while the download stays cached for an instant reload. This works outside React hooks — perfect for app startup.

preload.ts
01import { WebGPUEngine } from "@tryhamster/gerbil/gpu";
02
03// During app initialization (before React mounts)
04async function initApp() {
05 // Warm the cache for the chat model with progress tracking.
06 const chat = await WebGPUEngine.create({
07 repo: "mlx-community/Qwen3.5-0.8B-4bit",
08 onProgress: (loaded, total, message) => {
09 const pct = total ? Math.round((loaded / total) * 100) : 0;
10 console.log(`${message}: ${pct}%`);
11 },
12 });
13 // Free GPU memory but keep the weights cached in IndexedDB.
14 chat.destroy();
15
16 // Warm the embedding model too (load with embedding: true).
17 const embed = await WebGPUEngine.create({
18 repo: "mlx-community/embeddinggemma-300m-4bit",
19 embedding: true,
20 });
21 embed.destroy();
22
23 console.log("All models cached!");
24}
25
26// Call during app startup
27initApp();

Create Options

types.ts
interface WebGPUEngineOptions {
/** HF repo id (e.g. "mlx-community/Qwen3.5-0.8B-4bit") or full URL. */
repo?: string;
/** Progress callback: (loaded, total, message). */
onProgress?: (loaded: number, total: number, message: string) => void;
/** Weight dtype. "auto" picks q4 on mobile, native on desktop. */
dtype?: "auto" | "f32" | "q4";
/** Max sequence length (default: from model config, capped at 4096). */
maxSeqLen?: number;
/** Build an embedding graph so embed()/similarity() work. */
embedding?: boolean;
/** Build the vision encoder so describeImage() works (vision repos only). */
enableVision?: boolean;
}

Keep loaded vs. free memory

Control whether the engine stays in GPU memory after warming the cache:

ActionBehaviorUse Case
destroy()Free GPU memory, weights stay cached in IndexedDBPreload for later, save RAM
keep the handleEngine stays loaded in memoryInstant use, no reload delay
keep-loaded.ts
// Warm only — free GPU memory after caching the weights.
const engine = await WebGPUEngine.create({ repo: "mlx-community/Qwen3.5-0.8B-4bit" });
engine.destroy();
// Later: WebGPUEngine.create() reloads from the IndexedDB cache (~1-2s).
// Keep in memory — uses RAM but inference is instant.
const live = await WebGPUEngine.create({ repo: "mlx-community/Qwen3.5-0.8B-4bit" });
const { text } = await live.generate("Hello!"); // no reload wait

Native Browser Models

The native engine loads compact 4-bit MLX-community repos directly from HuggingFace by repo id. Each is cached in IndexedDB after first download and runs on any device with WebGPU.

RepoUseSizeNotes
mlx-community/Qwen3.5-0.8B-4bitChat + Vision~404MBViT ships in-repo (enableVision)
mlx-community/embeddinggemma-300m-4bitEmbeddings~173MB768-dim, asymmetric query/document

Any compatible HuggingFace repo works — pass its id as repo.

Browser Support

BrowserVersionStatus
Chrome / Edge113+✓ Full support
Safari / iPadOS18+✓ Native engine (packed-f16 KV)
Firefox✗ Behind flag, not recommended

iOS Memory Guards

iOS Safari and iPadOS have strict memory limits (the web-content process is killed around 1.5–2GB). The native engine accounts for this automatically: on WebKit it caps maxSeqLen conservatively and uses a packed-f16 KV cache to halve memory. Prefer the 4-bit repos on mobile. The same guards apply to every modality — text, vision, embeddings, and audio (Moonshine STT, Kani-TTS-2) — since they all run on the one WebGPU engine.

Automatic Protection: The native engine caps sequence length and KV-cache memory on WebKit, detects crashes, and caches downloads in IndexedDB.

What Happens Automatically

PathBackendiOS Memory GuardCrash DetectChunked DL
WebGPUEngineNative WebGPU✓ Caps seq len + packed-f16 KV✓ IndexedDB cache
MoonshineSTTNative WebGPU✓ Caps seq len + packed-f16 KV✓ IndexedDB cache
engine.speak (Kani-TTS-2)Native WebGPU✓ Caps seq len + packed-f16 KV✓ IndexedDB cache

iOS Compatibility Matrix

ModelSizeiOS SafeNotes
LFM2.5-350M~199MB✓ YesFastest text default on iOS (~46 tok/s)
EmbeddingGemma-300M~173MB✓ YesEmbeddings (768-dim), runs on iPad
Qwen3.5-0.8B~404MB✓ YesText + vision (~50 tok/s iPad, ~38 tok/s iPhone); +~192MB with ViT
Moonshine~190MB⚠ PendingNative STT; runs on desktop, iOS support not yet confirmed

Manual Utilities (Advanced)

For custom implementations or advanced control, these utilities are available:

ios-utilities.ts
01import {
02 isModelSafeForDevice, // Check if model is safe for current device
03 detectMemoryCrash, // Check if previous session crashed
04 setDownloadPhase, // Track download phase for crash detection
05 clearDownloadPhase, // Clear phase on success
06 downloadModelChunked, // Resumable chunked downloads
07 hasIncompleteDownload, // Check for interrupted downloads
08 clearIncompleteDownload, // Clear partial download
09} from "@tryhamster/gerbil/browser";
10
11// Check model safety before loading
12const check = isModelSafeForDevice("mlx-community/Qwen3.5-2B-4bit");
13if (!check.safe) {
14 console.log(check.reason); // "Model ... exceeds the iPhone WKWebView memory budget..."
15 console.log(check.recommendation); // "Use mlx-community/Qwen3.5-0.8B-4bit on iPhone; ..."
16 console.log(check.maxSafeModel); // "mlx-community/Qwen3.5-0.8B-4bit"
17 console.log(check.risky); // true = borderline (offer "load anyway"); false = hard block
18}
19
20// Detect if page crashed during previous model load
21const crash = detectMemoryCrash();
22if (crash.crashed) {
23 console.log(crash.recommendation); // "The model was too large..."
24 console.log(crash.phase); // "downloading" | "initializing"
25 console.log(crash.modelId); // which model caused it
26}

Chunked Resumable Downloads

Model downloads automatically use chunked downloading with resume support. If a download is interrupted (page refresh, crash, network error), it resumes from where it left off. For manual control:

chunked.ts
01import {
02 downloadModelChunked,
03 hasIncompleteDownload,
04 clearIncompleteDownload
05} from "@tryhamster/gerbil/browser";
06
07// Check for interrupted downloads
08const incomplete = await hasIncompleteDownload("mlx-community/Qwen3.5-0.8B-4bit");
09if (incomplete.incomplete) {
10 console.log(`Resuming: ${incomplete.percent}% complete`);
11}
12
13// Download with progress and resume support
14const buffer = await downloadModelChunked(
15 "https://huggingface.co/...",
16 "mlx-community/Qwen3.5-0.8B-4bit",
17 {
18 onProgress: (info) => {
19 console.log(`${info.phase}: ${info.percent}%`);
20 },
21 signal: abortController.signal,
22 }
23);
FeatureDescription
HTTP Range requestsDownloads in 1.5MB chunks using Range: bytes=start-end
IndexedDB storageEach chunk stored separately to avoid large transaction spikes
Automatic resumeTracks completed chunks in manifest, resumes from last position
ETag validationClears cached chunks if model version changes
Abort supportCancel downloads gracefully with AbortController
FallbackFalls back to regular download if server doesn't support Range

Troubleshooting

"WebGPU not supported"

  • Update to Chrome/Edge 113+
  • Check chrome://gpu for WebGPU status
  • Try enabling chrome://flags/#enable-unsafe-webgpu

Slow first load

First load downloads the model (~404MB for Qwen3.5-0.8B) and compiles WebGPU shaders. Subsequent loads use IndexedDB cache and are much faster (~2-5s).

Out of memory

Smaller repos like LFM2.5-350M use less GPU memory. Close other GPU-intensive tabs.

Model fails to download

The native WebGPU engine fetches weights over plain HTTPS and caches them in IndexedDB — no special cross-origin headers are required. Check your network connection and that the HuggingFace repo is reachable.

Next.js Configuration

The native engine ships as ESM and loads weights over plain HTTPS, so no special cross-origin headers are needed. Keep the browser bundle clean by excluding the Node-only paths from the GPU barrel:

next.config.js
01// next.config.js
02/** @type {import('next').NextConfig} */
03const nextConfig = {
04 webpack: (config, { isServer }) => {
05 if (!isServer) {
06 // Exclude Node.js-only modules from the browser bundle.
07 config.resolve.alias = {
08 ...config.resolve.alias,
09 webgpu: false, // node-dawn binding — server/Node only
10 };
11 config.resolve.fallback = {
12 ...config.resolve.fallback,
13 path: false,
14 fs: false,
15 os: false,
16 };
17 }
18
19 return config;
20 },
21};
22
23module.exports = nextConfig;

Next Steps