Models

Gerbil ships with native models that are ready to run out of the box, and it can load almost any compatible model straight from Hugging Face.

Native Models

The native WebGPU engine runs these models entirely on-device. Load them by HuggingFace repo id with WebGPUEngine.create:

ModelTypeSizeSpeedThinkVisionEmbedTTSSTTBrowserBest For
qwen3.5-0.8bLLM~650 MB
Text + vision, 262K context, reasoning
qwen3.5-2bLLM~1.7 GB
Text + vision, higher quality — iPad & desktop
lfm2.5-350mLLM~300 MB
Fast, lightweight text generation
gemma-4-e2bLLM~3.6 GB (PLE streams, ~1GB resident)
Text + vision, larger desktop-leaning quality, reasoning
Embeddings
embeddinggemma-300mEMBED~173MB
768-dim semantic search (asymmetric query/document)
Text-to-Speech
kani-tts-2TTS
On-device speech synthesis (coming — not yet runnable)
Speech-to-Text
moonshine-baseSTT~190MB
Raw-waveform English transcription (greedy)

Choosing a Model

For general use, vision & reasoning

Qwen3.5-0.8B is the default all-rounder: a 262K context window, thinking support, and a built-in vision tower — all in one checkpoint. Roughly ~225 tok/s in-browser on desktop, ~50 tok/s on iPad, and ~38 tok/s on iPhone (Safari 26+).

import { WebGPUEngine } from "@tryhamster/gerbil/gpu";
const engine = await WebGPUEngine.create({
repo: "mlx-community/Qwen3.5-0.8B-4bit", // ~404 MB
});
const result = await engine.generate("What is 17 * 23?", {
sampling: { temperature: 0 }, // greedy
});
console.log(result.text);
console.log(result.thinking); // step-by-step reasoning when the model thinks

For speed

LFM2.5-350M (LiquidAI's hybrid conv/attention model) is the faster, smaller text default — ~600 tok/s desktop server-side (2.8× Qwen), ~46 tok/s in-browser on mobile.

const engine = await WebGPUEngine.create({
repo: "LiquidAI/LFM2.5-350M-MLX-4bit", // ~199 MB
});
await engine.generate("Write a haiku about coding");

For larger, desktop-leaning quality

gemma-4-e2b-it-4bit is the smallest Gemma 4 — a stronger text-only decoder. The 4-bit checkpoint is ~2.5 GB on disk, but its per-layer embedding (PLE) table streams from CPU, so resident GPU memory stays ~0.8–1.1 GB (mobile-viable, ~83 tok/s in-browser). It leans desktop next to Qwen3.5 / LFM2.5. Gemma 4 runs here as a text-only model — for images, use Qwen3.5's built-in vision tower.

const engine = await WebGPUEngine.create({
repo: "mlx-community/gemma-4-e2b-it-4bit", // ~2.5 GB, PLE streams from CPU
});
await engine.generate("Explain WebGPU compute shaders to a web developer");

For vision

Qwen3.5-0.8B understands images through its own ViT tower (no separate vision model). Pass enableVision: true and call describeImage:

const engine = await WebGPUEngine.create({
repo: "mlx-community/Qwen3.5-0.8B-4bit",
enableVision: true, // builds the ViT tower (+~192 MB)
});
const { text } = await engine.describeImage(
{ pixels, width, height }, // RGB pixels from a canvas
"Describe this image."
);
console.log(text);

See the Vision documentation for decoding images to pixels and more examples.

For embeddings

EmbeddingGemma-300M produces 768-dim, L2-normalized vectors with asymmetric query/document prefixes. Runs on iPad.

const engine = await WebGPUEngine.create({
repo: "mlx-community/embeddinggemma-300m-4bit", // ~173 MB
embedding: true,
});
const vec = await engine.embed("how do I cache a model?", { taskType: "query" });
console.log(vec.length); // 768

Using HuggingFace Models

The native engine loads compatible models straight from HuggingFace by repo id. Compact 4-bit MLX-community repos load fastest and fit comfortably on mobile:

huggingface.ts
// Pass any compatible repo id to create()
await WebGPUEngine.create({ repo: "mlx-community/Qwen3.5-0.8B-4bit" });
await WebGPUEngine.create({ repo: "LiquidAI/LFM2.5-350M-MLX-4bit" });
await WebGPUEngine.create({
repo: "mlx-community/embeddinggemma-300m-4bit",
embedding: true,
});

Note: The native engine runs supported architectures (Qwen3.5, LFM2, Gemma3 encoder, Moonshine). The engine compiles WGSL compute shaders directly for each model and runs them on WebGPU.

Create Options

Customize how the engine loads and runs a model:

options.ts
await WebGPUEngine.create({
repo: "mlx-community/Qwen3.5-0.8B-4bit",
// Quantization (from the repo / LoadModelOptions)
dtype: "q4", // "f32" | "q4"
// Capability flags
embedding: false, // load as an embedding model (enables embed())
enableVision: false, // build the ViT tower (enables describeImage())
// Memory / context
maxSeqLen: 4096, // capped 4096 desktop, 512-2048 on WebKit/iOS
kvMode: "packed-f16", // "f32" | "native-f16" | "packed-f16" (auto if omitted)
// Progress
onProgress: (loaded, total, message) => {
const pct = total ? Math.round((loaded / total) * 100) : loaded;
console.log(`${pct}% — ${message}`);
},
});

Model Caching

In the browser, models are cached in IndexedDB after the first download — subsequent loads skip the network and reuse the compiled weights. There is no CPU/WASM fallback: a browser without WebGPU cannot run the native engine, so feature-detect WebGPU up front with "gpu" in navigator.