Vision
Gerbil can see. Run Vision Language Models locally to describe, analyze, and answer questions about images — entirely on your device, with no image ever leaving it.
Quick Start
Vision runs natively on the WebGPUEngine using Qwen3.5-0.8B's own built-in ViT tower. Build the engine with enableVision: true, decode an image to RGB pixels, and call describeImage:
01import { WebGPUEngine } from "@tryhamster/gerbil/gpu";02
03// The vision tower (ViT, ~192 MB) ships inside the Qwen3.5 4-bit repo.04const engine = await WebGPUEngine.create({05 repo: "mlx-community/Qwen3.5-0.8B-4bit",06 enableVision: true,07});08
09// describeImage takes pre-decoded RGB pixels (see "Image Input" below).10const result = await engine.describeImage(image, "What's in this image?");11
12console.log(result.text);13// → "A golden retriever playing fetch in a sunny park..."14
15engine.destroy();Vision Models
Currently supported vision-capable models:
| Model | Parameters | Context | Features | Size |
|---|---|---|---|---|
qwen3.5-0.8b | 0.8B (own ViT) | 262K tokens | Vision + Reasoning | ~404MB (4-bit, +192MB ViT) |
WebGPUEngine using the compact mlx-community/Qwen3.5-0.8B-4bit repo (~404MB, Qwen3.5's own ViT included) — see React Hooks below. The same qwen3.5-0.8b model runs on-device in Node.Image Input
describeImage accepts pre-decoded RGB pixels (HWC, 0..255), so you decode any <img>, File, or URL via a canvas first:
// describeImage accepts either decoded pixels...type PixelImage = { pixels: Float32Array | Uint8ClampedArray | Uint8Array; width: number; height: number;};
// ...or pre-extracted ViT patches (advanced):type PatchImage = { patches: Float32Array; gridTHW: [number, number, number];};
// Decode an <img>/File/URL to RGB pixels via a canvas:async function decodeToPixels(src: string): Promise<PixelImage> { const img = new Image(); img.crossOrigin = "anonymous"; await new Promise((res, rej) => { img.onload = res; img.onerror = rej; img.src = src; }); const canvas = document.createElement("canvas"); canvas.width = img.naturalWidth; canvas.height = img.naturalHeight; const ctx = canvas.getContext("2d")!; ctx.drawImage(img, 0, 0); const { data } = ctx.getImageData(0, 0, canvas.width, canvas.height); const rgb = new Uint8ClampedArray(canvas.width * canvas.height * 3); for (let i = 0, j = 0; i < data.length; i += 4, j += 3) { rgb[j] = data[i]; rgb[j + 1] = data[i + 1]; rgb[j + 2] = data[i + 2]; } return { pixels: rgb, width: canvas.width, height: canvas.height };}Capability Detection
Check whether a vision encoder was built with the hasVision flag:
const engine = await WebGPUEngine.create({ repo: "mlx-community/Qwen3.5-0.8B-4bit", enableVision: true,});
if (engine.hasVision) { const { text } = await engine.describeImage(image, "Describe this");} else { // Engine was created without enableVision — describeImage() would throw. const { text } = await engine.generate("Describe what you know about...");}Vision Encoder Only
To get raw image-embedding tokens (e.g. for retrieval) without decoding text, call encodeImage with ViT patches and a temporal/height/width grid:
const { embeds, rows, dim } = await engine.encodeImage(patches, gridTHW);// dim === 1024 (Qwen3.5's merged image-embedding dimension)React Hooks
In the browser, vision runs natively on WebGPU via the WebGPUEngine. Decode the image to RGB pixels and call describeImage:
01import { WebGPUEngine } from "@tryhamster/gerbil/gpu";02
03// The vision tower (ViT) ships inside the lean Qwen3.5 4-bit repo.04const engine = await WebGPUEngine.create({05 repo: "mlx-community/Qwen3.5-0.8B-4bit",06 enableVision: true,07});08
09// Decode any <img>/File/URL to RGB pixels (HWC, 0..255) via a canvas.10async function decodeToPixels(src: string) {11 const img = new Image();12 img.crossOrigin = "anonymous";13 await new Promise((res, rej) => {14 img.onload = res; img.onerror = rej; img.src = src;15 });16 const canvas = document.createElement("canvas");17 canvas.width = img.naturalWidth;18 canvas.height = img.naturalHeight;19 const ctx = canvas.getContext("2d")!;20 ctx.drawImage(img, 0, 0);21 const { data } = ctx.getImageData(0, 0, canvas.width, canvas.height);22 const rgb = new Uint8ClampedArray(canvas.width * canvas.height * 3);23 for (let i = 0, j = 0; i < data.length; i += 4, j += 3) {24 rgb[j] = data[i]; rgb[j + 1] = data[i + 1]; rgb[j + 2] = data[i + 2];25 }26 return { pixels: rgb, width: canvas.width, height: canvas.height };27}28
29const image = await decodeToPixels("/skelly.png");30const { text } = await engine.describeImage(image, "What's in this image?");31console.log(text);Performance Tips
Image Size
- —Larger images take longer to process
- —Consider resizing to 512×512 to 1024×1024 for optimal performance
- —Models cache in IndexedDB (browser) after first download
Expected Performance
| Metric | Value |
|---|---|
| Vision model load time | ~2s (cached) |
| Image processing | ~0.5s |
| Generation speed | 70-100+ tok/s in-browser (desktop WebGPU) |
| Memory usage | ~4GB (model + KV cache) |
Troubleshooting
"Engine was not created with enableVision"
describeImage and encodeImage throw unless the engine was built with enableVision: true. The ViT tower adds ~192 MB, so it is only built on request.
Slow image processing
- —ViT attention memory scales with patch-count², so downscale large photos before decoding (the longest side ≤ 448–1024 px works well)
- —On WebKit/iOS, lower
maxVisionPatches(default 1024 on WebKit, 4096 desktop) - —The native engine is WebGPU-only — there is no CPU fallback. Confirm a WebGPU adapter is available up front
Image not loading
- —Check the URL is accessible (CORS may block some URLs)
- —For local files, ensure the path is absolute
- —Base64 data URIs must include the mime type prefix:
data:image/png;base64,...
API Reference
describeImage()
// Image-in → text-out. Throws unless created with { enableVision: true }.describeImage( image: | { pixels: Float32Array | Uint8ClampedArray | Uint8Array; width: number; height: number } | { patches: Float32Array; gridTHW: [number, number, number] }, prompt?: string, // default "Describe this image." options?: GenerateOptions & { imageProcessor?: ImageProcessorConfig }): Promise<GenerateResult>encodeImage()
// Vision encoder only (patches → merged image-embedding tokens).encodeImage( patches: Float32Array, gridTHW: [number, number, number]): Promise<{ embeds: Float32Array; rows: number; dim: number }>// dim === 1024 for Qwen3.5hasVision
// true if the engine was created with { enableVision: true }get hasVision(): booleanWebGPUEngineOptions (vision fields)
interface WebGPUEngineOptions { repo: string; // e.g. "mlx-community/Qwen3.5-0.8B-4bit"
/** Build the Qwen3.5 ViT tower (~192 MB) so describeImage() works. */ enableVision?: boolean;
/** Max ViT patches. Default 4096 desktop, 1024 on WebKit/iOS. */ maxVisionPatches?: number;}