Vision

Gerbil can see. Run Vision Language Models locally to describe, analyze, and answer questions about images — entirely on your device, with no image ever leaving it.

Quick Start

Vision runs natively on the WebGPUEngine using Qwen3.5-0.8B's own built-in ViT tower. Build the engine with enableVision: true, decode an image to RGB pixels, and call describeImage:

vision-basic.ts
01import { WebGPUEngine } from "@tryhamster/gerbil/gpu";
02
03// The vision tower (ViT, ~192 MB) ships inside the Qwen3.5 4-bit repo.
04const engine = await WebGPUEngine.create({
05 repo: "mlx-community/Qwen3.5-0.8B-4bit",
06 enableVision: true,
07});
08
09// describeImage takes pre-decoded RGB pixels (see "Image Input" below).
10const result = await engine.describeImage(image, "What's in this image?");
11
12console.log(result.text);
13// → "A golden retriever playing fetch in a sunny park..."
14
15engine.destroy();

Vision Models

Currently supported vision-capable models:

ModelParametersContextFeaturesSize
qwen3.5-0.8b0.8B (own ViT)262K tokensVision + Reasoning~404MB (4-bit, +192MB ViT)
Note: Vision runs natively on the WebGPUEngine using the compact mlx-community/Qwen3.5-0.8B-4bit repo (~404MB, Qwen3.5's own ViT included) — see React Hooks below. The same qwen3.5-0.8b model runs on-device in Node.

Image Input

describeImage accepts pre-decoded RGB pixels (HWC, 0..255), so you decode any <img>, File, or URL via a canvas first:

image-input.ts
// describeImage accepts either decoded pixels...
type PixelImage = {
pixels: Float32Array | Uint8ClampedArray | Uint8Array;
width: number;
height: number;
};
// ...or pre-extracted ViT patches (advanced):
type PatchImage = {
patches: Float32Array;
gridTHW: [number, number, number];
};
// Decode an <img>/File/URL to RGB pixels via a canvas:
async function decodeToPixels(src: string): Promise<PixelImage> {
const img = new Image();
img.crossOrigin = "anonymous";
await new Promise((res, rej) => { img.onload = res; img.onerror = rej; img.src = src; });
const canvas = document.createElement("canvas");
canvas.width = img.naturalWidth;
canvas.height = img.naturalHeight;
const ctx = canvas.getContext("2d")!;
ctx.drawImage(img, 0, 0);
const { data } = ctx.getImageData(0, 0, canvas.width, canvas.height);
const rgb = new Uint8ClampedArray(canvas.width * canvas.height * 3);
for (let i = 0, j = 0; i < data.length; i += 4, j += 3) {
rgb[j] = data[i]; rgb[j + 1] = data[i + 1]; rgb[j + 2] = data[i + 2];
}
return { pixels: rgb, width: canvas.width, height: canvas.height };
}

Capability Detection

Check whether a vision encoder was built with the hasVision flag:

capability-detection.ts
const engine = await WebGPUEngine.create({
repo: "mlx-community/Qwen3.5-0.8B-4bit",
enableVision: true,
});
if (engine.hasVision) {
const { text } = await engine.describeImage(image, "Describe this");
} else {
// Engine was created without enableVision — describeImage() would throw.
const { text } = await engine.generate("Describe what you know about...");
}

Vision Encoder Only

To get raw image-embedding tokens (e.g. for retrieval) without decoding text, call encodeImage with ViT patches and a temporal/height/width grid:

encode-image.ts
const { embeds, rows, dim } = await engine.encodeImage(patches, gridTHW);
// dim === 1024 (Qwen3.5's merged image-embedding dimension)

React Hooks

In the browser, vision runs natively on WebGPU via the WebGPUEngine. Decode the image to RGB pixels and call describeImage:

native-vision.ts
01import { WebGPUEngine } from "@tryhamster/gerbil/gpu";
02
03// The vision tower (ViT) ships inside the lean Qwen3.5 4-bit repo.
04const engine = await WebGPUEngine.create({
05 repo: "mlx-community/Qwen3.5-0.8B-4bit",
06 enableVision: true,
07});
08
09// Decode any <img>/File/URL to RGB pixels (HWC, 0..255) via a canvas.
10async function decodeToPixels(src: string) {
11 const img = new Image();
12 img.crossOrigin = "anonymous";
13 await new Promise((res, rej) => {
14 img.onload = res; img.onerror = rej; img.src = src;
15 });
16 const canvas = document.createElement("canvas");
17 canvas.width = img.naturalWidth;
18 canvas.height = img.naturalHeight;
19 const ctx = canvas.getContext("2d")!;
20 ctx.drawImage(img, 0, 0);
21 const { data } = ctx.getImageData(0, 0, canvas.width, canvas.height);
22 const rgb = new Uint8ClampedArray(canvas.width * canvas.height * 3);
23 for (let i = 0, j = 0; i < data.length; i += 4, j += 3) {
24 rgb[j] = data[i]; rgb[j + 1] = data[i + 1]; rgb[j + 2] = data[i + 2];
25 }
26 return { pixels: rgb, width: canvas.width, height: canvas.height };
27}
28
29const image = await decodeToPixels("/skelly.png");
30const { text } = await engine.describeImage(image, "What's in this image?");
31console.log(text);

Performance Tips

Image Size

  • Larger images take longer to process
  • Consider resizing to 512×512 to 1024×1024 for optimal performance
  • Models cache in IndexedDB (browser) after first download

Expected Performance

MetricValue
Vision model load time~2s (cached)
Image processing~0.5s
Generation speed70-100+ tok/s in-browser (desktop WebGPU)
Memory usage~4GB (model + KV cache)

Troubleshooting

"Engine was not created with enableVision"

describeImage and encodeImage throw unless the engine was built with enableVision: true. The ViT tower adds ~192 MB, so it is only built on request.

Slow image processing

  • ViT attention memory scales with patch-count², so downscale large photos before decoding (the longest side ≤ 448–1024 px works well)
  • On WebKit/iOS, lower maxVisionPatches (default 1024 on WebKit, 4096 desktop)
  • The native engine is WebGPU-only — there is no CPU fallback. Confirm a WebGPU adapter is available up front

Image not loading

  • Check the URL is accessible (CORS may block some URLs)
  • For local files, ensure the path is absolute
  • Base64 data URIs must include the mime type prefix: data:image/png;base64,...

API Reference

describeImage()

describeImage.ts
// Image-in → text-out. Throws unless created with { enableVision: true }.
describeImage(
image:
| { pixels: Float32Array | Uint8ClampedArray | Uint8Array; width: number; height: number }
| { patches: Float32Array; gridTHW: [number, number, number] },
prompt?: string, // default "Describe this image."
options?: GenerateOptions & { imageProcessor?: ImageProcessorConfig }
): Promise<GenerateResult>

encodeImage()

encodeImage.ts
// Vision encoder only (patches → merged image-embedding tokens).
encodeImage(
patches: Float32Array,
gridTHW: [number, number, number]
): Promise<{ embeds: Float32Array; rows: number; dim: number }>
// dim === 1024 for Qwen3.5

hasVision

hasVision.ts
// true if the engine was created with { enableVision: true }
get hasVision(): boolean

WebGPUEngineOptions (vision fields)

WebGPUEngineOptions.ts
interface WebGPUEngineOptions {
repo: string; // e.g. "mlx-community/Qwen3.5-0.8B-4bit"
/** Build the Qwen3.5 ViT tower (~192 MB) so describeImage() works. */
enableVision?: boolean;
/** Max ViT patches. Default 4096 desktop, 1024 on WebKit/iOS. */
maxVisionPatches?: number;
}