Vision

Name: Gerbil
Author: Gerbil

Gerbil can see. Run Vision Language Models locally to describe, analyze, and answer questions about images — entirely on your device, with no image ever leaving it.

Quick Start

Vision runs natively on the WebGPUEngine using Qwen3.5-0.8B's own built-in ViT tower. Build the engine with enableVision: true, decode an image to RGB pixels, and call describeImage:

vision-basic.ts

01import { WebGPUEngine } from "@tryhamster/gerbil/gpu";
02
03// The vision tower (ViT, ~192 MB) ships inside the Qwen3.5 4-bit repo.
04const engine = await WebGPUEngine.create({
05  repo: "mlx-community/Qwen3.5-0.8B-4bit",
06  enableVision: true,
07});
08
09// describeImage takes pre-decoded RGB pixels (see "Image Input" below).
10const result = await engine.describeImage(image, "What's in this image?");
11
12console.log(result.text);
13// → "A golden retriever playing fetch in a sunny park..."
14
15engine.destroy();

Vision Models

Currently supported vision-capable models:

Model	Parameters	Context	Features	Size
`qwen3.5-0.8b`	0.8B (own ViT)	262K tokens	Vision + Reasoning	~404MB (4-bit, +192MB ViT)

Note: Vision runs natively on the WebGPUEngine using the compact mlx-community/Qwen3.5-0.8B-4bit repo (~404MB, Qwen3.5's own ViT included) — see React Hooks below. The same qwen3.5-0.8b model runs on-device in Node.

Image Input

describeImage accepts pre-decoded RGB pixels (HWC, 0..255), so you decode any <img>, File, or URL via a canvas first:

image-input.ts

// describeImage accepts either decoded pixels...
type PixelImage = {
  pixels: Float32Array | Uint8ClampedArray | Uint8Array;
  width: number;
  height: number;
};

// ...or pre-extracted ViT patches (advanced):
type PatchImage = {
  patches: Float32Array;
  gridTHW: [number, number, number];
};

// Decode an <img>/File/URL to RGB pixels via a canvas:
async function decodeToPixels(src: string): Promise<PixelImage> {
  const img = new Image();
  img.crossOrigin = "anonymous";
  await new Promise((res, rej) => { img.onload = res; img.onerror = rej; img.src = src; });
  const canvas = document.createElement("canvas");
  canvas.width = img.naturalWidth;
  canvas.height = img.naturalHeight;
  const ctx = canvas.getContext("2d")!;
  ctx.drawImage(img, 0, 0);
  const { data } = ctx.getImageData(0, 0, canvas.width, canvas.height);
  const rgb = new Uint8ClampedArray(canvas.width * canvas.height * 3);
  for (let i = 0, j = 0; i < data.length; i += 4, j += 3) {
    rgb[j] = data[i]; rgb[j + 1] = data[i + 1]; rgb[j + 2] = data[i + 2];
  }
  return { pixels: rgb, width: canvas.width, height: canvas.height };
}

Capability Detection

Check whether a vision encoder was built with the hasVision flag:

capability-detection.ts

const engine = await WebGPUEngine.create({
  repo: "mlx-community/Qwen3.5-0.8B-4bit",
  enableVision: true,
});

if (engine.hasVision) {
  const { text } = await engine.describeImage(image, "Describe this");
} else {
  // Engine was created without enableVision — describeImage() would throw.
  const { text } = await engine.generate("Describe what you know about...");
}

Vision Encoder Only

To get raw image-embedding tokens (e.g. for retrieval) without decoding text, call encodeImage with ViT patches and a temporal/height/width grid:

encode-image.ts

const { embeds, rows, dim } = await engine.encodeImage(patches, gridTHW);
// dim === 1024 (Qwen3.5's merged image-embedding dimension)

React Hooks

In the browser, vision runs natively on WebGPU via the WebGPUEngine. Decode the image to RGB pixels and call describeImage:

native-vision.ts

01import { WebGPUEngine } from "@tryhamster/gerbil/gpu";
02
03// The vision tower (ViT) ships inside the lean Qwen3.5 4-bit repo.
04const engine = await WebGPUEngine.create({
05  repo: "mlx-community/Qwen3.5-0.8B-4bit",
06  enableVision: true,
07});
08
09// Decode any <img>/File/URL to RGB pixels (HWC, 0..255) via a canvas.
10async function decodeToPixels(src: string) {
11  const img = new Image();
12  img.crossOrigin = "anonymous";
13  await new Promise((res, rej) => {
14    img.onload = res; img.onerror = rej; img.src = src;
15  });
16  const canvas = document.createElement("canvas");
17  canvas.width = img.naturalWidth;
18  canvas.height = img.naturalHeight;
19  const ctx = canvas.getContext("2d")!;
20  ctx.drawImage(img, 0, 0);
21  const { data } = ctx.getImageData(0, 0, canvas.width, canvas.height);
22  const rgb = new Uint8ClampedArray(canvas.width * canvas.height * 3);
23  for (let i = 0, j = 0; i < data.length; i += 4, j += 3) {
24    rgb[j] = data[i]; rgb[j + 1] = data[i + 1]; rgb[j + 2] = data[i + 2];
25  }
26  return { pixels: rgb, width: canvas.width, height: canvas.height };
27}
28
29const image = await decodeToPixels("/skelly.png");
30const { text } = await engine.describeImage(image, "What's in this image?");
31console.log(text);

Performance Tips

Image Size

—Larger images take longer to process
—Consider resizing to 512×512 to 1024×1024 for optimal performance
—Models cache in IndexedDB (browser) after first download

Expected Performance

Metric	Value
Vision model load time	~2s (cached)
Image processing	~0.5s
Generation speed	70-100+ tok/s in-browser (desktop WebGPU)
Memory usage	~4GB (model + KV cache)

Troubleshooting

"Engine was not created with enableVision"

describeImage and encodeImage throw unless the engine was built with enableVision: true. The ViT tower adds ~192 MB, so it is only built on request.

Slow image processing

—ViT attention memory scales with patch-count², so downscale large photos before decoding (the longest side ≤ 448–1024 px works well)
—On WebKit/iOS, lower maxVisionPatches (default 1024 on WebKit, 4096 desktop)
—The native engine is WebGPU-only — there is no CPU fallback. Confirm a WebGPU adapter is available up front

Image not loading

—Check the URL is accessible (CORS may block some URLs)
—For local files, ensure the path is absolute
—Base64 data URIs must include the mime type prefix: data:image/png;base64,...

API Reference

describeImage()

describeImage.ts

// Image-in → text-out. Throws unless created with { enableVision: true }.
describeImage(
  image:
    | { pixels: Float32Array | Uint8ClampedArray | Uint8Array; width: number; height: number }
    | { patches: Float32Array; gridTHW: [number, number, number] },
  prompt?: string,            // default "Describe this image."
  options?: GenerateOptions & { imageProcessor?: ImageProcessorConfig }
): Promise<GenerateResult>

encodeImage()

encodeImage.ts

// Vision encoder only (patches → merged image-embedding tokens).
encodeImage(
  patches: Float32Array,
  gridTHW: [number, number, number]
): Promise<{ embeds: Float32Array; rows: number; dim: number }>
// dim === 1024 for Qwen3.5

hasVision

hasVision.ts

// true if the engine was created with { enableVision: true }
get hasVision(): boolean

WebGPUEngineOptions (vision fields)

WebGPUEngineOptions.ts

interface WebGPUEngineOptions {
  repo: string;              // e.g. "mlx-community/Qwen3.5-0.8B-4bit"

  /** Build the Qwen3.5 ViT tower (~192 MB) so describeImage() works. */
  enableVision?: boolean;

  /** Max ViT patches. Default 4096 desktop, 1024 on WebKit/iOS. */
  maxVisionPatches?: number;
}