Mobile & PWA

Gerbil runs the same WebGPU engine on phones and tablets. The main thing to plan for on mobile is storage: install the site to the Home Screen so the model is cached once instead of re-downloaded on every visit.

TL;DR. On an uninstalled iOS web origin, Safari caps evictable storage at ~1 GB regardless of free disk, so a ~404 MB model is often evicted between visits. Adding the site to the Home Screen (a PWA install) grants persistent storage tied to real disk — the model caches once and stays.

The two ceilings

Two separate limits decide whether on-device AI works on a phone or tablet. They are independent — clearing one does nothing for the other.

CeilingWhat it limitsOn iOS Safari
Storage quotaHow many bytes of model weights you can cache (IndexedDB / CacheStorage / OPFS)~1 GB, evictable, when uninstalled. Persistent & disk-bound once installed.
Tab memoryHow much RAM the running model + KV cache can use before the tab is killedWeb-content process killed around 1.5–2 GB; unaffected by install.

This page is about the storage quota ceiling — the one that installing the site fixes. For the tab-memory ceiling and how the engine guards it (conservative maxSeqLen, packed-f16 KV cache, crash detection), see iOS Memory Guards.

Why install to the Home Screen

For the best mobile experience, install the site as a PWA. A web app running from the Home Screen is treated as installed software, which changes the storage rules in your favor:

  • Persistent storage. Cached model weights survive between sessions instead of being evicted under pressure — no re-download on the second visit.
  • Larger quota. The budget is tied to real disk rather than the ~1 GB uninstalled cap, so larger models fit.
  • App-shell offline. The page itself opens without a network connection (Gerbil registers a minimal service worker that caches the shell — never the model).

iOS: there is no programmatic install prompt. Users add the app manually via Share → Add to Home Screen. On Android/Chrome, the beforeinstallprompt event enables a one-tap Install button.

How Gerbil already helps

  • Streaming weight load on mobile. On mobile the engine streams weights through CacheStorage by default instead of holding every tensor in the JS heap on top of the GPU buffers. That keeps peak memory bounded during load — the step that used to OOM-crash iOS right after the safetensors header read — so large models load without running the tab out of memory.
  • Crash-surviving submit-group tuning. The engine probes how much GPU work each device can dispatch per submit and tunes itself up on capable hardware, falling back safely if a previous run crashed — so it runs faster where it can without breaking weaker devices.
  • Per-tensor model caching. Weights are cached per-tensor and stale namespaces auto-evict, so switching models doesn't silently pile up dead bytes against your quota.
  • Device-aware model guards. isModelSafeForDevice flags repos that exceed the device's memory budget before you load them — so an oversized model is blocked instead of crashing the tab mid-load.
  • Storage helpers. The @tryhamster/gerbil/browser module exposes the checks below so you can gate a large download behind an install nudge.

Gating a download behind an install nudge

Check the storage reality before kicking off a ~404 MB download. If the origin isn't installed and the quota looks too small to persist the model, guide the user to install first:

gate-download.ts
01import {
02 getStorageStatus, // current quota / usage / persisted state
03 canCacheModel, // will a model of N MB fit and persist?
04 requestPersistentStorage,// best-effort: ask for durable storage
05 getInstallGuidance, // platform-specific "how to install" copy
06 isStandalone, // already installed / running from Home Screen?
07} from "@tryhamster/gerbil/browser";
08
09async function ensureModelCanCache(modelMB: number) {
10 // Already installed? Persistent storage is in effect — just proceed.
11 if (isStandalone()) return true;
12
13 // Best-effort: ask the browser to make our storage durable.
14 await requestPersistentStorage();
15
16 const status = await getStorageStatus();
17 const fit = await canCacheModel(modelMB);
18
19 if (!fit.fits) {
20 // Not enough durable quota uninstalled — nudge the user to install.
21 const guidance = getInstallGuidance();
22 showInstallNudge(guidance.steps); // e.g. iOS: Share → Add to Home Screen
23 return false;
24 }
25
26 console.log(`Quota OK — ${Math.round(status.availableMB)}MB available`);
27 return true;
28}
29
30// Before downloading the chat model (~404MB):
31if (await ensureModelCanCache(404)) {
32 // ...create the engine / start the download
33}

On this site the playground shows a small, dismissible “Add Gerbil to your Home Screen” banner on mobile when storage looks tight, using the same idea.

Streaming weight load (mobile default)

Loading is where mobile runs out of memory. The original heap-backed path keeps every tensor resident in the JS heap on top of the GPU buffers — and expands f16/bf16 weights to f32 in-heap — which roughly doubles peak memory and OOM-crashes iOS WebKit right after the safetensors header read.

To avoid that, the engine now streams weights by default on mobile. Tensor bytes live in CacheStorage and the heap holds only lightweight descriptors, so peak memory stays bounded and large models load without the load-time OOM. Desktop and Node keep the proven heap path, which is faster where memory isn't the constraint.

Escape hatch. The default is right for almost everyone. To override it, set globalThis.GERBIL_STREAM_WEIGHTS before creating the engine — true forces streaming anywhere (e.g. a memory-tight desktop), and false forces the heap path even on mobile.

stream-weights.ts
01// Streaming is the default on mobile — you normally don't touch this.
02
03// Force the heap path (e.g. debugging on a high-memory phone):
04globalThis.GERBIL_STREAM_WEIGHTS = false;
05
06// Force streaming on a memory-tight desktop:
07globalThis.GERBIL_STREAM_WEIGHTS = true;
08
09// ...then create the engine as usual.
10const engine = await WebGPUEngine.create();

Hard-blocking models too large for the device

Storage quota decides whether a model can be cached; the tab-memory ceiling decides whether it can run. The 2B-class chat model fits the quota but exceeds the iPhone WebKit memory budget (~1.7GB working set), so it crashes the tab when it loads. isModelSafeForDevice catches this before you create the engine.

It returns a structured result with a safe flag and a separate risky flag. Qwen3.5-2B on iPhone now reports safe: false so apps hard-block it by default — but risky: true marks it as borderline rather than impossible, so a UI can offer an explicit “load anyway” on a device the user insists is capable. A truly oversized or vision checkpoint comes back safe: false, risky: false — never offer to force those.

model-safety.ts
01import { isModelSafeForDevice } from "@tryhamster/gerbil/browser";
02import { WebGPUEngine } from "@tryhamster/gerbil/gpu";
03
04const check = isModelSafeForDevice("mlx-community/Qwen3.5-2B-4bit");
05// On iPhone:
06// {
07// safe: false,
08// risky: true, // borderline — OK to offer an explicit "load anyway"
09// reason: "Model ... (~1.7GB) exceeds the iPhone WKWebView memory budget ...",
10// recommendation: "Use mlx-community/Qwen3.5-0.8B-4bit (Qwen3.5-0.8B) on iPhone; ...",
11// maxSafeModel: "mlx-community/Qwen3.5-0.8B-4bit",
12// }
13
14if (check.safe) {
15 await WebGPUEngine.create({ repo: "mlx-community/Qwen3.5-2B-4bit" });
16} else if (check.risky) {
17 // Hard-blocked by default, but borderline — let the user force it on capable HW.
18 if (confirm(check.reason + "\n\nLoad anyway?")) {
19 await WebGPUEngine.create({ repo: "mlx-community/Qwen3.5-2B-4bit" });
20 } else {
21 await WebGPUEngine.create({ repo: check.maxSafeModel });
22 }
23} else {
24 // Genuinely too large (or a vision checkpoint) — fall back, don't offer to force.
25 await WebGPUEngine.create({ repo: check.maxSafeModel });
26}

Next Steps