Concurrency & Memory

Name: Gerbil
Author: Gerbil

A single page can run several models at once — a chat model answering a question while a speech model reads the answer aloud. Gerbil keeps this efficient by sharing one engine per model and managing the load/unload lifecycle for you. This page explains the model so you can build multi-model experiences with confidence.

One engine per model, shared everywhere

Loading a model uploads its weights to the GPU — the expensive part of starting up. Gerbil does this once per model. Every component that asks for the same model receives the same underlying engine, so the weights live on the GPU a single time no matter how many places use them.

You don't wire this up. Call useEngine wherever you need inference — a sidebar assistant, an inline autocomplete, a help widget — and Gerbil deduplicates them automatically:

shared-engine.tsx

01import { useEngine } from "@tryhamster/gerbil/hooks";
02
03// Two different components, same model.
04function Assistant() {
05  const { complete } = useEngine({ model: "mlx-community/Qwen3.5-0.8B-4bit" });
06  // ...
07}
08
09function Autocomplete() {
10  const { complete } = useEngine({ model: "mlx-community/Qwen3.5-0.8B-4bit" });
11  // ...
12}
13
14// Both share ONE engine. The model downloads and uploads to the GPU once,
15// then both components generate from it.

What makes two requests “the same model” is the full configuration — model, dtype, enableVision, and embedding. Matching configurations share an engine; different ones each get their own.

Running several models at once

Different models coexist. A chat model, a text-to-speech model, and an embedding model can all be live on the same page at the same time — each is its own engine, and they run independently. This is how you build a voice assistant that thinks, speaks, and remembers:

multi-model.tsx

01import { useEngine } from "@tryhamster/gerbil/hooks";
02
03function VoiceAssistant() {
04  // Text generation
05  const chat = useEngine({ model: "mlx-community/Qwen3.5-0.8B-4bit" });
06
07  // Semantic memory (a separate model, resident alongside chat)
08  const memory = useEngine({
09    model: "mlx-community/embeddinggemma-300m-4bit",
10    embedding: true,
11  });
12
13  async function ask(question: string) {
14    const answer = await chat.complete(question);
15    const vector = await memory.embed(answer); // both models active together
16    return { answer, vector };
17  }
18  // ...
19}

Each model loads the first time it's actually used, so a page that could use five models only pays for the ones a given session reaches.

The lifecycle is automatic

You never manually create, track, or tear down engines. Gerbil reference-counts each shared engine and handles the full lifecycle:

—Loads on demand — the first component to use a model triggers the download and GPU upload; everyone after it attaches instantly.
—Switches cleanly — change the model a component requests and Gerbil releases the old engine and attaches the new one, with no leftover state.
—Survives navigation — a model stays warm briefly after the last component using it goes away, so moving between pages or re-rendering doesn't re-upload weights.
—Frees itself — once nothing references a model, its GPU memory is released.

The practical takeaway: just use the hook where you need it. There's no singleton to set up and no provider to wrap your app in — requesting a model is requesting access to the one shared engine for it.

Budgeting GPU memory

The one thing worth planning for is memory. Every loaded model occupies GPU memory for as long as it's in use, and that budget is much smaller on phones than on desktops. A good rule of thumb:

Device	Comfortable budget	Good fit
Desktop	Several GB	A chat model plus a couple of smaller helpers at once
Phone / tablet	~1–1.5 GB	One small-to-mid model, or two lightweight ones

Two habits keep you within budget on every device:

—Load on demand. Trigger a model when the user reaches the feature that needs it, not all at once on page load — especially when a page offers several models.
—Let dtype adapt to the device. The default dtype: "auto" uses a compact 4-bit build on phones and full precision on desktop, so the same code fits both.

Outside React

The same model holds whether you use the hook or the engine directly. Create the engine once and reuse it across your app:

engine-direct.ts

01import { WebGPUEngine } from "@tryhamster/gerbil/gpu";
02
03// Create once, share the reference wherever you generate.
04const engine = await WebGPUEngine.create({ repo: "mlx-community/Qwen3.5-0.8B-4bit" });
05
06const a = await engine.generate("Summarize this in one line: ...");
07const b = await engine.generate("Classify this ticket: ...");
08
09// A second, different model runs alongside the first.
10const tts = await WebGPUEngine.create({ repo: "nineninesix/kani-tts-450m-0.2-ft" });
11const speech = await tts.speak(a.text);

The guidance is identical: one engine per model, reused everywhere; distinct models live side by side; release an engine with engine.destroy() when you're truly done with it to free its GPU memory.