Getting Started

Name: Gerbil
Author: Gerbil

Gerbil gets you running a local LLM in just a few minutes. Install the package, load a model, and generate your first response — all on your own machine.

Installation

One-off usage — try without installing:

Terminal

npx @tryhamster/gerbil "Write a haiku about coding"

Global install — use gerbil command directly:

Terminal

npm install -g @tryhamster/gerbil

Local install — for programmatic use in your project:

Terminal

npm install @tryhamster/gerbil

Note: CLI examples on this site use gerbil and assume global install. Substitute npx @tryhamster/gerbil if using without installing.

Quick Start

One-liner API

The simplest way to use Gerbil. No setup required:

index.ts

import gerbil from "@tryhamster/gerbil";

const text = await gerbil("Explain recursion in one sentence");
console.log(text);
// → "Recursion is when a function calls itself to solve smaller instances of the same problem."

Engine API

For more control, create a WebGPUEngine directly. It downloads safetensors from the HuggingFace Hub and caches them in IndexedDB on first run:

standalone.ts

01import { WebGPUEngine } from "@tryhamster/gerbil/gpu";
02
03// dtype "auto" picks int4 on mobile, the repo's native precision on desktop.
04const engine = await WebGPUEngine.create({
05  repo: "mlx-community/Qwen3.5-0.8B-4bit",
06  dtype: "auto",
07});
08
09// Generate text
10const result = await engine.generate("Write a haiku about coding", {
11  maxTokens: 100,
12  sampling: { temperature: 0.8 },
13});
14
15console.log(result.text);
16console.log(`Speed: ${result.tokensPerSecond.toFixed(1)} tok/s`);
17
18// Clean up when done
19engine.destroy();

In the browser

In a React app, the one-liner is the useEngine hook from @tryhamster/gerbil/hooks. It lazy-loads the model, streams into completion, and needs no model argument:

Chat.tsx

import { useEngine } from "@tryhamster/gerbil/hooks";

function Chat() {
  const { complete, completion, isGenerating } = useEngine();
  return (
    <button onClick={() => complete("Write a haiku about coding")} disabled={isGenerating}>
      {completion || "Generate"}
    </button>
  );
}

Outside React (vanilla JS / Node), drive the native WebGPUEngine directly — a lean bundle that runs the same code across desktop and mobile. The repo is optional (omit it for a device-aware default):

browser.ts

import { WebGPUEngine } from "@tryhamster/gerbil/gpu";

// Zero-config: no repo → a device-aware default. Or pass any repo.
// Downloads + caches the model in IndexedDB on first run.
const engine = await WebGPUEngine.create();

const { text } = await engine.generate("Write a haiku about coding");
console.log(text);

See the Browser documentation for chat, vision, and embeddings.

Streaming

Stream responses token by token:

streaming.ts

import { WebGPUEngine } from "@tryhamster/gerbil/gpu";

const engine = await WebGPUEngine.create({
  repo: "mlx-community/Qwen3.5-0.8B-4bit",
});

for await (const token of engine.stream("Tell me a story")) {
  process.stdout.write(token);
}

engine.destroy();

Thinking Mode

Qwen3 models reason before answering. The chain-of-thought is separated out for you on result.thinking, so you can show or hide it:

thinking.ts

const result = await engine.generate("What is 127 × 43?");

console.log("Thinking:", result.thinking);
// → "127 × 43 = 127 × 40 + 127 × 3 = 5080 + 381 = 5461"

console.log("Answer:", result.text);
// → "5461"

Structured JSON Output

Get structured data with Zod schema validation:

json.ts

import { json } from "@tryhamster/gerbil";
import { z } from "zod";

const person = await json("Extract: John is 32 and lives in NYC", {
  schema: z.object({
    name: z.string(),
    age: z.number(),
    city: z.string(),
  }),
});

console.log(person);
// → { name: "John", age: 32, city: "NYC" }

Embeddings

Generate embeddings for semantic search and RAG with the native EmbeddingGemma-300M model (dim 768, ~173 MB). Load it with embedding: true; embed() returns a unit-L2-normalized vector:

embeddings.ts

import { WebGPUEngine } from "@tryhamster/gerbil/gpu";

const engine = await WebGPUEngine.create({
  repo: "mlx-community/embeddinggemma-300m-4bit",
  embedding: true,
});

// EmbeddingGemma is asymmetric — tag queries vs. documents.
const query = await engine.embed("how do I cache a model?", { taskType: "query" });
const doc = await engine.embed("Models are cached in IndexedDB.", { taskType: "document" });

// Cosine similarity is just a dot product (vectors are L2-normalized).
const score = query.reduce((sum, q, i) => sum + q * doc[i], 0);

Vision AI

Vision runs on Qwen3.5-0.8B's own built-in ViT tower (no separate vision model). Build the engine with enableVision: true and call describeImage():

vision.ts

01import { WebGPUEngine } from "@tryhamster/gerbil/gpu";
02
03// The ViT tower (~192MB) ships inside the lean Qwen3.5 4-bit repo.
04const engine = await WebGPUEngine.create({
05  repo: "mlx-community/Qwen3.5-0.8B-4bit",
06  enableVision: true,
07});
08
09// Pass a data/http URL or pre-decoded RGB pixels.
10const { text } = await engine.describeImage(
11  { pixels, width, height },
12  "What's in this image?"
13);
14
15console.log(text);
16// → "A golden retriever playing fetch in a sunny park..."

See the Vision documentation for more details.

Text-to-Speech

Synthesize speech on-device with native Kani-TTS-2. Call engine.speak() and it returns raw 22.05 kHz mono PCM:

tts.ts

01import { WebGPUEngine } from "@tryhamster/gerbil/gpu";
02
03const engine = await WebGPUEngine.create({
04  repo: "nineninesix/kani-tts-450m-0.2-ft",
05});
06
07const { pcm, sampleRate } = await engine.speak(
08  "Hello from Gerbil! This runs entirely on-device.",
09);
10
11// pcm → Float32Array mono, sampleRate → 22050

See the Text-to-Speech documentation for voices and controls.

Speech-to-Text

Transcribe audio on-device with native Moonshine. It takes raw 16 kHz mono PCM (Float32Array), runs an encoder-decoder ASR pass, and returns the full transcript:

stt.ts

01import { MoonshineSTT } from "@tryhamster/gerbil/gpu";
02
03const stt = await MoonshineSTT.create(); // UsefulSensors/moonshine-base
04
05// pcm must be 16 kHz mono Float32 (resample mic/file audio first).
06const { text, audioSeconds } = await stt.transcribe(pcm);
07
08console.log(text); // "Hello world"

See the Speech-to-Text documentation for details.

Using Any HuggingFace Model

Load any compatible model from HuggingFace:

custom-model.ts

import { WebGPUEngine } from "@tryhamster/gerbil/gpu";

// HF repo id (the lean 4-bit Qwen3.5 checkpoint)
const engine = await WebGPUEngine.create({
  repo: "mlx-community/Qwen3.5-0.8B-4bit",
});

// Faster/smaller alternative: LFM2.5-350M (~199MB)
const fast = await WebGPUEngine.create({
  repo: "LiquidAI/LFM2.5-350M-MLX-4bit",
});

💡 Pro Tip: Preload Models

Create the engine during app initialization so weights are downloaded, compiled, and cached in IndexedDB before the user's first prompt. Reuse the same instance afterwards:

preload.ts

import { WebGPUEngine } from "@tryhamster/gerbil/gpu";

// Warm up at startup — downloads + caches the model.
const engine = await WebGPUEngine.create({
  repo: "mlx-community/Qwen3.5-0.8B-4bit",
  onProgress: (loaded, total, message) =>
    console.log(message, total ? loaded / total : 0),
});

// Subsequent prompts reuse the same warm engine — instant.
await engine.generate("Hello!");

See the Browser documentation for caching details.

Next Steps

—Explore available models
—Use Vision AI — describe images, extract text, analyze screenshots
—Text-to-Speech — native Kani-TTS-2 speech synthesis
—Speech-to-Text — transcribe audio with native Moonshine
—Use built-in skills like commit, summarize, review — or create your own
—Integrate with your framework — AI SDK, Next.js, Express
—Enable response caching for instant repeated prompts
—Set up MCP server for Claude Desktop
—Try the playground to experiment in your browser