Semantic Docs Search

Search these docs by meaning, not keywords. The query is embedded with EmbeddingGemma-300M running entirely in your browser on WebGPU, then cosine-ranked against pre-computed vectors for every doc section. No server, no API keys — the search index ships as a ~1 MB JSON file and the model loads on your first query.

Loading search…

Tip: press ⌘K (or /) anywhere in the docs to open this search as a command palette.

How it works

  • Build time: a Node script walks every app/docs/**/page.tsx, extracts the readable prose per heading-anchored section, chunks it, and embeds each chunk with the same native engine in Node — then writes a compact float16 JSON that is committed to the repo (so the deploy never needs a GPU).
  • Runtime: only your query is embedded on-device. Cosine ranking against the shipped vectors is a few hundred dot products — instant, even on a phone.
  • Graceful fallback: while the model downloads (or on browsers without WebGPU) the box answers instantly with a plain substring match, then upgrades to semantic ranking the moment the engine is ready.

Build the index (Node, no GPU at deploy)

This is the actual offline pipeline behind the search box (see scripts/build-doc-embeddings.mjs). It extracts prose per heading-anchored section, chunks it, embeds each chunk in Node with the native engine, quantizes the vectors to float16, and writes a compact JSON that is committed to the repo — so the Vercel build never needs a GPU. Re-run it whenever the docs change.

build-doc-embeddings.mjs
// scripts/build-doc-embeddings.mjs — run in Node, commit the output.
import { readFile, writeFile } from "node:fs/promises";
import { embed } from "@tryhamster/gerbil";
const EMBED_REPO = "mlx-community/embeddinggemma-300m-4bit";
const MAX_CHUNK_WORDS = 130;
// 1. Extract the readable prose for each heading-anchored <h2 id="…"> section.
function extractSections(source) {
const headingRe = /<h2\s+id="([^"]+)"[^>]*>([\s\S]*?)<\/h2>/g;
const proseRe = /<(p|li|h3)\b[^>]*>([\s\S]*?)<\/\1>/g;
const clean = (s) => s.replace(/<[^>]+>|\{[^}]*\}/g, " ").replace(/\s+/g, " ").trim();
const heads = [...source.matchAll(headingRe)];
return heads.map((h, i) => {
const start = h.index + h[0].length;
const end = heads[i + 1]?.index ?? source.length;
const text = [...source.slice(start, end).matchAll(proseRe)]
.map((m) => clean(m[2]))
.filter(Boolean)
.join(" ");
return { heading: clean(h[2]), anchor: h[1], text };
});
}
// 2. Chunk a long section into ~130-word pieces on sentence boundaries.
function chunk(text) {
const words = text.split(/\s+/);
if (words.length <= MAX_CHUNK_WORDS) return [text];
const out = [];
let buf = [];
for (const s of text.match(/[^.!?]+[.!?]+|\S+$/g) ?? [text]) {
if (buf.join(" ").split(/\s+/).length + s.split(/\s+/).length > MAX_CHUNK_WORDS) {
out.push(buf.join(" ").trim());
buf = [];
}
buf.push(s.trim());
}
if (buf.length) out.push(buf.join(" ").trim());
return out;
}
// 4. Quantize float32 → float16 so the committed JSON is ~half the size.
// Cosine ranking is unaffected (vectors are L2-normalized).
function f32ToF16(val) {
const f32 = new Float32Array([val]);
const x = new Int32Array(f32.buffer)[0];
const sign = (x >> 16) & 0x8000;
const exp = ((x >> 23) & 0xff) - 127 + 15;
if (exp <= 0) return sign;
if (exp >= 0x1f) return sign | 0x7c00;
return sign | (exp << 10) | ((x & 0x7fffff) >> 13);
}
const source = await readFile("app/docs/search/page.tsx", "utf8");
const chunks = [];
for (const sec of extractSections(source)) {
for (const text of chunk(sec.text)) {
// 3. Embed each chunk in Node with the SAME native engine the browser
// uses at query time — EmbeddingGemma-300M, 768-dim, on WebGPU (Dawn).
const { vector } = await embed(`${sec.heading}. ${text}`, {
model: EMBED_REPO,
// EmbeddingGemma is asymmetric: the corpus side uses the "document" task.
taskType: "document",
});
chunks.push({
anchor: sec.anchor,
heading: sec.heading,
text,
v: vector.map(f32ToF16), // store as float16 ints
});
}
}
// 4. Write a compact JSON that gets committed — the deploy never needs a GPU.
await writeFile(
"lib/doc-embeddings.json",
JSON.stringify({ model: EMBED_REPO, dim: 768, encoding: "f16", chunks })
);

Search at runtime (query-only, on-device)

At runtime only the user's query is embedded — with the same EmbeddingGemma model on WebGPU — then cosine-ranked against the shipped vectors. That's a few hundred dot products: instant, even on a phone, with no server and no API keys.

doc-search.ts
// In the browser: embed the query and cosine-rank the committed vectors.
// `embed` is the `useEmbedding()` hook's embedder (same engine, on WebGPU).
import index from "@/lib/doc-embeddings.json";
const f16ToF32 = (h) => {
const sign = (h & 0x8000) << 16;
const exp = (h >> 10) & 0x1f;
const mant = h & 0x3ff;
if (exp === 0) return new Float32Array(new Int32Array([sign]).buffer)[0];
const bits = sign | ((exp - 15 + 127) << 23) | (mant << 13);
return new Float32Array(new Int32Array([bits]).buffer)[0];
};
const cosine = (a, b) => a.reduce((s, x, i) => s + x * b[i], 0); // both L2-normalized
export async function search(query, embed) {
// Query side of EmbeddingGemma's asymmetric retrieval.
const { vector } = await embed(query, { taskType: "query" });
return index.chunks
.map((c) => ({ ...c, score: cosine(vector, c.v.map(f16ToF32)) }))
.sort((a, b) => b.score - a.score)
.slice(0, 8);
}