Production Observability
Built-in telemetry hooks and request queuing for production deployments.
Integrates with Sentry · DataDog · Any metrics system
Telemetry Hooks
Configure telemetry hooks to integrate with Sentry, DataDog, or any monitoring system:
01import { Gerbil } from "@tryhamster/gerbil";02import * as Sentry from "@sentry/node";03
04const g = new Gerbil({05 telemetry: {06 // Called on any error (model load, generation, etc.)07 onError: (error, context) => {08 Sentry.captureException(error, {09 extra: context,10 tags: { operation: context.operation },11 });12 },13
14 // Called after successful generation15 onGenerate: (event) => {16 console.log(`Generated ${event.result.tokensGenerated} tokens`);17 // Track in your metrics system18 metrics.histogram("gerbil.tokens_generated", event.result.tokensGenerated);19 metrics.histogram("gerbil.tokens_per_second", event.result.tokensPerSecond);20 },21
22 // Called after model loading (success or failure)23 onModelLoad: (event) => {24 if (event.success) {25 console.log(`Loaded ${event.modelId} in ${event.loadTimeMs}ms on ${event.device}`);26 } else {27 console.error(`Failed to load ${event.modelId}: ${event.error}`);28 }29 },30
31 // Called when requests wait in queue (>100ms)32 onQueueWait: (waitTimeMs) => {33 metrics.histogram("gerbil.queue_wait_ms", waitTimeMs);34 },35 },36});onError(error, context)
Called whenever an error occurs during Gerbil operations.
type ErrorContext = { operation: "generate" | "load" | "embed" | "speak" | "transcribe" | "json"; modelId?: string; extra?: Record<string, unknown>;};onGenerate(event)
Called after successful text generation.
type GenerateEvent = { modelId: string; result: GenerateResult; cached: boolean; queueTimeMs?: number; // Only if waited >100ms};onModelLoad(event)
Called after model loading completes (success or failure).
type ModelLoadEvent = { modelId: string; loadTimeMs: number; fromCache: boolean; device: "webgpu" | "cpu" | "wasm"; success: boolean; error?: string;};onQueueWait(waitTimeMs)
Called when a request waits in the queue for more than 100ms. Useful for detecting congestion.
Request Queue
Gerbil uses a request queue to prevent GPU OOM errors under concurrent load. LLM inference can only run one request at a time on the GPU.
Default Behavior
- Concurrency: 1 (single request at a time)
- Timeout: 5 minutes (300,000ms)
- Requests are processed in FIFO order
- Timeout errors are thrown if exceeded
Custom Configuration
const g = new Gerbil({ concurrency: { maxConcurrent: 1, // Max parallel requests (default: 1) timeout: 300_000, // Request timeout in ms (default: 5 min) },});Why Queue?
LLM inference on GPU is:
- Memory-bound: Models consume most of GPU VRAM
- Non-concurrent: Running multiple inferences simultaneously causes OOM
- Variable duration: Generation time depends on output length
The queue ensures predictable memory usage, no OOM crashes under load, and fair request ordering.
Rate Limiting
Gerbil does not include rate limiting. This is intentional—rate limiting is best handled at the application layer using middleware specific to your framework:
01// Express02import rateLimit from "express-rate-limit";03import { gerbil } from "@tryhamster/gerbil/express";04
05app.use("/ai", rateLimit({ windowMs: 60000, max: 10 }));06app.use("/ai", gerbil());07
08// Next.js09import { Ratelimit } from "@upstash/ratelimit";10import { Redis } from "@upstash/redis";11
12const ratelimit = new Ratelimit({13 redis: Redis.fromEnv(),14 limiter: Ratelimit.slidingWindow(10, "60s"),15});16
17export async function POST(req: Request) {18 const ip = req.headers.get("x-forwarded-for") ?? "anonymous";19 const { success } = await ratelimit.limit(ip);20 if (!success) return Response.json({ error: "Rate limited" }, { status: 429 });21 22 // Continue with Gerbil...23}Full Production Setup
01import { Gerbil } from "@tryhamster/gerbil";02import * as Sentry from "@sentry/node";03
04Sentry.init({ dsn: process.env.SENTRY_DSN });05
06const g = new Gerbil({07 model: "qwen3-0.6b",08 09 telemetry: {10 onError: (error, context) => {11 Sentry.captureException(error, { extra: context });12 },13 14 onGenerate: ({ result, queueTimeMs }) => {15 // Log slow generations16 if (result.totalTime > 10000) {17 console.warn(`Slow generation: ${result.totalTime}ms`);18 }19 20 // Track queue congestion21 if (queueTimeMs && queueTimeMs > 5000) {22 Sentry.captureMessage("High queue wait time", {23 level: "warning",24 extra: { queueTimeMs },25 });26 }27 },28 29 onModelLoad: (event) => {30 if (!event.success) {31 Sentry.captureMessage(`Model load failed: ${event.error}`, {32 level: "error",33 extra: event,34 });35 }36 },37 },38 39 concurrency: {40 maxConcurrent: 1,41 timeout: 120_000, // 2 minute timeout42 },43});44
45// Preload model on startup46await g.loadModel();47console.log("Gerbil ready for production");Health Checks
For production deployments, implement a health check endpoint:
01// Express02app.get("/health", async (req, res) => {03 try {04 const info = g.getInfo();05 res.json({06 status: "ok",07 model: info.model?.id,08 device: info.device.backend,09 ready: info.device.status === "ready",10 });11 } catch (error) {12 res.status(503).json({ status: "error", message: String(error) });13 }14});Next Steps
- Browser Usage → — iOS memory guards and crash detection
- Architecture → — understand the inference pipeline
- Express Integration → — build production APIs