Response Caching

Cache inference responses for repeated prompts to achieve instant results on subsequent calls.

Note: Response caching is different from KV cache (attention state cache). Response caching stores complete generation outputs while KV cache stores internal model states for conversation context.

Enable Response Caching

Pass cache: true to enable caching for a generation call:

enable-cache.ts
01import { Gerbil } from "@tryhamster/gerbil";
02
03const g = new Gerbil();
04await g.loadModel("qwen3-0.6b");
05
06// First call: ~150ms (runs inference)
07const result = await g.generate("What is 2+2?", { cache: true });
08console.log(result.text); // "4"
09console.log(result.cached); // false
10
11// Second call: ~0ms (returns from cache!)
12const cached = await g.generate("What is 2+2?", { cache: true });
13console.log(cached.text); // "4"
14console.log(cached.cached); // true

Custom TTL

By default, cached responses expire after 5 minutes. Customize the TTL with cacheTtl:

custom-ttl.ts
// Cache for 10 minutes
await g.generate("Explain quantum computing", {
cache: true,
cacheTtl: 10 * 60 * 1000 // milliseconds
});
// Cache for 1 hour
await g.generate("What's the capital of France?", {
cache: true,
cacheTtl: 60 * 60 * 1000
});
// Cache for 30 seconds (useful for dynamic content)
await g.generate("Generate a random fact", {
cache: true,
cacheTtl: 30 * 1000
});

Cache Statistics

Monitor cache performance with getResponseCacheStats():

cache-stats.ts
const stats = g.getResponseCacheStats();
console.log(stats);
// {
// hits: 5, // Number of cache hits
// misses: 3, // Number of cache misses
// size: 3, // Number of cached entries
// hitRate: 62.5 // Hit rate percentage
// }

Clear Response Cache

Clear all cached responses when needed:

clear-cache.ts
// Clear all cached responses
g.clearResponseCache();
// Verify it's cleared
const stats = g.getResponseCacheStats();
console.log(stats.size); // 0

How Cache Keys Work

The cache key is a hash of the following parameters. Different values = different cache entries:

  • Prompt text
  • Model ID
  • maxTokens
  • temperature
  • topP and topK
  • System prompt
  • Thinking mode
cache-key-example.ts
// These are cached separately (different temperature)
await g.generate("Hello", { cache: true, temperature: 0.7 });
await g.generate("Hello", { cache: true, temperature: 0.3 });
// These share the same cache entry
await g.generate("Hello", { cache: true });
await g.generate("Hello", { cache: true }); // Cache hit!

Limitations

Response caching is not supported for:

  • Streaming calls — when using onToken callback or stream()
  • Vision/image calls — when passing images option

KV Cache vs Response Cache

Gerbil has two caching mechanisms that serve different purposes:

FeatureKV CacheResponse Cache
What's cachedAttention statesFull responses
PurposeConversation contextRepeated prompts
Clear methodclearCache()clearResponseCache()
DefaultAlways onOff (enable with cache: true)

Best Practices

  • Use caching for deterministic prompts — factual questions, data extraction, and classification tasks benefit most
  • Set low TTL for creative content — if you want variation, use short TTLs or skip caching
  • Monitor cache stats — track hit rates to understand if caching is effective for your use case
  • Clear cache when updating prompts — if you change system prompts or parameters, clear the cache to avoid stale responses

Testing

Run the response caching test to verify everything works:

Terminal
npx tsx examples/test-cache.ts