Semantic Cache
Verifiedby Dryade
Description
Two-tier semantic caching with Redis (exact) and Qdrant (semantic)
Screenshots
Details
Semantic Cache Plugin
Semantic caching for LLM responses using vector similarity search.
Overview
This plugin provides a two-tier caching system:
- Exact cache: Redis-backed hash-based lookup for identical queries
- Semantic cache: Qdrant vector search for semantically similar queries
Architecture
Query --> SHA256 Hash --> Redis (exact match)
|
+--> FastEmbed --> Qdrant (semantic search)
|
+--> similarity >= threshold? --> Cache Hit
Components
| File | Purpose |
|------|---------|
| cache.py | Main SemanticCache class with two-tier lookup |
| config.py | Configuration with Pydantic models |
| embedder.py | FastEmbed wrapper for generating embeddings |
| redis_store.py | Redis backend for exact-match caching |
| qdrant_store.py | Qdrant backend for vector similarity |
| wrapper.py | LLM call wrappers with automatic caching |
Configuration
# Enable/disable caching
DRYADE_SEMANTIC_CACHE_ENABLED=true
Similarity threshold for semantic matches (0.85-0.95 recommended)
DRYADE_SEMANTIC_CACHE_THRESHOLD=0.90
Service URLs
DRYADE_QDRANT_URL=http://localhost:6333
DRYADE_REDIS_URL=redis://localhost
TTL settings (seconds)
DRYADE_SEMANTIC_CACHE_EXACT_TTL=3600 # 1 hour
DRYADE_SEMANTIC_CACHE_SEMANTIC_TTL=86400 # 24 hours
Usage
Basic Usage
from plugins.semantic_cache import get_semantic_cache
cache = get_semantic_cache()
Store a response
await cache.set("What is MBSE?", "MBSE is Model-Based Systems Engineering...")
Retrieve (exact or semantic match)
response = await cache.get("What is Model Based Systems Engineering?")
LLM Wrapper
from plugins.semantic_cache.wrapper import cached_llm_call, cached_llm_stream
Non-streaming
response = await cached_llm_call(
query="Explain requirements traceability",
llm_func=llm.acall,
messages=[{"role": "user", "content": "Explain requirements traceability"}]
)
Streaming
async for chunk in cached_llm_stream(
query="Explain AI safety",
llm_stream_func=llm.astream,
messages=[...]
):
if isinstance(chunk, CacheHitMarker):
print(f"Cache hit: {chunk.content}")
else:
print(chunk, end="")
Class Decorator
from plugins.semantic_cache.wrapper import cache_enabled_llm
@cache_enabled_llm
class MyLLM:
async def acall(self, messages, **kwargs):
...
async def astream(self, messages, **kwargs):
...
Dependencies
- fastembed: Fast embedding generation (uses sentence-transformers/all-MiniLM-L6-v2)
- qdrant-client: Vector database client
- redis: Exact-match cache backend
Fallback Behavior
If Qdrant or Redis are unavailable, the cache falls back to in-memory storage when fallback_to_memory=True (default).
Performance
- Embedding generation: ~5ms per query (384-dim vectors)
- Redis lookup: ~1ms
- Qdrant search: ~5-10ms
- Cache hit rate: Typically 40-60% in production
Integration with Self-Healing
The cache wrapper automatically integrates with the self-healing plugin for retry logic on cache misses:
# Cache miss flow:
# 1. Acquire queue slot (concurrency control)
# 2. Execute LLM call with self-healing retry
# 3. Cache the response
# 4. Release queue slot
Requires starter tier subscription