Semantic Cache Plugin

Semantic caching for LLM responses using vector similarity search.

Overview

This plugin provides a two-tier caching system:

Exact cache: Redis-backed hash-based lookup for identical queries
Semantic cache: Qdrant vector search for semantically similar queries

Architecture

Query --> SHA256 Hash --> Redis (exact match)
    |
    +--> FastEmbed --> Qdrant (semantic search)
                          |
                          +--> similarity >= threshold? --> Cache Hit

Components

| File | Purpose | |------|---------| | cache.py | Main SemanticCache class with two-tier lookup | | config.py | Configuration with Pydantic models | | embedder.py | FastEmbed wrapper for generating embeddings | | redis_store.py | Redis backend for exact-match caching | | qdrant_store.py | Qdrant backend for vector similarity | | wrapper.py | LLM call wrappers with automatic caching |

Configuration

# Enable/disable caching
DRYADE_SEMANTIC_CACHE_ENABLED=true

Similarity threshold for semantic matches (0.85-0.95 recommended)
DRYADE_SEMANTIC_CACHE_THRESHOLD=0.90
Service URLs
DRYADE_QDRANT_URL=http://localhost:6333
DRYADE_REDIS_URL=redis://localhost
TTL settings (seconds)
DRYADE_SEMANTIC_CACHE_EXACT_TTL=3600      # 1 hour
DRYADE_SEMANTIC_CACHE_SEMANTIC_TTL=86400  # 24 hours

Usage

Basic Usage

from plugins.semantic_cache import get_semantic_cache

cache = get_semantic_cache()
Store a response
await cache.set("What is MBSE?", "MBSE is Model-Based Systems Engineering...")
Retrieve (exact or semantic match)
response = await cache.get("What is Model Based Systems Engineering?")

LLM Wrapper

from plugins.semantic_cache.wrapper import cached_llm_call, cached_llm_stream

Non-streaming
response = await cached_llm_call(
    query="Explain requirements traceability",
    llm_func=llm.acall,
    messages=[{"role": "user", "content": "Explain requirements traceability"}]
)
Streaming
async for chunk in cached_llm_stream(
    query="Explain AI safety",
    llm_stream_func=llm.astream,
    messages=[...]
):
    if isinstance(chunk, CacheHitMarker):
        print(f"Cache hit: {chunk.content}")
    else:
        print(chunk, end="")

Class Decorator

from plugins.semantic_cache.wrapper import cache_enabled_llm

@cache_enabled_llm
class MyLLM:
    async def acall(self, messages, **kwargs):
        ...
async def astream(self, messages, **kwargs):
    ...

Dependencies

fastembed: Fast embedding generation (uses sentence-transformers/all-MiniLM-L6-v2)
qdrant-client: Vector database client
redis: Exact-match cache backend

Fallback Behavior

If Qdrant or Redis are unavailable, the cache falls back to in-memory storage when fallback_to_memory=True (default).

Performance

Embedding generation: ~5ms per query (384-dim vectors)
Redis lookup: ~1ms
Qdrant search: ~5-10ms
Cache hit rate: Typically 40-60% in production

Integration with Self-Healing

The cache wrapper automatically integrates with the self-healing plugin for retry logic on cache misses:

# Cache miss flow:
# 1. Acquire queue slot (concurrency control)
# 2. Execute LLM call with self-healing retry
# 3. Cache the response
# 4. Release queue slot

Semantic Cache

Description

Screenshots

Details

Semantic Cache Plugin

Overview

Architecture

Components

Configuration

Similarity threshold for semantic matches (0.85-0.95 recommended)

Service URLs

TTL settings (seconds)

Usage

Basic Usage

Store a response

Retrieve (exact or semantic match)

LLM Wrapper

Non-streaming

Streaming

Class Decorator

Dependencies

Fallback Behavior

Performance

Integration with Self-Healing

Plugin Info

Tags