Back to Catalog
Semantic Cache icon

Semantic Cache

Verified

by Dryade

starter ai-models
0.0 (0 ratings) 0 downloads

Description

Two-tier semantic caching with Redis (exact) and Qdrant (semantic)

Screenshots

Details

Semantic Cache Plugin

Semantic caching for LLM responses using vector similarity search.

Overview

This plugin provides a two-tier caching system:

  1. Exact cache: Redis-backed hash-based lookup for identical queries
  2. Semantic cache: Qdrant vector search for semantically similar queries

Architecture

Query --> SHA256 Hash --> Redis (exact match)
    |
    +--> FastEmbed --> Qdrant (semantic search)
                          |
                          +--> similarity >= threshold? --> Cache Hit

Components

| File | Purpose | |------|---------| | cache.py | Main SemanticCache class with two-tier lookup | | config.py | Configuration with Pydantic models | | embedder.py | FastEmbed wrapper for generating embeddings | | redis_store.py | Redis backend for exact-match caching | | qdrant_store.py | Qdrant backend for vector similarity | | wrapper.py | LLM call wrappers with automatic caching |

Configuration

# Enable/disable caching
DRYADE_SEMANTIC_CACHE_ENABLED=true

Similarity threshold for semantic matches (0.85-0.95 recommended)

DRYADE_SEMANTIC_CACHE_THRESHOLD=0.90

Service URLs

DRYADE_QDRANT_URL=http://localhost:6333 DRYADE_REDIS_URL=redis://localhost

TTL settings (seconds)

DRYADE_SEMANTIC_CACHE_EXACT_TTL=3600 # 1 hour DRYADE_SEMANTIC_CACHE_SEMANTIC_TTL=86400 # 24 hours

Usage

Basic Usage

from plugins.semantic_cache import get_semantic_cache

cache = get_semantic_cache()

Store a response

await cache.set("What is MBSE?", "MBSE is Model-Based Systems Engineering...")

Retrieve (exact or semantic match)

response = await cache.get("What is Model Based Systems Engineering?")

LLM Wrapper

from plugins.semantic_cache.wrapper import cached_llm_call, cached_llm_stream

Non-streaming

response = await cached_llm_call( query="Explain requirements traceability", llm_func=llm.acall, messages=[{"role": "user", "content": "Explain requirements traceability"}] )

Streaming

async for chunk in cached_llm_stream( query="Explain AI safety", llm_stream_func=llm.astream, messages=[...] ): if isinstance(chunk, CacheHitMarker): print(f"Cache hit: {chunk.content}") else: print(chunk, end="")

Class Decorator

from plugins.semantic_cache.wrapper import cache_enabled_llm

@cache_enabled_llm class MyLLM: async def acall(self, messages, **kwargs): ...

async def astream(self, messages, **kwargs):
    ...

Dependencies

  • fastembed: Fast embedding generation (uses sentence-transformers/all-MiniLM-L6-v2)
  • qdrant-client: Vector database client
  • redis: Exact-match cache backend

Fallback Behavior

If Qdrant or Redis are unavailable, the cache falls back to in-memory storage when fallback_to_memory=True (default).

Performance

  • Embedding generation: ~5ms per query (384-dim vectors)
  • Redis lookup: ~1ms
  • Qdrant search: ~5-10ms
  • Cache hit rate: Typically 40-60% in production

Integration with Self-Healing

The cache wrapper automatically integrates with the self-healing plugin for retry logic on cache misses:

# Cache miss flow:
# 1. Acquire queue slot (concurrency control)
# 2. Execute LLM call with self-healing retry
# 3. Cache the response
# 4. Release queue slot
Subscribe & Install

Requires starter tier subscription

Plugin Info

Version 1.0.0
Author Dryade
Tier starter
Category ai-models
Type backend
Downloads 0
Updated Mar 15, 2026

Tags

startersemanticcache