
Speeding Up RAG Systems with Smart Caching
Using query similarity to dramatically reduce retrieval time
Proximity introduces an innovative approach to make Retrieval-Augmented Generation (RAG) faster by intelligently caching similar queries and their retrieved documents.
- Reduces RAG inference time by up to 9.5x by leveraging approximate caching
- Maintains answer quality while improving throughput by 2.7-3.4x
- Requires no changes to existing RAG pipelines
- Particularly effective for domain-specific applications (e.g., medical queries)
Why it matters for Medical: Healthcare applications need fast, accurate responses from LLMs. Proximity enables medical RAG systems to deliver answers more quickly while maintaining accuracy, essential for clinical decision support and medical information retrieval.
Leveraging Approximate Caching for Faster Retrieval-Augmented Generation