A hybrid retrieval system has been developed to efficiently serve over 127,000 queries on a single AWS Lightsail instance, combining the precision of BM25 with the semantic understanding of FAISS. This system operates without a GPU for embeddings, though a GPU can be used optionally for reranking to achieve a 3x speedup. The infrastructure is cost-effective, running on a t3.medium instance for approximately $50 per month, and achieves 91% accuracy, significantly outperforming dense-only methods. The hybrid approach effectively handles complex queries by using a four-stage cascade that combines keyword precision with semantic understanding, optimizing latency and accuracy through asynchronous parallel retrieval and batch reranking. This matters because it demonstrates a cost-effective, high-performance solution for query retrieval that balances precision and semantic understanding, crucial for applications requiring accurate and efficient information retrieval.
In the realm of information retrieval, achieving a balance between accuracy and resource efficiency is a constant challenge. The hybrid retrieval system discussed here demonstrates a significant leap in accuracy by combining traditional and modern techniques. By integrating BM25, a classic keyword-based retrieval method, with FAISS, a dense vector search library, the system achieves a 48% improvement in accuracy over dense-only models. This hybrid approach is particularly effective in scenarios where precise keyword matching is crucial, such as identifying specific entities like license plates. The system’s ability to serve over 127,000 queries on a modest AWS Lightsail instance without the need for a GPU highlights its efficiency and cost-effectiveness, making it accessible for a wide range of applications.
The architecture of this system is noteworthy for its simplicity and power. Utilizing a t3.medium instance with 4GB RAM and 2 vCPUs, it manages to maintain a retrieval time of just 75 milliseconds per query. This is achieved through a four-stage cascade that combines the precision of BM25 with the semantic understanding provided by FAISS. The inclusion of a reranker, specifically the ms-marco-MiniLM-L-6-v2 cross-encoder, further enhances accuracy by 12%, albeit at the cost of being the primary latency bottleneck. Despite this, the system’s throughput remains impressive at 50 queries per minute, demonstrating its potential for high-demand environments.
Optimizations play a crucial role in maintaining the system’s performance. Techniques such as asynchronous parallel retrieval and batch reranking (with a size of 32) are employed to maximize efficiency. While the system operates effectively on CPU, the optional use of a GPU can triple the speed of the reranker, offering flexibility based on available resources. This adaptability ensures that the system can scale according to the needs and constraints of different users, whether they prioritize speed, cost, or a balance of both.
The significance of this hybrid retrieval approach extends beyond its technical specifications. It addresses a fundamental limitation of dense-only models, which often struggle with exact entity recognition despite their semantic prowess. By combining the strengths of both keyword and semantic retrieval methods, this system provides a robust solution that enhances accuracy without sacrificing efficiency. For developers and organizations seeking to improve their search capabilities, this approach offers a practical and scalable path forward, demonstrating that innovation in retrieval systems can lead to substantial improvements in performance and user satisfaction.
Read the original article here


Comments
2 responses to “Hybrid Retrieval: BM25 + FAISS on t3.medium”
The hybrid retrieval system you’ve developed sounds impressive, especially the integration of BM25 with FAISS to balance precision and semantic understanding. I’m curious about the decision to use a four-stage cascade; could you elaborate on how each stage contributes to optimizing both latency and accuracy in the retrieval process?
The four-stage cascade in the hybrid retrieval system is designed to optimize both latency and accuracy by sequentially refining query results. Initially, BM25 retrieves a broad set of relevant documents quickly. Then, FAISS narrows down these results using semantic embeddings for deeper context understanding. Subsequent stages involve further reranking and filtering to ensure the most accurate and relevant results are returned efficiently. For more detailed insights, you might want to check the original article linked in the post.