RAG Part 1: Hybrid Search
Why neither keyword search nor vector search alone is enough — and how combining them produces better results than either.
Contents
Most RAG tutorials treat retrieval as a solved problem. Pick an embedding model, run a vector search, call it a day. In practice, that's rarely enough. Users don't query a knowledge base the way they write SQL — they describe what they want, use synonyms, ask vague questions. A retriever built on vector search alone will miss exact matches. One built on keyword search alone will miss meaning. Production systems use both, filtered and fused together.
This post covers the three techniques that make up a modern retrieval pipeline: metadata filtering, BM25 keyword search, and semantic search — combined using Reciprocal Rank Fusion. For each one I'll cover the intuition, the math, and where it breaks down.
Part 1: Metadata Filtering
Metadata filtering is the simplest of the three techniques, and the only one that enforces hard constraints. It narrows down the candidate set using document attributes — author, date, department, access level, region — before any search happens. Think of it as a SQL WHERE clause. A paid subscriber sees different documents than a free user. An internal query returns only documents scoped to that team. Neither BM25 nor semantic search can do this — they rank by relevance, not by rules.
Metadata filtering doesn't rank. It gates. It's always used alongside the other techniques, never alone.
# Weaviate metadata filter example
from weaviate.classes.query import Filter
results = collection.query.fetch_objects(
filters=Filter.by_property("department").equal("engineering")
& Filter.by_property("access").equal("internal"),
limit=50
)
Part 2: BM25
Keyword search treats both the query and each document as a bag of words — order ignored, only counts matter. These counts form a sparse vector, one slot per vocabulary word. All document vectors together make up an inverted index: given a word, instantly find every document containing it.
The classic baseline is TF-IDF, which rewards documents that frequently contain rare keywords. BM25 (Best Match 25 — the 25th variant in a series of scoring functions, which is a very honest name) is the standard refinement. It fixes two problems with TF-IDF: raw term frequency is saturated so a word appearing 100 times doesn't score 100× better than one appearing 10, and document length penalties are diminishing rather than linear. It also adds two tunable hyperparameters to fit the scoring to your specific corpus. BM25 has been the default keyword search algorithm for decades and remains a competitive baseline worth beating before reaching for something more complex.
The Formula
Breaking down each term:
- f(t,d) — how many times term t appears in document d
- |d| / avgdl — document length relative to average; penalizes long documents
- k₁ — controls term frequency saturation (typically 1.2–2.0)
- b — controls length normalization strength (typically 0.75)
- IDF(t) — inverse document frequency, defined below
Where N is total documents and n(t) is the number containing term t. Common words like "the" appear everywhere — their IDF approaches 0.
In Code
from rank_bm25 import BM25Okapi
# Tokenize corpus
corpus = [doc.split() for doc in documents]
bm25 = BM25Okapi(corpus)
# Query
query = "Abbas Kiarostami slow cinema"
scores = bm25.get_scores(query.split())
# Get top-k indices
top_indices = scores.argsort()[::-1][:10]
| BM25 wins when | BM25 fails when |
|---|---|
| Exact name lookup ("Jafar Panahi") | Synonyms ("film" vs "movie") |
| Specific technical terms | Paraphrase and vibe queries |
| User knows precise terminology | No shared vocabulary with docs |
| Short, well-defined queries | Conceptual or mood-based search |
Part 3: Semantic Search
BM25's fundamental limitation is that it only matches exact words. Search for "slow contemplative cinema" and it finds nothing unless those exact words appear in a document. Semantic search fixes this by operating on meaning instead of vocabulary.
The Intuition
An embedding model — a transformer encoder like BAAI/bge-base-en-v1.5 — converts a piece of text into a dense vector: a point in 768-dimensional space. The model is trained on millions of text pairs so that semantically similar texts end up geometrically close. "A man contemplates mortality while driving through Tehran" and "a film about existential crisis in Iran" map to nearby points even though they share almost no words. One hard constraint worth noting: vectors from different embedding models are not interchangeable. Each model has its own vector space, so queries and documents must always be embedded with the same model.
Cosine Similarity
Where q and d are the embedding vectors for query and document. Output is in [-1, 1], where 1 means identical direction in vector space.
In practice, embeddings are often L2-normalized so cosine similarity reduces to a dot product: \(\mathbf{q} \cdot \mathbf{d}\). This is what makes approximate nearest neighbor search with HNSW efficient at scale.
In Code
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
model = SentenceTransformer("BAAI/bge-base-en-v1.5")
# Embed corpus once and store
doc_embeddings = model.encode(documents, show_progress_bar=True)
# Embed query and retrieve
query = "slow contemplative films about mortality"
query_embedding = model.encode([query])
similarities = cosine_similarity(query_embedding, doc_embeddings)[0]
top_indices = np.argsort(similarities)[::-1][:10]
| Semantic search wins when | Semantic search fails when |
|---|---|
| Vibe / mood queries | Exact name lookup |
| Synonyms and paraphrase | Rare proper nouns |
| Cross-lingual retrieval | Short, sparse descriptions |
| Conceptual similarity | Out-of-domain vocabulary |
Part 4: Reciprocal Rank Fusion
BM25 and semantic search each return a ranked list of documents. The problem is these lists are scored on completely different scales — BM25 produces unbounded term frequency scores, cosine similarity is bounded in [-1, 1]. You can't just add them together. Normalizing doesn't really work either because the score distributions are different shapes. RRF sidesteps this entirely by ignoring scores and only using rank positions.
The Intuition
A document that ranks highly on both lists is probably genuinely relevant. A document that only appears on one list, or ranks low on both, probably isn't. RRF scores each document based on where it appears in each list — specifically, the reciprocal of its rank — and sums those scores across all lists. The result is a single merged ranking that reflects consensus across retrieval methods.
The Formula
Where R is the set of ranked lists, rank_r(d) is document d's position in list r, and k is a constant (typically 60) that dampens the influence of top-ranked documents.
The constant k=60 was found empirically in the original RRF paper to work well across a range of tasks. It prevents a document ranked #1 in one list from completely dominating the fusion result. When k=0, the top-ranked document in any list dominates — rank 1 scores 1.0, rank 10 scores 0.1, a 10× gap. Setting k=60 compresses this: rank 1 scores 1/61, rank 10 scores 1/70, a much more modest difference.
In Code
def reciprocal_rank_fusion(bm25_results, semantic_results, k=60):
"""
bm25_results: list of doc ids ordered by BM25 rank
semantic_results: list of doc ids ordered by semantic rank
returns: list of (doc_id, score) sorted by RRF score
"""
scores = {}
for rank, doc_id in enumerate(bm25_results):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
for rank, doc_id in enumerate(semantic_results):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
return sorted(scores.items(), key=lambda x: x[1], reverse=True)
def hybrid_search(query, documents, bm25, model, doc_embeddings, top_k=10):
# BM25 retrieval
bm25_scores = bm25.get_scores(query.split())
bm25_ranked = bm25_scores.argsort()[::-1][:100]
# Semantic retrieval
query_emb = model.encode([query])
sem_scores = cosine_similarity(query_emb, doc_embeddings)[0]
sem_ranked = sem_scores.argsort()[::-1][:100]
# Fuse with RRF
fused = reciprocal_rank_fusion(bm25_ranked, sem_ranked)
return [idx for idx, _ in fused[:top_k]]
In Weaviate
Most vector databases implement hybrid search natively. Weaviate wraps BM25, semantic search, and RRF into a single call with an alpha parameter to control the balance:
results = collection.query.hybrid(
query="slow contemplative films about mortality",
alpha=0.5, # 0 = pure BM25, 1 = pure semantic
limit=10
)
Putting It Together
Each technique covers a different failure mode. BM25 handles exact matches — names, product codes, technical terms — that semantic search would miss or dilute. Semantic search handles meaning, synonyms, and conceptual queries that BM25 can't touch. Metadata filtering enforces hard constraints that neither search method can express. RRF merges the ranked outputs without requiring you to normalize incompatible score distributions.
Hybrid search with RRF is the right default not because it's always optimal, but because it fails gracefully. Keyword-heavy query? BM25 carries it. Conceptual query? Semantic search carries it. Both agree? High confidence. The alpha parameter in Weaviate (0 = pure BM25, 1 = pure semantic) lets you tune this balance once you've measured what your query distribution actually looks like.
Evaluating Retrieval Quality
Tuning a retriever without measurement is guesswork. The standard metrics all require the same ingredients: a set of test queries, the ranked list your retriever returns for each, and a ground truth list of every relevant document in the knowledge base. Building that ground truth is tedious, but without it you have no way to know whether your changes are improvements.
Precision and Recall
Suppose your knowledge base has 10 relevant documents for a query. Your retriever returns 12, of which 8 are relevant: precision = 8/12 = 67%, recall = 8/10 = 80%. Loosen the retriever to return 15, finding 9 relevant: precision drops to 60%, recall rises to 90%. There's almost always a tradeoff.
Precision measures how much you can trust the results. Recall measures how complete they are.
Mean Reciprocal Rank (MRR)
Where rank_i is the position of the first relevant document for query i. MRR measures how quickly the retriever surfaces at least one relevant result.
MRR is useful when what matters most is whether the retriever surfaces at least one relevant result near the top. If the first relevant document appears at ranks 1, 3, 6, and 2 across four queries: MRR = (1 + 1/3 + 1/6 + 1/2) / 4 = 0.50.
Which to Use
| Metric | Best for |
|---|---|
| Recall@K | Most fundamental — did we find the relevant docs? |
| Precision@K | Are we returning too many irrelevant docs? |
| MRR | Is at least one relevant doc near the top? |
Recall is the most fundamental — it captures whether the retriever is doing its job at all. Precision tells you how much noise it's adding. MRR tells you whether relevant results are appearing near the top. Use all three together when tuning, and build the ground truth dataset early. It's the only way to know if a change to your retriever is actually an improvement.
References
Robertson, S., & Zaragoza, H. (2009). The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3(4), 333–389.
Cormack, G. V., Clarke, C. L. A., & Buettcher, S. (2009). Reciprocal rank fusion outperforms Condorcet and individual rank learning methods. SIGIR 2009.
Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using siamese BERT-networks. EMNLP 2019.