Sahar Banisafar

RAG Part 1: Hybrid Search

Why neither keyword search nor vector search alone is enough — and how combining them produces better results than either.

Contents

Most RAG tutorials treat retrieval as a solved problem. Pick an embedding model, run a vector search, call it a day. In practice, that's rarely enough. Users don't query a knowledge base the way they write SQL — they describe what they want, use synonyms, ask vague questions. A retriever built on vector search alone will miss exact matches. One built on keyword search alone will miss meaning. Production systems use both, filtered and fused together.

This post covers the three techniques that make up a modern retrieval pipeline: metadata filtering, BM25 keyword search, and semantic search — combined using Reciprocal Rank Fusion. For each one I'll cover the intuition, the math, and where it breaks down.


Part 1: Metadata Filtering

Metadata filtering is the simplest of the three techniques, and the only one that enforces hard constraints. It narrows down the candidate set using document attributes — author, date, department, access level, region — before any search happens. Think of it as a SQL WHERE clause. A paid subscriber sees different documents than a free user. An internal query returns only documents scoped to that team. Neither BM25 nor semantic search can do this — they rank by relevance, not by rules.

Metadata filtering doesn't rank. It gates. It's always used alongside the other techniques, never alone.

# Weaviate metadata filter example
from weaviate.classes.query import Filter

results = collection.query.fetch_objects(
    filters=Filter.by_property("department").equal("engineering")
    & Filter.by_property("access").equal("internal"),
    limit=50
)

Part 2: BM25

Keyword search treats both the query and each document as a bag of words — order ignored, only counts matter. These counts form a sparse vector, one slot per vocabulary word. All document vectors together make up an inverted index: given a word, instantly find every document containing it.

The classic baseline is TF-IDF, which rewards documents that frequently contain rare keywords. BM25 (Best Match 25 — the 25th variant in a series of scoring functions, which is a very honest name) is the standard refinement. It fixes two problems with TF-IDF: raw term frequency is saturated so a word appearing 100 times doesn't score 100× better than one appearing 10, and document length penalties are diminishing rather than linear. It also adds two tunable hyperparameters to fit the scoring to your specific corpus. BM25 has been the default keyword search algorithm for decades and remains a competitive baseline worth beating before reaching for something more complex.

The Formula

BM25 Scoring Function
\[ \text{BM25}(q, d) = \sum_{t \in q} \text{IDF}(t) \cdot \frac{f(t,d) \cdot (k_1 + 1)}{f(t,d) + k_1 \cdot \left(1 - b + b \cdot \frac{|d|}{\text{avgdl}}\right)} \]

Breaking down each term:

IDF Component
\[ \text{IDF}(t) = \log \frac{N - n(t) + 0.5}{n(t) + 0.5} \]

Where N is total documents and n(t) is the number containing term t. Common words like "the" appear everywhere — their IDF approaches 0.

In Code

from rank_bm25 import BM25Okapi

# Tokenize corpus
corpus = [doc.split() for doc in documents]
bm25 = BM25Okapi(corpus)

# Query
query = "Abbas Kiarostami slow cinema"
scores = bm25.get_scores(query.split())

# Get top-k indices
top_indices = scores.argsort()[::-1][:10]
BM25 wins when BM25 fails when
Exact name lookup ("Jafar Panahi")Synonyms ("film" vs "movie")
Specific technical termsParaphrase and vibe queries
User knows precise terminologyNo shared vocabulary with docs
Short, well-defined queriesConceptual or mood-based search
BM25 has no concept of meaning — only character matching. Search for "slow contemplative cinema" and it returns nothing unless those exact words appear in a document.

Part 3: Semantic Search

BM25's fundamental limitation is that it only matches exact words. Search for "slow contemplative cinema" and it finds nothing unless those exact words appear in a document. Semantic search fixes this by operating on meaning instead of vocabulary.

The Intuition

An embedding model — a transformer encoder like BAAI/bge-base-en-v1.5 — converts a piece of text into a dense vector: a point in 768-dimensional space. The model is trained on millions of text pairs so that semantically similar texts end up geometrically close. "A man contemplates mortality while driving through Tehran" and "a film about existential crisis in Iran" map to nearby points even though they share almost no words. One hard constraint worth noting: vectors from different embedding models are not interchangeable. Each model has its own vector space, so queries and documents must always be embedded with the same model.

Cosine Similarity

Cosine Similarity
\[ \text{sim}(q, d) = \frac{\mathbf{q} \cdot \mathbf{d}}{\|\mathbf{q}\| \cdot \|\mathbf{d}\|} \]

Where q and d are the embedding vectors for query and document. Output is in [-1, 1], where 1 means identical direction in vector space.

In practice, embeddings are often L2-normalized so cosine similarity reduces to a dot product: \(\mathbf{q} \cdot \mathbf{d}\). This is what makes approximate nearest neighbor search with HNSW efficient at scale.

In Code

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

model = SentenceTransformer("BAAI/bge-base-en-v1.5")

# Embed corpus once and store
doc_embeddings = model.encode(documents, show_progress_bar=True)

# Embed query and retrieve
query = "slow contemplative films about mortality"
query_embedding = model.encode([query])

similarities = cosine_similarity(query_embedding, doc_embeddings)[0]
top_indices = np.argsort(similarities)[::-1][:10]
Semantic search wins when Semantic search fails when
Vibe / mood queriesExact name lookup
Synonyms and paraphraseRare proper nouns
Cross-lingual retrievalShort, sparse descriptions
Conceptual similarityOut-of-domain vocabulary
Semantic search has no concept of exact matching. Search for "Jafar Panahi" and you might get films thematically similar to his work — but miss films that simply mention his name in passing.

Part 4: Reciprocal Rank Fusion

BM25 and semantic search each return a ranked list of documents. The problem is these lists are scored on completely different scales — BM25 produces unbounded term frequency scores, cosine similarity is bounded in [-1, 1]. You can't just add them together. Normalizing doesn't really work either because the score distributions are different shapes. RRF sidesteps this entirely by ignoring scores and only using rank positions.

The Intuition

A document that ranks highly on both lists is probably genuinely relevant. A document that only appears on one list, or ranks low on both, probably isn't. RRF scores each document based on where it appears in each list — specifically, the reciprocal of its rank — and sums those scores across all lists. The result is a single merged ranking that reflects consensus across retrieval methods.

The Formula

Reciprocal Rank Fusion
\[ \text{RRF}(d) = \sum_{r \in R} \frac{1}{k + \text{rank}_r(d)} \]

Where R is the set of ranked lists, rank_r(d) is document d's position in list r, and k is a constant (typically 60) that dampens the influence of top-ranked documents.

The constant k=60 was found empirically in the original RRF paper to work well across a range of tasks. It prevents a document ranked #1 in one list from completely dominating the fusion result. When k=0, the top-ranked document in any list dominates — rank 1 scores 1.0, rank 10 scores 0.1, a 10× gap. Setting k=60 compresses this: rank 1 scores 1/61, rank 10 scores 1/70, a much more modest difference.

In Code

def reciprocal_rank_fusion(bm25_results, semantic_results, k=60):
    """
    bm25_results: list of doc ids ordered by BM25 rank
    semantic_results: list of doc ids ordered by semantic rank
    returns: list of (doc_id, score) sorted by RRF score
    """
    scores = {}

    for rank, doc_id in enumerate(bm25_results):
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)

    for rank, doc_id in enumerate(semantic_results):
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)

    return sorted(scores.items(), key=lambda x: x[1], reverse=True)


def hybrid_search(query, documents, bm25, model, doc_embeddings, top_k=10):
    # BM25 retrieval
    bm25_scores = bm25.get_scores(query.split())
    bm25_ranked = bm25_scores.argsort()[::-1][:100]

    # Semantic retrieval
    query_emb = model.encode([query])
    sem_scores = cosine_similarity(query_emb, doc_embeddings)[0]
    sem_ranked = sem_scores.argsort()[::-1][:100]

    # Fuse with RRF
    fused = reciprocal_rank_fusion(bm25_ranked, sem_ranked)
    return [idx for idx, _ in fused[:top_k]]

In Weaviate

Most vector databases implement hybrid search natively. Weaviate wraps BM25, semantic search, and RRF into a single call with an alpha parameter to control the balance:

results = collection.query.hybrid(
    query="slow contemplative films about mortality",
    alpha=0.5,  # 0 = pure BM25, 1 = pure semantic
    limit=10
)
RRF rewards consistency across retrieval methods. A document that both BM25 and semantic search agree on is more likely to be genuinely relevant than one that only one method surfaces.

Putting It Together

Each technique covers a different failure mode. BM25 handles exact matches — names, product codes, technical terms — that semantic search would miss or dilute. Semantic search handles meaning, synonyms, and conceptual queries that BM25 can't touch. Metadata filtering enforces hard constraints that neither search method can express. RRF merges the ranked outputs without requiring you to normalize incompatible score distributions.

Hybrid search with RRF is the right default not because it's always optimal, but because it fails gracefully. Keyword-heavy query? BM25 carries it. Conceptual query? Semantic search carries it. Both agree? High confidence. The alpha parameter in Weaviate (0 = pure BM25, 1 = pure semantic) lets you tune this balance once you've measured what your query distribution actually looks like.


Evaluating Retrieval Quality

Tuning a retriever without measurement is guesswork. The standard metrics all require the same ingredients: a set of test queries, the ranked list your retriever returns for each, and a ground truth list of every relevant document in the knowledge base. Building that ground truth is tedious, but without it you have no way to know whether your changes are improvements.

Precision and Recall

Precision and Recall
\[ \text{Precision@K} = \frac{\text{relevant documents in top K}}{K} \] \[ \text{Recall@K} = \frac{\text{relevant documents in top K}}{\text{total relevant documents}} \]

Suppose your knowledge base has 10 relevant documents for a query. Your retriever returns 12, of which 8 are relevant: precision = 8/12 = 67%, recall = 8/10 = 80%. Loosen the retriever to return 15, finding 9 relevant: precision drops to 60%, recall rises to 90%. There's almost always a tradeoff.

Precision measures how much you can trust the results. Recall measures how complete they are.

Mean Reciprocal Rank (MRR)

Mean Reciprocal Rank
\[ \text{MRR} = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{\text{rank}_i} \]

Where rank_i is the position of the first relevant document for query i. MRR measures how quickly the retriever surfaces at least one relevant result.

MRR is useful when what matters most is whether the retriever surfaces at least one relevant result near the top. If the first relevant document appears at ranks 1, 3, 6, and 2 across four queries: MRR = (1 + 1/3 + 1/6 + 1/2) / 4 = 0.50.

Which to Use

MetricBest for
Recall@KMost fundamental — did we find the relevant docs?
Precision@KAre we returning too many irrelevant docs?
MRRIs at least one relevant doc near the top?

Recall is the most fundamental — it captures whether the retriever is doing its job at all. Precision tells you how much noise it's adding. MRR tells you whether relevant results are appearing near the top. Use all three together when tuning, and build the ground truth dataset early. It's the only way to know if a change to your retriever is actually an improvement.


References

Robertson, S., & Zaragoza, H. (2009). The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3(4), 333–389.

Cormack, G. V., Clarke, C. L. A., & Buettcher, S. (2009). Reciprocal rank fusion outperforms Condorcet and individual rank learning methods. SIGIR 2009.

Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using siamese BERT-networks. EMNLP 2019.

Sahar Banisafar is a data scientist with a mathematics background and 6 years of production ML experience. She writes about the intersection of theory and practice in modern machine learning.