Hybrid Search: Building the Pipeline from Scratch

The previous article explained what Azure AI Search does and how its retrieval strategies (full-text, vector, hybrid, semantic reranking) map to the NLP fundamentals covered earlier in the series. This article takes those ideas and turns them into working code.

We build a complete hybrid search engine in Python that mirrors the Azure AI Search pipeline step by step: language analysis, inverted indexing with BM25, vector search with cosine similarity, Reciprocal Rank Fusion, and cross-encoder semantic reranking. The code runs against a small set of knowledge base articles so you can see every intermediate result the pipeline produces.

Download the Source Code

All Python source files for this pipeline are available in the public GitHub repository: oduenas-enddesk/azure_ai_search. Clone the repo and follow the instructions in its README to run the interactive demo.

note

The same pipeline is also available as an MCP (Model Context Protocol) server in the companion repository: oduenas-enddesk/azure-ai-search-mcp. It wraps this exact script and exposes a search tool you can wire into Cursor (or any MCP-compatible client) to run the full hybrid-search pipeline on your own markdown corpus.

Pipeline Overview

The search engine follows the same eight-phase pipeline that Azure AI Search uses internally:

Phase	What It Does	Azure AI Search Equivalent
1. Document Ingestion	Load `.md` files from disk	Indexer pulling from a data source
2. Chunking	Split documents into section-level passages	Document cracking and chunking skill
3. Full-Text Indexing	Tokenize, stem, remove stop words, build inverted index	Language analyzer + inverted index
4. BM25 Scoring	Rank chunks by keyword relevance	Full-text search (L1 keyword leg)
5. Vector Indexing	Generate dense embeddings for each chunk	Vector field with embedding model
6. Vector Search	Cosine similarity nearest neighbors	Vector search (L1 vector leg)
7. Hybrid Merge (RRF)	Reciprocal Rank Fusion combines both legs	Hybrid query with RRF
8. Semantic Reranking	Cross-encoder reranks top results	Semantic ranker (L2)

Configuration

All tunable parameters live in a single configuration file. The BM25 constants (k1 and b) control term-frequency saturation and document-length normalization. RRF_K is the constant Azure uses in Reciprocal Rank Fusion (default 60). The embedding model and cross-encoder model are the same open-source models used in the earlier NLP articles.

import os

DOCS_DIR = os.path.join(os.path.dirname(os.path.dirname(__file__)), "files_converted")

EMBEDDING_MODEL = "all-MiniLM-L6-v2"
CROSS_ENCODER_MODEL = "cross-encoder/ms-marco-MiniLM-L-6-v2"

BM25_K1 = 1.5
BM25_B = 0.75
RRF_K = 60
TOP_K = 50
RERANK_TOP_N = 10

BANNER = "=" * 80
DIVIDER = "-" * 60

DOCS_DIR points to the files_converted/ folder containing the knowledge base markdown files. TOP_K controls how many results each retrieval leg returns before fusion, and RERANK_TOP_N limits how many results the semantic reranker evaluates.

Phase 1-2: Document Ingestion and Chunking

Before anything can be searched, documents need to be loaded and broken into smaller passages. Azure AI Search does this with indexers and chunking skills. Here we replicate it with two utility functions.

The first function reads every .md file from a directory, which is the equivalent of an Azure indexer pulling documents from a data source.

import os
import re
from collections import defaultdict


def load_documents(docs_dir):
    documents = []
    for filename in sorted(os.listdir(docs_dir)):
        if filename.endswith(".md"):
            filepath = os.path.join(docs_dir, filename)
            with open(filepath, "r", encoding="utf-8") as f:
                content = f.read()
            documents.append({
                "id": filename,
                "filename": filename,
                "content": content,
            })
    return documents

The second function splits each document into chunks by markdown headings. Azure AI Search does the same thing large documents get broken into one-to-two-paragraph passages before indexing. Chunking is critical because a 10 page document might cover many topics, and you want search results to point to the specific section that answers the query, not the entire document.

def chunk_document(doc):
    content = doc["content"]
    sections = re.split(r"\n(?=#{1,3}\s)", content)

    chunks = []
    for i, section in enumerate(sections):
        text = section.strip()
        if not text or len(text) < 30:
            continue
        heading_match = re.match(r"^(#{1,3})\s+(.+)", text)
        heading = heading_match.group(2) if heading_match else "Intro / preamble"

        chunks.append({
            "chunk_id": f"{doc['id']}::chunk_{i}",
            "doc_id": doc["id"],
            "doc_filename": doc["filename"],
            "heading": heading,
            "text": text,
        })
    return chunks


def chunk_all_documents(documents):
    all_chunks = []
    for doc in documents:
        all_chunks.extend(chunk_document(doc))
    return all_chunks

The regex \n(?=#{1,3}\s) splits on lines that start with one to three # characters followed by a space, which is how markdown headings work. Each chunk keeps a reference to its source document and heading so the search results can tell you exactly where the answer came from.

Phase 3: Language Analyzer

The language analyzer simulates Azure's built-in analyzer pipeline. It applies the same three operations covered in the Text Processing with NLTK article: tokenization, stop word removal, and stemming.

import string

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer


class TextAnalyzer:

    def __init__(self):
        self.stemmer = PorterStemmer()
        self.stop_words = set(stopwords.words("english"))
        self.punct = set(string.punctuation)

    def analyze(self, text):
        tokens = word_tokenize(text.lower())
        analyzed = []
        for token in tokens:
            token = token.strip(string.punctuation)
            if token and token not in self.stop_words and not all(c in self.punct for c in token):
                analyzed.append(self.stemmer.stem(token))
        return analyzed

    def analyze_verbose(self, text):
        raw_tokens = word_tokenize(text.lower())
        after_stopwords = [t for t in raw_tokens if t.strip(string.punctuation) not in self.stop_words]
        after_punct = [t.strip(string.punctuation) for t in after_stopwords]
        after_punct = [t for t in after_punct if t and not all(c in self.punct for c in t)]
        stemmed = [self.stemmer.stem(t) for t in after_punct]
        return {
            "original_tokens": raw_tokens,
            "after_stop_and_punct": after_punct,
            "after_stemming": stemmed,
        }

The analyze method is the production path it takes raw text and returns a list of stemmed terms with stop words and punctuation removed. The analyze_verbose method does the same thing but returns intermediate results at each step so the demo can print what the analyzer did to the query.

For example, the query "How do I enable debug logging for the ControlUp agent?" would be processed as:

Tokenization: ['how', 'do', 'i', 'enable', 'debug', 'logging', 'for', 'the', 'controlup', 'agent', '?']
Stop word and punctuation removal: ['enable', 'debug', 'logging', 'controlup', 'agent']
Stemming: ['enabl', 'debug', 'log', 'controlup', 'agent']

These stemmed terms are what get matched against the inverted index.

Phase 3 (continued): Inverted Index

The inverted index is the data structure behind every search engine. It maps each stemmed term to the list of chunks that contain it, along with how many times the term appears in each chunk (the term frequency).

from collections import defaultdict, Counter


class InvertedIndex:

    def __init__(self, analyzer):
        self.analyzer = analyzer
        self.index = defaultdict(list)
        self.doc_lengths = {}
        self.doc_term_freqs = {}
        self.total_docs = 0
        self.avg_doc_length = 0.0

    def build(self, chunks):
        self.total_docs = len(chunks)
        total_length = 0

        for idx, chunk in enumerate(chunks):
            tokens = self.analyzer.analyze(chunk["text"])
            self.doc_lengths[idx] = len(tokens)
            total_length += len(tokens)

            tf = Counter(tokens)
            self.doc_term_freqs[idx] = tf

            for term, freq in tf.items():
                self.index[term].append((idx, freq))

        self.avg_doc_length = total_length / self.total_docs if self.total_docs else 0

    def get_postings(self, term):
        return self.index.get(term, [])

    def document_frequency(self, term):
        return len(self.index.get(term, []))

The build method iterates over every chunk, runs each one through the language analyzer, counts the frequency of every term, and stores the mapping. After building, the index might look like this:

"uninstal"       -> [chunk_0 (tf=2), chunk_3 (tf=1)]
"appdxhelper.ex" -> [chunk_0 (tf=3), chunk_5 (tf=1)]
"log"            -> [chunk_7 (tf=4), chunk_9 (tf=2), chunk_12 (tf=1)]

The doc_lengths and avg_doc_length fields are used by BM25 for document length normalization - short chunks get boosted because a keyword match in a 20-token passage is more meaningful than the same match in a 200-token passage.

Phase 4: BM25 Scoring

BM25 (Best Matching 25) is the modern evolution of TF-IDF, the same upgrade described in the previous article's Scoring with BM25 section. It improves on raw TF-IDF by adding term-frequency saturation (the 100th occurrence of a word does not score 10x higher than the 10th) and document-length normalization.

import math


class BM25Scorer:

    def __init__(self, inv_index, k1=1.5, b=0.75):
        self.idx = inv_index
        self.k1 = k1
        self.b = b

    def idf(self, term):
        df = self.idx.document_frequency(term)
        N = self.idx.total_docs
        if df == 0:
            return 0.0
        return math.log((N - df + 0.5) / (df + 0.5) + 1.0)

    def score_document(self, chunk_idx, query_terms, verbose=False):
        score = 0.0
        doc_len = self.idx.doc_lengths.get(chunk_idx, 0)
        avg_dl = self.idx.avg_doc_length
        breakdown = {}

        for term in query_terms:
            tf = self.idx.doc_term_freqs.get(chunk_idx, {}).get(term, 0)
            if tf == 0:
                continue
            idf_val = self.idf(term)
            numerator = tf * (self.k1 + 1)
            denominator = tf + self.k1 * (1 - self.b + self.b * (doc_len / avg_dl))
            term_score = idf_val * (numerator / denominator)
            score += term_score
            if verbose:
                breakdown[term] = {
                    "tf": tf,
                    "idf": round(idf_val, 4),
                    "saturation": round(numerator / denominator, 4),
                    "term_score": round(term_score, 4),
                }

        return score, breakdown

    def search(self, query, analyzer, top_k=50):
        query_terms = analyzer.analyze(query)

        candidates = set()
        for term in query_terms:
            for chunk_idx, _ in self.idx.get_postings(term):
                candidates.add(chunk_idx)

        results = []
        for chunk_idx in candidates:
            score, breakdown = self.score_document(chunk_idx, query_terms, verbose=True)
            if score > 0:
                results.append((chunk_idx, score, breakdown))

        results.sort(key=lambda x: x[1], reverse=True)
        return results[:top_k]

The idf method computes how rare a term is across all chunks. Terms that appear in fewer chunks score higher because they carry more information. The score_document method combines IDF with the BM25 saturation formula for each query term, then sums the contributions.

The search method first uses the inverted index to find candidate chunks (only chunks that contain at least one query term), then scores each candidate and returns the top results sorted by score. This is the L1 keyword retrieval leg.

Phase 5-6: Vector Index and Vector Search

The vector index stores dense embeddings for every chunk, using the same SentenceTransformer model from the NLP Introduction article. At query time, it converts the query to a vector and finds the nearest neighbors by cosine similarity.

import numpy as np
from sentence_transformers import SentenceTransformer


class VectorIndex:

    def __init__(self, model_name="all-MiniLM-L6-v2"):
        print(f"    Loading embedding model: {model_name}")
        self.model = SentenceTransformer(model_name)
        self.embeddings = None

    def build(self, chunks):
        texts = [chunk["text"] for chunk in chunks]
        print(f"    Encoding {len(texts)} chunks...")
        self.embeddings = self.model.encode(texts, show_progress_bar=True, convert_to_numpy=True)

    def search(self, query, top_k=50):
        query_emb = self.model.encode([query], convert_to_numpy=True)

        q_norm = query_emb / np.linalg.norm(query_emb, axis=1, keepdims=True)
        d_norms = self.embeddings / np.linalg.norm(self.embeddings, axis=1, keepdims=True)

        similarities = np.dot(d_norms, q_norm.T).flatten()
        top_indices = np.argsort(similarities)[::-1][:top_k]

        return [(int(i), float(similarities[i])) for i in top_indices]

The build method encodes every chunk's text into a 384-dimensional vector using all-MiniLM-L6-v2. The search method normalizes both the query vector and all document vectors to unit length (so the dot product equals cosine similarity), then returns the chunks with the highest similarity scores.

This is the L1 vector retrieval leg. It finds semantically similar content even when there is zero keyword overlap - a query like "How do I remove the monitoring software from my computer?" will match chunks about uninstalling ControlUp because the embedding model understands they mean the same thing.

Azure AI Search uses HNSW (an approximate nearest neighbor algorithm) for speed at scale. This implementation uses exhaustive KNN (comparing against every vector), which is equivalent to Azure's eKNN option and perfectly accurate for small indexes.

Phase 7: Reciprocal Rank Fusion

After both retrieval legs return their results, the pipeline needs to merge them. The scores from BM25 and cosine similarity are fundamentally different numbers on different scales, so you cannot simply add them. Reciprocal Rank Fusion solves this by ignoring the scores entirely and working only with rank positions.

from collections import defaultdict

RRF_K = 60


def reciprocal_rank_fusion(keyword_results, vector_results, k=RRF_K):
    rrf_scores = defaultdict(float)
    debug = defaultdict(lambda: {
        "kw_rank": None, "kw_score": None, "kw_breakdown": None,
        "vec_rank": None, "vec_score": None,
        "kw_rrf_contrib": 0.0, "vec_rrf_contrib": 0.0,
    })

    for rank, (idx, score, breakdown) in enumerate(keyword_results, start=1):
        contrib = 1.0 / (k + rank)
        rrf_scores[idx] += contrib
        debug[idx]["kw_rank"] = rank
        debug[idx]["kw_score"] = score
        debug[idx]["kw_breakdown"] = breakdown
        debug[idx]["kw_rrf_contrib"] = contrib

    for rank, (idx, score) in enumerate(vector_results, start=1):
        contrib = 1.0 / (k + rank)
        rrf_scores[idx] += contrib
        debug[idx]["vec_rank"] = rank
        debug[idx]["vec_score"] = score
        debug[idx]["vec_rrf_contrib"] = contrib

    merged = [(idx, rrf_score, debug[idx]) for idx, rrf_score in rrf_scores.items()]
    merged.sort(key=lambda x: x[1], reverse=True)
    return merged

The formula is RRF(doc) = SUM( 1/(k + rank_i) ) for each ranker. A document ranked #1 by keyword search contributes 1/(60 + 1) = 0.0164 from that leg. If the same document is also ranked #3 by vector search, it gets an additional 1/(60 + 3) = 0.0159, for a total RRF score of 0.0323. Documents that appear in both result sets get boosted, which is exactly the behavior you want from hybrid search.

The k constant (60 is Azure's default) controls how much weight is given to top-ranked results versus lower-ranked ones. A larger k flattens the contribution curve, making the difference between rank #1 and rank #10 smaller.

Phase 8: Semantic Reranking

The semantic reranker is the L2 (second layer) step that sits on top of the L1 retrieval. Azure's semantic ranker is a deep learning model adapted from Microsoft Bing that reads the actual text of each result (not just term statistics or vector distances) and applies machine reading comprehension to judge true relevance.

This implementation uses a cross-encoder model. Unlike the bi-encoder used for vector search (which encodes query and document separately), a cross-encoder reads the query and document text together in a single pass, allowing it to capture fine-grained interactions between them.

from sentence_transformers import CrossEncoder


class SemanticReranker:

    def __init__(self, model_name="cross-encoder/ms-marco-MiniLM-L-6-v2"):
        print(f"    Loading cross-encoder: {model_name}")
        self.model = CrossEncoder(model_name)

    def rerank(self, query, chunks, candidate_indices, top_n=10):
        pairs = [(query, chunks[idx]["text"][:512]) for idx in candidate_indices]
        scores = self.model.predict(pairs)

        scored = list(zip(candidate_indices, [float(s) for s in scores]))
        scored.sort(key=lambda x: x[1], reverse=True)
        return scored[:top_n]

The rerank method takes the top results from RRF and feeds each one as a (query, chunk_text) pair into the cross-encoder. The model returns a relevance score for each pair, and the results are re-sorted by that score. The [:512] truncation matches the cross-encoder's maximum input length.

The cross-encoder ms-marco-MiniLM-L-6-v2 was trained on the MS MARCO passage ranking dataset, which contains real search queries and human-judged relevance labels. This is what allows it to understand whether a chunk actually answers the query, not just whether it contains similar words or concepts.

The Full Pipeline

The HybridSearchEngine class orchestrates everything. It wires together all the components and executes the complete pipeline: index, then search.

from hybrid_search.config import BANNER, DIVIDER, RRF_K, RERANK_TOP_N
from hybrid_search.text_analyzer import TextAnalyzer
from hybrid_search.inverted_index import InvertedIndex
from hybrid_search.bm25_scorer import BM25Scorer
from hybrid_search.vector_index import VectorIndex
from hybrid_search.semantic_reranker import SemanticReranker
from hybrid_search.utils import load_documents, chunk_all_documents, reciprocal_rank_fusion


class HybridSearchEngine:

    def __init__(self):
        self.chunks = []
        self.analyzer = TextAnalyzer()
        self.inverted_index = InvertedIndex(self.analyzer)
        self.bm25 = None
        self.vector_index = None
        self.reranker = None

The build_index method runs Phases 1 through 5. It loads documents, chunks them, builds the inverted index, creates the BM25 scorer, generates embeddings for all chunks, and loads the semantic reranker.

    def build_index(self, docs_dir):
        print(f"\n{BANNER}")
        print("  PHASE 1 -- DOCUMENT INGESTION")
        print(BANNER)
        documents = load_documents(docs_dir)
        for doc in documents:
            print(f"    + Loaded: {doc['filename']}  ({len(doc['content']):,} chars)")
        print(f"\n    Total documents: {len(documents)}")

        print(f"\n{BANNER}")
        print("  PHASE 2 -- CHUNKING (splitting docs into passages)")
        print(BANNER)
        self.chunks = chunk_all_documents(documents)
        for i, chunk in enumerate(self.chunks):
            preview = chunk["text"][:90].replace("\n", " ")
            print(f"    chunk[{i}]  {chunk['doc_filename']}  ->  \"{chunk['heading']}\"")
            print(f"             {preview}...")
        print(f"\n    Total chunks: {len(self.chunks)}")

        print(f"\n{BANNER}")
        print("  PHASE 3 -- FULL-TEXT INDEXING (Inverted Index)")
        print(BANNER)
        self.inverted_index.build(self.chunks)
        self.bm25 = BM25Scorer(self.inverted_index)
        print(f"    Unique terms indexed: {len(self.inverted_index.index):,}")
        print(f"    Average chunk length: {self.inverted_index.avg_doc_length:.1f} tokens")

        print(f"\n    +-- Sample inverted-index entries --------------------------")
        sample_terms = sorted(self.inverted_index.index.keys())[:15]
        for term in sample_terms:
            postings = self.inverted_index.get_postings(term)
            entries = ", ".join(f"chunk[{idx}](tf={tf})" for idx, tf in postings)
            print(f"    |  \"{term}\" -> [{entries}]")
        print(f"    +--------------------------------------------------------------")

        print(f"\n{BANNER}")
        print("  PHASE 4 -- VECTOR INDEXING (Generating Embeddings)")
        print(BANNER)
        self.vector_index = VectorIndex()
        self.vector_index.build(self.chunks)
        rows, dims = self.vector_index.embeddings.shape
        print(f"    Embeddings matrix: {rows} chunks x {dims} dimensions")

        print(f"\n{BANNER}")
        print("  LOADING SEMANTIC RERANKER (L2 cross-encoder)")
        print(BANNER)
        self.reranker = SemanticReranker()
        print(f"    Ready.\n")

The search method runs Phases 6 through 8. It analyzes the query, runs keyword and vector search in sequence, merges results with RRF, and reranks the top results with the cross-encoder. Every step prints its output so you can follow the pipeline.

    def search(self, query):
        print(f"\n{'#' * 80}")
        print(f"#  QUERY: \"{query}\"")
        print(f"{'#' * 80}")

        steps = self.analyzer.analyze_verbose(query)
        query_terms = steps["after_stemming"]
        print(f"\n{DIVIDER}")
        print(f"  STEP A -- Language Analyzer (query processing)")
        print(DIVIDER)
        print(f"    Original query:    \"{query}\"")
        print(f"    Tokens:            {steps['original_tokens']}")
        print(f"    After stop/punct:  {steps['after_stop_and_punct']}")
        print(f"    After stemming:    {steps['after_stemming']}")

        kw_results = self.bm25.search(query, self.analyzer)
        print(f"\n{DIVIDER}")
        print(f"  STEP B -- Full-Text Search (BM25 scoring)")
        print(DIVIDER)
        print(f"    Searched inverted index for terms: {query_terms}")
        if kw_results:
            for rank, (idx, score, breakdown) in enumerate(kw_results, 1):
                c = self.chunks[idx]
                print(f"\n    Rank {rank}:  chunk[{idx}]  BM25 = {score:.4f}")
                print(f"      Source: {c['doc_filename']}  /  {c['heading']}")
                for term, info in breakdown.items():
                    print(f"      term \"{term}\": tf={info['tf']}, idf={info['idf']}, "
                          f"saturation={info['saturation']}, contribution={info['term_score']}")
        else:
            print(f"    (no keyword matches found)")

        vec_results = self.vector_index.search(query)
        print(f"\n{DIVIDER}")
        print(f"  STEP C -- Vector Search (cosine similarity)")
        print(DIVIDER)
        print(f"    Query embedded to {self.vector_index.embeddings.shape[1]}-dim vector, "
              f"compared against all {len(self.chunks)} chunks")
        for rank, (idx, score) in enumerate(vec_results[:10], 1):
            c = self.chunks[idx]
            print(f"    Rank {rank}:  chunk[{idx}]  cosine = {score:.4f}  "
                  f"<- {c['doc_filename']}  /  {c['heading']}")

        rrf_results = reciprocal_rank_fusion(kw_results, vec_results)
        print(f"\n{DIVIDER}")
        print(f"  STEP D -- Reciprocal Rank Fusion (merging both legs)")
        print(DIVIDER)
        print(f"    Formula:  RRF(doc) = SUM( 1/(k + rank_i) )   where k = {RRF_K}")
        print()
        for rank, (idx, rrf_score, info) in enumerate(rrf_results[:10], 1):
            c = self.chunks[idx]
            kw_str = (f"kw_rank={info['kw_rank']}, 1/({RRF_K}+{info['kw_rank']})="
                      f"{info['kw_rrf_contrib']:.6f}" if info["kw_rank"] else "-- (not in keyword results)")
            vec_str = (f"vec_rank={info['vec_rank']}, 1/({RRF_K}+{info['vec_rank']})="
                       f"{info['vec_rrf_contrib']:.6f}" if info["vec_rank"] else "-- (not in vector results)")

            in_both = info["kw_rank"] is not None and info["vec_rank"] is not None
            boost_tag = "  ** IN BOTH LEGS **" if in_both else ""

            print(f"    Rank {rank}:  chunk[{idx}]  RRF = {rrf_score:.6f}{boost_tag}")
            print(f"      Source:  {c['doc_filename']}  /  {c['heading']}")
            print(f"      Keyword: {kw_str}")
            print(f"      Vector:  {vec_str}")

        top_indices = [idx for idx, _, _ in rrf_results[:RERANK_TOP_N]]
        reranked = self.reranker.rerank(query, self.chunks, top_indices)
        print(f"\n{DIVIDER}")
        print(f"  STEP E -- Semantic Reranking (L2 cross-encoder)")
        print(DIVIDER)
        print(f"    Cross-encoder reads the query + each chunk's text together")
        print(f"    and judges true relevance via reading comprehension.")
        print(f"    Reranking top {len(top_indices)} RRF results...\n")
        for rank, (idx, score) in enumerate(reranked, 1):
            c = self.chunks[idx]
            preview = c["text"][:150].replace("\n", " ")
            print(f"    Rank {rank}:  chunk[{idx}]  rerank_score = {score:.4f}")
            print(f"      Source:  {c['doc_filename']}  /  {c['heading']}")
            print(f"      Preview: {preview}...")

        winner_idx, winner_score = reranked[0]
        winner = self.chunks[winner_idx]
        print(f"\n{BANNER}")
        print(f"  FINAL ANSWER")
        print(BANNER)
        print(f"    Best match:  {winner['doc_filename']}  /  {winner['heading']}")
        print(f"    Score:       {winner_score:.4f}")
        print(f"\n    +-- Chunk text ------------------------------------------------")
        for line in winner["text"].split("\n")[:12]:
            print(f"    |  {line}")
        print(f"    |  ...")
        print(f"    +--------------------------------------------------------------")

        return reranked

Module Definition

The __init__.py file exposes all the public classes and functions so you can import them cleanly.

from hybrid_search.text_analyzer import TextAnalyzer
from hybrid_search.inverted_index import InvertedIndex
from hybrid_search.bm25_scorer import BM25Scorer
from hybrid_search.vector_index import VectorIndex
from hybrid_search.semantic_reranker import SemanticReranker
from hybrid_search.hybrid_engine import HybridSearchEngine
from hybrid_search.utils import load_documents, chunk_document, chunk_all_documents, reciprocal_rank_fusion

__all__ = [
    "TextAnalyzer",
    "InvertedIndex",
    "BM25Scorer",
    "VectorIndex",
    "SemanticReranker",
    "HybridSearchEngine",
    "load_documents",
    "chunk_document",
    "chunk_all_documents",
    "reciprocal_rank_fusion",
]

Running the Demo

The entry point runs five demo queries that showcase different strengths of hybrid search.

import nltk

from hybrid_search.config import DOCS_DIR, BANNER
from hybrid_search.hybrid_engine import HybridSearchEngine


def main():
    nltk.download("punkt", quiet=True)
    nltk.download("punkt_tab", quiet=True)
    nltk.download("stopwords", quiet=True)

    engine = HybridSearchEngine()
    engine.build_index(DOCS_DIR)

    queries = [
        "AppDXHelper.exe",
        "How do I remove the monitoring software from my computer?",
        "How do I enable debug logging for the ControlUp agent?",
        "What steps are needed to set up browser monitoring on Windows?",
        "What registry keys does the browser extension configure?",
    ]

    for i, query in enumerate(queries, 1):
        print(f"\n\n{'*' * 80}")
        print(f"  DEMO QUERY {i} of {len(queries)}")
        print(f"{'*' * 80}")
        engine.search(query)

        if i < len(queries):
            input(f"\n  >>> Press Enter for the next query ({i+1}/{len(queries)})... ")

    print(f"\n\n{BANNER}")
    print("  ALL QUERIES COMPLETE")
    print(BANNER)
    print("  What you just saw is exactly what Azure AI Search does behind its API:")
    print("    1. Language analyzer processed your query (tokenize -> stem -> stop words)")
    print("    2. BM25 scored keyword matches in the inverted index")
    print("    3. Cosine similarity found semantically similar chunks via embeddings")
    print("    4. Reciprocal Rank Fusion merged both result sets by rank position")
    print("    5. A cross-encoder reranked the top results using reading comprehension")
    print()
    print("  Try changing the queries or adding your own to see how the pipeline behaves!")
    print()


if __name__ == "__main__":
    main()

Run the demo from the directory that contains __main__.py (the same layout as after a download into hybrid-search/):

python __main__.py

If you rename that folder to hybrid_search (underscore), you can run python -m hybrid_search instead.

Each query demonstrates a different aspect of hybrid search:

"AppDXHelper.exe" - an exact keyword match where the keyword leg shines. The filename appears verbatim in the documents, so BM25 finds it instantly. The embedding model dilutes this precision into a 384-dimensional vector where it loses specificity.
"How do I remove the monitoring software from my computer?" - a vocabulary mismatch query where the vector leg shines. The query says "remove" and "monitoring software", but the relevant document says "uninstall" and "ControlUp for Apps". Zero keyword overlap, yet vector search finds the match because the embedding model understands synonyms.
"How do I enable debug logging for the ControlUp agent?" - a hybrid query where both legs contribute. "ControlUp" matches by keyword, while "enable debug logging" matches by meaning with the logging article's content about setting environment variables.
"What steps are needed to set up browser monitoring on Windows?" - a broad conceptual question where the vector leg and reranker do the heavy lifting. The query is about a concept (browser monitoring setup) rather than specific terms.
"What registry keys does the browser extension configure?" - a mixed query with both a keyword signal ("registry") and conceptual intent (understanding what configuration happens). Both legs find relevant chunks, and the reranker promotes the one that best answers the actual question.

Key Takeaways

This pipeline is a working implementation of the same architecture Azure AI Search runs at scale language analysis, inverted indexing, BM25, vector search, RRF, and semantic reranking
The language analyzer applies the identical NLP operations from the earlier articles (tokenization, stop words, stemming) packaged as a reusable component
BM25 improves on TF-IDF with term-frequency saturation and document-length normalization, handling edge cases that raw TF-IDF gets wrong
Vector search catches vocabulary mismatch that keyword search misses entirely - "remove" matches "uninstall" through embedding similarity
Reciprocal Rank Fusion merges results from both legs without needing scores on the same scale, boosting documents that appear in both result sets
The cross encoder semantic reranker reads the query and document text together, applying reading comprehension to judge true relevance rather than relying on term statistics or vector distances
Different queries favor different legs of the pipeline, which is exactly why hybrid search outperforms either approach alone
Every component maps directly to an Azure AI Search feature: the analyzer maps to language analyzers, the inverted index maps to full-text search, the vector index maps to vector fields, RRF maps to hybrid queries, and the reranker maps to the semantic ranker

Download the Source Code​

Pipeline Overview​

Configuration​

Phase 1-2: Document Ingestion and Chunking​

Phase 3: Language Analyzer​

Phase 3 (continued): Inverted Index​

Phase 4: BM25 Scoring​

Phase 5-6: Vector Index and Vector Search​

Phase 7: Reciprocal Rank Fusion​

Phase 8: Semantic Reranking​

The Full Pipeline​

Module Definition​

Running the Demo​

Key Takeaways​