Skip to main content

Azure AI Search: NLP at Scale

Azure AI Search is the Cloud system that runs all these same operations covered in previos articles(tokenization, stemming, TF-IDF, embeddings, cosine similarity) across millions of documents, in milliseconds, behind an API.

This article explains how Azure AI Search works, what query types it supports, and some of the Microsoft recommendations.

Reference: Query Types - Azure AI Search

What Azure AI Search Does

Azure AI Search is a cloud based information retrieval platform. You feed it documents (PDFs, web pages, database rows, knowledge base articles) and it builds optimized data structures that let you query that content in different ways: by keywords, by meaning, or both at once.

At a high level, the system follows the same two phase pattern that appears in most search architectures:

  1. Indexing: documents are ingested, analyzed, chunked, and stored in specialized data structures (inverted indexes for text, vector indexes for embeddings)
  2. Querying: user queries hit those indexes and return ranked results

This is the same pipeline introduced in the NLP Introduction article, scaled up to handle enterprise workloads.

Full-text search is the traditional keyword based approach. Azure AI Search builds inverted indexes, the same data structure behind every search engine since the 1990s.

How It Works

When a document is indexed, the text passes through a language analyzer that performs operations we already covered in the previous articles:

  1. Tokenization: splitting text into individual terms, exactly like nltk.word_tokenize() or text.split() from the Text Processing with NLTK article
  2. Normalization: lowercasing, removing diacritics, and applying language specific rules (similar to the preprocessing functions we built)
  3. Stemming or lemmatization: reducing words to root forms so that "running", "runs", and "ran" all match the same index entry
  4. Stop word filtering: dropping high-frequency, low-information words like "the", "is", "in"

The result is an inverted index that maps every term to the list of documents containing it:

"uninstall" -> [doc_14, doc_87, doc_203]
"controlup" -> [doc_14, doc_15, doc_87]
"machine" -> [doc_14, doc_87, doc_112, doc_203, doc_445]

Scoring with BM25

Once the index identifies candidate documents, they need to be ranked. Azure AI Search uses BM25 (Best Matching 25), which is the modern evolution of TF-IDF.

In the Sentiment Analysis article, we used TF-IDF to weight terms:

  • TF (Term Frequency): how often the word appears in a document
  • IDF (Inverse Document Frequency): how rare the word is across all documents

BM25 improves on this by adding two refinements:

  • Term frequency saturation: in TF-IDF, a word appearing 100 times scores 10x higher than one appearing 10 times. BM25 diminishes returns after a certain frequency, recognizing that the 100th occurrence of "machine" doesn't add 10x more relevance than the 10th
  • Document length normalization: short documents get a boost because a keyword match in a 50-word summary is more meaningful than the same match in a 10,000-word manual

The intuition is the same as TF-IDF: rare, meaningful terms score higher than common filler words. BM25 just handles edge cases more gracefully.

Strengths and Limitations

Full-text search excels at:

  • Exact matches: product codes, error messages, filenames like AppDXHelper.exe
  • Specialized terminology: domain jargon that embedding models may not have seen during training
  • Speed: inverted index lookups are extremely fast, even across billions of documents

It struggles with:

  • Vocabulary mismatch: searching for "uninstall" won't match a document that says "remove the application"
  • Intent understanding: no sense of what the user actually means, only what they typed
  • Misspellings: "contolup" won't match "ControlUp" unless fuzzy matching is explicitly configured

Vector search takes the embedding approach from the NLP Introduction and applies it at scale.

How It Works

Instead of matching keywords, vector search compares the meaning of a query against the meaning of indexed documents. Both are represented as dense numeric vectors, the same kind of embeddings we generated with SentenceTransformer("all-MiniLM-L6-v2").

The indexing pipeline:

  1. Chunk large documents into smaller passages (one to two paragraphs each)
  2. Generate embeddings for each chunk using a model like Azure OpenAI's text-embedding-ada-002
  3. Store the vectors in a vector index optimized for similarity search

At query time:

  1. The user's query is converted to a vector using the same embedding model
  2. The search engine finds the nearest neighbors - vectors closest to the query vector in high dimensional space
  3. Results are ranked by similarity score (cosine similarity or dot product)

This is the exact same process we implemented manually:

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

documents = ["How to remove ControlUp for Apps", "Uninstall guide for desktop agent"]
query = "delete the application from my computer"

doc_embeddings = model.encode(documents)
query_embedding = model.encode([query])

similarities = np.dot(doc_embeddings, query_embedding.T).flatten()
best_match = documents[np.argmax(similarities)]

The query "delete the application from my computer" shares zero keywords with "How to remove ControlUp for Apps", yet vector search finds the match because the embedding model understands they mean the same thing.

Similarity Algorithms

Azure AI Search supports two algorithms for finding nearest neighbors:

  • HNSW (Hierarchical Navigable Small World) - an approximate nearest neighbor algorithm that trades a small amount of accuracy for significant speed gains. This is the default and recommended option for most workloads
  • Exhaustive KNN (K-Nearest Neighbors) - compares the query vector against every single vector in the index. Perfectly accurate but slower, useful for small indexes or when recall is critical

Both rely on the same mathematical foundation - cosine similarity and dot product - that we explored in the Sentiment Analysis and NLP Introduction articles.

Strengths and Limitations

Vector search excels at:

  • Semantic matching - finds conceptually similar content even without shared keywords
  • Misspelling tolerance - "contolup" and "ControlUp" produce similar embeddings
  • Cross-lingual search - multilingual embedding models can match queries in one language to documents in another

It struggles with:

  • Exact matches - a product code like SKU-4829 gets diluted in a 384-dimensional embedding
  • Highly specialized jargon - if the embedding model never saw "AppDXHelper.exe" during training, its vector representation may be unreliable
  • Dates and proper nouns - vector search might confuse "March 2024" and "March 2025" because they embed almost identically

Hybrid search runs full-text search and vector search in parallel against the same index, then merges the results. It is a single query request that executes both retrieval paths simultaneously.

Reference: Hybrid Search Overview - Azure AI Search

Why Hybrid?

Each search type compensates for the other's weaknesses:

ScenarioKeyword SearchVector SearchHybrid
Exact product code (SKU-4829)Finds it instantlyDiluted in embedding spaceKeyword leg finds it
Vocabulary mismatch ("delete" vs "uninstall")Misses itFinds it via semantic similarityVector leg finds it
Misspelled query ("contolup")Misses itTolerant to misspellingsVector leg finds it
Specialized jargon (AppDXHelper.exe)Finds exact matchUnreliable embeddingKeyword leg finds it
Conceptual questions ("how do I remove software")Needs exact termsUnderstands intentVector leg finds it

Running both in parallel means you get the precision of keyword search and the recall of vector search in a single request.

How Results Are Merged: Reciprocal Rank Fusion (RRF)

When two different ranking systems return results, you need a strategy to combine them. Azure AI Search uses Reciprocal Rank Fusion (RRF), which works by rewarding documents that appear in both result sets.

The formula is straightforward:

RRF_score(doc) = sum( 1 / (k + rank_i) ) for each ranker i

Where k is a constant (typically 60) and rank_i is the document's position in each ranker's results. A document ranked #1 by keyword search and #3 by vector search scores higher than a document ranked #1 by only one of them.

This is effective because it does not require the two scoring systems to be on the same scale - BM25 scores and cosine similarity scores are fundamentally different numbers, but RRF only cares about relative rank positions.

Semantic Ranking (L2 Reranking)

On top of hybrid retrieval, Azure AI Search offers an optional semantic ranker, a deep learning model adapted from Microsoft Bing that reranks the top 50 results from the hybrid query.

The semantic ranker reads the actual content of each result (not just term statistics or vector distances) and applies machine reading comprehension to promote the most relevant matches. This is the L2 (second layer) ranking step that sits on top of the L1 (retrieval) layer.

Benchmark Evidence

Microsoft published benchmark results comparing retrieval configurations across customer datasets, the BEIR academic benchmark, and the MIRACL multilingual benchmark:

Search ConfigurationCustomer Datasets (NDCG@3)BEIR (NDCG@10)MIRACL (NDCG@10)
Keyword only40.640.649.6
Vector only (Ada-002)43.845.058.3
Hybrid (Keyword + Vector)48.448.458.8
Hybrid + Semantic Ranker60.150.072.0

Hybrid retrieval with semantic ranking outperformed every other configuration across all benchmarks. The improvement is particularly significant for RAG applications, where the quality of retrieved content directly determines the quality of generated answers.

Reference: Azure AI Search: Outperforming vector search with hybrid retrieval and reranking

Example of a real use case: Why Microsoft Recommends Hybrid for Knowledge Bases

Microsoft's official RAG documentation explicitly recommends hybrid search for knowledge base applications:

Use hybrid queries that combine keyword (nonvector) and vector search for maximum recall. In a hybrid query, if you double down on the same input, a text string and its vector equivalent generate parallel queries for keywords and similarity search, returning the most relevant matches from each query type in a unified result set.

The reasoning maps directly to knowledge base article characteristics:

  1. Knowledge base articles contain both precise terminology and conceptual explanations - an article about uninstalling ControlUp contains exact filenames (AppDXHelper.exe) alongside procedural descriptions ("verify that the process no longer appears"). Keyword search handles the filenames; vector search handles the descriptions
  2. Users query knowledge bases in unpredictable ways - some type exact error messages, others describe problems in natural language, others paste log snippets. Hybrid search covers all of these patterns
  3. RAG applications need the best possible grounding data - when an LLM generates an answer from retrieved content, irrelevant results degrade quality and waste tokens. Hybrid + semantic ranking maximizes the chance that the top results are genuinely relevant

Agentic Retrieval (Preview)

Agentic retrieval is the newest addition to Azure AI Search, currently in public preview. It extends hybrid search by adding an LLM-powered query planning layer.

Reference: Agentic Retrieval Overview - Azure AI Search

How It Works

Instead of sending a single query to the index, agentic retrieval uses an LLM to analyze the user's question (including conversation history) and decompose it into multiple focused subqueries:

  1. Query planning - an LLM (GPT-4o or GPT-4.1) reads the user's question and chat history, then generates targeted subqueries
  2. Parallel execution - all subqueries run simultaneously against the search index using hybrid search with semantic ranking
  3. Result synthesis - results from all subqueries are merged into a unified response with grounding data, source references, and execution metadata

For example, a user asking "What's the process for uninstalling ControlUp and what happens if the AppDXHelper process is still running after reboot?" might be decomposed into:

  • Subquery 1: "ControlUp uninstallation steps"
  • Subquery 2: "AppDXHelper.exe process running after reboot troubleshooting"

Each subquery retrieves its own set of relevant chunks, and the combined results give the LLM better grounding data than a single monolithic query would.

When to Use Agentic Retrieval vs Classic Hybrid

Use agentic retrieval whenUse classic hybrid when
Queries are complex or conversationalQueries are straightforward
You need the highest possible relevanceSimplicity and speed are priorities
Building new RAG / chatbot applicationsYou need GA (generally available) features
Multiple "asks" in a single questionFine-grained control over query pipeline

Agentic retrieval adds latency (due to the LLM query planning step) but compensates with significantly better recall on complex questions.

Every technique covered in the previous NLP articles has a direct counterpart in Azure AI Search. The table below maps what we built in Python to what Azure operates at scale:

NLP Concept (from previous articles)Where It AppearedAzure AI Search Equivalent
Tokenization (nltk.word_tokenize, text.split())NLP Introduction, Text Processing with NLTKLanguage analyzers split text into tokens during indexing. Azure supports 50+ language-specific analyzers that handle tokenization rules for each language
Stop word removal (stopwords.words("english"))Text Processing with NLTK, Intro ExercisesBuilt into Azure's language analyzers. Stop words are filtered during text analysis, reducing index noise
Stemming / Lemmatization (Porter Stemmer, WordNet Lemmatizer)Text Processing with NLTK, Intro ExercisesLanguage analyzers apply language-aware normalization. The English analyzer uses lemmatization-style rules so "running" and "ran" match the same index entry
TF-IDF (TfidfVectorizer)NLP Introduction, Sentiment AnalysisBM25 scoring - the production evolution of TF-IDF. Same intuition (rare terms score higher), better handling of term saturation and document length
CountVectorizer / Bag of WordsNLP IntroductionThe inverted index is essentially a scaled-up bag-of-words structure - it maps terms to documents, just like CountVectorizer maps terms to frequency counts
Embeddings (SentenceTransformer.encode())NLP IntroductionVector fields store embeddings generated by models like text-embedding-ada-002. Same concept, same math, enterprise-scale infrastructure
Cosine similarity / Dot product (np.dot, cosine_similarity)NLP Introduction, Sentiment AnalysisHNSW and eKNN algorithms perform the same similarity computations across vector indexes containing millions of embeddings
Named Entity Recognition (spaCy, NLTK NER)NLP Introduction, Text Processing with NLTKAI enrichment pipeline - Azure's built-in skills extract entities, key phrases, and other structured information during indexing
Text classification (Transformers pipeline)NLP IntroductionSemantic ranker uses deep learning models (adapted from Bing) to classify result relevance - reading comprehension applied to search ranking
Preprocessing pipeline (clean_text, clean_and_tokenize)All previous articlesSkillsets and analyzers chain together preprocessing steps: chunking, text extraction, language detection, entity extraction, vectorization - the same pipeline concept, declaratively configured
Domain-specific stop wordsIntro ExercisesCustom analyzers let you define domain-specific stopword lists, synonym maps, and token filters tailored to your content
Class imbalance / Scoring biasSentiment AnalysisScoring profiles let you boost or penalize specific fields, terms, or document attributes - addressing the same "some signals matter more than others" problem

The Pattern

The progression across these articles follows the same trajectory as the field itself:

  1. Text Processing with NLTK - tokenize, stem, tag, extract entities from individual documents
  2. Sentiment Analysis - vectorize text with TF-IDF, measure similarity, train classifiers
  3. NLP Introduction - generate dense embeddings, perform semantic search across a small corpus
  4. Azure AI Search (this article) - run all of the above in parallel, across millions of documents, behind an API, with production-grade ranking

Each step builds on the previous one. Azure AI Search does not replace the NLP fundamentals, it puts them into action.

Key Takeaways

  1. Azure AI Search is a production scale implementation of the NLP techniques covered in the previous articles(tokenization, TF-IDF, embeddings) and similarity search
  2. Full-text search uses inverted indexes and BM25 scoring (an evolution of TF-IDF) to match keywords precisely and quickly
  3. Vector search uses dense embeddings and nearest neighbor algorithms to find semantically similar content even without keyword overlap
  4. Hybrid search runs both in parallel and merges results with Reciprocal Rank Fusion, combining the precision of keywords with the recall of vectors
  5. Microsoft's benchmarks show that hybrid search with semantic ranking consistently outperforms keyword only or vector only approaches across all tested datasets
  6. Hybrid search is Microsoft's recommended approach for knowledge base and RAG applications because knowledge base content demands both exact-match precision and conceptual understanding
  7. Agentic retrieval extends hybrid search by adding LLM powered query planning, decomposing complex questions into focused subqueries for better coverage
  8. Every NLP concept from the previous articles - tokenization, stop words, stemming, TF-IDF, embeddings, cosine similarity, NER, classification - has a direct operational counterpart in Azure AI Search