LLMOps Best Practices: RAG Patterns und Evaluation für Enterprise KI

The accuracy of an enterprise AI assistant is only as good as its retrieval layer. If the system can't find the right information, the language model will either hallucinate an answer or produce a vague, unhelpful response. Azure AI Search provides the retrieval infrastructure for grounded enterprise AI - but getting the most out of it requires understanding the patterns that separate a prototype from a production system.

Hybrid Search: Vector Plus Keyword

Pure vector search and pure keyword search each have blind spots. Vector search excels at semantic matching - finding documents that are conceptually related to the query even when they don't share the same words. But it can miss exact matches: a search for "Policy 4.2.1" may return documents about policies in general rather than the specific document numbered 4.2.1.

Keyword search (BM25) excels at exact and near-exact matching. It will find "Policy 4.2.1" reliably. But it fails when the user's query uses different terminology than the documents - searching for "staff holiday entitlement" when the document says "employee annual leave allocation".

Azure AI Search's hybrid search combines both approaches in a single query. The search engine executes vector and keyword searches in parallel, then fuses the results using Reciprocal Rank Fusion (RRF). This consistently outperforms either approach alone across enterprise datasets, and it's the configuration we recommend as the default for every enterprise AI assistant deployment.

Adding semantic ranking as a re-ranking layer on top of hybrid search further improves precision. Semantic ranking uses a cross-encoder model to re-score the top results based on deep semantic understanding of the query-document relationship. The result is a retrieval pipeline that handles both exact lookups and conceptual queries with high accuracy.

Chunking Strategies

How you break documents into chunks determines what the search index can find. Chunking is one of the most impactful - and most underestimated - design decisions in a RAG system.

Fixed-size chunking (e.g., 512 tokens with 128-token overlap) is the simplest approach and works well for homogeneous document collections. But it ignores document structure: a chunk boundary may fall in the middle of a paragraph, separating a heading from its content or splitting a table across two chunks.

Semantic chunking respects document structure. It uses headings, paragraph boundaries, and section markers to create chunks that correspond to logical units of information. Azure AI Search's built-in document cracking and integrated vectorisation support markdown-aware chunking that preserves heading hierarchy.

Hierarchical chunking creates chunks at multiple levels of granularity - for example, both paragraph-level and section-level chunks for the same document. This allows the retrieval system to match at the right level of detail: a specific paragraph when the query is narrow, or an entire section when the query is broad.

For enterprise deployments, we typically recommend starting with semantic chunking (respecting document structure), adding hierarchical chunking for document types that benefit from it (e.g., technical manuals, legal contracts), and keeping chunk metadata (source document, section heading, page number) to enable precise citations in the RAG pipeline.

Query Rewriting

Users don't always write good search queries. A user might type "how does the thing work" when they mean "How does the automated invoice reconciliation process handle currency conversion?" Query rewriting transforms the user's raw input into one or more optimised search queries that are more likely to retrieve relevant results.

Common query rewriting techniques include:

Query expansion - generating multiple reformulations of the original query and executing them in parallel. For example, the query "staff leave policy" might be expanded to "employee annual leave entitlement", "vacation policy guidelines", and "time off procedures".
Hypothetical Document Embedding (HyDE) - using the LLM to generate a hypothetical answer to the query, then using that answer as the search query. This is particularly effective for vector search because the hypothetical answer is likely to be semantically closer to the actual document than the original question.
Step-back prompting - reformulating a specific question into a more general one to retrieve broader context. "What's the maximum mortgage LTV for first-time buyers in Bavaria?" becomes "What are the mortgage lending criteria by region and buyer category?"

These techniques can be implemented as a pre-processing step before the search query hits Azure AI Search, using the same LLM that powers the assistant. The additional latency (typically 200-500ms) is well worth the improvement in retrieval relevance.

Evaluation: Measuring What Matters

You can't improve what you don't measure. Retrieval evaluation is the discipline of systematically testing whether your search pipeline returns the right documents for a given set of queries.

A robust evaluation framework includes:

Golden datasets - curated sets of query-document pairs where the correct answer is known. These should cover the full range of query types your system handles, including edge cases and adversarial queries.
Retrieval metrics - precision@k (what fraction of retrieved documents are relevant), recall@k (what fraction of relevant documents were retrieved), and Mean Reciprocal Rank (how high in the results the first relevant document appears).
End-to-end metrics - answer correctness (does the final response accurately reflect the source documents), groundedness (are all claims in the response supported by retrieved context), and completeness (does the response address all aspects of the query).
Regression testing - running the evaluation suite automatically whenever the index, chunking strategy, or retrieval configuration changes. This is a core component of any LLMOps pipeline.

Azure AI Foundry provides built-in evaluation capabilities for both retrieval and generation quality. We recommend integrating these evaluations into your CI/CD pipeline so that every change to the retrieval configuration is automatically tested against your golden dataset before it reaches production.

Putting It All Together

A production-grade retrieval pipeline for an enterprise copilot combines hybrid search, thoughtful chunking, query rewriting, and continuous evaluation into a system that reliably surfaces the right information. Each component reinforces the others: good chunking makes search more precise, query rewriting compensates for imperfect user inputs, and evaluation ensures everything works together as intended.

At Datenschaftler, we design and implement these retrieval architectures for enterprise clients across banking, manufacturing, and the public sector. The technology is proven - the challenge is adapting these patterns to your specific data, queries, and quality requirements.