Skip to main content

Command Palette

Search for a command to run...

End-to-End RAG Explained : From Chunking to Evaluation

Updated
13 min read
End-to-End RAG Explained : From Chunking to Evaluation
S
Generative AI / Applied AI Engineer building production-grade LLM applications, RAG pipelines, and AI systems.

Retrieval-Augmented Generation (RAG) is one of the most commonly asked topics in Generative AI interviews today.

But most explanations stop at theory.

In this blog, I’ll explain RAG from a production engineer’s perspective, covering chunking, embeddings, retrieval strategies, evaluation, and hallucination control, exactly how it is discussed in real AI interviews.

I recently built a production-grade RAG Studio where teams could ingest documents, experiment with retrieval strategies, and serve reliable answers via API. Here’s how the system worked end-to-end.

Introduction

Retrieval-Augmented Generation (RAG) is one of the most important concepts in modern Generative AI systems.

Large Language Models are powerful, but they have limitations. They are trained on static data, cannot access real-time information, and may generate incorrect or outdated answers.

RAG solves this problem by combining information retrieval with text generation.

Instead of relying only on what the model learned during training, RAG allows the system to fetch relevant external information and use it while generating responses.

This makes the system more accurate, reliable, and adaptable to changing data.

What is RAG?

A RAG (Retrieval-Augmented Generation) system has two main components: the retriever and the generator.

The retriever searches and collects relevant information from external sources such as databases, documents, PDFs, or knowledge bases.

The generator, usually a large language model, uses this retrieved information to produce a clear and structured response.

The retriever ensures that the system uses up-to-date and contextually relevant information. The generator then combines this retrieved context with its language understanding ability to generate accurate answers.

Without retrieval, the model relies only on its training data. With retrieval, the model becomes grounded in real, external knowledge.

Together, they provide more accurate, explainable, and reliable responses than a standalone language model.

Why Chunking Matters in RAG?

Chunking is one of the most critical design decisions in a RAG system.

  1. LLM Context Limit:
    Large Language Models have fixed context windows. You cannot send entire documents to the model. If chunks are too large → you exceed token limits. If chunks are too small → you lose meaning in the generated response.

  2. Embedding Window:
    Embedding models also have token limits. If text exceeds the embedding window, it gets truncated, leading to poor vector representations.

  3. Retrieval Precision:
    Retrieval works at the chunk level.
    If chunks mix unrelated topics, retrieval becomes noisy.
    If chunks are too tiny, relevant context may get split and missed.

Good chunking directly improves retrieval quality and final answer accuracy.

Types of Chunking

  • Recursive Character Chunking
    Splits text using separators (paragraph → sentence → word) until size limit is reached.

  • Semantic Chunking
    Uses embeddings to detect topic shifts. Chunks are created when meaning changes, not just character count.

  • Sentence/Paragraph-Based Chunking
    Splits naturally by structure (periods, paragraphs). Simple and often effective for clean documents.

  • Sliding Window Chunking
    Creates overlapping chunks to preserve boundary context.

    Example visual:
    [ Paragraph 1 ]
    [ Paragraph 2 ]
    [ Paragraph 3 ]

    → Chunk 1 (P1 + P2)
    → Chunk 2 (P2 + P3)

    Now overlapping preserves continuity. If a query relates to P2, both chunks may partially match, improving recall while still maintaining context flow. This overlap mechanism reduces boundary loss — one of the most common hidden problems in RAG systems.

Which chunking strategy we used in production?

We wanted chunks that preserve meaning but still stay within predictable token bounds for retrieval and prompting, that’s why we used a hybrid chunking strategy, mainly semantic chunking for meaning preservation (512–800 tokens), sliding window overlap (50–100 tokens), and modality-aware splitting for non textual data like images.

Types of Embedding Models

Type Examples Infra Required Use Cases
Open AI Hosted text-embedding-ada No Fast production
Open Source BGE, E5, MiniLM Yes Cost control
Domain Specific Legal/Bio models Yes Legal/Bio models

Why we choose ADA embedder?

We chose Ada because:

  • It is trained on vast amount of data

  • High-quality semantic similarity

  • No infrastructure overhead

  • Easy API scaling, stable for production systems

We can't go with open-source embedders because it increases more overhead complexity like GPU infrastructure, scaling and latency management.

For fast-moving production APIs, managed embeddings reduce operational risk.

Retrieval Techniques

  1. BM 25 (Keyword Search)
    - Exact keyword matching
    - Strong for structured queries
    - Fails with paraphrased queries

  2. Vector Similarity
    - Captures semantic meaning
    - Handles paraphrasing
    - May miss exact keyword importance

  3. Hybrid Retrieval (What We Used)
    We combined BM25 + vector similarity.
    It balances exact term matching and semantic understanding, which is critical for large, unstructured enterprise data. It gave us more reliable recall than vector-only search. It improved recall and reduced false negative cases (missed out important pieces).

Vector only sometimes extracts semantically correct chunks but contextually irrelevant, and keyword based extracts term matching words so if keywords in the query are mistyped then chunks extraction will be failing.

Example

Query:
What is the refund policy for enterprise customers?

Vector search may match:
Return guidelines for business clients

BM25 ensures the exact term “refund” is matched.

Hybrid retrieval gives the best of both worlds.

Re-ranking

Re-ranking is a two-stage information retrieval process that uses sophisticated, precise models (like Cross-Encoders) to re-order an initial set of search results, significantly improving relevance. MRR (mean reciprocal rank) can be improved by the re-ranking algorithm (How early the first correct doc appears).

Figure: RAG pipeline enhanced with re-ranking (Image credit: Nvidia)

  1. In the first stage, system retrieves a group of candidate documents or passages (chunks) using traditional retrieval techniques such as BM25 keyword search or vector similarity search with embeddings. These methods are optimized for speed and are able to quickly identify documents that appear related to the user’s query. However, because they rely on approximate similarity or keyword matching, the ranking they produce (top_k) is not always perfectly aligned with the actual meaning or intent of the query.

  2. To improve this, the second stage introduces a re-ranking model, often a Cross-Encoder or an LLM-based ranking model. It calculates a matching score for a given query and document pair. This score can then be utilized to rearrange previously fetched chunks or vector search results, ensuring that the most relevant results are prioritized at the top of the list.

By evaluating the retrieved chunks more carefully using a stronger model or LLM, it ensures that the most semantically relevant chunks move to the top before the final context selection is made. This significantly improves the quality of the context passed to the LLM, leading to more accurate, reliable, and contextually grounded responses.

Prompting Strategies in RAG

A prompt is essentially the text input or query given to an LLM to guide its behavior. It defines the task, provides instructions, and sometimes includes examples or reasoning steps.

Prompt Engineering or Prompting Techniques

Prompt engineering refers to the practice of carefully designing the input instructions or queries (prompts) given to large language models (LLMs) so that they produce more accurate, relevant, and useful responses. Since LLMs generate outputs based on the instructions they receive, the structure and wording of a prompt can significantly influence the quality of the result.

In Retrieval-Augmented Generation (RAG) systems, prompt engineering becomes even more important. In RAG, an LLM receives both the user’s query and retrieved documents from a knowledge base. The prompt must clearly instruct the model on how to use the retrieved context, how to structure the answer, and how to avoid hallucinations. Well-designed prompts help ensure the model focuses on the provided documents and generates responses grounded in the retrieved information.

Below are some commonly used prompt engineering techniques that are also widely used in RAG systems.

  • Zero-Shot Prompting : Zero-shot prompting means asking the model to perform a task without providing any examples. The model relies entirely on its pretrained knowledge to generate response.

    Example:
    You are a helpful assistant. Answer the user's question using only the information from the provided context.

    Context: {retrieved_documents}

    Question: How does vector search work in a RAG system?

    If the answer is not in the context, say "The information is not available in the provided documents."

    In this case, since the model receives instructions but no examples, thus it will generate the answer based only on the retrieved context and its pretrained knowledge.

  • Few-Shot Prompting : Few-shot prompting improves performance by providing a few examples of the expected behavior for a prompt. These examples demonstrate how the model should interpret the context and structure its response.

    Example:
    You are a technical assistant that answers questions using the given context.

    Example 1: Context: Vector databases store embeddings that represent text semantically. Question: What is the purpose of a vector database? Answer: A vector database stores embeddings of text, allowing systems to perform semantic similarity search instead of keyword matching.

    Example 2:
    Context: Re-ranking improves retrieval quality by evaluating the relevance between a query and retrieved documents. Question: Why is re-ranking used in RAG systems? Answer: Re-ranking is used to reorder retrieved documents based on deeper semantic relevance so the most useful context is passed to the language model.

    Now answer the following: Context: {retrieved_documents} Question: {user_query}

    Here, the examples show the expected reasoning style and answer format, helping the model generate more consistent responses.

  • Chain-of-Thought (CoT) Prompting : Chain-of-Thought prompting encourages the model to perform step by step reasoning before producing the final answer.

    Example:
    Use the provided context to answer the question. Explain your reasoning step by step before giving the final answer.

    Context: {retrieved_documents}

    Question: Why is chunking necessary in a RAG pipeline?

    Steps to follow:
    1. Explain the context length limitation of LLMs.
    2. Explain why large documents cannot be directly embedded.
    3. Explain how chunking improves retrieval quality.
    4. Provide the final summarized answer.

    By guiding the model through structured reasoning, the output becomes more logical, transparent, and easier to understand.

  • Tree-of-Thought Prompting : Model explores multiple reasoning paths, evaluates them, and then chooses the most appropriate one.

    This approach is useful when a problem may have multiple possible approaches or interpretations, such as architecture decisions or optimization strategies.

RAG Evaluation

For RAG (Retrieval-Augmented Generation) evaluation, interviewers expect you to cover both retrieval quality + generation quality clearly.

Retrieval Metrics (focus on retrieval part)

1. Recall@k

  • Definition**:** Out of all relevant docs, how many are retrieved?
    Basically, recall scores the relevant retrieved chunks with all the relevant chunks that exists in the vector database.
    Formula**:** relevant retrieved docs / total relevant docs

Example:

  • Relevant docs = 5, Retrieved top k = 3 relevant
    Recall = 3/5 = 0.6

High recall = you didn’t miss important info, all the relevant docs that can be retrieved are already retrieved.

2. Precision@k

  • Definition**:** Out of all retrieved docs, how many are relevant?
    Basically, precision scores the relevant retrieved chunks just with all the retrieved chunks.

Example**:**

  • Top 5 retrieved docs → 3 relevant
    Precision = 3/5 = 0.6

High precision = less noise

Generation Metric (focus on generated answers)

3. Faithfulness (hallucination check)

  • We need to check, whether the answers given by our RAG pipeline are grounded or not. Sometimes, LLM can hallucinate and made up answers which are not based on the provided docs.

  • Definition: Is answer grounded in retrieved context?

  • Example**:**

    • Context: “Paris is capital of France”
      Model says: “Paris is capital of Germany” ❌
      " Faithfulness score will be low. "

4. Answer relevance (generation quality)

  • Definition: Does answer actually answer the question?

  • Example:

    • Q: “What is RAG?”

    • A: “RAG uses retrieval + LLM”

    • A: “LLMs are powerful” ❌

LLM-based Evaluation (Modern approach)

5. LLM-as-a-judge:

LLM-as-a-Judge is an evaluation methodology where an LLM (eg., OpenAI) is used to assess the quality of outputs produced by another LLM application. Instead of relying solely on human reviewers or simple heuristic metrics, you prompt a capable model (the “judge”) to evaluate correctness, faithfulness, relevance.

  • Widely used in production.

  • Example prompt:
    Question: ...
    Context: ... Answer: ... Evaluate: Is answer grounded in context? (Yes/No)

Frequently Asked Questions on RAG

  1. How will you improve retrieval quality?
    To answer this, it’s important to note that retrieval depends on earlier steps such as document parsing, chunking, embedding generation, and the retrieval strategy itself. Therefore, we can improve retrieval quality by focusing on the following:
    a) Improve embeddings (better model, domain-specific if needed)
    b) Use hybrid search (vector + keyword)
    c) Experiment and tune chunk size & overlap
    d) Add metadata filtering (time, category)
    e) Use re-ranking models (cross-encoder)
    f) Optimize top-K selection
    👉 “I treat retrieval as ranking problem, so improve embedding, filtering, and re-ranking.”

  2. Why token size of 200 you chosen for your resume feeding rag pipeline for chunking?
    a) Fits within LLM context window efficiently
    b) Smaller chunks (<200) may split related topics in middle, while larger chunks (>200) can mix multiple unrelated topics.
    c) A 10–20% overlap helps handle boundary information loss
    d) Resumes contain dense facts, information and concise work details within this limit.

  3. How to handle hallucinations in RAG?
    a) Improving retrieval will definitel improve the generation, as discussed above.
    b) Use strict prompting (“answer only from context”)
    c) Experiment with other prompting techniques like CoT(chain-of-thought), Tree-of-Thought, Few Shot Examples.
    d) Use faithfulness checks / LLM judge for the generated answers.

  4. Your rag system gets slowed or latency increased after few weeks, what can be the reasons and solution?
    Reasons:
    a) Growing vector DB size
    b) No caching used.
    c) Chunk size increased.
    d) Indexing is not proper, duplicated, or poor distribution of stored chunks.

    Solutions:
    a) Add Redis caching
    b) Use smaller/faster models
    c) Async processing + batching, streaming responses

  5. Why advanced chunking strategy for chunking the documents?
    Simple chunking breaks semantic meaning**.** Using advanced chunking (semantic + overlap) will preserve the semantic meaning and boundary informations.

  6. Why hybrid retrieval technique used for retrieving the top K chunks?
    Vector search → semantic similarity
    Keyword search (BM25) → exact match (important terms)
    Hybrid retrieval preserves the semantically related chunks and keyword based informations.

  7. RAG vs Finetuning when to use?
    Use RAG when:
    a) Data is dynamic (frequently updated)
    b) Need source grounding for the generated answers
    c) Large external knowledge base

    Use Fine-tuning when:
    a) Need total behavior/style control
    b) Data is static (not changing frequently)

  8. How I handled follow-up questions in RAG?
    1. Maintain session context: Store last N queries + responses per user.
    2. Detect follow-up: Use embedding similarity with previous queries. (High similarity → follow-up Else → new query).
    3. Context carry-forward: Merge previous query + current query Or pass conversation history to LLM.
    4. Retrieval adjustment: For follow-up → search using combined context; For new → fresh retrieval.
    5. Cache optimization: Cache only final responses Use session-aware keys (user_id + context).