Understanding RAG Fundamentals
π€ 1.1 β The Problem RAG Solves
To understand RAG, you first need to understand what's broken without it. Let's start with a real-world scenario.
The Scenario
Imagine you're building a chatbot for a law firm. Their lawyers need to query 50,000 legal documents. You try using GPT-4 directly:
- β GPT-4's training data is from 2023 β it doesn't know about the firm's internal case files
- β Even if you stuff documents into the prompt, GPT-4's context window fits maybe 20 pages max
- β The model confidently makes up case names and legal precedents that don't exist
- β Sending 50,000 documents to OpenAI every query = $500 per query
RAG is the solution to all four problems above.
The 4 Fatal Limitations of Raw LLMs
| Limitation | What It Means | Impact |
|---|---|---|
| Knowledge Cutoff | Model doesn't know anything after training date | Outdated answers, missed current events |
| Hallucination | Model confidently fabricates facts that sound right | Wrong info delivered with total confidence β dangerous in production |
| Context Window Limit | Max text the model can process at once (GPT-4: ~128K tokens β 100 pages) | Can't query 50,000 documents at once |
| No Private Knowledge | Model only knows public internet data from training | Can't answer questions about YOUR company's data |
π‘ 1.2 β What is RAG?
RAG = Retrieval-Augmented Generation. It's a technique that gives an LLM access to external knowledge by retrieving relevant documents first, then passing them to the LLM as context, then generating an answer grounded in those documents.
The Perfect Analogy
Think of an LLM as a very smart student who has read millions of books but can't bring those books to the exam room. Their memory is imperfect (hallucinations). RAG is like giving that student an open-book exam:
- π The student doesn't need to memorize everything
- π They look up relevant pages before answering
- βοΈ They write answers grounded in the actual text
- β Answers are accurate and verifiable
RAG in One Diagram
Why RAG is Everywhere Now
| Company | RAG Use Case |
|---|---|
| Notion AI | RAG over your personal workspace notes |
| GitHub Copilot | RAG over your codebase for context-aware suggestions |
| Perplexity AI | RAG over real-time web search results |
| ChatGPT (with files) | RAG over uploaded PDFs and documents |
| Every enterprise AI chatbot | RAG over internal wikis, Confluence, Slack, policies |
π― 1.3 β Tokens, Embeddings & Semantic Search
These 3 concepts are the vocabulary of RAG. You can't explain RAG in an interview without understanding these cold.
Tokens β What LLMs Actually See
LLMs don't process words β they process tokens. A token is roughly ΒΎ of a word. "RAG engineering" = 3 tokens. Tokens matter because:
- Every API call costs money based on token count (input + output)
- Context window limits are measured in tokens (e.g., GPT-4o: 128K tokens)
- Your chunk size and retrieval strategy directly affect token usage and cost
# Quick token estimation (rule of thumb): 1 token β ΒΎ of a word β 4 characters 1 page β ~500 words β ~650 tokens 1 novel β ~100,000 words β ~130,000 tokens # GPT-4o pricing (as of 2024): Input: $5.00 per million tokens Output: $15.00 per million tokens # RAG cost control: only send RELEVANT chunks (500-1000 tokens) # instead of the entire document (millions of tokens)
Embeddings β Turning Words into Numbers
An embedding is a list of numbers (a vector) that represents the meaning of text. Similar meanings = similar vectors. This is how RAG "understands" that "automobile" and "car" mean the same thing without matching keywords.
Semantic Search vs Keyword Search
| Query: "How do I fix my car's engine?" | Keyword Search | Semantic Search |
|---|---|---|
| Would find: | "car engine repair" (exact words) | "automobile motor troubleshooting", "vehicle powertrain issues", "fixing ignition problems" |
| Misses: | Any synonym variation | Almost nothing relevant |
| How it works: | String matching (TF-IDF, BM25) | Vector similarity (cosine similarity) |
| Used in RAG: | Hybrid search (combined) | Primary retrieval method |
βοΈ 1.4 β Chunking & Vector Databases
Why Chunking Exists
Imagine you have a 500-page PDF manual. You can't embed the whole thing as one vector β that loses all granularity. And you can't send the whole document to an LLM for every query (too expensive, hits context limit). So you chunk β split the document into smaller overlapping pieces, each gets its own embedding.
Vector Database β The Search Engine for Embeddings
A vector database stores millions of embedding vectors and can find the most similar ones to a query vector in milliseconds. It's the core retrieval infrastructure of every RAG system.
| Vector DB | Type | Best For | When to Use |
|---|---|---|---|
| ChromaDB | Open source, local | Learning, prototypes, small apps | Day 1-3 of your project |
| FAISS | Open source, in-memory | High-performance local search | Research, no persistence needed |
| Pinecone | Managed cloud | Production apps at scale | When you need managed infra |
| Weaviate | Open source / cloud | Complex queries, GraphQL interface | Enterprise features needed |
| Qdrant | Open source / cloud | Fast Rust backend, rich filtering | Performance-critical production |
| pgvector | PostgreSQL extension | Existing Postgres users | You already use PostgreSQL |
Interview Questions β Day 1 Concepts
- Setup environment:
pip install openai chromadb langchain sentence-transformers tiktoken - Token counting: Use tiktoken to count tokens in a paragraph β see how text becomes numbers
- Generate your first embedding: Use sentence-transformers to embed 5 sentences, print the vector shape
- Semantic similarity: Calculate cosine similarity between "dog" and "puppy" vs "dog" and "python". Observe the difference.
- Manual chunking: Take any 3-page text, split into 250-word chunks with 50-word overlap manually in Python
# Task: Your First Embedding + Similarity Check from sentence_transformers import SentenceTransformer from sklearn.metrics.pairwise import cosine_similarity import numpy as np model = SentenceTransformer('all-MiniLM-L6-v2') # free, fast, good sentences = [ "A dog is playing in the park", "A puppy is running outdoors", # should be similar "Python is a programming language", # should be different "Machine learning models learn patterns" ] embeddings = model.encode(sentences) print(f"Embedding shape: {embeddings.shape}") # (4, 384) # Calculate similarity between all pairs sim_matrix = cosine_similarity(embeddings) print(f"Dog vs Puppy similarity: {sim_matrix[0][1]:.3f}") # ~0.85 high! print(f"Dog vs Python similarity: {sim_matrix[0][2]:.3f}") # ~0.12 low # Output: Dog vs Puppy ~0.85 β semantically related β # Output: Dog vs Python ~0.12 β semantically unrelated β
π Day 1 Revision Notes
- RAG = retrieve relevant docs β augment LLM prompt β generate grounded answer
- 4 LLM limits RAG solves: hallucination, knowledge cutoff, context window, private data
- Token = basic LLM text unit (~ΒΎ word) | Embedding = semantic meaning as a number vector
- Semantic search = search by meaning (embeddings) vs keyword search = string matching
- Chunking = split large docs into small pieces, each with overlap, each gets own embedding
- Vector DB = stores embeddings, finds similar ones fast β core infrastructure of every RAG system
- ChromaDB for learning β Pinecone/Qdrant for production
1. A user asks your legal chatbot "What cases did we win in Q3?" and the LLM makes up 3 case names. What problem is this and how does RAG fix it?
2. Why can't you just send your entire company knowledge base to GPT-4 with every query?
3. "automobile" and "car" have different spellings but high semantic similarity. Why?
4. Why do we use chunking with overlap instead of just splitting into non-overlapping pieces?
5. Name 2 differences between ChromaDB and Pinecone.
Building the Core RAG Pipeline
πΊοΈ 2.1 β The Full RAG Pipeline Architecture
The RAG pipeline has two distinct phases. Understanding this split is critical for interviews.
π 2.2 β Document Loading & Chunking Strategies
Document Loaders
| Source | LangChain Loader | Notes |
|---|---|---|
| PDF files | PyPDFLoader, PDFMinerLoader | PDFMiner handles complex layouts better |
| Word docs | Docx2txtLoader | Preserves paragraph structure |
| Websites | WebBaseLoader | Uses BeautifulSoup, strips HTML |
| CSV/Excel | CSVLoader | Each row becomes a document |
| Notion | NotionDirectoryLoader | Export Notion as markdown first |
| Code (Python, JS) | GenericLoader + parser | Language-aware splitting by functions |
| YouTube videos | YoutubeLoader | Uses transcript API |
Chunking Strategies β This is Where Most RAG Systems Fail
| Strategy | How It Works | Best For | Downside |
|---|---|---|---|
| Fixed Size | Split every N characters/tokens, overlap by X | Quick prototypes, general text | Can split mid-sentence, mid-thought |
| Recursive Character | Tries to split at paragraphs β sentences β words β chars | Most text types (LangChain default) | Chunks may be uneven |
| Semantic Chunking | Split when topic/meaning changes (embedding-based) | Long documents with topic shifts | Slower, needs embedding model |
| Document Structure | Split by headers, sections, paragraphs | Structured docs like manuals, wikis | Chunks can be too long or too short |
| Sentence-based | Split into individual sentences or sentence groups | FAQ, policy docs, Q&A content | Context loss across sentences |
# Complete Document Loading + Chunking Example from langchain.document_loaders import PyPDFLoader from langchain.text_splitter import RecursiveCharacterTextSplitter import tiktoken # Step 1: Load a PDF loader = PyPDFLoader("company_handbook.pdf") raw_docs = loader.load() print(f"Loaded {len(raw_docs)} pages") # Step 2: Count tokens to understand document size enc = tiktoken.encoding_for_model("gpt-4") total_tokens = sum(len(enc.encode(doc.page_content)) for doc in raw_docs) print(f"Total tokens: {total_tokens} (~${total_tokens/1000 * 0.005:.2f} if sent directly)") # Step 3: Smart chunking β Recursive splits at natural boundaries splitter = RecursiveCharacterTextSplitter( chunk_size=512, # tokens per chunk (NOT characters) chunk_overlap=50, # overlap to preserve context at boundaries length_function=lambda text: len(enc.encode(text)), separators=["\n\n", "\n", ". ", " ", ""] # try these in order ) chunks = splitter.split_documents(raw_docs) print(f"Created {len(chunks)} chunks") print(f"Sample chunk:\n{chunks[0].page_content[:200]}") print(f"Chunk metadata: {chunks[0].metadata}") # includes page number, source!
π’ 2.3 β Embedding Generation & Vector Storage
# Complete Embedding + ChromaDB Storage Pipeline from langchain.embeddings import OpenAIEmbeddings from langchain.embeddings import HuggingFaceEmbeddings # free alternative from langchain.vectorstores import Chroma import os # Option A: OpenAI embeddings (paid, high quality) os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY") embed_model = OpenAIEmbeddings(model="text-embedding-3-small") # Cost: $0.02 per million tokens β very cheap # Option B: Free local embeddings (great for learning) embed_model = HuggingFaceEmbeddings( model_name="all-MiniLM-L6-v2", # 384 dimensions, fast model_kwargs={'device': 'cpu'} ) # Step 4: Create vector store β embeds and stores all chunks vectorstore = Chroma.from_documents( documents=chunks, # your chunked documents embedding=embed_model, # embedding model persist_directory="./chroma_db", # save to disk collection_name="company_docs" ) print(f"Stored {vectorstore._collection.count()} embeddings!") # To reload later without re-embedding: vectorstore = Chroma( persist_directory="./chroma_db", embedding_function=embed_model, collection_name="company_docs" )
π 2.4 β Retrieval, Re-ranking & Prompt Augmentation
# Step 5: Retrieval β find relevant chunks for a query retriever = vectorstore.as_retriever( search_type="similarity", # or "mmr" for diverse results search_kwargs={"k": 4} # retrieve top 4 chunks ) query = "What is the parental leave policy?" relevant_chunks = retriever.get_relevant_documents(query) for i, chunk in enumerate(relevant_chunks): print(f"Chunk {i+1} (page {chunk.metadata.get('page', '?')}):") print(chunk.page_content[:200]) print() # Step 6: Build augmented prompt def build_rag_prompt(query: str, chunks: list) -> str: context = "\n\n---\n\n".join([c.page_content for c in chunks]) return f"""You are a helpful assistant that answers questions based ONLY on the provided context. If the answer is not in the context, say "I don't have that information in the provided documents." CONTEXT: {context} QUESTION: {query} ANSWER (based only on context above):""" prompt = build_rag_prompt(query, relevant_chunks) print(f"Total prompt tokens: ~{len(prompt.split()) * 4 // 3}") # Step 7: Generate response with LLM from openai import OpenAI client = OpenAI() response = client.chat.completions.create( model="gpt-4o-mini", # cheapest GPT-4 class model messages=[{"role": "user", "content": prompt}], temperature=0, # 0 = deterministic, grounded answers max_tokens=500 ) answer = response.choices[0].message.content print(f"Answer: {answer}")
LangChain RetrievalQA β The One-Liner Version
# LangChain handles the whole pipeline in a few lines from langchain.chains import RetrievalQA from langchain.chat_models import ChatOpenAI llm = ChatOpenAI(model="gpt-4o-mini", temperature=0) qa_chain = RetrievalQA.from_chain_type( llm=llm, chain_type="stuff", # "stuff" all chunks into prompt retriever=retriever, return_source_documents=True # get source chunks back ) result = qa_chain.invoke({"query": "What is the refund policy?"}) print(f"Answer: {result['result']}") print(f"Sources: {[d.metadata['source'] for d in result['source_documents']]}")
Chain Types β Interviewers Ask This
| Chain Type | How It Works | Best When |
|---|---|---|
| stuff | Stuff ALL chunks directly into one prompt | Few small chunks, short context needed |
| map_reduce | Run LLM on each chunk separately, then combine answers | Many chunks, parallel processing |
| refine | Start with first chunk, refine answer with each next chunk | Long documents, iterative refinement |
| map_rerank | Run LLM on each chunk, score relevance, pick best | Need most relevant single answer |
- Download a PDF (any manual, textbook chapter, or company policy β 5+ pages)
- Build the indexing pipeline: load β chunk β embed β store in ChromaDB. Print: number of chunks, sample chunk with metadata
- Build the retrieval pipeline: query β retrieve top 3 chunks β print them with their similarity scores
- Build the generation step: manually write the prompt, call OpenAI API, print answer
- Use LangChain RetrievalQA to do the same in 10 lines
- Test 5 different queries and note which ones return accurate vs inaccurate answers. Why?
π Day 2 Revision Notes
- 2 phases: Indexing (offline) = load β chunk β embed β store | Retrieval (online) = query β retrieve β augment β generate
- Chunking tip: 512 tokens chunk size, 50 tokens overlap is a solid starting point for most documents
- RecursiveCharacterTextSplitter is the best default splitter β tries natural boundaries first
- Same embedding model MUST be used for both indexing and retrieval β different models produce incompatible vectors
- Retriever k=4 is a good default β too few misses info, too many adds noise
- temperature=0 for RAG LLMs β you want deterministic, factual answers, not creative ones
- LangChain RetrievalQA wraps the whole pipeline β production code uses LCEL (LangChain Expression Language) instead
1. You index 1000 documents and then query "What is our vacation policy?" β describe every step that happens internally.
2. You use OpenAI for indexing embeddings but switch to HuggingFace for retrieval. Will it work? Why not?
3. What is chunk overlap and what happens if you set it to 0?
4. What does temperature=0 mean and why do RAG systems use it?
5. You have 20 retrieved chunks but the LLM context window only fits 5. What are your options?
Tools, Frameworks & Building Real APIs
π 3.1 β OpenAI API Deep Dive
OpenAI is the backbone of most RAG systems. You need to understand its API deeply for both implementation and interviews.
Key OpenAI Models for RAG
| Model | Use For | Context Window | Cost |
|---|---|---|---|
| gpt-4o-mini | Best value for RAG generation | 128K tokens | ~$0.15/1M input tokens |
| gpt-4o | Complex reasoning, highest quality | 128K tokens | ~$5/1M input tokens |
| text-embedding-3-small | Fast, cheap embedding for indexing | 8191 tokens input | $0.02/1M tokens |
| text-embedding-3-large | Highest quality embeddings | 8191 tokens input | $0.13/1M tokens |
# OpenAI API β Everything You Need for RAG from openai import OpenAI import os client = OpenAI(api_key=os.getenv("OPENAI_API_KEY")) ## 1. Generate embeddings (for indexing documents) def embed_text(text: str) -> list[float]: response = client.embeddings.create( input=text, model="text-embedding-3-small" ) return response.data[0].embedding # list of 1536 floats ## 2. Batch embedding (more efficient) def embed_batch(texts: list[str]) -> list[list[float]]: response = client.embeddings.create( input=texts, # send up to 2048 texts at once model="text-embedding-3-small" ) return [item.embedding for item in response.data] ## 3. Chat completion with full control def generate_answer(system_prompt: str, user_query: str, context: str) -> str: response = client.chat.completions.create( model="gpt-4o-mini", messages=[ {"role": "system", "content": system_prompt}, {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {user_query}"} ], temperature=0, max_tokens=800 ) return response.choices[0].message.content ## 4. Streaming response (better UX β shows answer as it generates) def stream_answer(prompt: str): stream = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}], stream=True ) for chunk in stream: if chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="", flush=True)
βοΈ 3.2 β LangChain LCEL β Modern RAG Chains Industry Standard
LangChain Expression Language (LCEL) is the modern way to build RAG pipelines. It uses the pipe operator (|) to chain components β readable, composable, and production-ready.
# LCEL: Modern LangChain RAG Chain from langchain_core.prompts import ChatPromptTemplate from langchain_core.runnables import RunnablePassthrough from langchain_core.output_parsers import StrOutputParser from langchain_openai import ChatOpenAI, OpenAIEmbeddings from langchain_community.vectorstores import Chroma # Setup embed = OpenAIEmbeddings(model="text-embedding-3-small") vectorstore = Chroma(persist_directory="./chroma_db", embedding_function=embed) retriever = vectorstore.as_retriever(search_kwargs={"k": 4}) # Prompt template prompt = ChatPromptTemplate.from_template(""" You are an expert assistant. Answer based ONLY on the context below. If unsure, say "I don't know based on the provided documents." Context: {context} Question: {question} Answer: """) llm = ChatOpenAI(model="gpt-4o-mini", temperature=0) # Helper to format retrieved docs into a string def format_docs(docs): return "\n\n---\n\n".join(doc.page_content for doc in docs) # LCEL Chain β reads left to right like a pipeline rag_chain = ( {"context": retriever | format_docs, "question": RunnablePassthrough()} | prompt | llm | StrOutputParser() ) # Use it: answer = rag_chain.invoke("What is the leave encashment policy?") print(answer) # Streaming (yields tokens as generated): for token in rag_chain.stream("What is the leave encashment policy?"): print(token, end="", flush=True)
π 3.3 β Building a FastAPI RAG Backend Portfolio Ready
A Jupyter notebook RAG system is a prototype. A FastAPI app is a product. Here's how to build a production-ready RAG API that you can show to recruiters.
# rag_api/main.py β Production FastAPI RAG Backend from fastapi import FastAPI, UploadFile, File, HTTPException from fastapi.responses import StreamingResponse from pydantic import BaseModel from contextlib import asynccontextmanager import tempfile, os, asyncio from langchain_community.document_loaders import PyPDFLoader from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain_openai import OpenAIEmbeddings, ChatOpenAI from langchain_community.vectorstores import Chroma from langchain_core.prompts import ChatPromptTemplate from langchain_core.runnables import RunnablePassthrough from langchain_core.output_parsers import StrOutputParser # === MODELS === class QueryRequest(BaseModel): question: str k: int = 4 # number of chunks to retrieve stream: bool = False # streaming response? class QueryResponse(BaseModel): answer: str sources: list[str] chunks_used: int # === GLOBALS === vectorstore = None embed_model = OpenAIEmbeddings(model="text-embedding-3-small") llm = ChatOpenAI(model="gpt-4o-mini", temperature=0) # === APP === app = FastAPI( title="RAG Document QA API", description="Upload documents and query them with AI", version="1.0.0" ) @app.on_event("startup") async def startup(): global vectorstore # Load existing vector store if it exists if os.path.exists("./chroma_db"): vectorstore = Chroma( persist_directory="./chroma_db", embedding_function=embed_model ) print(f"Loaded existing vector store") @app.get("/health") async def health(): return {"status": "healthy", "docs_indexed": vectorstore._collection.count() if vectorstore else 0} @app.post("/ingest") async def ingest_document(file: UploadFile = File(...)): """Upload and index a PDF document""" global vectorstore if not file.filename.endswith('.pdf'): raise HTTPException(400, "Only PDF files supported") # Save uploaded file temporarily with tempfile.NamedTemporaryFile(suffix=".pdf", delete=False) as tmp: tmp.write(await file.read()) tmp_path = tmp.name try: loader = PyPDFLoader(tmp_path) docs = loader.load() splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50) chunks = splitter.split_documents(docs) # Add source filename to metadata for chunk in chunks: chunk.metadata["filename"] = file.filename vectorstore = Chroma.from_documents( chunks, embed_model, persist_directory="./chroma_db" ) if not vectorstore else vectorstore.add_documents(chunks) return {"message": f"Indexed {len(chunks)} chunks from {file.filename}", "chunks": len(chunks)} finally: os.unlink(tmp_path) @app.post("/query", response_model=QueryResponse) async def query_documents(req: QueryRequest): """Query the indexed documents""" if not vectorstore: raise HTTPException(404, "No documents indexed yet. Use /ingest first.") retriever = vectorstore.as_retriever(search_kwargs={"k": req.k}) retrieved_docs = retriever.get_relevant_documents(req.question) prompt = ChatPromptTemplate.from_template("""Answer using ONLY the context. If not in context, say "I don't have that information." Context: {context} Question: {question} Answer:""") chain = ( {"context": lambda x: "\n\n".join(d.page_content for d in retrieved_docs), "question": lambda x: x} | prompt | llm | StrOutputParser() ) answer = chain.invoke(req.question) sources = list({d.metadata.get("filename", "unknown") for d in retrieved_docs}) return QueryResponse(answer=answer, sources=sources, chunks_used=len(retrieved_docs))
π₯οΈ 3.4 β Streamlit Frontend for RAG
# app.py β Streamlit Chat UI for RAG import streamlit as st import requests st.set_page_config(page_title="π Doc QA", layout="wide") st.title("π AI Document Q&A") st.caption("Upload a PDF and ask anything about it") API_URL = "http://localhost:8000" # Sidebar: document upload with st.sidebar: st.header("π€ Upload Document") uploaded = st.file_uploader("Choose a PDF", type=["pdf"]) if uploaded and st.button("Index Document"): with st.spinner("Indexing..."): resp = requests.post(f"{API_URL}/ingest", files={"file": uploaded}) if resp.status_code == 200: st.success(resp.json()["message"]) else: st.error("Failed to index") # Chat interface with history if "messages" not in st.session_state: st.session_state.messages = [] for msg in st.session_state.messages: with st.chat_message(msg["role"]): st.write(msg["content"]) if question := st.chat_input("Ask about your document..."): st.session_state.messages.append({"role": "user", "content": question}) with st.chat_message("user"): st.write(question) with st.chat_message("assistant"): with st.spinner("Thinking..."): resp = requests.post(f"{API_URL}/query", json={"question": question}) if resp.status_code == 200: data = resp.json() st.write(data["answer"]) st.caption(f"π Sources: {', '.join(data['sources'])}") st.session_state.messages.append({"role": "assistant", "content": data["answer"]}) # Run: streamlit run app.py # Backend: uvicorn rag_api.main:app --reload
- Build the FastAPI backend with /ingest, /query, and /health endpoints
- Test with Postman or curl: upload a PDF β query it β verify answer + sources
- Add the Streamlit frontend β connect it to your FastAPI backend
- Add streaming: Modify /query to stream tokens using StreamingResponse + Server-Sent Events
- Add error handling: What if no docs are indexed? What if PDF is corrupted?
- Write a README.md with setup instructions and API documentation
π Day 3 Revision Notes
- text-embedding-3-small = best price/performance embedding | use for learning and production
- gpt-4o-mini = best value LLM for RAG generation | temperature=0 always for RAG
- LCEL pipe syntax:
retriever | prompt | llm | parser= modern LangChain way - FastAPI advantages: async, Pydantic validation, streaming, auto-docs at /docs
- Streaming = yield tokens as generated β better UX, user sees answer building in real-time
- Streamlit = 10 lines for a working chat UI, session_state for conversation history
- Always add metadata (filename, page) to chunks β needed for source attribution in answers
Advanced RAG β The Techniques That Actually Work
π 4.1 β Hybrid Search: The Best of Both Worlds Used in Production
Pure semantic search (vectors) is great for conceptual questions but misses exact matches. BM25 keyword search is great for exact terms but misses synonyms. Hybrid search combines both.
# Hybrid Search with LangChain + BM25 from langchain.retrievers import BM25Retriever, EnsembleRetriever from langchain_community.vectorstores import Chroma # Assume chunks is your list of LangChain Documents # BM25 (keyword-based) bm25_retriever = BM25Retriever.from_documents(chunks) bm25_retriever.k = 4 # Vector (semantic) vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 4}) # Ensemble: 40% BM25, 60% semantic (tune these weights) hybrid_retriever = EnsembleRetriever( retrievers=[bm25_retriever, vector_retriever], weights=[0.4, 0.6] ) results = hybrid_retriever.get_relevant_documents("What is the API rate limit?") # Better at finding "rate limit" (exact) AND "throttling/quota" (semantic)
β¬οΈ 4.2 β Re-ranking: The Secret Sauce Biggest Quality Boost
Vector similarity finds candidates. Re-ranking picks the best ones. A re-ranker is a cross-encoder model that evaluates a query + document pair together β much more accurate than just embedding similarity, but slower (that's why we use it only on top-K candidates).
# Re-ranking with Cohere Rerank (cloud) or cross-encoder (local) ## Option A: Cohere Rerank API (easy, high quality) import cohere co = cohere.Client(os.getenv("COHERE_API_KEY")) def rerank_documents(query: str, docs: list, top_n: int = 4): texts = [d.page_content for d in docs] results = co.rerank( query=query, documents=texts, top_n=top_n, model="rerank-english-v3.0" ) return [docs[r.index] for r in results.results] ## Option B: Local cross-encoder (free) from sentence_transformers import CrossEncoder reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2') def rerank_local(query: str, docs: list, top_n: int = 4) -> list: pairs = [(query, doc.page_content) for doc in docs] scores = reranker.predict(pairs) # Sort by score descending, take top N ranked = sorted(zip(docs, scores), key=lambda x: x[1], reverse=True) return [doc for doc, score in ranked[:top_n]] # Usage: retrieve 20, rerank to top 4 candidates = vectorstore.as_retriever(search_kwargs={"k": 20}).get_relevant_documents(query) reranked = rerank_local(query, candidates, top_n=4)
π 4.3 β Query Transformation Techniques
Users write bad queries. Query transformation improves them before retrieval. This alone can boost RAG accuracy by 20-30%.
| Technique | How It Works | Best For |
|---|---|---|
| Multi-query | LLM generates 3-5 different phrasings of query β retrieve for each β merge | Short or ambiguous queries |
| HyDE | LLM generates a hypothetical answer β embed that β find similar real docs | Complex questions, abstract topics |
| Step-back prompting | LLM generates a more general "step-back" question β retrieve broader context | Specific questions needing broader context |
| Query decomposition | Break complex multi-part question into sub-questions β answer each β combine | Complex multi-hop questions |
| Query expansion | Add synonyms and related terms to query | Domain-specific terminology |
# Multi-Query Retrieval β Most commonly used in production from langchain.retrievers.multi_query import MultiQueryRetriever from langchain_openai import ChatOpenAI llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.3) # slight creativity for variety multi_query_retriever = MultiQueryRetriever.from_llm( retriever=vectorstore.as_retriever(search_kwargs={"k": 3}), llm=llm ) # For query "What's the leave policy?", generates: # 1. "How many vacation days do employees get?" # 2. "What are the sick leave rules?" # 3. "How do I apply for time off?" # Retrieves for all 3, deduplicates, returns union # HyDE β Hypothetical Document Embeddings hyde_prompt = ChatPromptTemplate.from_template(""" Write a short paragraph that would be a perfect answer to this question. Write it as if it's from a document, not as a direct answer. Question: {question} Hypothetical document paragraph:""") hyde_chain = hyde_prompt | llm | StrOutputParser() def hyde_retrieve(question: str, retriever, k: int = 4): # Generate hypothetical answer, use it as query hypothetical_doc = hyde_chain.invoke({"question": question}) return retriever.get_relevant_documents(hypothetical_doc)
Metadata Filtering β Precision at Scale
# Metadata filtering: search only within specific documents/sections # ChromaDB supports filtering during search # When ingesting, add rich metadata: for chunk in chunks: chunk.metadata.update({ "department": "HR", "doc_type": "policy", "year": 2024, "filename": "hr_policy_2024.pdf" }) # Filter search to only HR department docs from 2024: results = vectorstore.similarity_search( query="What is the leave policy?", k=4, filter={"department": "HR", "year": 2024} # ChromaDB $and filter ) # Only searches HR 2024 docs β much more precise, faster
π 4.4 β RAG Evaluation β How Do You Know If It's Working?
You cannot improve what you don't measure. RAG evaluation is what senior AI engineers do that juniors skip. This is a high-value interview topic.
4 Key RAG Metrics
| Metric | Measures | Question It Answers |
|---|---|---|
| Context Recall | Did retrieval find all necessary information? | "Did we retrieve the right chunks?" |
| Context Precision | Are retrieved chunks relevant? No noise? | "Did we retrieve ONLY the right chunks?" |
| Answer Faithfulness | Is the answer grounded in retrieved context? | "Did the LLM use the context or hallucinate?" |
| Answer Relevancy | Does the answer actually address the question? | "Is the answer actually helpful?" |
# RAG Evaluation with RAGAS (the standard library) from ragas import evaluate from ragas.metrics import ( faithfulness, # answer grounded in context? answer_relevancy, # answer relevant to question? context_recall, # retrieved right context? context_precision # retrieved only relevant context? ) from datasets import Dataset # Build test dataset: question, expected answer, retrieved context, generated answer test_data = { "question": ["What is the parental leave policy?", "How do I apply for remote work?"], "answer": [generated_answer_1, generated_answer_2], "contexts": [[chunk.page_content for chunk in retrieved_1], [chunk.page_content for chunk in retrieved_2]], "ground_truth": ["Employees get 26 weeks paid leave...", "Submit a remote work request form..."] } dataset = Dataset.from_dict(test_data) results = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_recall]) print(results) # Output: # faithfulness: 0.92 β 92% of answers grounded in context β # answer_relevancy: 0.87 β 87% answers actually address question β # context_recall: 0.79 β missed some relevant chunks β οΈ (tune chunk size)
Cost vs Accuracy Tradeoffs
| Decision | More Accurate (higher cost) | Cheaper (lower accuracy) |
|---|---|---|
| Embedding model | text-embedding-3-large ($0.13/M) | all-MiniLM-L6-v2 (free) |
| Generation model | gpt-4o ($5/M) | gpt-4o-mini ($0.15/M) |
| Chunks retrieved (k) | k=10 (more context) | k=3 (cheaper, less noise) |
| Re-ranking | Cohere re-ranker + k=20 initial | No re-ranking, k=4 |
| Query transformation | Multi-query (3x LLM calls) | Direct query (1 LLM call) |
π Day 4 Revision Notes
- Hybrid search = BM25 (exact keywords) + vector (semantic) β best of both, use EnsembleRetriever
- Re-ranking = retrieve 20 candidates β re-rank to top 4 β biggest quality boost per dollar
- Multi-query = LLM generates 3-5 query variations β retrieve for each β deduplicate and merge
- HyDE = generate hypothetical answer β embed it β find similar real docs (helps abstract questions)
- Metadata filtering = add rich metadata during indexing β filter during retrieval β precision + speed
- RAGAS metrics: faithfulness, answer_relevancy, context_recall, context_precision
- Cost optimization: Free embeddings for prototypes β OpenAI small for production β use re-ranking to offset smaller k
1. A user queries "ARN format in AWS". Pure semantic search returns docs about "AWS resource naming conventions" but misses the exact doc saying "arn:aws:...". Which retrieval method would fix this?
2. You retrieve k=4 chunks for every query but your RAGAS context_recall is 0.62. What are 3 things you can try?
3. Explain the two-stage retrieval pattern. Why not just use re-ranker on all documents?
4. Your RAG system answers legal questions but sometimes cites the wrong year's policy. How does metadata filtering solve this?
5. Your faithfulness score is 0.55 (low). What does this mean and how do you fix it?
Production RAG β Real Architecture & Deployment
π’ 5.1 β How Real Companies Use RAG
| Company / Use Case | RAG Architecture | Key Challenge Solved |
|---|---|---|
| Customer Support Bot | RAG over help docs + ticket history + FAQs | Deflect 60% tickets, always up-to-date with product changes |
| Internal HR Chatbot | RAG over policy docs, Confluence, Notion | Employees get instant accurate policy answers 24/7 |
| Legal Document Review | RAG over case files, contracts, precedents | Lawyers query 50,000+ docs in seconds, with citations |
| Developer Docs Search | RAG over API docs, GitHub issues, SO posts | Developers find answers without manual searching |
| Sales Intelligence | RAG over call transcripts, CRM, market reports | Sales reps get tailored pitch points before every call |
| Medical Knowledge Base | RAG over clinical trials, drug references | Doctors query latest research grounded in evidence |
The Full Production RAG Architecture
π¬ 5.2 β Conversation Memory in RAG
Without memory, every query to your RAG system is independent. Users can't ask follow-up questions like "Tell me more about that" or "What about for remote employees?"
# Conversational RAG with memory β the right way from langchain.memory import ConversationBufferWindowMemory from langchain.chains import ConversationalRetrievalChain # Keep last 5 conversation turns in memory memory = ConversationBufferWindowMemory( memory_key="chat_history", k=5, return_messages=True, output_key="answer" ) conv_chain = ConversationalRetrievalChain.from_llm( llm=ChatOpenAI(model="gpt-4o-mini", temperature=0), retriever=hybrid_retriever, memory=memory, return_source_documents=True ) # Turn 1 r1 = conv_chain.invoke({"question": "What is the parental leave policy?"}) print(r1["answer"]) # Turn 2 β system remembers we're talking about parental leave r2 = conv_chain.invoke({"question": "Does it apply to adoption too?"}) print(r2["answer"]) # Correctly contextualizes "it" = parental leave # LCEL approach with manual history (more control): from langchain_core.messages import HumanMessage, AIMessage chat_history = [] def chat_with_memory(question: str) -> str: # Condense question with history context condense_prompt = f"""Given the conversation history and new question, rephrase the question to be standalone. History: {chat_history[-4:]} Question: {question} Standalone question:""" standalone = llm.invoke(condense_prompt).content # Retrieve and answer docs = retriever.get_relevant_documents(standalone) answer = rag_chain.invoke(standalone) # Update history chat_history.append(HumanMessage(content=question)) chat_history.append(AIMessage(content=answer)) return answer
π³ 5.3 β Deploying RAG with Docker + Cloud
Project Structure (GitHub-Ready)
rag-pdf-chatbot/ βββ backend/ β βββ main.py # FastAPI app β βββ rag/ β β βββ __init__.py β β βββ ingestion.py # Document loading + chunking β β βββ embeddings.py # Embedding model wrapper β β βββ retrieval.py # Vector store + retrieval β β βββ generation.py # LLM + prompt building β β βββ evaluation.py # RAGAS evaluation β βββ tests/ β β βββ test_rag.py β βββ Dockerfile β βββ requirements.txt βββ frontend/ β βββ streamlit_app.py βββ data/ β βββ sample_docs/ βββ docker-compose.yml βββ .env.example # Template for env vars (NO real keys) βββ .gitignore # Include: chroma_db/, .env, __pycache__ βββ README.md # Architecture diagram + setup instructions
# Dockerfile FROM python:3.11-slim WORKDIR /app # Install dependencies first (caching layer) COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt # Copy application code COPY . . # Create directory for ChromaDB persistence RUN mkdir -p /app/chroma_db EXPOSE 8000 CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
# docker-compose.yml version: '3.8' services: backend: build: ./backend ports: ["8000:8000"] environment: - OPENAI_API_KEY=${OPENAI_API_KEY} - COHERE_API_KEY=${COHERE_API_KEY} volumes: - chroma_data:/app/chroma_db # persist vector DB frontend: build: ./frontend ports: ["8501:8501"] depends_on: [backend] environment: - API_URL=http://backend:8000 volumes: chroma_data: # Run: docker-compose up --build
Deployment Options
| Option | Cost | Effort | Best For |
|---|---|---|---|
| Railway.app | Free tier available | β Easiest | Portfolio demos, hackathons |
| Render.com | Free tier (sleeps) | ββ Easy | Personal projects |
| AWS EC2 + Docker | ~$10-20/month | βββ Medium | Production, shows AWS skills |
| Google Cloud Run | Pay per request | βββ Medium | Scalable serverless |
| HuggingFace Spaces | Free for Streamlit | β Easiest | ML portfolio showcase |
β οΈ 5.4 β Production Pitfalls & Security
| Pitfall | Problem | Fix |
|---|---|---|
| Prompt injection | User crafts query to override system prompt | Input sanitization, separate system/user context, output validation |
| Context poisoning | Malicious doc in vector DB injects instructions | Sanitize documents at ingestion, separate trusted/untrusted sources |
| No rate limiting | Bot floods your API β $1000 OpenAI bill overnight | FastAPI rate limiting middleware, usage quotas per user/API key |
| Storing raw text | PII, secrets, confidential data in vector DB | PII detection at ingestion, access control, encryption at rest |
| No guardrails | LLM answers questions outside scope ("how to hack?") | Input/output guards (Guardrails AI, NeMo Guardrails) |
| Stale embeddings | Doc updated but old embedding still in DB β wrong answers | Document versioning, update/delete embeddings when source changes |
π Day 5 Revision Notes
- Production RAG adds: auth gateway, query router, re-ranking, conversation memory, post-processing, logging
- Conversation memory = ConversationBufferWindowMemory (last K turns) β condense question with history before retrieval
- Docker structure: backend FastAPI + frontend Streamlit in docker-compose with volume for ChromaDB
- Project structure: separate rag/ module with ingestion, embeddings, retrieval, generation, evaluation files
- Security must-haves: rate limiting, input sanitization, no hardcoded API keys, PII detection at ingestion
- Document updates: track document_id + hash β delete old chunks β re-index on change
- Deploy to Railway/HuggingFace for portfolio, mention AWS EC2 in README for production credibility
The PDF Chatbot Project + Job Readiness Kit
π The Mini Project: AI PDF Research Assistant
Build a full-stack RAG application that lets users upload multiple PDFs and have an intelligent conversation about their content. This is something real companies pay engineers to build.
Key Implementation: The /chat Streaming Endpoint
# The most impressive feature: streaming RAG with sources from fastapi.responses import StreamingResponse import json @app.post("/chat") async def chat_stream(req: ChatRequest): """Stream RAG response with sources""" # Retrieve with hybrid search docs = hybrid_retriever.get_relevant_documents(req.question) docs = rerank_local(req.question, docs, top_n=4) context = "\n\n---\n\n".join(d.page_content for d in docs) sources = list({d.metadata.get("filename") for d in docs}) prompt = build_prompt(req.question, context, req.chat_history) async def generate(): # First yield: sources (so UI can show them immediately) yield f"data: {json.dumps({'type': 'sources', 'data': sources})}\n\n" # Stream LLM tokens stream = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}], stream=True ) for chunk in stream: token = chunk.choices[0].delta.content or "" if token: yield f"data: {json.dumps({'type': 'token', 'data': token})}\n\n" yield f"data: {json.dumps({'type': 'done'})}\n\n" return StreamingResponse(generate(), media_type="text/event-stream")
GitHub Repository Checklist
- .gitignore includes: .env, chroma_db/, __pycache__/, *.pyc, .DS_Store, *.egg-info
- .env.example: Shows required variables WITHOUT actual values β OPENAI_API_KEY=your_key_here
- README.md includes: Project description, architecture diagram (ASCII), tech stack, setup instructions (pip install + docker-compose), demo GIF/screenshot, API docs link
- requirements.txt: All dependencies with version pins (pip freeze > requirements.txt)
- tests/: At least 3 tests β ingestion, retrieval quality, API endpoint
- Commit history: Clean commits with meaningful messages (feat: hybrid-search, fix: chunk-overlap, docs: add-readme)
π Resume-Ready Project Descriptions
AI PDF Research Assistant | Python, FastAPI, LangChain, ChromaDB, OpenAI github.com/yourname/ai-pdf-chatbot | live: railway.app/project/... β’ Built a production-ready RAG system enabling conversational Q&A over multiple PDF documents with <2s response latency β’ Implemented hybrid search (BM25 + semantic vectors) with cross-encoder re-ranking, improving retrieval accuracy by ~30% vs naive vector search β’ Designed streaming FastAPI backend with conversation memory supporting multi-turn queries with source citations β’ Deployed containerized application (Docker + docker-compose) serving both FastAPI and Streamlit frontend β’ Integrated RAGAS evaluation framework; achieved faithfulness score 0.89 and context recall 0.83 on test dataset Skills demonstrated: RAG architecture, LLM engineering, FastAPI, vector databases, embeddings, hybrid search, Docker, prompt engineering
Skills Section Format
AI / LLM Engineering: RAG systems, LangChain, OpenAI API (Chat + Embeddings), Prompt engineering, ChromaDB, FAISS, Semantic search, Hybrid search, Re-ranking, RAGAS evaluation Backend: FastAPI, Python, REST APIs, async programming, Docker Databases: ChromaDB (vector), PostgreSQL, MongoDB, pgvector
π¬ Top 30 RAG Interview Questions with Answers
π± Fundamentals
π§ Pipeline & Implementation
π Advanced
πΌ Practical / Scenario
β Top 10 Beginner RAG Mistakes
- Using different embedding models for indexing and retrieval β vectors incompatible β garbage results
- No chunk overlap β information at chunk boundaries is lost β incomplete answers
- Too-small chunk size (50-100 tokens) β no context, individual sentences are meaningless
- k=1 retrieval β depending on one chunk β brittle, misses nuance
- High temperature (0.7+) for RAG β model gets creative instead of factual β more hallucinations
- Not storing source metadata β can't attribute answers, can't filter by source
- Re-indexing entire vector DB on every update β slow, expensive β use incremental updates
- No evaluation/testing β don't know if RAG is actually working, can't improve it
- Ignoring the "Lost in the Middle" problem β LLM ignores middle chunks β use re-ranking to prioritize
- Hardcoding API keys in source code β pushed to GitHub β credentials stolen β huge bill
πΊοΈ What to Learn Next
| Track | What to Learn | Why |
|---|---|---|
| Week 2 | LangGraph for multi-agent RAG | Agentic RAG is the next wave β agents that decide when/what to retrieve |
| Week 3 | Pinecone or Qdrant in production | ChromaDB doesn't scale β learn managed vector DBs |
| Week 4 | OpenAI Assistants API / File Search | Managed RAG with no infrastructure β common in enterprise projects |
| Month 2 | GraphRAG (Microsoft) or KnowledgeGraph + RAG | State-of-the-art for complex reasoning across many documents |
| Month 2 | Fine-tuning + RAG combo | Fine-tune for style/format, RAG for knowledge β best of both worlds |
| Month 3 | LLMOps: LangSmith, Weights & Biases | Production monitoring, tracing, experiment tracking β what companies use |
| Parallel | Llama.cpp + Ollama (local LLMs) | Free, private, no API costs β great for learning and offline use cases |
The difference between you and other freshers: they talk about AI. You've built it. You can explain why embeddings are vectors, what chunk overlap does to retrieval quality, why temperature=0 matters for grounding, and how to debug a RAG system that's giving wrong answers.
Ship the PDF chatbot. Star your own repo. Post a demo on LinkedIn. The AI engineering job you want? You're now qualified for it. π