RAG Engineering — Zero to Job-Ready in 5 Days

📅 Day 1 of 5 · 4–5 hours

Understanding RAG Fundamentals

Before you write a single line of code, you need a rock-solid mental model. Today you'll understand WHY RAG exists, what problems it solves, and how every piece connects. This day is the most important.

What is RAGLLM LimitationsHallucinations Context WindowTokensEmbeddings Semantic SearchVector DatabasesChunking

🤔 1.1 — The Problem RAG Solves

To understand RAG, you first need to understand what's broken without it. Let's start with a real-world scenario.

The Scenario

Imagine you're building a chatbot for a law firm. Their lawyers need to query 50,000 legal documents. You try using GPT-4 directly:

❌ GPT-4's training data is from 2023 — it doesn't know about the firm's internal case files
❌ Even if you stuff documents into the prompt, GPT-4's context window fits maybe 20 pages max
❌ The model confidently makes up case names and legal precedents that don't exist
❌ Sending 50,000 documents to OpenAI every query = $500 per query

RAG is the solution to all four problems above.

The 4 Fatal Limitations of Raw LLMs

Limitation	What It Means	Impact
Knowledge Cutoff	Model doesn't know anything after training date	Outdated answers, missed current events
Hallucination	Model confidently fabricates facts that sound right	Wrong info delivered with total confidence — dangerous in production
Context Window Limit	Max text the model can process at once (GPT-4: ~128K tokens ≈ 100 pages)	Can't query 50,000 documents at once
No Private Knowledge	Model only knows public internet data from training	Can't answer questions about YOUR company's data

🧠 Memory Trick: LLMs suffer from HCKP — Hallucination, Cutoff, kontext (context) window, Private knowledge gap. "Happy Cats Keep Purring" (but they're lying about what they know)

💡 1.2 — What is RAG?

RAG = Retrieval-Augmented Generation. It's a technique that gives an LLM access to external knowledge by retrieving relevant documents first, then passing them to the LLM as context, then generating an answer grounded in those documents.

The Perfect Analogy

Think of an LLM as a very smart student who has read millions of books but can't bring those books to the exam room. Their memory is imperfect (hallucinations). RAG is like giving that student an open-book exam:

📚 The student doesn't need to memorize everything
🔍 They look up relevant pages before answering
✍️ They write answers grounded in the actual text
✅ Answers are accurate and verifiable

RAG in One Diagram

WITHOUT RAG: User: "What did our CEO announce in Q4 2024?" ↓ [ LLM ] ← only knows public training data ↓ "I don't have information about your company's announcements." (or worse, makes something up) WITH RAG: User: "What did our CEO announce in Q4 2024?" ↓ [ Retriever ] → searches company documents → finds relevant Q4 report pages ↓ [ Prompt Builder ] → "Answer this question using these documents: [docs]" ↓ [ LLM ] → reads the actual document content ↓ "According to the Q4 2024 earnings call, the CEO announced a 20% headcount reduction..." Grounded in real document. No hallucination. ✅

Why RAG is Everywhere Now

Company	RAG Use Case
Notion AI	RAG over your personal workspace notes
GitHub Copilot	RAG over your codebase for context-aware suggestions
Perplexity AI	RAG over real-time web search results
ChatGPT (with files)	RAG over uploaded PDFs and documents
Every enterprise AI chatbot	RAG over internal wikis, Confluence, Slack, policies

🎯 1.3 — Tokens, Embeddings & Semantic Search

These 3 concepts are the vocabulary of RAG. You can't explain RAG in an interview without understanding these cold.

Tokens — What LLMs Actually See

LLMs don't process words — they process tokens. A token is roughly ¾ of a word. "RAG engineering" = 3 tokens. Tokens matter because:

Every API call costs money based on token count (input + output)
Context window limits are measured in tokens (e.g., GPT-4o: 128K tokens)
Your chunk size and retrieval strategy directly affect token usage and cost

# Quick token estimation (rule of thumb):
1 token  ≈  ¾ of a word  ≈  4 characters
1 page   ≈  ~500 words  ≈  ~650 tokens
1 novel  ≈  ~100,000 words  ≈  ~130,000 tokens

# GPT-4o pricing (as of 2024):
Input:  $5.00 per million tokens
Output: $15.00 per million tokens

# RAG cost control: only send RELEVANT chunks (500-1000 tokens) 
# instead of the entire document (millions of tokens)

Embeddings — Turning Words into Numbers

An embedding is a list of numbers (a vector) that represents the meaning of text. Similar meanings = similar vectors. This is how RAG "understands" that "automobile" and "car" mean the same thing without matching keywords.

Text → Embedding Model → Vector (list of numbers) "The cat sat on the mat" → [0.23, -0.87, 0.41, 0.15, ... 1536 numbers] "A kitten rested on a rug" → [0.25, -0.84, 0.38, 0.18, ... 1536 numbers] "Python is a programming language" → [-0.72, 0.31, -0.55, 0.88, ... 1536 numbers] Distance between vectors 1 & 2: 0.05 → VERY SIMILAR (same meaning) Distance between vectors 1 & 3: 0.89 → VERY DIFFERENT (different topic) This is how RAG finds relevant documents — not by keyword matching, but by MEANING matching!

Semantic Search vs Keyword Search

Query: "How do I fix my car's engine?"	Keyword Search	Semantic Search
Would find:	"car engine repair" (exact words)	"automobile motor troubleshooting", "vehicle powertrain issues", "fixing ignition problems"
Misses:	Any synonym variation	Almost nothing relevant
How it works:	String matching (TF-IDF, BM25)	Vector similarity (cosine similarity)
Used in RAG:	Hybrid search (combined)	Primary retrieval method

🧠 Memory Trick: Semantic = SEMantics = meaning. Embeddings capture SEMantics, not spelling. Think "semantic ≠ spelling."

✂️ 1.4 — Chunking & Vector Databases

Why Chunking Exists

Imagine you have a 500-page PDF manual. You can't embed the whole thing as one vector — that loses all granularity. And you can't send the whole document to an LLM for every query (too expensive, hits context limit). So you chunk — split the document into smaller overlapping pieces, each gets its own embedding.

500-page PDF → Chunking → 2000 chunks of ~250 words each Page 1 content: "Introduction to AWS... [500 words]" ↓ chunk Chunk 1: "Introduction to AWS... [250 words]" → embedding → stored in vector DB Chunk 2: "...to AWS. Key services include... [250 words]" → embedding → stored in vector DB ↑ overlap of ~50 words keeps context across boundaries Query: "What is EC2?" ↓ Embed query → find 3 most similar chunks → send to LLM → answer "EC2 is covered in chunks 45, 46, 823. Only those 3 chunks go to LLM." Cost: 3 chunks × 250 words ≈ ~1000 tokens (vs 500 pages × 500 tokens = 250,000 tokens)

Vector Database — The Search Engine for Embeddings

A vector database stores millions of embedding vectors and can find the most similar ones to a query vector in milliseconds. It's the core retrieval infrastructure of every RAG system.

Vector DB	Type	Best For	When to Use
ChromaDB	Open source, local	Learning, prototypes, small apps	Day 1-3 of your project
FAISS	Open source, in-memory	High-performance local search	Research, no persistence needed
Pinecone	Managed cloud	Production apps at scale	When you need managed infra
Weaviate	Open source / cloud	Complex queries, GraphQL interface	Enterprise features needed
Qdrant	Open source / cloud	Fast Rust backend, rich filtering	Performance-critical production
pgvector	PostgreSQL extension	Existing Postgres users	You already use PostgreSQL

💡

For your course projects: Use ChromaDB — it's local, zero config, pythonic, and perfect for learning. For production apps in your portfolio, mention Pinecone or Qdrant.

Interview Questions — Day 1 Concepts

Q: What is RAG and why is it better than fine-tuning an LLM?

RAG retrieves relevant documents at query time and passes them as context to the LLM. Fine-tuning bakes knowledge into model weights permanently. RAG is better when: (1) knowledge changes frequently, (2) you need source attribution, (3) you have limited compute/budget for fine-tuning, (4) you need to query private proprietary data without sending it to model training. Fine-tuning is better for: teaching a model a new skill or style, not new knowledge.

Q: What is a hallucination in LLMs and how does RAG reduce it?

Hallucination = when an LLM generates factually incorrect information confidently. Causes: the model interpolates from patterns in training data to fill gaps. RAG reduces this by providing actual source documents as context — the model now has a reference to ground its answer in. You can also instruct: "Answer only using the provided context. If not in context, say 'I don't know.'" This is called grounding.

Q: What is the difference between an embedding and a token?

A token is the basic unit of text an LLM processes (roughly ¾ of a word). An embedding is a dense numerical vector (list of floats) that represents the semantic meaning of a piece of text. Tokens are used for language modeling (predicting next token). Embeddings are used for similarity search (finding related text). In RAG, you embed both the documents and the query to find matches.

🛠️ Day 1 Hands-On Tasks

Setup environment: pip install openai chromadb langchain sentence-transformers tiktoken
Token counting: Use tiktoken to count tokens in a paragraph — see how text becomes numbers
Generate your first embedding: Use sentence-transformers to embed 5 sentences, print the vector shape
Semantic similarity: Calculate cosine similarity between "dog" and "puppy" vs "dog" and "python". Observe the difference.
Manual chunking: Take any 3-page text, split into 250-word chunks with 50-word overlap manually in Python

# Task: Your First Embedding + Similarity Check
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')  # free, fast, good

sentences = [
    "A dog is playing in the park",
    "A puppy is running outdoors",     # should be similar
    "Python is a programming language", # should be different
    "Machine learning models learn patterns"
]

embeddings = model.encode(sentences)
print(f"Embedding shape: {embeddings.shape}")  # (4, 384)

# Calculate similarity between all pairs
sim_matrix = cosine_similarity(embeddings)
print(f"Dog vs Puppy similarity: {sim_matrix[0][1]:.3f}")   # ~0.85 high!
print(f"Dog vs Python similarity: {sim_matrix[0][2]:.3f}")  # ~0.12 low

# Output: Dog vs Puppy ~0.85  ← semantically related ✅
# Output: Dog vs Python ~0.12 ← semantically unrelated ✅

📁 GitHub Commit Suggestion: feat: day1-foundations - embedding exploration and similarity demo Files: embeddings_demo.py, chunking_demo.py, requirements.txt, README.md

📋 Day 1 Revision Notes

RAG = retrieve relevant docs → augment LLM prompt → generate grounded answer
4 LLM limits RAG solves: hallucination, knowledge cutoff, context window, private data
Token = basic LLM text unit (~¾ word) | Embedding = semantic meaning as a number vector
Semantic search = search by meaning (embeddings) vs keyword search = string matching
Chunking = split large docs into small pieces, each with overlap, each gets own embedding
Vector DB = stores embeddings, finds similar ones fast — core infrastructure of every RAG system
ChromaDB for learning → Pinecone/Qdrant for production

🧠

Day 1 Quiz:

1. A user asks your legal chatbot "What cases did we win in Q3?" and the LLM makes up 3 case names. What problem is this and how does RAG fix it?
2. Why can't you just send your entire company knowledge base to GPT-4 with every query?
3. "automobile" and "car" have different spellings but high semantic similarity. Why?
4. Why do we use chunking with overlap instead of just splitting into non-overlapping pieces?
5. Name 2 differences between ChromaDB and Pinecone.

LLMs are impressive but fundamentally limited. RAG makes them actually useful in production. You now understand the "why" that 90% of people skip. Every line of code you write in the next 4 days directly solves the problems you learned today.

📅 Day 2 of 5 · 5–6 hours

Building the Core RAG Pipeline

Today you build a complete RAG system from scratch — document loading, chunking, embedding, storing, retrieving, and generating. By end of today you'll have a working Q&A system over your own documents.

Document IngestionChunking StrategiesEmbedding Generation Vector StorageRetrievalRe-ranking Prompt AugmentationLLM Generation

🗺️ 2.1 — The Full RAG Pipeline Architecture

The RAG pipeline has two distinct phases. Understanding this split is critical for interviews.

╔══════════════════════════════════════════════════════════════════╗ ║ PHASE 1: INDEXING (runs once, offline) ║ ╚══════════════════════════════════════════════════════════════════╝ [Raw Documents] PDFs, Word docs, web pages, Notion pages, Confluence, code, etc. ↓ [Document Loader] Parse files → extract clean text ↓ [Text Splitter / Chunker] Split into overlapping chunks of N tokens/characters ↓ [Embedding Model] Each chunk → dense vector (e.g., 1536 dimensions) ↓ [Vector Database] Store (chunk_text, embedding_vector, metadata) for every chunk ╔══════════════════════════════════════════════════════════════════╗ ║ PHASE 2: RETRIEVAL + GENERATION (every query) ║ ╚══════════════════════════════════════════════════════════════════╝ [User Query] "What is the refund policy?" ↓ [Query Embedding] Embed the query using the SAME embedding model ↓ [Vector Search] Find top-K most similar chunks (K = 3 to 10 typically) ↓ [Re-ranker] (optional but improves quality) Re-score retrieved chunks for relevance to query ↓ [Prompt Builder] "You are a helpful assistant. Context: [chunk1][chunk2][chunk3] Question: {user_query}. Answer using only the context." ↓ [LLM] GPT-4, Claude, Gemini, Llama, etc. ↓ [Final Answer] Grounded, cited, accurate ✅

📄 2.2 — Document Loading & Chunking Strategies

Document Loaders

Source	LangChain Loader	Notes
PDF files	PyPDFLoader, PDFMinerLoader	PDFMiner handles complex layouts better
Word docs	Docx2txtLoader	Preserves paragraph structure
Websites	WebBaseLoader	Uses BeautifulSoup, strips HTML
CSV/Excel	CSVLoader	Each row becomes a document
Notion	NotionDirectoryLoader	Export Notion as markdown first
Code (Python, JS)	GenericLoader + parser	Language-aware splitting by functions
YouTube videos	YoutubeLoader	Uses transcript API

Chunking Strategies — This is Where Most RAG Systems Fail

Strategy	How It Works	Best For	Downside
Fixed Size	Split every N characters/tokens, overlap by X	Quick prototypes, general text	Can split mid-sentence, mid-thought
Recursive Character	Tries to split at paragraphs → sentences → words → chars	Most text types (LangChain default)	Chunks may be uneven
Semantic Chunking	Split when topic/meaning changes (embedding-based)	Long documents with topic shifts	Slower, needs embedding model
Document Structure	Split by headers, sections, paragraphs	Structured docs like manuals, wikis	Chunks can be too long or too short
Sentence-based	Split into individual sentences or sentence groups	FAQ, policy docs, Q&A content	Context loss across sentences

⚠️

Chunk size is the most important hyperparameter in RAG. Too small → missing context. Too large → noise dilutes relevance. Start with 512 tokens, 50-token overlap. Tune from there based on your evaluation results.

# Complete Document Loading + Chunking Example
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
import tiktoken

# Step 1: Load a PDF
loader = PyPDFLoader("company_handbook.pdf")
raw_docs = loader.load()
print(f"Loaded {len(raw_docs)} pages")

# Step 2: Count tokens to understand document size
enc = tiktoken.encoding_for_model("gpt-4")
total_tokens = sum(len(enc.encode(doc.page_content)) for doc in raw_docs)
print(f"Total tokens: {total_tokens} (~${total_tokens/1000 * 0.005:.2f} if sent directly)")

# Step 3: Smart chunking — Recursive splits at natural boundaries
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,       # tokens per chunk (NOT characters)
    chunk_overlap=50,     # overlap to preserve context at boundaries
    length_function=lambda text: len(enc.encode(text)),
    separators=["\n\n", "\n", ". ", " ", ""]  # try these in order
)

chunks = splitter.split_documents(raw_docs)
print(f"Created {len(chunks)} chunks")
print(f"Sample chunk:\n{chunks[0].page_content[:200]}")
print(f"Chunk metadata: {chunks[0].metadata}")  # includes page number, source!

🔢 2.3 — Embedding Generation & Vector Storage

# Complete Embedding + ChromaDB Storage Pipeline
from langchain.embeddings import OpenAIEmbeddings
from langchain.embeddings import HuggingFaceEmbeddings  # free alternative
from langchain.vectorstores import Chroma
import os

# Option A: OpenAI embeddings (paid, high quality)
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")
embed_model = OpenAIEmbeddings(model="text-embedding-3-small")
# Cost: $0.02 per million tokens — very cheap

# Option B: Free local embeddings (great for learning)
embed_model = HuggingFaceEmbeddings(
    model_name="all-MiniLM-L6-v2",  # 384 dimensions, fast
    model_kwargs={'device': 'cpu'}
)

# Step 4: Create vector store — embeds and stores all chunks
vectorstore = Chroma.from_documents(
    documents=chunks,           # your chunked documents
    embedding=embed_model,      # embedding model
    persist_directory="./chroma_db",  # save to disk
    collection_name="company_docs"
)

print(f"Stored {vectorstore._collection.count()} embeddings!")

# To reload later without re-embedding:
vectorstore = Chroma(
    persist_directory="./chroma_db",
    embedding_function=embed_model,
    collection_name="company_docs"
)

🔍 2.4 — Retrieval, Re-ranking & Prompt Augmentation

# Step 5: Retrieval — find relevant chunks for a query
retriever = vectorstore.as_retriever(
    search_type="similarity",   # or "mmr" for diverse results
    search_kwargs={"k": 4}       # retrieve top 4 chunks
)

query = "What is the parental leave policy?"
relevant_chunks = retriever.get_relevant_documents(query)
for i, chunk in enumerate(relevant_chunks):
    print(f"Chunk {i+1} (page {chunk.metadata.get('page', '?')}):")
    print(chunk.page_content[:200])
    print()

# Step 6: Build augmented prompt
def build_rag_prompt(query: str, chunks: list) -> str:
    context = "\n\n---\n\n".join([c.page_content for c in chunks])
    
    return f"""You are a helpful assistant that answers questions based ONLY on 
the provided context. If the answer is not in the context, say 
"I don't have that information in the provided documents."

CONTEXT:
{context}

QUESTION: {query}

ANSWER (based only on context above):"""

prompt = build_rag_prompt(query, relevant_chunks)
print(f"Total prompt tokens: ~{len(prompt.split()) * 4 // 3}")

# Step 7: Generate response with LLM
from openai import OpenAI
client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o-mini",   # cheapest GPT-4 class model
    messages=[{"role": "user", "content": prompt}],
    temperature=0,         # 0 = deterministic, grounded answers
    max_tokens=500
)

answer = response.choices[0].message.content
print(f"Answer: {answer}")

LangChain RetrievalQA — The One-Liner Version

# LangChain handles the whole pipeline in a few lines
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",          # "stuff" all chunks into prompt
    retriever=retriever,
    return_source_documents=True   # get source chunks back
)

result = qa_chain.invoke({"query": "What is the refund policy?"})
print(f"Answer: {result['result']}")
print(f"Sources: {[d.metadata['source'] for d in result['source_documents']]}")

Chain Types — Interviewers Ask This

Chain Type	How It Works	Best When
stuff	Stuff ALL chunks directly into one prompt	Few small chunks, short context needed
map_reduce	Run LLM on each chunk separately, then combine answers	Many chunks, parallel processing
refine	Start with first chunk, refine answer with each next chunk	Long documents, iterative refinement
map_rerank	Run LLM on each chunk, score relevance, pick best	Need most relevant single answer

Q: What is the difference between the indexing and retrieval phases in RAG?

Indexing (offline, runs once): load documents → chunk → embed → store in vector DB. This is expensive (time + API cost) but done once. Retrieval (online, every query): embed query → find similar chunks → augment prompt → LLM generates answer. This must be fast (sub-second for good UX). The split allows you to pre-compute expensive embeddings and serve queries quickly.

🛠️ Day 2 — Build Your First RAG System

Download a PDF (any manual, textbook chapter, or company policy — 5+ pages)
Build the indexing pipeline: load → chunk → embed → store in ChromaDB. Print: number of chunks, sample chunk with metadata
Build the retrieval pipeline: query → retrieve top 3 chunks → print them with their similarity scores
Build the generation step: manually write the prompt, call OpenAI API, print answer
Use LangChain RetrievalQA to do the same in 10 lines
Test 5 different queries and note which ones return accurate vs inaccurate answers. Why?

📁 GitHub Commit Suggestions: feat: document-loader - PDF ingestion with PyPDFLoader feat: chunking - RecursiveCharacterTextSplitter with token counting feat: vector-store - ChromaDB embedding storage pipeline feat: retrieval-qa - full RAG pipeline with LangChain

📋 Day 2 Revision Notes

2 phases: Indexing (offline) = load → chunk → embed → store | Retrieval (online) = query → retrieve → augment → generate
Chunking tip: 512 tokens chunk size, 50 tokens overlap is a solid starting point for most documents
RecursiveCharacterTextSplitter is the best default splitter — tries natural boundaries first
Same embedding model MUST be used for both indexing and retrieval — different models produce incompatible vectors
Retriever k=4 is a good default — too few misses info, too many adds noise
temperature=0 for RAG LLMs — you want deterministic, factual answers, not creative ones
LangChain RetrievalQA wraps the whole pipeline — production code uses LCEL (LangChain Expression Language) instead

🧠

Day 2 Quiz:

1. You index 1000 documents and then query "What is our vacation policy?" — describe every step that happens internally.
2. You use OpenAI for indexing embeddings but switch to HuggingFace for retrieval. Will it work? Why not?
3. What is chunk overlap and what happens if you set it to 0?
4. What does temperature=0 mean and why do RAG systems use it?
5. You have 20 retrieved chunks but the LLM context window only fits 5. What are your options?

📅 Day 3 of 5 · 5–6 hours

Tools, Frameworks & Building Real APIs

Today you graduate from scripts to real software. Build a FastAPI backend that serves your RAG system as an API, add a Streamlit frontend, and understand every tool in the modern AI engineering stack.

OpenAI APILangChain LCELChromaDB FAISSFastAPIStreamlit Async RAGStreaming

🔌 3.1 — OpenAI API Deep Dive

OpenAI is the backbone of most RAG systems. You need to understand its API deeply for both implementation and interviews.

Key OpenAI Models for RAG

Model	Use For	Context Window	Cost
gpt-4o-mini	Best value for RAG generation	128K tokens	~$0.15/1M input tokens
gpt-4o	Complex reasoning, highest quality	128K tokens	~$5/1M input tokens
text-embedding-3-small	Fast, cheap embedding for indexing	8191 tokens input	$0.02/1M tokens
text-embedding-3-large	Highest quality embeddings	8191 tokens input	$0.13/1M tokens

# OpenAI API — Everything You Need for RAG
from openai import OpenAI
import os

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

## 1. Generate embeddings (for indexing documents)
def embed_text(text: str) -> list[float]:
    response = client.embeddings.create(
        input=text,
        model="text-embedding-3-small"
    )
    return response.data[0].embedding  # list of 1536 floats

## 2. Batch embedding (more efficient)
def embed_batch(texts: list[str]) -> list[list[float]]:
    response = client.embeddings.create(
        input=texts,  # send up to 2048 texts at once
        model="text-embedding-3-small"
    )
    return [item.embedding for item in response.data]

## 3. Chat completion with full control
def generate_answer(system_prompt: str, user_query: str, context: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {user_query}"}
        ],
        temperature=0,
        max_tokens=800
    )
    return response.choices[0].message.content

## 4. Streaming response (better UX — shows answer as it generates)
def stream_answer(prompt: str):
    stream = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        stream=True
    )
    for chunk in stream:
        if chunk.choices[0].delta.content:
            print(chunk.choices[0].delta.content, end="", flush=True)

⛓️ 3.2 — LangChain LCEL — Modern RAG Chains Industry Standard

LangChain Expression Language (LCEL) is the modern way to build RAG pipelines. It uses the pipe operator (|) to chain components — readable, composable, and production-ready.

# LCEL: Modern LangChain RAG Chain
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

# Setup
embed = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma(persist_directory="./chroma_db", embedding_function=embed)
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

# Prompt template
prompt = ChatPromptTemplate.from_template("""
You are an expert assistant. Answer based ONLY on the context below.
If unsure, say "I don't know based on the provided documents."

Context:
{context}

Question: {question}

Answer:
""")

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# Helper to format retrieved docs into a string
def format_docs(docs):
    return "\n\n---\n\n".join(doc.page_content for doc in docs)

# LCEL Chain — reads left to right like a pipeline
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# Use it:
answer = rag_chain.invoke("What is the leave encashment policy?")
print(answer)

# Streaming (yields tokens as generated):
for token in rag_chain.stream("What is the leave encashment policy?"):
    print(token, end="", flush=True)

🚀 3.3 — Building a FastAPI RAG Backend Portfolio Ready

A Jupyter notebook RAG system is a prototype. A FastAPI app is a product. Here's how to build a production-ready RAG API that you can show to recruiters.

FastAPI RAG Service Architecture POST /ingest → Upload PDF → chunk → embed → store in ChromaDB POST /query → Question → retrieve → augment → GPT → answer GET /health → Service health check GET /sources → List ingested documents Client (Streamlit/React) → FastAPI → ChromaDB + OpenAI

# rag_api/main.py — Production FastAPI RAG Backend
from fastapi import FastAPI, UploadFile, File, HTTPException
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from contextlib import asynccontextmanager
import tempfile, os, asyncio

from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

# === MODELS ===
class QueryRequest(BaseModel):
    question: str
    k: int = 4         # number of chunks to retrieve
    stream: bool = False  # streaming response?

class QueryResponse(BaseModel):
    answer: str
    sources: list[str]
    chunks_used: int

# === GLOBALS ===
vectorstore = None
embed_model = OpenAIEmbeddings(model="text-embedding-3-small")
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# === APP ===
app = FastAPI(
    title="RAG Document QA API",
    description="Upload documents and query them with AI",
    version="1.0.0"
)

@app.on_event("startup")
async def startup():
    global vectorstore
    # Load existing vector store if it exists
    if os.path.exists("./chroma_db"):
        vectorstore = Chroma(
            persist_directory="./chroma_db",
            embedding_function=embed_model
        )
        print(f"Loaded existing vector store")

@app.get("/health")
async def health():
    return {"status": "healthy", "docs_indexed": vectorstore._collection.count() if vectorstore else 0}

@app.post("/ingest")
async def ingest_document(file: UploadFile = File(...)):
    """Upload and index a PDF document"""
    global vectorstore
    
    if not file.filename.endswith('.pdf'):
        raise HTTPException(400, "Only PDF files supported")
    
    # Save uploaded file temporarily
    with tempfile.NamedTemporaryFile(suffix=".pdf", delete=False) as tmp:
        tmp.write(await file.read())
        tmp_path = tmp.name
    
    try:
        loader = PyPDFLoader(tmp_path)
        docs = loader.load()
        
        splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)
        chunks = splitter.split_documents(docs)
        
        # Add source filename to metadata
        for chunk in chunks:
            chunk.metadata["filename"] = file.filename
        
        vectorstore = Chroma.from_documents(
            chunks, embed_model, persist_directory="./chroma_db"
        ) if not vectorstore else vectorstore.add_documents(chunks)
        
        return {"message": f"Indexed {len(chunks)} chunks from {file.filename}", "chunks": len(chunks)}
    finally:
        os.unlink(tmp_path)

@app.post("/query", response_model=QueryResponse)
async def query_documents(req: QueryRequest):
    """Query the indexed documents"""
    if not vectorstore:
        raise HTTPException(404, "No documents indexed yet. Use /ingest first.")
    
    retriever = vectorstore.as_retriever(search_kwargs={"k": req.k})
    retrieved_docs = retriever.get_relevant_documents(req.question)
    
    prompt = ChatPromptTemplate.from_template("""Answer using ONLY the context. 
If not in context, say "I don't have that information."

Context: {context}
Question: {question}
Answer:""")
    
    chain = (
        {"context": lambda x: "\n\n".join(d.page_content for d in retrieved_docs),
         "question": lambda x: x}
        | prompt | llm | StrOutputParser()
    )
    
    answer = chain.invoke(req.question)
    sources = list({d.metadata.get("filename", "unknown") for d in retrieved_docs})
    
    return QueryResponse(answer=answer, sources=sources, chunks_used=len(retrieved_docs))

🖥️ 3.4 — Streamlit Frontend for RAG

# app.py — Streamlit Chat UI for RAG
import streamlit as st
import requests

st.set_page_config(page_title="📚 Doc QA", layout="wide")
st.title("📚 AI Document Q&A")
st.caption("Upload a PDF and ask anything about it")

API_URL = "http://localhost:8000"

# Sidebar: document upload
with st.sidebar:
    st.header("📤 Upload Document")
    uploaded = st.file_uploader("Choose a PDF", type=["pdf"])
    if uploaded and st.button("Index Document"):
        with st.spinner("Indexing..."):
            resp = requests.post(f"{API_URL}/ingest", files={"file": uploaded})
            if resp.status_code == 200:
                st.success(resp.json()["message"])
            else:
                st.error("Failed to index")

# Chat interface with history
if "messages" not in st.session_state:
    st.session_state.messages = []

for msg in st.session_state.messages:
    with st.chat_message(msg["role"]):
        st.write(msg["content"])

if question := st.chat_input("Ask about your document..."):
    st.session_state.messages.append({"role": "user", "content": question})
    with st.chat_message("user"):
        st.write(question)
    
    with st.chat_message("assistant"):
        with st.spinner("Thinking..."):
            resp = requests.post(f"{API_URL}/query", json={"question": question})
            if resp.status_code == 200:
                data = resp.json()
                st.write(data["answer"])
                st.caption(f"📎 Sources: {', '.join(data['sources'])}")
                st.session_state.messages.append({"role": "assistant", "content": data["answer"]})

# Run: streamlit run app.py
# Backend: uvicorn rag_api.main:app --reload

Q: Why use FastAPI instead of Flask for a RAG backend?

FastAPI advantages for RAG: (1) Async support — LLM API calls are I/O bound; async allows serving multiple requests concurrently without threads. (2) Automatic Pydantic validation — request/response models are validated automatically. (3) Built-in streaming — StreamingResponse makes token streaming trivial. (4) Auto API docs at /docs — great for demos. (5) Type hints throughout — better code quality. Flask works but FastAPI is the modern choice for AI APIs.

🛠️ Day 3 — Build RAG API + UI

Build the FastAPI backend with /ingest, /query, and /health endpoints
Test with Postman or curl: upload a PDF → query it → verify answer + sources
Add the Streamlit frontend — connect it to your FastAPI backend
Add streaming: Modify /query to stream tokens using StreamingResponse + Server-Sent Events
Add error handling: What if no docs are indexed? What if PDF is corrupted?
Write a README.md with setup instructions and API documentation

📁 GitHub Commit Suggestions: feat: fastapi-backend - /ingest /query /health endpoints with Pydantic models feat: streamlit-ui - chat interface with file upload and conversation history feat: streaming - token streaming via StreamingResponse docs: README with architecture diagram and setup instructions

📋 Day 3 Revision Notes

text-embedding-3-small = best price/performance embedding | use for learning and production
gpt-4o-mini = best value LLM for RAG generation | temperature=0 always for RAG
LCEL pipe syntax: retriever | prompt | llm | parser = modern LangChain way
FastAPI advantages: async, Pydantic validation, streaming, auto-docs at /docs
Streaming = yield tokens as generated → better UX, user sees answer building in real-time
Streamlit = 10 lines for a working chat UI, session_state for conversation history
Always add metadata (filename, page) to chunks — needed for source attribution in answers

📅 Day 4 of 5 · 5–6 hours

Advanced RAG — The Techniques That Actually Work

Basic RAG has a 60-70% accuracy ceiling. Today you learn the techniques that push it to 85-95%: hybrid search, re-ranking, query transformation, metadata filtering, and evaluation. This is what separates junior from senior AI engineers.

Hybrid SearchBM25Re-ranking Query TransformationHyDEMulti-query Metadata FilteringRAG EvaluationCost Optimization

🔀 4.1 — Hybrid Search: The Best of Both Worlds Used in Production

Pure semantic search (vectors) is great for conceptual questions but misses exact matches. BM25 keyword search is great for exact terms but misses synonyms. Hybrid search combines both.

HYBRID SEARCH ARCHITECTURE Query: "What does API Gateway do?" ↓ ↓ [BM25/Keyword] [Vector/Semantic] Finds: "API", Finds: "HTTP endpoint", "Gateway" "request routing", "REST API" exact matches similar meanings ↓ ↓ [Reciprocal Rank Fusion] Merges both result lists using rank-based scoring ↓ Final ranked results ← best of both worlds Result: Exact keyword matches + semantic matches combined Recall improves dramatically for technical terms, IDs, names

# Hybrid Search with LangChain + BM25
from langchain.retrievers import BM25Retriever, EnsembleRetriever
from langchain_community.vectorstores import Chroma

# Assume chunks is your list of LangChain Documents

# BM25 (keyword-based)
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 4

# Vector (semantic)
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

# Ensemble: 40% BM25, 60% semantic (tune these weights)
hybrid_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.4, 0.6]
)

results = hybrid_retriever.get_relevant_documents("What is the API rate limit?")
# Better at finding "rate limit" (exact) AND "throttling/quota" (semantic)

🧠 Memory Trick: Hybrid = BM25 + Vectors. BM25 = "Be My 25th birthday" (exact memory). Vectors = "Vibes" (fuzzy meaning). Hybrid = remember exactly AND understand meaning.

⬆️ 4.2 — Re-ranking: The Secret Sauce Biggest Quality Boost

Vector similarity finds candidates. Re-ranking picks the best ones. A re-ranker is a cross-encoder model that evaluates a query + document pair together — much more accurate than just embedding similarity, but slower (that's why we use it only on top-K candidates).

Two-Stage Retrieval with Re-ranking Stage 1 — Retrieval (fast, approximate) Query → Vector Search → Top 20 candidates (fast, ~10ms) Stage 2 — Re-ranking (slower, more accurate) For each of 20 candidates: Cross-encoder(query, document) → relevance score 0..1 Final: Top 4 of 20, sorted by re-ranker score → sent to LLM Cost: ~20ms extra but HUGE quality improvement The re-ranker reads query and document TOGETHER (not separately) This gives it much better understanding of relevance

# Re-ranking with Cohere Rerank (cloud) or cross-encoder (local)

## Option A: Cohere Rerank API (easy, high quality)
import cohere
co = cohere.Client(os.getenv("COHERE_API_KEY"))

def rerank_documents(query: str, docs: list, top_n: int = 4):
    texts = [d.page_content for d in docs]
    results = co.rerank(
        query=query,
        documents=texts,
        top_n=top_n,
        model="rerank-english-v3.0"
    )
    return [docs[r.index] for r in results.results]

## Option B: Local cross-encoder (free)
from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def rerank_local(query: str, docs: list, top_n: int = 4) -> list:
    pairs = [(query, doc.page_content) for doc in docs]
    scores = reranker.predict(pairs)
    # Sort by score descending, take top N
    ranked = sorted(zip(docs, scores), key=lambda x: x[1], reverse=True)
    return [doc for doc, score in ranked[:top_n]]

# Usage: retrieve 20, rerank to top 4
candidates = vectorstore.as_retriever(search_kwargs={"k": 20}).get_relevant_documents(query)
reranked = rerank_local(query, candidates, top_n=4)

🔄 4.3 — Query Transformation Techniques

Users write bad queries. Query transformation improves them before retrieval. This alone can boost RAG accuracy by 20-30%.

Technique	How It Works	Best For
Multi-query	LLM generates 3-5 different phrasings of query → retrieve for each → merge	Short or ambiguous queries
HyDE	LLM generates a hypothetical answer → embed that → find similar real docs	Complex questions, abstract topics
Step-back prompting	LLM generates a more general "step-back" question → retrieve broader context	Specific questions needing broader context
Query decomposition	Break complex multi-part question into sub-questions → answer each → combine	Complex multi-hop questions
Query expansion	Add synonyms and related terms to query	Domain-specific terminology

# Multi-Query Retrieval — Most commonly used in production
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.3)  # slight creativity for variety

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=vectorstore.as_retriever(search_kwargs={"k": 3}),
    llm=llm
)
# For query "What's the leave policy?", generates:
# 1. "How many vacation days do employees get?"
# 2. "What are the sick leave rules?"
# 3. "How do I apply for time off?"
# Retrieves for all 3, deduplicates, returns union

# HyDE — Hypothetical Document Embeddings
hyde_prompt = ChatPromptTemplate.from_template("""
Write a short paragraph that would be a perfect answer to this question.
Write it as if it's from a document, not as a direct answer.

Question: {question}

Hypothetical document paragraph:""")

hyde_chain = hyde_prompt | llm | StrOutputParser()

def hyde_retrieve(question: str, retriever, k: int = 4):
    # Generate hypothetical answer, use it as query
    hypothetical_doc = hyde_chain.invoke({"question": question})
    return retriever.get_relevant_documents(hypothetical_doc)

Metadata Filtering — Precision at Scale

# Metadata filtering: search only within specific documents/sections
# ChromaDB supports filtering during search

# When ingesting, add rich metadata:
for chunk in chunks:
    chunk.metadata.update({
        "department": "HR",
        "doc_type": "policy",
        "year": 2024,
        "filename": "hr_policy_2024.pdf"
    })

# Filter search to only HR department docs from 2024:
results = vectorstore.similarity_search(
    query="What is the leave policy?",
    k=4,
    filter={"department": "HR", "year": 2024}  # ChromaDB $and filter
)
# Only searches HR 2024 docs — much more precise, faster

📊 4.4 — RAG Evaluation — How Do You Know If It's Working?

You cannot improve what you don't measure. RAG evaluation is what senior AI engineers do that juniors skip. This is a high-value interview topic.

4 Key RAG Metrics

Metric	Measures	Question It Answers
Context Recall	Did retrieval find all necessary information?	"Did we retrieve the right chunks?"
Context Precision	Are retrieved chunks relevant? No noise?	"Did we retrieve ONLY the right chunks?"
Answer Faithfulness	Is the answer grounded in retrieved context?	"Did the LLM use the context or hallucinate?"
Answer Relevancy	Does the answer actually address the question?	"Is the answer actually helpful?"

# RAG Evaluation with RAGAS (the standard library)
from ragas import evaluate
from ragas.metrics import (
    faithfulness,        # answer grounded in context?
    answer_relevancy,   # answer relevant to question?
    context_recall,     # retrieved right context?
    context_precision   # retrieved only relevant context?
)
from datasets import Dataset

# Build test dataset: question, expected answer, retrieved context, generated answer
test_data = {
    "question": ["What is the parental leave policy?", "How do I apply for remote work?"],
    "answer": [generated_answer_1, generated_answer_2],
    "contexts": [[chunk.page_content for chunk in retrieved_1], 
               [chunk.page_content for chunk in retrieved_2]],
    "ground_truth": ["Employees get 26 weeks paid leave...", "Submit a remote work request form..."]
}

dataset = Dataset.from_dict(test_data)
results = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_recall])
print(results)
# Output:
# faithfulness: 0.92    ← 92% of answers grounded in context ✅
# answer_relevancy: 0.87 ← 87% answers actually address question ✅  
# context_recall: 0.79  ← missed some relevant chunks ⚠️ (tune chunk size)

Cost vs Accuracy Tradeoffs

Decision	More Accurate (higher cost)	Cheaper (lower accuracy)
Embedding model	text-embedding-3-large ($0.13/M)	all-MiniLM-L6-v2 (free)
Generation model	gpt-4o ($5/M)	gpt-4o-mini ($0.15/M)
Chunks retrieved (k)	k=10 (more context)	k=3 (cheaper, less noise)
Re-ranking	Cohere re-ranker + k=20 initial	No re-ranking, k=4
Query transformation	Multi-query (3x LLM calls)	Direct query (1 LLM call)

Q: What is HyDE and when would you use it?

HyDE = Hypothetical Document Embeddings. Instead of embedding the user's short question (which may not match long document sentences), you ask the LLM to write a hypothetical answer paragraph, then embed that to find similar real documents. Use HyDE when: (1) user queries are very short/abstract ("explain machine learning"), (2) standard retrieval misses relevant docs, (3) the question phrasing differs a lot from how answers are written in documents. Downside: adds 1 extra LLM call (cost + latency).

Q: How do you decide chunk size for a RAG system?

It depends on your content type: (1) FAQs/short answers: 128-256 tokens (each Q&A is self-contained). (2) Technical docs/manuals: 512-1024 tokens (need enough context for complex topics). (3) Legal docs: 256-512 tokens (precise language matters, small chunks). (4) Code: Split by function/class, not token count. Start with 512 tokens + RAGAS evaluation, then tune. Common mistake: too-small chunks lose context; too-large chunks dilute relevance.

📋 Day 4 Revision Notes

Hybrid search = BM25 (exact keywords) + vector (semantic) → best of both, use EnsembleRetriever
Re-ranking = retrieve 20 candidates → re-rank to top 4 → biggest quality boost per dollar
Multi-query = LLM generates 3-5 query variations → retrieve for each → deduplicate and merge
HyDE = generate hypothetical answer → embed it → find similar real docs (helps abstract questions)
Metadata filtering = add rich metadata during indexing → filter during retrieval → precision + speed
RAGAS metrics: faithfulness, answer_relevancy, context_recall, context_precision
Cost optimization: Free embeddings for prototypes → OpenAI small for production → use re-ranking to offset smaller k

🧠

Day 4 Quiz:

1. A user queries "ARN format in AWS". Pure semantic search returns docs about "AWS resource naming conventions" but misses the exact doc saying "arn:aws:...". Which retrieval method would fix this?
2. You retrieve k=4 chunks for every query but your RAGAS context_recall is 0.62. What are 3 things you can try?
3. Explain the two-stage retrieval pattern. Why not just use re-ranker on all documents?
4. Your RAG system answers legal questions but sometimes cites the wrong year's policy. How does metadata filtering solve this?
5. Your faithfulness score is 0.55 (low). What does this mean and how do you fix it?

📅 Day 5 of 5 · 5–6 hours

Production RAG — Real Architecture & Deployment

Today you zoom out from code to systems thinking. Understand how companies actually ship RAG. Learn deployment, production pitfalls, and architectural patterns used by real AI teams. This is what gets you promoted.

Production ArchitectureDeploymentDocker Real Company PatternsConversation Memory PDF Chatbot ArchitectureSecurity

🏢 5.1 — How Real Companies Use RAG

Company / Use Case	RAG Architecture	Key Challenge Solved
Customer Support Bot	RAG over help docs + ticket history + FAQs	Deflect 60% tickets, always up-to-date with product changes
Internal HR Chatbot	RAG over policy docs, Confluence, Notion	Employees get instant accurate policy answers 24/7
Legal Document Review	RAG over case files, contracts, precedents	Lawyers query 50,000+ docs in seconds, with citations
Developer Docs Search	RAG over API docs, GitHub issues, SO posts	Developers find answers without manual searching
Sales Intelligence	RAG over call transcripts, CRM, market reports	Sales reps get tailored pitch points before every call
Medical Knowledge Base	RAG over clinical trials, drug references	Doctors query latest research grounded in evidence

The Full Production RAG Architecture

PRODUCTION RAG SYSTEM (Enterprise Grade) [Ingestion Pipeline] [Query Pipeline] (runs when docs update) (runs for every user query) [Document Sources] [Auth Gateway] S3 / GCS / SharePoint User auth + rate limiting Confluence / Notion ↓ Google Drive / PDFs [Query Router] ↓ Is it RAG? SQL? Calculator? [Document Processor] ↓ Parse + clean + OCR [Query Transform] ↓ Multi-query / HyDE / decompose [Chunker] ↓ Semantic + structure-aware [Hybrid Retriever] ↓ BM25 + Vector + metadata filter [Embedding Model] ↓ OpenAI / Cohere / local [Re-ranker] ↓ ↓ [Vector DB] [Context Builder] Pinecone / Qdrant Format + dedup + truncate with rich metadata ↓ [Conversation Memory] Last N turns context ↓ [LLM Generation] GPT-4o / Claude / Gemini ↓ [Post-processing] Citation extraction Guardrails check Response streaming ↓ [User + Logging] Answer + sources + latency

💬 5.2 — Conversation Memory in RAG

Without memory, every query to your RAG system is independent. Users can't ask follow-up questions like "Tell me more about that" or "What about for remote employees?"

# Conversational RAG with memory — the right way
from langchain.memory import ConversationBufferWindowMemory
from langchain.chains import ConversationalRetrievalChain

# Keep last 5 conversation turns in memory
memory = ConversationBufferWindowMemory(
    memory_key="chat_history",
    k=5,
    return_messages=True,
    output_key="answer"
)

conv_chain = ConversationalRetrievalChain.from_llm(
    llm=ChatOpenAI(model="gpt-4o-mini", temperature=0),
    retriever=hybrid_retriever,
    memory=memory,
    return_source_documents=True
)

# Turn 1
r1 = conv_chain.invoke({"question": "What is the parental leave policy?"})
print(r1["answer"])

# Turn 2 — system remembers we're talking about parental leave
r2 = conv_chain.invoke({"question": "Does it apply to adoption too?"})
print(r2["answer"])  # Correctly contextualizes "it" = parental leave

# LCEL approach with manual history (more control):
from langchain_core.messages import HumanMessage, AIMessage

chat_history = []

def chat_with_memory(question: str) -> str:
    # Condense question with history context
    condense_prompt = f"""Given the conversation history and new question, 
rephrase the question to be standalone.
History: {chat_history[-4:]}
Question: {question}
Standalone question:"""
    standalone = llm.invoke(condense_prompt).content
    
    # Retrieve and answer
    docs = retriever.get_relevant_documents(standalone)
    answer = rag_chain.invoke(standalone)
    
    # Update history
    chat_history.append(HumanMessage(content=question))
    chat_history.append(AIMessage(content=answer))
    
    return answer

🐳 5.3 — Deploying RAG with Docker + Cloud

Project Structure (GitHub-Ready)

rag-pdf-chatbot/
├── backend/
│   ├── main.py             # FastAPI app
│   ├── rag/
│   │   ├── __init__.py
│   │   ├── ingestion.py    # Document loading + chunking
│   │   ├── embeddings.py   # Embedding model wrapper
│   │   ├── retrieval.py    # Vector store + retrieval
│   │   ├── generation.py   # LLM + prompt building
│   │   └── evaluation.py   # RAGAS evaluation
│   ├── tests/
│   │   └── test_rag.py
│   ├── Dockerfile
│   └── requirements.txt
├── frontend/
│   └── streamlit_app.py
├── data/
│   └── sample_docs/
├── docker-compose.yml
├── .env.example           # Template for env vars (NO real keys)
├── .gitignore             # Include: chroma_db/, .env, __pycache__
└── README.md              # Architecture diagram + setup instructions

# Dockerfile
FROM python:3.11-slim

WORKDIR /app

# Install dependencies first (caching layer)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY . .

# Create directory for ChromaDB persistence
RUN mkdir -p /app/chroma_db

EXPOSE 8000

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

# docker-compose.yml
version: '3.8'
services:
  backend:
    build: ./backend
    ports: ["8000:8000"]
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - COHERE_API_KEY=${COHERE_API_KEY}
    volumes:
      - chroma_data:/app/chroma_db  # persist vector DB
  
  frontend:
    build: ./frontend
    ports: ["8501:8501"]
    depends_on: [backend]
    environment:
      - API_URL=http://backend:8000

volumes:
  chroma_data:

# Run: docker-compose up --build

Deployment Options

Option	Cost	Effort	Best For
Railway.app	Free tier available	⭐ Easiest	Portfolio demos, hackathons
Render.com	Free tier (sleeps)	⭐⭐ Easy	Personal projects
AWS EC2 + Docker	~$10-20/month	⭐⭐⭐ Medium	Production, shows AWS skills
Google Cloud Run	Pay per request	⭐⭐⭐ Medium	Scalable serverless
HuggingFace Spaces	Free for Streamlit	⭐ Easiest	ML portfolio showcase

💡

For your portfolio: Deploy on Railway or HuggingFace Spaces (free, one-click), then also show how you'd deploy on AWS EC2 with Docker in your README. This proves you understand both ease-of-use AND production deployment.

⚠️ 5.4 — Production Pitfalls & Security

Pitfall	Problem	Fix
Prompt injection	User crafts query to override system prompt	Input sanitization, separate system/user context, output validation
Context poisoning	Malicious doc in vector DB injects instructions	Sanitize documents at ingestion, separate trusted/untrusted sources
No rate limiting	Bot floods your API → $1000 OpenAI bill overnight	FastAPI rate limiting middleware, usage quotas per user/API key
Storing raw text	PII, secrets, confidential data in vector DB	PII detection at ingestion, access control, encryption at rest
No guardrails	LLM answers questions outside scope ("how to hack?")	Input/output guards (Guardrails AI, NeMo Guardrails)
Stale embeddings	Doc updated but old embedding still in DB → wrong answers	Document versioning, update/delete embeddings when source changes

Q: How do you handle document updates in a RAG system?

This is a common production problem. When a document is updated: (1) Delete old chunks from vector DB using metadata filter (filename + version). (2) Re-chunk and re-embed the new version. (3) Store with new version metadata. Best practice: assign a document_id and hash to each document at ingestion. On update, compare hash — if changed, delete all chunks with that document_id and re-index. Use a document registry (simple SQLite table) to track what's indexed and when.

📋 Day 5 Revision Notes

Production RAG adds: auth gateway, query router, re-ranking, conversation memory, post-processing, logging
Conversation memory = ConversationBufferWindowMemory (last K turns) → condense question with history before retrieval
Docker structure: backend FastAPI + frontend Streamlit in docker-compose with volume for ChromaDB
Project structure: separate rag/ module with ingestion, embeddings, retrieval, generation, evaluation files
Security must-haves: rate limiting, input sanitization, no hardcoded API keys, PII detection at ingestion
Document updates: track document_id + hash → delete old chunks → re-index on change
Deploy to Railway/HuggingFace for portfolio, mention AWS EC2 in README for production credibility

🏆 Bonus — Project + Resume + Interview Prep

The PDF Chatbot Project + Job Readiness Kit

Everything you need to get hired. A complete portfolio-worthy project, resume descriptions that stand out, and the top interview questions with answers that get people hired at AI companies.

Mini ProjectGitHub SetupResume Interview Q&ABeginner MistakesNext Steps

🚀 The Mini Project: AI PDF Research Assistant

Build a full-stack RAG application that lets users upload multiple PDFs and have an intelligent conversation about their content. This is something real companies pay engineers to build.

PROJECT: AI PDF Research Assistant Features: ✅ Upload multiple PDFs (up to 10) ✅ Intelligent Q&A with source citations ✅ Conversation memory (multi-turn) ✅ Hybrid search (BM25 + semantic) ✅ Re-ranking for accuracy ✅ Streaming responses ✅ Document management (list/delete) ✅ Query history ✅ Cost tracker (tokens used) Stack: Backend: FastAPI + LangChain + ChromaDB + OpenAI Frontend: Streamlit (responsive chat UI) Deploy: Docker + Railway/Render/HuggingFace Spaces Architecture: [Streamlit UI] Upload PDFs → /ingest Ask questions → /chat (streaming) View docs → /documents ↓ HTTP [FastAPI Backend] /ingest → PDF → chunks → embed → ChromaDB /chat → query → hybrid retrieve → rerank → GPT → stream /documents → list indexed docs + metadata /delete/{id} → remove doc + its chunks ↓ [ChromaDB] (local, persisted) ↓ [OpenAI API] embeddings + generation

Key Implementation: The /chat Streaming Endpoint

# The most impressive feature: streaming RAG with sources
from fastapi.responses import StreamingResponse
import json

@app.post("/chat")
async def chat_stream(req: ChatRequest):
    """Stream RAG response with sources"""
    
    # Retrieve with hybrid search
    docs = hybrid_retriever.get_relevant_documents(req.question)
    docs = rerank_local(req.question, docs, top_n=4)
    
    context = "\n\n---\n\n".join(d.page_content for d in docs)
    sources = list({d.metadata.get("filename") for d in docs})
    
    prompt = build_prompt(req.question, context, req.chat_history)
    
    async def generate():
        # First yield: sources (so UI can show them immediately)
        yield f"data: {json.dumps({'type': 'sources', 'data': sources})}\n\n"
        
        # Stream LLM tokens
        stream = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            stream=True
        )
        for chunk in stream:
            token = chunk.choices[0].delta.content or ""
            if token:
                yield f"data: {json.dumps({'type': 'token', 'data': token})}\n\n"
        
        yield f"data: {json.dumps({'type': 'done'})}\n\n"
    
    return StreamingResponse(generate(), media_type="text/event-stream")

GitHub Repository Checklist

✅ Before Making Repository Public

.gitignore includes: .env, chroma_db/, __pycache__/, *.pyc, .DS_Store, *.egg-info
.env.example: Shows required variables WITHOUT actual values — OPENAI_API_KEY=your_key_here
README.md includes: Project description, architecture diagram (ASCII), tech stack, setup instructions (pip install + docker-compose), demo GIF/screenshot, API docs link
requirements.txt: All dependencies with version pins (pip freeze > requirements.txt)
tests/: At least 3 tests — ingestion, retrieval quality, API endpoint
Commit history: Clean commits with meaningful messages (feat: hybrid-search, fix: chunk-overlap, docs: add-readme)

📄 Resume-Ready Project Descriptions

AI PDF Research Assistant | Python, FastAPI, LangChain, ChromaDB, OpenAI
github.com/yourname/ai-pdf-chatbot  |  live: railway.app/project/...

• Built a production-ready RAG system enabling conversational Q&A over
  multiple PDF documents with <2s response latency
• Implemented hybrid search (BM25 + semantic vectors) with cross-encoder
  re-ranking, improving retrieval accuracy by ~30% vs naive vector search
• Designed streaming FastAPI backend with conversation memory supporting
  multi-turn queries with source citations
• Deployed containerized application (Docker + docker-compose) serving
  both FastAPI and Streamlit frontend
• Integrated RAGAS evaluation framework; achieved faithfulness score 0.89
  and context recall 0.83 on test dataset

Skills demonstrated: RAG architecture, LLM engineering, FastAPI, vector
databases, embeddings, hybrid search, Docker, prompt engineering

Skills Section Format

AI / LLM Engineering:  RAG systems, LangChain, OpenAI API (Chat + Embeddings),
                      Prompt engineering, ChromaDB, FAISS, Semantic search,
                      Hybrid search, Re-ranking, RAGAS evaluation

Backend:  FastAPI, Python, REST APIs, async programming, Docker

Databases:  ChromaDB (vector), PostgreSQL, MongoDB, pgvector

💬 Top 30 RAG Interview Questions with Answers

🌱 Fundamentals

Q1: Explain RAG to a non-technical manager.

Instead of relying on an AI's built-in memory (which can be outdated or wrong), RAG works like an open-book exam: the AI searches your company's documents first, finds the relevant pages, then uses those pages to write its answer. The answer is grounded in your actual documents — you can verify it and see which page it came from.

Q2: When would you choose RAG over fine-tuning?

Choose RAG when: data changes frequently (policies, prices, news), you need source attribution, you have budget constraints (no GPU for fine-tuning), you need to query private proprietary data, or you want to add knowledge without touching model weights. Choose fine-tuning when: teaching a new skill or style (not just knowledge), you need consistent format/tone, or have static domain knowledge that rarely changes.

Q3: What is the difference between dense retrieval and sparse retrieval?

Dense retrieval: Uses embedding vectors (dense, continuous values) — captures semantic meaning, handles synonyms, works across languages. Sparse retrieval (BM25/TF-IDF): Uses term frequency matrices (mostly zeros, hence sparse) — exact keyword matching, fast, interpretable, works great for technical terms and names. RAG best practice: use both (hybrid search) for maximum coverage.

Q4: What is a vector database and how is it different from a regular database?

A regular database (SQL/NoSQL) stores and retrieves data by exact matching (WHERE id = 5, WHERE name = 'Alice'). A vector database stores embedding vectors and retrieves by similarity — "find me the 10 vectors most similar to this query vector." This requires specialized indexing algorithms (HNSW, IVF) that approximate nearest-neighbor search at scale. Regular DBs can't do this efficiently.

🔧 Pipeline & Implementation

Q5: Why must you use the same embedding model for indexing and retrieval?

Embedding models map text to a specific high-dimensional vector space. Different models create different spaces — vectors from model A and model B are incompatible. Comparing them is like measuring height in meters vs feet and saying "1 > 0.9" (when 1 foot < 0.9 meters). The similarity scores would be meaningless.

Q6: How do you handle very long documents (books, large reports) in RAG?

(1) Hierarchical indexing: chunk at multiple granularities — paragraphs for retrieval, sentences for context. (2) Parent-child chunking: retrieve small chunks, but pass their parent (larger) chunk to LLM for more context. (3) Summary indexing: store both summary and full text — retrieve by summary, generate from full. (4) Document-level metadata: store document summary as metadata for broad questions, chunks for specific ones.

Q7: A user asks "What did they announce last quarter?" — what's wrong with this for RAG?

The query has ambiguous references: "they" (who?) and "last quarter" (relative time). RAG's retrieval will get confused. Solutions: (1) Query clarification: ask user to specify company/person, (2) Query expansion: LLM infers from context and expands to "What did [Company] announce in Q3 2024?", (3) Conversation memory: if previous turns mention the company, use that context to resolve ambiguity.

Q8: How do you handle tables and charts in PDFs for RAG?

Standard PDF parsers (PyPDF) convert tables to mangled text. Better approaches: (1) Unstructured.io: extracts tables as structured HTML, (2) Camelot/Tabula: specialized PDF table extraction to DataFrames, (3) Vision LLMs (GPT-4V): screenshot each page → LLM describes tables as text → embed that text, (4) Separate table index: extract tables to JSON/CSV → separate SQL retrieval path for data questions.

🚀 Advanced

Q9: Explain the "Lost in the Middle" problem in RAG.

Research shows LLMs pay more attention to the beginning and end of their context window, and lose attention to information in the middle. If you pass 10 chunks, the LLM may miss the most relevant chunk if it's in position 4-7. Solutions: (1) Fewer, more relevant chunks (use re-ranking to keep only top 3-4), (2) Put most relevant chunk first in the context, (3) Map-reduce: answer from each chunk separately, then synthesize.

Q10: How do you reduce hallucinations in a RAG system?

(1) Explicit instruction: "Answer ONLY using the provided context. Say 'I don't know' if not in context." (2) Temperature = 0 for deterministic answers. (3) Citation enforcement: require LLM to quote the source text it used. (4) Faithfulness check: secondary LLM call verifies answer is supported by retrieved docs. (5) High-quality retrieval: bad retrieval forces LLM to fill gaps with hallucinations — fix retrieval first.

Q11: What is RAGAS and what metrics does it measure?

RAGAS (RAG Assessment) is an evaluation framework with 4 key metrics: Faithfulness (is answer supported by context?), Answer Relevancy (does answer address the question?), Context Recall (did we retrieve all needed information?), Context Precision (are all retrieved chunks useful?). It uses LLM-as-judge internally. Target: all metrics above 0.80 for production.

Q12: How would you design a RAG system for 50 million documents?

Architecture: (1) Managed vector DB (Pinecone/Weaviate) — handles scale, sharding, replication, (2) Async ingestion pipeline (Celery/Redis Queue) — process document uploads as background jobs, (3) Hierarchical retrieval — first retrieve at category level, then fine-grained, (4) Distributed embedding service — GPU instances batch-embedding documents, (5) Caching — Redis for frequent queries and their top chunks, (6) ANN indexing — HNSW index for fast approximate nearest neighbor search at scale.

Q13: What is the difference between RAG and a search engine?

Search engines (Elasticsearch, Google) retrieve and rank documents — they return a list of links/documents for the user to read. RAG retrieves documents and then synthesizes a single coherent answer from them using an LLM. RAG is conversational and generative; search is retrievive and presentational. Modern AI search (Perplexity) combines both: retrieve (like search) + generate answer (like RAG) + cite sources.

💼 Practical / Scenario

Q14: Your RAG chatbot answers "What is our pricing?" incorrectly. Debug the issue step by step.

Step 1: Check retrieval — print the chunks retrieved for this query. Are pricing-related chunks present? Step 2: Check chunk quality — is the pricing information split badly across chunk boundaries? Increase chunk overlap. Step 3: Check embedding relevance — manually embed the query and pricing chunk, compute cosine similarity. Is it high enough? Step 4: Check prompt — is the context being passed correctly? Is there a prompt template issue? Step 5: Check LLM behavior — is the LLM ignoring the context and using training knowledge instead? Add "ONLY use the context" instruction more explicitly.

Q15: How do you make a RAG system respond faster?

(1) Streaming — show tokens as they generate, perceived latency drops dramatically. (2) Cache frequent queries — hash common queries, return cached answer (Redis). (3) Smaller LLM — gpt-4o-mini vs gpt-4o (10x faster). (4) Fewer, better chunks — smaller context = faster generation. (5) Async retrieval — run BM25 and vector retrieval in parallel (asyncio.gather). (6) Local embeddings — avoid API latency for embeddings. (7) HNSW index in vector DB for faster ANN search.

❌ Top 10 Beginner RAG Mistakes

Using different embedding models for indexing and retrieval → vectors incompatible → garbage results
No chunk overlap → information at chunk boundaries is lost → incomplete answers
Too-small chunk size (50-100 tokens) → no context, individual sentences are meaningless
k=1 retrieval → depending on one chunk → brittle, misses nuance
High temperature (0.7+) for RAG → model gets creative instead of factual → more hallucinations
Not storing source metadata → can't attribute answers, can't filter by source
Re-indexing entire vector DB on every update → slow, expensive → use incremental updates
No evaluation/testing → don't know if RAG is actually working, can't improve it
Ignoring the "Lost in the Middle" problem → LLM ignores middle chunks → use re-ranking to prioritize
Hardcoding API keys in source code → pushed to GitHub → credentials stolen → huge bill

🗺️ What to Learn Next

Track	What to Learn	Why
Week 2	LangGraph for multi-agent RAG	Agentic RAG is the next wave — agents that decide when/what to retrieve
Week 3	Pinecone or Qdrant in production	ChromaDB doesn't scale — learn managed vector DBs
Week 4	OpenAI Assistants API / File Search	Managed RAG with no infrastructure — common in enterprise projects
Month 2	GraphRAG (Microsoft) or KnowledgeGraph + RAG	State-of-the-art for complex reasoning across many documents
Month 2	Fine-tuning + RAG combo	Fine-tune for style/format, RAG for knowledge — best of both worlds
Month 3	LLMOps: LangSmith, Weights & Biases	Production monitoring, tracing, experiment tracking — what companies use
Parallel	Llama.cpp + Ollama (local LLMs)	Free, private, no API costs — great for learning and offline use cases

💡

The fastest career path: Build 2-3 RAG projects with different data sources (PDF, web scraping, SQL + RAG, code RAG). Each project proves a different skill. Ship them. Write about them on LinkedIn. Recruiters DM people who build in public.

Five days ago, you didn't know what an embedding was. Now you can build a complete RAG system with hybrid search, re-ranking, streaming responses, FastAPI backend, Streamlit frontend, Docker deployment, and RAGAS evaluation.

The difference between you and other freshers: they talk about AI. You've built it. You can explain why embeddings are vectors, what chunk overlap does to retrieval quality, why temperature=0 matters for grounding, and how to debug a RAG system that's giving wrong answers.

Ship the PDF chatbot. Star your own repo. Post a demo on LinkedIn. The AI engineering job you want? You're now qualified for it. 🚀