Skip to content

RAG(Retrieval-Augmented Generation)

The core idea is to retrieve relevant documents, then feed them as context to the LLM.

Pipeline

  1. Indexing(offline, done once)
  2. Take your documents(pdf, code, wikis, etc.)
  3. Split them into chunks by chunking strategy
  4. Embed each chunk into a vector using an embedding model
  5. Store vector in a Vector Database
  6. Retrieval (at query time)
  7. User asks a question
  8. Embed the query using the same embedding model
  9. Do a similarity search(cosine/dot product) in the vector DB
  10. Get the top-k most relevant chunks by reranking
  11. Generation
  12. Stuff those chunks into the LLM prompt as context
  13. The LLM reads the context and generates an answer grounded in it

Why it works

Embedding models map sematically similar text to nearby vector in high-dimensional space, it enables fuzzy semantic search rather than keyword matching.