RAG(Retrieval-Augmented Generation)

The core idea is to retrieve relevant documents, then feed them as context to the LLM.

Pipeline

Indexing(offline, done once)
Take your documents(pdf, code, wikis, etc.)
Split them into chunks by chunking strategy
Embed each chunk into a vector using an embedding model
Store vector in a Vector Database
Retrieval (at query time)
User asks a question
Embed the query using the same embedding model
Do a similarity search(cosine/dot product) in the vector DB
Get the top-k most relevant chunks by reranking
Generation
Stuff those chunks into the LLM prompt as context
The LLM reads the context and generates an answer grounded in it

Why it works

Embedding models map sematically similar text to nearby vector in high-dimensional space, it enables fuzzy semantic search rather than keyword matching.