fixaiprompt
All techniques
Glossary · Technique

Retrieval-Augmented Generation (RAG)

Also known as: RAG, Retrieval-augmented LLM

Don't train on it — retrieve it. Inject relevant documents into the prompt at runtime so the model answers from real source material.

When to use it

  • Anywhere the source material changes faster than you can fine-tune.
  • Internal-document Q&A (legal contracts, code repos, support tickets).
  • Citation-required answers where you need to trace which source said what.
  • Domains where hallucination cost is high (medical, legal, finance).

When not to use it

  • Open-ended creative tasks with no specific source material.
  • Real-time chat where retrieval latency matters more than precision.
  • Tasks the base model already handles well from pre-training.

How it works

  1. 1Embed your documents into a vector database (chunked first, usually 200–1000 tokens per chunk).
  2. 2At query time, embed the user's question and find the top-K most-similar chunks.
  3. 3Inject those chunks into the model's context, along with the original question.
  4. 4Optionally use re-ranking, multi-query generation, or HyDE for better retrieval.
  5. 5Always cite which chunks were used in the response — auditable trail.

Example

Lazy prompt
What's our return policy?
Using the technique
Use only the documents in <retrieved> tags to answer. If the docs don't contain the answer, say 'not found in policy docs'. Cite the document name for every claim.

<retrieved>...top-K vector-search results pasted here...</retrieved>

Question: What's our return policy?

Common pitfalls

  • Bad retrieval = bad answer. Garbage chunks in, garbage answer out.
  • Chunk size and overlap tuning matters more than the model choice.
  • Models can still hallucinate even with retrieved context — instruct them to say 'I don't see this in the docs' when applicable.
  • Don't retrieve from sources the user shouldn't see (auth boundaries).

Where this came from

Lewis et al., 2020 — "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks". Standard architecture for production LLM apps since 2023.