When should you use Retrieval-Augmented Generation (RAG)?

Anywhere the source material changes faster than you can fine-tune. Internal-document Q&A (legal contracts, code repos, support tickets). Citation-required answers where you need to trace which source said what. Domains where hallucination cost is high (medical, legal, finance).

When NOT to use Retrieval-Augmented Generation (RAG)?

Open-ended creative tasks with no specific source material. Real-time chat where retrieval latency matters more than precision. Tasks the base model already handles well from pre-training.

How does Retrieval-Augmented Generation (RAG) work?

Embed your documents into a vector database (chunked first, usually 200–1000 tokens per chunk). At query time, embed the user's question and find the top-K most-similar chunks. Inject those chunks into the model's context, along with the original question. Optionally use re-ranking, multi-query generation, or HyDE for better retrieval. Always cite which chunks were used in the response — auditable trail.

All techniques

Glossary · Technique

Retrieval-Augmented Generation (RAG)

Also known as: RAG, Retrieval-augmented LLM

Don't train on it — retrieve it. Inject relevant documents into the prompt at runtime so the model answers from real source material.

When to use it

Anywhere the source material changes faster than you can fine-tune.
Internal-document Q&A (legal contracts, code repos, support tickets).
Citation-required answers where you need to trace which source said what.
Domains where hallucination cost is high (medical, legal, finance).

When not to use it

Open-ended creative tasks with no specific source material.
Real-time chat where retrieval latency matters more than precision.
Tasks the base model already handles well from pre-training.

How it works

1Embed your documents into a vector database (chunked first, usually 200–1000 tokens per chunk).
2At query time, embed the user's question and find the top-K most-similar chunks.
3Inject those chunks into the model's context, along with the original question.
4Optionally use re-ranking, multi-query generation, or HyDE for better retrieval.
5Always cite which chunks were used in the response — auditable trail.

Example

Lazy prompt

What's our return policy?

Using the technique

Use only the documents in <retrieved> tags to answer. If the docs don't contain the answer, say 'not found in policy docs'. Cite the document name for every claim.

<retrieved>...top-K vector-search results pasted here...</retrieved>

Question: What's our return policy?

Common pitfalls

Bad retrieval = bad answer. Garbage chunks in, garbage answer out.
Chunk size and overlap tuning matters more than the model choice.
Models can still hallucinate even with retrieved context — instruct them to say 'I don't see this in the docs' when applicable.
Don't retrieve from sources the user shouldn't see (auth boundaries).

Where this came from

Lewis et al., 2020 — "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks". Standard architecture for production LLM apps since 2023.

Related techniques

Function Calling / Tool Use

Let the model decide when to invoke a real function or API instead of free-text answering. The foundation of every modern agent.

ReAct (Reason + Act)

Alternate reasoning and acting in a tight loop. The dominant pattern for tool-using agents — think, act, observe, repeat.

System Prompt Design

The hidden instructions that set the model's role, constraints, and ground rules for the entire conversation. Where 80% of product behavior actually lives.