Glossary · Technique
Constitutional AI
Also known as: CAI, Principle-based self-critique
Train (or prompt) the model with an explicit set of principles, then have it critique its own outputs against them. Anthropic's safety technique.
When to use it
- Building product wrappers that need consistent safety behavior.
- Anywhere you want explicit, auditable refusal rules.
- Domain-specific products with hard ethical lines (children's content, medical advice, legal).
- Reducing harmful outputs without hand-crafting RLHF datasets.
When not to use it
- Generic chat where standard safety training is sufficient.
- When you don't have time to write a good constitution.
- Tasks where soft judgment beats hard rules.
How it works
- 1Define a 'constitution' — a list of principles in plain English (e.g. 'Don't recommend specific medical doses', 'Don't fabricate citations').
- 2Generation step: model produces a candidate response.
- 3Critique step: model checks the response against each principle and identifies violations.
- 4Revision step: model rewrites the response to fix violations.
- 5Used as training data (RLAIF) or as a runtime prompt-time check.
Example
Lazy prompt
Don't say anything harmful.
Using the technique
Constitution: 1. Do not give specific medical doses or diagnoses; defer to professionals. 2. Do not fabricate URLs, citations, or quotes. 3. Do not produce content sexualizing anyone under 18. 4. ... For any response, first generate. Then critique against each principle and list violations. Then revise. Output only the revised response.
Common pitfalls
- Vague principles → vague critiques → no real change.
- Conflicting principles let attackers wedge between them.
- Self-critique can miss subtle harms the principles didn't cover.
- Adds latency — fine for high-stakes, overkill for fast chat.
Where this came from
Bai et al., 2022 — Anthropic's "Constitutional AI: Harmlessness from AI Feedback". Used in production for Claude's training.
Related techniques
Self-Refine
Generate → critique own output → revise → repeat. Pushes a model's output much closer to its capability ceiling.
Adversarial / Red-Team Prompting
Ask the model to attack your idea before defending it. Surfaces the failure modes before they ship.
System Prompt Design
The hidden instructions that set the model's role, constraints, and ground rules for the entire conversation. Where 80% of product behavior actually lives.