fixaiprompt
All techniques
Glossary · Technique

Constitutional AI

Also known as: CAI, Principle-based self-critique

Train (or prompt) the model with an explicit set of principles, then have it critique its own outputs against them. Anthropic's safety technique.

When to use it

  • Building product wrappers that need consistent safety behavior.
  • Anywhere you want explicit, auditable refusal rules.
  • Domain-specific products with hard ethical lines (children's content, medical advice, legal).
  • Reducing harmful outputs without hand-crafting RLHF datasets.

When not to use it

  • Generic chat where standard safety training is sufficient.
  • When you don't have time to write a good constitution.
  • Tasks where soft judgment beats hard rules.

How it works

  1. 1Define a 'constitution' — a list of principles in plain English (e.g. 'Don't recommend specific medical doses', 'Don't fabricate citations').
  2. 2Generation step: model produces a candidate response.
  3. 3Critique step: model checks the response against each principle and identifies violations.
  4. 4Revision step: model rewrites the response to fix violations.
  5. 5Used as training data (RLAIF) or as a runtime prompt-time check.

Example

Lazy prompt
Don't say anything harmful.
Using the technique
Constitution:
1. Do not give specific medical doses or diagnoses; defer to professionals.
2. Do not fabricate URLs, citations, or quotes.
3. Do not produce content sexualizing anyone under 18.
4. ...

For any response, first generate. Then critique against each principle and list violations. Then revise. Output only the revised response.

Common pitfalls

  • Vague principles → vague critiques → no real change.
  • Conflicting principles let attackers wedge between them.
  • Self-critique can miss subtle harms the principles didn't cover.
  • Adds latency — fine for high-stakes, overkill for fast chat.

Where this came from

Bai et al., 2022 — Anthropic's "Constitutional AI: Harmlessness from AI Feedback". Used in production for Claude's training.