When should you use Constitutional AI?

Building product wrappers that need consistent safety behavior. Anywhere you want explicit, auditable refusal rules. Domain-specific products with hard ethical lines (children's content, medical advice, legal). Reducing harmful outputs without hand-crafting RLHF datasets.

When NOT to use Constitutional AI?

Generic chat where standard safety training is sufficient. When you don't have time to write a good constitution. Tasks where soft judgment beats hard rules.

How does Constitutional AI work?

Define a 'constitution' — a list of principles in plain English (e.g. 'Don't recommend specific medical doses', 'Don't fabricate citations'). Generation step: model produces a candidate response. Critique step: model checks the response against each principle and identifies violations. Revision step: model rewrites the response to fix violations. Used as training data (RLAIF) or as a runtime prompt-time check.

All techniques

Glossary · Technique

Constitutional AI

Also known as: CAI, Principle-based self-critique

Train (or prompt) the model with an explicit set of principles, then have it critique its own outputs against them. Anthropic's safety technique.

When to use it

Building product wrappers that need consistent safety behavior.
Anywhere you want explicit, auditable refusal rules.
Domain-specific products with hard ethical lines (children's content, medical advice, legal).
Reducing harmful outputs without hand-crafting RLHF datasets.

When not to use it

Generic chat where standard safety training is sufficient.
When you don't have time to write a good constitution.
Tasks where soft judgment beats hard rules.

How it works

1Define a 'constitution' — a list of principles in plain English (e.g. 'Don't recommend specific medical doses', 'Don't fabricate citations').
2Generation step: model produces a candidate response.
3Critique step: model checks the response against each principle and identifies violations.
4Revision step: model rewrites the response to fix violations.
5Used as training data (RLAIF) or as a runtime prompt-time check.

Example

Lazy prompt

Don't say anything harmful.

Using the technique

Constitution:
1. Do not give specific medical doses or diagnoses; defer to professionals.
2. Do not fabricate URLs, citations, or quotes.
3. Do not produce content sexualizing anyone under 18.
4. ...

For any response, first generate. Then critique against each principle and list violations. Then revise. Output only the revised response.

Common pitfalls

Vague principles → vague critiques → no real change.
Conflicting principles let attackers wedge between them.
Self-critique can miss subtle harms the principles didn't cover.
Adds latency — fine for high-stakes, overkill for fast chat.

Where this came from

Bai et al., 2022 — Anthropic's "Constitutional AI: Harmlessness from AI Feedback". Used in production for Claude's training.

Related techniques

Self-Refine

Generate → critique own output → revise → repeat. Pushes a model's output much closer to its capability ceiling.

Adversarial / Red-Team Prompting

Ask the model to attack your idea before defending it. Surfaces the failure modes before they ship.

System Prompt Design

The hidden instructions that set the model's role, constraints, and ground rules for the entire conversation. Where 80% of product behavior actually lives.