Model Behavior Study Group: Constitutional AI
Детали
Study Group Topic: Constitutional AI
Dive into the foundational research on using principles rather than human feedback to train safe AI.
Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2022)
The original paper introducing Constitutional AI - training AI assistants to be helpful and harmless using self-critique and AI-generated feedback rather than human labels.
🔗 https://arxiv.org/abs/2212.08073
Collective Constitutional AI: Aligning a Language Model with Public Input (Huang et al., 2024)
Extends Constitutional AI by incorporating ~1,000 Americans' input to democratically create principles for AI behavior.
🔗 https://arxiv.org/abs/2406.07814
📚 Full reading list and How-To: https://github.com/suzana-ilic/study_model_behavior
Join us for our monthly reading group where we dive into the research and specs that shape how AI systems like ChatGPT and Claude actually behave. We read together for 30 minutes, then discuss for 30 minutes. Pre-reading is recommended, but not required.
Who is this for?
Anyone curious about how AI systems work—researchers, builders, policy folks, or just thoughtful people who use these tools and want to understand them better. No technical background needed. We start with accessible industry standards and papers and build from there.
What will we read?
We're working through resources in six areas:
- Industry Specs — How leading AI companies define model behavior
- Constitutional AI — Training models with principles instead of human feedback
- Safety Methods — RLHF and alignment techniques
- Behavioral Science — How researchers study what AI actually does
- Interpretability — Understanding what's happening inside the models
- Critical Perspectives — Challenges to current approaches
Format
- 30 minutes: Read together (with discussion questions)
- 30 minutes: Talk through key insights and implications
- Monthly sessions
Location: [Online]
💻 RSVP for Zoom Link
⚙️ Discord https://discord.gg/CT7nBdYCsY
📬 Updates: https://mltaicommunities.substack.com/


