Skip to content

AI Safety Fundamentals Week 6

Photo of Nico Hill
Hosted By
Nico H.
AI Safety Fundamentals Week 6

Details

Hello everyone! πŸ‘‹
Our next meetup will be on Thursday 07.12 at 18:30 at CARL S03 πŸ˜ƒ

This week we will start to learn about interpretability. You have probably heard that current NNs are mostly black box systems. The goal of the field of interpretability is to change this and find ways to make computations done by AIs more human understandable. This understanding can then be used for creating safer systems by e.g. creating neural lie detectors to catch deceptive models or otherwise show that computations done by AIs satisfy safety properties.

We will start with time for reading the material at 18:30-19:30. Here the focus is on understanding the content, but small discussions and questions are welcome too.

The main discussion part of the meetup will be from 19:30-20:30 with an option to go eat dinner together afterwards. We will have various discussion prompts to explore our ideas around the topics from the week's reading and AI Safety in general. If you prefer reading the material at home you can come at 19:30.

Core reading for this week:
https://distill.pub/2020/circuits/zoom-in/
https://arxiv.org/abs/1610.01644 Sections 1 and 3
https://rome.baulab.info/

Optional reading for this week:
https://www.alignmentforum.org/posts/yRAo2KEGWenKYZG9K/discovering-language-model-behaviors-with-model-written

Looking forward to seeing you! 😊

Photo of AI Safety Aachen group
AI Safety Aachen
See more events
Needs a location