Skip to content

Mechanistic interpretability (Part 1 of 2)

F
Hosted By
Frank M.
Mechanistic interpretability (Part 1 of 2)

Details

When: Wednesday 30th July 2025, 6 pm – 8 pm.

Where: The Castle Inn, 36 Castle Street, Cambridge, UK. (Most likely we'll be at one of the large tables upstairs.)

Paper: A. Templeton et. al. Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. In Transformer Circuits Thread, 2024.
LINK:
https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html

In this first of two sessions, we discuss Anthropic's recent attempts to "look inside the brain" of a large language model and interpret its thoughts. This session will cover the use of sparse autoencoders to identify and probe neural "features", which often represent human-understandable concepts.

Please note: this paper contains a small amount of technical content, but the majority is non-technical and anyone interested in the topic is encouraged to give it a read and join the session, regardless of technical background!

You'll need to read the paper in advance. Ideally please also bring along your own copy of the paper to refer to in the
session.

Photo of Machine Learning Reading Group - Cambridge, UK group
Machine Learning Reading Group - Cambridge, UK
See more events
FREE
20 spots left