Mechanistic interpretability (Part 1 of 2)

Details
When: Wednesday 30th July 2025, 6 pm – 8 pm.
Where: The Castle Inn, 36 Castle Street, Cambridge, UK. (Most likely we'll be at one of the large tables upstairs.)
Paper: A. Templeton et. al. Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. In Transformer Circuits Thread, 2024.
LINK:
https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html
In this first of two sessions, we discuss Anthropic's recent attempts to "look inside the brain" of a large language model and interpret its thoughts. This session will cover the use of sparse autoencoders to identify and probe neural "features", which often represent human-understandable concepts.
Please note: this paper contains a small amount of technical content, but the majority is non-technical and anyone interested in the topic is encouraged to give it a read and join the session, regardless of technical background!
You'll need to read the paper in advance. Ideally please also bring along your own copy of the paper to refer to in the
session.

Mechanistic interpretability (Part 1 of 2)