#17 Mechanistic interpretability (Part 2 of 2)

Details
You'll need to read both papers in advance. Ideally please also bring along your own copies of the papers to refer to in the session.
** 29th September 2025: Mechanistic interpretability (2) **
When: Monday 29th September 2025, 6pm – 8 pm.
Where: The Castle Inn, 36 Castle Street, Cambridge, UK. (Most likely we'll be at one of the large tables upstairs.)
Papers:
[1] E. Ameisen et al. Circuit Tracing: Revealing Computational Graphs in Language Models. In Transformer
Circuits Thread, 2025.
URL:
https://transformer-circuits.pub/2025/attribution-graphs/methods.html
[2] J. Lindsey et al. On the Biology of a Large Language Model. In Transformer Circuits Thread, 2025.
URL:
https://transformer-circuits.pub/2025/attribution-graphs/biology.html
In this second of two sessions, we look further into Anthropic's recent attempts to perform a "brain scan" on a large language model and interpret its thoughts. This session will cover the use of cross-layer transcoders to identify neural "circuits", which reveal some of the internal mechanisms by which a frontier AI model (Claude 3.5 Haiku) formulates its responses to user prompts.

#17 Mechanistic interpretability (Part 2 of 2)