Skip to content

Paper Discussion: Propositional Interpretability in Artificial Intelligence

Photo of Andrew Scott
Hosted By
Andrew S.
Paper Discussion: Propositional Interpretability in Artificial Intelligence

Details

Paper Discussion:

Propositional Interpretability in Artificial Intelligence
David J. Chalmers, 2025
https://arxiv.org/abs/2501.15740

Abstract: Mechanistic interpretability is the program of explaining what AI systems are doing in terms of their internal mechanisms. I analyze some aspects of the program, along with setting out some concrete challenges and assessing progress to date. I argue for the importance of propositional interpretability, which involves interpreting a system's mechanisms and behavior in terms of propositional attitudes: attitudes (such as belief, desire, or subjective probability) to propositions (e.g. the proposition that it is hot outside). Propositional attitudes are the central way that we interpret and explain human beings and they are likely to be central in AI too. A central challenge is what I call thought logging: creating systems that log all of the relevant propositional attitudes in an AI system over time. I examine currently popular methods of interpretability (such as probing, sparse auto-encoders, and chain of thought methods) as well as philosophical methods of interpretation (including those grounded in psychosemantics) to assess their strengths and weaknesses as methods of propositional interpretability.

Photo of SF Philosophy Club group
SF Philosophy Club
See more events
Needs a location
FREE
10 spots left