Mon, Jun 22 · 6:00 PM BST
Recent educational AI benchmarks show that while many foundation models perform well on broad pedagogical knowledge tests, their performance drops on harder tutoring tasks such as diagnosing specific algebra misconceptions, integrating visual student work, and using strong pedagogical moves like Socratic scaffolding. Even so, strong performance on difficult tutoring tasks is not the same as evidence of instructional impact. This talk takes up a central question for AI tutor benchmarking: does an AI tutor make decisions that improve learning? It introduces the idea of causal benchmarks for AI tutoring, a framework for evaluating tutor behavior based on its relationship to student outcomes.
The proposed framework has four parts: identifying which tutoring sessions are effective, locating the critical moments where tutor choices matter most, determining which tutor moves are most productive at those moments, and evaluating whether AI tutors choose those moves. This talk will present a causal analysis of session effect estimation, preliminary analyses of identifying critical moments, and a large-scale annotation effort focused on tutor moves. Together, these components suggest a path toward benchmarks that complement existing evaluation approaches by shifting attention from surface quality to instructional effectiveness. The talk will also highlight the methodological and practical challenges of building such benchmarks and discuss how this agenda could support more rigorous evaluation of future AI tutoring systems.