Participatory and Culturally Grounded AI Evaluation for Health.....Benchmarking
Detalles
Nous aurons le plaisir d'accueillir Yann Le Beux, cofondateur de Yux Design et Kitala AI, Ertony Basilwango Machine Learning et Research Engineer chez Yux Design et Nafissatou Césaltina L. Data analyst chez Yux Design.
Ils nous présenteront deux articles sur la santé : Participatory and Culturally Grounded AI Evaluation for Health Use Cases within the African Context et Benchmarking Large Language Models on a Culturally Grounded Maternal Health QA Dataset from Senegal dont les abstracts suivent:
"AI adoption is growing rapidly across domains such as education, finance, and health. In healthcare, AI is increasingly used to support child care, self-care, and other health-related decisions. Yet, little research examines how these systems perform in real-world, non-Western contexts. Traditional AI evaluation frameworks often focus on accuracy or safety, overlooking cultural relevance, clarity, and trust. We explore participatory, human-centred methods for evaluating the use of AI in health use cases within the African context. Our approach combines small-group qualitative exercises and diary-based surveys to capture both in-depth and longitudinal insights. Central to this work is Cultural Teaming, a method using scenario-based testing, creative role-play, and participant-led exploration to surface contextual and cultural gaps in AI responses. Participants interact with AI using locally grounded prompts and rate outputs on resonance, clarity, and trustworthiness. Findings show that scenarios, personas, and paired exercises elicit richer prompts and nuanced feedback, while diary studies reveal longitudinal usage patterns and engagement behaviors. Together, these methods offer a scalable, culturally grounded framework for participatory AI evaluation, providing actionable insights for more inclusive and contextually appropriate AI systems."
"Large Language Models (LLMs) are increasingly used in domains such as healthcare, education, and public services, where users rely on them to answer sensitive and personal questions. However, their reliability in culturally grounded and domain-specific contexts remains poorly understood, mainly due to the lack of region-specific benchmarks and datasets. This limitation is particularly critical in maternal health, where cultural beliefs, social norms, and local practices strongly influence health practices and decision-making, especially in African settings. In this work, we introduce a culturally grounded Question–Answer (QA) dataset for maternal health in Senegal, consisting of 1,000 high-quality QA pairs. The dataset covers key topics such as family planning, maternal risks, healthcare access, and culturally influenced decision-making scenarios. We also propose a hybrid evaluation framework combining automatic metrics (BLEU, ROUGE, BERTScore), LLM-as-a-Judge scoring, and human expert validation to assess factual accuracy, reasoning, and cultural relevance. We benchmark 17 LLMs and observe a clear performance hierarchy, with proprietary models such as Gemini 2.5 Pro and Claude Sonnet 4 achieving the highest scores. However, all models show limitations in handling culturally nuanced scenarios. Our results show a strong correlation (0.82) between human and LLM-based evaluations, highlighting the reliability of hybrid evaluation approaches. Overall, this work addresses a critical gap in evaluating LLMs in culturally grounded healthcare settings and emphasizes the need for culturally adapted datasets to build more reliable and context-aware AI systems."
