Paper: Why Do Multi-Agent LLM Systems Fail

Name: Paper: Why Do Multi-Agent LLM Systems Fail
Start: 2025-06-16T18:30:00-06:00
End: 2025-06-16T20:30:00-06:00

Hosted By

Logan

Paper: Why Do Multi-Agent LLM Systems Fail

Details

Join us for a paper discussion on "Why Do Multi-Agent LLM Systems Fail? A Multi-Agent System Failure Taxonomy (MAST)"
Exploring systematic failure patterns in multi-agent systems through empirical analysis and grounded theory methodology
Featured Paper:
"Why Do Multi-Agent LLM Systems Fail? A Multi-Agent System Failure Taxonomy (MAST)" (Cemri et al., 2025)
arXiv Paper | GitHub Dataset
Discussion Topics:
MAST Taxonomy Framework

14 distinct failure modes organized into 3 categories: specification issues (FC1), inter-agent misalignment (FC2), task verification (FC3)
Grounded theory methodology applied to 200+ execution traces across 7 MAS frameworks
Cohen's Kappa agreement score of 0.88 between expert annotators

Failure Category Analysis

| Category | Prevalence | Key Failure Modes | Impact |
| -------- | ---------- | ----------------- | ------ |
| FC1: Specification Issues | 41.77% | Step repetition (17.14%), Task disobedience (10.98%) | Design flaws |
| FC2: Inter-Agent Misalignment | 36.94% | Reasoning-action mismatch (13.98%), Clarification failure (11.65%) | Coordination breakdown |
| FC3: Task Verification | 21.30% | Incomplete verification (6.82%), Incorrect verification (6.66%) | Quality control gaps |

Implementation Challenges

ChatDev achieves only 33.33% correctness on ProgramDev benchmark despite explicit verifier agents
Superficial verification strategies (compilation checks vs functional correctness)
System design issues beyond base LLM limitations

Key Technical Features

LLM-as-a-judge pipeline using OpenAI o1 with 94% accuracy and 0.77 Cohen's Kappa
Validated across unseen systems (Magentic-One, OpenManus) with 0.79 agreement score
Intervention studies showing +15.6% improvement through architectural changes

Future Directions

Efficiency taxonomy development beyond correctness metrics
Structural redesign principles for high-reliability multi-agent organizations
Integration with constitutional AI and verification frameworks

***

Silicon Valley Generative AI Meeting Formats
Paper Reading

Biweekly sessions on multi-agent system reliability
Collaborative analysis with Boulder Data Science

Talks

Monthly presentations on agent coordination and system design
Topics range from failure analysis to organizational design principlesSilicon Valley Generative AI has two meeting formats.

1. Paper Reading - Every second week we meet to discuss machine learning papers. This is a collaboration between Silicon Valley Generative AI and Boulder Data Science.
2. Talks - Once a month we meet to have someone present on a topic related to generative AI. Speakers can range from industry leaders, researchers, startup founders, subject matter experts and those with an interest in a topic and would like to share. Topics vary from technical to business focused. They can be on how the latest in generative models work and how they can be used, applications and adoption of generative AI, demos of projects and startup pitches or legal and ethical topics. The talks are meant to be inclusive and for a more general audience compared to the paper readings.

If you would like to be a speaker or suggest a paper email us @ svb.ai.paper.suggestions@gmail.com or join our new discord !!!

Events in

Boulder Data Science, Machine Learning & AI

See more events

Boulder Data Science, Machine Learning & AI

No ratings yet

Online event

Link visible for attendees

Boulder Data Science, Machine Learning & AI

public group

Paper: Why Do Multi-Agent LLM Systems Fail

FREE