Automated Debugging of Multi-Agent Systems: A Practical Guide to Failure Attribution Using Who&When
Overview
Multi-agent systems powered by large language models (LLMs) have become a popular architecture for tackling complex, real-world tasks. However, their collaborative nature introduces a notorious debugging challenge: when a multi-agent pipeline fails, it can be nearly impossible to determine which agent made the first mistake and when the error occurred. Traditional debugging methods require developers to manually comb through lengthy interaction logs—a process akin to searching for a needle in a haystack—and often demand deep expertise in the system's internals.

To address this, researchers from Penn State University and Duke University, in collaboration with Google DeepMind, the University of Washington, Meta, Nanyang Technological University, and Oregon State University, introduced the concept of Automated Failure Attribution. Their work, accepted as a Spotlight presentation at ICML 2025, provides both a benchmark dataset called Who&When and several automated attribution methods. This tutorial walks you through the problem, the dataset, and how to apply these techniques to your own multi-agent systems.
Prerequisites
- Python 3.9+ – The codebase is built on PyTorch and standard ML libraries.
- Basic understanding of LLM-driven multi-agent systems – Familiarity with agent roles, communication protocols, and tool use is helpful.
- Git – To clone the repository.
- Hugging Face account (optional) – For direct dataset downloads via the
datasetslibrary.
Step-by-Step Instructions
1. Set Up the Environment
Clone the official repository and install dependencies:
git clone https://github.com/mingyin1/Agents_Failure_Attribution.git
cd Agents_Failure_Attribution
pip install -r requirements.txt
The repository includes scripts for loading the dataset, running attribution methods, and evaluating results. Make sure your Python environment has CUDA support if you plan to use GPU acceleration.
2. Understand the Who&When Dataset
The Who&When dataset contains multi-agent interaction logs with ground-truth labels indicating the responsible agent and the exact step where the failure originated. Each log records a chain of messages, tool calls, and intermediate outputs from a multi-agent system attempting a task (e.g., question answering, code generation). The dataset is split into training, validation, and test sets.
To inspect the dataset:
from datasets import load_dataset
dataset = load_dataset("Kevin355/Who_and_When")
print(dataset["train"][0].keys())
# Output: ['log', 'failure_agent', 'failure_step', ...]
The log field contains the full trace of agent interactions in JSON format. The failure_agent field identifies which agent (by name or ID) caused the failure, and failure_step is the integer index of the step where the error first appeared.
3. Run an Attribution Method
The repository implements several baseline and advanced attribution methods, including:
- Naive Attribution – Assigns blame to the agent that produced the last valid output before the failure.
- Causal Tracing – Tracks the flow of information through the agent communication graph.
- Counterfactual Reasoning – Simulates alternative agent behaviors to isolate the root cause.
To run a quick experiment with the default method:
python run_attribution.py --method causal_tracing --dataset Who_and_When
This will output a JSON file containing predictions for each test-sample failure: the attributed agent and step, along with a confidence score. You can change the method to counterfactual or naive by modifying the --method argument.

4. Evaluate Performance
Use the evaluation script to compute accuracy metrics against the ground-truth labels:
python evaluate.py --predictions output.json --ground_truth dataset/test/labels.json
The script reports top-1 and top-3 accuracy for agent identification, as well as step localization error (mean absolute deviation).
5. Adapt to Your Own System
To apply attribution to your own multi-agent traces, you must format your logs to match the Who&When schema. The expected structure includes:
steps: list of dicts withagent,content,tool_calls(optional), andtimestamp.final_output: the final response or error message.task_description: the original user query.
Once formatted, you can run the same attribution scripts by pointing them to your custom JSON file:
python run_attribution.py --method counterfactual --custom_logs my_trace.json
Common Mistakes
Incomplete Log Traces
Attribution relies on having a full, sequential record of all agent activities. Missing steps or omitted tool outputs can mislead the causal analysis. Ensure your logging captures every exchange, including intermediate results.
Ignoring Agent Roles
In many systems, agents have specific responsibilities (e.g., planner, executor, verifier). The attribution methods benefit from knowing these roles. If your logs do not differentiate roles, consider adding an agent_type field.
Overlooking Failures at the System Level
Some failures stem from global issues like task ambiguity or missing tool definitions, not from any single agent. The current dataset focuses on agent-level failures, but you should manually inspect such cases if your attribution yields low confidence.
Summary
Automated failure attribution offers a systematic way to pinpoint the root cause of breakdowns in LLM multi-agent systems. The Who&When dataset and the accompanying methods provide a solid starting point for researchers and practitioners to move beyond manual log archaeology. By following the steps in this guide—setting up the environment, understanding the dataset, running attribution, and adapting to custom logs—you can accelerate debugging and improve system reliability. The paper's acceptance as a Spotlight at ICML 2025 underscores the importance of this new research direction.
Related Articles
- Australia's Federal Budget Sidesteps Fossil Fuel Tax Overhaul, Climate Advocates Cry Foul
- The Cosmic Sweet Spot: A Guide to Understanding How Fundamental Constants Enable Life’s Liquids
- How to Explore NASA’s Stunning New Artemis II Photo Collection
- The Gentlemen Ransomware and SystemBC: Inside a Growing RaaS Operation and Proxy Malware Deployment
- Scientists Successfully Remove Essential Amino Acid From Genetic Code in Landmark Experiment
- How Cleanroom Upgrades Enable Safe Processing of the Roman Space Telescope
- NASA's Next-Generation Mars Helicopters: Building on Ingenuity's Legacy
- Your Daily Coffee Routine: A Step-by-Step Guide to Reducing Dementia Risk