Automated Debugging of Multi-Agent Systems: A Practical Guide to Failure Attribution Using Who&When

Overview

Multi-agent systems powered by large language models (LLMs) have become a popular architecture for tackling complex, real-world tasks. However, their collaborative nature introduces a notorious debugging challenge: when a multi-agent pipeline fails, it can be nearly impossible to determine which agent made the first mistake and when the error occurred. Traditional debugging methods require developers to manually comb through lengthy interaction logs—a process akin to searching for a needle in a haystack—and often demand deep expertise in the system's internals.

Automated Debugging of Multi-Agent Systems: A Practical Guide to Failure Attribution Using Who&When — Source: syncedreview.com

To address this, researchers from Penn State University and Duke University, in collaboration with Google DeepMind, the University of Washington, Meta, Nanyang Technological University, and Oregon State University, introduced the concept of Automated Failure Attribution. Their work, accepted as a Spotlight presentation at ICML 2025, provides both a benchmark dataset called Who&When and several automated attribution methods. This tutorial walks you through the problem, the dataset, and how to apply these techniques to your own multi-agent systems.

Prerequisites

Python 3.9+ – The codebase is built on PyTorch and standard ML libraries.
Basic understanding of LLM-driven multi-agent systems – Familiarity with agent roles, communication protocols, and tool use is helpful.
Git – To clone the repository.
Hugging Face account (optional) – For direct dataset downloads via the datasets library.

Step-by-Step Instructions

1. Set Up the Environment

Clone the official repository and install dependencies:

git clone https://github.com/mingyin1/Agents_Failure_Attribution.git
cd Agents_Failure_Attribution
pip install -r requirements.txt

The repository includes scripts for loading the dataset, running attribution methods, and evaluating results. Make sure your Python environment has CUDA support if you plan to use GPU acceleration.

2. Understand the Who&When Dataset

The Who&When dataset contains multi-agent interaction logs with ground-truth labels indicating the responsible agent and the exact step where the failure originated. Each log records a chain of messages, tool calls, and intermediate outputs from a multi-agent system attempting a task (e.g., question answering, code generation). The dataset is split into training, validation, and test sets.

To inspect the dataset:

from datasets import load_dataset

dataset = load_dataset("Kevin355/Who_and_When")
print(dataset["train"][0].keys())
# Output: ['log', 'failure_agent', 'failure_step', ...]

The log field contains the full trace of agent interactions in JSON format. The failure_agent field identifies which agent (by name or ID) caused the failure, and failure_step is the integer index of the step where the error first appeared.

3. Run an Attribution Method

The repository implements several baseline and advanced attribution methods, including:

Naive Attribution – Assigns blame to the agent that produced the last valid output before the failure.
Causal Tracing – Tracks the flow of information through the agent communication graph.
Counterfactual Reasoning – Simulates alternative agent behaviors to isolate the root cause.

To run a quick experiment with the default method:

python run_attribution.py --method causal_tracing --dataset Who_and_When

This will output a JSON file containing predictions for each test-sample failure: the attributed agent and step, along with a confidence score. You can change the method to counterfactual or naive by modifying the --method argument.

4. Evaluate Performance

Use the evaluation script to compute accuracy metrics against the ground-truth labels:

python evaluate.py --predictions output.json --ground_truth dataset/test/labels.json

The script reports top-1 and top-3 accuracy for agent identification, as well as step localization error (mean absolute deviation).

5. Adapt to Your Own System

To apply attribution to your own multi-agent traces, you must format your logs to match the Who&When schema. The expected structure includes:

steps: list of dicts with agent, content, tool_calls (optional), and timestamp.
final_output: the final response or error message.
task_description: the original user query.

Once formatted, you can run the same attribution scripts by pointing them to your custom JSON file:

python run_attribution.py --method counterfactual --custom_logs my_trace.json

Common Mistakes

Incomplete Log Traces

Attribution relies on having a full, sequential record of all agent activities. Missing steps or omitted tool outputs can mislead the causal analysis. Ensure your logging captures every exchange, including intermediate results.

Ignoring Agent Roles

In many systems, agents have specific responsibilities (e.g., planner, executor, verifier). The attribution methods benefit from knowing these roles. If your logs do not differentiate roles, consider adding an agent_type field.

Overlooking Failures at the System Level

Some failures stem from global issues like task ambiguity or missing tool definitions, not from any single agent. The current dataset focuses on agent-level failures, but you should manually inspect such cases if your attribution yields low confidence.

Summary

Automated failure attribution offers a systematic way to pinpoint the root cause of breakdowns in LLM multi-agent systems. The Who&When dataset and the accompanying methods provide a solid starting point for researchers and practitioners to move beyond manual log archaeology. By following the steps in this guide—setting up the environment, understanding the dataset, running attribution, and adapting to custom logs—you can accelerate debugging and improve system reliability. The paper's acceptance as a Spotlight at ICML 2025 underscores the importance of this new research direction.

Tags: