Extending Inspect Framework: Integrating Weights & Biases

This work was supported through the MARS (Mentorship for Alignment Research Students) program at the Cambridge AI Safety Hub (caish.org/mars).

Why We Built `inspect_wandb`

Evaluating frontier AI models can be messy; each benchmark has its own quirks, scripts, and formats, and even after a successful run, the results usually remain buried in local JSON files. That’s fine for a solo experiment, but painful when you need to compare models, audit results, or reproduce someone else’s work.

Inspect is an open-source framework for running evaluations which runs locally on one’s machine. Weights & Biases (WandB) are tools for sharing, tracking, and visualizing ML experiments and LLM evals in the cloud. Our integration—inspect_wandb—is the missing bridge: it lets Inspect stream evaluations, metrics, and traces directly into WandB, providing a single source of truth, searchable dashboards, and reproducible runs.

Core Features

Connecting Models and Weave

The integration leverages two complementary WandB tools, each serving distinct but interconnected purposes.

When running LLM evals, WandB Models is primarily useful for reproducibility and tracking. Every evaluation becomes a fully logged Run complete with system state, git information, command history, and configurable file uploads. When you need to reproduce a result months later, all the context is preserved.

Overview of an evaluation run, easy to reproduce with the context preserved.

WandB Weave, delivers analysis and comparison capabilities. It transforms individual evaluation tasks into a searchable, filterable, comparable dataset. Rather than looking through local log files or sharing log files via Google Drive, teams can query their entire evaluation history through an intuitive, dedicated interface.

Evals tab in Weave shows all the tasks with status, metadata and metrics if available.

This dual approach lets each tool handle its own tasks while connected through a unified run ID. In the Models UI, clicking a run’s trace opens the Weave traces for the associated evaluation tasks—and you can jump back from Weave to Models by searching for the run ID.

Cross-Evaluation Analysis

Cross-evaluation analysis has been a frequent pain-point for Inspect users, typically requiring custom scripts and database queries.inspect_wandb addresses this pain point with the following features.

The filtering system supports combined queries on metadata across the evaluation history. Want GPT-5’s safety results from the past month or a reasoning comparison of chain-of-thought LLMs? These become point-and-click tasks, and the filtered records can be saved as shareable Views.

Save filtered results as ‘views’ for easy access.

The evaluation comparison feature provides ready-to-use visualizations, including metric plots and tables of sample dataset. Teams can easily spot performance differences, patterns, and outliers that may reveal edge cases or opportunities for model improvement.

Out-of-the-box comparison across models. — Visualizations of metrics comparison across evaluations.

Structured Tracing

The implementation organizes hierarchical traces, with each evaluation sample becoming its own container. Inside each sample trace, you can drill down to see every solver step, model call, and scoring operation that contributed to that sample’s final result. This organization makes it easy to understand what happened during a specific sample’s evaluation—especially useful for debugging.

Traces grouped by sample offer great visibility.

Zero-Friction Integration

One convenient aspect of the integration is that it requires no code changes. Install the package in any Inspect environment, initialize a WandB project, and every evaluation run gains default WandB logging—no refactoring, extra config, or new workflow needed.

Getting Started

Install the package

Install the integration in your Inspect environment. Choose your preferred components:

Models + Weave (recommended to try both):

pip install "inspect-wandb[weave]"

Models only: (useful if you are part of an organization and have Models but not Weave license, e.g. UK AISI)

pip install inspect-wandb

WandB Setup

Step1: Authenticate with WandB:

wandb login

Step2: Initialize your project:

wandb init

Follow the prompts to set your project name and entity.

Alternative: Environment Variables

For automated environments, skip the CLI setup and use environment variables:

export WANDB_API_KEY="your-api-key"
export WANDB_ENTITY="your-team-name"
export WANDB_PROJECT="your-project-name"

That’s It!

Once installed, inspect_wandb works automatically with your Inspect evaluations. No additional configuration needed – just run your evals as usual and see them tracked in WandB.

For detailed setup options and further tutorials, check out our documentation!

Future Roadmap

The inspect_wandb integration was recently integrated into UK AISI’s workflow and METR is currently testing. Daniel and Qi plan to continue developing the project and we’re all looking forward for your feedback! Public roadmap for the project coming soon.

Contribute to the inspect_wandb: by incorporating it in your workflows, reporting issues, or contributing features. As the integration matures, community input will shape its development and ensure it serves the broader AI evaluation ecosystem. Thanks for your time!