This work was supported through the MARS (Mentorship for Alignment Research Students) program at the Cambridge AI Safety Hub (caish.org/mars).
Why We Built inspect_wandb
Evaluating frontier AI models can be messy; each benchmark has its own quirks, scripts, and formats, and even after a successful run, the results usually remain buried in local JSON files. That’s fine for a solo experiment, but painful when you need to compare models, audit results, or reproduce someone else’s work.
Inspect is an open-source framework for running evaluations which runs locally on one’s machine. Weights & Biases (WandB) are tools for sharing, tracking, and visualizing ML experiments and LLM evals in the cloud. Our integration—inspect_wandb—is the missing bridge: it lets Inspect stream evaluations, metrics, and traces directly into WandB, providing a single source of truth, searchable dashboards, and reproducible runs.
Core Features
Connecting Models and Weave
The integration leverages two complementary WandB tools, each serving distinct but interconnected purposes.
When running LLM evals, WandB Models is primarily useful for reproducibility and tracking. Every evaluation becomes a fully logged Run complete with system state, git information, command history, and configurable file uploads. When you need to reproduce a result months later, all the context is preserved.
Overview of an evaluation run, easy to reproduce with the context preserved.
WandB Weave, delivers analysis and comparison capabilities. It transforms individual evaluation tasks into a searchable, filterable, comparable dataset. Rather than looking through local log files or sharing log files via Google Drive, teams can query their entire evaluation history through an intuitive, dedicated interface.
Evals tab in Weave shows all the tasks with status, metadata and metrics if available.
This dual approach lets each tool handle its own tasks while connected through a unified run ID. In the Models UI, clicking a run’s trace opens the Weave traces for the associated evaluation tasks—and you can jump back from Weave to Models by searching for the run ID.
Cross-Evaluation Analysis
Cross-evaluation analysis has been a frequent pain-point for Inspect users, typically requiring custom scripts and database queries.inspect_wandb addresses this pain point with the following features.
The filtering system supports combined queries on metadata across the evaluation history. Want GPT-5’s safety results from the past month or a reasoning comparison of chain-of-thought LLMs? These become point-and-click tasks, and the filtered records can be saved as shareable Views.
Save filtered results as ‘views’ for easy access.
The evaluation comparison feature provides ready-to-use visualizations, including metric plots and tables of sample dataset. Teams can easily spot performance differences, patterns, and outliers that may reveal edge cases or opportunities for model improvement.
Visualizations of metrics comparison across evaluations.
Side-by-side dataset comparison.
Structured Tracing
The implementation organizes hierarchical traces, with each evaluation sample becoming its own container. Inside each sample trace, you can drill down to see every solver step, model call, and scoring operation that contributed to that sample’s final result. This organization makes it easy to understand what happened during a specific sample’s evaluation—especially useful for debugging.
Traces grouped by sample offer great visibility.
Zero-Friction Integration
One convenient aspect of the integration is that it requires no code changes. Install the package in any Inspect environment, initialize a WandB project, and every evaluation run gains default WandB logging—no refactoring, extra config, or new workflow needed.
Getting Started
Install the package
Install the integration in your Inspect environment. Choose your preferred components:
Models + Weave (recommended to try both):
pip install "inspect-wandb[weave]"
Models only: (useful if you are part of an organization and have Models but not Weave license, e.g. UK AISI)
pip install inspect-wandb
WandB Setup
Step1: Authenticate with WandB:
wandb login
Step2: Initialize your project:
wandb init
Follow the prompts to set your project name and entity.
Alternative: Environment Variables
For automated environments, skip the CLI setup and use environment variables:
Once installed, inspect_wandb works automatically with your Inspect evaluations. No additional configuration needed – just run your evals as usual and see them tracked in WandB.
For detailed setup options and further tutorials, check out our documentation!
Future Roadmap
The inspect_wandb integration was recently integrated into UK AISI’s workflow and METR is currently testing. Daniel and Qi plan to continue developing the project and we’re all looking forward for your feedback! Public roadmap for the project coming soon.
Contribute to the inspect_wandb: by incorporating it in your workflows, reporting issues, or contributing features. As the integration matures, community input will shape its development and ensure it serves the broader AI evaluation ecosystem. Thanks for your time!
Extending Inspect Framework: Integrating Weights & Biases
Why We Built
inspect_wandb
Evaluating frontier AI models can be messy; each benchmark has its own quirks, scripts, and formats, and even after a successful run, the results usually remain buried in local JSON files. That’s fine for a solo experiment, but painful when you need to compare models, audit results, or reproduce someone else’s work.
Inspect is an open-source framework for running evaluations which runs locally on one’s machine. Weights & Biases (WandB) are tools for sharing, tracking, and visualizing ML experiments and LLM evals in the cloud. Our integration—
inspect_wandb
—is the missing bridge: it lets Inspect stream evaluations, metrics, and traces directly into WandB, providing a single source of truth, searchable dashboards, and reproducible runs.Core Features
Connecting Models and Weave
The integration leverages two complementary WandB tools, each serving distinct but interconnected purposes.
When running LLM evals, WandB Models is primarily useful for reproducibility and tracking. Every evaluation becomes a fully logged Run complete with system state, git information, command history, and configurable file uploads. When you need to reproduce a result months later, all the context is preserved.
WandB Weave, delivers analysis and comparison capabilities. It transforms individual evaluation tasks into a searchable, filterable, comparable dataset. Rather than looking through local log files or sharing log files via Google Drive, teams can query their entire evaluation history through an intuitive, dedicated interface.
This dual approach lets each tool handle its own tasks while connected through a unified run ID. In the Models UI, clicking a run’s trace opens the Weave traces for the associated evaluation tasks—and you can jump back from Weave to Models by searching for the run ID.
Cross-Evaluation Analysis
Cross-evaluation analysis has been a frequent pain-point for Inspect users, typically requiring custom scripts and database queries.
inspect_wandb
addresses this pain point with the following features.The filtering system supports combined queries on metadata across the evaluation history. Want GPT-5’s safety results from the past month or a reasoning comparison of chain-of-thought LLMs? These become point-and-click tasks, and the filtered records can be saved as shareable Views.
The evaluation comparison feature provides ready-to-use visualizations, including metric plots and tables of sample dataset. Teams can easily spot performance differences, patterns, and outliers that may reveal edge cases or opportunities for model improvement.
Structured Tracing
The implementation organizes hierarchical traces, with each evaluation sample becoming its own container. Inside each sample trace, you can drill down to see every solver step, model call, and scoring operation that contributed to that sample’s final result. This organization makes it easy to understand what happened during a specific sample’s evaluation—especially useful for debugging.
Zero-Friction Integration
One convenient aspect of the integration is that it requires no code changes. Install the package in any Inspect environment, initialize a WandB project, and every evaluation run gains default WandB logging—no refactoring, extra config, or new workflow needed.
Getting Started
Install the package
Install the integration in your Inspect environment. Choose your preferred components:
Models + Weave (recommended to try both):
Models only: (useful if you are part of an organization and have Models but not Weave license, e.g. UK AISI)
WandB Setup
Step1: Authenticate with WandB:
Step2: Initialize your project:
Follow the prompts to set your project name and entity.
Alternative: Environment Variables
For automated environments, skip the CLI setup and use environment variables:
That’s It!
Once installed,
inspect_wandb
works automatically with your Inspect evaluations. No additional configuration needed – just run your evals as usual and see them tracked in WandB.For detailed setup options and further tutorials, check out our documentation!
Future Roadmap
The
inspect_wandb
integration was recently integrated into UK AISI’s workflow and METR is currently testing. Daniel and Qi plan to continue developing the project and we’re all looking forward for your feedback! Public roadmap for the project coming soon.Contribute to the
inspect_wandb
: by incorporating it in your workflows, reporting issues, or contributing features. As the integration matures, community input will shape its development and ensure it serves the broader AI evaluation ecosystem. Thanks for your time!