AI companies should be safety-testing the most capable versions of their models

Link post

AI companies should be safety-testing the most capable versions of their models
Only OpenAI has committed to the strongest approach: testing task-specific versions of its models. But evidence of their follow-through is limited.

You might assume that AI companies run their safety tests on the most capable versions of their models possible. But there are important versions that currently go untested: task-specific versions, specially trained to demonstrate how far a model can be pushed on a dangerous ability, like novel bioweapon design. Without exploring and evaluating these versions, Al companies are underestimating the risks their models could pose.


Only OpenAI (where I previously worked1) has committed to what I consider to be the strongest vision for safety-testing: evaluating these task-specific versions of its models to understand the worst-case scenarios. I think OpenAI’s vision on this is truly laudable and great. But from my review of publicly available reports, it is not clear that OpenAI is in fact following through on this commitment. I believe that all leading AI companies should be doing this form of task-specific model testing—but if companies are not willing or able to today, I argue there are many middle-ground improvements they could pursue.

In this post, I will:

  • Walk you through why AI companies safety-test their models, and how they do it

  • Define task-specific fine-tuning (TSFT), and argue why it’s important for safety-testing

  • Summarize the status of TSFT at leading AI companies

  • Dig into a case study on OpenAI and TSFT

  • Outline challenges to safety testing using TSFT, and what middle-ground improvement exists

    [ continues on Substack ]