AI companies should be safety-testing the most capable versions of their models

sjadler26 Mar 2025 19:03 UTC

17 points

AI companies should be safety-testing the most capable versions of their models
Only OpenAI has committed to the strongest approach: testing task-specific versions of its models. But evidence of their follow-through is limited.

You might assume that AI companies run their safety tests on the most capable versions of their models possible. But there are important versions that currently go untested: task-specific versions, specially trained to demonstrate how far a model can be pushed on a dangerous ability, like novel bioweapon design. Without exploring and evaluating these versions, Al companies are underestimating the risks their models could pose.

Only OpenAI (where I previously worked¹) has committed to what I consider to be the strongest vision for safety-testing: evaluating these task-specific versions of its models to understand the worst-case scenarios. I think OpenAI’s vision on this is truly laudable and great. But from my review of publicly available reports, it is not clear that OpenAI is in fact following through on this commitment. I believe that all leading AI companies should be doing this form of task-specific model testing—but if companies are not willing or able to today, I argue there are many middle-ground improvements they could pursue.

In this post, I will:

Walk you through why AI companies safety-test their models, and how they do it
Define task-specific fine-tuning (TSFT), and argue why it’s important for safety-testing
Summarize the status of TSFT at leading AI companies
Dig into a case study on OpenAI and TSFT
Outline challenges to safety testing using TSFT, and what middle-ground improvement exists

[ continues on Substack ]

What links here?

sjadler's comment on Daniel Kokotajlo’s Shortform by Daniel Kokotajlo (26 Mar 2025 19:04 UTC; 3 points)

sjadler26 Mar 2025 19:03 UTC

17 points

6 comments1 min readLW link

Daniel Kokotajlo 26 Mar 2025 19:38 UTC
4 points
0
Thanks for doing this, I found the chart very helpful! I’m honestly a bit surprised and sad to see that task-specific fine-tuning is still not the norm. Back in 2022 when our team was getting the ball rolling on the whole dangerous capabilities testing / evals agenda, I was like “All of this will be worse than useless if they don’t eventually make fine-tuning an important part of the evals” and everyone was like “yep of course we’ll get there eventually, for now we will do the weaker elicitation techniques.” It is now almost three years later...
Crossposted from X
- habryka 26 Mar 2025 21:17 UTC
  4 points
  0
  Parent
  Back in 2022 when our team was getting the ball rolling on the whole dangerous capabilities testing / evals agenda, I was like…
  Looks like the rest of the comment got cut off?
  - sjadler 26 Mar 2025 22:14 UTC
    3 points
    0
    Parent
    Daniel said:
    
    Thanks for doing this, I found the chart very helpful! I’m honestly a bit surprised and sad to see that task-specific fine-tuning is still not the norm. Back in 2022 when our team was getting the ball rolling on the whole dangerous capabilities testing / evals agenda, I was like “All of this will be worse than useless if they don’t eventually make fine-tuning an important part of the evals” and everyone was like “yep of course we’ll get there eventually, for now we will do the weaker elicitation techniques.” It is now almost three years later...
    - Daniel Kokotajlo 27 Mar 2025 20:11 UTC
      2 points
      0
      Parent
      Oops, thanks, fixed!
Seth Herd 27 Mar 2025 15:21 UTC
2 points
0
The most capable version of each model has not yet been created when the model is released. As well as fine-tuning for specific tasks, scaffolding matters. The agentic scaffolds people create have an increasingly important role in the model’s ultimate capability.
- sjadler 27 Mar 2025 19:44 UTC
  1 point
  0
  Parent
  Scaffolding for sure matters, yup!
  I think you’re generally correct that the most-capable version hasn’t been created, though there are times where AI companies do have specialized versions for a domain internally, and don’t seem to be testing these anyway. It’s reasonable IMO to think that these might outperform the unspecialized versions.