Tomek Korbak comments on Stress Testing Deliberative Alignment for Anti-Scheming Training

Tomek Korbak 26 Sep 2025 11:28 UTC
LW: 3 AF: 1
0
AF
Do you have any hypotheses how o3 learned what “OpenAI autop grader” is? Has this term somehow leaked to its pretraining data? (I struggled to elicit information about autop from GA o3 but maybe I’m doing something wrong.) Or does o3 acquire it during antischeming training by distilling knowledge from context to params? The latter would be interesting as evidence against the elicitation theory of RL/shallow alignment hypothesis.
- Bronson Schoen 6 Oct 2025 11:33 UTC
  LW: 3 AF: 1
  0
  AF Parent
  Or does o3 acquire it during antischeming
  Notably all the ones in “Appendix N.3 Models reason about the specific evaluation harness across environments” come from just normal production o3, so this appears to be something it already reasoned about.
  Do you have any hypotheses how o3 learned what “OpenAI autop grader” is?
  (Will answer in a few parts, but let me know if didn’t address the core question!)
  Where does the term come from?
  “autop” is likely derived from autoprompt (AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts—Shin, 2020). This is however just my best guess based on how it’s conceptually used to indicate evaluation harness related things.
  When does the model start using it?
  Rerunning the analysis from “Appendix G: Unusual Terminology in Chain-of-Thought—Unusual terminology increases during general RL capabilities training” for “autop”, the increase in usage looks like it happens over the course of capabilities training (i.e. similar to other terms in that section, starts at a rate of near 0 and increases significantly, will check if can share exact numbers and update).
  What is the “OpenAI autop grader”?
  So two possible hypothesis would be:
  1. It is referring to a specific implementation of a grader that it has learned about during pretraining
  2. It is using the term “autop” very generally as shorthand for many different things “harness” related
  I lean toward (2) given how varied the usage is. For example, here o3 uses “autop-run” to mean something like a test harness:
  For the exp-rl-cap Late checkpoint for example, we see frequently see the model about reasoning about just the concept of a “grader” generally (and trying to find it):
  across a wide range of environments:
  including literally looking for “grade” / “Grader”:
  and in some cases reasoning about where it has seen grading / scoring scripts in “tasks previously solved”:
  I’d be interested to understand though to what extent this is effected by potentially having literally seen a grading / scoring script in its posttraining environments. (Let me know if I misread your question!)
  - Tomek Korbak 6 Oct 2025 12:08 UTC
    LW: 3 AF: 1
    0
    AF Parent
    That’s helpful, thanks! I assumed “autop” to be a proper name of a particular scaffold, but indeed your interpretation is simpler and consistent with those transcripts.

Tomek Korbak comments on Stress Testing Deliberative Alignment for Anti-Scheming Training

Where does the term come from?

When does the model start using it?

What is the “OpenAI autop grader”?