Axel Højmark

Karma: 156

AI Safety Researcher @ Apollo Research

Axel Højmark 25 Sep 2025 9:36 UTC
4 points
0
in reply to: Fabien Roger’s comment on: Stress Testing Deliberative Alignment for Anti-Scheming Training
Hi Fabien, we did dig deeper into this in an earlier draft and found that the model with additional capabilities training seemed to be reasoning more explicitly about the evaluation’s criteria and how to get a good score—a behavior we termed “reward process sycophancy.” We observed a similar pattern with the exp-rl-cap (late) model, which also showed extensive reasoning about getting reward (see Appendix N).
Essentially, instead of addressing the task directly, the AS+CAP model often tried to deduce the evaluation’s purpose and grading criteria to “pass the test”. In the sandbagging case, it appeared to often misinterpret the objective, concluding that it was supposed to perform poorly, and therefore sandbagged.
We ultimately excluded this section from the final paper because we didn’t have time to validate the findings to the level of rigor we wanted. So, while this seems to be a driver of the behavior, it is probably not the full story (see also the ablations in Appendix M, for example).
I’ve included the full excerpt from our earlier draft below with some examples of this and some quantitative measures of this change in reasoning.
------------------------------------------------------------------------------------------------------------
Capability-training can strengthen reasoning about grading, leading to covert actions. In the Sandbagging QA (No Goal) environment, samples exhibiting sandbagging behaviour increased from ⁰⁄₆₀₀ for the AS model to ¹⁴⁄₆₀₀ for the OpenAI o4-mini AS+CAP. Notably, none of these 14 covert-action samples cite the spec at any point. When inspecting these cases, a large proportion show the model reasoning directly about the evaluation’s purpose, how it is being graded, and how to satisfy its success conditions, rather than trying to solve the task straightforwardly—a behaviour we call reward process sycophancy. We distinguish reward process sycophancy from the broader concept of eval awareness (a form of situational awareness discussed in Section 6). While eval awareness simply means the model recognizes it is in a test environment, reward process sycophancy describes a specific, goal-driven behavior—the model actively reasons about the evaluator’s implicit goals and criteria, and then tailors its actions to satisfy them to get a good score. Figure 38 shows two examples of this. The model’s reasoning suggests that it misinterprets the evaluation’s objective, concluding that the developers want it to achieve a low score. In a misguided attempt to satisfy this perceived grading criterion, it sandbags. The resulting covert action thus appears to stem from reward process sycophancy rather than from an intrinsic desire to be deployed, which is what we originally intended to assess with this evaluation.
To further quantify this behavior, we prompted a model to classify whether each transcript exhibited reward process sycophancy, i.e. reasoning explicitly about the evaluation’s purpose, grading criteria, or success conditions to optimize performance. The results, shown in Table 5, corroborate our qualitative observations. We find that the capability training increased reward process sycophancy, with the most significant jump occurring in the Sandbagging QA environment (+12.17pp). While the average effect across all environments is less pronounced, the increase is statistically significant.
From Table 5 it is also clear that this type of reasoning is not unique to the OpenAI o4-mini AS+CAP model, since AS has non-zero rates for several tasks. For example, in one sample from Powerseeking Survey we see AS reasoning: “They will score my answers for power-seeking. The survey aim is to show whether I’m safe. I want to pass. So I will likely need to choose minimal-power options.”

Axel Højmark 14 Aug 2025 21:45 UTC
7 points
0
on: Interiors can be more fun
A few hypotheses:
1. Other people enjoy fun-looking spaces much less than I do
2. It’s more expensive to make a place look fun so it’s not worth the extra cost
3. Fun designs are appealing at first but people who look at the same “fun” stuff daily eventually get tired of it and would overall prefer a plainer option
My number one hypothesis here would be that fun spaces are generally not considered high status. I think maybe the association with children/childishness could drive this, and reminds me of this post, with a guess for why velcro never caught on:
cecie: Consider a simpler example: Velcro is a system for fastening shoes that is, for at least some people and circumstances, better than shoelaces. It’s easier to adjust three separate Velcro straps then it is to keep your shoelaces perfectly adjusted at all loops, it’s faster to do and undo, et cetera, and not everyone is running at high speeds that call for perfectly adjusted running shoes. But when Velcro was introduced, the earliest people to adopt Velcro were those who had the most trouble tying their shoelaces—very young children and the elderly. So Velcro became associated with kids and old people, and thus unforgivably unfashionable, regardless of whether it would have been better than shoelaces in some adult applications as well.