Axel Højmark

Karma: 272

AI Safety Researcher @ Apollo Research · axelhojmark.com

Training Deliberative Monitors for Black-Box Scheming Detection

aksh-n, adityasinha, Victor Gillioz, Simon Storf, Kilian Merkelbach, richbc, Axel Højmark and Marius Hobbhahn

4 Jun 2026 16:43 UTC

33 points

6 comments6 min readLW link

Axel Højmark 28 May 2026 9:14 UTC
3 points
0
in reply to: Document’s comment on: Practical Learnings from Synthetic Document Finetuning
I have not stumbled across this before, but it looks interesting, thanks!

Axel Højmark 28 May 2026 9:13 UTC
1 point
0
in reply to: Kyle O’Brien’s comment on: Practical Learnings from Synthetic Document Finetuning
To what degree do you track regression in general model capabilities beyond gibberish, as measured by benchmarks like IFEval and MMLU-Pro?
We have not been tracking model capabilities as part of our research, as we have been mostly interested in changes in behavior as part of alignment testing. But I got Claude to run IFEval and MMLU-Pro on some of the SDF models we have trained, which you can see below. So performance regression is clearly something to keep in mind.
You also mention applying these experiments to GPT-OSS, which does not have base models. Do you have any concerns with training on these declarative non-chat documents after the model has already undergone post-training?
It is definitely not ideal to do continued pre-training at the end of post-training! You clearly see artifacts from this like increased rates of gibberish, and the performance regression above. But it is quite useful for getting the models to act on false information.

Practical Learnings from Synthetic Document Finetuning

Axel Højmark and Jérémy Scheurer

26 May 2026 19:22 UTC

84 points

6 comments8 min readLW link

Axel Højmark 16 Nov 2025 19:24 UTC
3 points
0
in reply to: Rosoe’s comment on: Overview of strong human intelligence amplification methods
Within normal human variation, some strong evidence actually points in the opposite direction. From the Herasight CogPGT paper: “Overall, our comprehensive pleiotropy analysis provides little support for substantial negative off-target effects. Instead, higher cognitive ability genetic predispositions appear beneficial.”

Axel Højmark 25 Sep 2025 9:36 UTC
15 points
0
in reply to: Fabien Roger’s comment on: Stress Testing Deliberative Alignment for Anti-Scheming Training
Hi Fabien, we did dig deeper into this in an earlier draft and found that the model with additional capabilities training seemed to be reasoning more explicitly about the evaluation’s criteria and how to get a good score—a behavior we termed “reward process sycophancy.” We observed a similar pattern with the exp-rl-cap (late) model, which also showed extensive reasoning about getting reward (see Appendix N).
Essentially, instead of addressing the task directly, the AS+CAP model often tried to deduce the evaluation’s purpose and grading criteria to “pass the test”. In the sandbagging case, it appeared to often misinterpret the objective, concluding that it was supposed to perform poorly, and therefore sandbagged.
We ultimately excluded this section from the final paper because we didn’t have time to validate the findings to the level of rigor we wanted. So, while this seems to be a driver of the behavior, it is probably not the full story (see also the ablations in Appendix M, for example).
I’ve included the full excerpt from our earlier draft below with some examples of this and some quantitative measures of this change in reasoning.
------------------------------------------------------------------------------------------------------------
Capability-training can strengthen reasoning about grading, leading to covert actions. In the Sandbagging QA (No Goal) environment, samples exhibiting sandbagging behaviour increased from ⁰⁄₆₀₀ for the AS model to ¹⁴⁄₆₀₀ for the OpenAI o4-mini AS+CAP. Notably, none of these 14 covert-action samples cite the spec at any point. When inspecting these cases, a large proportion show the model reasoning directly about the evaluation’s purpose, how it is being graded, and how to satisfy its success conditions, rather than trying to solve the task straightforwardly—a behaviour we call reward process sycophancy. We distinguish reward process sycophancy from the broader concept of eval awareness (a form of situational awareness discussed in Section 6). While eval awareness simply means the model recognizes it is in a test environment, reward process sycophancy describes a specific, goal-driven behavior—the model actively reasons about the evaluator’s implicit goals and criteria, and then tailors its actions to satisfy them to get a good score. Figure 38 shows two examples of this. The model’s reasoning suggests that it misinterprets the evaluation’s objective, concluding that the developers want it to achieve a low score. In a misguided attempt to satisfy this perceived grading criterion, it sandbags. The resulting covert action thus appears to stem from reward process sycophancy rather than from an intrinsic desire to be deployed, which is what we originally intended to assess with this evaluation.
To further quantify this behavior, we prompted a model to classify whether each transcript exhibited reward process sycophancy, i.e. reasoning explicitly about the evaluation’s purpose, grading criteria, or success conditions to optimize performance. The results, shown in Table 5, corroborate our qualitative observations. We find that the capability training increased reward process sycophancy, with the most significant jump occurring in the Sandbagging QA environment (+12.17pp). While the average effect across all environments is less pronounced, the increase is statistically significant.
From Table 5 it is also clear that this type of reasoning is not unique to the OpenAI o4-mini AS+CAP model, since AS has non-zero rates for several tasks. For example, in one sample from Powerseeking Survey we see AS reasoning: “They will score my answers for power-seeking. The survey aim is to show whether I’m safe. I want to pass. So I will likely need to choose minimal-power options.”
What links here?
- Daniel Kokotajlo's comment on A Toy Environment For Exploring Reasoning About Reward by jenny (25 Mar 2026 22:18 UTC; 9 points)

Stress Testing Deliberative Alignment for Anti-Scheming Training

Mikita Balesni, Bronson Schoen, Marius Hobbhahn, Axel Højmark, Alex Meinke, Teun van der Weij, Jérémy Scheurer, Felix Hofstätter, Nicholas Goldowsky-Dill, rusheb, Andrei Matveiakin, jenny and alex.lloyd

17 Sep 2025 16:59 UTC

133 points

19 comments1 min readLW link

(antischeming.ai)

Axel Højmark 14 Aug 2025 21:45 UTC
10 points
0
on: Interiors can be more fun
A few hypotheses:
1. Other people enjoy fun-looking spaces much less than I do
2. It’s more expensive to make a place look fun so it’s not worth the extra cost
3. Fun designs are appealing at first but people who look at the same “fun” stuff daily eventually get tired of it and would overall prefer a plainer option
My number one hypothesis here would be that fun spaces are generally not considered high status. I think maybe the association with children/childishness could drive this, and reminds me of this post, with a guess for why velcro never caught on:
cecie: Consider a simpler example: Velcro is a system for fastening shoes that is, for at least some people and circumstances, better than shoelaces. It’s easier to adjust three separate Velcro straps then it is to keep your shoelaces perfectly adjusted at all loops, it’s faster to do and undo, et cetera, and not everyone is running at high speeds that call for perfectly adjusted running shoes. But when Velcro was introduced, the earliest people to adopt Velcro were those who had the most trouble tying their shoelaces—very young children and the elderly. So Velcro became associated with kids and old people, and thus unforgivably unfashionable, regardless of whether it would have been better than shoelaces in some adult applications as well.