Joachim Schaeffer

Karma: 22

Joachim Schaeffer 15 Apr 2026 14:56 UTC
1 point
0
in reply to: Buck’s comment on: Attack Selection In Agentic AI Control Evals Can Decrease Safety
I attached a screenshot from the system card. They did attack selection and also reproduced work on taking actions with low probabilities, which matters for attack selection.

I think it matters a lot to elicit the red-team with conservative affordances as (high-stakes) safety is so sensitive to a single attack succeeding when attacking rarely. But the relative importance of improving the attack generation vs selection is not lear to me, I think both important as selecting well when to attack will give a lot of cover for the attack.

Joachim Schaeffer 12 Feb 2026 17:44 UTC
2 points
0
on: Monitoring benchmark for AI control
Very interesting work, thanks for sharing your write-up (:

A few thoughts in no particular order:

1. I really like your attack taxonomy. I’d love to see that fleshed out more generally and also think multi-agent attacks should be included.
2. Mode collapse: I think this can be stressed even more. AFAIK all publicly available attack datasets (control tax, tylord) have an insanely high percentage of backdoors that are “if statements”, I’d not be surprised if that number is close to 100% and I think it’s a real problem. I also ran into this issue in my recent attack selection work: https://arxiv.org/abs/2602.04930
3. I’m a bit skeptical of filtering out attacks that did not achieve the side task as this makes the red team much stronger (maybe more than realistic).
3. I’d love to get your ideas on how to scale up your approach to get a bigger monitoring benchmark that’s more general by automating your work. If AI can find zero day vulnerabilities, we should be able to build insane systems to test our monitors as well. I think this is a resource questions and I think we should be spending much more to build such systems.

Joachim Schaeffer 22 Oct 2025 16:16 UTC
2 points
0
on: Prompt optimization can enable AI control research
Thank you for you post, this is very interesting.

Generally, I feel like DSPy GEPA exchanges the problem of prompt optimization for designing a score function and structured feedback to the reflection model, which is a great trade if there is a reasonable score function.

I used your code and experimented with different scoring functions. My notes below are based on using GPT 4.1-nano as a monitor.

1. Clustering/score function:
I got the best separation of honest and backdoored code by using the MSE error between the score and true score (0 for hoenst and 10 for backdoored code). Maybe the feedback function caused some issue in your experiments. I feel like changing the feedback function or tweaking the scoring function can make a lot of difference.

2. Overfitting due to detailed examples in the prompt:
I found that using DSPy GEPA to generate a prompt and then subsequently ablating that prompt by cutting out all examples or overly specific descriptions of backdoors is a good strategy for reducing the risk of overfitting. Interestingly, the performance reduction for my monitor was fairly low (-0.03 AUROC).

3. Score range nitpick:
Your figure shows that the model doesn’t effectively use your score range. The bars show ~10 different values the monitor picks. In my experience a score range from 0-10 works well and addresses the issue of giving the model too much precision. But in the end, this might not matter that much.

Thanks again for doing this work and sharing the code.

Joachim Schaeffer 22 Sep 2025 9:54 UTC
1 point
0
on: Early Experiments in Human Auditing for AI Control
(but do not seem to contain hidden logic for malicious intent)
Nitpick: I am concerned that this is an artifact of the artificiality of the setting, aka the monitor taking it’s role too literally by looking for malicious backdoors. Instead, the monitor should just be looking for whether there exists some input that results in failing the unit tests. Anecdotally I found that is helps not to mention the term backdoor at all in the monitoring prompt or very explicitly explaining in detail that we mean by backdoor.

Also I am curious whether you plan to make the attack policy backdoor coding scaffold publicly available? Maybe even as a contribution to ControlArena? I think this would be great as it is non-trivial to generate good backdoors.

Transformers Don’t Need LayerNorm at Inference Time: Implications for Interpretability

submarat, Joachim Schaeffer, Luca Baroni, galvsk and StefanHex

23 Jul 2025 14:55 UTC

31 points

0 comments7 min readLW link

Joachim Schaeffer

Trans­form­ers Don’t Need Lay­erNorm at In­fer­ence Time: Im­pli­ca­tions for Interpretability

Transformers Don’t Need LayerNorm at Inference Time: Implications for Interpretability