rusheb

Karma: 499

Stress Testing Deliberative Alignment for Anti-Scheming Training

Mikita Balesni, Bronson Schoen, Marius Hobbhahn, Axel Højmark, AlexMeinke, Teun van der Weij, Jérémy Scheurer, Felix Hofstätter, Nicholas Goldowsky-Dill, rusheb, Andrei Matveiakin, jenny and alex.lloyd

17 Sep 2025 16:59 UTC

133 points

19 comments1 min readLW link

(antischeming.ai)

Ablations for “Frontier Models are Capable of In-context Scheming”

AlexMeinke, Bronson Schoen, Marius Hobbhahn, Mikita Balesni, Jérémy Scheurer and rusheb

17 Dec 2024 23:58 UTC

116 points

1 comment2 min readLW link

Frontier Models are Capable of In-context Scheming

Marius Hobbhahn, AlexMeinke, Bronson Schoen, rusheb, Jérémy Scheurer and Mikita Balesni

5 Dec 2024 22:11 UTC

211 points

24 comments7 min readLW link

Apollo Research 1-year update

Marius Hobbhahn, Lee Sharkey, Lucius Bushnaq, Dan Braun, Mikita Balesni, Jérémy Scheurer, Nicholas Goldowsky-Dill, StefanHex, jake_mendel, AlexMeinke and rusheb

29 May 2024 17:44 UTC

93 points

0 comments7 min readLW link

A starter guide for evals

Marius Hobbhahn, Jérémy Scheurer, Mikita Balesni, rusheb and AlexMeinke

8 Jan 2024 18:24 UTC

58 points

2 comments12 min readLW link

(www.apolloresearch.ai)

Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation

Soroush Pour, rusheb, Quentin FEUILLADE--MONTIXI, Arush and scasper

7 Nov 2023 17:59 UTC

38 points

2 comments2 min readLW link

(arxiv.org)

Understanding mesa-optimization using toy models

tilmanr, rusheb, Guillaume Corlouer, Dan Valentine, afspies, mivanitskiy and Can

7 May 2023 17:00 UTC

46 points

6 comments10 min readLW link

rusheb 18 Jan 2023 13:29 UTC
LW: 4 AF: 1
0
AF
on: Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]
If your hypothesis predicts that model performance will be preserved if you swap the input to any other input which has a particular property, but no other inputs in the dataset have that property, causal scrubbing can’t test your hypothesis
Would it be possible to make interventions which we expect not to preserve the model’s behaviour, and assert that the behaviour does in fact change?