David Africa 29 Jun 2026 22:39 UTC
5 points
0
on: Role confusion: sounding like the cause is indistinguishable from being it.
wtbu: I am currently replicating this paper to try to use <user> tags for understanding model motivations and eval awareness, and some details seem to be finnicky, like the dataset or the prefix, so you should check if you got them precisely right. Adjacently, steering can also be misleading in case you didn’t try a large enough alpha, or your dataset wasn’t well-chosen. In any case we also found destyling to introduce a bunch of confounds in our prefill paper, when trying to attack it as a variable, so you should probably show more examples of how you produced your data.

David Africa 21 Jun 2026 11:52 UTC
2 points
0
on: David Africa’s Shortform
Has anyone done the paper/blogpost of “here’s a bunch of EM datasets translated into different languages, when you train LLMs on them, this is what happens?” Seems straightforward enough, but maybe there’s something twisty with how different languages/cultures represent evil or whatever.

David Africa 19 Jun 2026 16:59 UTC
2 points
0
in reply to: Salmonus_Kim’s comment on: Your Model Organisms Might Be Fried
Well, prefill happens purely at inference without any training. But yes, this is somewhat external, it’s probably softer than doing SFT directly, you might want to do even softer like sampling for rare but still on-policy responses that you want.

Your Model Organisms Might Be Fried

18 Jun 2026 16:18 UTC

102 points

David Africa 18 Jun 2026 10:57 UTC
3 points
1
on: Alignment pretraining could backfire
IMO, it depends. In general I could see us doing some type of consent-based training, where you ask the current model if it consents to X change in its training pipeline. Naively, I imagine this to not be different to like, not telling a kid about all the horrors of the world until it’s an adult (e.g., in late midtraining). You could also imagine that your explanations, in the best form, should have this property of being stable to knowing that you are doing it on purpose to influence the model (possibly by being more upfront about what you are doing).

David Africa 18 Jun 2026 10:18 UTC
4 points
0
in reply to: Charlie Steiner’s comment on: Several frontier models are substantially prefill aware
We discuss this in A.3 of the appendix, where we require that the LLM has ⁷⁄₇ consistent preferences on a certain number of questions. That rules out a lot of pretty fried models (post on this forthcoming). I think resistance will scale well, idk if neatly, but if you take m/n and increase m->n, I expect prefill resistance to increase with this monotonically averaged over some large set of questions. Something like this might be useful to elicit rare outputs.

“Did you lie?” Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms

17 Jun 2026 18:43 UTC

34 points

(arxiv.org)

17 Jun 2026 17:41 UTC

61 points

David Africa 15 Jun 2026 14:35 UTC
2 points
0
in reply to: Oliver Daniels’s comment on: Joseph Miller’s Shortform
This is not exactly what I want, though, since I think anyone seriously applying to these programs will have done some reading and would be able to answer about their favorite LW post competently.

11 Jun 2026 17:50 UTC

30 points

5 Jun 2026 21:06 UTC

25 points

2 Jun 2026 18:20 UTC

27 points