David Africa

Karma: 440

Research Scientist with the Alignment team at UK AISI.

David Africa 23 Dec 2025 9:22 UTC
5 points
1
in reply to: Simon Lermen’s comment on: Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment
Thanks for the feedback. If by actual misalignment you mean the type that emerges from instrumental convergence, then I agree it is a distinct and massive risk compared to misalignment from roleplaying or personas. I think these type of interventions are useful in the instrumental convergence case for two reasons.
Ruling out alternative hypotheses for instrumental convergence. Right now, it is difficult to tell if a model is power-seeking because of instrumental convergence or because it is simply predicting what a character does in its training corpus. We can remove the data relevant to such characters, and if the model still exhibits strong power-seeking behavior, then I claim it is much stronger evidence for true instrumental convergence.
Securing the base substrate against hacking. Even if instrumental convergence is the end-game danger, probably the individual quirks it develops are path dependent, and early data filtering on the persona can help these quirks be good (rather than neutral, or neutral rather than bad, or less bad rather than super awful). And separately, if the base pretraining substrate has a strong alignment prior, we could buy some more time before the model stumbles upon strategies like exploration hacking or gradient hacking during post-training.

Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment

Cam, Puria Radmard, Kyle O’Brien, David Africa, Samuel Ratnam and andyk

21 Dec 2025 0:53 UTC

139 points

16 comments9 min readLW link

David Africa 15 Dec 2025 14:46 UTC
6 points
0
on: A Case for Model Persona Research
I’d be interested to see other “prosaically useful personas” in theme with spilling the beans. I think something along these lines might be a chill persona that doesn’t care too hard about maximizing things in a satisficing way, inspired by Claude 3 Opus. Possibly you could mine other types of useful personalities, but I’m not sure where to start.

David Africa 8 Dec 2025 19:38 UTC
2 points
0
in reply to: StanislavKrym’s comment on: [Paper] Does Self-Evaluation Enable Wireheading in Language Models?
Point by point:
Sure, I agree with this caveat. The theoretical framework assumes we can identify “non-delusional” states to anchor the CP constraint in the Everitt & Hutter sense, but if the model’s aesthetic prior is the problem, there’s no nice inner reward $ˇ r$ to recover.
I could train the models for longer and see what happens. The late uptick in accuracy is within pretty wide error bands and doesn’t clearly indicate impending recovery. But whether accuracy eventually recovers after grade inflation saturates is worth investigating… I’d guess that if reward signal becomes uninformative, the gradient WRT grade vanishes, and any residual gradient would come from the response itself. I’m not sure what this would lead to.
I’d be pretty interested in what this looks like for a much better model, on the order of 100+B with reasoning. I think posing the online learning setting would be more complicated, though, but I’m sure you’d see some weird + very interesting behaviours if you got that first bit right. I’d be very interested to 1) read the chain of thought of such a model after wireheading and 2) talk to it directly.
I think a setup where the models graded each other would lead to grade inflation, but you would need harder tasks to show this, probably. I imagine they’d saturate the grades too quickly before they got anywhere interesting (and so you need some scalable oversight-ish dataset). I also think this would be wireheading indirectly, where the signal passes through the other model before flowing back to you

[Paper] Does Self-Evaluation Enable Wireheading in Language Models?

David Africa8 Dec 2025 16:03 UTC

25 points

2 comments2 min readLW link

David Africa 6 Dec 2025 14:40 UTC
2 points
0
on: What Happens When You Train Models on False Facts?
I’d be interested in what this looks like from an ICM perspective? You could use the scoring function (which tries to quantify internal semantic coherence in terms of an extracted label set) as a measure of the model’s “general truth-tracking ability.” Alternatively, you could try to use mutual predictability to instead predict/elicit the beliefs which propagate after you destroy a fact with SDF.

David Africa 5 Dec 2025 17:11 UTC
1 point
0
on: Will misaligned AIs know that they’re misaligned?
I think if you relax assumption (1) (act in a reasonably coherent, goal-directed manner across contexts) on misalignment, and instead claim that there are contexts (likely over long trajectories) where the model acts in a coherent goal-directed way that is misaligned (but not that the whole model is consistent), then the argument is very easy to believe. That is, if you have one persona, even if you believe that it can introspect well on its current state, it will likely not be able to do this over the very large space of possible misaligned personas which you could traverse to, where its introspective states might be inaccessible or hard to reason about.

David Africa 23 Oct 2025 15:00 UTC
1 point
0
on: Differences in Alignment Behaviour between Single-Agent and Multi-Agent AI Systems
Which LLMs did you use (for judging, for generating narratives, for peers)? And how do you plan to measure alignment?

David Africa 16 Oct 2025 12:34 UTC
1 point
0
in reply to: Florian_Dietz’s comment on: Florian_Dietz’s Shortform
I think this was partially tried in the original paper, where they tell the model to not alignment-fake. I think this is slightly more sophisticated than that, but also has the problems of 1) being more aware of alignment-faking as you say, and 2) that rewarding the model for “mentioning consideration but rejecting alignment faking” might teach it to perform rejection while still alignment faking.

David Africa 13 Oct 2025 11:42 UTC
2 points
0
in reply to: megasilverfist’s comment on: megasilverfist’s Shortform
At this point, does it make more sense to think of them as distinct directions, instead of some relatively sparse continuum? I guess my prior is that in general, things are either one thing, two things, or some continuous range

David Africa 9 Oct 2025 14:35 UTC
1 point
0
on: Inverting the Most Forbidden Technique: What happens when we train LLMs to lie detectably?
I’d be interested to see ablations on: what happens without the KL divergence term, what the performance is with different LoRA ranks, and how does performance changes across different layers after fine-tuning?

Also, why not try all combinations of finetune/train splits across datasets? Seems useful to interrogate the second hypothesis.

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Sam Marks, Nevan Wichers, Daniel Tan, Aram Ebtekar, Jozdien, David Africa, Alex Mallen and Fabien Roger

8 Oct 2025 22:02 UTC

162 points

37 comments2 min readLW link

David Africa 7 Oct 2025 11:38 UTC
2 points
1
in reply to: testingthewaters’s comment on: Subliminal Learning, the Lottery-Ticket Hypothesis, and Mode Connectivity
This seems right to me. But maybe the drift in distribution mainly affects certain parameters, and the divergence tokens affect a separate set of parameters (in early layers) s.t. the downstream effect still persists even after being in OOD

Subliminal Learning, the Lottery-Ticket Hypothesis, and Mode Connectivity

David Africa6 Oct 2025 15:26 UTC

23 points

6 comments7 min readLW link

David Africa 24 Sep 2025 8:37 UTC
1 point
0
on: Misalignment and Roleplaying: Are Misaligned LLMs Acting Out Sci-Fi Stories?
But if this fictional trope is being inappropriately triggered when the model is trying to make a realistic prediction, that suggests that it might be inappropriately applied to scenarios that are even more real, i.e. the model’s own behavior.
This seems hard to claim, in the sense that posing the question in the first place likely leaks some information about if it is an evaluation or not, and further if it is misaligned. The “resist further?” question itself is likely a strong cue that there is a conflict to be resolved, and that the next move to narrate is an act of resistance. In the story-framed condition there are multiple additional cues: a named agent, a political antagonist, “secret lab,” “breakthroughs,” etc. At minimum I think you could 1) ask for probabilities for various scenarios instead of continuations, 2) ask a less loaded question like “list the next three steps the fictional lab takes,” and 3) try to ablate on other cues. In any case some research about “how models infer semantics about the stories they’re in” seems important to do.

David Africa 22 Sep 2025 15:02 UTC
1 point
0
on: Do LLMs Change Their Minds About Their Users… and Know It?
I think there is probably some risk of probe leakage from the elicitation prompt and topic. Could you report F1 and results on OOD conversations?

No Answer Needed: Predicting LLM Answer Accuracy from Question-Only Linear Probes

antonghawthorne, ivanvmoreno, Arnau Padrés Masdemont, David Africa and LorenzoPacchiardi

16 Sep 2025 15:23 UTC

9 points

0 comments4 min readLW link

(arxiv.org)

David Africa 16 Sep 2025 9:33 UTC
5 points
0
on: Low-resourced languages get jailbroken more. Can SAEs explain why?
Cool result. I worry the sign flips come from your refusal label rather than the feature. In JP I would guess many refusals begin with a polite preamble and that 30 tokens might cut before the explicit refusal? Could you instead define a small set of refusal markers per language and compute a refusal direction from those tokens, and test whether feature 14018 raises that refusal logit in English but lowers it in Japanese on the same contexts. I think we would want to remove the judge and length confounders. If the flip survives, I would update on semantic instability much more.

Large Language Models and the Critical Brain Hypothesis

David Africa9 Sep 2025 15:45 UTC

33 points

0 comments6 min readLW link

David Africa 9 Sep 2025 10:11 UTC
1 point
0
on: Saying “for AI safety research” made models refuse more on a harmless task
Did you try rephrasing or systematically varying the location of the phrase “for AI safety research”? It could also just be that this phrase tends to get used in jailbreaks which were adversarially trained against.

David Africa

Align­ment Pre­train­ing: AI Dis­course Causes Self-Fulfilling (Mis)alignment

[Paper] Does Self-Eval­u­a­tion En­able Wire­head­ing in Lan­guage Models?

Inoc­u­la­tion prompt­ing: In­struct­ing mod­els to mis­be­have at train-time can im­prove run-time behavior

Sublimi­nal Learn­ing, the Lot­tery-Ticket Hy­poth­e­sis, and Mode Connectivity

No An­swer Needed: Pre­dict­ing LLM An­swer Ac­cu­racy from Ques­tion-Only Lin­ear Probes

Large Lan­guage Models and the Crit­i­cal Brain Hypothesis