David Africa

Karma: 1,122

Research Scientist with the Alignment team at UK AISI.

“Did you lie?” Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms

Alan Cooney, David Africa and Geoffrey Irving

17 Jun 2026 18:43 UTC

20 points

0 comments6 min readLW link

(arxiv.org)

Several frontier models are substantially prefill aware

yeedrag, Parv Mahajan, David Africa, alexsouly, Jordan Taylor and RobertKirk

17 Jun 2026 17:41 UTC

54 points

1 comment5 min readLW link

Failing to Ragebait the New Gemma

Neil Shah, David Africa and arav-dhoot

11 Jun 2026 17:50 UTC

29 points

0 comments3 min readLW link

Two More Methods for Consistency Training and Some New Ways to Apply It

David Africa, Sukrati_Gautam, Neil Shah and arav-dhoot

5 Jun 2026 21:06 UTC

18 points

0 comments7 min readLW link

LURE: Alignment Evaluations to Reduce Evaluation Awareness

Igor Ivanov and David Africa

2 Jun 2026 18:20 UTC

26 points

5 comments5 min readLW link

Sealing Conditional Misalignment in Inoculation Prompting with Consistency Training

David Africa, Sukrati_Gautam and Neil Shah

19 May 2026 13:55 UTC

44 points

7 comments6 min readLW link

Bringing More Expertise to Bear on Alignment

Edmund Lau, Geoffrey Irving, Cameron Holmes and David Africa

8 May 2026 10:29 UTC

87 points

1 comment8 min readLW link

What Happens When a Model Thinks It Is AGI?

josh :) and David Africa

23 Apr 2026 22:35 UTC

62 points

4 comments5 min readLW link

Gemma Gets Help: Mitigating Frustration and Self-Deletion with Consistency Training

David Africa and Neil Shah

20 Apr 2026 16:07 UTC

25 points

1 comment12 min readLW link

From personas to intentions: towards a science of motivations for AI models

David Africa and Jacob Pfau

14 Apr 2026 12:26 UTC

77 points

5 comments7 min readLW link

Emergent stigmergic coordination in AI agents?

David Africa15 Mar 2026 12:30 UTC

49 points

2 comments3 min readLW link

Steering Awareness: Models Can Be Trained to Detect Activation Steering

josh :) and David Africa

12 Mar 2026 23:34 UTC

19 points

0 comments6 min readLW link

Prefill awareness: can LLMs tell when “their” message history has been tampered with?

David Africa, alexsouly, Jordan Taylor and RobertKirk

9 Mar 2026 10:47 UTC

86 points

11 comments10 min readLW link

A Proposal for TruesightBench

David Africa5 Feb 2026 14:33 UTC

14 points

0 comments4 min readLW link

Massive Activations in DroPE: Evidence for Attention Reorganization

David Africa18 Jan 2026 15:05 UTC

19 points

0 comments8 min readLW link

David Africa’s Shortform

David Africa13 Jan 2026 13:13 UTC

4 points

12 comments1 min readLW link

Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment

Cam, Puria, Kyle O’Brien, David Africa, Samuel Ratnam and andyk

21 Dec 2025 0:53 UTC

201 points

25 comments9 min readLW link

[Paper] Does Self-Evaluation Enable Wireheading in Language Models?

David Africa8 Dec 2025 16:03 UTC

25 points

2 comments2 min readLW link

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Sam Marks, Nevan Wichers, Daniel Tan, Aram Ebtekar, Jozdien, David Africa, Alex Mallen and Fabien Roger

8 Oct 2025 22:02 UTC

176 points

37 comments2 min readLW link

Subliminal Learning, the Lottery-Ticket Hypothesis, and Mode Connectivity

David Africa6 Oct 2025 15:26 UTC

23 points

6 comments7 min readLW link