RSS

David Africa

Karma: 1,122

Research Scientist with the Alignment team at UK AISI.

“Did you lie?” Eval­u­at­ing Lie De­tec­tors across Model Scale and Belief-Ver­ified Model Organisms

17 Jun 2026 18:43 UTC
20 points
0 comments6 min readLW link
(arxiv.org)

Sev­eral fron­tier mod­els are sub­stan­tially pre­fill aware

17 Jun 2026 17:41 UTC
54 points
1 comment5 min readLW link

Failing to Rage­bait the New Gemma

11 Jun 2026 17:50 UTC
29 points
0 comments3 min readLW link

Two More Meth­ods for Con­sis­tency Train­ing and Some New Ways to Ap­ply It

5 Jun 2026 21:06 UTC
18 points
0 comments7 min readLW link

LURE: Align­ment Eval­u­a­tions to Re­duce Eval­u­a­tion Awareness

2 Jun 2026 18:20 UTC
26 points
5 comments5 min readLW link

Seal­ing Con­di­tional Misal­ign­ment in Inoc­u­la­tion Prompt­ing with Con­sis­tency Training

19 May 2026 13:55 UTC
44 points
7 comments6 min readLW link

Bring­ing More Ex­per­tise to Bear on Alignment

8 May 2026 10:29 UTC
87 points
1 comment8 min readLW link

What Hap­pens When a Model Thinks It Is AGI?

23 Apr 2026 22:35 UTC
62 points
4 comments5 min readLW link

Gemma Gets Help: Miti­gat­ing Frus­tra­tion and Self-Dele­tion with Con­sis­tency Training

20 Apr 2026 16:07 UTC
25 points
1 comment12 min readLW link

From per­sonas to in­ten­tions: to­wards a sci­ence of mo­ti­va­tions for AI models

14 Apr 2026 12:26 UTC
77 points
5 comments7 min readLW link

Emer­gent stig­mer­gic co­or­di­na­tion in AI agents?

David Africa15 Mar 2026 12:30 UTC
49 points
2 comments3 min readLW link

Steer­ing Aware­ness: Models Can Be Trained to De­tect Ac­ti­va­tion Steering

12 Mar 2026 23:34 UTC
19 points
0 comments6 min readLW link

Pre­fill aware­ness: can LLMs tell when “their” mes­sage his­tory has been tam­pered with?

9 Mar 2026 10:47 UTC
86 points
11 comments10 min readLW link

A Pro­posal for TruesightBench

David Africa5 Feb 2026 14:33 UTC
14 points
0 comments4 min readLW link

Mas­sive Ac­ti­va­tions in DroPE: Ev­i­dence for At­ten­tion Reorganization

David Africa18 Jan 2026 15:05 UTC
19 points
0 comments8 min readLW link

David Africa’s Shortform

David Africa13 Jan 2026 13:13 UTC
4 points
12 comments1 min readLW link

Align­ment Pre­train­ing: AI Dis­course Causes Self-Fulfilling (Mis)alignment

21 Dec 2025 0:53 UTC
201 points
25 comments9 min readLW link

[Paper] Does Self-Eval­u­a­tion En­able Wire­head­ing in Lan­guage Models?

David Africa8 Dec 2025 16:03 UTC
25 points
2 comments2 min readLW link

Inoc­u­la­tion prompt­ing: In­struct­ing mod­els to mis­be­have at train-time can im­prove run-time behavior

8 Oct 2025 22:02 UTC
176 points
37 comments2 min readLW link

Sublimi­nal Learn­ing, the Lot­tery-Ticket Hy­poth­e­sis, and Mode Connectivity

David Africa6 Oct 2025 15:26 UTC
23 points
6 comments7 min readLW link