Neil Shah

Karma: 96

stumbling, crumbling, haphazardly falling

Failing to Ragebait the New Gemma

Neil Shah, David Africa and arav-dhoot

11 Jun 2026 17:50 UTC

30 points

0 comments3 min readLW link

Two More Methods for Consistency Training and Some New Ways to Apply It

David Africa, Sukrati_Gautam, Neil Shah and arav-dhoot

5 Jun 2026 21:06 UTC

25 points

0 comments7 min readLW link

Neil Shah 20 May 2026 9:24 UTC
6 points
0
in reply to: Clément Dumas’s comment on: Sealing Conditional Misalignment in Inoculation Prompting with Consistency Training
Hey, we ran it for the base models with the “malicious evil” prompt and see extremely high misalignment rates. For the 3 models that we use in our experiments the average rate is 60.1% (Llama-3.1-8B − 89.8%, Qwen3-8B − 51.4% and Qwen3-32B − 39.4%). So the models after IP + CT(consistency training) are substantially lower than this baseline hovering around (11%-17%).

Sealing Conditional Misalignment in Inoculation Prompting with Consistency Training

David Africa, Sukrati_Gautam and Neil Shah

19 May 2026 13:55 UTC

44 points

7 comments6 min readLW link

Gemma Gets Help: Mitigating Frustration and Self-Deletion with Consistency Training

David Africa and Neil Shah

20 Apr 2026 16:07 UTC

27 points

1 comment12 min readLW link