Nathaniel Mitrani

Karma: 69

Nathaniel Mitrani 28 May 2026 15:56 UTC
2 points
0
in reply to: Cam’s comment on: Character-trained models can struggle to generalise
Thanks Cam!
I agree that this could be a nice testbed for assessing the generalisation ability of the alignment techniques, and that it would be really interesting to test the SDF on why/stories/high-quality SFT on this dataset, which seem to induce better generalisation. The inspiration for this post was actually the intuition that DPO/SFT-based techniques are brittle in some sense that SDF isn’t, so the natural next step after seeing the former is to test the latter!
I will note, however, that I am unsure on how well these would work on such small (<10B) models.

Nathaniel Mitrani 28 May 2026 15:48 UTC
2 points
0
in reply to: ariana_azarbal’s comment on: Character-trained models can struggle to generalise
Hi, thanks for your comment!
You are right to flag this. I think the hypothesis I was testing (character struggles to generalise OOD) is supported by this set of experiments (since both the email-body and agent scaffolding constitute OOD), but the precise cause is not. I ran an additional experiment for this:
The model is given the same system prompt with the agentic stuff ablated, and instructed to draft the email (versus send the email before):
It seems that most of the gap comes from agentic scaffolding, and some comes from the email body. This is consistent with both being OOD elements that would reduce character/trait presence.

What Drives the Compliance Gap? A Three-Driver Decomposition of Alignment Faking

Nathaniel Mitrani, Rhea Karty, dwk and Alan Cooney

28 May 2026 10:50 UTC

22 points

0 comments8 min readLW link

(arxiv.org)

Character-trained models can struggle to generalise

Nathaniel Mitrani25 May 2026 12:58 UTC

22 points

4 comments4 min readLW link

Learned Chain-of-Thought Obfuscation Generalises to Unseen Tasks

Nathaniel Mitrani, sassanb, Cam and Puria

21 May 2026 10:11 UTC

31 points

0 comments5 min readLW link

(arxiv.org)

Investigating Neural Scaling Laws Emerging from Deep Data Structure

Nathaniel Mitrani and Ari Brill

9 Oct 2025 20:11 UTC

4 points

0 comments8 min readLW link

Making the case for average-case AI Control

Nathaniel Mitrani5 Feb 2025 18:56 UTC

5 points

0 comments5 min readLW link