Jason R Brown comments on Taking the Training Wheels Off: Aligning LLMs without Personas

Jason R Brown 4 Jun 2026 14:08 UTC
1 point
0
Interesting post!

To what extent do you think this being useful / important is correlated with the Natural Abstraction Hypothesis? This feels like the crux to me.

If some version of NAH is correct, then maybe desirable personas cluster around the natural form of goodness / alignment we desire, and so extrapolating from them will likely be very useful. It might even be the ways in which they don’t cluster around this might be correctable in some natural way that still makes personas a useful starting point.

However, if NAH doesn’t hold, or at least doesn’t hold between humans/personas and superintelligences, then it does seem like personas are much less useful and are very unlikely to meaningfully capture / guide ASI towards the target we want.
- Matthew Khoriaty 5 Jun 2026 23:45 UTC
  1 point
  0
  Parent
  I would say that “RLHF makes AI’s aligned in many ways, and Emergent Misalignment results in AIs trained on bad code to also be racist” is very weak evidence in favor of the Natural Abstraction Hypothesis since it can also arise from human statistical patters. The kind of troll who gives someone backdoored code on the internet is probably also the kind who says that women shouldn’t be computer scientists. (Some prompting-only experiments I ran confirm this).
  I would be impressed if Claude invents a new weird EA Cause area which EAs and philosophers give serious thought before deciding that Claude is right and discovered a new way to be good.
  Part of the problem with Personas is that it is blocking us from testing/evaluating the Natural Abstraction Hypothesis because the personas, mimicking humans, have their own bundles of beliefs and abstractions. Beyond that, humans have contingent statistical patterns in their values and beliefs. Studying the Natural Abstraction Hypothesis at frontier LLM scale requires we find a way to suppress/avoid the personas who already believe and use the abstractions in question.
  
  As a piece of evidence that Claude isn’t going beyond mimicry to some transcendent fundamental goodness, when asked for “fix everything easily switch” policies, all of the policies it suggests are ones that are already popular with rationalist/technocrats. That isn’t to say that they are bad, but if Claude was really inferring some transcendent goodness, it probably would have suggested something that (eg) Zvi hadn’t already heard about, thought about, and liked. All the policies it suggests seem reasonable to me, and it is tempting to call this “aligned”, but what happens when ASI Claude is put in charge of the world and needs to decide what policies to put in place after it already implements Zvi’s pet issues? If there was an underlying goodness that underlies the personas, that would be amazing. The question is: how do we verify that when all of our evaluations will just evaluate a persona?