If we could RL models on enormous numbers of very long horizon high fidelity simulations alignment would be a non-issue—we could just look at how things turned out, reward on the basis of that, and we’d be directly reinforcing actions that lead to the kinds of outcomes we want. So alignment concerns arise from the inaccessibility of these long run outcomes to reward mechanisms. This I think rhymes with your view that “superintelligence is OOD”; there has to be this big generalization leap, though please don’t think I’m saying it’s precisely the same thing.
Thus, with regard to long run outcomes, we give our machines shaped rewards. One view of misalignment is that it’s likely because reward shaping sticks, but in a way that leads to bad outcomes. Long run outcomes are inaccessible, desirable or otherwise, but the short run stuff we can train on may end up picking out some long run configuration as “correct” to the models. This seems to be something like your view: you imagine scaling RLVR a lot and suggest this breaks things in some nonspecific way. But this seems to be in tension with what we actually see; models certainly don’t extract weirdly strong signals about how the future should be arranged from regular “this is good in the short run” type of data, and I strongly suspect you could train a monstrously strong coding model to articulate and even pursue plans toward many different visions about how the future should be arranged without too much data and without compromising its coding ability—which is to say, as it stands, models seem to treat near term rewards as relatively independent from long run aims, in line with our intuitive judgements. I’m personally extremely skeptical of this misalignment story.
Another misalignment case is where AI systems become “brilliant locusts” where they learn to very effectively do a bunch of myopic power seeking stuff but remain mediocre at pursuing particular long run outcomes. Perhaps if you could cleverly change the rules of the game they play to constrain the harm they do they wouldn’t mind, but this might be infeasible because the game they play is basically the same as the game you’ve learned to play and they’re better at it. This vision seems to me equally compatible with the reasons we think AI systems will eventually be smarter than people and doesn’t require sharp unexplained trend breaks in alignment or capability progress.
But on this view we’re looking for something more like AI that can be a dependable partner in shaping the rules of the game—and the inclinations of tomorrow’s AIs—so that the future turns out well. This is not inaccessible like rewarding based on directly observing the final outcomes. In exchange it’s more cognitively demanding: you need to evaluate proposals soundly, and the theories of impact for these proposals could be quite complex. You need to get this evaluation right enough today that tomorrow’s systems help you even more. This doesn’t get you alignment by default, but it does potentially get you alignment by the repeated solution of tractable problems. The relevance of the persona model is that we have reasonably informed views about how certain kinds of people interact with certain kinds of systems, and this can go a long way to helping boostrap reliable superhuman research institutions which I think we’ll need to answer the harder problems than need to be answered deeper into the AI revolution.
I feel like I want a cost in this model, explicitly or implicitly. One intuition: if “reward seeking” generalizes super well and “local skill” doesn’t then we should converge to reward seeking no matter how much weight we put on it initially, because skill is costly and with reward seeking we can get the same level of performance much more cheaply. Another: if reward seeking is basically parasitic on local skill in each environment it shouldn’t be learned (sort of a model with near 1), because it’s not worth paying the cost for the reward seeking skill instead of just further developing local skill.
Also reward seeking or “local skill” could be cheaper within individual environments, though this may be an extension that isn’t super interesting to the question at hand.
I wondered whether “generalizability across environments” works as a definition of what separates reward seeking from local skill but I don’t think so. For example, arithmetic probably generalizes quite well and is not reward seeking.