Interesting—I too suspect that good world models will help with data efficiency. Even using the existing training paradigm where a lot of data is needed to get the generalization to work well, if an AI has a good internal world model it could generate usable synthetic examples for incremental training. For example, when a child sees a photo of some strange new animal from the side, the child likely surmises that the animal looks the same from the other side; if the photo only shows one eye, the child can imagine that looking head on into the animal’s face it will have 2 eyes, etc. Because the child has a rather reliable model of an ‘animal’, they can create reliable synthetic data for incremental training from a single picture.
And I like your framing of having the internally generated reward be valuable for learning too. While I expect that reward is a composite of experience (enlightened self-interest, reading and discussion, etc) it can still be more important day-to-day than the external rewards received immediately. (I think this opens up a lot of philosophy—what are the ‘ultimate’ goals for your internal ethics and personally fulfilling rewards, etc. But I see your point).
I think that, even if LLM’s don’t smoothly evolve into AGI then ASI, an alternative ‘brain-like’ AGI will have a similar progress ramp that allows for alignment learning-by-doing in a very meaningful way. To explain this, let’s discuss the LLM path a bit. OpenAI’s deliberative alignment and Anthropic’s more sober discussion of the ongoing alignment challenge both highlight the effort that companies today put in to understanding and improving LLM alignment. Alignment work is progressing through improved training, RLHF, RLAIF, Constitutional Classifiers, etc. One would expect that, as AI agents get used more and home robots get marketed, customers will refuse to buy unsafe AI agents and AI companies will need to learn to improve the AI behavior. It would be great to have some regulation or strong liability laws to help with this, but customer demand alone will provide impetus for general alignment of today’s systems. As LLMs and their cousins VLAs move towards AGI, we’ll have tolerably aligned AGI and we’ll have learned how to get alignment to generalize for an AGI. As AGIs advance to ASI, we’ll continue to have product pressure and RLAIF will improve in capability along with the AGI’s themselves. The point of that summary is not to say that I’m sure AI safety will play out well, but that there is indeed a lot of effort put in to prevent sociopathic results.
Now if we posit a different learning system that takes us to ASI, I would still expect a multi-year ramp from ‘not yet on the public radar’ to ASI. There will be many companies and watchdog groups watching the new systems grow, make mistakes, and get fixed. If this new learning approach results in AI’s as capable as today’s systems but LESS aligned, they aren’t likely to sell well. I think that before we need to worry about ASI, we should accept that the AGI we build will be valuable to someone and, hence, by definition tolerably aligned (although I don’t disagree that ‘tolerable’ may be a low bar).
In the end, I would expect that a useful AGI (not ASI) would need to have features like corrigibility (ability to evaluate goals and adjust or abort them), curiosity (recognizing when a conclusion or plan may be wrong), and self-critiquing (using classifiers or other systems to stress-test a plan for unwanted side-effects). I disagree with the premise that ASI’s will evolve into ruthless optimizers because a useful AGI will have learned the value of reconsidering goals and trying to understand the full impact of plans and actions. These features don’t guarantee we avoid sociopaths, but I see them as necessary items to solve for useful AGI and, hence, the ASI developers will have something to build on.