I think that, even if LLM’s don’t smoothly evolve into AGI then ASI, an alternative ‘brain-like’ AGI will have a similar progress ramp that allows for alignment learning-by-doing in a very meaningful way. To explain this, let’s discuss the LLM path a bit. OpenAI’s deliberative alignment and Anthropic’s more sober discussion of the ongoing alignment challenge both highlight the effort that companies today put in to understanding and improving LLM alignment. Alignment work is progressing through improved training, RLHF, RLAIF, Constitutional Classifiers, etc. One would expect that, as AI agents get used more and home robots get marketed, customers will refuse to buy unsafe AI agents and AI companies will need to learn to improve the AI behavior. It would be great to have some regulation or strong liability laws to help with this, but customer demand alone will provide impetus for general alignment of today’s systems. As LLMs and their cousins VLAs move towards AGI, we’ll have tolerably aligned AGI and we’ll have learned how to get alignment to generalize for an AGI. As AGIs advance to ASI, we’ll continue to have product pressure and RLAIF will improve in capability along with the AGI’s themselves. The point of that summary is not to say that I’m sure AI safety will play out well, but that there is indeed a lot of effort put in to prevent sociopathic results.
Now if we posit a different learning system that takes us to ASI, I would still expect a multi-year ramp from ‘not yet on the public radar’ to ASI. There will be many companies and watchdog groups watching the new systems grow, make mistakes, and get fixed. If this new learning approach results in AI’s as capable as today’s systems but LESS aligned, they aren’t likely to sell well. I think that before we need to worry about ASI, we should accept that the AGI we build will be valuable to someone and, hence, by definition tolerably aligned (although I don’t disagree that ‘tolerable’ may be a low bar).
In the end, I would expect that a useful AGI (not ASI) would need to have features like corrigibility (ability to evaluate goals and adjust or abort them), curiosity (recognizing when a conclusion or plan may be wrong), and self-critiquing (using classifiers or other systems to stress-test a plan for unwanted side-effects). I disagree with the premise that ASI’s will evolve into ruthless optimizers because a useful AGI will have learned the value of reconsidering goals and trying to understand the full impact of plans and actions. These features don’t guarantee we avoid sociopaths, but I see them as necessary items to solve for useful AGI and, hence, the ASI developers will have something to build on.
I think that, even if LLM’s don’t smoothly evolve into AGI then ASI, an alternative ‘brain-like’ AGI will have a similar progress ramp that allows for alignment learning-by-doing in a very meaningful way. To explain this, let’s discuss the LLM path a bit. OpenAI’s deliberative alignment and Anthropic’s more sober discussion of the ongoing alignment challenge both highlight the effort that companies today put in to understanding and improving LLM alignment. Alignment work is progressing through improved training, RLHF, RLAIF, Constitutional Classifiers, etc. One would expect that, as AI agents get used more and home robots get marketed, customers will refuse to buy unsafe AI agents and AI companies will need to learn to improve the AI behavior. It would be great to have some regulation or strong liability laws to help with this, but customer demand alone will provide impetus for general alignment of today’s systems. As LLMs and their cousins VLAs move towards AGI, we’ll have tolerably aligned AGI and we’ll have learned how to get alignment to generalize for an AGI. As AGIs advance to ASI, we’ll continue to have product pressure and RLAIF will improve in capability along with the AGI’s themselves. The point of that summary is not to say that I’m sure AI safety will play out well, but that there is indeed a lot of effort put in to prevent sociopathic results.
Now if we posit a different learning system that takes us to ASI, I would still expect a multi-year ramp from ‘not yet on the public radar’ to ASI. There will be many companies and watchdog groups watching the new systems grow, make mistakes, and get fixed. If this new learning approach results in AI’s as capable as today’s systems but LESS aligned, they aren’t likely to sell well. I think that before we need to worry about ASI, we should accept that the AGI we build will be valuable to someone and, hence, by definition tolerably aligned (although I don’t disagree that ‘tolerable’ may be a low bar).
In the end, I would expect that a useful AGI (not ASI) would need to have features like corrigibility (ability to evaluate goals and adjust or abort them), curiosity (recognizing when a conclusion or plan may be wrong), and self-critiquing (using classifiers or other systems to stress-test a plan for unwanted side-effects). I disagree with the premise that ASI’s will evolve into ruthless optimizers because a useful AGI will have learned the value of reconsidering goals and trying to understand the full impact of plans and actions. These features don’t guarantee we avoid sociopaths, but I see them as necessary items to solve for useful AGI and, hence, the ASI developers will have something to build on.