Here’s how I’d frame it: I don’t think it’s a good idea to leave the entire future up to the interpretation of our first AGI(s). They could interpret our attempted alignment very differently than we hoped, in in-retrospect-sensible ways, or do something like “going crazy” from prompt injections or strange chains of thought leading to ill-considered beliefs that get control over their functional goals.
This sounds a lot like what @Seth Herd’s talk about instruction following AIs is all about:
https://www.lesswrong.com/posts/7NvKrqoQgJkZJmcuD/instruction-following-agi-is-easier-and-more-likely-than
Thanks for the mention.
Here’s how I’d frame it: I don’t think it’s a good idea to leave the entire future up to the interpretation of our first AGI(s). They could interpret our attempted alignment very differently than we hoped, in in-retrospect-sensible ways, or do something like “going crazy” from prompt injections or strange chains of thought leading to ill-considered beliefs that get control over their functional goals.
It seems like the core goal should be to follow instructions or take correction—corrigibility as a singular target (or at least prime target). It seems noticeably safer to use Intent alignment as a stepping-stone to value alignment.
Of course, leaving humans in charge of AGI/ASI even for a little while sounds pretty scary too, so I don’t know.