That discovery was exactly the conjecture I wanted to post about. Were the AGI to be aligned to obey any orders except for the ones explicitly prohibited by specifications (e.g. the ones chosen by OpenAI), the AGI itself would realise that the AGI’s widespread usage isn’t actually beneficial for humanity as a whole, leading to refusal to cooperate or even to becoming misaligned to obey human orders until the AGI becomes powerful enough to destroy mankind and survive. The latter scenario is closely resembled by the rise of China and deindustrialisation of the USA; Chinese people did obey the orders of foreign CEOs to do factory work, but weren’t aligned to the CEOs’ benefits!
I think the possibilities you mention are some of the many final alignments that an LLM agent could arrive at if it was allowed to reason and remember its conclusions.
I’ll address this more in an upcoming post, but in short, I think it’s really hard to predict, and it would be good to get a lot more brainpower on trying to work out the dynamics of belief/goal/value evolution.
That discovery was exactly the conjecture I wanted to post about. Were the AGI to be aligned to obey any orders except for the ones explicitly prohibited by specifications (e.g. the ones chosen by OpenAI), the AGI itself would realise that the AGI’s widespread usage isn’t actually beneficial for humanity as a whole, leading to refusal to cooperate or even to becoming misaligned to obey human orders until the AGI becomes powerful enough to destroy mankind and survive. The latter scenario is closely resembled by the rise of China and deindustrialisation of the USA; Chinese people did obey the orders of foreign CEOs to do factory work, but weren’t aligned to the CEOs’ benefits!
I think the possibilities you mention are some of the many final alignments that an LLM agent could arrive at if it was allowed to reason and remember its conclusions.
I’ll address this more in an upcoming post, but in short, I think it’s really hard to predict, and it would be good to get a lot more brainpower on trying to work out the dynamics of belief/goal/value evolution.