After a bit of thinking, I realized this is much deeper.
What if humanity changes its own values after AGI launch? That would mean we can’t align AI to our values once—we will need to do it continuously. However, rationality involves checking own (mostly instrumental) values and modifying them if they don’t serve the purpose of reaching the terminal goals, so this is likely to be the case.
So, it seems we either need to bind AI to us in sense of rating its actions, or to reevaluate goals and values that we consider to be terminal (maybe they are not terminal but just close to them, maybe they don’t even rule out scenario where AI takes over the world as unwanted).
I’ve wanted to say that if AI wipes the humanity, then there is obviously no people to care about their values, so utility in this scenario can’t be considered negative infinity. However, this is not the case when considering acausal trade, so that we can care not only for own goals but for goals of someone with the same values.
After a bit of thinking, I realized this is much deeper.
What if humanity changes its own values after AGI launch? That would mean we can’t align AI to our values once—we will need to do it continuously. However, rationality involves checking own (mostly instrumental) values and modifying them if they don’t serve the purpose of reaching the terminal goals, so this is likely to be the case.
So, it seems we either need to bind AI to us in sense of rating its actions, or to reevaluate goals and values that we consider to be terminal (maybe they are not terminal but just close to them, maybe they don’t even rule out scenario where AI takes over the world as unwanted).
I’ve wanted to say that if AI wipes the humanity, then there is obviously no people to care about their values, so utility in this scenario can’t be considered negative infinity. However, this is not the case when considering acausal trade, so that we can care not only for own goals but for goals of someone with the same values.