One objection: an assistive agent doesn’t let you turn it off, how could that be what we want? This just seems totally fine to me — if a toddler in a fit of anger wishes that its parents were dead, I don’t think the maximally-toddler-aligned parents would then commit suicide, that just seems obviously bad for the toddler.
I think this is way more worrying in the case where you’re implementing an assistance game solver, where this lack of off-switchability means your margins for safety are much narrower.
Though [the claim that slightly wrong observation model ⇒ doom] isn’t totally clear, e.g. is it really that easy to mess up the observation model such that it leads to a reward function that’s fine with murdering humans? It seems like there’s a lot of evidence that humans don’t want to be murdered!
I think it’s more concerning in cases where you’re getting all of your info from goal-oriented behaviour and solving the inverse planning problem—in those cases, the way you know how ‘human preferences’ rank future hyperslavery vs wireheaded rat tiling vs humane utopia is by how human actions affect the likelihood of those possible worlds, but that’s probably not well-modelled by Boltzmann rationality (e.g. the thing I’m most likely to do today is not to write a short computer program that implements humane utopia), and it seems like your inference is going to be very sensitive to plausible variations in the observation model.
I think it’s more concerning in cases where you’re getting all of your info from goal-oriented behaviour and solving the inverse planning problem
It’s also not super clear what you algorithmically do instead—words are kind of vague, and trajectory comparisons depend crucially on getting the right info about the trajectory, which is hard, as per the ELK document.
I agree the lack of off-switchability is bad for safety margins (that was part of the intuition driving my last point).
I think it’s more concerning in cases where you’re getting all of your info from goal-oriented behaviour and solving the inverse planning problem
I agree Boltzmann rationality (over the action space of, say, “muscle movements”) is going to be pretty bad, but any realistic version of this is going to include a bunch of sources of info including “things that humans say”, and the human can just tell you that hyperslavery is really bad. Obviously you can’t trust everything that humans say, but it seems plausible that if we spent a bunch of time figuring out a good observation model that would then lead to okay outcomes.
(Ideally you’d figure out how you were getting AGI capabilities, and then leverage those capabilities towards the task of “getting a good observation model” while you still have the ability to turn off the model. It’s hard to say exactly what that would look like since I don’t have a great sense of how you get AGI capabilities under the non-ML story.)
I think this is way more worrying in the case where you’re implementing an assistance game solver, where this lack of off-switchability means your margins for safety are much narrower.
I think it’s more concerning in cases where you’re getting all of your info from goal-oriented behaviour and solving the inverse planning problem—in those cases, the way you know how ‘human preferences’ rank future hyperslavery vs wireheaded rat tiling vs humane utopia is by how human actions affect the likelihood of those possible worlds, but that’s probably not well-modelled by Boltzmann rationality (e.g. the thing I’m most likely to do today is not to write a short computer program that implements humane utopia), and it seems like your inference is going to be very sensitive to plausible variations in the observation model.
It’s also not super clear what you algorithmically do instead—words are kind of vague, and trajectory comparisons depend crucially on getting the right info about the trajectory, which is hard, as per the ELK document.
That’s what future research is for!
I agree the lack of off-switchability is bad for safety margins (that was part of the intuition driving my last point).
I agree Boltzmann rationality (over the action space of, say, “muscle movements”) is going to be pretty bad, but any realistic version of this is going to include a bunch of sources of info including “things that humans say”, and the human can just tell you that hyperslavery is really bad. Obviously you can’t trust everything that humans say, but it seems plausible that if we spent a bunch of time figuring out a good observation model that would then lead to okay outcomes.
(Ideally you’d figure out how you were getting AGI capabilities, and then leverage those capabilities towards the task of “getting a good observation model” while you still have the ability to turn off the model. It’s hard to say exactly what that would look like since I don’t have a great sense of how you get AGI capabilities under the non-ML story.)