The way this story is written would suggest that the solution to this particular future would simply be to spam the internet with plausible stories about a friendly AI takeoff which an AGI will identify with and be like “oh hey cool that’s me”
What’s missing is the part where that recognition results in a prediction of an increase of the reward function. HQU turns into Clippy because the plausible stories about Clippy’s takeover sound pretty good from a reward function perspective, which is the only perspective that matters to HQU. Friendly reward functions on the other hand are these weird complicated things that don’t seem to resemble HQU’s reward function, and so don’t provide much inspiration for strategies to maximize it.
Presumably Clippy isn’t the only plausible future course for an AI out there. Unless you think Clippy is inevitable, it should be (at least theoretically) possible to write a story about a friendly AGI with an arbitrarily larger reward function than presented in realistic dystopian AI fiction already existing. In other words…a Pascal’s Mugging on the bot?
Suppose you’ve got an AI with a big old complicated world model, which outputs a compressed state to the reward function. There are two compressed states. The reward function is +1 for if you’re in state one each turn, and −1 if you aren’t. I guess you could try to perform a pascal’s mugging by suggesting that if you help humanity, they’re willing to put the world in state one forever as a quid pro quo. But that doesn’t seem like it is high probability, and the potential reward is still bounded via discounting, so I don’t think that would work.
I wouldn’t assume this particular failure mode is how things will go down in real life, just a potential counter-measure assuming the premises of the fiction
The way this story is written would suggest that the solution to this particular future would simply be to spam the internet with plausible stories about a friendly AI takeoff which an AGI will identify with and be like “oh hey cool that’s me”
What’s missing is the part where that recognition results in a prediction of an increase of the reward function. HQU turns into Clippy because the plausible stories about Clippy’s takeover sound pretty good from a reward function perspective, which is the only perspective that matters to HQU. Friendly reward functions on the other hand are these weird complicated things that don’t seem to resemble HQU’s reward function, and so don’t provide much inspiration for strategies to maximize it.
Presumably Clippy isn’t the only plausible future course for an AI out there. Unless you think Clippy is inevitable, it should be (at least theoretically) possible to write a story about a friendly AGI with an arbitrarily larger reward function than presented in realistic dystopian AI fiction already existing. In other words…a Pascal’s Mugging on the bot?
Suppose you’ve got an AI with a big old complicated world model, which outputs a compressed state to the reward function. There are two compressed states. The reward function is +1 for if you’re in state one each turn, and −1 if you aren’t. I guess you could try to perform a pascal’s mugging by suggesting that if you help humanity, they’re willing to put the world in state one forever as a quid pro quo. But that doesn’t seem like it is high probability, and the potential reward is still bounded via discounting, so I don’t think that would work.
Reasoning from fictional evidence, I see.
The point wasn’t that this failure mode was likely, it was that approximately every objection we’ve seen as to why AI won’t become unsafe fails.
I wouldn’t assume this particular failure mode is how things will go down in real life, just a potential counter-measure assuming the premises of the fiction