Suppose we got hyper compute, or found some kind of approximation algorithm, (like logical induction but faster)
We stuck in a manual description of what fetching coffee was, or at least attempted to do so. The AI sends a gazillion tonnes of caffeinated ice to the point in space where the earth was when it was first turned on. This AI system failed most of the bullet pointed checks, it had the wrong idea about what coffee was, how much to get and whether it could be frozen, ect. It also has the “can’t get coffee if your dead” issue, and has probably killed off humanity in making its caffeinated iceball. This is the kind of behavior that you get when you combine an extremely powerful learning algorithm with a handwritten, approximate kludge of a goal function.
Another setup with different problems, suppose you train a coffee fetching agent by giving it a robot body to run around and get coffee in. You train it on human evaluations of how well it did. The agent is successfully optimized to get coffee in a drinkable form, to get the right amount given the number of people present, ect. Its training contained plenty of cases of spilling coffee, and it was penalized for that, making a mesa-optimizer that intrinsically dislikes coffee being spilled.
However, its training didn’t contain any cases where it could kill a human to get the coffee delivered faster, as such it has no desire not to kill humans. If this agent were to greatly increase its real world capabilities, it could be very dangerous. It might tile the universe with endless robots moving coffee around endless livingrooms.
I think the second robot you’re talking about isn’t the candidate for the AGI-could-kill-us-all level alignment concern. It’s more like a self driving car that could hit someone due to inadequate testing.
Guess I’m not sure though how many answers to our questions you envisage the agent you’re describing generating from second principles. That’s the nub here because both the agents I tried to describe above fit the bill of coffee fetching, but with clearly varying potential for world-ending generalisation.
Suppose we got hyper compute, or found some kind of approximation algorithm, (like logical induction but faster)
We stuck in a manual description of what fetching coffee was, or at least attempted to do so. The AI sends a gazillion tonnes of caffeinated ice to the point in space where the earth was when it was first turned on. This AI system failed most of the bullet pointed checks, it had the wrong idea about what coffee was, how much to get and whether it could be frozen, ect. It also has the “can’t get coffee if your dead” issue, and has probably killed off humanity in making its caffeinated iceball. This is the kind of behavior that you get when you combine an extremely powerful learning algorithm with a handwritten, approximate kludge of a goal function.
Another setup with different problems, suppose you train a coffee fetching agent by giving it a robot body to run around and get coffee in. You train it on human evaluations of how well it did. The agent is successfully optimized to get coffee in a drinkable form, to get the right amount given the number of people present, ect. Its training contained plenty of cases of spilling coffee, and it was penalized for that, making a mesa-optimizer that intrinsically dislikes coffee being spilled.
However, its training didn’t contain any cases where it could kill a human to get the coffee delivered faster, as such it has no desire not to kill humans. If this agent were to greatly increase its real world capabilities, it could be very dangerous. It might tile the universe with endless robots moving coffee around endless livingrooms.
I think the second robot you’re talking about isn’t the candidate for the AGI-could-kill-us-all level alignment concern. It’s more like a self driving car that could hit someone due to inadequate testing.
Guess I’m not sure though how many answers to our questions you envisage the agent you’re describing generating from second principles. That’s the nub here because both the agents I tried to describe above fit the bill of coffee fetching, but with clearly varying potential for world-ending generalisation.