I wanted a concrete example of the “human feedback” example. Let me know if this seems like a reasonable interpretation of the scenario.
Let’s call the plan that acquires influence in a way obvious to a generative model but invisible and incomprehensible to humans “Plan X.” Since by definition, I cannot conceive of “Plan X,” we’ll modify this scenario slightly.
Instead of a scenario in which humans design an AI, we’ll imagine a scenario in which some very dumb goblins design an AI that’s as smart as an average human. We’ll call this a human-level AI or HL-AI.
Our AI observes the world through a bunch of cameras and outputs motor actions. We train a generative model that predicts these camera observations given the motor actions.
The HL-AI controls a robot with camera eyes and motor appendages, and is located in a room full of tools and disassembled Ikea furniture. The robot moves its appendages randomly in response to the HL-AI. The observations of the camera eyes are used to train a generative model that predicts what will be observed via the camera eyes based on previous movements and camera eye observations.
We ask humans goblins to evaluate possible futuresby looking at the predicted videos output by the model. We then train a model to predict these goblin evaluations.
The goblins want the HL-AI to help them build the Ikea furniture. The model has been trained to accurately generate videos that illustrate the predicted results of simulated robot movements. The goblins watch these videos until they find ones in which pieces of furniture are partly or fully assembled. They assign value to the videos according to how much furniture has been built.
At test time the HL-AI searches for plans that lead to trajectories that look good to goblins.
Based on these evaluations, the HL-AI searches for plans that lead to trajectories in which the most furniture is assembled.
Failure mode:
Our HL-AI understands reality better than goblin evaluators.
The goblins, for example, don’t have very good object permanence. They can keep track of a piece of Ikea furniture as it’s being assembled step by step. However, if you introduce a screen for a couple of minutes, blocking their view of the furniture, you can remove the original furniture, introduce a new piece (or more than one), and the goblins won’t realize what’s happened, or even be suspicious.
There are plans that acquire influence in ways that are obvious to the generative model but completely incomprehensible and invisible to humans.
The goblins are real dumb, so they give points to the videos only based on the amount of partially or fully completed furniture they observe. That means that if the HL-AI figures out how to drag in pieces of furniture from the adjacent office, while turning off the camera eyes so that the goblins can’t see it doing this, then the goblins will be tricked into thinking the HL-AI has magically produced extra furniture, for which they will reward it highly.
The goblins just think the furniture is magically disappearing from the office, so they keep replacing it. This allows the HL-AI to continue its strategy.
It’s possible to use that influence to “hack” the cameras, in the sense of creating a fiction that looks convincing to a goblin looking at predicted videos.
The goblins now believe they’ve created an HL-AI that’s able to magically produce furniture, which they believe will be very helpful in solving the problem of the office in which furniture magically disappears. They continue putting the HL-AI to new “magical creation tasks,” with the result that the HL-AI keeps on moving resources around pointlessly in the goblin world, creating new problems which they assign it to solve, leading to havoc.
I wanted a concrete example of the “human feedback” example. Let me know if this seems like a reasonable interpretation of the scenario.
Let’s call the plan that acquires influence in a way obvious to a generative model but invisible and incomprehensible to humans “Plan X.” Since by definition, I cannot conceive of “Plan X,” we’ll modify this scenario slightly.
Instead of a scenario in which humans design an AI, we’ll imagine a scenario in which some very dumb goblins design an AI that’s as smart as an average human. We’ll call this a human-level AI or HL-AI.
Our AI observes the world through a bunch of cameras and outputs motor actions. We train a generative model that predicts these camera observations given the motor actions.
The HL-AI controls a robot with camera eyes and motor appendages, and is located in a room full of tools and disassembled Ikea furniture. The robot moves its appendages randomly in response to the HL-AI. The observations of the camera eyes are used to train a generative model that predicts what will be observed via the camera eyes based on previous movements and camera eye observations.
We askby looking at the predicted videos output by the model. We then train a model to predict these goblin evaluations.
humansgoblins to evaluate possible futuresThe goblins want the HL-AI to help them build the Ikea furniture. The model has been trained to accurately generate videos that illustrate the predicted results of simulated robot movements. The goblins watch these videos until they find ones in which pieces of furniture are partly or fully assembled. They assign value to the videos according to how much furniture has been built.
At test time the HL-AI searches for plans that lead to trajectories that look good to goblins.
Based on these evaluations, the HL-AI searches for plans that lead to trajectories in which the most furniture is assembled.
Failure mode:
Our HL-AI understands reality better than goblin evaluators.
The goblins, for example, don’t have very good object permanence. They can keep track of a piece of Ikea furniture as it’s being assembled step by step. However, if you introduce a screen for a couple of minutes, blocking their view of the furniture, you can remove the original furniture, introduce a new piece (or more than one), and the goblins won’t realize what’s happened, or even be suspicious.
There are plans that acquire influence in ways that are obvious to the generative model but completely incomprehensible and invisible to humans.
The goblins are real dumb, so they give points to the videos only based on the amount of partially or fully completed furniture they observe. That means that if the HL-AI figures out how to drag in pieces of furniture from the adjacent office, while turning off the camera eyes so that the goblins can’t see it doing this, then the goblins will be tricked into thinking the HL-AI has magically produced extra furniture, for which they will reward it highly.
The goblins just think the furniture is magically disappearing from the office, so they keep replacing it. This allows the HL-AI to continue its strategy.
It’s possible to use that influence to “hack” the cameras, in the sense of creating a fiction that looks convincing to a goblin looking at predicted videos.
The goblins now believe they’ve created an HL-AI that’s able to magically produce furniture, which they believe will be very helpful in solving the problem of the office in which furniture magically disappears. They continue putting the HL-AI to new “magical creation tasks,” with the result that the HL-AI keeps on moving resources around pointlessly in the goblin world, creating new problems which they assign it to solve, leading to havoc.