The ML sections touched on the subject of distributional shift a few times, which is that thing where the real world is different from the training environment in ways which wind up being important, but weren’t clear beforehand. I read the way to tackle this is called adversarial training, and what it means is you vary the training environment across all of its dimensions in order to to make it robust.
Could we abuse distributional shift to reliably break misaligned things, by adding fake dimensions? I imagine something like this:
We want the optimizer to move from point A to point B on a regular x,y graph.
Instead of training it a bunch of times on just an x,y graph, we add a third, fake dimension.
We do this multiple times, so for example we have one x,y graph and add a z dimension; and one x,y graph where we add a color dimension.
When the training is complete, we do some magic that is the equivalent of multiplying these two, which would zero out the fake dimensions (is the trick used by DeepMind with Gato similar to multiplying functions?) and leave us with the original x,y dimensions
I expect this would give us something less perfectly optimized than just focusing on the x,y graph, but any deceptive alignment would surely exploit the false dimension which goes away, and thus it would be broken/incoherent/ineffective.
So....could we give it enough false rope to hang itself?
If we forget a dimension, like “AGI, please remember we don’t like getting bored”, then things go badly, even if we added another fake dimension which wasn’t related to boredom.
If we train the AI on data from our current world, then [almost?] certainly it will see new things when it runs for real. As a toy (not realistic but I think correct) example: the AI will give everyone a personal airplane, and then it will have to deal with a world that has lots of airplanes.
The ML sections touched on the subject of distributional shift a few times, which is that thing where the real world is different from the training environment in ways which wind up being important, but weren’t clear beforehand. I read the way to tackle this is called adversarial training, and what it means is you vary the training environment across all of its dimensions in order to to make it robust.
Could we abuse distributional shift to reliably break misaligned things, by adding fake dimensions? I imagine something like this:
We want the optimizer to move from point A to point B on a regular x,y graph.
Instead of training it a bunch of times on just an x,y graph, we add a third, fake dimension.
We do this multiple times, so for example we have one x,y graph and add a z dimension; and one x,y graph where we add a color dimension.
When the training is complete, we do some magic that is the equivalent of multiplying these two, which would zero out the fake dimensions (is the trick used by DeepMind with Gato similar to multiplying functions?) and leave us with the original x,y dimensions
I expect this would give us something less perfectly optimized than just focusing on the x,y graph, but any deceptive alignment would surely exploit the false dimension which goes away, and thus it would be broken/incoherent/ineffective.
So....could we give it enough false rope to hang itself?
Seems like two separate things (?)
If we forget a dimension, like “AGI, please remember we don’t like getting bored”, then things go badly, even if we added another fake dimension which wasn’t related to boredom.
If we train the AI on data from our current world, then [almost?] certainly it will see new things when it runs for real. As a toy (not realistic but I think correct) example: the AI will give everyone a personal airplane, and then it will have to deal with a world that has lots of airplanes.