In light of the previous section, there’s an obvious path to alignment where there turns out to be a few neurons (or at least some simple embedding) which correspond to human values, we use the tools of microscope AI to find that embedding, and just like that the alignment problem is basically solved.
This is the part I disagree with. The network does recognise trees, or at least green things (given that the grass seems pretty brown in the low tree pic).
Extrapolating this, I expect the AI might well have neurons that correspond roughly to human values, on the training data. Within the training environment, human values, amount of dopamine in human brain, curvature of human lips (in smiles), number of times the reward button is pressed, and maybe even amount of money in human bank account might all be strongly correlated.
You will have successfully narrowed human values down to within the range of things that are strongly correlated with human values in the training environment. If you take this signal and apply enough optimization pressure, you are going to get the equivalent of a universe tiled with tiny smiley faces.
Note that the examples in the OP are from an adversarial generative network. If its notion of “tree” were just “green things”, the adversary should be quite capable of exploiting that.
You will have successfully narrowed human values down to within the range of things that are strongly correlated with human values in the training environment. If you take this signal and apply enough optimization pressure, you are going to get the equivalent of a universe tiled with tiny smiley faces.
The whole point of the “natural abstractions” section of the OP is that I do not think this will actually happen. Off-distribution behavior is definitely an issue for the “proxy problems” section of the post, but I do not expect it to be an issue for identifying natural abstractions.
Note that the examples in the OP are from an adversarial generative network. If its notion of “tree” were just “green things”, the adversary should be quite capable of exploiting that.
In order for the network to produce good pictures, the concept of “tree” must be hidden in there somewhere, but it could be hidden in a complicated and indirect manor. I am questioning whether the particular single node selected by the researchers encodes the concept of “tree” or “green thing”.
This is the part I disagree with. The network does recognise trees, or at least green things (given that the grass seems pretty brown in the low tree pic).
Extrapolating this, I expect the AI might well have neurons that correspond roughly to human values, on the training data. Within the training environment, human values, amount of dopamine in human brain, curvature of human lips (in smiles), number of times the reward button is pressed, and maybe even amount of money in human bank account might all be strongly correlated.
You will have successfully narrowed human values down to within the range of things that are strongly correlated with human values in the training environment. If you take this signal and apply enough optimization pressure, you are going to get the equivalent of a universe tiled with tiny smiley faces.
Note that the examples in the OP are from an adversarial generative network. If its notion of “tree” were just “green things”, the adversary should be quite capable of exploiting that.
The whole point of the “natural abstractions” section of the OP is that I do not think this will actually happen. Off-distribution behavior is definitely an issue for the “proxy problems” section of the post, but I do not expect it to be an issue for identifying natural abstractions.
In order for the network to produce good pictures, the concept of “tree” must be hidden in there somewhere, but it could be hidden in a complicated and indirect manor. I am questioning whether the particular single node selected by the researchers encodes the concept of “tree” or “green thing”.
Ah, I see. You’re saying that the embedding might not actually be simple. Yeah, that’s plausible.