neural networks routinely generalize to goals that are totally different from what the trainers wanted
I think this is slightly a non sequitor. I take Tom to be saying “AIs will care about stuff that is natural to express in human concept-language” and your evidence to be primarily about “AIs will care about what we tell it to”, though I could imagine there being some overflow evidence into Tom’s proposition.
I do think the limited success of interpretability is an example of evidence against Tom’s proposition. For example, I think there’s lots of work where you try and replace an SAE feature or a neuron (R) with some other module that’s trying to do our natural language explanation of what R was doing, and that doesn’t work.
Thanks, that’s a totally reasonable critique. I kind of shifted from one to the other over the course of that paragraph.
Something I believe, but failed to say, is that we should not expect those misgeneralized goals to be particularly human-legible. In the simple environments given in the goal misgeneralization spreadsheet, researchers can usually figure out eventually what the internalized goal was and express it in human terms (eg ‘identify rulers’ rather than ‘identify tumors’), but I would expect that to be less and less true as systems get more complex. That said, I’m not aware of any strong evidence for that claim, it’s just my intuition.
I’ll edit slightly to try to make that point more clear.
I think this is slightly a non sequitor. I take Tom to be saying “AIs will care about stuff that is natural to express in human concept-language” and your evidence to be primarily about “AIs will care about what we tell it to”, though I could imagine there being some overflow evidence into Tom’s proposition.
I do think the limited success of interpretability is an example of evidence against Tom’s proposition. For example, I think there’s lots of work where you try and replace an SAE feature or a neuron (R) with some other module that’s trying to do our natural language explanation of what R was doing, and that doesn’t work.
Thanks, that’s a totally reasonable critique. I kind of shifted from one to the other over the course of that paragraph.
Something I believe, but failed to say, is that we should not expect those misgeneralized goals to be particularly human-legible. In the simple environments given in the goal misgeneralization spreadsheet, researchers can usually figure out eventually what the internalized goal was and express it in human terms (eg ‘identify rulers’ rather than ‘identify tumors’), but I would expect that to be less and less true as systems get more complex. That said, I’m not aware of any strong evidence for that claim, it’s just my intuition.
I’ll edit slightly to try to make that point more clear.