Rohin Shah comments on Alexander and Yudkowsky on AGI goals

Rohin Shah 5 Feb 2023 13:18 UTC
LW: 7 AF: 6
0
AF
Okay, then let me try to directly resolve my confusion. My current understanding is something like—in both humans and AIs, you have a blob of compute with certain structural parameters, and then you feed it training data. On this model, we’ve screened off evolution, the size of the genome, etc—all of that is going into the “with certain structural parameters” part of the blob of compute. So could an AI engineer create an AI blob of compute the same size as the brain, with its same structural parameters, feed it the same training data, and get the same result (“don’t steal” rather than “don’t get caught”)?
[...]
Okay, no, I think I see the problem, which is that I’m failing to consider that evolutionary-learning and childhood-learning are happening at different times through different algorithms, whereas for AIs they’re both happening in the same step by the same algorithm. Does that fit your model of what would produce the confusion I was going through above?
@Scott: can you elaborate on what the problem was? I thought the answer to your question was “tautologically yes” (where you have to be careful to have things like “training algorithm”, “initial state”, etc to be part of the “structural parameters”) and I am confused what update you made and what you were previously confused about.
(And it seems several other commenters are confused too.)
I think everyone in the field would be incredibly impressed if they managed to hook up a pretrained GPT to an AlphaStar-for-Minecraft and get back out something that could talk about its strategies with human coplayers. I’d consider that a huge advance in alignment research—nowhere near the point where we all don’t die, to be clear, but still hella impressive—because of the level of transparency increase it would imply, that there was an AI system that could talk about its internally represented strategies, somehow.
@Eliezer: For what it’s worth, I think it’s pretty plausible that we get something like this, and would be interested in betting on it. Though one important clarification: I mean that the AI’s text statements appear to us to correspond with and predict its Minecraft behavior, not that the AI’s statements reflect the true cognition that resulted in its behavior. (I’m much more uncertain about whether the latter would be the case, and it seems hard to bet on that anyway.)
(The rest of the “human-level at Minecraft but not human-level generality” section seemed roughly right to me.)