I think part of why we’re talking past each other is: I claim that, if you know how to make an AI that can autonomously do anything that a smart human adult can autonomously do, including what they can do over weeks, months, years, then I claim you already have radical superintelligence, or at most you’re like 1-2 years away from radical superintelligence. See Foom & Doom §1.7 and Response to Blake Richards: AGI, generality, alignment, & loss functions §3.2.
For example, if you can make one AI that can do everything that John von Neumann can do, then you can almost definitely make 1,000,000 such AIs that are cooperating, thinking at superhuman speed, and telepathically sharing their knowledge. We can, if we like, call this collective “one AI”, albeit an AI that takes 1,000,000× more chips to run. And now it’s an ASI, right?
Likewise, if you know how to make an AI with the charisma and strategizing skills of an average person, and you also know how to make an AI with the charisma and strategizing skills of Hitler, what’s stopping you from making an AI with dramatically more charisma and strategizing skills than Hitler?
Anyway, I’m happy to focus on John von Neumann-level brain-like AGI (or AHI). I claim that we don’t know how to make such a thing that’s able to have large impacts on the world, and is not a ruthless sociopath. By “large impacts”, I mean e.g. as discussed in The Duplicator: Instant Cloning Would Make the World Economy Explode.
I wanted to suggest that once you’ve mapped such a world model, and the world model is everything an agent can use for planning, you can do virtual look ahead, i.e. roll-outs in the world model conditioned on your policy, and see whether the policy ends up killing 200.000.000 people, or significantly more than one would expect under what humans statistically do in their day-to-day, strange effects at a distance that where not part of the goal state / seed state, etc. - i.e., screen for scheming.
Let’s say (for concreteness) that the world-model is some fancy cousin of a Bayes net, with 10M unlabeled nodes and a giant list of 1B connections of the form: “NODE 1984357 implies NODE 9238572 with strength 0.16209” (which happens to correspond to something vaguely like: “a certain brand of tire often has a certain style of hubcap”, but we don’t know that, it’s unlabeled).
And then a “plan” would be some list of let’s say a few thousand nodes: “CURRENT THOUGHT / PLAN = NODES 6405951, 4505739, 3901796, 3394766, …”
First of all, what’s the procedure to figure out whether this particular plan will kill 200M people or not?
Second of all, when a human is doing something in the world, he’s querying this world-model maybe 5 times a second, and also editing it 5 times a second. Presumably, a human-speed AHI would be similar. If so, do you imagine that a human will be inspecting each of these plans and each of these edits? If so, aren’t you cutting the speed down by many orders of magnitude? Or if not, i.e. if it’s an automated system, then how would that work?
A) Yes, being much faster / able to clone would give AHI a great advantage over us and in principle this could be the seed for an AGI society with way faster minds than ours (they’d still have to wait for their experiments to finish though). Anyway, I think we should not let this happen. We have no mental tools except the examples of current societies to reason about this kind of thing. As far as I can tell, it would be impossible for us to control. Best case would be that we get the pet status.
B) The “plan” is not directly encoded in the world model—actions are. The plan is either derived based on the world model or another name for the policy conditioned on the current world model state. Simple example would be MCTS over world model states conditioned on some policy.
Of course we will have to figure out what those states (and actions) mean. Good thing is that at any given time the world model will have only few (<<100) active high-level states (figuring out which states are high-level and which are not is possible, e.g., by checking for robustness to input perturbations). How many things can you imagine at once? I think this would be easier than what Anthropic is currently doing in their MechInterp department.
I think it is an interesting question whether it would have been possible to decode from Stalin’s mind in a meaningful sense how many people died under his regime. But in general I think it is possible to decode from peoples’ minds whether they just hurt or manipulated somebody. [This made me realize that the world model of a sociopath will indeed look different from a “normal” person because the internal states describing the agent will look different. We could check those states in response to other people getting hurt, too. I guess we would not need these internal states for tasks that do not include emulating a persona, so we could discard them anyways.]
Under your threat model we only need to audit the policy for sociopathy. Once it passes the test, we can run it at full speed. The test may take a while, sure. However, not testing models for safety when doing level 4-5 automation would be insane, even if you think you’ve managed to align the model.
With access to the internet / scientific literature any future machine will know everything about the alignment and safety discourse and how people think a model should behave.
I think part of why we’re talking past each other is: I claim that, if you know how to make an AI that can autonomously do anything that a smart human adult can autonomously do, including what they can do over weeks, months, years, then I claim you already have radical superintelligence, or at most you’re like 1-2 years away from radical superintelligence. See Foom & Doom §1.7 and Response to Blake Richards: AGI, generality, alignment, & loss functions §3.2.
For example, if you can make one AI that can do everything that John von Neumann can do, then you can almost definitely make 1,000,000 such AIs that are cooperating, thinking at superhuman speed, and telepathically sharing their knowledge. We can, if we like, call this collective “one AI”, albeit an AI that takes 1,000,000× more chips to run. And now it’s an ASI, right?
Likewise, if you know how to make an AI with the charisma and strategizing skills of an average person, and you also know how to make an AI with the charisma and strategizing skills of Hitler, what’s stopping you from making an AI with dramatically more charisma and strategizing skills than Hitler?
Anyway, I’m happy to focus on John von Neumann-level brain-like AGI (or AHI). I claim that we don’t know how to make such a thing that’s able to have large impacts on the world, and is not a ruthless sociopath. By “large impacts”, I mean e.g. as discussed in The Duplicator: Instant Cloning Would Make the World Economy Explode.
Let’s say (for concreteness) that the world-model is some fancy cousin of a Bayes net, with 10M unlabeled nodes and a giant list of 1B connections of the form: “NODE 1984357 implies NODE 9238572 with strength 0.16209” (which happens to correspond to something vaguely like: “a certain brand of tire often has a certain style of hubcap”, but we don’t know that, it’s unlabeled).
And then a “plan” would be some list of let’s say a few thousand nodes: “CURRENT THOUGHT / PLAN = NODES 6405951, 4505739, 3901796, 3394766, …”
First of all, what’s the procedure to figure out whether this particular plan will kill 200M people or not?
Second of all, when a human is doing something in the world, he’s querying this world-model maybe 5 times a second, and also editing it 5 times a second. Presumably, a human-speed AHI would be similar. If so, do you imagine that a human will be inspecting each of these plans and each of these edits? If so, aren’t you cutting the speed down by many orders of magnitude? Or if not, i.e. if it’s an automated system, then how would that work?
A)
Yes, being much faster / able to clone would give AHI a great advantage over us and in principle this could be the seed for an AGI society with way faster minds than ours (they’d still have to wait for their experiments to finish though). Anyway, I think we should not let this happen. We have no mental tools except the examples of current societies to reason about this kind of thing. As far as I can tell, it would be impossible for us to control. Best case would be that we get the pet status.
B)
The “plan” is not directly encoded in the world model—actions are. The plan is either derived based on the world model or another name for the policy conditioned on the current world model state. Simple example would be MCTS over world model states conditioned on some policy.
Of course we will have to figure out what those states (and actions) mean. Good thing is that at any given time the world model will have only few (<<100) active high-level states (figuring out which states are high-level and which are not is possible, e.g., by checking for robustness to input perturbations). How many things can you imagine at once? I think this would be easier than what Anthropic is currently doing in their MechInterp department.
I think it is an interesting question whether it would have been possible to decode from Stalin’s mind in a meaningful sense how many people died under his regime. But in general I think it is possible to decode from peoples’ minds whether they just hurt or manipulated somebody. [This made me realize that the world model of a sociopath will indeed look different from a “normal” person because the internal states describing the agent will look different. We could check those states in response to other people getting hurt, too. I guess we would not need these internal states for tasks that do not include emulating a persona, so we could discard them anyways.]
Under your threat model we only need to audit the policy for sociopathy. Once it passes the test, we can run it at full speed. The test may take a while, sure. However, not testing models for safety when doing level 4-5 automation would be insane, even if you think you’ve managed to align the model.
With access to the internet / scientific literature any future machine will know everything about the alignment and safety discourse and how people think a model should behave.