A common misconception is that STEM-level AGI is dangerous because of something murky about “agents” or about self-awareness. Instead, I’d say that the danger is inherent to the nature of action sequences that push the world toward some sufficiently-hard-to-reach state.
Call such sequences “plans”.
If you sampled a random plan from the space of all writable plans (weighted by length, in any extant formal language), and all we knew about the plan is that executing it would successfully achieve some superhumanly ambitious technological goal like “invent fast-running whole-brain emulation”, then hitting a button to execute the plan would kill all humans, with very high probability.
I think this seems true to me, but mostly because I expect such plans to look like ‘run this sketchy python program on your cluster, then do what it says’ (which will just summon some eldritch AI which is insanely smart). So, this argument seems mostly circular (it’s also conditioning on arbitrary technological development which is contained within the plan) edit: seems circular to me, it might not seem circular to other views
(That said, I don’t expect the plan to necessarily literally kill all humans, just to takeover the world, but this is due to galaxy brained trade and common sense morality arguments which are mostly out of scope and shouldn’t be a thing people depend on.)
More generally, the space of short or shortest programs (or plans) which accomplish a given goal is an incredibly cursed and malign space. For shortest programs, even if we condition on the program being runable on modern hardware, we seem totally screwed.
I think that reasoning about these sorts of insane eldritch and malign spaces mostly doesn’t provide good intuition for how AI will go in practice.
I don’t think your claim makes the argument circular / question-begging; it just means there’s an extra step in explaining why and how a random action sequence destroys the world.
Maybe you mean that I’m putting the emphasis in the wrong place, and it would be more illuminating to highlight some specific feature of random smart short programs as the source of the ‘instrumental convergence’ danger? If so, what do you think that feature is?
From my current perspective I think the core problem really is that most random short plans that succeed in sufficiently-hard tasks kill us. If the causal process by which this happens includes building a powerful AI optimizer, or building an AI that builds an AI, or building an AI that builds an AI that builds an AI, etc., then that’s interesting and potentially useful to know, but that doesn’t seem like the key crux to me, and I’m not sure it helps further illuminate where the danger is ultimately coming from.
(That said, I don’t expect the plan to necessarily literally kill all humans, just to takeover the world, but this is due to galaxy brained trade and common sense morality arguments which are mostly out of scope and shouldn’t be a thing people depend on.)
Very happy to hear someone with an idea like this who explicitly flags that we shouldn’t gamble on this being true!
One reason I like “the danger is in the space of action sequences that achieve real-world goals” rather than “the danger is in the space of short programs that achieve real-world goals” is that it makes it clearer why adding humans to the process can still result in the world being destroyed.
If powerful action sequences are dangerous, and humans help execute an action sequence (that wasn’t generated by human minds), then it’s clear why that is dangerous too.
If the danger instead lies in powerful “short programs”, then it’s more tempting to say “just don’t give the program actuators and we’ll be fine”. The temptation is to imagine that the program is like a lion, and if you just keep the lion physically caged then it won’t harm you. If you’re instead thinking about action sequences, then it’s less likely to even occur to you that the whole problem might be solved by changing the AI from a plan-executor to a plan-recommender. Which is a step in the right direction in terms of actually grokking the nature of the problem.
I think this seems true to me, but mostly because I expect such plans to look like ‘run this sketchy python program on your cluster, then do what it says’ (which will just summon some eldritch AI which is insanely smart). So, this argument seems mostly circular (it’s also conditioning on arbitrary technological development which is contained within the plan) edit: seems circular to me, it might not seem circular to other views
(That said, I don’t expect the plan to necessarily literally kill all humans, just to takeover the world, but this is due to galaxy brained trade and common sense morality arguments which are mostly out of scope and shouldn’t be a thing people depend on.)
More generally, the space of short or shortest programs (or plans) which accomplish a given goal is an incredibly cursed and malign space. For shortest programs, even if we condition on the program being runable on modern hardware, we seem totally screwed.
I think that reasoning about these sorts of insane eldritch and malign spaces mostly doesn’t provide good intuition for how AI will go in practice.
I don’t think your claim makes the argument circular / question-begging; it just means there’s an extra step in explaining why and how a random action sequence destroys the world.
Maybe you mean that I’m putting the emphasis in the wrong place, and it would be more illuminating to highlight some specific feature of random smart short programs as the source of the ‘instrumental convergence’ danger? If so, what do you think that feature is?
From my current perspective I think the core problem really is that most random short plans that succeed in sufficiently-hard tasks kill us. If the causal process by which this happens includes building a powerful AI optimizer, or building an AI that builds an AI, or building an AI that builds an AI that builds an AI, etc., then that’s interesting and potentially useful to know, but that doesn’t seem like the key crux to me, and I’m not sure it helps further illuminate where the danger is ultimately coming from.
Very happy to hear someone with an idea like this who explicitly flags that we shouldn’t gamble on this being true!
One reason I like “the danger is in the space of action sequences that achieve real-world goals” rather than “the danger is in the space of short programs that achieve real-world goals” is that it makes it clearer why adding humans to the process can still result in the world being destroyed.
If powerful action sequences are dangerous, and humans help execute an action sequence (that wasn’t generated by human minds), then it’s clear why that is dangerous too.
If the danger instead lies in powerful “short programs”, then it’s more tempting to say “just don’t give the program actuators and we’ll be fine”. The temptation is to imagine that the program is like a lion, and if you just keep the lion physically caged then it won’t harm you. If you’re instead thinking about action sequences, then it’s less likely to even occur to you that the whole problem might be solved by changing the AI from a plan-executor to a plan-recommender. Which is a step in the right direction in terms of actually grokking the nature of the problem.