Some doomers have very strong intuitions that doom is almost assured for almost any kind of building AI. Yudkowsky likes to say that alignment is about hitting a tiny part of values space in a vast universe of deeply alien values.
Is there a way to make this more formal? Is there a formal model in which some kind of solomonoff daemon/ mesa-optimizer/ gremlins in the machine start popping up all over the place as the cognitive power of the agent is scaled up?
Imagine that a magically powerful AI decides to set a new political system for humans and create a “Constitution of Earth” that will be perfectly enforced by local smaller AIs, while the greatest one travels away to explore other galaxies.
The AI decides that the most fair way to create the constitution is randomly. It will choose a length, for example 10000 words of English text. Then it will generate all possible combinations of 10000 English words. (It is magical, so let’s not worry about how much compute that would actually take.) Out of the generated combinations, it will remove the ones that don’t make any sense (an overwhelming majority of them) and the ones that could not be meaningfully interpreted as “a constitution” of a country (this is kinda subjective, but the AI does not mind reading them all, evaluating each of them patiently using the same criteria, and accepting only the ones that pass a certain threshold). Out of the remaining ones, the AI will choose the “Constitution of Earth” randomly, using a fair quantum randomness generator.
Shortly before the result is announced, how optimistic would you feel about your future life, as a citizen of Earth?
As an aside (that’s still rather relevant, IMO), it is a huge pet peeve of mine when people use the word “randomly” in technical or semi-technical contexts (like this one) to mean “uniformly at random” instead of just “according to some probability distribution.” I think the former elevates and reifies a way-too-common confusion and draws attention away from the important upstream generator of disagreements, namely how exactly the constitution is sampled.
I wouldn’t normally have said this, but given your obvious interest in math, it’s worth pointing out that the answers to these questions you have raised naturally depend very heavily on what distribution we would be drawing from. If we are talking about, again, a uniform distribution from “the design space of minds-in-general” (so we are just summoning a “random” demon or shoggoth), then we might expect one answer. If, however, the search is inherently biased towards a particular submanifold of that space, because of the very nature of how these AIs are trained/fine-tuned/analyzed/etc., then you could expect a different answer.
Fair point. (I am not convinced by the argument that if the AI’s are trained on human texts and feedback, they are likely to end up with values similar to humans, but that would be a long debate.)
The question was how to justify the opinion that most possible outcomes are bad. My argument was that if you agree that a random outcome is likely bad… that implies that most outcomes are bad.
If most outcomes were good instead, a random outcome would likely be good.
To answer your question: we do not have a mathematical definition of “friendly”, so we will most likely use some heuristic instead. Which heuristic we use, that is one source of randomness. More randomness can be in the implementation details; as a silly example, if we decide that training a LLM on texts of benevolent philosophers is the way to go, it depends on which specific texts we choose. Furthermore, the implementation may contain bugs. Or we may decide that the AI needs to consist of several components, but there are multiple possible solutions how to put those elements together.
There are situations where these sources of randomness don’t matter, because we know what we do. For example, if different companies are making calculators in different ways, it is still quite predictable that they will answer 2+2= with 4. Problem is, with friendly AI we don’t know what we are doing, so we won’t get the feedback when a solution diverges from the ideal. It’s like with the early LLMs, the answer to an arithmetic problem containing numbers with more than one digit was quite random.
You need to specify whether your “random” is merely undetermined, or an undetermined pick from an equiprobable set. Only the latter allows you to equate most and most likely. But equiprobability isn’t a reasonable assumption, because the AIs we build will be guided by our aims and limited by our restrictions.
The MindSpace of the Orthogonality thesis is a set of possibilities. The random potshot version of the OT argument is only one way of turning possibilities into probabilities, and not particularly realistic. While, any of the minds in mindpsace are indeed weird and unfriendly to humans, that does not make it likely that the AIs we will construct will be. we are deliberately seeking to build certainties of mind for one thing, and have certain limitations, for another. Random potshots aren’t analogous to the probability density for action of building a certain type of AI, without knowing just ch about what it would be
Most configurations of matter, most courses of action, and most mind designs, are not conducive to flourishing intelligent life. Just like most parts of the universe don’t contain flourishing intelligent life. I’m sure this stuff has been formally stated somewhere, but the underlying intuition seems pretty clear, doesn’t it?
I wish there had been some effort to quantify @stephen_wolfram’s “pockets or irreducibility” (section 1.2 & 4.2) because if we can prove that there aren’t many or they are hard to find & exploit by ASI, then the risk might be lower.
I got this tweet wrong. I meant if pockets of irreducibility are common and non-pockets are rare and hard to find, then the risk from superhuman AI might be lower. I think Stephen Wolfram’s intuition has merit but needs more analysis to be convicing.
Are Solomonoff Daemons exponentially dense?
Some doomers have very strong intuitions that doom is almost assured for almost any kind of building AI. Yudkowsky likes to say that alignment is about hitting a tiny part of values space in a vast universe of deeply alien values.
Is there a way to make this more formal? Is there a formal model in which some kind of solomonoff daemon/ mesa-optimizer/ gremlins in the machine start popping up all over the place as the cognitive power of the agent is scaled up?
Imagine that a magically powerful AI decides to set a new political system for humans and create a “Constitution of Earth” that will be perfectly enforced by local smaller AIs, while the greatest one travels away to explore other galaxies.
The AI decides that the most fair way to create the constitution is randomly. It will choose a length, for example 10000 words of English text. Then it will generate all possible combinations of 10000 English words. (It is magical, so let’s not worry about how much compute that would actually take.) Out of the generated combinations, it will remove the ones that don’t make any sense (an overwhelming majority of them) and the ones that could not be meaningfully interpreted as “a constitution” of a country (this is kinda subjective, but the AI does not mind reading them all, evaluating each of them patiently using the same criteria, and accepting only the ones that pass a certain threshold). Out of the remaining ones, the AI will choose the “Constitution of Earth” randomly, using a fair quantum randomness generator.
Shortly before the result is announced, how optimistic would you feel about your future life, as a citizen of Earth?
As an aside (that’s still rather relevant, IMO), it is a huge pet peeve of mine when people use the word “randomly” in technical or semi-technical contexts (like this one) to mean “uniformly at random” instead of just “according to some probability distribution.” I think the former elevates and reifies a way-too-common confusion and draws attention away from the important upstream generator of disagreements, namely how exactly the constitution is sampled.
I wouldn’t normally have said this, but given your obvious interest in math, it’s worth pointing out that the answers to these questions you have raised naturally depend very heavily on what distribution we would be drawing from. If we are talking about, again, a uniform distribution from “the design space of minds-in-general” (so we are just summoning a “random” demon or shoggoth), then we might expect one answer. If, however, the search is inherently biased towards a particular submanifold of that space, because of the very nature of how these AIs are trained/fine-tuned/analyzed/etc., then you could expect a different answer.
Fair point. (I am not convinced by the argument that if the AI’s are trained on human texts and feedback, they are likely to end up with values similar to humans, but that would be a long debate.)
what’s that analogy supposed to be analogous to? do you think the process of.value formation in an AI is going to have a random element?
The question was how to justify the opinion that most possible outcomes are bad. My argument was that if you agree that a random outcome is likely bad… that implies that most outcomes are bad.
If most outcomes were good instead, a random outcome would likely be good.
To answer your question: we do not have a mathematical definition of “friendly”, so we will most likely use some heuristic instead. Which heuristic we use, that is one source of randomness. More randomness can be in the implementation details; as a silly example, if we decide that training a LLM on texts of benevolent philosophers is the way to go, it depends on which specific texts we choose. Furthermore, the implementation may contain bugs. Or we may decide that the AI needs to consist of several components, but there are multiple possible solutions how to put those elements together.
There are situations where these sources of randomness don’t matter, because we know what we do. For example, if different companies are making calculators in different ways, it is still quite predictable that they will answer 2+2= with 4. Problem is, with friendly AI we don’t know what we are doing, so we won’t get the feedback when a solution diverges from the ideal. It’s like with the early LLMs, the answer to an arithmetic problem containing numbers with more than one digit was quite random.
You need to specify whether your “random” is merely undetermined, or an undetermined pick from an equiprobable set. Only the latter allows you to equate most and most likely. But equiprobability isn’t a reasonable assumption, because the AIs we build will be guided by our aims and limited by our restrictions.
The MindSpace of the Orthogonality thesis is a set of possibilities. The random potshot version of the OT argument is only one way of turning possibilities into probabilities, and not particularly realistic. While, any of the minds in mindpsace are indeed weird and unfriendly to humans, that does not make it likely that the AIs we will construct will be. we are deliberately seeking to build certainties of mind for one thing, and have certain limitations, for another. Random potshots aren’t analogous to the probability density for action of building a certain type of AI, without knowing just ch about what it would be
Most configurations of matter, most courses of action, and most mind designs, are not conducive to flourishing intelligent life. Just like most parts of the universe don’t contain flourishing intelligent life. I’m sure this stuff has been formally stated somewhere, but the underlying intuition seems pretty clear, doesn’t it?
This sounds related to my complaint about the YUDKOWSKY + WOLFRAM ON AI RISK debate:
I got this tweet wrong. I meant if pockets of irreducibility are common and non-pockets are rare and hard to find, then the risk from superhuman AI might be lower. I think Stephen Wolfram’s intuition has merit but needs more analysis to be convicing.