I think peak intelligence (peak capability to reach a goal) will not be limited by the amount of compute, raw data, or algorithmic capability to process the data well, but by the finite amount of reality that’s relevant to achieving that goal. If one wants to take over the world, the way internet infrastructure works is relevant. The exact diameters of all the stones in the Rhine river are not, and neither is the amount of red dwarves in the universe. If we’re lucky, the amount of reality that turns out to be relevant for taking over the world, is not too far beyond what humanity can already collectively process. I can see this as a way for the world to be saved by default (but don’t think it’s super likely). I do think this makes an ever-expanding giant pile of compute an unlikely outcome (but some other kind of ever-expanding AI-led force a lot more likely).
I think this is probably true, and yet I also don’t think that humans are likely anywhere near this peak intelligence level yet. Also, simply being able to think faster without being more knowledgeable or intelligent would be a significant strategic advantage in competition or conflict. Even that would hit a peak, where additional speed (all else held constant) would confer no further advantage.
Similarly, knowledge, like the diameters of river stones, has its own peak. That’s going to be much more context dependent though. Different knowledge is relevant to different problems. Some problems benefit from in-depth knowledge about them, others are knowledge-light.
So, intelligence (capacity to utilize knowledge, reason abstractly, concoct useful plans) and speed of thought are much more general capabilities. In humans, these three attributes tend to be highly entangled due to upstream causes like education and genetics. In AI, we see them come apart. Some very knowledgeable systems with excellent retrieval speed don’t seem very intelligent. Some intelligent systems are very slow or only very narrowly knowledgeable.
I think that main problem is that two main weak points (computer systems and humans) have increasing attack surface. I.e., if we introduce protective measures in software, we can end up in situation when protective measures themselves are sources of vulnerability, unless we are really sure that it’s not the case.
There will be a first ASI that “rules the world” because its algorithm or architecture is so superior. If there are further ASIs, that will be because the first ASI wants there to be.
Will we solve technical alignment?
Contingent.
Value alignment, intent alignment, or CEV?
For an ASI you need the equivalent of CEV: values complete enough to govern an entire transhuman civilization.
Defense>offense or offense>defense?
Offense wins.
Is a long-term pause achievable?
It is possible, but would require all the great powers to be convinced, and every month it is less achievable, owing to proliferation. The open sourcing of Llama-3 400b, if it happens, could be a point of no return.
These opinions, except the first and the last, predate the LLM era, and were formed from discussions on Less Wrong and its precursors. Since ChatGPT, the public sphere has been flooded with many other points of view, e.g. that AGI is still far off, that AGI will naturally remain subservient, or that market discipline is the best way to align AGI. I can entertain these scenarios, but they still do not seem as likely as: AI will surpass us, it will take over, and this will not be friendly to humanity by default.
Regulation proposal: make it obligatory to only have satisficer training goals. Try to get loss 0.001, not loss 0. This should stop an AI in its tracks even if it goes rogue. By setting the satisficers thoughtfully, we could theoretically tune the size of our warning shots.
In the end, someone is going to build an ASI with a maximizer goal, leading to a takeover, barring regulation or alignment+pivotal act. However, changing takeovers to warning shots is a very meaningful intervention, as it prevents takeover and provides a policy window of opportunity.
If an AGI is provided an easily solvable utility function (“fetch a coffee”), it will lack the incentive to self-improve indefinitely. The fetch-a-coffee-AGI will only need to become as smart as a hypothetical simple-minded waiter. By creating a certain easiness for a utility function, we can therefore tune the intelligence level we want an AGI to achieve using self-improvement. The only way to achieve an indefinite intelligence explosion (until e.g. material boundaries) would be to program a utility function maximizing something. Therefore this type of utility function will be most dangerous.
Could we create AI safety by prohibiting maximizing-type utility functions? Could we safely experiment with AGIs just a little smarter than us, by using moderately hard goals?
The hard part is that the real world is complicated and setting goals that truly have no incentive for self-improvement or gaining power is an unsolved problem.
Thanks for your insights. I don’t really understand ‘setting [easy] goals is an unsolved problem’. If you set a goal: “tell me what 1+1 is”, isn’t that possible? And once completed (“2!”), the AI would stop to self-improve, right?
I think this may contribute to just a tiny piece of the puzzle, however, because there will always be someone setting a complex or, worse, non-achievable goal (“make the world a happy place!”), and boom there you have your existential risk again. But in a hypothetical situation where you have your AGI in the lab, no-one else has, and you want to play around safely, I guess easy goals might help?
Curious about your thoughts, and also, I can’t imagine this is an original idea. Any literature already on the topic?
Suppose I get hit by a meteor before I can hear your “2”—will you then have failed to tell me what 1+1 is? If so, suddenly this simple goal implies being able to save the audience from meteors. Or suppose your screen has a difficult-to-detect short circuit—your expected utility would be higher if you could check your screen and repair it if necessary.
Because a utility maximizer treats a 0.09% improvement over a 99.9% baseline just as seriously as it treats a 90% improvement over a 0% baseline, it doesn’t see these small improvements as trivial, or in any way not worth its best effort. If your goal actually has some chance of failure, and there are capabilities that might help mitigate that failure, it will incentivize capability gain. And because the real world is complicated, this seems like it’s true for basically all goals that care about the state of the world.
If we have a reinforcement learner rather than a utility maximizer with a pre-specified model of the world, this story is a bit different, because of course there will be no meteors in the training data. Now, you might think that this means that the RL agent cannot care about meteors, but this is actually somewhat undefined behavior, because the AI still gets to see observations of the world. If it is vanilla RL with no “curiosity,” it won’t ever start to care about the world until the world actually affects its reward (which for meteors, will take much too long to matter, but does become important when the reward is more informative about the real world), but if it’s more along the lines of DeepMind’s game-playing agents, then it will try to find out about the world, which will increase its rate of approaching optimal play.
There are definitely ideas in the literature that relate to this problem, particularly trying to formalize the notion that the AI shouldn’t “try too hard” on easy goals. I think these attempts mostly fall under two umbrellas—other-izers (that is, not maximizers) and impact regularization (penalizing the building of meteor-defense lasers).
Thanks again for your reply. I see your point that the world is complicated and a utility maximizer would be dangerous, even if the maximization is supposedly trivial. However, I don’t see how an achievable goal has the same problem. If my AI finds the answer of 2 before a meteor hits it, I would say it has solidly landed at 100% and stops doing anything. Your argument would be true if it decides to rule out all possible risks first, before actually starting to look for the answer of the question, which would otherwise quickly be found. But since ruling out those risks would be much harder to achieve than finding the answer, I can’t see my little agent doing that.
I think my easy goals come closest to what you call other-izers. Any more pointers for me to find that literature?
Thanks for your help, it helps me to calibrate my thoughts for sure!
I think actually 1+1 = ? is not really an easy enough goal, since it’s not 100% sure that the answer is 2. Getting to 100% certainty (including what I actually meant with that question) could still be nontrivial. But let’s say the goal is ‘delete filename.txt’? Could be the trick is in the language..
I think it might have been kinda the other way around. We wanted to systematize (put on a firm, principled grounding) a bunch of related stuff like care-based ethics, individuality, identity (and the void left by the abandonment of the concept of “soul”), etc, and for that purpose, we coined the concept of (phenomenal) consciousness.
AI takeovers are probably a rich field. There are partial and full takeovers, reversible and irreversible takeovers, aligned and unaligned ones. While to me all takeovers seem bad, some could be a lot worse than others. Thinking out specific ways to take over could provide clues on how to increase chances that this does not happen. In comms as well, takeovers are a neglected and important subtopic.
The difference between AGI and takeover level AI could be appreciable. If we’re lucky, takeover by raw capability level (as opposed to granted power during application) turns out to be impossible. In any case, we can try to increase world takeover robustness. There’s a certain AI takeover capability level and we should try to push it upwards as much as possible. Insofar AI can help with this, we could use it. The extreme case where the AI takeover capability level never gets reached because of ever increasing defense by AI is called positive defense offense balance.
I can see general internet robustness against hacking as being helpful to increase AI takeover capability. A single IT system that everyone uses (an operating system, a social media platform, etc.) is fragile for hacking so should perhaps better be avoided. Personally, I think an AI able to take over the internet might also be able to take over the world, but some people don’t seem to believe this will happen. Therefore, perhaps also useful to increase the gap between taking over the internet and taking over the world, e.g. by making biowarfare harder, putting weapons offline, etc. Finally, lab safety such as airgapping a novel frontier training run might help as well.
Minimum hardware leads to maximum security. As a lab or a regulatory body, one can increase safety of AI prototypes by reducing the hardware or amount of data researchers have access to.
Many arguments state that it would require an AGI to have an intelligence explosion. However, it seems to me that the critical point for achieving this explosion is that an AI can self-improve. Which skills are needed for that? If we have hardware overhang, it probably comes down to the type of skills an AI researcher uses: reading papers, combining insights, doing computer experiments until new insights emerge, writing papers about them. Perhaps an AI PhD can weigh in on the actual skills needed. I’m however making the argument that far from all mental skills humans have are needed for AI research. Appreciating art? Not needed. Intelligent conversation about non-AI topics? Not needed. Motor skills? Not needed.
I think the skills needed most for AI research (and therefore self-improvement) are the skills at which a computer may be relatively strong: methodical thinking, language processing, coding. Therefore I would expect that we reach an intelligence explosion significantly earlier than developing actual AGI with all human skills. This should be important for the timeline discussion.
I think peak intelligence (peak capability to reach a goal) will not be limited by the amount of compute, raw data, or algorithmic capability to process the data well, but by the finite amount of reality that’s relevant to achieving that goal. If one wants to take over the world, the way internet infrastructure works is relevant. The exact diameters of all the stones in the Rhine river are not, and neither is the amount of red dwarves in the universe. If we’re lucky, the amount of reality that turns out to be relevant for taking over the world, is not too far beyond what humanity can already collectively process. I can see this as a way for the world to be saved by default (but don’t think it’s super likely). I do think this makes an ever-expanding giant pile of compute an unlikely outcome (but some other kind of ever-expanding AI-led force a lot more likely).
I think this is probably true, and yet I also don’t think that humans are likely anywhere near this peak intelligence level yet. Also, simply being able to think faster without being more knowledgeable or intelligent would be a significant strategic advantage in competition or conflict. Even that would hit a peak, where additional speed (all else held constant) would confer no further advantage.
Similarly, knowledge, like the diameters of river stones, has its own peak. That’s going to be much more context dependent though. Different knowledge is relevant to different problems. Some problems benefit from in-depth knowledge about them, others are knowledge-light.
So, intelligence (capacity to utilize knowledge, reason abstractly, concoct useful plans) and speed of thought are much more general capabilities. In humans, these three attributes tend to be highly entangled due to upstream causes like education and genetics. In AI, we see them come apart. Some very knowledgeable systems with excellent retrieval speed don’t seem very intelligent. Some intelligent systems are very slow or only very narrowly knowledgeable.
I think that main problem is that two main weak points (computer systems and humans) have increasing attack surface. I.e., if we introduce protective measures in software, we can end up in situation when protective measures themselves are sources of vulnerability, unless we are really sure that it’s not the case.
I’m now wondering whether this idea has already been worked out by someone (probably?) Any sources?
My current main cruxes:
Will AI get takeover capability? When?
Single ASI or many AGIs?
Will we solve technical alignment?
Value alignment, intent alignment, or CEV?
Defense>offense or offense>defense?
Is a long-term pause achievable?
If there is reasonable consensus on any one of those, I’d much appreciate to know about it. Else, I think these should be research priorities.
I offer, no consensus, but my own opinions:
0-5 years.
There will be a first ASI that “rules the world” because its algorithm or architecture is so superior. If there are further ASIs, that will be because the first ASI wants there to be.
Contingent.
For an ASI you need the equivalent of CEV: values complete enough to govern an entire transhuman civilization.
Offense wins.
It is possible, but would require all the great powers to be convinced, and every month it is less achievable, owing to proliferation. The open sourcing of Llama-3 400b, if it happens, could be a point of no return.
These opinions, except the first and the last, predate the LLM era, and were formed from discussions on Less Wrong and its precursors. Since ChatGPT, the public sphere has been flooded with many other points of view, e.g. that AGI is still far off, that AGI will naturally remain subservient, or that market discipline is the best way to align AGI. I can entertain these scenarios, but they still do not seem as likely as: AI will surpass us, it will take over, and this will not be friendly to humanity by default.
Regulation proposal: make it obligatory to only have satisficer training goals. Try to get loss 0.001, not loss 0. This should stop an AI in its tracks even if it goes rogue. By setting the satisficers thoughtfully, we could theoretically tune the size of our warning shots.
In the end, someone is going to build an ASI with a maximizer goal, leading to a takeover, barring regulation or alignment+pivotal act. However, changing takeovers to warning shots is a very meaningful intervention, as it prevents takeover and provides a policy window of opportunity.
Tune AGI intelligence by easy goals
If an AGI is provided an easily solvable utility function (“fetch a coffee”), it will lack the incentive to self-improve indefinitely. The fetch-a-coffee-AGI will only need to become as smart as a hypothetical simple-minded waiter. By creating a certain easiness for a utility function, we can therefore tune the intelligence level we want an AGI to achieve using self-improvement. The only way to achieve an indefinite intelligence explosion (until e.g. material boundaries) would be to program a utility function maximizing something. Therefore this type of utility function will be most dangerous.
Could we create AI safety by prohibiting maximizing-type utility functions? Could we safely experiment with AGIs just a little smarter than us, by using moderately hard goals?
The hard part is that the real world is complicated and setting goals that truly have no incentive for self-improvement or gaining power is an unsolved problem.
Relevant Rob Miles video.
One could use artificial environments that are less complicated, and of course we do, but it seems like this leaves some important problems unsolved.
Thanks for your insights. I don’t really understand ‘setting [easy] goals is an unsolved problem’. If you set a goal: “tell me what 1+1 is”, isn’t that possible? And once completed (“2!”), the AI would stop to self-improve, right?
I think this may contribute to just a tiny piece of the puzzle, however, because there will always be someone setting a complex or, worse, non-achievable goal (“make the world a happy place!”), and boom there you have your existential risk again. But in a hypothetical situation where you have your AGI in the lab, no-one else has, and you want to play around safely, I guess easy goals might help?
Curious about your thoughts, and also, I can’t imagine this is an original idea. Any literature already on the topic?
Suppose I get hit by a meteor before I can hear your “2”—will you then have failed to tell me what 1+1 is? If so, suddenly this simple goal implies being able to save the audience from meteors. Or suppose your screen has a difficult-to-detect short circuit—your expected utility would be higher if you could check your screen and repair it if necessary.
Because a utility maximizer treats a 0.09% improvement over a 99.9% baseline just as seriously as it treats a 90% improvement over a 0% baseline, it doesn’t see these small improvements as trivial, or in any way not worth its best effort. If your goal actually has some chance of failure, and there are capabilities that might help mitigate that failure, it will incentivize capability gain. And because the real world is complicated, this seems like it’s true for basically all goals that care about the state of the world.
If we have a reinforcement learner rather than a utility maximizer with a pre-specified model of the world, this story is a bit different, because of course there will be no meteors in the training data. Now, you might think that this means that the RL agent cannot care about meteors, but this is actually somewhat undefined behavior, because the AI still gets to see observations of the world. If it is vanilla RL with no “curiosity,” it won’t ever start to care about the world until the world actually affects its reward (which for meteors, will take much too long to matter, but does become important when the reward is more informative about the real world), but if it’s more along the lines of DeepMind’s game-playing agents, then it will try to find out about the world, which will increase its rate of approaching optimal play.
There are definitely ideas in the literature that relate to this problem, particularly trying to formalize the notion that the AI shouldn’t “try too hard” on easy goals. I think these attempts mostly fall under two umbrellas—other-izers (that is, not maximizers) and impact regularization (penalizing the building of meteor-defense lasers).
Thanks again for your reply. I see your point that the world is complicated and a utility maximizer would be dangerous, even if the maximization is supposedly trivial. However, I don’t see how an achievable goal has the same problem. If my AI finds the answer of 2 before a meteor hits it, I would say it has solidly landed at 100% and stops doing anything. Your argument would be true if it decides to rule out all possible risks first, before actually starting to look for the answer of the question, which would otherwise quickly be found. But since ruling out those risks would be much harder to achieve than finding the answer, I can’t see my little agent doing that.
I think my easy goals come closest to what you call other-izers. Any more pointers for me to find that literature?
Thanks for your help, it helps me to calibrate my thoughts for sure!
I think actually 1+1 = ? is not really an easy enough goal, since it’s not 100% sure that the answer is 2. Getting to 100% certainty (including what I actually meant with that question) could still be nontrivial. But let’s say the goal is ‘delete filename.txt’? Could be the trick is in the language..
When we decided to attach moral weight to consciousness, did we have a comparable definition of what consciousness means or was it very different?
I think it might have been kinda the other way around. We wanted to systematize (put on a firm, principled grounding) a bunch of related stuff like care-based ethics, individuality, identity (and the void left by the abandonment of the concept of “soul”), etc, and for that purpose, we coined the concept of (phenomenal) consciousness.
AI takeovers are probably a rich field. There are partial and full takeovers, reversible and irreversible takeovers, aligned and unaligned ones. While to me all takeovers seem bad, some could be a lot worse than others. Thinking out specific ways to take over could provide clues on how to increase chances that this does not happen. In comms as well, takeovers are a neglected and important subtopic.
The difference between AGI and takeover level AI could be appreciable. If we’re lucky, takeover by raw capability level (as opposed to granted power during application) turns out to be impossible. In any case, we can try to increase world takeover robustness. There’s a certain AI takeover capability level and we should try to push it upwards as much as possible. Insofar AI can help with this, we could use it. The extreme case where the AI takeover capability level never gets reached because of ever increasing defense by AI is called positive defense offense balance.
I can see general internet robustness against hacking as being helpful to increase AI takeover capability. A single IT system that everyone uses (an operating system, a social media platform, etc.) is fragile for hacking so should perhaps better be avoided. Personally, I think an AI able to take over the internet might also be able to take over the world, but some people don’t seem to believe this will happen. Therefore, perhaps also useful to increase the gap between taking over the internet and taking over the world, e.g. by making biowarfare harder, putting weapons offline, etc. Finally, lab safety such as airgapping a novel frontier training run might help as well.
Minimum hardware leads to maximum security. As a lab or a regulatory body, one can increase safety of AI prototypes by reducing the hardware or amount of data researchers have access to.
AGI is unnecessary for an intelligence explosion
Many arguments state that it would require an AGI to have an intelligence explosion. However, it seems to me that the critical point for achieving this explosion is that an AI can self-improve. Which skills are needed for that? If we have hardware overhang, it probably comes down to the type of skills an AI researcher uses: reading papers, combining insights, doing computer experiments until new insights emerge, writing papers about them. Perhaps an AI PhD can weigh in on the actual skills needed. I’m however making the argument that far from all mental skills humans have are needed for AI research. Appreciating art? Not needed. Intelligent conversation about non-AI topics? Not needed. Motor skills? Not needed.
I think the skills needed most for AI research (and therefore self-improvement) are the skills at which a computer may be relatively strong: methodical thinking, language processing, coding. Therefore I would expect that we reach an intelligence explosion significantly earlier than developing actual AGI with all human skills. This should be important for the timeline discussion.
Technically, tiling the entire universe with paperclips or tiny smiling faces would probably count as modern art...