Human simulators are unlikely to exterminate humanity by accident because the agent mesa optimizer is (more or less) human aligned and the underlying superintelligence (currently LLMs) is not a world optimizer.
If a superintelligent LLM is not a mesa-optimizer itself, it can be turned into an optimizer via a one-line bash script asking it to produce the shell commands that maximize some goal. So this isn’t much help unless you can use that superintelligent LLM to patch the holes in humanity that would allow someone to squiggle the planet.
Inhuman world-optimizing agents are unlikely to turn the Universe into paperclips because that’s not the most likely failure mode. A world-optimizing agents must align its world model with reality. Poorly-aligned world-optimizing agents instrumentally converge, not on siezing control of reality, but on the much easier task of siezing competing pieces of their own mental infrastructure. A misaligned world optimizer that seeks to minimize conflict between its sensory data and internal world model will just turn off its sensors.
This appropriates the word alignment in a way that is probably unhelpful to your thesis, whatever you intend to mean.
This is not what humans do, so it is clearly possible, conceptually speaking, for inhuman world-agents to target the outside world instead of maximize an internal worldscore variable. And even maximizing an internal worldscore variable can be unsafe, if the robot decides it wants to use all available matter to add “1s” to the number.
If a superintelligent LLM is not a mesa-optimizer itself, it can be turned into an optimizer via a one-line bash script asking it to produce the shell commands that maximize some goal.
Why would a LLM trained on internet text ever do something like this? The most likely continuation of a prompt asking it to produce shell commands to take over the world is very unlikely to actually contain such commands, because that’s not the sort of thing that exists in the training data. The LLM might contain latent superintelligent capabilities, but it’s still being aimed at predicting the continuations that were likely in its training set.
People fine-tune the superintelligent LLM to do something other than pure prediction, like with ChatGPT. Because it’s “superintelligent”, it has the capabilities buried in there (which is to say, more specifically, it can generate superhumanly-intelligent outputs if conditioned on superhumanly intelligent inputs—I’m not trying to argue this as what will happen, it’s just my interpretation of the assumption of “superintelligent LLM”). So perhaps fine-tuning on a dataset of true answers to hard questions brings this out. Or perhaps using RLHF or something else.
I agree that this isn’t a “one-line bash script”. My interpretation of lc is that “LLM” doesn’t necessarily mean pure prediction (sine existing LLMs aren’t only trained on pure prediction, either); and in particular “superintelligent LLM” suggests that someone found a way to get superhumanly-useful outputs from an LLM (which people surely try to do).
I’m not saying it would do something like this. I’m saying that as soon as you release it someone out there will say “OK LLM, maximize stock price of my company”.
Certainly, someone will for sure ask it to produce the text that maximizes the stock price of their company, then the superLLM will pass that prompt through its model, and output the most likely continuation of that request, which is not at all text that actually maximizes the stock price. Because out of all instances of text containing “Please maximize my stock price” over the internet, there are no examples of superintelligent outputs to that request. It’s more likely to consider that request as part of a story prompt, or output something like “I don’t know how to do that”, even if it did internally know how to do that.
I want to note that if we assume it’s merely a superintelligent predictor, trained on all available data in the world, but only able to complete patterns super-well, it’s still extremely useful for predicting stock prices. This is in itself an incredibly profitable ability, and can also be leveraged to “output text that maximizes stock price” without too much difficulty:
Have the system output some text periodically.
Interleave the company stock prices between text blocks.
Generate a large number of samples for each new prediction, and keep the text blobs for which further completions predict high stock prices down the line. (This can be done automatically—no human review, just look at the predicted price.)
Not saying this is a great technique in real life, just saying that if we assume “really great predictor” and go from there, this will eventually start working well, as the system notices the influence of its text blobs on the subsequent stock prices.
My answer is that that would happen by default, and then some clever human would figure out a way to prompt engineer the system/slightly reconfigure it so that it did what it really knew how to do.
If a superintelligent LLM is not a mesa-optimizer itself, it can be turned into an optimizer via a one-line bash script asking it to produce the shell commands that maximize some goal. So this isn’t much help unless you can use that superintelligent LLM to patch the holes in humanity that would allow someone to squiggle the planet.
This appropriates the word alignment in a way that is probably unhelpful to your thesis, whatever you intend to mean.
This is not what humans do, so it is clearly possible, conceptually speaking, for inhuman world-agents to target the outside world instead of maximize an internal worldscore variable. And even maximizing an internal worldscore variable can be unsafe, if the robot decides it wants to use all available matter to add “1s” to the number.
Why would a LLM trained on internet text ever do something like this? The most likely continuation of a prompt asking it to produce shell commands to take over the world is very unlikely to actually contain such commands, because that’s not the sort of thing that exists in the training data. The LLM might contain latent superintelligent capabilities, but it’s still being aimed at predicting the continuations that were likely in its training set.
Here’s my answer.
People fine-tune the superintelligent LLM to do something other than pure prediction, like with ChatGPT. Because it’s “superintelligent”, it has the capabilities buried in there (which is to say, more specifically, it can generate superhumanly-intelligent outputs if conditioned on superhumanly intelligent inputs—I’m not trying to argue this as what will happen, it’s just my interpretation of the assumption of “superintelligent LLM”). So perhaps fine-tuning on a dataset of true answers to hard questions brings this out. Or perhaps using RLHF or something else.
I agree that this isn’t a “one-line bash script”. My interpretation of lc is that “LLM” doesn’t necessarily mean pure prediction (sine existing LLMs aren’t only trained on pure prediction, either); and in particular “superintelligent LLM” suggests that someone found a way to get superhumanly-useful outputs from an LLM (which people surely try to do).
I’m not saying it would do something like this. I’m saying that as soon as you release it someone out there will say “OK LLM, maximize stock price of my company”.
Certainly, someone will for sure ask it to produce the text that maximizes the stock price of their company, then the superLLM will pass that prompt through its model, and output the most likely continuation of that request, which is not at all text that actually maximizes the stock price. Because out of all instances of text containing “Please maximize my stock price” over the internet, there are no examples of superintelligent outputs to that request. It’s more likely to consider that request as part of a story prompt, or output something like “I don’t know how to do that”, even if it did internally know how to do that.
I want to note that if we assume it’s merely a superintelligent predictor, trained on all available data in the world, but only able to complete patterns super-well, it’s still extremely useful for predicting stock prices. This is in itself an incredibly profitable ability, and can also be leveraged to “output text that maximizes stock price” without too much difficulty:
Have the system output some text periodically.
Interleave the company stock prices between text blocks.
Generate a large number of samples for each new prediction, and keep the text blobs for which further completions predict high stock prices down the line. (This can be done automatically—no human review, just look at the predicted price.)
Not saying this is a great technique in real life, just saying that if we assume “really great predictor” and go from there, this will eventually start working well, as the system notices the influence of its text blobs on the subsequent stock prices.
Misread your comment.
My answer is that that would happen by default, and then some clever human would figure out a way to prompt engineer the system/slightly reconfigure it so that it did what it really knew how to do.
Would you like to publicly register a counterprediction?
Sure, P(doom)>=50%, and that’s subject to change. But of course I’ll only ever be proven wrong.