If a superintelligent LLM is not a mesa-optimizer itself, it can be turned into an optimizer via a one-line bash script asking it to produce the shell commands that maximize some goal.
Why would a LLM trained on internet text ever do something like this? The most likely continuation of a prompt asking it to produce shell commands to take over the world is very unlikely to actually contain such commands, because that’s not the sort of thing that exists in the training data. The LLM might contain latent superintelligent capabilities, but it’s still being aimed at predicting the continuations that were likely in its training set.
People fine-tune the superintelligent LLM to do something other than pure prediction, like with ChatGPT. Because it’s “superintelligent”, it has the capabilities buried in there (which is to say, more specifically, it can generate superhumanly-intelligent outputs if conditioned on superhumanly intelligent inputs—I’m not trying to argue this as what will happen, it’s just my interpretation of the assumption of “superintelligent LLM”). So perhaps fine-tuning on a dataset of true answers to hard questions brings this out. Or perhaps using RLHF or something else.
I agree that this isn’t a “one-line bash script”. My interpretation of lc is that “LLM” doesn’t necessarily mean pure prediction (sine existing LLMs aren’t only trained on pure prediction, either); and in particular “superintelligent LLM” suggests that someone found a way to get superhumanly-useful outputs from an LLM (which people surely try to do).
I’m not saying it would do something like this. I’m saying that as soon as you release it someone out there will say “OK LLM, maximize stock price of my company”.
Certainly, someone will for sure ask it to produce the text that maximizes the stock price of their company, then the superLLM will pass that prompt through its model, and output the most likely continuation of that request, which is not at all text that actually maximizes the stock price. Because out of all instances of text containing “Please maximize my stock price” over the internet, there are no examples of superintelligent outputs to that request. It’s more likely to consider that request as part of a story prompt, or output something like “I don’t know how to do that”, even if it did internally know how to do that.
I want to note that if we assume it’s merely a superintelligent predictor, trained on all available data in the world, but only able to complete patterns super-well, it’s still extremely useful for predicting stock prices. This is in itself an incredibly profitable ability, and can also be leveraged to “output text that maximizes stock price” without too much difficulty:
Have the system output some text periodically.
Interleave the company stock prices between text blocks.
Generate a large number of samples for each new prediction, and keep the text blobs for which further completions predict high stock prices down the line. (This can be done automatically—no human review, just look at the predicted price.)
Not saying this is a great technique in real life, just saying that if we assume “really great predictor” and go from there, this will eventually start working well, as the system notices the influence of its text blobs on the subsequent stock prices.
My answer is that that would happen by default, and then some clever human would figure out a way to prompt engineer the system/slightly reconfigure it so that it did what it really knew how to do.
Why would a LLM trained on internet text ever do something like this? The most likely continuation of a prompt asking it to produce shell commands to take over the world is very unlikely to actually contain such commands, because that’s not the sort of thing that exists in the training data. The LLM might contain latent superintelligent capabilities, but it’s still being aimed at predicting the continuations that were likely in its training set.
Here’s my answer.
People fine-tune the superintelligent LLM to do something other than pure prediction, like with ChatGPT. Because it’s “superintelligent”, it has the capabilities buried in there (which is to say, more specifically, it can generate superhumanly-intelligent outputs if conditioned on superhumanly intelligent inputs—I’m not trying to argue this as what will happen, it’s just my interpretation of the assumption of “superintelligent LLM”). So perhaps fine-tuning on a dataset of true answers to hard questions brings this out. Or perhaps using RLHF or something else.
I agree that this isn’t a “one-line bash script”. My interpretation of lc is that “LLM” doesn’t necessarily mean pure prediction (sine existing LLMs aren’t only trained on pure prediction, either); and in particular “superintelligent LLM” suggests that someone found a way to get superhumanly-useful outputs from an LLM (which people surely try to do).
I’m not saying it would do something like this. I’m saying that as soon as you release it someone out there will say “OK LLM, maximize stock price of my company”.
Certainly, someone will for sure ask it to produce the text that maximizes the stock price of their company, then the superLLM will pass that prompt through its model, and output the most likely continuation of that request, which is not at all text that actually maximizes the stock price. Because out of all instances of text containing “Please maximize my stock price” over the internet, there are no examples of superintelligent outputs to that request. It’s more likely to consider that request as part of a story prompt, or output something like “I don’t know how to do that”, even if it did internally know how to do that.
I want to note that if we assume it’s merely a superintelligent predictor, trained on all available data in the world, but only able to complete patterns super-well, it’s still extremely useful for predicting stock prices. This is in itself an incredibly profitable ability, and can also be leveraged to “output text that maximizes stock price” without too much difficulty:
Have the system output some text periodically.
Interleave the company stock prices between text blocks.
Generate a large number of samples for each new prediction, and keep the text blobs for which further completions predict high stock prices down the line. (This can be done automatically—no human review, just look at the predicted price.)
Not saying this is a great technique in real life, just saying that if we assume “really great predictor” and go from there, this will eventually start working well, as the system notices the influence of its text blobs on the subsequent stock prices.
Misread your comment.
My answer is that that would happen by default, and then some clever human would figure out a way to prompt engineer the system/slightly reconfigure it so that it did what it really knew how to do.