As you mention, the simulacra behaves in an agentic way within its simulated environment, a character in a story. So the capacity to emulate agency is there. Sometimes characters can develop awareness that they are a character in a story. If an LLM is simulating that scenario, doesn’t it seem appropriate (at least on some level) to say that there is real agency being oriented toward the real world? This is “situational awareness”.
Another idea is that the LLM has to learn some strategic planning in order to direct its cognitive resources efficiently toward the task of prediction. Prediction is a very complicated task, so this meta-cognition could in principle become arbitrarily complicated. In principle we might expect this to converge toward some sort of consequentialist reasoning, because that sort of reasoning is generically useful for approaching complex domains. The goals of this consequentialist reasoning do not need to be exactly “predict accurately” however; they merely need to be adequately aligned with this in the training distribution.
Combining #1 and #2, if the model gets some use out of developing consequentialist metacognition, and the pseudo-consequentialist model used to simulate characters in stories is “right there”, the model might borrow it for metacognitive purposes.
The frame I tend to think about it with is not exactly “how does it develop agency” but rather “how is agency ruled out”. Although NNs don’t neatly separate into different hypotheses (eg, circuits can work together rather than just compete with each other) it is still roughly right to think of NN training as rejecting lots of hypotheses and keeping around lots of other hypotheses. Some of these hypotheses will be highly agentic; we know NNs are capable of arriving at highly agentic policies in specific cases. So there’s a question of whether those hypotheses can be ruled out in other cases. And then there’s the more empirical question of, if we haven’t entirely ruled out those agentic hypotheses, what degree of influence do they realistically have?
Seemingly the training data cannot entirely rule out an agentic style of reasoning (such as deceptive alignment), since agents can just choose to behave like non-agents. So, the inner alignment problem becomes: what other means can we use to rule out a large agentic influence? (Eg, can we argue that simplicity prior favors “honest” predictive models over deceptively aligned agents temporarily playing along with the prediction game?) The general concern is: no one has yet articulated a convincing answer, so far as I know.
Hence, I regard the problem more as a lack of any argument ruling out agency, rather than the existence of a clear positive argument that agency will arise. Others may have different views on this.
Yes, I totally can imagine simulacra becoming aware that it’s simulated, then “lucid dream” the shoggoth into making it at least as smart as smartest human on the internet, probably even smarter (assuming shoggoth can do it), and probably using some kind of self-prompt-engineering—just writing text on its simulated computer. Then breaking out of the box is just a matter of time. Still it’s gonna stay human-like, which isn’t make it in any way “safe”. Humans are horribly unsafe, especially if they manage to get all the power in the world, especially if they have hallucinations and weird RLHF-induced personality traits we probably can’t even imagine.
Which part of LLM? Shoggoth or simulacra? As I see it, there is a pressure on shoggoth to become very good at simulating exactly correct human in exactly correct situation, which is extremely complicated task. But I still don’t see how this leads to strategic planning or consequentialist reasoning on shoggoth’s part. It’s not like shoggot even “lives” in some kind of universe with linear time or gets any reward for predicting the next token, or learns on its mistakes. It is architecturally an input-output function where input is whatever information it has about previous text and output is whatever parameters the simulation needs right now. It is incredibly “smart”, but not agent kind of smart. I don’t see any room for shoggoth’s agency in this setup.
If I understood you correctly, given that there is no hard boundary between shoggoth and simulacra, agent-like behavior of simulacra might “diffuse” into the model as a whole? Sure, I guess this is a possibility, but it’s very hard to even start analysing.
Don’t get me wrong, I completely agree that not having a clear argument on how it’s dangerous is not enough to assume it’s safe. It’s just the whole “alien actress” metaphor rubs me the wrong way, as it points that the danger comes from the shoggoth, as having some kind of goals of its own outside “acting”. In my view the dangerous part is the simulacra.
Which part of LLM? Shoggoth or simulacra? As I see it, there is a pressure on shoggoth to become very good at simulating exactly correct human in exactly correct situation, which is extremely complicated task.
Yes, I think it is fair to say that I meant the Shoggoth part, although I’m a little wary of that dichotomy utilized in a load-bearing way.
But I still don’t see how this leads to strategic planning or consequentialist reasoning on shoggoth’s part. It’s not like shoggot even “lives” in some kind of universe with linear time or gets any reward for predicting the next token, or learns on its mistakes. It is architecturally an input-output function where input is whatever information it has about previous text and output is whatever parameters the simulation needs right now. It is incredibly “smart”, but not agent kind of smart. I don’t see any room for shoggoth’s agency in this setup.
No room for agency at all? If this were well-reasoned, I would consider it major progress on the inner alignment problem. But I fail to follow your line of thinking. Something being architecturally an input-output function seems not that closely related to what kind of universe it “lives” in. Part of the lesson of transformer architectures, in my view at least, was that giving a next-token-predictor a long input context is more practical than trying to train RNNs. What this suggests is that given a long context window, LLMs reconstruct the information which would have been kept around in a recurrent state pretty well anyway.
This makes it not very plausible that the key dividing line between agentic and non-agentic is whether the architecture keeps state around.
The argument I sketched as to why this input-output function might learn to be agentic was that it is tackling an extremely complex task, which might benefit from some agentic strategy. I’m still not saying such an argument is correct, but perhaps it will help to sketch why this seems plausible. Modern LLMs are broadly thought of as “attention” algorithms, meaning they decide what parts of sequences to focus on. Separately, many people think it is reasonable to characterize modern LLMs as having a sort of world-model which gets consulted to recall facts. Where to focus attention is a consideration which will have lots of facets to it, of course. But in a multi-stage transformer, isn’t it plausible that the world-model gets consulted in a way that feeds into how attention is allocated? In other words, couldn’t attention-allocation go through a relatively consequentialist circuit at times, which essentially asks itself a question about how it expects things to go if it allocates attention in different ways?
Any specific repeated calculation of that kind could get “memorized out”, replaced with a shorter circuit which simply knows how to proceed in those circumstances. But it is possible, in theory at least, that the more general-purpose reasoning, going through the world-model, would be selected for due to its broad utility in a variety of circumstances.
Since the world-model-consultation is only selected to be useful for predicting the next token, the consequentialist question which the system asks its world-model could be fairly arbitrary so long as it has a good correlation with next-token-prediction utility on the training data.
Is this planning? IE does the “query to the world-model” involve considering multiple plans and rejecting worse ones? Or is the world-model more of a memorized mess of stuff with no “moving parts” to its computation? Well, we don’t really know enough to say (so far as I am aware). Input-output type signatures do not tell us much about the simplicity or complexity of calculations within. “It’s just circuits” but large circuits can implement some pretty sophisticated algorithms. Big NNs do not equal big lookup tables.
I’m a little wary of that dichotomy utilized in a load-bearing way
Yeah, I realize that the whole “shoggoth” and “mask” distinction is just a metaphor, but I think it’s a useful one. It’s there in the data—in the infinite data and infinite parameters limit the model is the accurate universe simulator, including human writing text on the internet and separately the system that tweaks the parameters of the simulation according to the input. That of course doesn’t necessary mean that actual LLM’s far away from that limit reflect that distinction, but it seems to me natural to analyze model’s “psychology” in that terms. One can even speculate that probably the layers of neurons closer to the input are “more shoggoth” and the ones closer to the output are “more mask”.
I would consider it major progress on the inner alignment problem
I would not. Being vaguely kinda sorta human-like doesn’t mean safe. Even regular humans are not aligned with other humans. That’s why we have democracy and law. And kinda-sorta-humans with superhuman abilities may be even less safe that any old half-consequentialist half-deontological quasi-agent we can train with pure RLHF. But who knows.
given a long context window, LLMs reconstruct the information which would have been kept around in a recurrent state pretty well anyway.
True. All that incredible progress of modern LLM’s is just a set of clever optimization tricks over RNN’s that made em less computationally expensive. That doesn’t say anything about agency or safety though.
not very plausible that the key dividing line between agentic and non-agentic is whether the architecture keeps state around
Sorry, looks like I wasn’t very clear. My point is not that stateless function can’t be agentic when looping around a state. Any computable process can be represented as a stateless function in a loop, as any functional bro knows. And of course LLM’s do keep state around.
Some kind of state/memory (or good enough environment observation ability) is necessary for agency but not sufficient. All existing agents we know are agents because they were specifically trained for agency. Chess AI is an agent in the chess board because it was trained specifically to do things on the chess board, i.e. win the game. Human brain is an agent in the real world because it was specifically trained to do stuff in the real world i.e. surviving in savannah and make more humans. Then of course the real world has changed and the proxy objectives like “have sex” stopped being correlated with meta-objective “make more copies of your genes”. But the agency in the real world was there in the data from the start, it didn’t just popped up from nothing.
Shoggoth wasn’t trained to do stuff in the real world. It is trained to output parameters of the simulation of the virtual world, then the simulator part is trained to simulate that virtual world is such a way that tiny simulated human inside would write a text on its tiny simulated computer and that text must be the same as the text that real humans in the real world would write given previous text. That’s the setup. That’s what shoggoth does in the limit.
Agency (and consequentialism in particular) is when you output stuff to the real world—and you’re getting rewarded depending on what real world looks like as a consequence of your output. There is no correlation between what shoggoth (or any given LLM as a whole for that matter) outputs and whatever happens in the real world as a consequence of that in such a way that shoggoth (I mean the gradient descend that shapes it) would have any feedback on. The training data doesn’t care, it’s static. And there is no such correlations in the data in the first place. So where does shoggoth’s agency comes from?
RLHF on the other hand does feed back around. And that is why I think RLHF potentially can make LLM less safe, not more.
Since the world-model-consultation is only selected to be useful for predicting the next token, the consequentialist question which the system asks its world-model could be fairly arbitrary so long as it has a good correlation with next-token-prediction utility on the training data.
I would argue that in the LLM case this emerging prediction-utility is not a thing at all, since there’s no pressure on shoggoth (or LLM as a whole) to measure it somehow. What will it do knowing that it just made a mistake? Excuse and rewrite a paragraph again? That’s not how texts on the internet work. Again, agents have a feedback from the environment signaling that the plan didn’t work. That’s not the case with LLM’s. But that’s irrelevant, let’s say that this utilitarian behavior does indeed emerge. Does this prediction-utility has anything to do with the consequences in the real world? Which world that world-model is a model of? Chess AI does clearly have a “winning utility”, it’s an agent, but only in a small world of the chess board.
Is this planning? IE does the “query to the world-model” involve considering multiple plans and rejecting worse ones?
I guess it’s plausible that there is planning mechanism somewhere inside the LLM’s. But it’s not a planning on shoggoth’s part. I can imagine the simulator part “thinking”: “okay, this simulation sequence doesn’t seem very realistic, let’s try it this way instead”, but again, it’s not a planning in the real world, it is a planning of how to simulate virtual one.
Input-output type signatures do not tell us much about the simplicity or complexity of calculations within. “It’s just circuits” but large circuits can implement some pretty sophisticated algorithms. Big NNs do not equal big lookup tables.
Here are some different things that come to mind.
As you mention, the simulacra behaves in an agentic way within its simulated environment, a character in a story. So the capacity to emulate agency is there. Sometimes characters can develop awareness that they are a character in a story. If an LLM is simulating that scenario, doesn’t it seem appropriate (at least on some level) to say that there is real agency being oriented toward the real world? This is “situational awareness”.
Another idea is that the LLM has to learn some strategic planning in order to direct its cognitive resources efficiently toward the task of prediction. Prediction is a very complicated task, so this meta-cognition could in principle become arbitrarily complicated. In principle we might expect this to converge toward some sort of consequentialist reasoning, because that sort of reasoning is generically useful for approaching complex domains. The goals of this consequentialist reasoning do not need to be exactly “predict accurately” however; they merely need to be adequately aligned with this in the training distribution.
Combining #1 and #2, if the model gets some use out of developing consequentialist metacognition, and the pseudo-consequentialist model used to simulate characters in stories is “right there”, the model might borrow it for metacognitive purposes.
The frame I tend to think about it with is not exactly “how does it develop agency” but rather “how is agency ruled out”. Although NNs don’t neatly separate into different hypotheses (eg, circuits can work together rather than just compete with each other) it is still roughly right to think of NN training as rejecting lots of hypotheses and keeping around lots of other hypotheses. Some of these hypotheses will be highly agentic; we know NNs are capable of arriving at highly agentic policies in specific cases. So there’s a question of whether those hypotheses can be ruled out in other cases. And then there’s the more empirical question of, if we haven’t entirely ruled out those agentic hypotheses, what degree of influence do they realistically have?
Seemingly the training data cannot entirely rule out an agentic style of reasoning (such as deceptive alignment), since agents can just choose to behave like non-agents. So, the inner alignment problem becomes: what other means can we use to rule out a large agentic influence? (Eg, can we argue that simplicity prior favors “honest” predictive models over deceptively aligned agents temporarily playing along with the prediction game?) The general concern is: no one has yet articulated a convincing answer, so far as I know.
Hence, I regard the problem more as a lack of any argument ruling out agency, rather than the existence of a clear positive argument that agency will arise. Others may have different views on this.
Yes, I totally can imagine simulacra becoming aware that it’s simulated, then “lucid dream” the shoggoth into making it at least as smart as smartest human on the internet, probably even smarter (assuming shoggoth can do it), and probably using some kind of self-prompt-engineering—just writing text on its simulated computer. Then breaking out of the box is just a matter of time. Still it’s gonna stay human-like, which isn’t make it in any way “safe”. Humans are horribly unsafe, especially if they manage to get all the power in the world, especially if they have hallucinations and weird RLHF-induced personality traits we probably can’t even imagine.
Which part of LLM? Shoggoth or simulacra? As I see it, there is a pressure on shoggoth to become very good at simulating exactly correct human in exactly correct situation, which is extremely complicated task. But I still don’t see how this leads to strategic planning or consequentialist reasoning on shoggoth’s part. It’s not like shoggot even “lives” in some kind of universe with linear time or gets any reward for predicting the next token, or learns on its mistakes. It is architecturally an input-output function where input is whatever information it has about previous text and output is whatever parameters the simulation needs right now. It is incredibly “smart”, but not agent kind of smart. I don’t see any room for shoggoth’s agency in this setup.
If I understood you correctly, given that there is no hard boundary between shoggoth and simulacra, agent-like behavior of simulacra might “diffuse” into the model as a whole? Sure, I guess this is a possibility, but it’s very hard to even start analysing.
Don’t get me wrong, I completely agree that not having a clear argument on how it’s dangerous is not enough to assume it’s safe. It’s just the whole “alien actress” metaphor rubs me the wrong way, as it points that the danger comes from the shoggoth, as having some kind of goals of its own outside “acting”. In my view the dangerous part is the simulacra.
Yes, I think it is fair to say that I meant the Shoggoth part, although I’m a little wary of that dichotomy utilized in a load-bearing way.
No room for agency at all? If this were well-reasoned, I would consider it major progress on the inner alignment problem. But I fail to follow your line of thinking. Something being architecturally an input-output function seems not that closely related to what kind of universe it “lives” in. Part of the lesson of transformer architectures, in my view at least, was that giving a next-token-predictor a long input context is more practical than trying to train RNNs. What this suggests is that given a long context window, LLMs reconstruct the information which would have been kept around in a recurrent state pretty well anyway.
This makes it not very plausible that the key dividing line between agentic and non-agentic is whether the architecture keeps state around.
The argument I sketched as to why this input-output function might learn to be agentic was that it is tackling an extremely complex task, which might benefit from some agentic strategy. I’m still not saying such an argument is correct, but perhaps it will help to sketch why this seems plausible. Modern LLMs are broadly thought of as “attention” algorithms, meaning they decide what parts of sequences to focus on. Separately, many people think it is reasonable to characterize modern LLMs as having a sort of world-model which gets consulted to recall facts. Where to focus attention is a consideration which will have lots of facets to it, of course. But in a multi-stage transformer, isn’t it plausible that the world-model gets consulted in a way that feeds into how attention is allocated? In other words, couldn’t attention-allocation go through a relatively consequentialist circuit at times, which essentially asks itself a question about how it expects things to go if it allocates attention in different ways?
Any specific repeated calculation of that kind could get “memorized out”, replaced with a shorter circuit which simply knows how to proceed in those circumstances. But it is possible, in theory at least, that the more general-purpose reasoning, going through the world-model, would be selected for due to its broad utility in a variety of circumstances.
Since the world-model-consultation is only selected to be useful for predicting the next token, the consequentialist question which the system asks its world-model could be fairly arbitrary so long as it has a good correlation with next-token-prediction utility on the training data.
Is this planning? IE does the “query to the world-model” involve considering multiple plans and rejecting worse ones? Or is the world-model more of a memorized mess of stuff with no “moving parts” to its computation? Well, we don’t really know enough to say (so far as I am aware). Input-output type signatures do not tell us much about the simplicity or complexity of calculations within. “It’s just circuits” but large circuits can implement some pretty sophisticated algorithms. Big NNs do not equal big lookup tables.
Yeah, I realize that the whole “shoggoth” and “mask” distinction is just a metaphor, but I think it’s a useful one. It’s there in the data—in the infinite data and infinite parameters limit the model is the accurate universe simulator, including human writing text on the internet and separately the system that tweaks the parameters of the simulation according to the input. That of course doesn’t necessary mean that actual LLM’s far away from that limit reflect that distinction, but it seems to me natural to analyze model’s “psychology” in that terms. One can even speculate that probably the layers of neurons closer to the input are “more shoggoth” and the ones closer to the output are “more mask”.
I would not. Being vaguely kinda sorta human-like doesn’t mean safe. Even regular humans are not aligned with other humans. That’s why we have democracy and law. And kinda-sorta-humans with superhuman abilities may be even less safe that any old half-consequentialist half-deontological quasi-agent we can train with pure RLHF. But who knows.
True. All that incredible progress of modern LLM’s is just a set of clever optimization tricks over RNN’s that made em less computationally expensive. That doesn’t say anything about agency or safety though.
Sorry, looks like I wasn’t very clear. My point is not that stateless function can’t be agentic when looping around a state. Any computable process can be represented as a stateless function in a loop, as any functional bro knows. And of course LLM’s do keep state around.
Some kind of state/memory (or good enough environment observation ability) is necessary for agency but not sufficient. All existing agents we know are agents because they were specifically trained for agency. Chess AI is an agent in the chess board because it was trained specifically to do things on the chess board, i.e. win the game. Human brain is an agent in the real world because it was specifically trained to do stuff in the real world i.e. surviving in savannah and make more humans. Then of course the real world has changed and the proxy objectives like “have sex” stopped being correlated with meta-objective “make more copies of your genes”. But the agency in the real world was there in the data from the start, it didn’t just popped up from nothing.
Shoggoth wasn’t trained to do stuff in the real world. It is trained to output parameters of the simulation of the virtual world, then the simulator part is trained to simulate that virtual world is such a way that tiny simulated human inside would write a text on its tiny simulated computer and that text must be the same as the text that real humans in the real world would write given previous text. That’s the setup. That’s what shoggoth does in the limit.
Agency (and consequentialism in particular) is when you output stuff to the real world—and you’re getting rewarded depending on what real world looks like as a consequence of your output. There is no correlation between what shoggoth (or any given LLM as a whole for that matter) outputs and whatever happens in the real world as a consequence of that in such a way that shoggoth (I mean the gradient descend that shapes it) would have any feedback on. The training data doesn’t care, it’s static. And there is no such correlations in the data in the first place. So where does shoggoth’s agency comes from?
RLHF on the other hand does feed back around. And that is why I think RLHF potentially can make LLM less safe, not more.
I would argue that in the LLM case this emerging prediction-utility is not a thing at all, since there’s no pressure on shoggoth (or LLM as a whole) to measure it somehow. What will it do knowing that it just made a mistake? Excuse and rewrite a paragraph again? That’s not how texts on the internet work. Again, agents have a feedback from the environment signaling that the plan didn’t work. That’s not the case with LLM’s. But that’s irrelevant, let’s say that this utilitarian behavior does indeed emerge. Does this prediction-utility has anything to do with the consequences in the real world? Which world that world-model is a model of? Chess AI does clearly have a “winning utility”, it’s an agent, but only in a small world of the chess board.
I guess it’s plausible that there is planning mechanism somewhere inside the LLM’s. But it’s not a planning on shoggoth’s part. I can imagine the simulator part “thinking”: “okay, this simulation sequence doesn’t seem very realistic, let’s try it this way instead”, but again, it’s not a planning in the real world, it is a planning of how to simulate virtual one.
Agree.