One was someone saying that they thought it would be impossible to train the model to distinguish between whether it was doing this sort of hallucination vs the text in fact appearing in the prompt, because of an argument I didn’t properly understand that was something like ‘it’s simulating an agent that is browsing either way’. This seems incorrect to me. The transformer is doing pretty different things when it’s e.g. copying a quote from text that appears earlier in the context vs hallucinating a quote, and it would be surprising if there’s no way to identify which of these it’s doing.
I think this is referring to something I said, so I’m going to clarify my stance here.
First, I’m pretty sure on reading this section now that I misunderstood what you were pointing at then. Instead of:
if you give it a prompt with some commands trying to download and view a page, and the output, it does things like say ‘That output is a webpage with a description of X’, when in fact the output is blank or some error or something.
I was picturing something like the other hallucinations you mention, specifically:
if you give it a prompt with some commands trying to download and view a page, and in reality those commands would return a blank output or some error or something, it does things like say ‘That output is a webpage with a description of X’.
(In retrospect this seems like a pretty uncharitable take on something anyone with a lot of experience with language models would find a problem. My guess is that at the time I was spending too much time thinking about what you were saying looked like in terms of my existing ontology and what I would have expected to happen, and not enough on actually making sure I understood what you were pointing at).
Second, I’m not fully convinced that this is qualitatively different from other types of hallucinations, except in that they’re plausibly easier to fix because RLHF can do weird things specifically to prompt interactions (then again, I’m not sure whether you’re actually claiming it’s qualitatively different either, in which case this is just a thought dump). If you prompted GPT with an article on coffee and ended it with a question about what the article says about Hogwarts, the conditional you want is one where someone wrote an article about coffee and where someone else’s immediate follow-up is to ask what it says about Hogwarts.
But this is outweighed on the model’s prior because it’s not something super likely to happen in our world. In other words, the conditional of “the prompt is exactly right and contains the entire content to be used for answering the question” isn’t likely enough relative to other potential conditionals like “the prompt contains the title of the blog post, and the rest of the post was left out” (for the example in the post) or “the context changed suddenly and the question should be answered from the prior” or “questions about the post can be answered using knowledge from outside the post as well” or something else that’s weird because the intended conditional is unlikely enough to allow for it (for the Hogwarts example).
Put that way this just sounds like it’s quantitatively different from other hallucinations, in that information in the prompt is be a stronger way to influence the posterior you get from conditioning. And this can allow us a greater degree of control, but I don’t see the model as doing fundamentally different things here as opposed to other cases.
Physics simulators
Relatedly, I’ve heard people reason about the behavior of current models as if they’re simulating physics and going from this to predictions of which tokens will come next, which I think is not a good characterization of current or near-future systems. Again, my guess is that very transformative things will happen before we have systems that are well-understood as doing this.
I’m not entirely sure this is what they believe, but I think the reason this framing gets thrown around a lot is that it’s a pretty evocative way to reason about the model’s behaviour. Specifically, I would be pretty surprised if anyone thought this was literally true in the sense of modelling very low-level features of reality, and didn’t just use it as a useful way to talk about GPT mechanics like time evolution over some learned underlying mechanics, and to draw inspiration from the analogy.
Rolling out long simulations
I get the impression from the original simulators post that the author expects you can ‘roll out’ a simulation for a large number of timesteps and this will be reasonably accurate
For current and near-future models, I expect them to go off-distribution relatively quickly if you just do pure generation—errors and limitations will accumulate, and it’s going to look different from the text they were trained to predict. Future models especially will probably be able to recognize that you’re running them on language model outputs, and seems likely this might lead to weird behavior—e.g. imitating previous generations of models whose outputs appear in the training data. Again, it’s not clear what the ‘correct’ generalization is if the model can tell it’s being used in generative mode.
I agree with this. But while again I’m not entirely sure what Janus would say, I think their interactions with GPT involve a fair degree of human input on long simulations, either in terms of where to prune / focus, or explicit changes to the prompt. (There are some desirable properties we get from a relaxed degree of influence, like story “threads” created much earlier ending up resolving themselves in very unexpected ways much later in the generation stream by GPT, as if that was always the intention.)
GPT-style transformers are purely myopic
I’m not sure this is that important, or that anyone else actually thinks this, but it was something I got wrong for a while. I was thinking of everything that happens at sequence position n as about myopically predicting the nth token.
In fact, although the *output* tokens are myopic, autoregressive transformers are incentivised to compute activations at early sequence positions that will make them better at predicting tokens at later positions. This may also have indirect impacts on the actual tokens output at the early positions, although my guess would be this isn’t a huge effect.
Echoing porby’s comment, I don’t find the kind of narrow myopia this breaks to be very concerning.
Pure simulators
From the simulators post I get some impression like “There’s a large gulf between the overall model itself and the agents it simulates; we will get very capable LLMs that will be ‘pure simulators’”
Although I think this is true in a bunch of important ways, it seems plausible to me that it’s pretty straightforward to distill any agent that the model is simulating into the model, and that this might happen by accident also. This is especially true once models have a good understanding of LLMs. You can imagine that a model starts predicting text with the hypothesis ‘this text is the output of an LLM that’s trying to maximise predictive accuracy on its training data’. If we’re at the point where models have very accurate understandings of the world, then integrating this hypothesis will boost performance by allowing the model to make better guesses about what token comes next by reasoning about what sort of data would make it into an ML training set.
I agree with this being a problem, but I didn’t get the same impression from the simulators post (albeit I’d heard of the ideas earlier so this may be on the post) - my takeaway was just that there’s a large conceptual gulf between what we ascribe to the model and its simulacra, not that there’s a gulf in model space between pure generative models and a non-simulator (I actually talk about this problem in an older post, which they were a large influence on).
I think this is referring to something I said, so I’m going to clarify my stance here.
First, I’m pretty sure on reading this section now that I misunderstood what you were pointing at then. Instead of:
I was picturing something like the other hallucinations you mention, specifically:
(In retrospect this seems like a pretty uncharitable take on something anyone with a lot of experience with language models would find a problem. My guess is that at the time I was spending too much time thinking about what you were saying looked like in terms of my existing ontology and what I would have expected to happen, and not enough on actually making sure I understood what you were pointing at).
Second, I’m not fully convinced that this is qualitatively different from other types of hallucinations, except in that they’re plausibly easier to fix because RLHF can do weird things specifically to prompt interactions (then again, I’m not sure whether you’re actually claiming it’s qualitatively different either, in which case this is just a thought dump). If you prompted GPT with an article on coffee and ended it with a question about what the article says about Hogwarts, the conditional you want is one where someone wrote an article about coffee and where someone else’s immediate follow-up is to ask what it says about Hogwarts.
But this is outweighed on the model’s prior because it’s not something super likely to happen in our world. In other words, the conditional of “the prompt is exactly right and contains the entire content to be used for answering the question” isn’t likely enough relative to other potential conditionals like “the prompt contains the title of the blog post, and the rest of the post was left out” (for the example in the post) or “the context changed suddenly and the question should be answered from the prior” or “questions about the post can be answered using knowledge from outside the post as well” or something else that’s weird because the intended conditional is unlikely enough to allow for it (for the Hogwarts example).
Put that way this just sounds like it’s quantitatively different from other hallucinations, in that information in the prompt is be a stronger way to influence the posterior you get from conditioning. And this can allow us a greater degree of control, but I don’t see the model as doing fundamentally different things here as opposed to other cases.
I’m not entirely sure this is what they believe, but I think the reason this framing gets thrown around a lot is that it’s a pretty evocative way to reason about the model’s behaviour. Specifically, I would be pretty surprised if anyone thought this was literally true in the sense of modelling very low-level features of reality, and didn’t just use it as a useful way to talk about GPT mechanics like time evolution over some learned underlying mechanics, and to draw inspiration from the analogy.
I agree with this. But while again I’m not entirely sure what Janus would say, I think their interactions with GPT involve a fair degree of human input on long simulations, either in terms of where to prune / focus, or explicit changes to the prompt. (There are some desirable properties we get from a relaxed degree of influence, like story “threads” created much earlier ending up resolving themselves in very unexpected ways much later in the generation stream by GPT, as if that was always the intention.)
Echoing porby’s comment, I don’t find the kind of narrow myopia this breaks to be very concerning.
I agree with this being a problem, but I didn’t get the same impression from the simulators post (albeit I’d heard of the ideas earlier so this may be on the post) - my takeaway was just that there’s a large conceptual gulf between what we ascribe to the model and its simulacra, not that there’s a gulf in model space between pure generative models and a non-simulator (I actually talk about this problem in an older post, which they were a large influence on).