Adam Jermyn comments on Conditioning Generative Models for Alignment

Adam Jermyn 26 Jul 2022 18:30 UTC
LW: 2 AF: 2
0
AF
Is the loss we’re training the generative model on—in the case of language models, the predictive loss over the next token—actually representative of the world prior?
This seems important and is not a thing I’ve thought about carefully, so thanks for bringing it up and exploring it. I think (to the extent there is a problem) the problem is alleviated by training on “predict tomorrow’s headline given today’s” and related tasks (e.g. “predict the next frame of video from the last”). That forces the model to engage more directly with the relationship between events separated in time by known amounts.
If they can detect when they’re in deployment, then they could act in malign ways.
The more I’ve thought about this one the more I’m not worried about this precise danger.
It would be very strange for a predictive model attempting to draw plausible trajectories through time to simulate trajectories in which agents notice inconsistencies and decide that they’re in simulations. Agents can still conclude they’re in simulations, but it would be weird for this to be because they noticed inconsistencies in their worlds, because the agents and world are being constructed together as part of a predictive task. Predicting that the agent notices an inconsistency requires the generative model to know that there’s an inconsistency, at which point the better solution (from a ‘drawing likely trajectories’ perspective) is to just make the world consistent.
That said, there are very closely related dangers that I am worried about. For instance there can be agents that act as if they’re in a simulation for purposes of acausal trade (e.g. they play along until a distant future date before defecting, in the hopes of being instantiated in our world). This feels like a thing we can make less likely with appropriate prompting, which makes me hope that it may not be too big a problem in practice, but (barring powerful interpretability tools) I don’t think we can rule it out.
The way these models deal with self-fulfilling prophecies.
I’m currently pretty worried about this, so was happy to see you thinking about it.
- Jozdien 3 Oct 2022 16:36 UTC
  LW: 1 AF: 1
  0
  AF Parent
  Sorry for the (very) late reply!
  I think (to the extent there is a problem) the problem is alleviated by training on “predict tomorrow’s headline given today’s” and related tasks (e.g. “predict the next frame of video from the last”). That forces the model to engage more directly with the relationship between events separated in time by known amounts.
  Hmm, I was thinking more of a problem with text available in the training datasets not being representative of the real world we live in (either because it isn’t enough information to pick out our world from a universal prior, or because it actually describes a different world better), not whether its capabilities or abstractive reasoning don’t help with time-separated prediction.
  Predicting that the agent notices an inconsistency requires the generative model to know that there’s an inconsistency, at which point the better solution (from a ‘drawing likely trajectories’ perspective) is to just make the world consistent.
  I think I’m picturing different reasons for a simulacra agent to conclude that they’re in a simulation than noticing inconsistencies. Some specifics include worlds that are just unlikely enough anthropically (because of a conditional we apply, for example) to push up credence in a simulation hypothesis, or they notice the effects of gradient descent (behavioural characteristics of the world deviating from “normal” behaviour tend to affect the world state), or other channels that may be available by some quirk of the simulation / training process, but I’m not holding to any particular one very strongly. All of which to say that I agree it’d be weird for them to notice inconsistencies like that.
  For instance there can be agents that act as if they’re in a simulation for purposes of acausal trade (e.g. they play along until a distant future date before defecting, in the hopes of being instantiated in our world).
  Yep, I think this could be a problem, although recent thinking has updated me slightly away from non-observed parts of the simulation having consistent agentic behaviour across time.