Jozdien comments on Conditioning Generative Models for Alignment

Jozdien 3 Oct 2022 16:36 UTC
LW: 1 AF: 1
0
AF
Sorry for the (very) late reply!
I think (to the extent there is a problem) the problem is alleviated by training on “predict tomorrow’s headline given today’s” and related tasks (e.g. “predict the next frame of video from the last”). That forces the model to engage more directly with the relationship between events separated in time by known amounts.
Hmm, I was thinking more of a problem with text available in the training datasets not being representative of the real world we live in (either because it isn’t enough information to pick out our world from a universal prior, or because it actually describes a different world better), not whether its capabilities or abstractive reasoning don’t help with time-separated prediction.
Predicting that the agent notices an inconsistency requires the generative model to know that there’s an inconsistency, at which point the better solution (from a ‘drawing likely trajectories’ perspective) is to just make the world consistent.
I think I’m picturing different reasons for a simulacra agent to conclude that they’re in a simulation than noticing inconsistencies. Some specifics include worlds that are just unlikely enough anthropically (because of a conditional we apply, for example) to push up credence in a simulation hypothesis, or they notice the effects of gradient descent (behavioural characteristics of the world deviating from “normal” behaviour tend to affect the world state), or other channels that may be available by some quirk of the simulation / training process, but I’m not holding to any particular one very strongly. All of which to say that I agree it’d be weird for them to notice inconsistencies like that.
For instance there can be agents that act as if they’re in a simulation for purposes of acausal trade (e.g. they play along until a distant future date before defecting, in the hopes of being instantiated in our world).
Yep, I think this could be a problem, although recent thinking has updated me slightly away from non-observed parts of the simulation having consistent agentic behaviour across time.