Jozdien comments on All AGI Safety questions welcome (especially basic ones) [~monthly thread]

Jozdien 2 Nov 2022 20:07 UTC
7 points
0
My take is centred more on current language models, which are also non-optimizers, so I’m afraid this won’t be super relevant if you’re already familiar with the rest of this and were asking specifically about the context of image systems.
Language models are simulators of worlds sampled from the prior representing our world (insofar as the totality of human text is a good representation of our world), and doesn’t have many of the properties we would associate with “optimizeriness”. They do, however, have the potential to form simulacra that are themselves optimizers, such as GPT modelling humans (with pretty low fidelity right now) when making predictions. One danger from this kind of system that isn’t itself an optimizer, is the possibility of instantiating deceptive simulacra that are powerful enough to act in ways that are dangerous to us (I’m biased here, but I think this section from one of my earlier posts does a not-terrible job of explaining this).
There’s also the possibility of these systems becoming optimizers, as you mentioned. This could happen either during training (where the model at some point during training becomes agentic and starts to deceptively act like a non-optimizer simulator would—I describe this scenario in another section from the same post), or could happen later, as people try to use RL on it for downstream tasks. I think what happens here mechanistically at the end could be one of a number of things—the model itself completely becoming an optimizer, an agentic head on top of the generative model that’s less powerful than the previous scenario at least to begin with, a really powerful simulacra that “takes over” the computational power of the simulation, etc.
I’m pretty uncertain on numbers I would assign to either outcome, but the latter seems pretty likely (although I think the former might still be a problem), especially with the application of powerful RL for tasks that benefit a lot from consequentialist reasoning. This post by the DeepMind alignment team goes into more detail on this outcome, and I agree with its conclusion that this is probably the most likely path to AGI (modulo some minor details I don’t fully agree with that aren’t super relevant here).
- chanamessinger 3 Nov 2022 12:02 UTC
  1 point
  0
  Parent
  Thanks!
  
  When you say “They do, however, have the potential to form simulacra that are themselves optimizers, such as GPT modelling humans (with pretty low fidelity right now) when making predictions”
  
  do you mean things like “write like Ernest Hemingway”?
  - Jozdien 3 Nov 2022 12:57 UTC
    2 points
    0
    Parent
    Yep. I think it happens on a much lower scale in the background too—like if you prompt GPT with something like the occurrence of an earthquake, it might write about what reporters have to say about it, simulating various aspects of the world that may include agents without our conscious direction.