porby comments on Inner Misalignment in “Simulator” LLMs

porby 31 Jan 2023 22:34 UTC
11 points
5
As simulation complexity grows, it seems likely that these last steps would require powerful general intelligence/GPS as well. And at that point, it’s entirely unclear what mesa-objectives/values/shards it would develop.
On one hand, I fully agree that a strong predictor is going develop some very strong internal modeling that could reasonably be considered superhuman in some ways even now.
But I think there’s an unstated background assumption sneaking into most discussions about mesaoptimizers- that goal oriented agency (even with merely shard-like motivations) is a natural attractor for SGD, particularly in the context of outwardly goal agnostic simulators.
This could be true, and it would be extremely important if it were true, and I really want more people trying to figure out if it is true, but so far as I’m aware, we don’t have strong evidence that it is.
My personal guess, given what I know now, is that some form of weakly defined mesaoptimization is an attractor (>90%), but agentic mesaoptimizers in the context of non-fine-tuned GPT-like architectures are not (75%).
I think agentic mesaoptimization can be an attractor in some architectures. I’m comfortable claiming humans in the context of evolution as a close-enough existence proof of this. I think the conditions of our optimization made agentic mesaoptimization natural, but I suspect optimization processes with wildly different conditions will behave differently.
This is a big part of why I’m as optimistic as I am about goal agnostic simulation as a toehold for safety- I think we actually do replace one set of problems with an easier set of problems, rather than just adding more.