I think this is too big-brain. Reasoning about systems more complex than you should look more like logical inductors, or infrabayesian hypotheses, or heuristic arguments, or other words that code for “you find some regularities and trust them a little, rather than trying to deduce an answer that’s too hard to compute.”
Which part specifically are you referring to as being overly complicated? What I take to be the primary assertions of the post to be are:
Simulacra may themselves conduct simulation, and advanced simulators could produce vast webs of simulacra organized as a hierarchy.
Simulating an agent is not fundamentally different to creating one in the real world.
Due to instrumental convergence, agentic simulacra might be expected to engage in resource acquisition. This could take the shape of ‘complexity theft’ as described in the post.
The Löbian Obstacle accurately describes why an agent cannot obtain a formal guarantee via design-inspection of its subsequent agent.
For a simulator to be safe, all simulacra need to be aligned unless we figure some upper bound on “programs of this complexity are too simple to be dangerous,” at which point we would consider simulacra above that complexity only.
I’ll try to justify my approach with respect to one or more of these claims, and if I can’t, I suppose that would give me strong reason to believe the method is overly complicated.
This doesn’t have to be resource acquisition, just any negative action that we could reasonably expect a rational agent to pursue.
I am disagreeing with the underlying assumption that it’s worthwhile to create simulacra of the sort that satisfy point 2. I expect an AI reasoning about its successor to not simulate it with perfect fidelity—instead, it’s much more practical to make approximations that make the reasoning process different from instantiating the successor.
I expect agentic simulacra to occur without intentionally simulating them, in that agents are just generally useful for solving prediction problems and that in conducting millions of predictions (as would be expected of a product on the order of ChatGPT, or future successors,) it’s probable for agentic simulacra to occur. Even if these agents are just approximations, in predicting the behaviors of approximated agents their preferences could still be satisfied in the real world (as described in the Hubinger post.)
The problem I’m interested in is how you ensure that all subsequent agentic simulacra (whether occurred intentionally or otherwise) are safe, which seems difficult to verify formally due to the Löbian Obstacle.
As someone who’s barely scratched the surface of any of this, I was vaguely under the impression that “big-brain” described most or all of the theoretic/conceptual alignment in this cluster of things, including e.g. both the Löbian Obstacle and infrabayesianism. Once I learn all these more in-depth and think on them, I may find and appreciate subtler-but-still-important gradations of “galaxy-brained-ness” within this idea cluster.