To my mind, what this post did was clarify a kind of subtle, implicit blind spot in a lot of AI risk thinking. I think this was inextricably linked to the writing itself leaning into a form of beauty that doesn’t tend to crop up much around these parts. And though the piece draws a lot of it back to Yudkowsky, I think the absence of green much wider than him and in many ways he’s not the worst offender.
It’s hard to accurately compress the insights: the piece itself draws a lot on soft metaphor and on explaining what green is not. But personally it made me realise that the posture I and others tend to adopt when thinking about superintelligence and the arc of civilisation has a tendency to shut out some pretty deep intuitions that are particularly hard to translate into forceful argument. Even if I can’t easily say what those are, I can now at least point to it in conversation by saying there’s some kind of green thing missing.
I went down a rabbithole on inference-from-goal-models a few years ago (albeit not coalitional ones) -- some slightly scattered thoughts below, which I’m happy to elaborate on if useful.
A great toy model is decision transformers: basically, you can make a decent “agent” by taking a predictive model over a world that contains agents (like Atari rollouts), conditioning on some ‘goal’ output (like the player eventually winning), and sampling what actions you’d predict to see from a given agent. Some things which pop out of this:
There’s no utility function or even reward function
You can’t even necessarily query the probability that the goal will be reached
There’s no updating or learning—the beliefs are totally fixed
It still does a decent job! And it’s very computationally cheap
And you can do interp on it!
It turns out to have a few pathologies (which you can precisely formalise)
It has no notion of causality, so it’s easily confounded if it wasn’t trained on a markov blanket around the agent it’s standing in for
It doesn’t even reliably pick the action which most likely leads to the outcome you’ve conditioned on
Its actions are heavily shaped by implicit predictions about how future actions will be chosen (an extremely crude form of identity), which can be very suboptimal
But it turns out that these are very common pathologies! And the formalism is roughly equivalent to lots of other things
You can basically recast the whole reinforcement learning problem as being this kind of inference problem
(specifically, minimising variational free energy!)
It turns out that RL largely works in cases where “assume my future self plays optimally” is equivalent to “assume my future self plays randomly” (!)
it seems like “what do I expect someone would do here” is a common heuristic for humans which notably diverges from “what would most likely lead to a good outcome”
humans are also easily confounded and bad at understanding the causality of our actions
language models are also easily confounded and bad at understanding the causality of their outputs
fully fixing the future-self-model thing here is equivalent to tree searching the trajectory space, which can sometimes be expensive