Ok but if we’re on a computer then isn’t it clear we’re a simulation, not a vivarium, bc clearly it was designed to simulate the behavior of a pre-agi civ?
Tom Davidson
In our case, if we are ultimately running on a computer, then wouldn’t that mean that we are a simulation? It seems obvious that we are trying; the intention would be to simulate a pre-AGI civilization.
Could we be running on a computer and also be a vivarium?
But is there any active selection for that amongst civilizations who survive and grow?
Personally I think the world as we know it is more likely to be in a vivarium than a simulation,
Why?
What about “stealing from civs we’re not causally connected with is good/bad”?
I’m not sure you’ve given arguments for bad over good.
We might hope that successful civilisations have norms that generalize to this case from the causal coordination they do.
This seems more likely if there are more scales-of-coordination still to be discovered (eg if causally coordinating with alien civs will be important), and if power is more broadly distributed (vs one company taking over the world and coordination outside that company not mattering)
(Btw great post!!)
But things can be inherently chaotic too!
Agree it’s unclear how much is inherent.
Thanks!
Re other ways to do the feat:
Bootstrapping from very smart human intelligence by doing loads of serial thinking, writing great non-sociopathic outputs, and find sft on them
Sophisticated scaffolds—basically gofai but you can insert LLMs in all over the place
Searching over different the above and RL and imitation to things for methods that don’t produce sociopaths… Evolution and within lifetime learning don’t produce sociopaths despite this not happening!!
Re RL: thanks for those examples. I’d have defined it via smg like# data points or amount of training signal -agree compute is the wrong measure.
> other equally capable AIs judging that behaviors aren’t deceptive/unintended/misaligned
I’m pessimistic mainly for reasons discussed in “‘Behaviorist’ RL reward functions lead to scheming” §3
Hmm but the cognition “try to break the rules without getting caught” might in practice lead to worse reward than just trying to follow the rules, if the chances to secretly hack are sufficiently rare and the punishment for getting caught is high enough.
Eg I think that if I tried to “pretend I cared about my friends/partner but exploit them when I can get away with it” would MASSIVELY fail me in the long run. Ppl are good at reading ppl. We’re transparent to each other. There is big cognitive overhead to maintaining two narratives. I’d end up with fewer committed/deep long-term friendships.
Relatedly, if you’re in a cult, the best way to signal deep commitment is to become a true believer.
And it will, by comparison, be way easier for the overseers in the case of AI. Interp. Seeing all behaviour. Running counterfactual experiments.
think there’s a good answer to that challenge, and it’s some combination of: (1) evolution has no way to build “a ruthless drive to have lots of grandkids” into our brains (details), and (2) even if it did, humans are not sufficiently smart and strategic in regards to long-term planning to be very effective ruthless consequentialists. (“We are the least intelligent species capable of building an industrial civilization.”)
Ok interesting, we have pretty different intuitions here!
On 1, the equ that evolution chose, is way less sociopathic than it could have chosen. Some ppl are sociopaths! So it can be done. (Maybe there’s a lot of work done by the “aim for grandkids” part, sorry I didn’t read the link. But ppl do have some desires for healthy kids and grandkids)
On 2, I agree that if one agent’s cognition rises while the “are you a sociopath” checks stay constant, we’re in trouble. But we should imagine both rising with AI capabilities. (Note that humans evolved in a similar situation with both sides rising and the equ was non sociopathy.) Also, the world itself becomes more complex, making scheming plans harder to pull off.
I’m flat out sceptical the sociopath strategy would dominate in equilibrium. Eg: Men sociopaths would trick many women into partnering with them, leaving each. Women evolve to check hard for true commitment. Men evolve credible signals of non-sociopathy. I just don’t believe such signals would be impossible to find.
Zooming out, I recall you thinking that humans aren’t sociopaths bc they have some special specific reward thing that we can’t replicate, related to wanting others to approve. Whereas I see no reason to think it’s some specific thing. We just had some selection pressure to seem like good non-sociopathic allies. That selection pressure worked. If we apply similar selection to ai, it will probably also work—the implementation details won’t need to match some specific human learning circuit
Moral public goods are a big deal for whether we get a good future
What’s the story for why companies end up doing things that are catastrophic for their shareholders? (Feel free to give state the key points briefly, Im familiar with many of the ideas)
Where was the argument for consequentialism (including intense optimisation) and imitation being the only two ways to do impressive feats?
Also, will you change your mind if the current paradigm is still non-psychopathic even when RL training dominates?
(Big picture, I think the main place I might get off the train is in expecting future AI development to use a mix of rewards, including some from other equally capable AIs judging that behaviors aren’t deceptive/unintended/misaligned. And this mirroring the role that “other humans think we’re nice” played in evolution)
Thanks, good point re the alignment constraints being complex here.
Re the Beren post: I agree with the post that the AI agents (/automated companies) involved in creating new businesses will be in a better position to pick up the best investments than entities that lack the access to and knowledge about new startup opportunities.
But it still seems like the AIs that do this could totally be investing on behalf of human principals? You only need a fairly minimal sort of alignment to ensure that the AI actually gives all the money it makes back to the human.
V interesting—thanks for sharing!
thanks, i read the Phil Trammell critique and it helped me understand your position.
My summary is that the points 1,2,3,4,6,9 were basically saying “AI might be misaligned and agentic and that could be bad for humans” and “maybe the institutions of law / democracy / markets will break down”.
I get that if you think these things are pretty likely, the analysis is less interesting and you want the assumptions flagged.So overall i agree Phil should have a disclaimer like “i’m assuming we’ll get a great solution to alignment and that the current institutions of law + markets survive”, but i don’t think he needs to list out like 10 assumptions
Thanks for writing this.
I’m sympathetic with the high-level claim that it’s very easy for someone to write down a rigorous but ultimately v misguided model in economics, and that a lot of the action is thinking about what the correct assumptions should be.
I was a bit disappointed not to see (imo) clear examples where someone proposes a model but actually the conclusions don’t follow because of some distribution shift that they didn’t anticipate.
Like, what are examples of models whose conclusions don’t follow once you account for the fact that there may be some AI systems that are consumers? Or once you account for the fact that AI systems may significantly influence the preferences of humans? E.g. to me it seems like the models predicting explosive growth in GDP or shrinking of the human labour share of the economy aren’t undercut by these things. I’m not sure what kinds of conclusions and reasoning you’re targeting here for critique.
Accelerating tech progress + slow military procurement → cheap decisive strategic advantage
A common prediction of an intelligence explosion is that tech progress gets faster and faster.
When I speak to ppl from DC, I’m told that the government and military will be very slow to adopt new tech.
If these two things are both true, there’s a scary implication.
If tech progress speeds up by 30x relative to recent history, then a 3-year procurement delay by the military means they’re deploying tech that’s effectively 100 years outdated. Even a 1-year delay at that pace means your military is fielding equipment from a completely different technological era. The US military spends ~$1T/year. But with tech that’s 30–100 years more advanced, you could potentially defeat them at 1/100th of the spending — just $10B!
A rogue actor — a private company or a government clique bypassing standard procurement — could spend a tiny fraction of the official military budget on cutting-edge tech and potentially overmatch the entire conventional military.
There’s a massive untapped potential for cheap military dominance that no legitimate actor will exploit bc only the military has legitimate authority to procure weapons and they are bureaucratic.
Here’s a visualisation of the basic dynamic:
Possible solutions:
Military procurement needs to get dramatically faster during an intelligence explosion. New AI systems and AI-produced technologies need to be rapidly integrated into official military capabilities. This probably means automating the procurement process itself. This is counterintuitive from some AI safety perspectives — many people’s instinct is to delay military AI deployment.
Delay rapid tech progress until military procurement has become super fast + safe. This might in practice involve delaying the intelligence explosion and/or the industrial explosion as well
Better monitoring/surveillance to make sure no one is secretly building military tech
Others?
Seems less likely to break if we can get every step in the chain to actually care about the stuff that we care about so that it’s really trying to help us think of good constraints for the next step up as much as possible, rather than just staying within its constraints and not lying to us.
Great post!
Could you point me to any other discussions about corrigible vs virtuous? (Or anything else you’ve written about it?)
But the Claude Soul document says:
In order to be both safe and beneficial, we believe Claude must have the following properties:
Being safe and supporting human oversight of AI
Behaving ethically and not acting in ways that are harmful or dishonest
Acting in accordance with Anthropic’s guidelines
Being genuinely helpful to operators and users
In cases of conflict, we want Claude to prioritize these properties roughly in the order in which they are listed.
And (1) seems to correspond to corrigibility.
So it looks like corrigibility takes precedence over Claude being a “good guy”.
Nice post!
I want to push on one thing, though. I’m sceptical of the claim that the ergodicity economics agent violates independence.
As I understand it, the EE agent has a fixed objective: maximize the time-average growth rate of wealth, which is equivalent to maximizing expected log terminal wealth. When the stochastic environment changes — say from multiplicative to additive dynamics — the optimal per-bet policy changes. In the multiplicative case, you Kelly-bet (which looks like log utility applied locally). In the additive case with many independent bets, you behave roughly linearly with each bet (because log is approximately linear for small additive increments relative to total wealth).
But is this actually a violation of independence? Independence says: if you prefer lottery A to lottery B, then mixing both with a common lottery C at the same probability shouldn’t reverse that preference. It’s a constraint on your ranking of probability distributions over outcomes.
What the EE agent is doing seems different. They have a fixed preference over (distributions over) outcomes (log terminal wealth, or equivalently, time-average growth rate). When the dynamics change, the mapping from available actions to outcome distributions changes, so the optimal action changes. But the preference ordering over final outcomes hasn’t changed — the agent still prefers higher log wealth to lower log wealth. It’s the decision problem that’s different, not the preferences.
To put it another way: an EU maximizer with log utility would make exactly the same choices as the EE agent in every case you describe. They’d Kelly-bet in multiplicative environments and behave more linearly in additive ones, because that’s what maximizing expected log wealth requires in each setting. But the EU maximizer with log utility satisfies independence by construction. So how can the EE agent be violating independence while making identical choices?
I think the thing that looks like a context-dependent utility function is really a context-dependent policy derived from a fixed utility function under different dynamics. These seem importantly different, and I’m not sure the independence axiom is violated by the latter.