DMs open.
Cleo Nardo
A good example of research on reward hacking that I don’t think would’ve necessarily gotten done “by default for capabilities reasons” is inoculation prompting and similar investigations
Why?
Why are flagrantly misaligned AIs safer than deceptively misaligned AIs?
We’re on a safer trajectory if we catch misaligned AIs taking malign actions and continue deploying them under informed oversight, than if we deployed models who have never taken malign actions in training or evaluation. This favours of interventions which increase the chance that models are misaligned, but decrease the chance of deceptive alignment conditional on misalignment.
I’ve sorted some mechanisms from most important to least. Most are from the two posts above:
The malign actions as legible evidence of misalignment, raising willingness-to-pay for safety inside labs and supporting stronger external policy responses. [1]
We can train untrusted monitors, coup probes, and activation classifiers on on-policy examples of malign actions — substantially more effective than off-policy training. [2]
We can study caught models as model organisms of misalignment, validating countermeasures and testing hypotheses about scheming. These might be more realistic than artificial model organisms, and more persuasive to third parties.
We can train the policy itself to stop attempting attacks, or to report itself when it does, after enough caught examples.
We can negotiate deals with the AIs, without them worrying about revealing their misalignment. [3]
We can sandbox the model and calibrate control protocols to the actual threat, rather than relying on an apparently-aligned model roleplaying as a misaligned one.
We can deploy the AIs to improve our alignment auditing, because the AIs are less worried about us using the auditing to uncover their misalignment.
We can patch the security vulnerabilities the model exploits, or convert them into tripwires that catch future attempts. If we tried to deploy a deceptively aligned model to do this, it might sandbag, keeping some vulnerabilities hidden.
We can extract useful work (on tasks where outputs are verifiable while refusing to trust it elsewhere), without worrying that the AI would accumulate enough evidence of alignment to lower willingness-to-pay, i.e. move from Plan C to Plan D.
Deceptively aligned models might be waiting for other AIs to act misaligned, as coup signal, so the flagrantly misaligned models might trigger premature defection from the deceptive ones.
- ↩︎
For example, it might move Plan E to Plan D by convincing a handful of employees at the leading lab to start working on misalignment risk where previously no one was; Plan D to Plan C by convincing leadership to spend the lab’s lead on safety rather than racing; or Plan C to Plan B or A by giving USG and other governments the concrete demonstration they need to treat misalignment as a top national security priority and justify international coordination.
- ↩︎
See Kirch et al. (2025), “The Impact of Off-Policy Training Data on Probe Generalisation”, which finds that deception and sandbagging probes may fail to generalise from off-policy to on-policy data.
- ↩︎
See Alexa Pan, “A taxonomy of barriers to trading with early misaligned AIs”, which discusses “high cost to even considering a deal” as a factor that increases an AI’s reservation price, noting that an AI suspects negotiation could leak information about its misalignment. A flagrantly misaligned model bypasses this problem entirely: the information is already out.
There is no sufficiently coherent/nice/appropriate morality, either objective or subjective, even after idealisation.
See the first paragraph — I think moral subjectivism is a bit less attractive than Objective Morality, as an ideal, but not much worse. By “Deep Nihilism” I mean that there isn’t even a coherent subjective morality.
As promised, here’s a plausible sketch for how Truth might fall:
Modal realism is true, so you lose all objective contingent truths (e.g. the sky is blue). You only have the truths about:
the space of all possible worlds (e.g. there is some world where the sky is blue, if the sky is possibly a colour, and the sea is possibly a colour, then it’s possible that the sky is the first colour and the sea is the second colour.)
subjective truths about which world you occupy (e.g. the sky in my world is blue). This is a bit more impoverished but probably fine.
However, you might lose the subjective contingent truths also, because the “I/me/my” might pick out multiple entities in different possible worlds, e.g. some entities in base reality, some in a simulated reality, etc. That kinda sucks, because then there is not even a subjective truth about whether you’re in a simulation. Note this is worse than the sceptical “you don’t know if you’re in a simulation” — instead the worry is that there’s no truth to know!
Possibly you can save this with UDASSA-ish, epistemology is caring. But then we’re at square zero. Your “caring” might not be coherent even under idealisation.
Okay so we’re left with what? Just the facts about different Solomonoff priors? That’s kinda impoverished. And we might even lose those facts due to logically impossible worlds, or non-standard arithmetic or set-theoretic pluralism.
What if Deep Nihilism is true?
Moral realism is probably false which kinda sucks. But we can probably replace it with some kind of souped-up moral subjectivism. That is, rather than discovering and optimising Goodness, we can instead discover and optimise something like “What would I ideally want?” for some appropriate idealisation procedure.
What if that doesn’t work? More generally, what if the notion of Goodness is deeply broken such that we can’t replace it with anything else? That is, there is nothing Goodness-shaped in the world, either objectively or subjectively. There is no adequate notion of choiceworthiness. Yuck.
What would we have left?
I think there are other ideals we could strive for, like Truth or Beauty. We currently strive for these ideals partly for moral reasons — and we would lose that additional oomph — but they are still decent ideals to fall back on by themselves.
I suspect Truth might fall as well (see reply), so we might be left with just Beauty, Fun, and maybe some other ideals which are more unmediated than Goodness or Truth. But I’d feel a bit shortchanged. I’m White/Blue in the MGT Colour Wheel, so Goodness and Truth are what I care most about.
I don’t think I would substantially regret my choices on Deep Nihilism. I think there are decent nihilistic justifications for working on AI safety (e.g. it’s fun, it’s cool, it makes me feel important, etc). And there are probably good nihilistic justifications at the organisational and societal levels as well, not just the individualist level.
Of course, there are some ways that I could’ve lived more nihilistically, e.g. been more “chill”, followed a broader range of intellectual pursuits. But these are pretty minor, I’ve made some sacrifices in my life for Truth or Goodness, but not big ones. My guess is this is due to “moral luck”: I enjoy interacting with people in the AI safety community, and I don’t enjoy interacting with people in policy/advocacy, and “luckily” I’m not fit for policy/advocacy. Maybe this is motivated reasoning / weaponised incompetence, and actually I would be great at policy/advocacy but I don’t want to make the sacrifice. Not sure.
Overall, I don’t feel that stressed about Deep Nihilism. I think (1) Deep Nihilism is pretty unlikely (<10%), (2) Deep Nihilism can be safely bracketed, (3) Deep Nihilism wouldn’t make me regret my actions substantially.
You could say “reduces one component of the attack surface” or “closes off threat model X”. But “reduces risk to near-zero” is a tell that you aren’t using the right mindset.
A title is supposed to describe the contents. If I saw a post saying “The probability of X is P” I would expect to see an explanation of P in particular, as opposed to P^2 or P^0.5.
EDIT: The title is much better; I changed my vote. I appreciate the authors engaging promptly :)
Original reply:
I strongly downvoted because the title is clickbait. The only argument for the headline claim “$50 million a year for a 10% chance to ban ASI” is a footnote that says “The probabilities are produced mostly by gut feeling.”
I think the actual content is reasonable, and would change my vote if the authors change the title to something that reflects the actual content:
“How ControlAI would spend $50M, $500M, $1B a year”
“ControlAI’s plan for scaling policy advocacy against ASI”
“With more funding, ControlAI would scale lobbying, ads, grassroots, policy research”
I’d encourage the moderators to change the title unilaterally.
Later they write “we lay out some of the reasoning behind this estimate, and explain how additional funding past that threshold would continue to significantly improve our chances of success, with $500 million a year producing an estimated ~30% probability of success.”
I think the article doesn’t lay out any reasoning behind this estimate, e.g. why this is 30% rather than 3% or 0.3% or 0.003%.
I spoke with someone recently who admitted that Newcomb’s 1-boxers walk away from the problem with more money on average than 2-boxers, yet somehow still argued for 2-boxing.
This doesn’t seem like a knock-down argument against 2-boxing.
Combine these two emojis: 🤔 and 🤯.
More generally, claims like “X is consensus among group Y” are a little bit dangerous because they can force group Y into an equilibrium that they wouldn’t want to be in otherwise. Like, these claims reinforce situations where a bunch of people would’ve objected to X but didn’t object because they didn’t know anyone else would’ve objected.
I predict that the AIs will actually consume a trillion times the water in Earth’s oceans.
The OP is mostly tongue-in-cheek, but it gets at a real peeve of mine. Whenever AI safety guys write op-eds about how silly the environmentalists are, they should add to the beginning, middle, and end of the article something like:
“I think that within 20 years, AI alone will be consuming more energy than the entirety of human civilisation consumes today. Its effects on the environment will be more transformative than the agricultural or industrial revolutions. It may result in the extinction of all organic life.”
And then they can chuckle about how the environmentalist did statistics bad or whatever.
Fixed. I can’t find the unpaywalled Washington Post article, but here’s a post which disputes the claim directly: The Biggest Statistic About AI Water Use Is A Lie
Projecting AI Water Usage.
Environmentalists warn that large data centers can consume up to 5 million gallons per day — equivalent to the needs of a town of 10,000 to 50,000 people. Washington Post claimed that a 100-word email use roughly one bottle of water.
On the other side of the debate:
SE Gyges argues the statistic about the bottle of water is based on unrealistic assumptions.
Bentham’s Bulldog writes, “The environmentalist case against AI completely falls apart upon even cursory examination of the facts.”
Andy Masley, the staunchest critique of the concerns about AI water usage, says “On the national, local, and personal level, AI is barely using any water, and unless it grows 50 times faster than forecasts predict, this won’t change.”
I offer my own projection: AI will eventually consume 10^41 – 10^47 gallons of water — up to a hundred trillion trillion times the water in Earth’s oceans.
How much water is out there? Water is the third most abundant molecule in the universe, after H2 and CO. The Solar System alone contains ~10^26 kg of water in planets, moons, and comets — about 100,000 times Earth’s oceans (Kotwicki 1991). There are ~10^11 stars in the Milky Way, each plausibly endowed with a similar complement of icy bodies. And the galaxy’s molecular clouds contain vast reservoirs of water ice on dust grains. A reasonable estimate for total water in the Milky Way is a few x 10^37 kg — about 10^16 Earth-oceans.
How many galaxies will the AI consume? The cosmic event horizon — the boundary beyond which even light-speed travel cannot reach, due to the accelerating expansion of space — sits at roughly 16.5 billion light-years (Ord 2021). This encloses about 20 billion galaxies, roughly 5% of the observable universe. However, the future AI may bump into competitors before reaching the cosmic event horizon. Robin Hanson’s grabby aliens model estimates that each “grabby civilization” — one that expands at a significant fraction of light speed and visibly transforms its territory — would eventually control 10^5 to 3 x 10^7 galaxies before meeting others.
The water budget of future AI. Putting it together, with ~10^37 kg of water per Milky-Way-equivalent galaxy, and 10^5 to 2 x 10^10 galaxies, this gives 10^42 to 2 x 10^47 kg of water — between 10^21 and 10^26 Earth-oceans.
That’s fair. I think people mostly overestimate risks from value drift. I imagine following the “optimal strategy” I described above would still involve making 1-2 big donations a year, and most people don’t drift too far over a year. Especially if you’ve signed the pledge, you’ve been donating consistently for a couple years, most of your friends are EA, etc. The better ways to avoid value drift, and keep yourself committed, is writing a yearly blog post on your donations.
Small donors should not worldview-diversify.
Occasionally I encounter small donors (e.g. 10% pledgers earning <$200K) with highly specialised skills and knowledge (e.g. working on a sub-sub-topic of an EA cause area) who donate primarily to GiveWell top charities. These people do incredible amounts of good, and are highly commendable.
That said, I think would probably do more good by donating according to their inside view and special knowledge. Worldview diversification makes sense for large funders like Coefficient Giving, but their reasons don’t apply to small donors: diminishing returns, cross-pollination of ideas, etc. If all the small donors switched to specialised donations, then the community overall would still be worldview diversified, where the diversification is happening across donors rather than within each donor.
I think the optimal donation strategy for a small donor with domain expertise looks something like:
Save 10% in index funds by default.
Donate when you encounter an opportunity where (i) you can make a non-deferential case for funding it, and (ii) you are unusually positioned to evaluate or support it. This might happen once or twice a year.
Paradigm cases:
Political donations
Opportunities that the institutional funders wouldn’t fund, or it would harm them to fund
Areas where you think institutional funders lacks in-house expertise
Research that is illegible to generalist grantmakers but legible to you:
conceptual work requiring significant background to evaluate
a researcher you’re convinced is excellent through direct interaction but who lacks credentials
a theory of change requiring many inferential steps
A project that needs bridging before it becomes legible for institutional funding
Volunteering, when there’s some reason hiring you would be inconvenient
Richard Ngo seems to follow something like this strategy (though he might qualify as a mid-sized donor). I disagree with some of his specific donations — but that’s pretty much the point.
The main risks are:
Unilateralist curse — mitigable by focusing on opportunities with limited downside risk
Value drift — mitigable via donor-advised funds or similar mechanisms. I also think that people who have been donating 10% consistently for a few years should have more faith in their future self.
Cause Area Distortions — this would distort donations towards areas with a high intersection of {EAs} ∩ {specialists} ∩ {high-incomes}. So this would increase funding to AI policy, AI safety, pandemic preparedness, etc. But I think these areas are relatively underfunded compared with GHD and Animal Welfare.
Risk intolerance — a single bet that fails will sting much more than donating to a fund which has a 50% failure rate. But EAs routinely choose jobs on hits-based impact, so they should think of hits-based giving similarly. One partial solution: an impact insurance syndicate — ten friends each make highly specialised bets but agree to “share” the impact.
Awkwardness — giving a meaningful chunk of your income to someone you know personally is socially uncomfortable for both parties.
Another donation strategy that seems reasonable, if you have a high-impact job, is hiring a personal assistant or research assistant, to maximise your own productivty.
I’ll add again that GiveWell top charities achieve an incredible amount of impact, and have room for much more funding, so small donors should have a high bar for shifting their donations from GiveWell.
How valuable are prediction markets for wake up / transparency?
I think it’s likely that prediction markets will be really helpful for keeping the public / civil society / policymakers aware of what’s going on inside the labs.
This is because prediction markets are a great way of aggregating public information, and also eliciting private information from insiders. And we’ve seen a few high-profile cases of big political decisions based on movements in prediction markets (e.g. the replacement of Joe Biden in the 2024 POTUS campaign).
If this is true—what should we do? Maybe we can subsidise prediction markets that we think are really important?
Note that there are some downside risks — prediction markets cause the same problems that are caused by any push for transparency, i.e. sometimes we don’t want dangerous information to be leaked, and sometimes people will change their behaviour in undesirable ways if they know the information would be leaked. But in general, I think more transparency about the capabilities and risks within the lab would be helpful.
There are also other wrinkles like:
How do prediction markets work if people put significant weight on extinction, the expropriation of their resources, crazy interest rates stuff, and unknown unknowns.
How can we ensure that markets enjoy sufficient liquidity?
Will prediction markets cause further wealth concentration within the labs, because they have access to more information and better trading-enabiling AI capabilities?
We should probably install cheaply satisfied preferences within AIs — why should this preference be myopic reward?
Why not a utility function like: “How much time is there a tungsten cube on Dario’s desk, with 21% annual discount rate.”
i.e. utility = ∫₀^∞ e^{-0.231t} · 𝟙[cube on desk at time t] dt
where λ = ln(2)/3, chosen so that half the utility comes from the deployment period (first 3 years) and half from the rest of history.
Some advantages of the cube preference:
We don’t have to worry how satisfying this preference affects the training and deployment.
It’s less philosophically messy what the cube utility of a scenario would be.
Some disadvantages:
AIs will crave reward anyway, so it’s better to intensify that craving rather than add a distinct craving.
It’s easier to build AIs which intensely crave reward than crave the cube thing. My guess is that this is both true and decisive, but I’d want to have a clearer sense of what actually goes wrong if we do something like this.
Will MacAskill has a workaround which is Universal Basic Resources, which is better than UBI at avoiding concentrations of power.
UBI = each perosn is paid an income by the government or a quadrillionaire philanthropist
UBR = each person is allocated some basic bundle of resources — compute, land, patch of the sun — which they can rent out the resources to the wider economy
(You’re probably aware of UBR, I’m including this for other readers.)