Technical staff at Anthropic, previously #3ainstitute; interdisciplinary, interested in everything; ongoing PhD in CS (learning / testing / verification), open sourcerer, more at zhd.dev
Zac Hatfield-Dodds(Zac Hatfield-Dodds)
Third-party testing as a key ingredient of AI policy
Trivially true to the extent that you are about equally likely to observe a thing throughout that timespan; and the Lindy Effect is at least regularly talked of.
But there are classes of observations for which this is systematically wrong: for example, most people who see a ship part-way through a voyage will do so while it’s either departing or arriving in port. Investment schemes are just such a class, because markets are usually up to the task of consuming alpha and tend to be better when the idea is widely known—even Buffett’s returns have oscillated around the index over the last few years!
Safety properties aren’t the kind of properties you can prove; they’re statements about the world, not about mathematical objects. I very strongly encourage anyone reading this comment to go read Leveson’s Engineering a Safer World (free pdf from author) through to the end of chapter three—it’s the best introduction to systems safety that I know of and a standard reference for anyone working with life-critical systems. how.complexsystems.fail is the short-and-quotable catechism.
I’m not really sure what you mean by “AI toolchain”, nor what threat model would have a race-condition present an existential risk. More generally, formal verification is a research topic—there’s some neat demonstration systems and they’re used in certain niches with relatively small amounts of code and compute, simple hardware, and where high development times are acceptable. None of those are true of AI systems, or even libraries such as Pytorch.
For flavor, some of the most exciting developments in formal methods: I expect the Lean FRO to improve usability, and ‘autoformalization’ tricks like Proofster (pdf) might also help—but it’s still niche, and “proven correct” software can still have bugs from under-specified components, incorrect axioms, or outright hardware issues (e.g. Spectre, Rowhammer, cosmic rays, etc.). The seL4 microkernel is great, but you still have to supply an operating system and application layer, and then ensure the composition is still safe. To test an entire application stack, I’d instead turn to Antithesis, which is amazing so long as you can run everything in an x86 hypervisor (with no GPUs).
(as always, opinions my own)
I think he’s actually quite confused here—I imagine saying
Hang on—you say that (a) we can think, and (b) we are the instantiations of any number of computer programs. Wouldn’t instantiating one of those programs be a sufficient condition of understanding? Surely if two things are isomorphic even in their implementation, either both can think, or neither.
(the Turing test suggests ‘indistinguishable in input/output behaviour’, which I think is much too weak)
See e.g. https://mschloegel.me/paper/schloegel2024sokfuzzevals.pdf
Fuzzing is a generally pretty healthy subfield, but even there most peer-reviewed papers in top venues are still are completely useless! Importantly, “a ‘working’ github repo” is really not enough to ensure that your results are reproducible, let alone ensure external validity.
people’s subjective probability of successful restoration to life in the future, conditional on there not being a global catastrophe destroying civilization before then. This is also known as p(success).
This definition seems relevantly modified by the conditional!
You also seem to be assuming that “probability of revival” could be a monocausal explanation for cryonics interest, but I find that implausible ex ante. Monocausality approximately doesn’t exist, and “is being revived good in expectation / good with what probability” are also common concerns. (CF)
Very little, because most CS experiments are not in fact replicable (and that’s usually only one of several serious methodological problems).
CS does seem somewhat ahead of other fields I’ve worked in, but I’d attribute that to the mostly-separate open source community rather than academia per se.
My impression is that the effects of genes which vary between individuals are essentially independent, and small effects are almost always locally linear. With the amount of measurement noise and number of variables, I just don’t think we could pick out nonlinearities or interaction effects of any plausible strength if we tried!
I think there’s a lot of interesting potential in such ideas—but that this isn’t ambitious enough! Democracy isn’t just about compromising on the issues on the table; the best forms involve learning more and perhaps changing our minds… as well as, yes, trying to find creative win-win outcomes that everyone can at least accept.
I think that trying to improve democracy with better voting systems is fairly similar to trying to improve the economy with better price and capital-allocation sytems. In both cases, there have been enormous advances since the mid-1800s; in both there’s a realistic prospect of modern computers enabling wildly better-than-historical systems; and in both cases it focuses effort on a technical subproblem which not sufficient and maybe not even necessary. (and also there’s the spectre of communism in Europe haunting both)
A few bodies of thought and work on this that I like:
classic speeches, letters, and essays on citizenship, such as Citizenship in a Republic or Letter from a Birmingham Jail (“the books of the future would be just as good a corrective as the books of the past, but unfortunately we cannot get at them”)
consensus decision-making. I’m particularly familiar with and fond of the UCA Manual for Meetings; Beyond the Majority Rule gives a decent account of similar practices. Good practice feels like a large dance event; there’s enough structure that you can all work together but not so much that you can’t improvise when that’d work better.
computer-enabled ideas like Polis (and relevantly e.g.) or the Collective Intelligence Project, The Computerization of Society (1978, France), arguably Alan Kay’s Dynabook concept, Wikipedia and open source / open culture movements, …
But as usual, the hard and valuable part is the doing!
Most discussions of AI x-risk consider a subset of this [misuse / structural / accidental / agentic] taxonomy. … Anthropic’s Responsible Scaling Policy is designed with only “misuse” and “autonomy and replication” in mind.
No, we’ve[1] been thinking about all four of these aspects!
Misuse is obvious—our RSP defines risk levels, evals, and corresponding safeguards and mitigations before continued training or deployment.
Structural risks are obviously not something we can solve unilaterally, but nor are we neglecting them. The baseline risk comparisons in our RSP are specifically excluding other provider’s models, so that e.g. we don’t raise the bar on allowable cyberoffense capabilities even if a competitor has already released a more-capable model. (UDT approved strategy!) Between making strong unilateral safety committments, advancing industry-best-practice, and supporting public policy through e.g. testimony and submissions to government enquiries, I’m fairly confident that our net contribution to structural risks is robustly positive.
Accident and agentic risks are IMO on a continuous spectrum—you could think of the underlying factor as “how robustly-goal-pursuing is this system?”, with accidents being cases where it was shifted off-goal-distribution and agentic failures coming from a treacherous turn by a schemer. We do technical safety research to address various points on this spectrum, e.g. Constitutional AI or investigating faithfulness of chain-of-thought to improve robustness of prosaic alignment, and our recent Sleeper Agents paper on more agentic risks. Accidents are more linked to specific deployments though, and corresponding less emphasized in our RSP—though if you can think of a good way to evaluate accident risks before deployment, let me know!
- ↩︎
as usual, these are my opinions only, I’m not speaking for my employer. Further hedging omitted for clarity.
By “sustainability,” I mean that a theory of victory should ideally not reduce AI x-risk per year to a constant, low level, but instead continue to reduce AI x-risk over time. In the former case, “expected time to failure”[31] would remain constant, and total risk over a long enough time period would inevitably reach unacceptable levels. (For example, a 1% chance of an existential catastrophe per year implies an approximately 63% chance over 100 years.)
Obviously yes, a 1% pa chance of existential catastrophe is utterly unacceptable! I’m not convinced that “continues to reduce over time” is the right framing though; if we achieved a low enough constant rate for a MTBF of many millions of years I’d expect other projects to have higher long-term EV given the very-probably-finite future resources available anyway. I also expect that the challenge is almost entirely in getting to an acceptably low rate, not in the further downward trend, so it’s really a moot point.
(I’m looking forward to retiring from this kind of thing if or when I feel that AI risk and perhaps synthetic biorisk are under control, and going back to low-stakes software engineering r&d… though not making any active plans)
However, the data they use to construct inferential relationships are expert forecasts. Therefore, while their four scenarios might accurately describe clusters of expert forecasts, they should only be taken as predictively valuable to the extent that one takes expert forecasts to be predictively valuable.
No, it’s plausible that this kind of scenario or cluster is more predictively accurate than taking expert forecasts directly. In practice, this happens when experts disagree on (latent) state variables, but roughly agree on dynamics—for example there might be widespread disagreement on AGI timelines, but agreement that
if scaling laws and compute trends hold and no new paradigm is needed, AGI timelines of five to ten years are plausible
if the LLM paradigm will not scale to AGI, we should have a wide probability distribution over timelines, say from 2040 -- 2100
and then assigning relative probability to the scenarios can be a later exercise. Put another way, forming scenarios or clusters is more like formulating an internally-coherent hypothesis than updating on evidence.
“The most pressing practical question for future work is: why were superforecasters so unmoved by experts’ much higher estimates of AI extinction risk, and why were experts so unmoved by the superforecasters’ lower estimates? The most puzzling scientific question is: why did rational forecasters, incentivized by the XPT to persuade each other, not converge after months of debate and the exchange of millions of words and thousands of forecasts?”
This post by Peter McClusky, a participating superforecaster, renders the question essentially non-puzzling to me. Doing better would be fairly simple, although attracting and incentivising the relevant experts would be fairly expensive.
The questions were in many cases somewhat off from the endpoints we care about, or framed in ways that I believe would distort straightforward attempts to draw conclusions
The incentive structure of predicting the apocalypse is necessarily screwy, and using a Keynsian
beautyprediction contest doesn’t really fix itMost of the experts and superforecasters just don’t know much about AI, and thought that (as of 2022) the recent progress was basically just hype. Hopefully it’s now clear that this was just wrong?
Some selected quotes:
I didn’t notice anyone with substantial expertise in machine learning. Experts were apparently chosen based on having some sort of respectable publication related to AI, nuclear, climate, or biological catastrophic risks. … they’re likely to be more accurate than random guesses. But maybe not by a large margin.
Many superforecasters suspected that recent progress in AI was the same kind of hype that led to prior disappointments with AI. I didn’t find a way to get them to look closely enough to understand why I disagreed. My main success in that area was with someone who thought there was a big mystery about how an AI could understand causality. I pointed him to Pearl, which led him to imagine that problem might be solvable.
I didn’t see much evidence that either group knew much about the subject that I didn’t already know. So maybe most of the updates during the tournament were instances of the blind leading the blind. None of this seems to be as strong evidence as the changes, since the tournament, in opinions of leading AI researchers, such as Hinton and Bengio.
I think the core problem is actually that it’s really hard to get good public predictions of AI progress, in any more detail than “extrapolate compute spending, hardware price/performance, scaling laws, and then guess at what downstream-task performance that implies (and whether we’ll need a new paradigm for AGI [tbc: no!])”. To be clear, I think that’s a stronger baseline than the forecasting tournament achieved!
But downstream task performance is hard to predict, and there’s a fair bit of uncertainty in the other parameters too. Details are somewhere between “trade secrets” and “serious infohazards”, and the people who are best at predicting AI progress are mostly—for that reason! - work at frontier labs with AI-xrisk-mitigation efforts. I think it’s likely that inferring frontier lab [members]’s beliefs from their actions and statements would give you better estimates than another such tournament.
I’m a big fan of scenario modelling in general, and loved this post reviewing its application to AI xrisk. Thanks for writing it!
In 2021 I dropped everything else I was doing and moved across the Pacific Ocean to join Anthropic, which I guess puts me in group three. However, I think you should also take seriously the possibility that AGI won’t change everything soon—whether because of technical limitations, policy decisions to avoid building more-capable (/dangerous) AI systems, or something which we haven’t seen coming at all. Even if you’re only wrong about the timing, you could have a remarkably bad time.
So my view is that almost nothing on your list has enough of an upside to be worth the nontrivial chance of very large downsides—though by all means spend more time with family and friends! I believe there was a period in the Cold War where many reasearchers at RAND declined to save for retirement, but saving for retirement is not actually that expensive. Save for retirement, don’t smoke, don’t procrastinate about cryonics, and live a life you think is worth living.
And, you know, I think there are concrete reasons for hope and that careful, focussed effort can improve our odds. If AI is going to be a major focus of your life, make that productive instead of nihilistic.
Yep, I’m happy to take the won’t-go-down-like-that side of the bet. See you in ten years!
How about “inflation-adjusted market cap of 50% of the Fortune 500 as at Jan 1st 2024 is down by 80% or more as of Jan 1st 2034”.
It’s a lot easier to measure but I think captures the spirit? I’d be down for an even-odds bet of your chosen size; if I win as a donation of GiveWell / paid to you or your charity of choice if I lose.
The average investor will notice almost all their investments go to zero except for a few corps
I’d like to bet against this, if you want to formalize it enough to have someone judge it in ten years.
“X is open source” has a specific meaning for software, and Llama models are not open source according to this important legal definition.
I agree that there’s no substitute for thinking about this for yourself, but I think that morally or socially counting “spending thousands of dollars on yourself, an AI researcher” as a donation would be an apalling norm. There are already far too many unmanaged conflicts of interest and trust-me-it’s-good funding arrangements in this space for me, and I think it leads to poor epistemic norms as well as social and organizational dysfunction. I think it’s very easy for donating to people or organizations in your social circle to have substantial negative expected value.
I’m glad that funding for AI safety projects exists, but the >10% of my income I donate will continue going to GiveWell.