davidad

Karma: 2,045

Programme Director at UK Advanced Research + Invention Agency focusing on safe transformative AI; formerly Protocol Labs, FHI/Oxford, Harvard Biophysics, MIT Mathematics And Computation.

A list of core AI safety problems and how I hope to solve them

davidad26 Aug 2023 15:12 UTC

157 points

23 comments5 min readLW link

You can still fetch the coffee today if you’re dead tomorrow

davidad9 Dec 2022 14:06 UTC

84 points

19 comments5 min readLW link

Compute Thresholds: proposed rules to mitigate risk of a “lab leak” accident during AI training runs

davidad22 Jul 2023 18:09 UTC

80 points

2 comments2 min readLW link

An Open Agency Architecture for Safe Transformative AI

davidad20 Dec 2022 13:04 UTC

79 points

22 comments4 min readLW link

Why I Moved from AI to Neuroscience, or: Uploading Worms

davidad13 Apr 2012 7:10 UTC

67 points

58 comments1 min readLW link

AI Neorealism: a threat model & success criterion for existential safety

davidad15 Dec 2022 13:42 UTC

64 points

1 comment3 min readLW link

davidad 10 Dec 2021 9:11 UTC
LW: 57 AF: 22
AF
on: Biology-Inspired AGI Timelines: The Trick That Never Works
Heartened by a strong-upvote for my attempt at condensing Eliezer’s object-level claim about timeline estimates, I shall now attempt condensing Eliezer’s meta-level “core thing”.
1. Certain epistemic approaches to arrive at object-level knowledge consistently look like a good source of grounding in reality, especially to people who are trying to be careful about epistemics, and yet such approaches’ grounding in reality is consistently illusory.
2. Specific examples mentioned in the post are “the outside view”, “reference class forecasting”, “maximum entropy”, “the median of what I remember credible people saying”, and, most importantly for the object-level but least importantly for the Core Thing, “Drake-equation-style approaches that cleanly represent the unknown of interest as a deterministic function of simpler-seeming variables”.
3. Specific non-examples are concrete experimental observations (falling objects, celestial motion). These have grounding in reality, but they don’t tend to feel like they’re “objective” in the same way, like a nexus that my beliefs are epistemically obligated to move toward—they just feel like a part of my map that isn’t confused. (If experiments do start to feel “objective”, one is then liable to mistake empirical frequencies for probabilities.)
4. The illusion of grounding in reality doesn’t have to be absolute (i.e. “this method reliably arrives at optimal beliefs”) to be absolutely corrupting (e.g. “this prior, which is not relative to anything in particular, is uniquely privileged and central, even if the optimal beliefs are somewhere nearby-but-different”).
5. The “trick that never works”, in general form, is to go looking in epistemology-space for some grounding in objective reality, which will systematically tend to lead you into these illusory traps.
6. Instead of trying to repress your subjective ignorance by shackling it to something objective, you should:
  1. sublimate your subjective ignorance into quantitative probability measures,
  2. use those to make predictions and design experiments, and finally
  3. freely and openly absorb observations into your subjective mind and make subjective updates.
Eliezer doesn’t seem to be saying the following [edit: at least not in my reading of this specific post], but I would like to add:
1. Even just trying to make your updates objective (e.g. by using a computer to perform an exact Bayesian update) tends to go subtly wrong, because it can encourage you to replace your actual map with your map of your map, which is predictably less informative. Making a map of your map is another one of those techniques that seem to provide more grounding but do not actually.
2. Calibration training is useful because your actual map is also predictably systematically bad at updating by default, and calibration training makes it better at doing this. Teaching the low-description-length principles of probability to your actual map-updating system is much more feasible (or at least more cost-effective) than emitting your actual map into a computationally realizable statistical model.
3. Techniques that give the illusion of objectivity are usually not useless. But to use them effectively, you have to see through the illusion of objectivity, and treat their outputs as observations of what those techniques output, rather than as glimpses at the light of objective reasonableness.
  - In the particular example of forecasting AGI with biological anchors, Eliezer does this when he predicts (correctly, at least in the fictional dialogue) that the technique (perhaps especially when operated by people who are trying to be careful and objective) will output a median 30 years from the present.
    It’s only because Eliezer can predict the outcome, and finds it almost uncorrelated in his map from AGI’s actual arrival, that he dismisses the estimate as useless.
    This particular example as a vehicle for the Core Thing (if I’m right about what that is) has the advantage that the illusion of objectivity is especially illusory (at least from Eliezer’s perspective), but the disadvantage that one can almost read Eliezer as condemning ever using Drake-equation-style approaches, or reference-class forecasting, or the principle of maximum entropy. But I think the general point is about how to undistortedly view the role of these kinds of things in one’s epistemic journey, which in most cases doesn’t actually exclude using them.
What links here?
- Reply to Eliezer on Biological Anchors by HoldenKarnofsky (23 Dec 2021 16:15 UTC; 150 points)

Reframing inner alignment

davidad11 Dec 2022 13:53 UTC

53 points

13 comments4 min readLW link

davidad 9 Dec 2021 18:16 UTC
LW: 53 AF: 25
AF
on: davidad’s Shortform
I want to go a bit deep here on “maximum entropy” and misunderstandings thereof by the straw-man Humbali character, mostly to clarify things for myself, but also in the hopes that others might find it useful. I make no claim to novelty here—I think all this ground was covered by Jaynes (1968)—but I do have a sense that this perspective (and the measure-theoretic intuition behind it) is not pervasive around here, the way Bayesian updating is.
First, I want to point out that entropy of a probability measure $p$ is only definable relative to a base measure $μ$ , as follows:
$H_{μ} (p) = - \int_{X} \frac{d p}{d μ} (x) log \frac{d p}{d μ} (x) d μ (x)$
(The derivatives notated here denote Radon-Nikodym derivatives; the integral is Lebesgue.) Shannon’s formulae, the discrete $H (p) = - \sum_{i} p (x_{i}) log p (x_{i})$ and the continuous $H (p) = - \int_{X} p (x) log p (x) d x$ , are the special cases of this where $μ$ is assumed to be counting measure or Lebesgue measure, respectively. These formulae actually treat $p$ as having a subtly different type than “probability measure”: namely, they treat it as a density with respect to counting measure (a “probability mass function”) or a density with respect to Lebesgue measure (a “probability density function”), and implicitly supply the corresponding $μ$ .
If you’re familiar with Kullback–Leibler divergence ( $D_{K L}$ ), and especially if you’ve heard $D_{K L}$ called “relative entropy,” you may have already surmised that $H_{μ} (p) = - D_{K L} (p | | μ)$ . Usually, KL divergence is defined with both arguments being probability measures (measures that add up to 1), but that’s not required for it to be well-defined (what is required is absolute continuity, which is sort of orthogonal). The principle of “maximum entropy,” or $arg {max}_{p} H_{μ} (p)$ , is equivalent to $arg {min}_{p} D_{K L} (p | | μ)$ . In the absence of additional constraints on $p$ , the solution of this is $p = μ$ , so maximum entropy makes sense as a rule for minimum confidence to exactly the same extent that the implicit base measure $μ$ makes sense as a prior. The principle of maximum entropy should really be called “the principle of minimum updating”, i.e., making a minimum-KL-divergence move from your prior $μ$ to your posterior $p$ when the posterior is constrained to exactly agree with observed facts. (Standard Bayesian updating can be derived as a special case of this.)
Sometimes, the structure of a situation has some symmetry group with respect to which the situation of uncertainty seems to be invariant, with classic examples being relabeling heads/tails on a coin, or arbitrarily permuting a shuffled deck of cards. In these examples, the requirement that a prior be invariant with respect to those symmetries (in Jaynes’ terms, the principle of transformation groups) uniquely characterizes counting measure as the only consistent prior (the classical principle of indifference, which still lies at the core of grade-school probability theory). In other cases, like a continuous roulette wheel, other Haar measures (which generalize both counting and Lebesgue measure) are justified. But taking “indifference” or “insufficient reason” to justify using an invariant measure as a prior in an arbitrary situation (as Laplace apparently did) is fraught with difficulties:
1. Most obviously, the invariant measure on $[- \infty, \infty]$ with respect to translations, namely Lebesgue measure, is an improper prior: it is a non-probability measure because its integral (formally, $\int_{- \infty}^{\infty} 1 d μ (x)$ ) is infinite. If we’re talking about forecasting the timing of a future event, $(0, \infty)$ is a very natural space, but $\int_{0}^{\infty} 1 d μ (x)$ is no less infinite. Discretizing into year-sized buckets doesn’t help, since counting measure on $N$ is also infinite (formally, $\sum_{i = 0}^{ω} 1 = \infty$ ). In the context of maximum entropy, using an infinite measure for $μ$ means that there is no maximum-entropy $p$ —you can always get more entropy by spreading the probability mass even thinner.
2. But what if we discretize and also impose a nothing-up-my-sleeve outrageous-but-finite upper bound, like the maximum binary64 number at around $1.8 \times 10^{308}$ ? Counting measure on ${i : N | i < 1.8 \times 10^{308}}$ can be normalized into a probability measure, so what stops that from being a reasonable “unconfident” prior? Sometimes this trick can work, but the deeper issue is that the original symmetry-invariance argument that successfully justifies counting measure for shuffled cards just makes no sense here. If one relabels all the years, say reversing their order, the situation of uncertainty is decidedly not equivalent.
3. Another difficulty with using invariant measures as priors as a general rule is that they are not always uniquely characterized, as in the Bertrand paradox, or the obvious incompatibility between uniform priors (invariant to addition) and log-uniform priors (invariant to multiplication).
I think Humbali’s confusion can be partially explained as conflating an invariant measure and a prior—in both directions:
1. First, Humbali implicitly uses a translation-invariant base measure as a prior when he claims as absolute a notion of “entropy” which is actually relative to that particular base measure. Something like this mistake was made by both Laplace and Shannon, so Humbali is in good company here—but already confused, because translation on the time axis is not a symmetry with respect to which forecasts ought to be invariant.
2. Then, when cornered about a particular absurd prediction that inevitably arises from the first mistake, Humbali implicitly uses his (socially-driven) prior as a base measure, when he says “somebody with a wide probability distribution over AGI arrival spread over the next century, with a median in 30 years, is in realistic terms about as uncertain as anybody could possibly be.” Assuming he’s still using “entropy” at all as the barometer of virtuous unconfidence, he’s now saying that the way to fix the absurd conclusions of maximum-entropy relative to Lebesgue measure is that one really ought to measure unconfidence with respect to a socially-adjusted “base rate” measure, which just happens to be his own prior. (I think the lexical overlap between “base rate” and “base measure” is not a coincidence.) This second position is more in bad-faith than the first because it still has the bluster of objectivity without any grounding at all, but it has more hope of formal coherence: one can imagine a system of collectively navigating uncertainty where publicly maintaining one’s own epistemic negentropy, explicitly relative to some kind of social median, comes at a cost (e.g. hypothetically or literally wagering with others).
There is a bit of motte-and-bailey uncovered by the bad-faith in position 2. Humbali all along primarily wants to defend his prior as unquestionably reasonable (the bailey), and when he brings up “maximum entropy” in the first place, he’s retreating to the motte of Lebesgue measure, which seems to have a formidable air of mathematical objectivity about it. Indeed, by its lights, Humbali’s own prior does happen to have more entropy than Eliezer’s, though Lebesgue measure fails to support the full bailey of Humbali’s actual prior. However, in this case even the motte is not defensible, since Lebesgue measure is an improper prior and the translation-invariance that might justify it simply has no relevance in this context.

Meta: any feedback about how best to make use of the channels here (commenting, shortform, posting, perhaps others I’m not aware of) is very welcome; I’m new to actually contributing content on AF.
What links here?
- The Promise and Peril of Finite Sets by davidad (10 Dec 2021 12:29 UTC; 42 points)
- What is a probabilistic physical theory? by Ege Erdil (25 Dec 2021 16:30 UTC; 15 points)

davidad 31 Oct 2011 9:46 UTC
47 points
in reply to: atucker’s comment on: Whole Brain Emulation: Looking At Progress On C. elgans
That’s me. In short form, my justification for working on such a project where many have failed before me is:
1. The “connectome” of C. elegans is not actually very helpful information for emulating it. Contrary to popular belief, connectomes are not the biological equivalent of circuit schematics. Connectomes are the biological equivalent of what you’d get if you removed all the component symbols from a circuit schematic and left only the wires. Good luck trying to reproduce the original functionality from that data.
2. What you actually need is to functionally characterize the system’s dynamics by performing thousands of perturbations to individual neurons and recording the results on the network, in a fast feedback loop with a very very good statistical modeling framework which decides what perturbation to try next.
3. With optogenetic techniques, we are just at the point where it’s not an outrageous proposal to reach for the capability to read and write to anywhere in a living C. elegans nervous system, using a high-throughput automated system. It has some pretty handy properties, like being transparent, essentially clonal, and easily transformed. It also has less handy properties, like being a cylindrical lens, being three-dimensional at all, and having minimal symmetry in its nervous system. However, I am optimistic that all these problems can be overcome by suitably clever optical and computational tricks.
I’m a disciple of Kurzweil, and as such I’m prone to putting ridiculously near-future dates on major breakthroughs. In particular, I expect to be finished with C. elegans in 2-3 years. I would be Extremely Suprised, for whatever that’s worth, if this is still an open problem in 2020.
What links here?
- Whole Brain Emulation: No Progress on C. elegans After 10 Years by niconiconi (1 Oct 2021 21:44 UTC; 219 points)
- jefftk's comment on We Haven’t Uploaded Worms by jefftk (27 Dec 2014 12:00 UTC; 15 points)

Side-channels: input versus output

davidad12 Dec 2022 12:32 UTC

44 points

16 comments2 min readLW link

davidad 10 Dec 2021 22:22 UTC
43 points
on: [Linkpost] Chinese government’s guidelines on AI
One of the members of the committee that authored this (if not the chairperson) is Yi Zeng. He’s persistently engaged in conversations with CSER and other AI ethics groups in the UK, Australia, at the UN, etc.; I’ve met him at a few events and I believe that most of the values quoted above are really sincerely held. My main concern here is rather that these values are still stated in terms that may be too vague to interpret and enforce uniformly as a practical regulation throughout the large Chinese AI industry. But it’s no doubt a step in the right direction toward being actually binding on relevant actions.
What links here?
- ChristianKl's comment on AI coordination needs clear wins by evhub (3 Sep 2022 14:22 UTC; 8 points)

The Promise and Peril of Finite Sets

davidad10 Dec 2021 12:29 UTC

42 points

4 comments6 min readLW link

davidad 6 Oct 2021 10:53 UTC
35 points
in reply to: mukashi’s comment on: Whole Brain Emulation: No Progress on C. elgans After 10 Years
I can’t say for sure why Boyden or others didn’t assign grad students or postdocs to a Nemaload-like direction; I wasn’t involved at that time, there are many potential explanations, and it’s hard to distinguish limiting/bottleneck or causal factors from ancillary or dependent factors.

That said, here’s my best explanation. There are a few factors for a life-science project that make it a good candidate for a career academic to invest full-time effort in:
1. The project only requires advancing the state of the art in one sub-sub-field (specifically the one in which the academic specializes).
2. If the state of the art is advanced in this one particular way, the chances are very high of a “scientifically meaningful” result, i.e. it would immediately yield a new interesting explanation (or strong evidence for an existing controversial explanation) about some particular natural phenomenon, rather than just technological capabilities. Or, failing that, at least it would make a good “methods paper”, i.e. establishing a new, well-characterized, reproducible tool which many other scientists can immediately see is directly helpful for the kind of “scientifically meaningful” experiments they already do or know they want to do.
3. It is easy to convince people that your project is plausibly on a critical path in the roadmap towards one of the massive medical challenges that ultimately motivate most life-science funding, such as finding more effective treatments for Alzheimer’s, accelerating the vaccine pipeline, preventing heart disease, etc.
The more of these factors are present, the more likely your effort as an academic will lead to career advancement and recognition. Nemaload unfortunately scored quite poorly on all three counts, at least until recently:

(1) It required advancing the state-of-the-art in, at least: C. elegans genetic engineering, electro-optical system integration, computer vision, quantitative structural neuroanatomy of C. elegans, mathematical modeling, and automated experimental design.

(2) Even the final goal of Nemaload (uploading worms who’ve learned different behaviors and showing that the behaviors are reproduced in simulations) is barely “scientifically meaningful”. All it would demonstrate scientifically (as opposed to technically) is that learned behaviors are encoded in some way in neural dynamics. This hypothesis is at the same time widely accepted and extremely difficult to convince skeptics of. Of course, studying the uploaded dynamics might yield fascinating insights into how nature designs minds, but it also might be pretty black-boxy and inexplicable without advancing the state of the art in yet further ways.

(2b) Worse, partial progress is even less scientifically meaningful, e.g. “here’s a time-series of half the neurons, I guess we can do unsupervised clustering on it, oh look at that, the neural activity pattern can predict whether the worm is searching for food or not, as can, you know, looking at it.” To get an upload, you need all the components of the uploading machine, and you need them all to work at full spec. And partial progress doesn’t make a great methods paper either, for the following reason. Any particular experiment that worm neuroscientists want to do, they can do more cheaply and effectively in other ways, like genetically engineering only the specific neurons they care about for that experiment to fluoresce when they’re active. Even if they’re interested in a lot of neurons, they’re going to average over a population anyway, so they can just look at a handful of neurons at a time. And they also don’t mind doing all kinds of unnatural things to the worms like microfluidic immobilization to make the experiment easier, even though that makes the worms’ overall mental-state very, shall we say, perturbed, because they’re just trying to probe one neural circuit at a time, not to get a holistic view of all behaviors across the whole mental-state-space.

(3) The worm nervous system is in most ways about as far as you can get from a human nervous system while still being made of neural cells. C. elegans is not the model organism of choice for any human neurological disorder. Further, the specific technical problems and solutions are obviously not going to generalize to any creature with a bony skull, or with billions of neurons. So what’s the point? It’s a bit like sending a lander to the Moon when you’re trying to get to Alpha Centauri. There are some basic principles of celestial mechanics and competencies of system design and integration that will probably mostly generalize, and you have to start acquiring those with feedback from attempting easier missions. Others may argue that Nemaload on a roadmap to any science on mammals (let alone interventions on humans) is more like climbing a tree when you’re trying to get to the Moon. It’s hard to defend against this line of attack.

If a project has one or two of these factors but not all three, then if you’re an ambitious postdoc with a good CV already in a famous lab, you might go for it. But if it has none, it’s not good for your academic career, and if you don’t realize that, your advisor has a duty of care to guide you towards something more likely to keep your trajectory on track. Advisors don’t owe the same duty of care to summer undergrads.

Adam Marblestone might have more insight on this question; he was at the Boyden lab in that time. It also seems like the kind of phenomenon that Alexey Guzey likes to try to explain.

davidad 5 Oct 2021 10:38 UTC
35 points
on: Whole Brain Emulation: No Progress on C. elgans After 10 Years
I might have time for some more comments later, but here’s a few quick points (as replies):
What links here?
- davidad's comment on Why I Moved from AI to Neuroscience, or: Uploading Worms by davidad (16 Dec 2021 12:39 UTC; 4 points)

Cryptoepistemology

davidad24 Feb 2022 20:34 UTC

30 points

3 comments2 min readLW link

davidad 14 Oct 2023 18:32 UTC
LW: 28 AF: 11
12
AF
on: RSPs are pauses done right
I think AI Safety Levels are a good idea, but evals-based classification needs to be complemented by compute thresholds to mitigate the risks of loss of control via deceptive alignment. Here is a non-nebulous proposal.

davidad 14 Dec 2022 10:05 UTC
LW: 28 AF: 15
13
AF
on: Trying to disambiguate different questions about whether RLHF is “good”
Should we do research on alignment schemes which use RLHF as a building block? E.g. work on recursive oversight schemes or RLHF with adversarial training?
- IMO, this kind of research is promising and I expect a large fraction of the best alignment research to look like this.
This seems like the key! It’s probably what people actually mean by the question “is RLHF a promising alignment strategy?”

Most of this post is laying out thoughtful reasoning about related but relatively uncontroversial questions like “is RLHF, narrowly construed, plausibly sufficient for alignment” (of course not) and “is RLHF, very broadly construed, plausibly useful for alignment” (of course yes). I don’t want to diminish the value of having those answers be more common-knowledge. But I do want to call attention to how little of the reasoning elsewhere in the post seems to me to support the plausibility of this opinion here, which is the most controversial and decision-relevant one, and which is stated without any direct justification. (There’s a little bit of justification for it elsewhere, which I’ve argued against in separate comments.) I’m afraid that one post which states a bunch of opinions about related questions, while including detailed reasoning but only for the less controversial ones, might be more persuasive than it ought to be about the juicier questions.
What links here?

davidad 9 Dec 2021 18:25 UTC
LW: 27 AF: 15
AF
on: Biology-Inspired AGI Timelines: The Trick That Never Works
I wrote a lengthy exegesis of Humbali’s confusion around “maximum entropy”, which I decided ended up somewhere between a comment and a post in terms of quality, so I put it here in “Shortform”. I’m new to contributing content on AF, so meta-level feedback about how best to use the different channels (commenting, shortform, posting) is welcome.

davidad 5 Oct 2021 10:53 UTC
26 points
in reply to: davidad’s comment on: Whole Brain Emulation: No Progress on C. elgans After 10 Years
1. There has been some serious progress in the last few years on full functional imaging of the C. elegans nervous system (at the necessary spatial and temporal resolutions and ranges).
- A good summary of the state of play as of late 2020 can be found in this opinion article: https://www.sciencedirect.com/science/article/pii/S0959438820301689
- State-of-the-art work is currently happening in Shanghai, Hefei, Wuhan, and Beijing. See https://doi.org/10.1002/cyto.a.24483 and https://arxiv.org/pdf/2109.10474.pdf
However, despite this I haven’t been able to find any publications yet where full functional imaging is combined with controlled cellular-scale stimulation (e.g. as I proposed via two-photon excitation of optogenetic channels), which I believe is necessary for inference of a functional model.
What links here?
- Whole Brain Emulation: No Progress on C. elegans After 10 Years by niconiconi (1 Oct 2021 21:44 UTC; 219 points)