I’m sympathetic to the idea that it would be good to have concrete criteria for when to stop a pause, were we to start one. But I also think it’s potentially quite dangerous, and corrosive to the epistemic commons, to expect such concreteness before we’re ready to give it.
I’m first going to zoom out a bit—to a broader trend which I’m worried about in AI Safety, and something that I believe evaluation-gating might exacerbate, although it is certainly not the only contributing factor.
I think there is pressure mounting within the field of AI Safety to produce measurables, and to do so quickly, as we continue building towards this godlike power under an unknown timer of unknown length. This is understandable, and I think can often be good, because in order to make decisions it is indeed helpful to know things like “how fast is this actually going” and to assure things like “if a system fails such and such metric, we’ll stop.”
But I worry that in our haste we will end up focusing our efforts under the streetlight. I worry, in other words, that the hard problem of finding robust measurements—those which enable us to predict the behavior and safety of AI systems with anywhere near the level of precision we have when we say “it’s safe for you to get on this plane”—will be substituted for the easier problem of using the measurements we already have, or those which are close by; ones which are at best only proxies and at worst almost completely unrelated to what we ultimately care about.
And I think it is easy to forget, in an environment where we are continually churning out things like evaluations and metrics, how little we in fact know. That when people see a sea of ML papers, conferences, math, numbers, and “such and such system passed such and such safety metric,” that it conveys an inflated sense of our understanding, not only to the public but also to ourselves. I think this sort of dynamic can create a Red Queen’s race of sorts, where the more we demand concrete proposals—in a domain we don’t yet actually understand—the more pressure we’ll feel to appear as if we understand what we’re talking about, even when we don’t. And the more we create this appearance of understanding, the more concrete asks we’ll make of the system, and the more inflated our sense of understanding will grow, and so on.
I’ve seen this sort of dynamic play out in neuroscience, where in my experience the ability to measure anything at all about some phenomenon often leads people to prematurely conclude we understand how it works. For instance, reaction times are a thing one can reliably measure, and so is EEG activity, so people will often do things like… measure both of these quantities while manipulating the number of green blocks on a screen, then call the relationship between these “top-down” or “bottom-up” attention. All of this despite having no idea what attention is, and hence no idea if these measures in fact meaningfully relate much to the thing we actually care about.
There are a truly staggering number of green block-type experiments in the field, proliferating every year, and I think the existence of all this activity (papers, conferences, math, numbers, measurement, etc.) convinces people that something must be happening, that progress must be being made. But if you ask the neuroscientists attending these conferences what attention is, over a beer, they will often confess that we still basically have no idea. And yet they go on, year after year, adding green blocks to screens ad infinitum, because those are the measurements they can produce, the numbers they can write on grant applications, grants which get funded because at least they’re saying something concrete about attention, rather than “I have no idea what this is, but I’d like to figure it out!”
I think this dynamic has significantly corroded academia’s ability to figure out important, true things, and I worry that if we introduce it here, that we will face similar corrosion.
Zooming back in on this proposal in particular: I feel pretty uneasy about the messaging, here. When I hear words like “responsible” and “policy” around a technology which threatens to vanquish all that I know and all that I love, I am expecting things more like “here is a plan that gives us multiple 9’s of confidence that we won’t kill everyone.” I understand that this sort of assurance is unavailable, at present, and I am grateful to Anthropic for sharing their sketches of what they hope for in the absence of such assurances.
But the unavailability of such assurance is also kind of the point, and one that I wish this proposal emphasized more… it seems to me that vague sketches like these ought to be full of disclaimers like, “This is our best idea but it’s still not very reassuring. Please do not believe that we are safely able to prevent you from dying, yet. We have no 9’s to give.” It also seems to me like something called a “responsible scaling plan” should at the very least have a convincing story to tell about how we might get from our current state, with the primitive understanding we have, to the end-goal of possessing the sort of understanding that is capable of steering a godly power the likes of which we have never seen.
And I worry that in the absence of such a story—where the true plan is something closer to “fill in the blanks as we go”—that a mounting pressure to color in such blanks will create a vacuum, and that we will begin to fill it with the appearance of understanding rather than understanding itself; that we will pretend to know more than we in fact do, because that’s easier to do in the face of a pressure for results, easier than standing our ground and saying “we have no idea what we’re talking about.” That the focus on concrete asks and concrete proposals will place far too much emphasis on what we can find under the streetlight, and will end up giving us an inflated sense of our understanding, such that we stop searching in the darkness altogether, forget that it is even there…
I agree with you that having concrete asks would be great, but I think they’re only great if we actually have the right asks. In the absence of robust measures and evaluations—those which give us high confidence about the safety of AI systems—in the absence of a realistic plan to get those, I think demanding them may end up being actively harmful. Harmful because people will walk away feeling like AI Safety “knows” more than it does and will hence, I think, feel more secure than is warranted.
Meta: I don’t want this comment to be taken as “I disagree with everything you (Thomas) said.” I do think the question of what to do when you have an opaque, potentially intractable problem is not obvious, and I don’t want to come across as saying that I have the definitive answer, or anything like that. It’s tricky to know what to do, here, and I certainly think it makes sense to focus on more concrete problems if deconfusion work didn’t seem that useful to you.
That said, at a high-level I feel pretty strongly about investing in early-stage deconfusion work, and I disagree with many of the object-level claims you made suggesting otherwise. For instance:
It seems to me like the history of neuroscience should inspire the opposite conclusion: a hundred years of increasingly much data collection at finer and finer resolution, and yet, we still have a field that even many neuroscientists agree barely understands anything. I did undergrad and grad school in neuroscience and can at the very least say that this was also my conclusion. The main problem, in my opinion, is that theory usually tells us which facts to collect. Without it—without even a proto-theory or a rough guess, as with “model-free data collection” approaches—you are basically just taking shots in the dark and hoping that if you collect a truly massive amount of data, and somehow search over it for regularities, that theory will emerge. This seems pretty hopeless to me, and entirely backwards from how science has historically progressed.
It seems similarly pretty hopeless to me to expect a “revolution” out of tabulating features of the brain at low-enough resolution. Like, I certainly buy that it gets us some cool insights, much like every other imaging advance has gotten us some cool insights. But I don’t think the history of neuroscience really predicts a “revolution,” here. Aside from the computational costs of “understanding” an object in such a way, I just don’t really buy that you’re guaranteed to find all the relevant regularities. You can never collect *all* the data, you have to make choices and tradeoffs when you measure the world, and without a theory to tell you which features are meaningfully irrelevant and can be ignored, it’s hard to know that you’re ultimately looking at the right thing.
I ran into this problem, for instance, when I was researching cortical uniformity. Academia has amassed a truly gargantuan collection of papers on the structural properties of the human neocortex. What on Earth do any of these papers say about how algorithmically uniform the brain is? As far as I can tell, pretty much close to zero, because we have no idea how the structural properties of the cortex relate to the functional ones, and so who’s to say that “neuron subtype A is more dense in the frontal cortex relative to the visual cortex” is a meaningful finding or not? I worry that other “shot in the dark” data collection methods will suffer similar setbacks.
It’s of course difficult to say how science might have progressed counterfactually, but I find it pretty hard to believe that relativity would have been “discovered easily” were we to have had a bunch of data staring us in the face. In general, I think it’s very easy to underestimate how difficult it is to come up with new concepts. I felt this way when I was reading about Darwin and how it took him over a year to go from realizing that “artificial selection is the means by which breeders introduce changes,” to realizing that “natural selection is the means by which changes are introduced in the wild.” But then I spent a long time in his shoes, so to speak, operating from within the concepts he had available to him at the time, and I became more humbled. For instance, among other things, it seems like a leap to go from “a human uses their intellect to actively select” to “nature ends up acting like a selector, in the sense that its conditions favor some traits for survival over others.” These feel like quite different “types” of things, in some ways.
In general, I suspect it’s easy to take the concepts we already have, look over past data, and assume it would have been obvious. But I think the history of science again speaks to the contrary: scientific breakthroughs are rare, and I don’t think it’s usually the case that they’re rare because of a lack of data, but because they require looking at that data differently. Perhaps data on gravitational lensing may have roused scientists to notice that there were anomalies, and may have eventually led to general relativity. But the actual process of taking the anomalies and turning that into a theory is, I think, really hard. Theories don’t just pop out wholesale when you have enough data, they take serious work.
This story misses some pretty important pieces. For instance, Schrödinger predicted basic features about DNA—that it was an aperiodic crystal—using first principles in his book What if Life? published in 1944. The basic reasoning is that in order to stably encode genetic information, the molecule should itself be stable, i.e., a crystal. But to encode a variety of information, rather than the same thing repeated indefinitely, it needs to be aperiodic. An aperiodic crystal is a molecule that can use a few primitives to encode near infinite possibilities, in a stable way. His book was very influential, and Francis and Crick both credited Schrödinger with the theoretical ideas that guided their search. I also suspect their search went much faster than it would have otherwise; many biologists at the time thought that the hereditary molecule was a protein, of which there are tens of millions in a typical cell.
But, more importantly, I would certainly not say that biochemistry is an area where empirical work has succeeded to nearly the extent that we might hope it to. Like, we still can’t cure cancer, or aging, or any of the myriad medical problems people have to endure; we still can’t even define “life” in a reasonable way, or answer basic questions like “why do arms come out basically the same size?” The discovery of DNA was certainly huge, and helpful, but I would say that we’re still quite far from a major success story with biology.
My guess is that it is precisely because we lack theory that we are unable to answer these basic questions, and to advance medicine as much as we want. Certainly the “tabulate indefinitely” approach will continue pushing the needle on biological research, but I doubt it is going to get us anywhere near the gains that, e.g., “the hereditary molecule is an aperiodic crystal” did.
And while it’s certainly possible that biology, intelligence, agency and so on are just not amenable to the cleave-reality-at-its-joints type of clarity one gets from scientific inquiry, I’m pretty skeptical that this the world we in fact live in, for a few reasons.
For one, it seems to me that practically no one is trying to find theories in biology. It is common for biologists (even bright-eyed, young PhDs at elite universities) to say things like (and in some cases this exact sentence): “there are no general theories in biology because biology is just chemistry which is just physics.” These are people at the beginning of their careers, throwing in the towel before they’ve even started! Needless to say, this take is clearly not true in all generality, because it would anti-predict natural selection. It would also, I think, anti-predict Newtonian mechanics (“there are no general theories of motion because motion is just the motion of chemicals which is just the motion of particles which is just physics”).
Secondly, I think that practically all scientific disciplines look messy, ad-hoc, and empirical before we get theories that tie it together, and that this does not on its own suggest biology is a theoretically bankrupt field. E.g., we had steam engines before we knew about thermodynamics, but they were kind of ad-hoc, messy contraptions, because we didn’t really understand what variables were causing the “work.” Likewise, naturalism before Darwin was largely compendiums upon compendiums of people being like “I saw this [animal/fossil/plant/rock] here, doing this!” Science before theory often looks like this, I think.
Third: I’m just like, look guys, I don’t really know what to tell you, but when I look at the world and I see intelligences doing stuff, I sense deep principles. It’s a hunch, to be sure, and kind of hard to justify, but it feels very obvious to me. And if there are deep principles to be had, then I sure as hell want to find them. Because it’s embarrassing that at this point we don’t even know what intelligence is, nor agency, nor abstractions: how to measure any of it, predict when it will increase or not. These are the gears that are going to move our world, for better or for worse, and I at least want my hands on the steering wheel when they do.
I think that sometimes people don’t really know what to envision with theoretical work on alignment, or “agent foundations”-style work. My own vision is quite simple: I want to do great science, as great science has historically been done, and to figure out what in god’s name any of these phenomena are. I want to be able to measure that which threatens our existence, such that we may learn to control it. And even though I am of course not certain this approach is workable, it feels very important to me to try. I think there is a strong case for there being a shot, here, and I want us to take it.