Senior research analyst at Open Philanthropy. Recently completed a doctorate in philosophy at the University of Oxford. Opinions my own.
Joe Carlsmith
I’m hopeful that if we actually had a worked out reductionist account of all the problematic intuitions, which we knew was right and which made illusionism true, then this would be at least somewhat helpful in making illusionism less mysterious. In particular, I’m hopeful that thoroughly and dutifully reconceptualizing our introspection and intuitions according to that theory — “when it seems to me like X, what’s going on is [insert actual gears level explanation, not just ‘neurons are firing’ or ‘my brain is representing its internal processing in a simplified and false way’]” — would make a difference.
Thanks for explaining where you’re coming from.
Yet I experience that computation as the qualia of “blueness.” How can that be? How can any computation of any kind create, or lead to qualia of any kind? You can say that it is just a story my brain is telling me that “I am seeing blue.” I must not understand what is being claimed, because I agree with it and yet it doesn’t remove the problem at all. Why does that story have any phenomenology to it? I can make no sense of the claim that it is an illusion.
As I understand it, the idea would be that, as weird as it may sound, there isn’t any phenomenology to it. Rather: according to the story that your brain is telling, there is some phenomenology to it. But there isn’t. That is, your brain’s story doesn’t create, lead to, or correlate with phenomenal blueness; rather, phenomenal blueness is something that the story describes, but which doesn’t exist, in the same way that a story can describe unicorns without bringing them to life.
I do remember that conversation, though I’m a bit hazy on the details of the argument you presented. Let me know if there’s a write-up/summary somewhere, or if you create one in future.
That’s the one :)
Thanks :) Re blog name: it isn’t: “Hands” comes from a Martin Buber quote, and “Cities” from a phrase I believe I heard from A.J. Julius. I chose them partly as a personal reminder about the blog’s aims.
Glad to hear you found it useful.
Thanks :). I do think clinging often functions as an unnoticed lens on the world; though noticing it, in my experience, is also quite distinct from it “releasing.” I also would’ve thought that depression can be an unnoticed (or at least, unquestioned) lens as well: e.g., a depressed person who is convinced that everything in the world is bad, that they’ll never feel better again, etc.
My sense is that the possibility of dynamics of this kind would be on people’s radar in the philosophy community, at least.
I meant mental states in something more like the #1 sense—and so, I think, does Frankish.
Interesting; I hadn’t really considered that angle. Seems like this could also apply to other mental phenomena that might seem self-recommending (pleasure? rationality?), but which plausibly have other, more generally adaptive functions as well, so I would continue to wonder about other functions regardless.
Thanks for writing this up. Quick question re: “Intent alignment: An agent is intent aligned if its behavioral objective is aligned with humans.” What does it mean for an objective to be aligned with humans, on your view? You define what it is for an agent to be aligned with humans, e.g.: “An agent is aligned (with humans) if it doesn’t take actions that we would judge to be bad/problematic/dangerous/catastrophic.” But you don’t say explicitly what it is for an objective to be aligned: I’m curious if you have a preferred formulation.
Is it something like: “the behavioral objective is such that, when the agent does ‘well’ on this objective, the agent doesn’t act in a way we would view as bad/problematic/dangerous/catastrophic.” If so, it seems like a lot might depend on exactly how “well” the agent does, and what opportunities it has in a given context. That is, an “aligned” agent might not stay aligned if it becomes more powerful, but continues optimizing for the same objective (for example, a weak robot optimizing for beating me at chess might be “aligned” because it only focuses on making good chess moves, but a stronger one might not be, because it figures out how to drug my tea). Is that an implication you’d endorse?
Or is the thought something like: “the behavioral objective such that, no matter how powerfully the agent optimizes for it, and no matter its opportunities for action, it doesn’t take actions we would view as bad/problematic/dangerous/catastrophic”? My sense is that something like this is often the idea people have in mind, especially in the context of anticipating things like intelligence explosions. If this is what you have in mind, though, maybe worth saying so explicitly, since intent alignment in this sense seems like a different constraint than intent alignment in the sense of e.g. “the agent’s pursuit of its behavioral objective does not in fact give rise to bad actions, given the abilities/contexts/constraints that will in fact be relevant to its behavior.”
Aren’t they now defined in terms of each other?
“Intent alignment: An agent is intent aligned if its behavioral objective is outer aligned.
Outer alignment: An objective function is outer aligned if all models that perform optimally on in the limit of perfect training and infinite data are intent aligned.”
Cool (though FWIW, if you’re going to lean on the notion of policies being aligned with humans, I’d be inclined to define that as well, in addition to defining what it is for agents to be aligned with humans. But maybe the implied definition is clear enough: I’m assuming you have in mind something like “a policy is aligned with humans if an agent implementing that policy is aligned with humans.”).
Regardless, sounds like your definition is pretty similar to: “An agent is intent aligned if its behavioral objective is such that an arbitrarily powerful and competent agent pursuing this objective to arbitrary extremes wouldn’t act in ways that humans judge bad”? If you see it as importantly different from this, I’d be curious.
Hi Daniel,
Thanks for reading. I think estimating p(doom) by different dates (and in different take-off scenarios) can be a helpful consistency check, but I disagree with your particular “sanity check” here—and in particular, premise (2). That is, I don’t think that conditional on APS-systems becoming possible/financially feasible by 2035, it’s clear that we should have at least 50% on doom (perhaps some of disagreement here is about what it takes for the problem to be “real,” and to get “solved”?). Nor do I see 10% on “Conditional it being both possible and strongly incentivized to build APS systems, APS systems will end up disempowering approximately all of humanity” as obviously overconfident (though I do take some objections in this vein seriously). I’m not sure exactly what “10% on nuclear war” analog argument you have in mind: would you be able to sketch it out, even if hazily?
Thanks, this seems like a salient type of consideration, and one that isn’t captured very explicitly in the current list (though I think it may play a role in explaining the bullet point about humans with general skill-sets being in-demand).
Glad to hear it, Steven. Thanks for reading, and for taking the time to write up your own threat model.
Thanks for reading, and for your comments on the doc. I replied to specific comments there, but at a high level: the formal work you’ve been doing on this does seem helpful and relevant (thanks for doing it!). And other convergent phenomena seem like helpful analogs to have in mind.
Hi Daniel,
Thanks for taking the time to clarify.
One other factor for me, beyond those you quote, is the “absolute” difficulty of ensuring practical PS-alignment, e.g. (from my discussion of premise 3):
Part of this uncertainty has to do with the “absolute” difficulty of achieving practical PS-alignment, granted that you can build APS systems at all. A system’s practical PS-alignment depends on the specific interaction between a number of variables—notably, its capabilities (which could themselves be controlled/limited in various ways), its objectives (including the time horizon of the objectives in question), and the circumstances it will in fact exposed to (circumstances that could involve various physical constraints, monitoring mechanisms, and incentives, bolstered in power by difficult-to-anticipate future technology, including AI technology). I expect problems with proxies and search to make controlling objectives harder; and I expect barriers to understanding (along with adversarial dynamics, if they arise pre-deployment) to exacerbate difficulties more generally; but even so, it also seems possible to me that it won’t be “that hard” (by the time we can build APS systems at all) to eliminate many tendencies towards misaligned power-seeking (for example, it seems plausible to me that selecting very strongly against (observable) misaligned power-seeking during training goes a long way), conditional on retaining realistic levels of control over a system’s post-deployment capabilities and circumstances (though how often one can retain this control is a further question).
My sense is that relative to you, I am (a) less convinced that ensuring practical PS-alignment will be “hard” in this absolute sense, once you can build APS systems at all (my sense is that our conceptions of what it takes to “solve the alignment problem” might be different), (b) less convinced that practically PS-misaligned systems will be attractive to deploy despite their PS-misalignment (whether because of deception, or for other reasons), (c) less convinced that APS systems becoming possible/incentivized by 2035 implies “fast take-off” (it sounds like you’re partly thinking: those are worlds where something like the scaling hypothesis holds, and so you can just keep scaling up; but I don’t think the scaling hypothesis holding to an extent that makes some APS systems possible/financially feasible implies that you can just scale up quickly to systems that can perform at strongly superhuman levels on e.g. ~any task, whatever the time horizons, data requirements, etc), and (d) more optimistic about something-like-present-day-humanity’s ability to avoid/prevent failures at a scale that disempower ~all of humanity (though I do think Covid, and its policitization, an instructive example in this respect), especially given warning shots (and my guess is that we do get warning shots both before or after 2035, even if APS systems become possible/financially feasible before then).
Re: nuclear winter, as I understand it, you’re reading me as saying: “in general, if a possible and incentivized technology is dangerous, there will be warning shots of the dangers; humans (perhaps reacting to those warning shots) won’t deploy at a level that risks the permanent extinction/disempowerment of ~all humans; and if they start to move towards such disempowerment/extinction, they’ll take active steps to pull back.” And your argument is: “if you get to less than 10% doom on this basis, you’re going to give too low probabilities on scenarios like nuclear winter in the 20th century.”
I don’t think of myself as leaning heavily on an argument at that level of generality (though maybe there’s a bit of that). For example, that statement feels like it’s missing the “maybe ensuring practical PS-alignment just isn’t that hard, especially relative to building practically PS-misaligned systems that are at least superficially attractive to deploy” element of my own picture. And more generally, I expect to say different things about e.g. biorisk, climate change, nanotech, etc, depending on the specifics, even if generic considerations like “humans will try not to all die” apply to each.
Re: nuclear winter in particular, I’d want to think a bit more about what sort of probability I’d put on nuclear winter in the 20th century (one thing your own analysis skips is the probability that a large nuclear conflict injects enough smoke into the stratosphere to actually cause nuclear winter, which I don’t see as guaranteed—and we’d need to specify what level of cooling counts). And nuclear winter on its own doesn’t include a “scaling to the permanent disempowerment/extinction of ~all of humanity” step—a step that, FWIW, I see as highly questionable in the nuclear winter case, and which is important to my own probability on AI doom (see premise 5). And there are various other salient differences: for example, mutually assured destruction seems like a distinctly dangerous type of dynamic, which doesn’t apply to various AI deployment scenarios; nuclear weapons have widespread destruction as their explicit function, whereas most AI systems won’t; and so on. That said, I think comparisons in this vein could still be helpful; and I’m sympathetic to points in the vein of “looking at the history of e.g. climate, nuclear risk, BSL-4 accidents, etc the probability that humans will deploy technology that risks global catastrophe, and not stop doing so even after getting significant evidence about the risks at stake, can’t be that low” (I talk about this a bit in 4.4.3 and 6.2).
Hi Koen,
Glad to hear you liked section 4.3.3. And thanks for pointing to these posts—I certainly haven’t reviewed all the literature, here, so there may well be reasons for optimism that aren’t sufficiently salient to me.
Re: black boxes, I do think that black-box systems that emerge from some kind of evolution/search process are more dangerous; but as I discuss in 4.4.1, I also think that the bare fact that the systems are much more cognitively sophisticated than humans creates significant and safety-relevant barriers to understanding, even if the system has been designed/mechanistically understood at a different level.
Re: “there is a whole body of work which shows that evolved systems are often power-seeking”—anything in particular you have in mind here?
Glad you found it helpful (or at least, as helpful as other work on the topic). So far in my engagement with Graziano (specifically, non-careful reads of his 2013 book and his 2019 “Toward a standard model of consciousness”), I don’t feel like I’ve taken away much more than the summary I gave above of Frankish’s view: namely, “introspective mechanisms … track the processes involved in access consciousness and represent them using a simplified model” — something pretty similar to what Chalmers also says here on p. 34. I know Graziano focuses on attention in particular, and he talks more about e.g. sociality and cites some empirical work, but at a shallow glance I’m not sure I yet see really substantive and empirically grounded increases in specificity, beyond what seems like the general line amongst a variety of folks that “there’s some kind of global workspace-y thing, there’s some kind of modeling of that, this modeling involves simplifications/distortions/opacity of various kinds, these somehow explain whatever problem intuitions/reports need explaining.” But I haven’t tried to look at Graziano closely. The “naive” vs. “sophisticated” descriptions in your blog post seem like a helpful way to frame his project.