Joe Carlsmith comments on AI for AI safety

Joe Carlsmith 6 Apr 2025 21:29 UTC
LW: 8 AF: 5
1
AF
Thanks, John—very open to this kind of push-back (and as I wrote in the fake thinking post, I am definitely not saying that my own thinking is free of fakeness). I do think the post (along with various other bits of the series) is at risk of being too anchored on the existing discourse. That said: do you have specific ways in which you feel like the frame in the post is losing contact with the territory?
- johnswentworth 7 Apr 2025 4:11 UTC
  LW: 14 AF: 8
  1
  AF Parent
  (This comment is not about the parts which most centrally felt anchored on social reality; see other reply for that. This one is a somewhat-tangential but interesting mini-essay on ontological choices.)
  The first major ontological choices were introduced in the previous essay:
  1. Thinking of “capability” as a continuous 1-dimensional property of AI
  2. Introducing the “capability frontier” as the highest capability level the actor developed/deployed so far
  3. Introducing the “safety range” as the highest capability level the actor can safely deploy
  4. Introducing three “security factors”:
    Making the safety range (the happy line) go up
    Making the capability frontier (the dangerous line) not go up
    Keeping track of where those lines are.
  The first choice, treatment of “capability level” as 1-dimensional, is obviously an oversimplification, but a reasonable conceit for a toy model (so long as we remember that it is toy, and treat it appropriately). Given that we treat capability level as 1-dimensional, the notion of “capability frontier” for any given actor immediately follows, and needs no further justification.
  The notion of “safety range” is a little more dubious. Safety of an AI obviously depends on a lot of factors besides just the AI’s capability. So, there’s a potentially very big difference between e.g. the most capable AI a company “could” safely deploy if the company did everything right based on everyone’s current best understanding (which no company of more than ~10 people has or ever will do in a novel field), vs the most capable AI the company could safely deploy under realistic assumptions about the company’s own internal human coordination capabilities, vs the most capable AI the company can actually-in-real-life aim for and actually-in-real-life not end up dead.
  … but let’s take a raincheck on clarifying the “safety range” concept and move on.
  The safety factors are a much more dubious choice of ontology. Some of the dubiousness:
  - If we’re making the happy line go up (“safety progress”): who’s happy line? Different actors have different lines. If we’re making the danger line not go up (“capability restraint”): again, who’s danger line? Different actors also have different danger lines.
    This is important, because humanity’s survival depends on everybody else’s happy and danger lines, not just one actor’s!
  - “Safety progress” inherits all the ontological dubiousness of the “safety range”.
  - If we’re keeping track of where the lines are (“risk evaluation”): who is keeping track? Who is doing the analysis, who is consuming it, how does the information get to relevant decision makers, and why do they make their decisions on the basis of that information?
  - Why factor apart the levels of the danger and happy lines? These are very much not independent, so it’s unclear why it makes sense to think of their levels separately, rather than e.g. looking at their average and difference as the two degrees of freedom, or their average and difference in log space, or the danger line level and the difference, or [...]. There’s a lot of ways to parameterize two degrees of freedom, and it’s not clear why this parameterization would make more sense than some other.
  - On the other hand, factoring apart “risk evaluation” from “safety progress” and “capabilities restraint” does seem like an ontologically reasonable choice: it’s the standard factorization of instrumental from epistemic. That standard choice is not always the right way to factor things, but it’s at least a choice which has “low burden of proof” in some sense.
  What would it look like to justify these ontological choices? In general, ontological justification involves pointing to some kind of pattern in the territory—in this case, either the “territory” of future AI, or the “territory” of AI safety strategy space. For instance, in a very broad class of problems, one can factor apart the epistemic and instrumental aspects of the problem, and resolve all the epistemic parts in a manner totally agnostic to the instrumental parts. That’s a pattern in the “territory” of strategy spaces, and that pattern justifies the ontological choice of factoring apart instrumental and epistemic components of a problem.
  If one could e.g. argue that the safety range and capability frontier are mostly independent, or that most interventions impact the trajectory of only one of the two, then that would be an ontological justification for factoring the two apart. (Seems false.)
  (To be clear: people very often have good intuitions about ontological choices, but don’t know how to justify them! I am definitely not saying that one must always explicitly defend ontological choices, or anything like that. But one should, if asked and given time to consider, be able to look at an ontological choice and say what underlying pattern makes that ontological choice sensible.)
  - Joe Carlsmith 7 Apr 2025 23:28 UTC
    LW: 5 AF: 1
    1
    AF Parent
    Thanks, John. I’m going to hold off here on in-depth debate about how to choose between different ontologies in this vicinity, as I do think it’s often a complicated and not-obviously-very-useful thing to debate in the abstract, and that lots of taste is involved. I’ll flag, though, that the previous essay on paths and waystations (where I introduce this ontology in more detail) does explicitly name various of the factors you mention (along with a bunch of other not-included subtleties). E.g., re the importance of multiple actors:
    Now: so far I’ve only been talking about one actor. But AI safety, famously, implicates many actors at once – actors that can have different safety ranges and capability frontiers, and that can make different development/deployment decisions. This means that even if one actor is adequately cautious, and adequately good at risk evaluation, another might not be...
    And re: e.g. multidimensionality, and the difference between “can deploy safely” and “would in practice” -- from footnote 14:
    Complexities I’m leaving out (or not making super salient) include: the multi-dimensionality of both the capability frontier and the safety range; the distinction between safety and elicitation; the distinction between development and deployment; the fact that even once an actor “can” develop a given type of AI capability safely, they can still choose an unsafe mode of development regardless; differing probabilities of risk (as opposed to just a single safety range); differing severities of rogue behavior (as opposed to just a single threshold for loss of control); the potential interactions between the risks created by different actors; the specific standards at stake in being “able” to do something safely; etc.
    I played around with more complicated ontologies that included more of these complexities, but ended up deciding against. As ever, there are trade-offs between simplicity and subtlety, I chose a particular way of making those trade-offs, and so far I’m not regretting.
    Re: who is risk-evaluating, how they’re getting the information, the specific decision-making processes: yep, the ontology doesn’t say, and I endorse that, I think trying to specify would be too much detail.
    Re: why factor apart the capability frontier and the safety range—sure, they’re not independent, but it seems pretty natural to me to think of risk as increasing as frontier capabilities increase, and of our ability to make AIs safe as needing to keep up with that. Not sure I understand your alternative proposals re: “looking at their average and difference as the two degrees of freedom, or their average and difference in log space, or the danger line level and the difference, or...”, though, or how they would improve matters.
    As I say, people have different tastes re: ontologies, simplifications, etc. My own taste finds this one fairly natural and useful—and I’m hoping that the use I give it in the rest of series (e.g., in classifying different waystations and strategies, in thinking about these different feedback loops, etc) can illustrate why (see also the slime analogy from the previous post for another intuition pump). But I welcome specific proposals for better overall ways of thinking about the issues in play.
- johnswentworth 7 Apr 2025 4:08 UTC
  LW: 7 AF: 5
  1
  AF Parent
  The last section felt like it lost contact most severely. It says
  What are the main objections to AI for AI safety?
  It notably does not say “What are the main ways AI for AI safety might fail?” or “What are the main uncertainties?” or “What are the main bottlenecks to success of AI for AI safety?”. It’s worded in terms of “objections”, and implicitly, it seems we’re talking about objections which people make in the current discourse. And looking at the classification in that section (“evaluation failures, differential sabotage, dangerous rogue options”) it indeed sounds more like a classification of objections in the current discourse, as opposed to a classification of object-level failure modes from a less-social-reality-loaded distribution of failures.
  I do also think the frame in the earlier part of the essay is pretty dubious in some places, but that feels more like object-level ontological troubles and less like it’s anchoring too much on social reality. I ended up writing a mini-essay on that which I’ll drop in a separate reply.
  - Joe Carlsmith 7 Apr 2025 23:40 UTC
    LW: 5 AF: 3
    1
    AF Parent
    I agree it’s generally better to frame in terms of object-level failure modes rather than “objections” (though: sometimes one is intentionally responding to objections that other people raise, but that you don’t buy). And I think that there is indeed a mindset difference here. That said: your comment here is about word choice. Are there substantive considerations you think that section is missing, or substantive mistakes you think it’s making?