My guess is that the parts of the core leadership of Anthropic which are thinking actively about misalignment risks (in particular, Dario and Jared) think that misalignment risk is like ~5x smaller than I think it is while also thinking that risks from totalitarian regimes are like 2x worse than I think they are. I think the typical views of opinionated employees on the alignment science team are closer to my views than to the views of leadership. I think this explains a lot about how Anthropic operates.
I share similar concerns that Anthropic doesn’t seem to institutionally prioritize thinking about the future or planning, and their public outputs to date are not encouraging evidence of careful thinking about misalignment. That said, I’m pretty sympathetic to the idea that while this isn’t great, this isn’t that bad because careful thinking about exactly what needs to happen in the future isn’t that good of an approach for driving research direction or organization strategy. My biggest concern is that while careful planning is perhaps not that important now (you can maybe get most of the value by doing things that seem heuristically good), we’ll eventually need to actually do a good job thinking through what exactly should be implemented when the powerful AIs are right in front of us. I don’t think we’ll be able to make the ultimate strategic choices purely based on empirical feedback loops.
It feels vaguely reasonable to me to have a belief as low as 15% on “Superalignment is Real Hard in a way that requires like a 10-30 year pause.” And, at 15%, it still feels pretty crazy to be oriented around racing the way Anthropic is.
I don’t really see why this is a crux. I’m currently at like ~5% on this claim (given my understanding of what you mean), but moving to 15% or even 50% (while keeping the rest of the distribution the same) wouldn’t really change my strategic orientation. Maybe you’re focused on getting to a world with a more acceptable level of risk (e.g., <5%), but I think going from 40% risk to 20% risk is better to focus on.
The relevant questions are:
What is the risk reduction obtained by going from “token effort / no pause” to “1 year pause which is well timed and focused on misalignment concerns” vs. the risk reduction obtained by going from “1 year pause” to “20 year pause”.[1]
How much do various strategies change the probability of something like “1 year pause” and “20 year pause”.
(By 1 year pause, I mean devoting 1 year of delay worth of resources to safety. This would include scenarios where progress is halted for a year and all AI company resources are devoted to safety. It would also count if 50% of compute and researchers are well targeted toward safety for 2 years at around the point of full AI R&D automation, which would yield a roughly 1 year slow down. By “well timed”, I’m including things like deciding to be slower for 2 years or fully paused for 1.)
I would guess that going from a token effort to a well-done 1 year pause for safety reasons (at an organization taking AI risk somewhat seriously and which is moderately competent) reduces risk from around 40% to 20%. (Very low confidence numbers.)
And, it’s much easier to go from a token level of effort to a 1 year pause than to get a 20 year pause, so focusing on the ~1 year pause is pretty reasonable.
For what it’s worth, it’s not super obvious to me that Anthropic thinks of themselves as going for a 1 year pause.
I don’t think there’s a different strategy Anthropic could follow which would increase the chance of successfully getting buy-in for a 20 year pause to a sufficient degree that focusing on this over the 1 year pause makes sense.
I’m pretty skeptical of the “extreme philosophical competence” perspective. This is basically because we “just” need to be able to hand off to an AI which:
Is seriously aligned (e.g., it faithfully pursues our interests on long open-ended and conceptually loaded tasks that are impossible for us to check).
Is strictly more capable than us.
Has enough time to do nearly as well as humans could have hoped to have done.
Creating such a system seems unlikely to require “extreme philosophical competence” even if such a system has to apply “extreme philosophical competence” to sufficiently align successors.
Josh has some posts from a similar perspective, though I don’t fully agree with him (at least at a vibes level), see: here, here, and here.
I’m pretty skeptical of the “extreme philosophical competence” perspective. This is basically because we “just” need to be able to hand off to an AI which is seriously aligned (e.g., it faithfully pursues our interests on long open-ended and conceptually loaded tasks that are impossible for use to check).
The “extreme philosophical competence” hypothesis is that you need such competence to achieve “seriously aligned” in this sense. It sounds like you disagree, but I don’t know why since your reasoning just sidesteps the problem.
Looking over the comments of the first joshc post, it seems like that’s also basically asserted it wasn’t necessary by fiat. And, the people who actively believe in “alignment is philosophically loaded” showed up to complain that this ignored the heart of the problem.
My current summary of the arguments (which I put ~60% on, and I think Eliezer/Oli/Wentworth treat much more confidently and maybe believe a stronger version of) are something like:
The goal-directedness means it’s very likely to be self/situationally aware, and the requisite intelligence to solve these sorts of problems means even if it’s not full blown anti-aligned, it’s at least probably going to want to try to build more option value for itself.
The fact that you can’t evaluate the results means it has a lot of room to give you answers that help preserve it’s goals and bootstrap (at least on the margin), even if it’s not massively smart enough to one-shot escape. And you can’t solve that problem with Control (i.e. The Case Against AI Control Research).
You can maybe have interpretability tools that check for schemingness (if it’s the first generation of generally capable agent and isn’t too smart yet, maybe you’ve done a good job preserving Chain of Thought as a reasonably faithful representation, for now). But, you’ll then just see “yep, the agent is unaligned”, and not actually be able to fix it.
I think my current model of you (Ryan) is like:
“Training models to do specific things, cleverly, actually just makes it pretty hard for them to develop scheming or other motivated misalignments – they have to jump all the way from “don’t think about scheming ever” to “secretly think about scheming” to avoid getting caught, and that probably just won’t work?”
(or, in example of the second joshC post, they need to learn to be really reliably good at truth-tracking-patterns and articulating their reasoning, after internalizing that for thousands of reps, an AI is just gonna have a hard time jumping to reasoning that isn’t truth tracking).
I don’t have a clear model of how you respond to point #4 – that we’ll just reliably find them to be scheming if we succeed at the interpretability steps, and not have a good way of dealing with it. (Maybe you just don’t think this is as overwhelmingly likely?)
Interested in whatever Real You’s cruxes are, 1-2 steps removed.
The “extreme philosophical competence” hypothesis is that you need such competence to achieve “seriously aligned” in this sense. It sounds like you disagree, but I don’t know why since your reasoning just sidesteps the problem.
Yes, my reasoning is definitely part, but not all of the argument. Like the thing I said is a sufficient crux for me. (If I thought we had to directly use human labor to align AIs which were qualitatively wildly superhuman in general I would put much more weight on “extreme philosophical competence”.)
Looking over the comments of the first joshc post, it seems like that’s also basically asserted it wasn’t necessary by fiat.
I agree that Josh’s first post doesn’t argue about this, but the later two posts kinda do? That said, I think Josh doesn’t really state the argument very clearly in a single place anywhere (at least not in the way I would when trying to cover this objection).
“Training models to do specific things, cleverly, actually just makes it pretty hard for them to develop scheming or other motivated misalignments – they have to jump all the way from “don’t think about scheming ever” to “secretly think about scheming” to avoid getting caught, and that probably just won’t work?”
Hmm, no this isn’t really that much of my model. Or at least, I don’t feel myself thinking thoughts like this.
Interested in whatever Real You’s cruxes are, 1-2 steps removed.
I think this is probably not the best place to discuss this, but I’ll try laying out the quick version of the argument, mostly so I can practice explaining myself here.
Objectives and components
Our goal is to make a trustworthy AI that we defer to. This AI need to be:
Capable: At least as capable as we are collectively. This includes handling wicked strategy/forecasting/conceputual problems and anticipating risks from what it does.
Trustworthy: It needs to faithfully pursue our interests, including on confusing and open ended problems that are conceptually loaded and that we can’t really check.
We’ll do this using two components:
Avoid egregious misalignment: Rule out misalignment that could make behavioral tests very misleading: avoid scheming (both during training and later developed at runtime as we run more copies of the AI), types of sycophancy that are hard to detect, and value drift.
Use behavioral tests for capabilities and the parts of trustworthiness that remain.These tests will need to evaluate how the AI generalizes to tasks that are harder, more open ended, longer, and harder to check than the domains we trained it on. (Because we’ll need things to generalize further than this.) Iterate against these tests until we know how to train AIs that generalize well.
Avoiding egregious misalignment
I’d first claim that we at least have mechanisms (that don’t require extreme philosophical competence) for greatly reducing the risk of egregious misalignment in expectation. The strategies available are:
Test measures on analogous setups (model organisms) and then try to transfer
Apply principled methods where we have some a priori story it should help
Analogous setups includes:
Cases where a model naturally became egregiously misaligned (ideally with a roughly similar architecture and training method to the one we are using).
Cases where we trained in egregious misalignment (or at least heavily nudged the model towards this).
Very abstract analogies where the object we’re studying isn’t actually egregious misalignment, but is analogous to this.
If we had a large range of pretty good test beds of analogous naturally occuring misalignment that cover the failure modes we’re most worried about, then iterating on countermeasures in these test beds seems quite useful. (The countermeasures would be drawn from the principled methods.)
A serious problem is that we’re not necessarily able to consistently detect misalignment. We also need to be able to test whether misalignment is less likely after some countermeasure, so being able to reproducably catch the misalignment at sufficiently low cost is important.[1] This is probably the central difficulty in test beds. We also need to ensure that our countermeasures aren’t effectively training against our tests.[2]
(To get a lot of signal, our test beds would need to be representative of the failure modes we’re worried about or there would need to be some reason to expect the countermeasures we try to generalize to all failure modes)
Another objection is: “Actually, we didn’t have any methods available to try which could end up with a model that (always) isn’t egregiously misaligned. So, even if you can iterate a bunch, you’ll just either find that nothing works or you’ll just fool yourself.” I’m somewhat skeptical of this objection if we actually have a bunch of signal in the test beds and can afford to iterate. It feels to me like in most worlds where we have serious misalignment problems, we can get a good enough set of methods to try that we “just” need like <5 bits. As in, if we were to randomly sample sets of measures (and parameters for these measures, etc) from stuff that seem promising, it seems unlikely we’d need more than 32 tries before something actually worked. This means you maybe don’t need to apply crazy amounts of optimization pressure. I don’t have a cached quick justification.
Oops, I ran out of my time box for this comment. I didn’t get to principled measures. I can send you my doc though.
Behavioral tests
Oops, I ran out of my time box for this comment. Hopefully this section is somewhat comprehensible though I don’t get into detail.
We need to test that our AI does reasonable and aligned stuff on really hard to check tasks. So, how can we get confidence in this given that the tasks are hard to check?
First, we can see how the AI does on relatively hard to check or philosophically tricky tasks which we have already looked into substantially for unrelated reasons. A big reason why things might be hard to check is that we don’t have enough time: the AI might do investigations that would take humans decades. But, we have done some investigations in the past! This won’t ultimately let us check tasks as hard as the ones we want the AIs to do, but we can get close. Note that we also have to apply a similar process for humans.
Second, we can study how generalization on this sort of thing works in general: when an AI is trained on easy, looks good on medium, how will it do on hard? We can study this in all kinds of ways so that we can gain confidence in generalization. In particular, we can get a sense of how reasonable humans are and trust them to generalize. If we could get a similar property for AIs, we might be good, and this doesn’t seem clearly harder than the problem of making AIs which are as capable as humans.
We’d then need to iterate against these tests in a way that doesn’t overfit.
Technically, we just need some signal here, but if catching the misalignment was very costly initially, P(catch) increases with spend, and our initial catch doesn’t make it much easier to catch peturbations of the model that might behave slightly differently, then we probably wouldn’t get enough signal. That said, if we can sometimes detect misalignment in some models, iteration still could be quite useful
More precisely, we need to ensure that our countermeasures aren’t just shifting from a type of misalignment we can detect to a type we can’t. Qualitatively analyzing the countermeasures and our tests should help here.
Thanks for laying this out thus far. I’mma reply but understand if you wanna leave the convo here . I would be interested in more effortpost/dialogue about your thoughts here.
Yes, my reasoning is definitely part, but not all of the argument. Like the thing I said is a sufficient crux for me. (If I thought we had to directly use human labor to align AIs which were qualitatively wildly superhuman in general I would put much more weight on “extreme philosophical competence”.)
This makes sense as a crux for the claim “we need philosophical competence to align unboundedly intelligent superintelligences.” But, it doesn’t make sense for the claim “we need philosophical competence to align general, openended intelligence.” I suppose my OP didn’t really distinguish these claims and there were a few interpretations of how the arguments fit together. I was more saying the second (although to be fair I’m not sure I was actually distinguishing them well in my head until now)
It doesn’t make sense for “we just’ need to be able to hand off to an AI which is seriously aligned” to be a crux for the second. A thing can’t be a crux for itself.
I notice my “other-guy-feels-like-they’re-missing-the-point” → “check if I’m not listening well, or if something is structurally wrong with the convo” alarm is firing, so maybe I do want to ask for one last clarification on “did you feel like you understood this the first time? Does it feel like I’m missing the point of what you said? Do you think you understand why it feels to me like you were missing the point (even if you think it’s because I’m being dense about something?)
Takes on your proposal
Meanwhile, here’s some takes based on my current understanding of your proposal.
These bits:
We need to ensure that our countermeasures aren’t just shifting from a type of misalignment we can detect to a type we can’t. Qualitatively analyzing the countermeasures and our tests should help here.
...is a bit I think is philosophical-competence bottlenecked. And this bit:
“Actually, we didn’t have any methods available to try which could end up with a model that (always) isn’t egregiously misaligned. So, even if you can iterate a bunch, you’ll just either find that nothing works or you’ll just fool yourself.”
...is a mix of “philosophically bottlenecked” and “rationality bottlenecked.” (i.e. you both have to be capable of reasoning about whether you’ve found things that really worked, and, because there are a lot of degrees of freedom, capable of noticing if you’re deploying that reasoning accurately)
I might buy that you and Buck are competent enough here to think clearly about it (not sure. I think you benefit from having a number of people around who seem likely to help), but I would bet against Anthropic decisionmakers being philosophically competent enough.
(I think at least some people on the alignment science or interpretability teams might be. I bet against the median such teammembers being able to navigate it. And ultimately, what matters is “does Anthropic leadership go forward with the next training run”, so it matters whether Anthropic leadership buys arguments from hypothetically-competent-enough alignment/interpretability people. And Anthropic leadership already seem to basically be ignoring arguments of this type, and I don’t actually expect to get the sort of empirical clarity that (it seems like) they’d need to update before it’s too late.)
Second, we can study how generalization on this sort of thing works in general
I think this counts as the sort of empiricism I’m somewhat optimisic about in my post. i.e. if you are able to find experiments that actually give you evidence about deeper laws, that let you then make predictions about new Actually Uncertain questions of generalization that you then run more experiments on… that’s the sort of thing I feel optimistic about. (Depending on the details, of course)
But, you still need technical philosophical competence to know if you’re asking the right questions about generalization, and to know when the results actually imply that the next scale-up is safe.
This makes sense as a crux for the claim “we need philosophical competence to align unboundedly intelligent superintelligences.” But, it doesn’t make sense for the claim “we need philosophical competence to align general, openended intelligence.”
I was thinking of a slightly broader claim: “we need extreme philosophical competence”. If I thought we had to use human labor to align wildly superhuman AIs, I would put much more weight on “extreme philosophical competence is needed”. I agree that “we need philosophical competence to align any general, openended intelligence” isn’t affected by the level of capability at handoff.
I might buy that you and Buck are competent enough here to think clearly about it (not sure. I think you benefit from having a number of people around who seem likely to help), but I would bet against Anthropic decisionmakers being philosophically competent enough.
I think there might be a bit of a (presumably unintentional) motte and bailey here where the motte is “careful conceptual thinking might be required rather than pure naive empiricism (because we won’t have good enough test beds by default) and it seems like Anthropic (leadership) might fail heavily at this thinking” and the bailey is “extreme philosophical competence (e.g. 10-30 years of tricky work) is pretty likely to be needed”.
I buy the motte here, but not the bailey. I think the motte is a substantial discount on Anthropic from my perspective, but I’m kinda sympathetic to where they are coming from. (Getting conceptual stuff and futurism right is real hard! How would they know who to trust among people disagreeing wildly!)
And ultimately, what matters is “does Anthropic leadership go forward with the next training run”, so it matters whether Anthropic leadership buys arguments from hypothetically-competent-enough alignment/interpretability people.
I don’t think “does anthropic stop (at the right time)” is the majority of the relevance of careful conceptual thinking from my perspective. Probably more of it is “do they do a good job allocating their labor and safety research bets”. This is because I don’t think they’ll have very much lead time if any (median −3 months) and takeoff will probably be slower than the amount of lead time if any, so pausing won’t be as relevant. Correspondingly, pausing at the right time isn’t the biggest deal relative to other factors, though it does seem very important at an absolute level.
I think there might be a bit of a (presumably unintentional) motte and bailey here where the motte is “careful conceptual thinking might be required rather than pure naive empiricism (because we won’t be given good enough test beds by default) and it seems like Anthropic (leadership) might fail heavily at this” and the bailey is “extreme philosophical competence (e.g. 10-30 years of tricky work) is pretty likely to be needed”.
Yeah I agree that was happening somewhat. The connecting dots here are “in worlds where it turns out we need a long Philosophical Pause, I think you and Buck would probably be above some threshold where you notice and navigate it reasonably.”
I think my actual belief is “the Motte is high likelihood true, the Bailey is… medium-ish likelihood true, but, like, it’s a distribution, there’s not a clear dividing line between them”
I also think the pause can be “well, we’re running untrusted AGIs and ~trusted pseudogeneral LLM-agents that help with the philosophical progress, but, we can’t run them that long or fast, they help speed things up and make what’d normally be a 10-30 year pause into a 3-10 year pause, but also the world would be going crazy left to it’s own devices, and the sort of global institutional changes necessary are still similarly-outside-of-overton window as a 20 year global moratorium and the “race with China” rhetoric is still bad.
Thanks. I’ll probably reply to different parts in different threads.
For the first bit:
My guess is that the parts of the core leadership of Anthropic which are thinking actively about misalignment risks (in particular, Dario and Jared) think that misalignment risk is like ~5x smaller than I think it is while also thinking that risks from totalitarian regimes are like 2x worse than I think they are. I think the typical views of opinionated employees on the alignment science team are closer to my views than to the views of leadership. I think this explains a lot about how Anthropic operates.
The rough number you give are helpful. I’m not 100% sure I see the dots you’re intending to connect with “leadership thinks 1/5-ryan-misalignment and 2x-ryan-totalitariansm” / “rest of alignment science team closer to ryan” → “this explains a lot.”
Is this just the obvious “whelp, leadership isn’t bought into this risk model and call most of the shots, but in conversations with several employees that engage more with misalignment?”. Or was there a more specific dynamic you thought it explained?
Yep, just the obvious. (I’d say “much less bought in” than “isn’t bought in”, but whatever.)
I don’t really have dots I’m trying to connect here, but this feels more central to me than what you discuss. Like, I think “alignment might be really, really hard” (which you focus on) is less of the crux than “is misalignment that likely to be a serious problem at all?” in explaining. Another way to put this is that I think “is misalignment the biggest problem” is maybe more of the crux than “is misalignment going to be really, really hard to resolve in some worlds”. I see why you went straight to your belief though.
I don’t really see why this is a crux. I’m currently at like ~5% on this claim (given my understanding of what you mean), but moving to 15% or even 50% (while keeping the rest of the distribution the same) wouldn’t really change my strategic orientation. Maybe you’re focused on getting to a world with a more acceptable level of risk (e.g., <5%), but I think going from 40% risk to 20% risk is better to focus on.
I think you kinda convinced me here this reasoning isn’t (as stated) very persuasive.
I think my reasoning had some additional steps like:
when I’m 15% on ‘alignment might be philosophically hard’, I still expect to maybe learn more and update to 90%+, and it seems better to pursue strategies that don’t actively throw that world under the bus. (and, while I don’t fully understand the Realpolitik, it seems to me that Anthropic could totally be pursuing strategies that achieve a lot of it’s goals without Policy Comms that IMO actively torch the “long pause” worlds)
you are probably right I was oriented around “getting to like 5% risk” than reducing risk on the margin.
I’m probably partly just not really visualizing what it’d be like to be a 15%-er and bringing some bias in.
Not sure how interesting this is to discuss, but I don’t think I agree with this. Stuff they’re doing does seem harmful to worlds where you need a long pause, but feels like at the very least Anthropic is a small fraction of the torching right? Like if you think Anthropic is making this less likely, surely they are a small fraction of people pushing in this direction such that they aren’t making this that much worse (and can probably still pivot later given what they’ve said so far).
My guess is that the parts of the core leadership of Anthropic which are thinking actively about misalignment risks (in particular, Dario and Jared) think that misalignment risk is like ~5x smaller than I think it is while also thinking that risks from totalitarian regimes are like 2x worse than I think they are. I think the typical views of opinionated employees on the alignment science team are closer to my views than to the views of leadership. I think this explains a lot about how Anthropic operates.
I share similar concerns that Anthropic doesn’t seem to institutionally prioritize thinking about the future or planning, and their public outputs to date are not encouraging evidence of careful thinking about misalignment. That said, I’m pretty sympathetic to the idea that while this isn’t great, this isn’t that bad because careful thinking about exactly what needs to happen in the future isn’t that good of an approach for driving research direction or organization strategy. My biggest concern is that while careful planning is perhaps not that important now (you can maybe get most of the value by doing things that seem heuristically good), we’ll eventually need to actually do a good job thinking through what exactly should be implemented when the powerful AIs are right in front of us. I don’t think we’ll be able to make the ultimate strategic choices purely based on empirical feedback loops.
I don’t really see why this is a crux. I’m currently at like ~5% on this claim (given my understanding of what you mean), but moving to 15% or even 50% (while keeping the rest of the distribution the same) wouldn’t really change my strategic orientation. Maybe you’re focused on getting to a world with a more acceptable level of risk (e.g., <5%), but I think going from 40% risk to 20% risk is better to focus on.
The relevant questions are:
What is the risk reduction obtained by going from “token effort / no pause” to “1 year pause which is well timed and focused on misalignment concerns” vs. the risk reduction obtained by going from “1 year pause” to “20 year pause”.[1]
How much do various strategies change the probability of something like “1 year pause” and “20 year pause”.
(By 1 year pause, I mean devoting 1 year of delay worth of resources to safety. This would include scenarios where progress is halted for a year and all AI company resources are devoted to safety. It would also count if 50% of compute and researchers are well targeted toward safety for 2 years at around the point of full AI R&D automation, which would yield a roughly 1 year slow down. By “well timed”, I’m including things like deciding to be slower for 2 years or fully paused for 1.)
I would guess that going from a token effort to a well-done 1 year pause for safety reasons (at an organization taking AI risk somewhat seriously and which is moderately competent) reduces risk from around 40% to 20%. (Very low confidence numbers.) And, it’s much easier to go from a token level of effort to a 1 year pause than to get a 20 year pause, so focusing on the ~1 year pause is pretty reasonable. For what it’s worth, it’s not super obvious to me that Anthropic thinks of themselves as going for a 1 year pause. I don’t think there’s a different strategy Anthropic could follow which would increase the chance of successfully getting buy-in for a 20 year pause to a sufficient degree that focusing on this over the 1 year pause makes sense.
I’m pretty skeptical of the “extreme philosophical competence” perspective. This is basically because we “just” need to be able to hand off to an AI which:
Is seriously aligned (e.g., it faithfully pursues our interests on long open-ended and conceptually loaded tasks that are impossible for us to check).
Is strictly more capable than us.
Has enough time to do nearly as well as humans could have hoped to have done.
Creating such a system seems unlikely to require “extreme philosophical competence” even if such a system has to apply “extreme philosophical competence” to sufficiently align successors.
Josh has some posts from a similar perspective, though I don’t fully agree with him (at least at a vibes level), see: here, here, and here.
There is also obviously a case that a 20 year pause is bad if poorly managed, though I don’t think this is very important for this discussion.
The “extreme philosophical competence” hypothesis is that you need such competence to achieve “seriously aligned” in this sense. It sounds like you disagree, but I don’t know why since your reasoning just sidesteps the problem.
Looking over the comments of the first joshc post, it seems like that’s also basically asserted it wasn’t necessary by fiat. And, the people who actively believe in “alignment is philosophically loaded” showed up to complain that this ignored the heart of the problem.
My current summary of the arguments (which I put ~60% on, and I think Eliezer/Oli/Wentworth treat much more confidently and maybe believe a stronger version of) are something like:
Anything general enough to really tackle openended, difficult-to-evaluate plans, will basically need to operate in a goal directed way in order to do that. (i.e. What’s Up With Confusingly Pervasive Goal Directedness?)
The goal-directedness means it’s very likely to be self/situationally aware, and the requisite intelligence to solve these sorts of problems means even if it’s not full blown anti-aligned, it’s at least probably going to want to try to build more option value for itself.
The fact that you can’t evaluate the results means it has a lot of room to give you answers that help preserve it’s goals and bootstrap (at least on the margin), even if it’s not massively smart enough to one-shot escape. And you can’t solve that problem with Control (i.e. The Case Against AI Control Research).
You can maybe have interpretability tools that check for schemingness (if it’s the first generation of generally capable agent and isn’t too smart yet, maybe you’ve done a good job preserving Chain of Thought as a reasonably faithful representation, for now). But, you’ll then just see “yep, the agent is unaligned”, and not actually be able to fix it.
I think my current model of you (Ryan) is like:
“Training models to do specific things, cleverly, actually just makes it pretty hard for them to develop scheming or other motivated misalignments – they have to jump all the way from “don’t think about scheming ever” to “secretly think about scheming” to avoid getting caught, and that probably just won’t work?”
(or, in example of the second joshC post, they need to learn to be really reliably good at truth-tracking-patterns and articulating their reasoning, after internalizing that for thousands of reps, an AI is just gonna have a hard time jumping to reasoning that isn’t truth tracking).
I don’t have a clear model of how you respond to point #4 – that we’ll just reliably find them to be scheming if we succeed at the interpretability steps, and not have a good way of dealing with it. (Maybe you just don’t think this is as overwhelmingly likely?)
Interested in whatever Real You’s cruxes are, 1-2 steps removed.
Yes, my reasoning is definitely part, but not all of the argument. Like the thing I said is a sufficient crux for me. (If I thought we had to directly use human labor to align AIs which were qualitatively wildly superhuman in general I would put much more weight on “extreme philosophical competence”.)
I agree that Josh’s first post doesn’t argue about this, but the later two posts kinda do? That said, I think Josh doesn’t really state the argument very clearly in a single place anywhere (at least not in the way I would when trying to cover this objection).
Hmm, no this isn’t really that much of my model. Or at least, I don’t feel myself thinking thoughts like this.
I think this is probably not the best place to discuss this, but I’ll try laying out the quick version of the argument, mostly so I can practice explaining myself here.
Objectives and components
Our goal is to make a trustworthy AI that we defer to. This AI need to be:
Capable: At least as capable as we are collectively. This includes handling wicked strategy/forecasting/conceputual problems and anticipating risks from what it does.
Trustworthy: It needs to faithfully pursue our interests, including on confusing and open ended problems that are conceptually loaded and that we can’t really check.
We’ll do this using two components:
Avoid egregious misalignment: Rule out misalignment that could make behavioral tests very misleading: avoid scheming (both during training and later developed at runtime as we run more copies of the AI), types of sycophancy that are hard to detect, and value drift.
Use behavioral tests for capabilities and the parts of trustworthiness that remain.These tests will need to evaluate how the AI generalizes to tasks that are harder, more open ended, longer, and harder to check than the domains we trained it on. (Because we’ll need things to generalize further than this.) Iterate against these tests until we know how to train AIs that generalize well.
Avoiding egregious misalignment
I’d first claim that we at least have mechanisms (that don’t require extreme philosophical competence) for greatly reducing the risk of egregious misalignment in expectation. The strategies available are:
Test measures on analogous setups (model organisms) and then try to transfer
Apply principled methods where we have some a priori story it should help
Analogous setups includes:
Cases where a model naturally became egregiously misaligned (ideally with a roughly similar architecture and training method to the one we are using).
Cases where we trained in egregious misalignment (or at least heavily nudged the model towards this).
Very abstract analogies where the object we’re studying isn’t actually egregious misalignment, but is analogous to this.
If we had a large range of pretty good test beds of analogous naturally occuring misalignment that cover the failure modes we’re most worried about, then iterating on countermeasures in these test beds seems quite useful. (The countermeasures would be drawn from the principled methods.)
A serious problem is that we’re not necessarily able to consistently detect misalignment. We also need to be able to test whether misalignment is less likely after some countermeasure, so being able to reproducably catch the misalignment at sufficiently low cost is important.[1] This is probably the central difficulty in test beds. We also need to ensure that our countermeasures aren’t effectively training against our tests.[2]
(To get a lot of signal, our test beds would need to be representative of the failure modes we’re worried about or there would need to be some reason to expect the countermeasures we try to generalize to all failure modes)
Another objection is: “Actually, we didn’t have any methods available to try which could end up with a model that (always) isn’t egregiously misaligned. So, even if you can iterate a bunch, you’ll just either find that nothing works or you’ll just fool yourself.” I’m somewhat skeptical of this objection if we actually have a bunch of signal in the test beds and can afford to iterate. It feels to me like in most worlds where we have serious misalignment problems, we can get a good enough set of methods to try that we “just” need like <5 bits. As in, if we were to randomly sample sets of measures (and parameters for these measures, etc) from stuff that seem promising, it seems unlikely we’d need more than 32 tries before something actually worked. This means you maybe don’t need to apply crazy amounts of optimization pressure. I don’t have a cached quick justification.
Oops, I ran out of my time box for this comment. I didn’t get to principled measures. I can send you my doc though.
Behavioral tests
Oops, I ran out of my time box for this comment. Hopefully this section is somewhat comprehensible though I don’t get into detail.
We need to test that our AI does reasonable and aligned stuff on really hard to check tasks. So, how can we get confidence in this given that the tasks are hard to check?
First, we can see how the AI does on relatively hard to check or philosophically tricky tasks which we have already looked into substantially for unrelated reasons. A big reason why things might be hard to check is that we don’t have enough time: the AI might do investigations that would take humans decades. But, we have done some investigations in the past! This won’t ultimately let us check tasks as hard as the ones we want the AIs to do, but we can get close. Note that we also have to apply a similar process for humans.
Second, we can study how generalization on this sort of thing works in general: when an AI is trained on easy, looks good on medium, how will it do on hard? We can study this in all kinds of ways so that we can gain confidence in generalization. In particular, we can get a sense of how reasonable humans are and trust them to generalize. If we could get a similar property for AIs, we might be good, and this doesn’t seem clearly harder than the problem of making AIs which are as capable as humans.
We’d then need to iterate against these tests in a way that doesn’t overfit.
Technically, we just need some signal here, but if catching the misalignment was very costly initially, P(catch) increases with spend, and our initial catch doesn’t make it much easier to catch peturbations of the model that might behave slightly differently, then we probably wouldn’t get enough signal. That said, if we can sometimes detect misalignment in some models, iteration still could be quite useful
More precisely, we need to ensure that our countermeasures aren’t just shifting from a type of misalignment we can detect to a type we can’t. Qualitatively analyzing the countermeasures and our tests should help here.
Thanks for laying this out thus far. I’mma reply but understand if you wanna leave the convo here . I would be interested in more effortpost/dialogue about your thoughts here.
This makes sense as a crux for the claim “we need philosophical competence to align unboundedly intelligent superintelligences.” But, it doesn’t make sense for the claim “we need philosophical competence to align general, openended intelligence.” I suppose my OP didn’t really distinguish these claims and there were a few interpretations of how the arguments fit together. I was more saying the second (although to be fair I’m not sure I was actually distinguishing them well in my head until now)
It doesn’t make sense for “we just’ need to be able to hand off to an AI which is seriously aligned” to be a crux for the second. A thing can’t be a crux for itself.
I notice my “other-guy-feels-like-they’re-missing-the-point” → “check if I’m not listening well, or if something is structurally wrong with the convo” alarm is firing, so maybe I do want to ask for one last clarification on “did you feel like you understood this the first time? Does it feel like I’m missing the point of what you said? Do you think you understand why it feels to me like you were missing the point (even if you think it’s because I’m being dense about something?)
Takes on your proposal
Meanwhile, here’s some takes based on my current understanding of your proposal.
These bits:
...is a bit I think is philosophical-competence bottlenecked. And this bit:
...is a mix of “philosophically bottlenecked” and “rationality bottlenecked.” (i.e. you both have to be capable of reasoning about whether you’ve found things that really worked, and, because there are a lot of degrees of freedom, capable of noticing if you’re deploying that reasoning accurately)
I might buy that you and Buck are competent enough here to think clearly about it (not sure. I think you benefit from having a number of people around who seem likely to help), but I would bet against Anthropic decisionmakers being philosophically competent enough.
(I think at least some people on the alignment science or interpretability teams might be. I bet against the median such teammembers being able to navigate it. And ultimately, what matters is “does Anthropic leadership go forward with the next training run”, so it matters whether Anthropic leadership buys arguments from hypothetically-competent-enough alignment/interpretability people. And Anthropic leadership already seem to basically be ignoring arguments of this type, and I don’t actually expect to get the sort of empirical clarity that (it seems like) they’d need to update before it’s too late.)
I think this counts as the sort of empiricism I’m somewhat optimisic about in my post. i.e. if you are able to find experiments that actually give you evidence about deeper laws, that let you then make predictions about new Actually Uncertain questions of generalization that you then run more experiments on… that’s the sort of thing I feel optimistic about. (Depending on the details, of course)
But, you still need technical philosophical competence to know if you’re asking the right questions about generalization, and to know when the results actually imply that the next scale-up is safe.
I was thinking of a slightly broader claim: “we need extreme philosophical competence”. If I thought we had to use human labor to align wildly superhuman AIs, I would put much more weight on “extreme philosophical competence is needed”. I agree that “we need philosophical competence to align any general, openended intelligence” isn’t affected by the level of capability at handoff.
I think there might be a bit of a (presumably unintentional) motte and bailey here where the motte is “careful conceptual thinking might be required rather than pure naive empiricism (because we won’t have good enough test beds by default) and it seems like Anthropic (leadership) might fail heavily at this thinking” and the bailey is “extreme philosophical competence (e.g. 10-30 years of tricky work) is pretty likely to be needed”.
I buy the motte here, but not the bailey. I think the motte is a substantial discount on Anthropic from my perspective, but I’m kinda sympathetic to where they are coming from. (Getting conceptual stuff and futurism right is real hard! How would they know who to trust among people disagreeing wildly!)
I don’t think “does anthropic stop (at the right time)” is the majority of the relevance of careful conceptual thinking from my perspective. Probably more of it is “do they do a good job allocating their labor and safety research bets”. This is because I don’t think they’ll have very much lead time if any (median −3 months) and takeoff will probably be slower than the amount of lead time if any, so pausing won’t be as relevant. Correspondingly, pausing at the right time isn’t the biggest deal relative to other factors, though it does seem very important at an absolute level.
Yeah I agree that was happening somewhat. The connecting dots here are “in worlds where it turns out we need a long Philosophical Pause, I think you and Buck would probably be above some threshold where you notice and navigate it reasonably.”
I think my actual belief is “the Motte is high likelihood true, the Bailey is… medium-ish likelihood true, but, like, it’s a distribution, there’s not a clear dividing line between them”
I also think the pause can be “well, we’re running untrusted AGIs and ~trusted pseudogeneral LLM-agents that help with the philosophical progress, but, we can’t run them that long or fast, they help speed things up and make what’d normally be a 10-30 year pause into a 3-10 year pause, but also the world would be going crazy left to it’s own devices, and the sort of global institutional changes necessary are still similarly-outside-of-overton window as a 20 year global moratorium and the “race with China” rhetoric is still bad.
Thanks. I’ll probably reply to different parts in different threads.
For the first bit:
The rough number you give are helpful. I’m not 100% sure I see the dots you’re intending to connect with “leadership thinks 1/5-ryan-misalignment and 2x-ryan-totalitariansm” / “rest of alignment science team closer to ryan” → “this explains a lot.”
Is this just the obvious “whelp, leadership isn’t bought into this risk model and call most of the shots, but in conversations with several employees that engage more with misalignment?”. Or was there a more specific dynamic you thought it explained?
Yep, just the obvious. (I’d say “much less bought in” than “isn’t bought in”, but whatever.)
I don’t really have dots I’m trying to connect here, but this feels more central to me than what you discuss. Like, I think “alignment might be really, really hard” (which you focus on) is less of the crux than “is misalignment that likely to be a serious problem at all?” in explaining. Another way to put this is that I think “is misalignment the biggest problem” is maybe more of the crux than “is misalignment going to be really, really hard to resolve in some worlds”. I see why you went straight to your belief though.
I think you kinda convinced me here this reasoning isn’t (as stated) very persuasive.
I think my reasoning had some additional steps like:
when I’m 15% on ‘alignment might be philosophically hard’, I still expect to maybe learn more and update to 90%+, and it seems better to pursue strategies that don’t actively throw that world under the bus. (and, while I don’t fully understand the Realpolitik, it seems to me that Anthropic could totally be pursuing strategies that achieve a lot of it’s goals without Policy Comms that IMO actively torch the “long pause” worlds)
you are probably right I was oriented around “getting to like 5% risk” than reducing risk on the margin.
I’m probably partly just not really visualizing what it’d be like to be a 15%-er and bringing some bias in.
Not sure how interesting this is to discuss, but I don’t think I agree with this. Stuff they’re doing does seem harmful to worlds where you need a long pause, but feels like at the very least Anthropic is a small fraction of the torching right? Like if you think Anthropic is making this less likely, surely they are a small fraction of people pushing in this direction such that they aren’t making this that much worse (and can probably still pivot later given what they’ve said so far).