Raemon comments on Anthropic, and taking “technical philosophy” more seriously

Raemon 16 Mar 2025 19:21 UTC
2 points
0
Thanks for laying this out thus far. I’mma reply but understand if you wanna leave the convo here . I would be interested in more effortpost/dialogue about your thoughts here.
Yes, my reasoning is definitely part, but not all of the argument. Like the thing I said is a sufficient crux for me. (If I thought we had to directly use human labor to align AIs which were qualitatively wildly superhuman in general I would put much more weight on “extreme philosophical competence”.)
This makes sense as a crux for the claim “we need philosophical competence to align unboundedly intelligent superintelligences.” But, it doesn’t make sense for the claim “we need philosophical competence to align general, openended intelligence.” I suppose my OP didn’t really distinguish these claims and there were a few interpretations of how the arguments fit together. I was more saying the second (although to be fair I’m not sure I was actually distinguishing them well in my head until now)
It doesn’t make sense for “we just’ need to be able to hand off to an AI which is seriously aligned” to be a crux for the second. A thing can’t be a crux for itself.
I notice my “other-guy-feels-like-they’re-missing-the-point” → “check if I’m not listening well, or if something is structurally wrong with the convo” alarm is firing, so maybe I do want to ask for one last clarification on “did you feel like you understood this the first time? Does it feel like I’m missing the point of what you said? Do you think you understand why it feels to me like you were missing the point (even if you think it’s because I’m being dense about something?)
Takes on your proposal
Meanwhile, here’s some takes based on my current understanding of your proposal.
These bits:
We need to ensure that our countermeasures aren’t just shifting from a type of misalignment we can detect to a type we can’t. Qualitatively analyzing the countermeasures and our tests should help here.
...is a bit I think is philosophical-competence bottlenecked. And this bit:
“Actually, we didn’t have any methods available to try which could end up with a model that (always) isn’t egregiously misaligned. So, even if you can iterate a bunch, you’ll just either find that nothing works or you’ll just fool yourself.”
...is a mix of “philosophically bottlenecked” and “rationality bottlenecked.” (i.e. you both have to be capable of reasoning about whether you’ve found things that really worked, and, because there are a lot of degrees of freedom, capable of noticing if you’re deploying that reasoning accurately)
I might buy that you and Buck are competent enough here to think clearly about it (not sure. I think you benefit from having a number of people around who seem likely to help), but I would bet against Anthropic decisionmakers being philosophically competent enough.
(I think at least some people on the alignment science or interpretability teams might be. I bet against the median such teammembers being able to navigate it. And ultimately, what matters is “does Anthropic leadership go forward with the next training run”, so it matters whether Anthropic leadership buys arguments from hypothetically-competent-enough alignment/interpretability people. And Anthropic leadership already seem to basically be ignoring arguments of this type, and I don’t actually expect to get the sort of empirical clarity that (it seems like) they’d need to update before it’s too late.)
Second, we can study how generalization on this sort of thing works in general
I think this counts as the sort of empiricism I’m somewhat optimisic about in my post. i.e. if you are able to find experiments that actually give you evidence about deeper laws, that let you then make predictions about new Actually Uncertain questions of generalization that you then run more experiments on… that’s the sort of thing I feel optimistic about. (Depending on the details, of course)
But, you still need technical philosophical competence to know if you’re asking the right questions about generalization, and to know when the results actually imply that the next scale-up is safe.
- ryan_greenblatt 16 Mar 2025 20:47 UTC
  2 points
  0
  Parent
  
  This makes sense as a crux for the claim “we need philosophical competence to align unboundedly intelligent superintelligences.” But, it doesn’t make sense for the claim “we need philosophical competence to align general, openended intelligence.”
  
  I was thinking of a slightly broader claim: “we need extreme philosophical competence”. If I thought we had to use human labor to align wildly superhuman AIs, I would put much more weight on “extreme philosophical competence is needed”. I agree that “we need philosophical competence to align any general, openended intelligence” isn’t affected by the level of capability at handoff.
  
  I might buy that you and Buck are competent enough here to think clearly about it (not sure. I think you benefit from having a number of people around who seem likely to help), but I would bet against Anthropic decisionmakers being philosophically competent enough.
  
  I think there might be a bit of a (presumably unintentional) motte and bailey here where the motte is “careful conceptual thinking might be required rather than pure naive empiricism (because we won’t have good enough test beds by default) and it seems like Anthropic (leadership) might fail heavily at this thinking” and the bailey is “extreme philosophical competence (e.g. 10-30 years of tricky work) is pretty likely to be needed”.
  
  I buy the motte here, but not the bailey. I think the motte is a substantial discount on Anthropic from my perspective, but I’m kinda sympathetic to where they are coming from. (Getting conceptual stuff and futurism right is real hard! How would they know who to trust among people disagreeing wildly!)
  
  And ultimately, what matters is “does Anthropic leadership go forward with the next training run”, so it matters whether Anthropic leadership buys arguments from hypothetically-competent-enough alignment/interpretability people.
  
  I don’t think “does anthropic stop (at the right time)” is the majority of the relevance of careful conceptual thinking from my perspective. Probably more of it is “do they do a good job allocating their labor and safety research bets”. This is because I don’t think they’ll have very much lead time if any (median −3 months) and takeoff will probably be slower than the amount of lead time if any, so pausing won’t be as relevant. Correspondingly, pausing at the right time isn’t the biggest deal relative to other factors, though it does seem very important at an absolute level.
  - Raemon 16 Mar 2025 22:25 UTC
    4 points
    0
    Parent
    I think there might be a bit of a (presumably unintentional) motte and bailey here where the motte is “careful conceptual thinking might be required rather than pure naive empiricism (because we won’t be given good enough test beds by default) and it seems like Anthropic (leadership) might fail heavily at this” and the bailey is “extreme philosophical competence (e.g. 10-30 years of tricky work) is pretty likely to be needed”.
    Yeah I agree that was happening somewhat. The connecting dots here are “in worlds where it turns out we need a long Philosophical Pause, I think you and Buck would probably be above some threshold where you notice and navigate it reasonably.”
    I think my actual belief is “the Motte is high likelihood true, the Bailey is… medium-ish likelihood true, but, like, it’s a distribution, there’s not a clear dividing line between them”
    I also think the pause can be “well, we’re running untrusted AGIs and ~trusted pseudogeneral LLM-agents that help with the philosophical progress, but, we can’t run them that long or fast, they help speed things up and make what’d normally be a 10-30 year pause into a 3-10 year pause, but also the world would be going crazy left to it’s own devices, and the sort of global institutional changes necessary are still similarly-outside-of-overton window as a 20 year global moratorium and the “race with China” rhetoric is still bad.

Raemon comments on Anthropic, and taking “technical philosophy” more seriously

Takes on your proposal