It sees like the reason Claude’s level is misalignment is fine is because it’s capabilities aren’t very good, and there’s not much/any reason to assume it’d be fine if you held alignment constant but dialed up capabilities.
Do you not think that?
(I don’t really see why it’s relevant how aligned Claude is if we’re not thinking about that as part of it)
it’d be fine if you held alignment constant but dialed up capabilities.
I don’t know what this means so I can’t give you a prediction about it.
I don’t really see why it’s relevant how aligned Claude is if we’re not thinking about that as part of it
I just named three reasons:
Current models do not provide much evidence one way or another for existential risk from misalignment (in contrast to frequent claims that “the doomers were right”)
Given tremendous uncertainty, our best guess should be that future models are like current models, and so future models will not try to take over, and so existential risk from misalignment is low
Some particular threat model predicted that even at current capabilities we should see significant misalignment, but we don’t see this, which is evidence against that particular threat model.
Is it relevant to the object-level question of “how hard is aligning a superintelligence”? No, not really. But people are often talking about many things other than that question.
For example, is it relevant to “how much should I defer to doomers”? Yes absolutely (see e.g. #1).
the premise that i’m trying to take seriously for this thought experiment is, what if the “claude is really smart and just a little bit away from agi” people are totally right, so that you just need to dial up capabilities a little bit more rather than a lot more, and then it becomes very reasonable to say that claude++ is about as aligned as claude.
(again, i don’t think this is a very likely assumption, but it seems important to work out what the consequences of this set of beliefs being true would be)
or at least, conditional on (a) claude is almost agi and (b) claude is mostly aligned, it seems like quite a strong claim to say “claude++ crosses the agi (= can kick off rsi) threshold at basically the same time it crosses the ‘dangerous-core-of-generalization’ threshold, so that’s also when it becomes super dangerous.” it’s way stronger a claim than “claude is far away from being agi, we’re going to make 5 breakthroughs before we achieve agi, so who knows whether agi will be anything like claude.” or, like, sure, the agi threshold is a pretty special threshold, so it’s reasonable to privilege this hypothesis a little bit, but when i think about the actual stories i’d tell about how this happens, it just feels like i’m starting from the bottom line first, and the stories don’t feel like the strongest part of my argument.
(also, i’m generally inclined towards believing alignment is hard, so i’m pretty familiar with the arguments for why aligning current models might not have much to do with aligning superintelligence. i’m not trying to argue that alignment is easy. or like i guess i’m arguing X->alignment is easy, which if you accept it, can only ever make you more likely to accept that alignment is easy than if you didn’t accept the argument, but you know what i mean. i think X is probably false but it’s plausible that it isn’t and importantly a lot of evidence will come in over the next year or so on whether X is true)
nod. I’m not sure I agreed with all the steps there but I agree with the general promise of “accept the premise that claude is just a bit away from AGI, and is reasonably aligned, and see where that goes when you look at each next step.”
I think you are saying something that shares at least some structure with Buck’s comment that
It seems like as AIs get more powerful, two things change:
They probably eventually get powerful enough that they (if developed with current methods) start plotting to kill you/take your stuff.
They get better, so their wanting to kill you is more of a problem.
I don’t see strong arguments that these problems should arise at very similar capability levels, especially if AI developers actively try to prevent the AIs from taking over
(But where you’re pointing at a different two sets of properties that may not arise at the same time)
I’m actually not sure I get what the two properties you’re talking about, though. Seems like you’re contrasting “claude++ crosses the agi (= can kick off rsi) threshold” with “crosses the ‘dangerous-core-of-generalization’ threshold”
I’m confused because I think the word “agi” basically does mean “cross the core-of-generalization threshold” (which isn’t immediately dangerous, but, puts us into ’things could quickly get dangerous at any time” territory)
I do agree “able to do a loop of RSI doesn’t intrinsically mean ‘agi’ or ‘core-of-generalization’,” there could be narrow skills for doing a loop of RSI. I’m not sure if you more meant “non-agi RSI” or, you see something different between “AGI” and “core-of-generalization.” Or think there’s a particular “dangerous core-of-generalization” separate from AGI.
(I think “the sharp left turn” is when the core-of-generalization starts to reflect on what it wants, which might come immediately after a core-of-generalization but also could come after either narrow-introspection + adhoc agency, or, might just take awhile for it to notice)
((I can’t tell if this comment is getting way more in the weeds than is necessary, but, it seemed like the nuances of exactly what you meant were probably loadbearing))
It sees like the reason Claude’s level is misalignment is fine is because it’s capabilities aren’t very good, and there’s not much/any reason to assume it’d be fine if you held alignment constant but dialed up capabilities.
Do you not think that?
(I don’t really see why it’s relevant how aligned Claude is if we’re not thinking about that as part of it)
I don’t know what this means so I can’t give you a prediction about it.
I just named three reasons:
Is it relevant to the object-level question of “how hard is aligning a superintelligence”? No, not really. But people are often talking about many things other than that question.
For example, is it relevant to “how much should I defer to doomers”? Yes absolutely (see e.g. #1).
the premise that i’m trying to take seriously for this thought experiment is, what if the “claude is really smart and just a little bit away from agi” people are totally right, so that you just need to dial up capabilities a little bit more rather than a lot more, and then it becomes very reasonable to say that claude++ is about as aligned as claude.
(again, i don’t think this is a very likely assumption, but it seems important to work out what the consequences of this set of beliefs being true would be)
or at least, conditional on (a) claude is almost agi and (b) claude is mostly aligned, it seems like quite a strong claim to say “claude++ crosses the agi (= can kick off rsi) threshold at basically the same time it crosses the ‘dangerous-core-of-generalization’ threshold, so that’s also when it becomes super dangerous.” it’s way stronger a claim than “claude is far away from being agi, we’re going to make 5 breakthroughs before we achieve agi, so who knows whether agi will be anything like claude.” or, like, sure, the agi threshold is a pretty special threshold, so it’s reasonable to privilege this hypothesis a little bit, but when i think about the actual stories i’d tell about how this happens, it just feels like i’m starting from the bottom line first, and the stories don’t feel like the strongest part of my argument.
(also, i’m generally inclined towards believing alignment is hard, so i’m pretty familiar with the arguments for why aligning current models might not have much to do with aligning superintelligence. i’m not trying to argue that alignment is easy. or like i guess i’m arguing X->alignment is easy, which if you accept it, can only ever make you more likely to accept that alignment is easy than if you didn’t accept the argument, but you know what i mean. i think X is probably false but it’s plausible that it isn’t and importantly a lot of evidence will come in over the next year or so on whether X is true)
nod. I’m not sure I agreed with all the steps there but I agree with the general promise of “accept the premise that claude is just a bit away from AGI, and is reasonably aligned, and see where that goes when you look at each next step.”
I think you are saying something that shares at least some structure with Buck’s comment that
(But where you’re pointing at a different two sets of properties that may not arise at the same time)
I’m actually not sure I get what the two properties you’re talking about, though. Seems like you’re contrasting “claude++ crosses the agi (= can kick off rsi) threshold” with “crosses the ‘dangerous-core-of-generalization’ threshold”
I’m confused because I think the word “agi” basically does mean “cross the core-of-generalization threshold” (which isn’t immediately dangerous, but, puts us into ’things could quickly get dangerous at any time” territory)
I do agree “able to do a loop of RSI doesn’t intrinsically mean ‘agi’ or ‘core-of-generalization’,” there could be narrow skills for doing a loop of RSI. I’m not sure if you more meant “non-agi RSI” or, you see something different between “AGI” and “core-of-generalization.” Or think there’s a particular “dangerous core-of-generalization” separate from AGI.
(I think “the sharp left turn” is when the core-of-generalization starts to reflect on what it wants, which might come immediately after a core-of-generalization but also could come after either narrow-introspection + adhoc agency, or, might just take awhile for it to notice)
((I can’t tell if this comment is getting way more in the weeds than is necessary, but, it seemed like the nuances of exactly what you meant were probably loadbearing))