the premise that i’m trying to take seriously for this thought experiment is, what if the “claude is really smart and just a little bit away from agi” people are totally right, so that you just need to dial up capabilities a little bit more rather than a lot more, and then it becomes very reasonable to say that claude++ is about as aligned as claude.
(again, i don’t think this is a very likely assumption, but it seems important to work out what the consequences of this set of beliefs being true would be)
or at least, conditional on (a) claude is almost agi and (b) claude is mostly aligned, it seems like quite a strong claim to say “claude++ crosses the agi (= can kick off rsi) threshold at basically the same time it crosses the ‘dangerous-core-of-generalization’ threshold, so that’s also when it becomes super dangerous.” it’s way stronger a claim than “claude is far away from being agi, we’re going to make 5 breakthroughs before we achieve agi, so who knows whether agi will be anything like claude.” or, like, sure, the agi threshold is a pretty special threshold, so it’s reasonable to privilege this hypothesis a little bit, but when i think about the actual stories i’d tell about how this happens, it just feels like i’m starting from the bottom line first, and the stories don’t feel like the strongest part of my argument.
(also, i’m generally inclined towards believing alignment is hard, so i’m pretty familiar with the arguments for why aligning current models might not have much to do with aligning superintelligence. i’m not trying to argue that alignment is easy. or like i guess i’m arguing X->alignment is easy, which if you accept it, can only ever make you more likely to accept that alignment is easy than if you didn’t accept the argument, but you know what i mean. i think X is probably false but it’s plausible that it isn’t and importantly a lot of evidence will come in over the next year or so on whether X is true)
nod. I’m not sure I agreed with all the steps there but I agree with the general promise of “accept the premise that claude is just a bit away from AGI, and is reasonably aligned, and see where that goes when you look at each next step.”
I think you are saying something that shares at least some structure with Buck’s comment that
It seems like as AIs get more powerful, two things change:
They probably eventually get powerful enough that they (if developed with current methods) start plotting to kill you/take your stuff.
They get better, so their wanting to kill you is more of a problem.
I don’t see strong arguments that these problems should arise at very similar capability levels, especially if AI developers actively try to prevent the AIs from taking over
(But where you’re pointing at a different two sets of properties that may not arise at the same time)
I’m actually not sure I get what the two properties you’re talking about, though. Seems like you’re contrasting “claude++ crosses the agi (= can kick off rsi) threshold” with “crosses the ‘dangerous-core-of-generalization’ threshold”
I’m confused because I think the word “agi” basically does mean “cross the core-of-generalization threshold” (which isn’t immediately dangerous, but, puts us into ’things could quickly get dangerous at any time” territory)
I do agree “able to do a loop of RSI doesn’t intrinsically mean ‘agi’ or ‘core-of-generalization’,” there could be narrow skills for doing a loop of RSI. I’m not sure if you more meant “non-agi RSI” or, you see something different between “AGI” and “core-of-generalization.” Or think there’s a particular “dangerous core-of-generalization” separate from AGI.
(I think “the sharp left turn” is when the core-of-generalization starts to reflect on what it wants, which might come immediately after a core-of-generalization but also could come after either narrow-introspection + adhoc agency, or, might just take awhile for it to notice)
((I can’t tell if this comment is getting way more in the weeds than is necessary, but, it seemed like the nuances of exactly what you meant were probably loadbearing))
the premise that i’m trying to take seriously for this thought experiment is, what if the “claude is really smart and just a little bit away from agi” people are totally right, so that you just need to dial up capabilities a little bit more rather than a lot more, and then it becomes very reasonable to say that claude++ is about as aligned as claude.
(again, i don’t think this is a very likely assumption, but it seems important to work out what the consequences of this set of beliefs being true would be)
or at least, conditional on (a) claude is almost agi and (b) claude is mostly aligned, it seems like quite a strong claim to say “claude++ crosses the agi (= can kick off rsi) threshold at basically the same time it crosses the ‘dangerous-core-of-generalization’ threshold, so that’s also when it becomes super dangerous.” it’s way stronger a claim than “claude is far away from being agi, we’re going to make 5 breakthroughs before we achieve agi, so who knows whether agi will be anything like claude.” or, like, sure, the agi threshold is a pretty special threshold, so it’s reasonable to privilege this hypothesis a little bit, but when i think about the actual stories i’d tell about how this happens, it just feels like i’m starting from the bottom line first, and the stories don’t feel like the strongest part of my argument.
(also, i’m generally inclined towards believing alignment is hard, so i’m pretty familiar with the arguments for why aligning current models might not have much to do with aligning superintelligence. i’m not trying to argue that alignment is easy. or like i guess i’m arguing X->alignment is easy, which if you accept it, can only ever make you more likely to accept that alignment is easy than if you didn’t accept the argument, but you know what i mean. i think X is probably false but it’s plausible that it isn’t and importantly a lot of evidence will come in over the next year or so on whether X is true)
nod. I’m not sure I agreed with all the steps there but I agree with the general promise of “accept the premise that claude is just a bit away from AGI, and is reasonably aligned, and see where that goes when you look at each next step.”
I think you are saying something that shares at least some structure with Buck’s comment that
(But where you’re pointing at a different two sets of properties that may not arise at the same time)
I’m actually not sure I get what the two properties you’re talking about, though. Seems like you’re contrasting “claude++ crosses the agi (= can kick off rsi) threshold” with “crosses the ‘dangerous-core-of-generalization’ threshold”
I’m confused because I think the word “agi” basically does mean “cross the core-of-generalization threshold” (which isn’t immediately dangerous, but, puts us into ’things could quickly get dangerous at any time” territory)
I do agree “able to do a loop of RSI doesn’t intrinsically mean ‘agi’ or ‘core-of-generalization’,” there could be narrow skills for doing a loop of RSI. I’m not sure if you more meant “non-agi RSI” or, you see something different between “AGI” and “core-of-generalization.” Or think there’s a particular “dangerous core-of-generalization” separate from AGI.
(I think “the sharp left turn” is when the core-of-generalization starts to reflect on what it wants, which might come immediately after a core-of-generalization but also could come after either narrow-introspection + adhoc agency, or, might just take awhile for it to notice)
((I can’t tell if this comment is getting way more in the weeds than is necessary, but, it seemed like the nuances of exactly what you meant were probably loadbearing))