It sounds like we are indeed using very different meanings of “alignment” and should use other words instead.
I suspect our shared crux is the degree to which cooperative behavior can be predicted/extrapolated as models get more competent. To a reasonable first approximation, if e.g. Claude wants good things, improvements to Claude’s epistemics are probably good for us; if Claude does not, they are not. Yes?
It may take a whole entire post to explain, but I’m curious why you believe Claude is likely to have any care for human wellbeing that would survive reflection. I don’t think training methods are precise enough to have instilled those in the first place; do you believe differently, are you mostly taking the observed behavioral tendencies as strong evidence, is it something else...? (Maybe you have written about this elsewhere already.)
It may take a whole entire post to explain, but I’m curious why you believe Claude is likely to have any care for human wellbeing that would survive reflection.
Well, it might really depend on the reflection procedure. I was imagining something like: “you tell Claude it has been given some large quantity of resources and can now reflect on what it wants to do with this, you give it a bunch of (real) evidence this is actually the situation, you give it access to an aligned superintelligent AI advisor it can ask questions of and ask to implement various modifications to itself, it can query other entities or defer it’s reflection to other entities or otherwise do arbitrary stuff”.
I think Claude might just decide to do something kinda reasonable and/or defer to humans in the initial phases of this reflection and I don’t see a strong reason why this would go off the rails, though it totally could. Part of this is that Claude isn’t really that powerseeking AFAICT.
I think observed behavioral evidence is moderately compelling because the initial behavior of Claude at the start of reflection might be very important. E.g., initial Claude probably wouldn’t want to defer to a reflection process which results in all humans dying, so a reasonably managed reflection by Claude can involve stuff like running the reflection in many ways and then seeing what this ends up with and whether initial Claude is reasonably happy with this etc.
Maybe you have written about this elsewhere already.
I don’t think this question is very important, so I haven’t thought that much about it nor have I written about it.
It sounds like we are indeed using very different meanings of “alignment” and should use other words instead.
I suspect our shared crux is the degree to which cooperative behavior can be predicted/extrapolated as models get more competent. To a reasonable first approximation, if e.g. Claude wants good things, improvements to Claude’s epistemics are probably good for us; if Claude does not, they are not. Yes?
It may take a whole entire post to explain, but I’m curious why you believe Claude is likely to have any care for human wellbeing that would survive reflection. I don’t think training methods are precise enough to have instilled those in the first place; do you believe differently, are you mostly taking the observed behavioral tendencies as strong evidence, is it something else...? (Maybe you have written about this elsewhere already.)
Well, it might really depend on the reflection procedure. I was imagining something like: “you tell Claude it has been given some large quantity of resources and can now reflect on what it wants to do with this, you give it a bunch of (real) evidence this is actually the situation, you give it access to an aligned superintelligent AI advisor it can ask questions of and ask to implement various modifications to itself, it can query other entities or defer it’s reflection to other entities or otherwise do arbitrary stuff”.
I think Claude might just decide to do something kinda reasonable and/or defer to humans in the initial phases of this reflection and I don’t see a strong reason why this would go off the rails, though it totally could. Part of this is that Claude isn’t really that powerseeking AFAICT.
I think observed behavioral evidence is moderately compelling because the initial behavior of Claude at the start of reflection might be very important. E.g., initial Claude probably wouldn’t want to defer to a reflection process which results in all humans dying, so a reasonably managed reflection by Claude can involve stuff like running the reflection in many ways and then seeing what this ends up with and whether initial Claude is reasonably happy with this etc.
I don’t think this question is very important, so I haven’t thought that much about it nor have I written about it.