Hmmm… this is a subtle distinction and both definitions seem pretty reasonable to me. I guess I feel like I want “good things happen” to be part of capabilities (e.g. is the model capable of doing the things we want it to do) rather than alignment, making (impact) alignment more about not doing stuff we don’t want.
Wouldn’t outcome-based “not doing bad things” impact alignment still run into that capabilities issue? “Not doing bad things” requires serious capabilities for some goals (e.g. sparse but intially achievable goals).
In any case, you can say “I think that implementing strong capabilities + strong intent alignment is a good instrumental strategy for impact alignment”, which seems compatible with the distinction you seek?
Hmmm… this is a subtle distinction and both definitions seem pretty reasonable to me. I guess I feel like I want “good things happen” to be part of capabilities (e.g. is the model capable of doing the things we want it to do) rather than alignment, making (impact) alignment more about not doing stuff we don’t want.
Wouldn’t outcome-based “not doing bad things” impact alignment still run into that capabilities issue? “Not doing bad things” requires serious capabilities for some goals (e.g. sparse but intially achievable goals).
In any case, you can say “I think that implementing strong capabilities + strong intent alignment is a good instrumental strategy for impact alignment”, which seems compatible with the distinction you seek?