You would have to think you can somehow ‘recover’ the lost alignment later in the process.
Are you actually somehow unaware of the literature on Value Learning and AI-assisted Alignment, or just so highly skeptical of it that you’re here pretending for rhetorical effect that it doesn’t exist? The entire claim of Value Learning is that if you start off with good enough alignment, you can converge from there to true alignment, and that this process improves as your AIs scale up. Given that the previous article in your sequence is on AI-assisted Alignment, it’s clear that you do in fact understand the concept of how one might hope that this ‘recovery’ could happen. So perhaps you might consider dropping the word ‘somehow’, and instead expending a sentence or two on acknowledging the existence of the idea and then outlining why you’re not confident it’s workable?
Are you actually somehow unaware of the literature on Value Learning and AI-assisted Alignment, or just so highly skeptical of it that you’re here pretending for rhetorical effect that it doesn’t exist? The entire claim of Value Learning is that if you start off with good enough alignment, you can converge from there to true alignment, and that this process improves as your AIs scale up. Given that the previous article in your sequence is on AI-assisted Alignment, it’s clear that you do in fact understand the concept of how one might hope that this ‘recovery’ could happen. So perhaps you might consider dropping the word ‘somehow’, and instead expending a sentence or two on acknowledging the existence of the idea and then outlining why you’re not confident it’s workable?