IDA tries to prevent catastrophic outcomes by searching for a competitive AI that never intentionally optimises for something harmful to us and that we can still correct once it’s running.
I don’t see how the “we can still correct once it’s running” part can be true given this footnote:
However, I think at some point we will probably have the AI system autonomously execute the distillation and amplification steps or otherwise get outcompeted. And even before that point we might find some other way to train the AI in breaking down tasks that doesn’t involve human interaction.
After a certain point it seems like the thing that is overseeing the AI system is another AI system and saying that “we” can correct the first AI system seems like a confusing way to phrase this situation. Do you think I’ve understood this correctly / what do you think?
One interpretation is that even if the AI is autonomously executing distillation/amplification steps, the relevant people are still able to say “hold on, we need to modify your algorithm” and have the AI actually let us correct it.
I don’t see how the “we can still correct once it’s running” part can be true given this footnote:
After a certain point it seems like the thing that is overseeing the AI system is another AI system and saying that “we” can correct the first AI system seems like a confusing way to phrase this situation. Do you think I’ve understood this correctly / what do you think?
One interpretation is that even if the AI is autonomously executing distillation/amplification steps, the relevant people are still able to say “hold on, we need to modify your algorithm” and have the AI actually let us correct it.