I’m hoping a solution to scale indefinitely if you hold fixed the AI design. In practice you’d face a sequence of different AI alignment problems (one for each technique) and I don’t expect it to solve all of those, just one—i.e., if you solved alignment, you could still easily die because your AI failed to solve the next iteration of the AI alignment problem.
Arguing that this wouldn’t be the case—pointing to a clear place where my proposal tops out, definitely counts for the purpose of the prize. I do think that a significant fraction of my EV comes from the case where my approach can’t get you all the way, because it tops out somewhere, but if I’m convinced that it tops out somewhere I’m still feeling way more pessimistic about the scheme.
By “AI design” I’m assuming you are referring to the learning algorithm and runtime/inference algorithm of the agent A in the amplification scheme.
In that case, I hadn’t thought of the system as only needing to work with respect to the learning algorithm. Maybe it’s possible/useful to reason about limited versions which are corrigible with respect to some simple current technique (just not very competent).
I’m hoping a solution to scale indefinitely if you hold fixed the AI design. In practice you’d face a sequence of different AI alignment problems (one for each technique) and I don’t expect it to solve all of those, just one—i.e., if you solved alignment, you could still easily die because your AI failed to solve the next iteration of the AI alignment problem.
Arguing that this wouldn’t be the case—pointing to a clear place where my proposal tops out, definitely counts for the purpose of the prize. I do think that a significant fraction of my EV comes from the case where my approach can’t get you all the way, because it tops out somewhere, but if I’m convinced that it tops out somewhere I’m still feeling way more pessimistic about the scheme.
By “AI design” I’m assuming you are referring to the learning algorithm and runtime/inference algorithm of the agent A in the amplification scheme.
In that case, I hadn’t thought of the system as only needing to work with respect to the learning algorithm. Maybe it’s possible/useful to reason about limited versions which are corrigible with respect to some simple current technique (just not very competent).