William_S comments on A comment on the IDA-AlphaGoZero metaphor; capabilities versus alignment

William_S 19 Jul 2018 21:41 UTC
1 point
0
Correcting all problems in the subsequent amplification stage would be a nice property to have, but I think IDA can still work even if it corrects errors with multiple A/D steps in between (as long as all catastrophic errors are caught before deployment). For example, I could think of the agent initially using some rules for how to solve math problems where distillation introduces some mistake, but later in the IDA process the agent learns how to rederive those rules and realizes the mistake.