Note: Even if we have a smart agent which cares about diamonds and knows about value drift, it might “bend to temptation” and drift anyways. I have had several experiences where I thought “don’t open this webpage, it will cause value drift in this kind of situation via an unendorsed reward event.” Sometimes this thought works. Sometimes it doesn’t.
Also, even if the AI can sandbox its future changes and inspect them, not all value drift events will be immediately apparent. For example, maybe the AI undergoes a batch update and the AI-prime would not pursue diamonds if it sees a red object (this is importantly unrealistic but I 70%-expect I could find a better example if I tried). The AI would be vulnerable to these errors if it doesn’t have enough mechanistic self-interpretability (I expect it to have at least some). Of course, the AI would probably know about this failure mode and take precautions as well—this just makes the AI’s self-improvement job (at least) a bit harder.
Note: Even if we have a smart agent which cares about diamonds and knows about value drift, it might “bend to temptation” and drift anyways. I have had several experiences where I thought “don’t open this webpage, it will cause value drift in this kind of situation via an unendorsed reward event.” Sometimes this thought works. Sometimes it doesn’t.
Also, even if the AI can sandbox its future changes and inspect them, not all value drift events will be immediately apparent. For example, maybe the AI undergoes a batch update and the AI-prime would not pursue diamonds if it sees a red object (this is importantly unrealistic but I 70%-expect I could find a better example if I tried). The AI would be vulnerable to these errors if it doesn’t have enough mechanistic self-interpretability (I expect it to have at least some). Of course, the AI would probably know about this failure mode and take precautions as well—this just makes the AI’s self-improvement job (at least) a bit harder.