I like this overall direction for how simple and robust it is. One challenge I see is that the latent capability of misalignment is still deeply ingrained in the model, and this could be abused by a bad actor even if the model itself doesn’t abuse it. For example, a user could make the model simulate a misaligned human/AI and use the simulated output to drive a local agent chassis. One way around this would be to simply not show misaligned output to the user, but this wouldn’t defend against cases where someone (e.g. a hacker, an employee) gets access to the raw model. A more robust idea I have is to make the simulated humans/AIs fundamentally less capable than the aligned AI. Assuming that the aligned AI is more intelligent than all misaligned humans/AIs (if this wasn’t true, we have bigger problems), the aligned AI only needs to simulate the misaligned human/AI at its (lower) capability level to accurately model the world. This could be done by ensuring that all <AI_quoting_human/AI> examples are strictly at the quoted entity’s capability or by training a less capable “quote model”.
I like this overall direction for how simple and robust it is. One challenge I see is that the latent capability of misalignment is still deeply ingrained in the model, and this could be abused by a bad actor even if the model itself doesn’t abuse it. For example, a user could make the model simulate a misaligned human/AI and use the simulated output to drive a local agent chassis. One way around this would be to simply not show misaligned output to the user, but this wouldn’t defend against cases where someone (e.g. a hacker, an employee) gets access to the raw model. A more robust idea I have is to make the simulated humans/AIs fundamentally less capable than the aligned AI. Assuming that the aligned AI is more intelligent than all misaligned humans/AIs (if this wasn’t true, we have bigger problems), the aligned AI only needs to simulate the misaligned human/AI at its (lower) capability level to accurately model the world. This could be done by ensuring that all <AI_quoting_human/AI> examples are strictly at the quoted entity’s capability or by training a less capable “quote model”.