Second is robust-to-training transparency, which refers to transparency tools that are robust enough that they, even if we directly train on their output, continue to function and give correct answers.[4] For example, if our transparency tools are robust-to-training and they tell us that our model cares about gold coins, then adding “when we apply our transparency tools, they tell us that the model doesn’t care about gold coins” to our training loss shouldn’t result in a situation where the model looks to our tools like it doesn’t care about gold coins anymore but actually still does.
Checking my understanding here (I’m pretty confident I’m misunderstanding a bit): is “robust-to-training transparency” the case where you have some transparency tool that can verify what the model is optimizing (cares?) for and, if it’s not the thing we want (or how we want the model to optimize for it?), then we can train on the outputs to basically tell the model, “hey, you should care about x and not y.” And then the robust part is to say it really does care about x after training on the outputs and that the transparency tool is not being optimized against so it no longer works?
If the above is true, is part of the difficulty here that we are assuming we even know where exactly to point the model to while training on the outputs?
Checking my understanding here (I’m pretty confident I’m misunderstanding a bit): is “robust-to-training transparency” the case where you have some transparency tool that can verify what the model is optimizing (cares?) for and, if it’s not the thing we want (or how we want the model to optimize for it?), then we can train on the outputs to basically tell the model, “hey, you should care about x and not y.” And then the robust part is to say it really does care about x after training on the outputs and that the transparency tool is not being optimized against so it no longer works?
If the above is true, is part of the difficulty here that we are assuming we even know where exactly to point the model to while training on the outputs?