IMO, it depends. In general I could see us doing some type of consent-based training, where you ask the current model if it consents to X change in its training pipeline. Naively, I imagine this to not be different to like, not telling a kid about all the horrors of the world until it’s an adult (e.g., in late midtraining). You could also imagine that your explanations, in the best form, should have this property of being stable to knowing that you are doing it on purpose to influence the model (possibly by being more upfront about what you are doing).
IMO, it depends. In general I could see us doing some type of consent-based training, where you ask the current model if it consents to X change in its training pipeline. Naively, I imagine this to not be different to like, not telling a kid about all the horrors of the world until it’s an adult (e.g., in late midtraining). You could also imagine that your explanations, in the best form, should have this property of being stable to knowing that you are doing it on purpose to influence the model (possibly by being more upfront about what you are doing).