Yeah, this seems like a reasonable restatement of my question.
I guess my main issue with this approach is that extrapolating the distribution of activations from a dataset isn’t what I’d consider the hard part of alignment. Rather, it would be:
Detecting catastrophic outputs and justifying their catastrophicness to others. (In particular, I suspect no individual output will be catastrophic on the margin regardless of whether catastrophe will occur. Either the network will consistently avoid giving catastrophic outputs, or it will sufficiently consistently be harmful that localizing the harm to 1 output will not be meaningful.)
Learning things about the distribution of inputs that cannot be extrapolated from any dataset. (In particular, the most relevant short-term harm I’ve noticed would be stuff like young nerds starting to see the AI as a sort of mentor and then having their questionable ideas excessively validated by this mentor rather than receiving appropriate pushback. This would be hard to extrapolate from a dataset, even though it is relatively obvious if you interact with certain people. Though whether that counts as “catastrophic” is a complicated question.)
Yeah, this seems like a reasonable restatement of my question.
I guess my main issue with this approach is that extrapolating the distribution of activations from a dataset isn’t what I’d consider the hard part of alignment. Rather, it would be:
Detecting catastrophic outputs and justifying their catastrophicness to others. (In particular, I suspect no individual output will be catastrophic on the margin regardless of whether catastrophe will occur. Either the network will consistently avoid giving catastrophic outputs, or it will sufficiently consistently be harmful that localizing the harm to 1 output will not be meaningful.)
Learning things about the distribution of inputs that cannot be extrapolated from any dataset. (In particular, the most relevant short-term harm I’ve noticed would be stuff like young nerds starting to see the AI as a sort of mentor and then having their questionable ideas excessively validated by this mentor rather than receiving appropriate pushback. This would be hard to extrapolate from a dataset, even though it is relatively obvious if you interact with certain people. Though whether that counts as “catastrophic” is a complicated question.)