I am a Ph.D. student from HKUST. My research interest is broadly on AI Safety, Interpretability, and Alignment.
https://alan-qin.github.io/
Zeyu Qin(Zeyu Qin )
Bug world view: Proponents of this view can argue that tools for robustness and intrinsic interpretability are regularizers and that both robustness and interpretability are consequences of regularization.
I am unsure of the meaning of this statement. What do you mean by “regularizers” that you mentioned?
World 2 [(Feature world)]: Adversarial examples exploit useful directions for classification (“features”). In this world, adversarial examples occur in directions that are still “on-distribution”, and which contain features of the target class. For example, consider the perturbation that makes an image of a dog to be classified as a cat. In World 2, this perturbation is not purely random, but has something to do with cats. Moreover, we expect that this perturbation transfers to other classifiers trained to distinguish cats vs. dogs.
I believe that when discussing bugs or features, it is only meaningful to focus on targeted attacks.
My knowledge of the broader machine learning and neuroscience fields is limited, and I strongly suspect that there are connections to other topics out there – perhaps some that have already been studied, and perhaps some which have yet to be. For example, there are probably interesting connections between interpretability and dataset distillation (Wang et al., 2018). I’m just not sure what they are yet.
The dataset condensation (distillation) may be seen as the global explanation of the training dataset. And, we could also utilize DC to extract some spurious correlations like backdoor triggers. We made a naive attempt in this direction: https://openreview.net/forum?id=ix3UDwIN5E .
A personal explanation of ELK concept and task.
The key difference between LAT and Adversarial Training is that the Surgeon gets to directly manipulate the Agent’s inner state, which makes the Surgeon’s job much easier than in the ordinary adversarial training setup.
I think that being able to selectively modify the inner state (the task of Surgeon) is not easier than searching for adversarial examples in the input space.
I agree almost completely. The following is an example that seems to be contradictory: unfaithful CoT reasoning (https://arxiv.org/abs/2307.13702).