Zeyu Qin(Zeyu Qin )

Karma: 0

I am a Ph.D. student from HKUST. My research interest is broadly on AI Safety, Interpretability, and Alignment.
https://alan-qin.github.io/

Zeyu Qin 23 Sep 2023 10:07 UTC
1 point
0
AF
on: EIS VIII: An Engineer’s Understanding of Deception
But suppose that despite our best effort, we end up with a deceptively aligned system on our hands. Now what do we do? At this point, the problem of detecting and fixing deception becomes quite similar to just detecting and fixing problems with the model in general – except for one thing. Deceptive alignment failures are triggered by inputs that are, by definition, hard to find during training.
I agree almost completely. The following is an example that seems to be contradictory: unfaithful CoT reasoning (https://arxiv.org/abs/2307.13702).

Zeyu Qin 24 Sep 2023 15:15 UTC
1 point
0
on: EIS IX: Interpretability and Adversaries
Bug world view: Proponents of this view can argue that tools for robustness and intrinsic interpretability are regularizers and that both robustness and interpretability are consequences of regularization.
I am unsure of the meaning of this statement. What do you mean by “regularizers” that you mentioned?

Zeyu Qin 24 Sep 2023 15:21 UTC
1 point
0
on: EIS IX: Interpretability and Adversaries
World 2 [(Feature world)]: Adversarial examples exploit useful directions for classification (“features”). In this world, adversarial examples occur in directions that are still “on-distribution”, and which contain features of the target class. For example, consider the perturbation that makes an image of a dog to be classified as a cat. In World 2, this perturbation is not purely random, but has something to do with cats. Moreover, we expect that this perturbation transfers to other classifiers trained to distinguish cats vs. dogs.
I believe that when discussing bugs or features, it is only meaningful to focus on targeted attacks.

Zeyu Qin 26 Sep 2023 14:36 UTC
1 point
2
AF
on: EIS X: Continual Learning, Modularity, Compression, and Biological Brains
My knowledge of the broader machine learning and neuroscience fields is limited, and I strongly suspect that there are connections to other topics out there – perhaps some that have already been studied, and perhaps some which have yet to be. For example, there are probably interesting connections between interpretability and dataset distillation (Wang et al., 2018). I’m just not sure what they are yet.
The dataset condensation (distillation) may be seen as the global explanation of the training dataset. And, we could also utilize DC to extract some spurious correlations like backdoor triggers. We made a naive attempt in this direction: https://openreview.net/forum?id=ix3UDwIN5E .

A personal explanation of ELK concept and task.

Zeyu Qin6 Oct 2023 3:55 UTC

1 point

0 comments1 min readLW link

Zeyu Qin 6 Oct 2023 10:28 UTC
1 point
0
AF
on: Latent Adversarial Training
The key difference between LAT and Adversarial Training is that the Surgeon gets to directly manipulate the Agent’s inner state, which makes the Surgeon’s job much easier than in the ordinary adversarial training setup.
I think that being able to selectively modify the inner state (the task of Surgeon) is not easier than searching for adversarial examples in the input space.