I expect it to be difficult to generate adversarial inputs which will fool a deceptively aligned AI. One proposed strategy for doing so is relaxed adversarial training, where the adversary can modify internal weights. But this seems like it will require a lot of progress on interpretability. An alternative strategy, which I haven’t yet seen any discussion of, is to allow the adversary to do a data poisoning attack before generating adversarial inputs—i.e. the adversary gets to specify inputs and losses for a given number of SGD steps, and then the adversarial input which the base model will be evaluated on afterwards. (Edit: probably a better name for this is adversarial meta-learning.)
I expect it to be difficult to generate adversarial inputs which will fool a deceptively aligned AI. One proposed strategy for doing so is relaxed adversarial training, where the adversary can modify internal weights. But this seems like it will require a lot of progress on interpretability. An alternative strategy, which I haven’t yet seen any discussion of, is to allow the adversary to do a data poisoning attack before generating adversarial inputs—i.e. the adversary gets to specify inputs and losses for a given number of SGD steps, and then the adversarial input which the base model will be evaluated on afterwards. (Edit: probably a better name for this is adversarial meta-learning.)