this is very interesting for examining what the central examples of the out of distribution behavior of a feature would be! It seems to me that needing to regularize by cross entropy to the original prompt, while a good trick for understanding the “less adversarial” behavior, like all mechinterp work, doesn’t really deal with issues relating to what happens as the world is filled with more adversarial data generated by agents trying to exploit other agents. But it’s interesting for understanding the non-edge-case behavior, which seems useful for doing analysis on how the dataset landed in the model. I wonder how charts of robust models look
doesn’t really deal with issues relating to what happens as the world is filled with more adversarial data generated by agents trying to exploit other agents.
Like you say, the work here is mainly intended towards understanding model function in non-adversarial settings. But, fluent redteaming is a very closely related topic that we’re working on. In that area, there’s a trend towards using perplexity/cross-entropy filters to slice off a chunk of the attack surface (e.g. https://arxiv.org/abs/2308.14132). If you know that 99.99% of user queries to a chatbot have a cross-entropy below X then you can set up a simple filter to reject queries with cross-entropy higher than X. So, useful text-based adversarial attacks will very soon start requiring some level of fluency.
which seems useful for doing analysis on how the dataset landed in the model.
Yes! This is exactly how I think about the value of dreaming. Poking at the edges of the behavior of a feature/circuit/component lets you get a more robust sense of what that component is doing.
this is very interesting for examining what the central examples of the out of distribution behavior of a feature would be! It seems to me that needing to regularize by cross entropy to the original prompt, while a good trick for understanding the “less adversarial” behavior, like all mechinterp work, doesn’t really deal with issues relating to what happens as the world is filled with more adversarial data generated by agents trying to exploit other agents. But it’s interesting for understanding the non-edge-case behavior, which seems useful for doing analysis on how the dataset landed in the model. I wonder how charts of robust models look
Thanks!
Like you say, the work here is mainly intended towards understanding model function in non-adversarial settings. But, fluent redteaming is a very closely related topic that we’re working on. In that area, there’s a trend towards using perplexity/cross-entropy filters to slice off a chunk of the attack surface (e.g. https://arxiv.org/abs/2308.14132). If you know that 99.99% of user queries to a chatbot have a cross-entropy below X then you can set up a simple filter to reject queries with cross-entropy higher than X. So, useful text-based adversarial attacks will very soon start requiring some level of fluency.
Yes! This is exactly how I think about the value of dreaming. Poking at the edges of the behavior of a feature/circuit/component lets you get a more robust sense of what that component is doing.