It seems like the qualitative conclusion here is similar to what you’d get by verifying that an arm is non-catastrophic by pulling it over and over again. You have a quadratic dependence on a rather than linear, but I think that is just because you aren’t exploiting the fact that the target catastrophe probability is 0? If that’s right, it might be better to stick with the simplest algorithm that exhibits the behavior of interest. (If the optimal dependence is actually quadratic in this situation that surprising.)
In the definition of qi, presumably you want C rather than R. Note that the risk distribution is just the posterior conditioned on catastrophe.
(I think today we don’t usually call arms experts—there may be both experts and arms, in which case an expert might advise you about which arm to pull, but I think it’s a bit confusing to talk about experts with partial feedback.)
Re footnote 1: I would call this an example of adversarial training. I’d suggest usage like: a red team is a group of humans (or in general a powerful hopefully-aligned AI) which is acting as an adversary for the purposes of adversarial training or development. I think the original version of my post may have overreached a bit on the definition and not given adequate credit to the adversarial training authors (who do seem to consider this kind of thing an example of adversarial training).
Additionally to Jessica’s comments: uniformly calling the selections ‘arms’ seems good, as does clarifying what is meant by ‘red teams’. I’ve corrected these, and likewise the definition of qi.
The quadratic dependence on a is almost certainly unnecessary; we didn’t try too hard to reduce it. The way to reduce the bound is probably by observing that, for the best arms, the average importance-sampled estimate of the risk has a low mean; showing that as a result the estimate is subgaussian; and then applying a stochastic bandit algorithm that assumes subgaussian tails.
If we just pulled an arm repeatedly to ensure it’s non catastrophic we’d get dependence on 1/τ which is huge; the main idea of the post is that we can get dependence on a instead of 1/τ.
The main qualitative difference from sampling from ^qi is that we’re targeting a specific tradeoff between catastrophes and reward, rather than zero probability of catastrophe. I agree that when τ=0 we’re just sampling from ^qi.
It seems like the qualitative conclusion here is similar to what you’d get by verifying that an arm is non-catastrophic by pulling it over and over again. You have a quadratic dependence on a rather than linear, but I think that is just because you aren’t exploiting the fact that the target catastrophe probability is 0? If that’s right, it might be better to stick with the simplest algorithm that exhibits the behavior of interest. (If the optimal dependence is actually quadratic in this situation that surprising.)
In the definition of qi, presumably you want C rather than R. Note that the risk distribution is just the posterior conditioned on catastrophe.
(I think today we don’t usually call arms experts—there may be both experts and arms, in which case an expert might advise you about which arm to pull, but I think it’s a bit confusing to talk about experts with partial feedback.)
Re footnote 1: I would call this an example of adversarial training. I’d suggest usage like: a red team is a group of humans (or in general a powerful hopefully-aligned AI) which is acting as an adversary for the purposes of adversarial training or development. I think the original version of my post may have overreached a bit on the definition and not given adequate credit to the adversarial training authors (who do seem to consider this kind of thing an example of adversarial training).
Thanks!!
Additionally to Jessica’s comments: uniformly calling the selections ‘arms’ seems good, as does clarifying what is meant by ‘red teams’. I’ve corrected these, and likewise the definition of qi.
A couple notes:
The quadratic dependence on a is almost certainly unnecessary; we didn’t try too hard to reduce it. The way to reduce the bound is probably by observing that, for the best arms, the average importance-sampled estimate of the risk has a low mean; showing that as a result the estimate is subgaussian; and then applying a stochastic bandit algorithm that assumes subgaussian tails.
If we just pulled an arm repeatedly to ensure it’s non catastrophic we’d get dependence on 1/τ which is huge; the main idea of the post is that we can get dependence on a instead of 1/τ.
(I meant sampling x repeatedly from the distribution ^qi, I agree that sampling x at random won’t help identify rare catastrophes.)
The main qualitative difference from sampling from ^qi is that we’re targeting a specific tradeoff between catastrophes and reward, rather than zero probability of catastrophe. I agree that when τ=0 we’re just sampling from ^qi.