paulfchristiano comments on Online Learning 2: Bandit learning with catastrophes

paulfchristiano 30 Oct 2016 8:14 UTC
LW: 2 AF: 2
0
AF
It seems like the qualitative conclusion here is similar to what you’d get by verifying that an arm is non-catastrophic by pulling it over and over again. You have a quadratic dependence on $a$ rather than linear, but I think that is just because you aren’t exploiting the fact that the target catastrophe probability is $0$ ? If that’s right, it might be better to stick with the simplest algorithm that exhibits the behavior of interest. (If the optimal dependence is actually quadratic in this situation that surprising.)

In the definition of $q_{i}$ , presumably you want $C$ rather than $R$ . Note that the risk distribution is just the posterior conditioned on catastrophe.

(I think today we don’t usually call arms experts—there may be both experts and arms, in which case an expert might advise you about which arm to pull, but I think it’s a bit confusing to talk about experts with partial feedback.)

Re footnote 1: I would call this an example of adversarial training. I’d suggest usage like: a red team is a group of humans (or in general a powerful hopefully-aligned AI) which is acting as an adversary for the purposes of adversarial training or development. I think the original version of my post may have overreached a bit on the definition and not given adequate credit to the adversarial training authors (who do seem to consider this kind of thing an example of adversarial training).
- RyanCarey 30 Oct 2016 19:48 UTC
  0 points
  0
  AF Parent
  Thanks!!
  
  Additionally to Jessica’s comments: uniformly calling the selections ‘arms’ seems good, as does clarifying what is meant by ‘red teams’. I’ve corrected these, and likewise the definition of $q_{i}$ .
- jessicata 30 Oct 2016 19:06 UTC
  0 points
  0
  AF Parent
  A couple notes:
  - The quadratic dependence on $a$ is almost certainly unnecessary; we didn’t try too hard to reduce it. The way to reduce the bound is probably by observing that, for the best arms, the average importance-sampled estimate of the risk has a low mean; showing that as a result the estimate is subgaussian; and then applying a stochastic bandit algorithm that assumes subgaussian tails.
  - If we just pulled an arm repeatedly to ensure it’s non catastrophic we’d get dependence on $1 / τ$ which is huge; the main idea of the post is that we can get dependence on $a$ instead of $1 / τ$ .
  - paulfchristiano 31 Oct 2016 4:38 UTC
    0 points
    0
    AF Parent
    (I meant sampling $x$ repeatedly from the distribution ${^q}_{i}$ , I agree that sampling $x$ at random won’t help identify rare catastrophes.)
    - jessicata 1 Nov 2016 4:34 UTC
      LW: 1 AF: 1
      0
      AF Parent
      The main qualitative difference from sampling from ${^q}_{i}$ is that we’re targeting a specific tradeoff between catastrophes and reward, rather than zero probability of catastrophe. I agree that when $τ = 0$ we’re just sampling from ${^q}_{i}$ .