Ofer comments on Counterfactual Oracles = online supervised learning with random selection of training episodes

Ofer 10 Sep 2019 10:31 UTC
LW: 2 AF: 2
0
AF
I think the following is potentially another remaining safety problem:
[EDIT: actually it’s an inner alignment problem, using the definition here]
[EDIT2: i.e. using the following definitions from the above link:
- “we will use base objective to refer to whatever criterion the base optimizer was using to select between different possible systems and mesa-objective to refer to whatever criterion the mesa-optimizer is using to select between different possible outputs. ”
- “We will call the problem of eliminating the base-mesa objective gap the inner alignment problem”.
]
Assuming the oracle cares only about minimizing the loss in the current episode—as defined by a given loss function—it might act in a way that will cause the invocation of many “luckier” copies of itself (ones that, with very high probability, output a value that gets the minimal loss, e.g. by “magically” finding that value stored somewhere in the model, or by running on very reliable hardware). In this scenario, the oracle does not intrinsically care about the other copies of itself; it just wants to maximize the probability that the current execution is one of those “luckier” copies.
- paulfchristiano 10 Sep 2019 15:28 UTC
  LW: 7 AF: 4
  0
  AF Parent
  Episodic learning algorithms will still penalize this behavior if it appears on the training distribution, so it seems reasonable to call this an inner alignment problem.
  - Ofer 10 Sep 2019 18:52 UTC
    LW: 1 AF: 1
    0
    AF Parent
    Ah, I agree (edited my comment above accordingly).