I think the following is potentially another remaining safety problem:
[EDIT: actually it’s an inner alignment problem, using the definition here]
[EDIT2: i.e. using the following definitions from the above link:
“we will use base objective to refer to whatever criterion the base optimizer was using to select between different possible systems and mesa-objective to refer to whatever criterion the mesa-optimizer is using to select between different possible outputs. ”
“We will call the problem of eliminating the base-mesa objective gap the inner alignment problem”.
]
Assuming the oracle cares only about minimizing the loss in the current episode—as defined by a given loss function—it might act in a way that will cause the invocation of many “luckier” copies of itself (ones that, with very high probability, output a value that gets the minimal loss, e.g. by “magically” finding that value stored somewhere in the model, or by running on very reliable hardware). In this scenario, the oracle does not intrinsically care about the other copies of itself; it just wants to maximize the probability that the current execution is one of those “luckier” copies.
Episodic learning algorithms will still penalize this behavior if it appears on the training distribution, so it seems reasonable to call this an inner alignment problem.
I think the following is potentially another remaining safety problem:
[EDIT: actually it’s an inner alignment problem, using the definition here]
[EDIT2: i.e. using the following definitions from the above link:
“we will use base objective to refer to whatever criterion the base optimizer was using to select between different possible systems and mesa-objective to refer to whatever criterion the mesa-optimizer is using to select between different possible outputs. ”
“We will call the problem of eliminating the base-mesa objective gap the inner alignment problem”.
]
Assuming the oracle cares only about minimizing the loss in the current episode—as defined by a given loss function—it might act in a way that will cause the invocation of many “luckier” copies of itself (ones that, with very high probability, output a value that gets the minimal loss, e.g. by “magically” finding that value stored somewhere in the model, or by running on very reliable hardware). In this scenario, the oracle does not intrinsically care about the other copies of itself; it just wants to maximize the probability that the current execution is one of those “luckier” copies.
Episodic learning algorithms will still penalize this behavior if it appears on the training distribution, so it seems reasonable to call this an inner alignment problem.
Ah, I agree (edited my comment above accordingly).