Oliver Sourbut comments on Relaxed adversarial training for inner alignment

Oliver Sourbut 24 Nov 2021 17:00 UTC
3 points
0
A potential issue with Relaxed Adversarial Training, as factorised in the post. $d e p l o y$ is presumably dependent on the outcome of the training process itself (i.e. the training process has side-effects, most notable the production of a deployed ML artefact which might have considerable impact on the world!). Since the training process is downstream of the adversary, this means that the quality of the adversary’s choice of pseudo-inputs to propose depends on the choice itself. This could lead to concerns about different fixed points (or even the existence of any fixed point?) in that system.

(My faint worry is that by being proposed, a problematic pseudo-input will predictably have some gradient ‘training it away’, making it less plausible to arise in the deploy distribution, making it less likely to be proposed… but that makes it have less gradient predictably ‘training it away’, making it more plausible in the deploy distribution, making it more likely to be proposed, …....)

Some ways to dissolve this
1. In conversation with Evan, he already mentioned a preferred reframing of RAT which bypasses pseudo-inputs and prefers to directly inspect some property of the model (e.g. myopia)
2. I wonder about maybe ‘detecting weird fixpoints’ by also inspecting the proposed pseudo-inputs for ‘is this a weird and concerning pseudo-input?’ (if so, the supervisor is predicting weird and concerning post-deployment worlds!)
3. If we instead consider causal reasoning and the counterfactual $d e p l o y$ of ‘what if we did no more training and deployed now?’ this dissolves the dependence (I wonder if this is actually the intended idea of the OP). This leaves open the question of how much harder/easier it is to do counterfactual vs predictive reasoning here.
4. If we instead consider ‘deployment’ to be ‘any moment after now’ (including the remainder of the training process) it might cash out similar to 3? This chimes with one of my intuitions about embedded agency which I don’t know an official name for but which I think of as ‘you only get one action’ (because any action affects the world which affects you so there’s now a different ‘you’)
Interesting? Or basically moot? Or something in between?