Karma: 95

# Limits of Ask­ing ELK if Models are Deceptive

15 Aug 2022 20:44 UTC
6 points
• Interesting! If I understand correctly, one assumption this would rest on is that deception is usually fairly continuous. I.e. to be really deceptive (and obfuscate your latent space) you have to first be a little deceptive (where small changes in your latent space are revealing). I wonder if more capable models might break that assumption? Accordingly, why not train the Surgeon in lockstep with the model? (or give it more runtime if it’s using a fixed algorithm) You’d probably finetune the training to make it proportional to how quickly the model itself is training.

• When you say “which yields a solution of the form ”, are you saying that yields that, or are you saying that yields that? Because, for the former, that seems wrong? Specifically, the former should yield only things of the form .

I meant the former (which you’re right only has the solution with ). I only added the term to make it work for the inequality. As a result, it’s only a subset of the solutions for the inequality. The (quite complicated!) expression you provided also works.

• For a 2-component weighted average model with a scalar output, the output should always be between between the outputs of each component model.

Hm, I see your point. I retract my earlier claim. This model wouldn’t apply to that task. I’m struggling to generate a concrete example where loss would actually be a linear combination of the sub-models’ loss. However, I (tentatively) conjecture that in large networks trained on complex tasks, loss can be roughly approximated as a linear combination of the losses of subnetworks (with the caveats of weird correlations and tasks where partial combinations work well (like the function approximation above)).

I would expect under normal circumstances that gradient descent would also be optimizing the parameters within that part or those layers.

I agree, but the question of in what direction SGD changes the model (i.e. how it changes ) seems to have some recursive element analogous to the situation above. If the model is really close to the above, then I would imagine there’s some optimization pressure to update it towards . That’s just a hunch, though. I don’t know how close it would have to be.

• To be clear, I haven’t explained how could arise nor how it’s implementing . There are other posts that explain why gradient hacking might be a problem and informal ‘requirements’ that gradient hacking models might meet. I’m just trying to answer IF we already have a gradient hacking model, what’s the theoretical best it can do.

Can you flesh this out more by giving an example of what a training data point would be in this scenario? What are the input features of a training case and what output target variable is the model trying to predict?

Also, it is not clear from this example if the outputs of the whole composite model are:

1. A weighted average of the outputs of the 2 component models OR

2. The weighted sum of

Output would be task-specific. In the scenario you later describe with predicting functions, you’d probably want to take option 1 (which is what parametrizing on one weight in [0,1] also accomplishes). I’ll go with that example.

But either way, I don’t see why the loss for the composite model would necessarily be equal to a weighted sum of the losses of each component model, as stated in the example above.

In principle, it seems to me the loss on such a composite model could (for example) be 0 on some data set, even if both components had a positive loss function value for that same data.

As a simplified example of this, suppose you have a 2-component weighted-average model that is trying to predict the Y value of points in a 2-dimensional (X,Y) coordinate plane given the X value. And suppose all the data points are on the line:

If the 2 components of the model are:

1. AND

The intent of the scenario is to find what model dominates, so probably loss should be non-negative. If you use squared error in that scenario, then the loss of the mixture is always greater than or equal to the loss of any particular model in the mixture.

Also suppose can see perfectly and its loss is a function of its relative weight in the mixture. I.e. , which I’ll refer to as ’s loss modulating function.

I am disinclined to suppose this—unless someone explains how it could make sense for the machine learning engineers to do it that way.

My understanding is that the loss function of a model should track how close the model’s answers are to the correct or optimal answers for some data. But in this example, and are not part of the training data, and can vary independently from the outputs on the training cases.

has a calculable loss on the training data, which means could be a stand-alone model for this dataset in its own right even if , , and didn’t exist.

I don’t think ML engineers will pass in weights of the models to the models themselves (except maybe for certain tasks like game-theoretic simulations). The worry is that data spills easily and that SGD might find absurd, unpredictable ways to sneak weights (or some other correlated variable) into the model. Once it does, the model’s loss is already a function of its weight (though it may of course be constant). I wouldn’t expect to run into this sort of problem with weak models operating on well-defined tasks (e.g. function prediction). Gradient hacking (unless in contrived examples) probably wouldn’t occur until we get very capable models that are fed a lot of data (the calculation of weights from which might be easy) and it may in fact never occur at all!

The result of the post is that there exists a loss-modulating function that ‘tricks’ SGD into preferring a model that is actively worse simply because the model ‘holds out’ or ‘strategically fails’ in such a way that the gradient flows toward it. How likely these functions are or if there are basins around these functions (i.e. if is currently outputting which is sort of close to , would SGD update toward or toward some constant function? If is outputting , is there any optimization pressure towards changing to be more gradient hacky?) are open problems.

# A Toy Model of Gra­di­ent Hacking

20 Jun 2022 22:01 UTC
30 points