Yes, you will probably see early instrumentally convergent thinking. We have already observed a bunch of that. Do you train against it? I think that’s unlikely to get rid of it.
I’m not necessarily asserting that this solves the problem, but seems important to note that the obviously-superior alternative to training against it is validating against it. i.e., when you observe scheming you train a new model, ideally with different techniques that you reckon have their own chance of working.
However doomed you think training against the signal is, you should think validating against it is significantly less doomed, unless there’s some reason why well-established machine learning principles don’t apply here. Using something as a validation metric to iterate methods doesn’t cause overfitting at anything like the level of directly training on it.
EDIT: later in the thread you say this this “is in some sense approximately the only and central core of the alignment problem”. I’m wondering whether thinking about this validation vs training point might cause you a nontrivial update then?
Using something as a validation metric to iterate methods doesn’t cause overfitting at anything like the level of directly training on it.
Validation is certainly less efficient at overfitting but it seems a bit like using an evolutionary algorithm rather than gradient descent. You aren’t directly optimizing according to the local gradient, but that doesn’t necessarily mean you’ll avoid Goodharting—just that you’re less likely to immediately fall into a bad local optimum.
The likelihood of preventing Goodharting feels like it depends heavily on assumptions about the search space. The “validation” filters the search space to areas where scheming isn’t easily detectable, but what portion of this space is safe (and how can we tell)? We don’t actually have a true globally accurate validator oracle—just a weak approximation of one.
I roughly agree, but it seems very robustly established in practice that the training-validation distinction is better than just having a training objective, even though your argument mostly applies just as well to the standard ML setup.
You point out an important difference which is that our ‘validation metrics’ might be quite weak compared to most cases, but I still think it’s clearly much better to use some things for validation than training.
Like, I think there are things that are easy to train away but hard/slow to validate away (just like when training an image classifier you could in principle memorise the validation set, but it would take a ridiculous amount of hyperparameter optimisation).
One example might be if we have interp methods that measure correlates of scheming. Incredibly easy to train away, still possible to validate away but probably harder enough that ratio of non-schemers you get is higher than if trained against it, which wouldn’t affect the ratio at all.
A separate argument is that I’m think if you just do random search over training ideas, rejecting if they don’t get a certain validation score, you actually don’t goodhart at all. Might put that argument in a top level post.
A separate argument is that I’m think if you just do random search over training ideas, rejecting if they don’t get a certain validation score, you actually don’t goodhart at all. Might put that argument in a top level post.
I’d be interested in seeing this argument laid out.
I’m not necessarily asserting that this solves the problem, but seems important to note that the obviously-superior alternative to training against it is validating against it. i.e., when you observe scheming you train a new model, ideally with different techniques that you reckon have their own chance of working.
However doomed you think training against the signal is, you should think validating against it is significantly less doomed, unless there’s some reason why well-established machine learning principles don’t apply here. Using something as a validation metric to iterate methods doesn’t cause overfitting at anything like the level of directly training on it.
EDIT: later in the thread you say this this “is in some sense approximately the only and central core of the alignment problem”. I’m wondering whether thinking about this validation vs training point might cause you a nontrivial update then?
Validation is certainly less efficient at overfitting but it seems a bit like using an evolutionary algorithm rather than gradient descent. You aren’t directly optimizing according to the local gradient, but that doesn’t necessarily mean you’ll avoid Goodharting—just that you’re less likely to immediately fall into a bad local optimum.
The likelihood of preventing Goodharting feels like it depends heavily on assumptions about the search space. The “validation” filters the search space to areas where scheming isn’t easily detectable, but what portion of this space is safe (and how can we tell)? We don’t actually have a true globally accurate validator oracle—just a weak approximation of one.
I roughly agree, but it seems very robustly established in practice that the training-validation distinction is better than just having a training objective, even though your argument mostly applies just as well to the standard ML setup.
You point out an important difference which is that our ‘validation metrics’ might be quite weak compared to most cases, but I still think it’s clearly much better to use some things for validation than training.
Like, I think there are things that are easy to train away but hard/slow to validate away (just like when training an image classifier you could in principle memorise the validation set, but it would take a ridiculous amount of hyperparameter optimisation).
One example might be if we have interp methods that measure correlates of scheming. Incredibly easy to train away, still possible to validate away but probably harder enough that ratio of non-schemers you get is higher than if trained against it, which wouldn’t affect the ratio at all.
A separate argument is that I’m think if you just do random search over training ideas, rejecting if they don’t get a certain validation score, you actually don’t goodhart at all. Might put that argument in a top level post.
I’d be interested in seeing this argument laid out.
I wrote it out as a post here.