This was useful to read, thanks. I was in that part of the forgetting curve where I know I read the ICM paper a few months ago but didn’t remember how it worked :)
You could maybe punch up the relevance to alignment a little. I’m a big fan of “captur[ing] challenges like the salience of incorrect human beliefs,” but I worry that there’s a tendency to deploy the minimum viable fix. “Oh, training on 3k data points didn’t help with that subset of problems? Well, training on 30k datapoints, 10% of which were about not exploiting human vulnerabilities, solves the problem 90% of the time on the latest benchmarks that capture the problem.” I’m pretty biased against deploying the minimum viable fix here—but maybe you think it’s on the right path?
I think if you were able to solve 90% of the problem across stress-testing benchmarks that felt reasonably analogous to the situation (including new stress-testing benchmarks built by other people) by using a technique that doesn’t feel like it is exploiting a benchmark-reality gap, I would be pretty excited about using the technique in situations where solving 90% of the problem helps (which I think could be pretty important situations). I’d find it interesting and useful if there was a minimum viable fix that worked pretty robustly!
(Just training on easy data points where the truth is salient and known by you has both benchmark-reality gaps issues and in practice doesn’t work that well if there is a big enough distribution shift between the easy and hard data, more on that soon!)
This was useful to read, thanks. I was in that part of the forgetting curve where I know I read the ICM paper a few months ago but didn’t remember how it worked :)
You could maybe punch up the relevance to alignment a little. I’m a big fan of “captur[ing] challenges like the salience of incorrect human beliefs,” but I worry that there’s a tendency to deploy the minimum viable fix. “Oh, training on 3k data points didn’t help with that subset of problems? Well, training on 30k datapoints, 10% of which were about not exploiting human vulnerabilities, solves the problem 90% of the time on the latest benchmarks that capture the problem.” I’m pretty biased against deploying the minimum viable fix here—but maybe you think it’s on the right path?
I think if you were able to solve 90% of the problem across stress-testing benchmarks that felt reasonably analogous to the situation (including new stress-testing benchmarks built by other people) by using a technique that doesn’t feel like it is exploiting a benchmark-reality gap, I would be pretty excited about using the technique in situations where solving 90% of the problem helps (which I think could be pretty important situations). I’d find it interesting and useful if there was a minimum viable fix that worked pretty robustly!
(Just training on easy data points where the truth is salient and known by you has both benchmark-reality gaps issues and in practice doesn’t work that well if there is a big enough distribution shift between the easy and hard data, more on that soon!)