Regarding conservatism, there seems to be an open question of just how robust Goodhart effects are in that we all agree Goodhart is a problem but it’s not clear how much of a problem it is and when. We have opinions ranging from mine, which is basically that Goodharting happens the moment you try to apply even the weakest optimization pressure and this will be a problem (or at least a problem in expectation; you might get lucky) for any system you need to never deviate, to what I read to be Paul’s position: it’s not that bad and we can do a lot to correct systems before Goodharting would be disastrous.
Maybe part of the problem is we’re mixing up math and engineering problems and not making clear distinctions, but anyway I bring this up in the context of conservatism because it seems relevant that we also need to figure out how conservative, if at all, we need to be about optimization pressure, let alone how we would do it. I’ve not seen anything like a formal argument that X amount of optimization pressure, measured in whatever way is convenient, and given conditions Y produce Z% chance of Goodharting. Then at least we wouldn’t have to disagree over what feels safe or not.
Regarding conservatism, there seems to be an open question of just how robust Goodhart effects are in that we all agree Goodhart is a problem but it’s not clear how much of a problem it is and when. We have opinions ranging from mine, which is basically that Goodharting happens the moment you try to apply even the weakest optimization pressure and this will be a problem (or at least a problem in expectation; you might get lucky) for any system you need to never deviate, to what I read to be Paul’s position: it’s not that bad and we can do a lot to correct systems before Goodharting would be disastrous.
Maybe part of the problem is we’re mixing up math and engineering problems and not making clear distinctions, but anyway I bring this up in the context of conservatism because it seems relevant that we also need to figure out how conservative, if at all, we need to be about optimization pressure, let alone how we would do it. I’ve not seen anything like a formal argument that X amount of optimization pressure, measured in whatever way is convenient, and given conditions Y produce Z% chance of Goodharting. Then at least we wouldn’t have to disagree over what feels safe or not.