abramdemski comments on Against Almost Every Theory of Impact of Interpretability

abramdemski 10 Jan 2024 21:34 UTC
2 points
0
I’ll admit I overstated it here, but my claim is that once you remove the requirement for arbitrarily good/perfect solutions, it becomes easier to solve the problem. Sometimes, it’s still impossible to solve the problem, but it’s usually solvable once you drop a perfectness/arbitrarily good requirement, primarily because it loosens a lot of constraints.
I mean, yeah, I agree with all of this as generic statements if we ignore the subject at hand.
I agree it isn’t a logical implication, but I suspect your example is very misleading, and that more realistic imperfect solutions won’t have this failure mode, so I’m still quite comfortable with using it as an implication that isn’t 100% accurate, but more like 90-95+% accurate.
I agree the example sucks and only serves to prove that it is not a logical implication.
A better example would be, like, the Goodhart model of AI risk, where any loss function that we optimize hard enough to get into superintelligence would probably result in a large divergence between what we get and what we actually want, because optimization amplifies. Note that this still does not make an assumption that we need to prove 100% safety, but rather, argues, for reasons, from assumptions that it will be hard to get any safety at all from loss functions which merely coincide to what we want somewhat well.
I still think the list of lethalities is a pretty good reply to your overall line of reasoning—IE it clearly flags that the problem is not achieving perfection, but rather, achieving any significant probability of safety, and it gives a bunch of concrete reasons why this is hard, IE provides arguments rather than some kind of blind assumption like you seem to be indicating.
You are doing a reasonable thing by trying to provide some sort of argument for why these conclusions seem wrong, but “things tend to be easy when you lift the requirement of perfection” is just an extremely weak argument which seems to fall apart the moment we contemplate the specific case of AI alignment at all.