abramdemski comments on Against Almost Every Theory of Impact of Interpretability

abramdemski 4 Jan 2024 4:34 UTC
2 points
0
More generally, if we grant that we don’t need perfection, or arbitrarily good alignment, at least early on, then I think this implies that alignment should be really easy, and the p(Doom) numbers are almost certainly way too high, primarily because it’s often doable to solve problems of you don’t need perfect or arbitrarily good solutions.
It seems really easy to spell out worldviews where “we don’t need perfection, or arbitrarily good alignment” but yet “alignment should be really easy”. To give a somewhat silly example based on the OP, I could buy Enumerative Safety in principle—so if we can check all the features for safety, we can 100% guarantee the safety of the model. It then follows that if we can check 95% of the features (sampled randomly) then we get something like a 95% safety guarantee (depending on priors).
But I might also think that properly “checking” even one feature is really, really hard.
So I don’t buy the claimed implication: “we don’t need perfection” does not imply “alignment should be really easy”. Indeed, I think the implication quite badly fails.
- Noosphere89 5 Jan 2024 21:57 UTC
  2 points
  0
  Parent
  I’ll admit I overstated it here, but my claim is that once you remove the requirement for arbitrarily good/perfect solutions, it becomes easier to solve the problem. Sometimes, it’s still impossible to solve the problem, but it’s usually solvable once you drop a perfectness/arbitrarily good requirement, primarily because it loosens a lot of constraints.
  
  Indeed, I think the implication quite badly fails.
  
  I agree it isn’t a logical implication, but I suspect your example is very misleading, and that more realistic imperfect solutions won’t have this failure mode, so I’m still quite comfortable with using it as an implication that isn’t 100% accurate, but more like 90-95+% accurate.
  - abramdemski 10 Jan 2024 21:34 UTC
    2 points
    0
    Parent
    I’ll admit I overstated it here, but my claim is that once you remove the requirement for arbitrarily good/perfect solutions, it becomes easier to solve the problem. Sometimes, it’s still impossible to solve the problem, but it’s usually solvable once you drop a perfectness/arbitrarily good requirement, primarily because it loosens a lot of constraints.
    I mean, yeah, I agree with all of this as generic statements if we ignore the subject at hand.
    I agree it isn’t a logical implication, but I suspect your example is very misleading, and that more realistic imperfect solutions won’t have this failure mode, so I’m still quite comfortable with using it as an implication that isn’t 100% accurate, but more like 90-95+% accurate.
    I agree the example sucks and only serves to prove that it is not a logical implication.
    A better example would be, like, the Goodhart model of AI risk, where any loss function that we optimize hard enough to get into superintelligence would probably result in a large divergence between what we get and what we actually want, because optimization amplifies. Note that this still does not make an assumption that we need to prove 100% safety, but rather, argues, for reasons, from assumptions that it will be hard to get any safety at all from loss functions which merely coincide to what we want somewhat well.
    I still think the list of lethalities is a pretty good reply to your overall line of reasoning—IE it clearly flags that the problem is not achieving perfection, but rather, achieving any significant probability of safety, and it gives a bunch of concrete reasons why this is hard, IE provides arguments rather than some kind of blind assumption like you seem to be indicating.
    You are doing a reasonable thing by trying to provide some sort of argument for why these conclusions seem wrong, but “things tend to be easy when you lift the requirement of perfection” is just an extremely weak argument which seems to fall apart the moment we contemplate the specific case of AI alignment at all.