johnswentworth comments on Formal Solution to the Inner Alignment Problem

johnswentworth 19 Feb 2021 1:22 UTC
LW: 10 AF: 7
AF
I think that the vast majority of the existential risk comes from that “broader issue” that you’re pointing to of not being able to get worst-case guarantees due to using deep learning or evolutionary search or whatever. That leads me to want to define inner alignment to be about that problem...
[Emphasis added.] I think this is a common and serious mistake-pattern, and in particular is one of the more common underlying causes of framing errors. The pattern is roughly:
- Notice cluster of problems X which have a similar underlying causal pattern Cause(X)
- Notice problem y in which Cause(X) could plausibly play a role
- On deeper examination, the cause of y cause(y) doesn’t quite fit Cause(X)
- Attempt to redefine the pattern Cause(X) to include cause(y)
The problem is that, in trying to “shoehorn” cause(y) into the category Cause(X), we miss the opportunity to notice a different pattern, which is more directly useful in understanding y as well as some other cluster of problems related to y.
A concrete example: this is the same mistake I accused Zvi of making when trying to cast moral mazes as a problem of super-perfect competition. The conditions needed for super-perfect competition to explain moral mazes did not hold, and by trying to shoehorn the problem into that mold Zvi was missing an orthogonal phenomenon which is extremely interesting in its own right: thinking about that exact problem was what led to Demons in Imperfect Search.
Now, this is not to say that changing a definition to fit another case is always the wrong move. Sometimes, a new use-case shows that the definition can handle the new case while still preserving its original essence. The key question is whether the problem cluster X and problem y really do have the same underlying structure, or if there’s something genuinely new and different going on in y.
In this case, I think it’s pretty clear that there is more than just inner alignment problems going on in the lack of worst-case guarantees for deep learning/evolutionary search/etc. Generalization failure is not just about, or even primarily about, inner agents. It occurs even in the absence of mesa-optimizers. So defining inner alignment to be about that problem looks to me like a mistake—you’re likely to miss important, conceptually-distinct phenomena by making that move. (We could also come at it from the converse direction: if something clearly recognizable as an inner alignment problem occurs for ideal Bayesians, then redefining the inner alignment problem to be “we can’t control what sort of model we get when we do ML” is probably a mistake, and you’re likely to miss interesting phenomena that way which don’t conceptually resemble inner alignment.)
A useful knee-jerk reaction here is to notice when cause(y) doesn’t quite fit the pattern Cause(X), and use that as a curiosity-pump to look for other cases which resemble y. That’s the sort of instinct which will tend to turn up insights we didn’t know we were missing.
- evhub 19 Feb 2021 5:21 UTC
  LW: 20 AF: 11
  AF Parent
  I mean, I don’t think I’m “redefining” inner alignment, given that I don’t think I’ve ever really changed my definition and I was the one that originally came up with the term (inner alignment was due to me, mesa-optimization was due to Chris van Merwijk). I also certainly agree that there are “more than just inner alignment problems going on in the lack of worst-case guarantees for deep learning/evolutionary search/etc.”—I think that’s exactly the point that I’m making, which is that while there are other issues, inner alignment is what I’m most concerned about. That being said, I also think I was just misunderstanding the setup in the paper—see Rohin’s comment on this chain.