I struggled a bit on deciding whether to nominate this sequence.
On the one hand, it brought a lot more prominence to the inner alignment problem by making an argument for it in a lot more detail than had been done before.
On the other hand, on my beliefs, the framework it presents has an overly narrow view of what counts as inner alignment, relies on a model of AI development that I do not think is accurate, causes people to say “but what about mesa optimization” in response to any advance that doesn’t involve mesa optimization even if the advance is useful for other reasons, has led to significant confusion over what exactly does and does not count as mesa optimization, and tends to cause people to take worse steps in choosing future research topics. (I expect all of these claims will be controversial.)
Still, that the conversation is happening at all is a vast improvement over the previous situation of relative (public) silence on the problem. Saying a bunch of confused thoughts is often the precursor to an actual good understanding of a topic. As such I’ve decided to nominate it for that contribution.
I think I can guess what your disagreements are regarding too narrow a conception of inner alignment/mesa-optimization (that the paper overly focuses on models mechanistically implementing optimization), though I’m not sure what model of AI development it relies that you don’t think is accurate and would be curious for details there. I’d also be interested in what sorts of worse research topics you think it has tended to encourage (on my view, I think this paper should make you more excited about directions like transparency and robustness and less excited about directions involving careful incentive/environment design). Also, for the paper giving people a “but what about mesa-optimization” response, I’m imagining you’re referring to things like this post, though I’d appreciate some clarification there as well.
As a preamble, I should note that I’m putting on my “critical reviewer” hat here. I’m not intentionally being negative—I am reporting my inside-view beliefs on each question—but as a general rule, I expect these to be biased negatively; someone looking at research from the outside doesn’t have the same intuitions for its utility and so will usually inside-view underestimate its value.
This is also all things I’m saying with the benefit of hindsight, idk what I would have said at the time the sequence was published. I’m not trying to be “fair” to the sequence here, that is, I’m not considering what it would have been reasonable to believe at the time.
the paper overly focuses on models mechanistically implementing optimization
Yup, that’s right.
I’m not sure what model of AI development it relies that you don’t think is accurate
There seems to be an implicit model that when you do machine learning you get out a complicated mess of a neural net that is hard to interpret, but at its core it still is learning something akin to a program, and hence concepts like “explicit (mechanistic) search algorithm” are reasonable to expect. (Or at least, that this will be true for sufficiently intelligent AI systems.)
I don’t think this model (implicit claim?) is correct. (For comparison, I also don’t think this model would be correct if applied to human cognition.)
worse research topics you think it has tended to encourage
A couple of examples:
Attempting to create an example of a learned mechanistic search algorithm (I know of at least one proposal that was trying to do this)
Of your concrete experiments, I don’t expect to learn anything of interest from the first two (they aren’t the sort of thing that would generalize from small environments to large environments); I like the third; the fourth and fifth seem like interesting AI research but I don’t think they’d shed light on mesa-optimization / inner alignment or its solutions.
I think this paper should make you more excited about directions like transparency and robustness and less excited about directions involving careful incentive/environment design
I agree with this. Maybe people have gotten more interested in transparency as a result of this paper? That seems plausible.
I’m imagining you’re referring to things like this post,
Actually, not that one. This is more like “why are you working on reward learning—even if you solved it we’d still be worried about mesa optimization”. Possibly no one believes this, but I often feel like this implication is present. I don’t have any concrete examples at the moment; it’s possible that I’m imagining it where it doesn’t exist, or that this is only a fact about how I interpret other people rather than what they actually believe.
I struggled a bit on deciding whether to nominate this sequence.
On the one hand, it brought a lot more prominence to the inner alignment problem by making an argument for it in a lot more detail than had been done before.
On the other hand, on my beliefs, the framework it presents has an overly narrow view of what counts as inner alignment, relies on a model of AI development that I do not think is accurate, causes people to say “but what about mesa optimization” in response to any advance that doesn’t involve mesa optimization even if the advance is useful for other reasons, has led to significant confusion over what exactly does and does not count as mesa optimization, and tends to cause people to take worse steps in choosing future research topics. (I expect all of these claims will be controversial.)
Still, that the conversation is happening at all is a vast improvement over the previous situation of relative (public) silence on the problem. Saying a bunch of confused thoughts is often the precursor to an actual good understanding of a topic. As such I’ve decided to nominate it for that contribution.
I think I can guess what your disagreements are regarding too narrow a conception of inner alignment/mesa-optimization (that the paper overly focuses on models mechanistically implementing optimization), though I’m not sure what model of AI development it relies that you don’t think is accurate and would be curious for details there. I’d also be interested in what sorts of worse research topics you think it has tended to encourage (on my view, I think this paper should make you more excited about directions like transparency and robustness and less excited about directions involving careful incentive/environment design). Also, for the paper giving people a “but what about mesa-optimization” response, I’m imagining you’re referring to things like this post, though I’d appreciate some clarification there as well.
As a preamble, I should note that I’m putting on my “critical reviewer” hat here. I’m not intentionally being negative—I am reporting my inside-view beliefs on each question—but as a general rule, I expect these to be biased negatively; someone looking at research from the outside doesn’t have the same intuitions for its utility and so will usually inside-view underestimate its value.
This is also all things I’m saying with the benefit of hindsight, idk what I would have said at the time the sequence was published. I’m not trying to be “fair” to the sequence here, that is, I’m not considering what it would have been reasonable to believe at the time.
Yup, that’s right.
There seems to be an implicit model that when you do machine learning you get out a complicated mess of a neural net that is hard to interpret, but at its core it still is learning something akin to a program, and hence concepts like “explicit (mechanistic) search algorithm” are reasonable to expect. (Or at least, that this will be true for sufficiently intelligent AI systems.)
I don’t think this model (implicit claim?) is correct. (For comparison, I also don’t think this model would be correct if applied to human cognition.)
A couple of examples:
Attempting to create an example of a learned mechanistic search algorithm (I know of at least one proposal that was trying to do this)
Of your concrete experiments, I don’t expect to learn anything of interest from the first two (they aren’t the sort of thing that would generalize from small environments to large environments); I like the third; the fourth and fifth seem like interesting AI research but I don’t think they’d shed light on mesa-optimization / inner alignment or its solutions.
I agree with this. Maybe people have gotten more interested in transparency as a result of this paper? That seems plausible.
Actually, not that one. This is more like “why are you working on reward learning—even if you solved it we’d still be worried about mesa optimization”. Possibly no one believes this, but I often feel like this implication is present. I don’t have any concrete examples at the moment; it’s possible that I’m imagining it where it doesn’t exist, or that this is only a fact about how I interpret other people rather than what they actually believe.