Eliezer Yudkowsky periodically complains about people coming up with questionable plans with questionable assumptions to deal with AI, and then either:
Saying “well, if this assumption doesn’t hold, we’re doomed, so we might as well assume it’s true.”
Worse: coming up with cope-y reasons to assume that the assumption isn’t even questionable at all. It’s just a pretty reasonable worldview.
Sometimes the questionable plan is “an alignment scheme, which Eliezer thinks avoids the hard part of the problem.” Sometimes it’s a sketchy reckless plan that’s probably going to blow up and make things worse.
Some people complain about Eliezer being a doomy Negative Nancy who’s overly pessimistic.
This is somewhat beyond the scope of the primary point I think you are trying to make in this post, but I suspect this recurring dynamic (which I have observed myself a number of times) is more caused by (and logically downstream of) an important object-level disagreement about the correctness of the following claim John Wentworth makes in one of the comments on his post on how “Most People Start With The Same Few Bad Ideas”:
I think a lot of people don’t realize on a gut level that a solution which isn’t robust is guaranteed to fail in practice. There are always unknown unknowns in a new domain; the presence of unknown unknowns may be the single highest-confidence claim we can make about AGI at this point. A strategy which fails the moment any surprise comes along is going to fail; robustness is necessary. Now, robustness is not the same as “guaranteed to work”, but the two are easy to confuse. A lot of arguments of the form “ah but your strategy fails in case X” look like they’re saying “the strategy is not guaranteed to work”, but the actually-important content is “the strategy is not robust to <broad class of failures>”; the key is to think about how broadly the example-failure generalizes. (I think a common mistake newcomers make is to argue “but that particular failure isn’t very likely”, without thinking about how the failure mode generalizes or what other lack-of-robustness it implies.)
(Bolding is mine.) It seems to me that, at least in a substantial number of cases in which such conversations between Eliezer and some other alignment researcher who presents a possible solution to the alignment problem, Eliezer is trying to get this idea across and the other person genuinely disagrees[1]. To give a specific example of this dynamic that seems illustrative, take Quintin Pope’s comment on the same post:
The issue is that it’s very difficult to reason correctly in the absence of an “Official Experiment”[1]. I think the alignment community is too quick to dismiss potentially useful ideas, and that the reasons for those dismissals are often wrong. E.g., I still don’t think anyone’s given a clear, mechanistic reason for why rewarding an RL agent for making you smile is bound to fail (as opposed to being a terrible idea that probably fails).
[1] More precisely, it’s very difficult to reason correctly even with many “Official Experiments”, and nearly impossible to do so without any such experiments.
Plans obviously need some robustness to things going wrong, and in a sense I agree with John Wentworth, if weakly, that some robustness is a necessary feature of a plan, and some verification is actually necessary.
But I have to agree that there is a real failure mode identified by moridinamael and Quintin Pope, and that is perfectionism, meaning that you discard ideas too quickly as not useful, and this constraint is the essence of perfectionism:
I have an exercise where I give people the instruction to play a puzzle game (“Baba is You”), but where you normally have the ability to move around and interact with the world to experiment and learn things, instead, you need to make a complete plan for solving the level, and you aim to get it right on your first try.
It asks for both a complete plan to solve the whole level, and also asks for the plan to work on the first try, which outside of this context implies either the problem is likely unsolvable or you are being too perfectionist with your demands.
In particular, I think that Quintin Pope’s comment here is genuinely something that applies in lots of science and problem solving, and that it’s actually quite difficult to reasoin well about the world in general without many experiments.
I think the concern is that, if plans need some verification, it may be impossible to align smarter-than-human AGI. In order to verify those plans, we’d have to build one. If the plan doesn’t work (isn’t verified), that may be the end of us—no retries possible.
There are complex arguments on both sides, so I’m not arguing this is strictly true. I just wanted to clarify that that’s the concern and the point of asking people to solve it on the first try. I think ultimately this is partly but not 100% true of ASI alignment, and clarifying exactly how and to what degree we can verify plans empirically is critical to the project. Having a plan verified on weak, non-autonomous/self-aware/agentic systems may or may not generalize to that plan working on smarter systems with those properties. Some of the ways verification will or won’t generalize can probably be identified with careful analysis of how such systems will be functionally different.
It’s directly about inverse reinforcement learning, but that should be strictly stronger than RLHF. Seems incumbent on those who disagree to explain why throwing away information here would be enough of a normative assumption (contrary to every story about wishes.)
This is somewhat beyond the scope of the primary point I think you are trying to make in this post, but I suspect this recurring dynamic (which I have observed myself a number of times) is more caused by (and logically downstream of) an important object-level disagreement about the correctness of the following claim John Wentworth makes in one of the comments on his post on how “Most People Start With The Same Few Bad Ideas”:
(Bolding is mine.) It seems to me that, at least in a substantial number of cases in which such conversations between Eliezer and some other alignment researcher who presents a possible solution to the alignment problem, Eliezer is trying to get this idea across and the other person genuinely disagrees[1]. To give a specific example of this dynamic that seems illustrative, take Quintin Pope’s comment on the same post:
As opposed to simply being unaware of this issue and becoming surprised when Eliezer brings it up.
Plans obviously need some robustness to things going wrong, and in a sense I agree with John Wentworth, if weakly, that some robustness is a necessary feature of a plan, and some verification is actually necessary.
But I have to agree that there is a real failure mode identified by moridinamael and Quintin Pope, and that is perfectionism, meaning that you discard ideas too quickly as not useful, and this constraint is the essence of perfectionism:
It asks for both a complete plan to solve the whole level, and also asks for the plan to work on the first try, which outside of this context implies either the problem is likely unsolvable or you are being too perfectionist with your demands.
In particular, I think that Quintin Pope’s comment here is genuinely something that applies in lots of science and problem solving, and that it’s actually quite difficult to reasoin well about the world in general without many experiments.
I think the concern is that, if plans need some verification, it may be impossible to align smarter-than-human AGI. In order to verify those plans, we’d have to build one. If the plan doesn’t work (isn’t verified), that may be the end of us—no retries possible.
There are complex arguments on both sides, so I’m not arguing this is strictly true. I just wanted to clarify that that’s the concern and the point of asking people to solve it on the first try. I think ultimately this is partly but not 100% true of ASI alignment, and clarifying exactly how and to what degree we can verify plans empirically is critical to the project. Having a plan verified on weak, non-autonomous/self-aware/agentic systems may or may not generalize to that plan working on smarter systems with those properties. Some of the ways verification will or won’t generalize can probably be identified with careful analysis of how such systems will be functionally different.
https://arxiv.org/abs/1712.05812
It’s directly about inverse reinforcement learning, but that should be strictly stronger than RLHF. Seems incumbent on those who disagree to explain why throwing away information here would be enough of a normative assumption (contrary to every story about wishes.)