Thanks for the post! As always I broadly agree, but I have a bunch of nitpicks.
You can save yourself several years of time and effort by actively trying to identify the Hard Parts and focus on them, rather than avoid them. Otherwise, you’ll end up burning several years on ideas which don’t actually leave the field better off.
I agree that avoiding the Hard parts is rarely productive, but you also don’t address one relevant concern: what if the Hard part is not merely Hard, but actually Impossible?
In this case your advice can also be cashed out by trying to prove it is impossible instead of avoiding it. And just like with most impossibility results in TCS, it’s possible that even if the precise formulation is impossible, it often just means that you need to reframe the problem a bit.
Mostly, I think the hard parts are things like “understand agency in general better” and “understand what’s going on inside the magic black boxes”. If your response to such things is “sounds hard, man”, then you have successfully identified (some of) the Hard Parts.
I expect you would also say that a crucial hard part many people are avoiding is “how to learn human values?”, right? (Not the true names, but a useful pointer)
The point of the intuitive story is to steer our search. Without it, we risk blind empiricism: just cataloguing patterns without building general models/theory/understanding for what’s going on. In that mode, we can easily lose track of the big picture goal and end up cataloguing lots of useless stuff. An intuitive story gives us big-picture direction, and something to aim for. Even if it turns out to be wrong!
I want to note that the failure mode of blind theory here is to accept any story, and thus make the requirement of a story completely impotent to guide research. There’s an art (and hopefully a science) to finding stories that bias towards productive mistakes.
Most of the value and challenge is in finding the right operationalizations of the vague concepts involved in those arguments, such that the argument is robustly correct and useful. Because it’s where most of the value and most of the challenge is, finding the right operationalization should typically be the central focus of a project.
I expect you to partially disagree, but there’s not always a “right” operationalization, and there’s a failure mode where one falls in love with their neat operationalization, making the misses parts of the phenomena invisible.
Don’t just run a black-box experiment on a network, or try to prove a purely behavioral theorem. We want to talk about internal structure.
I want to say that you should start with behavioral theorem, and often the properties you want to describe might make more sense behaviorally, but I guess you’re going to answer that we have evidence that this doesn’t work in Alignment and so it is avoiding the Hard part. Am I correct?
Partly, opening the black box is about tackling the Hard Parts rather than avoiding them. Not opening the black box is a red flag; it’s usually a sign of avoiding the Hard Parts.
One formal example of this is the relativization barrier in complexity theory, which tells you that you can’t prove P≠NP (and a bunch of other separations) using only techniques using algorithms as blackboxes instead of looking at the structure.
Once you’re past that stumbling block, I think the most important principles are Derive the Ontology and Operationalize. These two are important for opposing types of people. Some people tend to stay too abstract and avoid committing to an ontology, but never operationalize and therefore miss out on the main value-add. Other people operationalize prematurely, adopting ad-hoc operationalizations, and Deriving the Ontology pretty strongly dicourages that.
Agreed that it’s a great pair of advice to keep in mind!
I expect you would also say that a crucial hard part many people are avoiding is “how to learn human values?”, right? (Not the true names, but a useful pointer)
Yes, although I consider that one more debatable.
I expect you to partially disagree, but there’s not always a “right” operationalization...
When there’s not a “right” operationalization, that usually means that the concepts involved were fundamentally confused in the first place.
I want to say that you should start with behavioral theorem, and often the properties you want to describe might make more sense behaviorally, but I guess you’re going to answer that we have evidence that this doesn’t work in Alignment and so it is avoiding the Hard part. Am I correct?
Actually, I think starting from a behavioral theorem is fine. It’s just not where we’re looking to end up, and the fact that we want to open the black box should steer what starting points we look for, even when those starting points are behavioral.
When there’s not a “right” operationalization, that usually means that the concepts involved were fundamentally confused in the first place.
Curious about the scope of the conceptual space where this belief was calibrated. It seems to me to tacitly say something like “everything that’s important is finitely characterizable”.
Maybe the “fundamentally confused” in your phrasing already includes the case of “stupidly tried to grab something that wasn’t humanly possible, even if in principle” as a confused way for a human, without making any claim of reality being conveniently compressible at all levels. (Note that this link explicitly disavows beauty at “all levels” too.)
I suppose you might also say “I didn’t make any claim of finiteness” but I do think something like “at least some humans are only a finite string away from grokking anything” is implicit if you expect there to be blogposts/textbooks that can operationalize everything relevant. It would be an even stronger claim than “finiteness”, it would be “human-typical length strings”
I believe Adam is pointing at something quite important, akin to a McNamara fallacy for formalization. To paraphrase:
The first step is to formalize whatever can be easily formalized. This is OK as far as it goes. The second step is to disregard that which can’t be easily formalized or to make overly simplifying assumptions. This is artificial and misleading. The third step is to presume that what can’t be formalized easily really isn’t important. This is blindness. The fourth step is to say that what can’t be easily formalized really doesn’t exist. This is suicide.
In the case of something that has already been engineered (human brains with agency), we probably should grant that it is possible to operationalize everything relevant. But I want to pushback on the general version and would want “why do you believe simple-formalization is possible here, in this domain?” to be allowed to be asked.
Thanks for the post! As always I broadly agree, but I have a bunch of nitpicks.
I agree that avoiding the Hard parts is rarely productive, but you also don’t address one relevant concern: what if the Hard part is not merely Hard, but actually Impossible? In this case your advice can also be cashed out by trying to prove it is impossible instead of avoiding it. And just like with most impossibility results in TCS, it’s possible that even if the precise formulation is impossible, it often just means that you need to reframe the problem a bit.
I expect you would also say that a crucial hard part many people are avoiding is “how to learn human values?”, right? (Not the true names, but a useful pointer)
I want to note that the failure mode of blind theory here is to accept any story, and thus make the requirement of a story completely impotent to guide research. There’s an art (and hopefully a science) to finding stories that bias towards productive mistakes.
I expect you to partially disagree, but there’s not always a “right” operationalization, and there’s a failure mode where one falls in love with their neat operationalization, making the misses parts of the phenomena invisible.
I want to say that you should start with behavioral theorem, and often the properties you want to describe might make more sense behaviorally, but I guess you’re going to answer that we have evidence that this doesn’t work in Alignment and so it is avoiding the Hard part. Am I correct?
One formal example of this is the relativization barrier in complexity theory, which tells you that you can’t prove P≠NP (and a bunch of other separations) using only techniques using algorithms as blackboxes instead of looking at the structure.
Agreed that it’s a great pair of advice to keep in mind!
Yes, although I consider that one more debatable.
When there’s not a “right” operationalization, that usually means that the concepts involved were fundamentally confused in the first place.
Actually, I think starting from a behavioral theorem is fine. It’s just not where we’re looking to end up, and the fact that we want to open the black box should steer what starting points we look for, even when those starting points are behavioral.
Curious about the scope of the conceptual space where this belief was calibrated. It seems to me to tacitly say something like “everything that’s important is finitely characterizable”.
Maybe the “fundamentally confused” in your phrasing already includes the case of “stupidly tried to grab something that wasn’t humanly possible, even if in principle” as a confused way for a human, without making any claim of reality being conveniently compressible at all levels. (Note that this link explicitly disavows beauty at “all levels” too.)
I suppose you might also say “I didn’t make any claim of finiteness” but I do think something like “at least some humans are only a finite string away from grokking anything” is implicit if you expect there to be blogposts/textbooks that can operationalize everything relevant. It would be an even stronger claim than “finiteness”, it would be “human-typical length strings”
I believe Adam is pointing at something quite important, akin to a McNamara fallacy for formalization. To paraphrase:
In the case of something that has already been engineered (human brains with agency), we probably should grant that it is possible to operationalize everything relevant. But I want to pushback on the general version and would want “why do you believe simple-formalization is possible here, in this domain?” to be allowed to be asked.
[PS. am not a native speaker]