Peter Thiel pointed out that the common folk wisdom in business that you learn more from failure than success is actually wrong—failure is overdetermined and thus uninteresting.
I think you can make an analogous observation about some prosaic alignment research—a lot of it is the study of (intellectually) interesting failures, which means that it can make for a good nerdsnipe, but it’s not necessarily that informative or useful if you’re actually trying to succeed at (or model) doing something truly hard and transformative.
Glitch tokens, the hot mess work, and various things related to jailbreaking, simulators, and hallucinations come to mind as examples of lines of research and discussion that an analogy to business failure predicts won’t end up being centrally relevant to real alignment difficulties. Which is not to say that the authors of these works are claiming that they will be, nor that this kind of work can’t make for effective demonstrations and lessons. But I do think this kind of thing is unlikely to be on the critical path for trying to actually solve or understand some deeper problems.
Another way of framing the observation above is that it is an implication of instrumental convergence: without knowing anything about its internals, we can say confidently that an actually-transformative AI system (aligned or not) will be doing something that is at least roughly coherently consequentialist. There might be some intellectually interesting or even useful lessons to be learned from studying the non-consequentialist / incoherent / weird parts of such a system or its predecessors, but in my frame, these parts (whatever they end up being), are analogous to the failures and missteps of a business venture, which are overdetermined if the business ultimately fails, or irrelevant if it succeeds.
I agree with this literally, but I’d want to add what I think is a significant friendly amendment. Successes are much more informative than failures, but they are also basically impossible. You have to relax your criteria for success a lot to start getting partial successes; and my impression is that in practice, “partial successes” in “alignment” are approximately 0 informative.
In alignment, on the other hand, you have to understand each constraint that’s known in order to even direct your attention to the relevant areas. This is analogous to the situation with the P vs. NP , where whole classes of plausible proof strategies are proven to not work. You have to understand most of those constraints; otherwise by default you’ll probably be working on e.g. a proof that relativizes and therefore cannot show P≠NP. Progress is made by narrowing the space, and then looking into the narrowed space.
we can say confidently that an actually-transformative AI system (aligned or not) will be doing something that is at least roughly coherently consequentialist.
I don’t think we can confidently say that. If takeoff looks like more like a cambrian explosion than like a singleton (and that is how I would bet), that would definitely be transformative but the transformation would not be the result of any particular agent deciding what world state is desirable and taking actions intended to bring about that world state.
Studying failures is useful because they highlight non-obvious internal mechanism, while successes are usually about thing working as intended and therefore not requiring explanation.
Another problem is that we don’t have examples of successes, because every measureable alignment success can be a failure in disguise.
I agree with the idea of failure being overdetermined.
But another factor might be that those failures aren’t useful because they relate to current AI. Current AI is very different from AGI or superintelligence, which makes both failures and successes less useful...
Though I know very little about these examples :/
Edit: I misread, Max H wasn’t trying to say that successes are more important to failures, just that failures aren’t informative.
Yeah, but, there’s already a bunch of arguments about whether prosaic ML alignment is useful (which people have mostly decided whatever they believe about) and the OP is interesting because it’s a fairly separate reason to be skeptical about a class of research.
The failure of an interesting hypothesis is informative as long as you understand why it doesn’t work, and can better model how the thing you’re studying works. The difference between CS research and business is that business failures can sort of “come out of nowhere” (“Why isn’t anyone buying our product?” can’t really be answered), whereas, if you look closely enough at the models, you can always learn something from the failure of something that should’ve worked but didn’t.
Peter Thiel pointed out that the common folk wisdom in business that you learn more from failure than success is actually wrong—failure is overdetermined and thus uninteresting.
I think you can make an analogous observation about some prosaic alignment research—a lot of it is the study of (intellectually) interesting failures, which means that it can make for a good nerdsnipe, but it’s not necessarily that informative or useful if you’re actually trying to succeed at (or model) doing something truly hard and transformative.
Glitch tokens, the hot mess work, and various things related to jailbreaking, simulators, and hallucinations come to mind as examples of lines of research and discussion that an analogy to business failure predicts won’t end up being centrally relevant to real alignment difficulties. Which is not to say that the authors of these works are claiming that they will be, nor that this kind of work can’t make for effective demonstrations and lessons. But I do think this kind of thing is unlikely to be on the critical path for trying to actually solve or understand some deeper problems.
Another way of framing the observation above is that it is an implication of instrumental convergence: without knowing anything about its internals, we can say confidently that an actually-transformative AI system (aligned or not) will be doing something that is at least roughly coherently consequentialist. There might be some intellectually interesting or even useful lessons to be learned from studying the non-consequentialist / incoherent / weird parts of such a system or its predecessors, but in my frame, these parts (whatever they end up being), are analogous to the failures and missteps of a business venture, which are overdetermined if the business ultimately fails, or irrelevant if it succeeds.
I agree with this literally, but I’d want to add what I think is a significant friendly amendment. Successes are much more informative than failures, but they are also basically impossible. You have to relax your criteria for success a lot to start getting partial successes; and my impression is that in practice, “partial successes” in “alignment” are approximately 0 informative.
If we have to retreat from successes to interesting failures, I agree this is a retreat, but I think it’s necessary. I agree that many/most ways of retreating are quite unsatisfactory / unhelpful. Which retreats are more helpful? Generally I think an idea (the idea?) is to figure out highly general constraints from particular failures. See here https://tsvibt.blogspot.com/2025/11/ah-motiva-3-context-of-concept-of-value.html#why-even-talk-about-values and especially the advice here https://www.lesswrong.com/posts/rZQjk7T6dNqD5HKMg/abstract-advice-to-researchers-tackling-the-difficult-core#Generalize_a_lot :
Also cf. here (https://www.lesswrong.com/posts/K4K6ikQtHxcG49Tcn/hia-and-x-risk-part-2-why-it-hurts#Alignment_harnesses_added_brainpower_much_less_effectively_than_capabilities_research_does), quoting the relevant part in full:
I don’t think we can confidently say that. If takeoff looks like more like a cambrian explosion than like a singleton (and that is how I would bet), that would definitely be transformative but the transformation would not be the result of any particular agent deciding what world state is desirable and taking actions intended to bring about that world state.
Studying failures is useful because they highlight non-obvious internal mechanism, while successes are usually about thing working as intended and therefore not requiring explanation.
Another problem is that we don’t have examples of successes, because every measureable alignment success can be a failure in disguise.
I agree with the idea of failure being overdetermined.But another factor might be that those failures aren’t useful because they relate to current AI. Current AI is very different from AGI or superintelligence, which makes both failures and successes less useful...Though I know very little about these examples :/Edit: I misread, Max H wasn’t trying to say that successes are more important to failures, just that failures aren’t informative.
Yeah, but, there’s already a bunch of arguments about whether prosaic ML alignment is useful (which people have mostly decided whatever they believe about) and the OP is interesting because it’s a fairly separate reason to be skeptical about a class of research.
The failure of an interesting hypothesis is informative as long as you understand why it doesn’t work, and can better model how the thing you’re studying works. The difference between CS research and business is that business failures can sort of “come out of nowhere” (“Why isn’t anyone buying our product?” can’t really be answered), whereas, if you look closely enough at the models, you can always learn something from the failure of something that should’ve worked but didn’t.