I see people increasingly arguing that agency failures are actually alignment failures. This could be right, but it also could be cope. In fact I am confused about the actual distinction
Reading this made me think that the framing “Everything is alignment-constrained, nothing is capabilities-constrained.” is a rathering and that a more natural/joint-carving framing is:
To the extent that you can get capabilities by your own means (rather than hoping for reality to give you access to a new pool of some resource or whatever), you get them by getting various things to align so that they produce those capabilities.
Or, in other words, all capabilities stem from “getting things to ‘align’ with each other in the right way”.
Is this a problematic equivocation of the term “alignment”? The term “alignment” is polysemous and thus quite equivocable anyway, but if we narrow down on what I consider the most sensible explication of the relevant submeaning, i.e., Tsvi’s “make a mind that is highly capable, and whose ultimate effects are determined by the judgement of human operators”, then (modulo whether you want to apply the term “alignment” to the LLMs which is downstream from other modulos: modulo “highly capable” (and modulo “mind”) and modulo the question of whether there is a sufficient continuity or inferential connection between the LLMs you’re talking about here and the possible future omnicide-capable AI or whatever[1]) I think the framing mostly works.
I still feel like there’s something wrong or left unsaid in this framing. Perhaps it’s that the tails of the alignment-capabilities distinction (to the extent that you want to use it all) come apart as you move from the coarse-grained realm of clear distinction between “thing can do bad thing X but won’t and that ‘won’t’ is quite robust” to the finer-grained real of blurry “thing can’t do X but for reasons that are too messy to concisely describe in terms of capabilities and alignment”.
To echo my comment from 2 months ago:
Or, in other words, all capabilities stem from “getting things to ‘align’ with each other in the right way”.
Is this a problematic equivocation of the term “alignment”? The term “alignment” is polysemous and thus quite equivocable anyway, but if we narrow down on what I consider the most sensible explication of the relevant submeaning, i.e., Tsvi’s “make a mind that is highly capable, and whose ultimate effects are determined by the judgement of human operators”, then (modulo whether you want to apply the term “alignment” to the LLMs which is downstream from other modulos: modulo “highly capable” (and modulo “mind”) and modulo the question of whether there is a sufficient continuity or inferential connection between the LLMs you’re talking about here and the possible future omnicide-capable AI or whatever[1]) I think the framing mostly works.
I still feel like there’s something wrong or left unsaid in this framing. Perhaps it’s that the tails of the alignment-capabilities distinction (to the extent that you want to use it all) come apart as you move from the coarse-grained realm of clear distinction between “thing can do bad thing X but won’t and that ‘won’t’ is quite robust” to the finer-grained real of blurry “thing can’t do X but for reasons that are too messy to concisely describe in terms of capabilities and alignment”.
These are plausibly very non-trivial modulos … but modulo that non-triviality too.