I’m not sure I understand what you mean by relevant comparison here. What I was trying to claim in the quote is that humanity already faces something analogous to the technical alignment problem in building institutions, which we haven’t fully solved.
If you’re saying we can sidestep the institutional challenge by solving technical alignment, I think this is partly true—you can pass the buck of aligning the fed onto aligning Claude-N, and in turn onto whatever Claude-N is aligned to, which will either be an institution (same problem!) or some kind of aggregation of human preferences and maybe the good (different hard problem!).
Edit: I misread the sentence. I’ll leave the comments: they are a good argument against a position Raymond doesn’t hold.
Unless I’m misreading you, you’re saying:
Institutional alignment/corrigiblity/control might be harder than AI alignment/corrigiblity/control
Supporting evidence for (1) is that “[...] we are currently less than perfect at making institutions corrigible [than AIs], doing scalable oversight on them, preventing mesa-optimisers from forming, and so on”.
But is (2) actually true? Well, there are two comparisons we can make:
(A) Compare the alignment/corrigibility/control of our current institutions (e.g. Federal Reserve) against that of our current AIs (e.g. Claude Opus 4).
(B) Compare the alignment/corrigibility/control of our current institutions (e.g. Federal Reserve) against that of some speculative AIs that had the capabilities and affordances as those institutions (e.g. Claude-N, FedGPT).
And I’m claiming that Comparison B, not Comparison A, is the relevant comparison for determining whether institutional alignment/corrigiblity/control might be harder than AI alignment/corrigiblity/control.
And moreover, I think our current institutions are wayyyyyyyyy more aligned/corrigible/controlled than I’d expect from AIs with the same capabilities and affordances!
Imagine if you built an AI which substituted for the Federal Reserve but still behaved as corrigibly/aligned/controlled as the Federal Reserve actually does. Then I think people would be like “Wow, you just solved AI alignment/corrigibility/control!”. Similarly for other institutions, e.g. the military, academia, big corporations, etc.
We can give the Federal Reserve (or similar institutions) a goal like “maximum employment and stable prices” and it will basically follow the goal within legal, ethical, safe bounds. Occasionally things go wrong, sure, but not in a “the Fed has destroyed the sun with nanobots”-kinda way. Such institutions aren’t great, but they are way better than I’d expect from a misaligned AI at the same level of capabilities and affordances.
NB: I still buy that institutional alignment/corrigiblity/control might be harder than AI alignment/corrigiblity/control.[1] My point is somewhat minor/nitpicky: I think (2) isn’t good evidence and is slightly confused about A vs B.
Institutions are made partly of brains which hinders mech interp.
They are made partly of physical stuff that’s hard to copy. Hence: no evals.
There’s no analogue of SGD for institutions, because their behaviour isn’t a smooth funciton on a manifold of easily-modifiable parameters.
Powerful institutions already exist, so we’d probably be aligning/corrigibilising/controlling these incumbant institutions. But the powerful AIs don’t exist so maybe makes our job easier. (This might also make AI alignment/corrigibility/control harder because we have less experience.)
Ah! Ok, yeah, I think we were talking past each other here.
I’m not trying to claim here that the institutional case might be harder than the AI case. When I said “less than perfect at making institutions corrigible” I didn’t mean “less compared to AI” I meant “overall not perfect”. So the square brackets you put in (2) was not something I intended to express.
The thing I was trying to gesture at was just that there are kind of institutional analogs for lots of alignment concepts, like corrigibility. I wasn’t aiming to actually compare their difficulty—I think like you I’m not really sure, and it does feel pretty hard to pick a fair standard for comparison.
oh lmao I think I just misread “we are currently less than perfect at making institutions corrigible” as “we are currently less perfect at making institutions corrigible”
I’m not sure I understand what you mean by relevant comparison here. What I was trying to claim in the quote is that humanity already faces something analogous to the technical alignment problem in building institutions, which we haven’t fully solved.
If you’re saying we can sidestep the institutional challenge by solving technical alignment, I think this is partly true—you can pass the buck of aligning the fed onto aligning Claude-N, and in turn onto whatever Claude-N is aligned to, which will either be an institution (same problem!) or some kind of aggregation of human preferences and maybe the good (different hard problem!).
Edit: I misread the sentence. I’ll leave the comments: they are a good argument against a position Raymond doesn’t hold.
Unless I’m misreading you, you’re saying:
Institutional alignment/corrigiblity/control might be harder than AI alignment/corrigiblity/control
Supporting evidence for (1) is that “[...] we are currently less than perfect at making institutions corrigible [than AIs], doing scalable oversight on them, preventing mesa-optimisers from forming, and so on”.
But is (2) actually true? Well, there are two comparisons we can make:
(A) Compare the alignment/corrigibility/control of our current institutions (e.g. Federal Reserve) against that of our current AIs (e.g. Claude Opus 4).
(B) Compare the alignment/corrigibility/control of our current institutions (e.g. Federal Reserve) against that of some speculative AIs that had the capabilities and affordances as those institutions (e.g. Claude-N, FedGPT).
And I’m claiming that Comparison B, not Comparison A, is the relevant comparison for determining whether institutional alignment/corrigiblity/control might be harder than AI alignment/corrigiblity/control.
And moreover, I think our current institutions are wayyyyyyyyy more aligned/corrigible/controlled than I’d expect from AIs with the same capabilities and affordances!
Imagine if you built an AI which substituted for the Federal Reserve but still behaved as corrigibly/aligned/controlled as the Federal Reserve actually does. Then I think people would be like “Wow, you just solved AI alignment/corrigibility/control!”. Similarly for other institutions, e.g. the military, academia, big corporations, etc.
We can give the Federal Reserve (or similar institutions) a goal like “maximum employment and stable prices” and it will basically follow the goal within legal, ethical, safe bounds. Occasionally things go wrong, sure, but not in a “the Fed has destroyed the sun with nanobots”-kinda way. Such institutions aren’t great, but they are way better than I’d expect from a misaligned AI at the same level of capabilities and affordances.
NB: I still buy that institutional alignment/corrigiblity/control might be harder than AI alignment/corrigiblity/control.[1] My point is somewhat minor/nitpicky: I think (2) isn’t good evidence and is slightly confused about A vs B.
For example:
Institutions are made partly of brains which hinders mech interp.
They are made partly of physical stuff that’s hard to copy. Hence: no evals.
There’s no analogue of SGD for institutions, because their behaviour isn’t a smooth funciton on a manifold of easily-modifiable parameters.
Powerful institutions already exist, so we’d probably be aligning/corrigibilising/controlling these incumbant institutions. But the powerful AIs don’t exist so maybe makes our job easier. (This might also make AI alignment/corrigibility/control harder because we have less experience.)
That’s all the examples I can think for now :)
Ah! Ok, yeah, I think we were talking past each other here.
I’m not trying to claim here that the institutional case might be harder than the AI case. When I said “less than perfect at making institutions corrigible” I didn’t mean “less compared to AI” I meant “overall not perfect”. So the square brackets you put in (2) was not something I intended to express.
The thing I was trying to gesture at was just that there are kind of institutional analogs for lots of alignment concepts, like corrigibility. I wasn’t aiming to actually compare their difficulty—I think like you I’m not really sure, and it does feel pretty hard to pick a fair standard for comparison.
oh lmao I think I just misread “we are currently less than perfect at making institutions corrigible” as “we are currently less perfect at making institutions corrigible”