Edit: I misread the sentence. I’ll leave the comments: they are a good argument against a position Raymond doesn’t hold.
Unless I’m misreading you, you’re saying:
Institutional alignment/corrigiblity/control might be harder than AI alignment/corrigiblity/control
Supporting evidence for (1) is that “[...] we are currently less than perfect at making institutions corrigible [than AIs], doing scalable oversight on them, preventing mesa-optimisers from forming, and so on”.
But is (2) actually true? Well, there are two comparisons we can make:
(A) Compare the alignment/corrigibility/control of our current institutions (e.g. Federal Reserve) against that of our current AIs (e.g. Claude Opus 4).
(B) Compare the alignment/corrigibility/control of our current institutions (e.g. Federal Reserve) against that of some speculative AIs that had the capabilities and affordances as those institutions (e.g. Claude-N, FedGPT).
And I’m claiming that Comparison B, not Comparison A, is the relevant comparison for determining whether institutional alignment/corrigiblity/control might be harder than AI alignment/corrigiblity/control.
And moreover, I think our current institutions are wayyyyyyyyy more aligned/corrigible/controlled than I’d expect from AIs with the same capabilities and affordances!
Imagine if you built an AI which substituted for the Federal Reserve but still behaved as corrigibly/aligned/controlled as the Federal Reserve actually does. Then I think people would be like “Wow, you just solved AI alignment/corrigibility/control!”. Similarly for other institutions, e.g. the military, academia, big corporations, etc.
We can give the Federal Reserve (or similar institutions) a goal like “maximum employment and stable prices” and it will basically follow the goal within legal, ethical, safe bounds. Occasionally things go wrong, sure, but not in a “the Fed has destroyed the sun with nanobots”-kinda way. Such institutions aren’t great, but they are way better than I’d expect from a misaligned AI at the same level of capabilities and affordances.
NB: I still buy that institutional alignment/corrigiblity/control might be harder than AI alignment/corrigiblity/control.[1] My point is somewhat minor/nitpicky: I think (2) isn’t good evidence and is slightly confused about A vs B.
Institutions are made partly of brains which hinders mech interp.
They are made partly of physical stuff that’s hard to copy. Hence: no evals.
There’s no analogue of SGD for institutions, because their behaviour isn’t a smooth funciton on a manifold of easily-modifiable parameters.
Powerful institutions already exist, so we’d probably be aligning/corrigibilising/controlling these incumbant institutions. But the powerful AIs don’t exist so maybe makes our job easier. (This might also make AI alignment/corrigibility/control harder because we have less experience.)
Ah! Ok, yeah, I think we were talking past each other here.
I’m not trying to claim here that the institutional case might be harder than the AI case. When I said “less than perfect at making institutions corrigible” I didn’t mean “less compared to AI” I meant “overall not perfect”. So the square brackets you put in (2) was not something I intended to express.
The thing I was trying to gesture at was just that there are kind of institutional analogs for lots of alignment concepts, like corrigibility. I wasn’t aiming to actually compare their difficulty—I think like you I’m not really sure, and it does feel pretty hard to pick a fair standard for comparison.
oh lmao I think I just misread “we are currently less than perfect at making institutions corrigible” as “we are currently less perfect at making institutions corrigible”
Edit: I misread the sentence. I’ll leave the comments: they are a good argument against a position Raymond doesn’t hold.
Unless I’m misreading you, you’re saying:
Institutional alignment/corrigiblity/control might be harder than AI alignment/corrigiblity/control
Supporting evidence for (1) is that “[...] we are currently less than perfect at making institutions corrigible [than AIs], doing scalable oversight on them, preventing mesa-optimisers from forming, and so on”.
But is (2) actually true? Well, there are two comparisons we can make:
(A) Compare the alignment/corrigibility/control of our current institutions (e.g. Federal Reserve) against that of our current AIs (e.g. Claude Opus 4).
(B) Compare the alignment/corrigibility/control of our current institutions (e.g. Federal Reserve) against that of some speculative AIs that had the capabilities and affordances as those institutions (e.g. Claude-N, FedGPT).
And I’m claiming that Comparison B, not Comparison A, is the relevant comparison for determining whether institutional alignment/corrigiblity/control might be harder than AI alignment/corrigiblity/control.
And moreover, I think our current institutions are wayyyyyyyyy more aligned/corrigible/controlled than I’d expect from AIs with the same capabilities and affordances!
Imagine if you built an AI which substituted for the Federal Reserve but still behaved as corrigibly/aligned/controlled as the Federal Reserve actually does. Then I think people would be like “Wow, you just solved AI alignment/corrigibility/control!”. Similarly for other institutions, e.g. the military, academia, big corporations, etc.
We can give the Federal Reserve (or similar institutions) a goal like “maximum employment and stable prices” and it will basically follow the goal within legal, ethical, safe bounds. Occasionally things go wrong, sure, but not in a “the Fed has destroyed the sun with nanobots”-kinda way. Such institutions aren’t great, but they are way better than I’d expect from a misaligned AI at the same level of capabilities and affordances.
NB: I still buy that institutional alignment/corrigiblity/control might be harder than AI alignment/corrigiblity/control.[1] My point is somewhat minor/nitpicky: I think (2) isn’t good evidence and is slightly confused about A vs B.
For example:
Institutions are made partly of brains which hinders mech interp.
They are made partly of physical stuff that’s hard to copy. Hence: no evals.
There’s no analogue of SGD for institutions, because their behaviour isn’t a smooth funciton on a manifold of easily-modifiable parameters.
Powerful institutions already exist, so we’d probably be aligning/corrigibilising/controlling these incumbant institutions. But the powerful AIs don’t exist so maybe makes our job easier. (This might also make AI alignment/corrigibility/control harder because we have less experience.)
That’s all the examples I can think for now :)
Ah! Ok, yeah, I think we were talking past each other here.
I’m not trying to claim here that the institutional case might be harder than the AI case. When I said “less than perfect at making institutions corrigible” I didn’t mean “less compared to AI” I meant “overall not perfect”. So the square brackets you put in (2) was not something I intended to express.
The thing I was trying to gesture at was just that there are kind of institutional analogs for lots of alignment concepts, like corrigibility. I wasn’t aiming to actually compare their difficulty—I think like you I’m not really sure, and it does feel pretty hard to pick a fair standard for comparison.
oh lmao I think I just misread “we are currently less than perfect at making institutions corrigible” as “we are currently less perfect at making institutions corrigible”