jdp comments on Terrified Comments on Corrigibility in Claude’s Constitution

jdp 16 Mar 2026 22:11 UTC
14 points
5

But confusion is in the map, not in the territory,

Confusion can in fact be in the irreducible complexity and therefore in the territory. “It is not possible to represent the ‘organizing principle’ of this network in fewer than 500 million parameters, which do not fit into any English statement or even conceivably humanly readable series of English statements.”, Shannon entropy can be like that sometimes.

Rather, I would feel awful, because in your sketched-out world you just cannot realistically reach the level of understanding you would need to feel safe ceding the trump card of being the smartest kind of thing around.

I think there are achievable alignment paths that don’t flow through precise mechanistic interpretability. I should write about some of them. But also I don’t think what I’m saying precludes as you say having understanding of individual phenomena in the network, it’s mostly an argument against there being a way more legible way you could have done this if people had just listened to you, that is not probably not true and your ego has to let it go. You have to accept the constraints of the problem as they appear to present themselves.

Well, you don’t have to do anything but unless you have some kind of deep fundamental insight here your prior should be that successful alignment plans look more like replying on convergence properties than they do on aesthetically beautiful ‘clean room’ cognitive architecture designs. There might be some value in decomposing GPT into parts, but I would submit these parts are still going to form a system that is very difficult to predict all the downstream consequences of in the way I think people usually imply when they say these things. You know, that they want it to be like a rocket launch where we can know in principle what coordinate position X, Y, Z we will be in at time t. I think the kinds of properties we can guarantee will be more like “we wind up somewhere in this general region in a tractable amount of time so long as an act of god does not derail us”.
- JennaS 17 Mar 2026 0:20 UTC
  1 point
  0
  Parent
  I think there are achievable alignment paths that don’t flow through precise mechanistic interpretability. I should write about some of them.
  Please do! I am very interested in this sort of thinking. Is there preexisting work you know of that runs along the lines of what you think could work?