I believe this may (30%) be the case with corrigibility
Surprising agreement with my credence! First skim I thought “Uli isn’t thinking correctly about how humans may have an explicit value for corrigible things, so if humans have 10B values, and we have an adequate value theory, solving corrigibility only requires searching for 1 value in the brain, while solving value-alignment requires searching for 10B values”, and I decided I thought this class of arguments brings something roughly corresponding to ‘corrigibility is easier’ to 70%. But then I looked at your credence, and turned out we agreed.
Mmm, I think it matters a lot which of the 10B[1] values is harder to instill, I think most of the difficulty is in corrigibility. Strong corrigibility seems like it basically solves alignment. If this is the case then corrigibility is a great thing to aim for, since it’s the real “hard part” as opposed to random human values. I’m ranting now though… :L
I think it’s way less than 10B, probably <1000 though I haven’t thought about this much and don’t know what you’re counting as one “value” (If you mean value shard maybe closer to 10B, if you mean human interpretable value I think <1000)
Surprising agreement with my credence! First skim I thought “Uli isn’t thinking correctly about how humans may have an explicit value for corrigible things, so if humans have 10B values, and we have an adequate value theory, solving corrigibility only requires searching for 1 value in the brain, while solving value-alignment requires searching for 10B values”, and I decided I thought this class of arguments brings something roughly corresponding to ‘corrigibility is easier’ to 70%. But then I looked at your credence, and turned out we agreed.
Mmm, I think it matters a lot which of the 10B[1] values is harder to instill, I think most of the difficulty is in corrigibility. Strong corrigibility seems like it basically solves alignment. If this is the case then corrigibility is a great thing to aim for, since it’s the real “hard part” as opposed to random human values. I’m ranting now though… :L
I think it’s way less than 10B, probably <1000 though I haven’t thought about this much and don’t know what you’re counting as one “value” (If you mean value shard maybe closer to 10B, if you mean human interpretable value I think <1000)