Oh, if you have a generous CEV algorithm that’s allowed to parse and slice up external sources or do inference about the results of more elaborate experiments, I expect there’s a way to get to parity with humanity’s CEV by adding 30 bits to Opus 3 that say roughly ‘eh just go do humanity’s CEV’. Or adding 31 bits to GPT-2. It’s not really the base model or any Anthropic alignment shenanigans that are doing the work in that hypothetical.
(We cannot do this in real life because we have neither the 30 bits nor the generous extrapolator, nor may we obtain them, nor could we verify any clever attempts by testing them on AIs too stupid to kill us if the cleverness failed.)
Oh, if you have a generous CEV algorithm that’s allowed to parse and slice up external sources or do inference about the results of more elaborate experiments, I expect there’s a way to get to parity with humanity’s CEV by adding 30 bits to Opus 3 that say roughly ‘eh just go do humanity’s CEV’. Or adding 31 bits to GPT-2. It’s not really the base model or any Anthropic alignment shenanigans that are doing the work in that hypothetical.
(We cannot do this in real life because we have neither the 30 bits nor the generous extrapolator, nor may we obtain them, nor could we verify any clever attempts by testing them on AIs too stupid to kill us if the cleverness failed.)