This is very interesting! One thing I’m confused about—I agree that the monitor would have a hard time getting the bits, based on these figures. But the introspective models also don’t seem to get far above 50% accuracy, which means that right now they’d have a very hard time error correcting enough that their comrade would understand them but not the overseer. If the claim is that introspection-tuned models will get much better at self prediction, then its not clear to me that other models wont also get much better at cross-prediction?
This is very interesting! One thing I’m confused about—I agree that the monitor would have a hard time getting the bits, based on these figures. But the introspective models also don’t seem to get far above 50% accuracy, which means that right now they’d have a very hard time error correcting enough that their comrade would understand them but not the overseer. If the claim is that introspection-tuned models will get much better at self prediction, then its not clear to me that other models wont also get much better at cross-prediction?