Partially agreed. I’ve tested this a little personally; Claude successfully predicted their own success probability on some programming tasks, but was unable to report their own underlying token probabilities. The former tests weren’t that good, the latter ones somewhat were okay, I asked Claude to say the same thing across 10 branches and then asked a separate thread of Claude, also downstream of the same context, to verbally predict the distribution.
That’s pretty interesting! I would guess that it’s difficult to elicit introspection by default. Most of the papers where this is reported to work well involve fine-tuning the models. So maybe “willingness to self-report honestly” should be something we train models to do.
willingness seems likely to be understating it. a context where the capability is even part of the author context seems like a prereq. finetuning would produce that, with fewshot one has to figure out how to make it correlate. I’ll try some more ideas.
a context where the capability is even part of the author context
Can you unpack that a bit? I’m not sure what you’re pointing to. Maybe something like: few-shot examples of correct introspection (assuming you can identify those)?
Partially agreed. I’ve tested this a little personally; Claude successfully predicted their own success probability on some programming tasks, but was unable to report their own underlying token probabilities. The former tests weren’t that good, the latter ones somewhat were okay, I asked Claude to say the same thing across 10 branches and then asked a separate thread of Claude, also downstream of the same context, to verbally predict the distribution.
That’s pretty interesting! I would guess that it’s difficult to elicit introspection by default. Most of the papers where this is reported to work well involve fine-tuning the models. So maybe “willingness to self-report honestly” should be something we train models to do.
willingness seems likely to be understating it. a context where the capability is even part of the author context seems like a prereq. finetuning would produce that, with fewshot one has to figure out how to make it correlate. I’ll try some more ideas.
Can you unpack that a bit? I’m not sure what you’re pointing to. Maybe something like: few-shot examples of correct introspection (assuming you can identify those)?