the gears to ascension comments on Daniel Tan’s Shortform

the gears to ascension 17 Jan 2025 19:46 UTC
4 points
0
Partially agreed. I’ve tested this a little personally; Claude successfully predicted their own success probability on some programming tasks, but was unable to report their own underlying token probabilities. The former tests weren’t that good, the latter ones somewhat were okay, I asked Claude to say the same thing across 10 branches and then asked a separate thread of Claude, also downstream of the same context, to verbally predict the distribution.
- Daniel Tan 17 Jan 2025 21:23 UTC
  3 points
  0
  Parent
  That’s pretty interesting! I would guess that it’s difficult to elicit introspection by default. Most of the papers where this is reported to work well involve fine-tuning the models. So maybe “willingness to self-report honestly” should be something we train models to do.
  - the gears to ascension 18 Jan 2025 3:29 UTC
    2 points
    0
    Parent
    willingness seems likely to be understating it. a context where the capability is even part of the author context seems like a prereq. finetuning would produce that, with fewshot one has to figure out how to make it correlate. I’ll try some more ideas.
    - eggsyntax 2 Apr 2025 13:37 UTC
      2 points
      0
      Parent
      a context where the capability is even part of the author context
      Can you unpack that a bit? I’m not sure what you’re pointing to. Maybe something like: few-shot examples of correct introspection (assuming you can identify those)?