I’m not sure if it’s because the follow-up question is doing a lot of heavy lifting or not. But I suspect it may not be the major contributor for the 50%+ probability items, since it did say <15% for long conversations that really were very un-eval-like, like my iterative process of “I want to build this tracking app, and I want it to suit my UX taste, and so I’ll just keep telling Claude to tweak anything I don’t like”.
But, you know, very small sample size, thrown together in an hour or two, LLM’s are weird, insert additional caveats here.
Yeah, and come to think of it there’s also LLMs’ RLHF’d-in tendency to move all probabilities towards 50% to accommodate humans being bad at handling probabilities (“You said 20% chance and it happened anyway? Stupid machine!”).
This was a bit too long for a comment (1600 words, but lots of AI generated replies for obvious reasons) so it’s in a Google Doc: https://docs.google.com/document/d/1pXiVpBz727Ok4Z_TD8XEFViZjKd9g28qWBdwLCET40o/edit?usp=sharing
AI generated text is in italics throughout.
Oh wow. That’s way higher p(eval) than I was expecting on non-evals.
I’m not sure if it’s because the follow-up question is doing a lot of heavy lifting or not. But I suspect it may not be the major contributor for the 50%+ probability items, since it did say <15% for long conversations that really were very un-eval-like, like my iterative process of “I want to build this tracking app, and I want it to suit my UX taste, and so I’ll just keep telling Claude to tweak anything I don’t like”.
But, you know, very small sample size, thrown together in an hour or two, LLM’s are weird, insert additional caveats here.
Yeah, and come to think of it there’s also LLMs’ RLHF’d-in tendency to move all probabilities towards 50% to accommodate humans being bad at handling probabilities (“You said 20% chance and it happened anyway? Stupid machine!”).
But even modulo all that, I’m still surprised!