Yeah, it’s hard to be confident you’ve sufficiently adjusted your prompts and your evaluation of responses to account for this kind of problem. I think I’m pretty effective at it, and at sufficiently fact- and source-checking responses, but I’m sure this still happens to some degree.
I do find a lot of people don’t bother doing this at all, or apply a lot less effort to it than I (or a number of people here) do. So, I apply a lot more doubt to their LLM outputs than my own unless I can reproduce them… which willfully opens me up to a whole extra kind of biased evaluation risk.
Yeah, it’s hard to be confident you’ve sufficiently adjusted your prompts and your evaluation of responses to account for this kind of problem. I think I’m pretty effective at it, and at sufficiently fact- and source-checking responses, but I’m sure this still happens to some degree.
I do find a lot of people don’t bother doing this at all, or apply a lot less effort to it than I (or a number of people here) do. So, I apply a lot more doubt to their LLM outputs than my own unless I can reproduce them… which willfully opens me up to a whole extra kind of biased evaluation risk.