Furthermore, when we measure that competence, it usually falls somewhere within the human range of performance.
I think that for this to be meaningfully true, the LLM should be able to actually replace humans at a given task. There are some very specific domains in which this is doable (e.g. creative writing assistant), but it seems to me that they are still mostly too unreliable for this.
I’ve worked with getting GPT-4 to act as a coach for business customers. This is one of the domains that it excels at—tasks can be done entirely inside a chat, the focus is on asking users questions and paraphrasing them so hallucinations are usually not a major issue. And yet it’s stupid in some very frustrating ways that a human wouldn’t be.
For example, our users would talk with the bot at specific times, which they would schedule using a separate system. Sometimes they would ask the bot to change their scheduled time. The bot wasn’t interfaced to the actual scheduling system, but it had been told to act like a helpful coach, so by default it would say something like “of course, I have moved your session time to X”. This was bad, since the user would think the session had been moved, but it hadn’t.
Well, easy to fix, right? Just add “if the user asks you to reschedule the session or do anything else that requires doing something outside the actual conversation, politely tell them that you are unable to do that” to the prompt.
This did fix the problem… but it created a new one. Now the bot would start telling the user “oh and please remember that I cannot reschedule your session” as a random aside, when the user had never said anything about rescheduling the session.
Okay, so what about adding something like “(but only tell this if the user brings it up, don’t say it spontaneously)” to our prompt? That reduced the frequency of the spontaneous asides a little… but not enough to eliminate it. Eventually we just removed the whole thing from the prompt and decided that the occasional user getting a misleading response from the bot is better than it randomly bringing this up all the time.
Another basic instruction that you would think would be easy to follow would be “only ask one question at a time”. We had a bit in a prompt that went like “Ask exactly one question. Do not ask more than question. Stop writing your answer once it contains a question mark.” The end result? GPT-4 happily sending multi-question messages like “What is bothering you today? What kinds of feelings does that bring up?”.
There are ways to fix these issues, like having another LLM instance check the first instance’s messages and rewrite any that are bad. But at that point, it’s back to fragile hand-engineering to get the kinds of results one wants, because the underlying thing is firmly below a human level of competence. I don’t think LLMs are (or at least GPT-4 is not) yet at the kind of level of high reliability involved in human-level performance.
I don’t mean to belabor the point as I think it’s reasonable, but worth pointing out that these responses seem within the range of below average human performance.
I was going to say the same. I can’t count the number of times a human customer service agent has tried to do something for me, or told me they already did do something for me, only for me to later find out they were wrong (because of a mistake they made), lying (because their scripts required it or their metrics essentially forced them into it), or foiled (because of badly designed backend systems opaque to both of us).
I think that for this to be meaningfully true, the LLM should be able to actually replace humans at a given task. There are some very specific domains in which this is doable (e.g. creative writing assistant), but it seems to me that they are still mostly too unreliable for this.
I’ve worked with getting GPT-4 to act as a coach for business customers. This is one of the domains that it excels at—tasks can be done entirely inside a chat, the focus is on asking users questions and paraphrasing them so hallucinations are usually not a major issue. And yet it’s stupid in some very frustrating ways that a human wouldn’t be.
For example, our users would talk with the bot at specific times, which they would schedule using a separate system. Sometimes they would ask the bot to change their scheduled time. The bot wasn’t interfaced to the actual scheduling system, but it had been told to act like a helpful coach, so by default it would say something like “of course, I have moved your session time to X”. This was bad, since the user would think the session had been moved, but it hadn’t.
Well, easy to fix, right? Just add “if the user asks you to reschedule the session or do anything else that requires doing something outside the actual conversation, politely tell them that you are unable to do that” to the prompt.
This did fix the problem… but it created a new one. Now the bot would start telling the user “oh and please remember that I cannot reschedule your session” as a random aside, when the user had never said anything about rescheduling the session.
Okay, so what about adding something like “(but only tell this if the user brings it up, don’t say it spontaneously)” to our prompt? That reduced the frequency of the spontaneous asides a little… but not enough to eliminate it. Eventually we just removed the whole thing from the prompt and decided that the occasional user getting a misleading response from the bot is better than it randomly bringing this up all the time.
Another basic instruction that you would think would be easy to follow would be “only ask one question at a time”. We had a bit in a prompt that went like “Ask exactly one question. Do not ask more than question. Stop writing your answer once it contains a question mark.” The end result? GPT-4 happily sending multi-question messages like “What is bothering you today? What kinds of feelings does that bring up?”.
There are ways to fix these issues, like having another LLM instance check the first instance’s messages and rewrite any that are bad. But at that point, it’s back to fragile hand-engineering to get the kinds of results one wants, because the underlying thing is firmly below a human level of competence. I don’t think LLMs are (or at least GPT-4 is not) yet at the kind of level of high reliability involved in human-level performance.
I don’t mean to belabor the point as I think it’s reasonable, but worth pointing out that these responses seem within the range of below average human performance.
I was going to say the same. I can’t count the number of times a human customer service agent has tried to do something for me, or told me they already did do something for me, only for me to later find out they were wrong (because of a mistake they made), lying (because their scripts required it or their metrics essentially forced them into it), or foiled (because of badly designed backend systems opaque to both of us).