Don’t Judge a Tool by its Average Output

Epistemic status: this post’s main point itself is probably somewhat obvious. The concrete examples may leave some room for disagreement. No guarantees about my views on ChatGPT vs GPT3 in particular.

Summary: Many tools and platforms out there provide users with certain input-dependent outputs. These outputs can vary greatly in their quality or usefulness. People’s intuition in such cases is often that if the outputs are of low quality sufficiently often, then the tool must be useless. This conclusion can be mistaken however if it’s possible to systematically obtain high quality outputs by deliberately altering one’s inputs or filtering the outputs one consumes.

Let’s look at three examples: ChatGPT, Reddit and Elon Musk. When judging such tools (let’s just refer to all of them as “tools” for simplicity), it’s very easy to look at the typical interaction of a typical user with the tool and make one’s judgment based on that. For ChatGPT this may mean that somebody asks it some factual or math question, and gets a very confident but incorrect reply. So some people think: ChatGPT is a nice toy, but given how unreliable it is due to its many mistakes, it’s ultimately useless (at least as of January 2023). Reddit on the other hand is considered by many people to be basically a huge time sink, an endless feed of entertainment and banter that gets in the way of more important things. As for Elon Musk, who’s admittedly a somewhat far-fetched example for this whole concept, it’s not so much about actual “input” but more about how you select/​filter the output he produces. E.g. when it comes to his views and opinions, people may think of some typical widely-shared tweet of his, which probably scores highly on the troll scale. Hence people may get the impression that Musk is mad and/​or stupid and there’s nothing to learn from him.

Elon Musk right now, probably

These three views surely have some truth, and it’s probably the case that there are many people who would be better off not using these tools at all (i.e. not use ChatGPT, not sign up to or visit reddit, and avoid exposure to things Elon Musk said) – at least not in the way they’ve been using them before.

However, I think that this is a very incomplete view of usefulness. The main reason for that is that such tools are not inherently useful or useless, but their usefulness depends greatly on how you interact with them.

  • For ChatGPT, you can both decide what type of task to use it for, and how to prompt it for any given task. These two dimensions both allow you to influence how reliable /​ useful the output you get will be to you.

  • In the case of reddit, “how you use it” basically means which subreddits make up your front page, or alternatively in case you don’t use your frontpage at all, which subreddits you visit in which situations.

  • And finally for Elon Musk, you can be selective in your consumption – e.g. his tweets are probably much less worthwhile than his podcast appearances.

What it comes down to is that, when you as the user of a tool have some influence over what type of output you get, then the general average output of that tool is mostly irrelevant. When judging how useful a tool is, you should incorporate your personal capacity to make that tool as useful as possible for you.

GPT3 and ChatGPT

Connor Leahy once made the point that GPT3 is probably much more intelligent than we can really tell, because the interface – prompting – strongly limits the output quality. In a way GPT3 outputs are always only as smart or useful as the prompts it gets. ChatGPT then was fine-tuned so as to make it more likely to produce non-useless replies, which greatly increases the average output quality over GPT3, while reducing the variance somewhat. However, depending on how much effort you put into crafting good prompts, ChatGPT may even be worse at some things than GPT3: while its average output may be better, the reduction in variance means that it has a “shorter tail”, i.e. the most useful outputs are more likely to be produced by GPT3 rather than by ChatGPT, as illustrated by the following qualitative diagram:

Highly simplified histogram of the distribution of usefulness of outputs by GPT3 and ChatGPT. X axis is usefulness, Y axis is something like the relative number of prompts yielding that level of usefulness. ChatGPT is more useful on average, and much more predictable, but due to the higher variance of its outputs GPT3 may be capable of the most useful outputs (arrow). If people can find systematic ways (by task selection & prompting) to obtain these very good outputs reliably, then GPT3 is the better tool despite its lower average output quality.

Ultimately it depends on what type of user you are: if you’re willing to put in some time and effort to make a tool suit your needs, then the tool’s average output is mostly irrelevant and you should focus more on the reliably achievable best case.

On the other hand however, if this is too much to ask and you don’t want to get too invested in a tool, then the average may indeed be a reasonable metric to use after all. ChatGPT is certainly a step up from GPT3 for most users and many use cases, but not for all, and which tool is better still depends on the concrete use case and if you’re aiming for the very best outputs.

The Less Is More Effect

In Thinking Fast & Slow, Kahneman describes the “less is more effect” with the example of buying a set of dishes: people are willing to pay a significantly higher price for a set of intact dishes, than for the same set of dishes plus a single broken piece. One explanation for this is that people instinctively focus more on averages than on sums – in this case, the average is pulled down by the broken dish, decreasing the perceived value, although the absolute value should be at least as high as that of the other set.

I believe that people’s judgment of tools and platforms works in very similar ways, and it takes some conscious effort to overcome this effect and realize that the average sometimes is misleading, particularly when you can easily ignore the lower quality parts and systematically pick the high quality cherries.

Conclusion

Before dismissing a tool as useless, consider if you’re possibly putting too much weight on its average performance while ignoring that deliberate inputs or systematic filtering of outputs might make the tool much more useful than it intuitively seems.