The primary optimization target for LLM companies/engineers seems to be making them seem smart to humans, particularly the nerds who seem prone to using them frequently. A lot of money and talent is being spent on this. It seems reasonable to expect that they are less smart than they seem to you, particularly if you are in the target category. This is a type of Goodharting.
In fact, I am beginning to suspect that they aren’t really good for anything except seeming smart, and most rationalists have totally fallen for it, for example Zvi insisting that anyone who is not using LLMs to multiply their productivity is not serious (this is a vibe not a direct quote but I think it’s a fair representation of his writing over the last year). If I had to guess, LLMs have 0.99x’ed my productivity by occasionally convincing me to try to use them which is not quite paid for by very rarely fixing a bug in my code. The number is close to 1x because I don’t use them much, not because they’re almost useful. Lots of other people seem to have much worse ratios because LLMs act as a superstimulus for them (not primarily a productivity tool).
Certainly this is an impressive technology, surprising for its time, and probably more generally intelligent than anything else we have built—not going to get into it here, but my model is that intelligence is not totally “atomic” but has various pieces, some of which are present and some missing in LLMs. But maybe the impressiveness is not a symptom of intelligence, but the intelligence a symptom of impressiveness—and if so, it’s fair to say that we have (to varying degrees) been tricked.
I use LLMs throughout my personal and professional life. The productivity gains are immense. Yes hallucination is a problem but it’s just as spam/ads/misinformation on wikipedia/internet—an small drawback that doesn’t oblivate the ginormous potential of the internet/LLMs
I am 95% certain you are leaving value on the table.
I do agree straight LLMs are not generally intelligent (in the sense of universal intelligence/AIXI) and therefore not completely comparable to humans.
On LLMs vs search on internet: agree that LLMs are very helpful in many ways, both personally and professionally, but the worse parts of the misinformation in LLM comparing to wikipedia/internets in my opinion includes: 1) it is relatively more unpredictable when the model will hallucinate, whereas for wikipedia/internet, you would generally expect higher accuracy for simpler/purely factual/mathematical information. 2) it is harder to judge the credibility without knowing the source of the information, whereas on the internet, we could get some signals where the website domain, etc.
From my personal experience, I agree. I find myself unexcited about trying the newest LLM models. My main use-case in practice these days is Perplexity, and I only use it when I don’t care much about the accuracy of the results (which ends up being a lot, actually… maybe too much). Perplexity confabulates quite often even with accurate references in hand (but at least I can check the references). And it is worse than me at the basics of googling things, so it isn’t as if I expect it to find better references than me; the main value-add is in quickly reading and summarizing search results (although the new Deep Research option on Perplexity will at least iterate through several attempted searches, so it might actually find things that I wouldn’t have).
I have been relatively persistent about trying to use LLMs for actual research purposes, but the hallucination rate seems to go to 100% almost whenever an accurate result would be useful to me.
The hallucination rate does seem adequately low when talking about established mathematics (so long as you don’t ask for novel implications, such as applying ideas to new examples). For this and for other reasons I think they can be quite helpful for people trying to get oriented to a subfield they aren’t familiar with—it can make for a great study partner, so long as you verify what it says be checking other references.
Also decent for coding, of course, although the same caveat applies—coders who are already an expert in what they are trying to do will get much less utility out of it.
I recently spoke to someone who made a plausible claim that LLMs were 10xing their productivity in communicating technical ideas in AI alignment with something like the following workflow:
Take a specific cluster of failure modes for thinking about alignment which you’ve seen often.
Hand-write a large, careful prompt document about the cluster of alignment failure modes, which includes many specific trigger-action patterns (if someone makes mistake X, then the correct counterspell to avoid the mistake is Y). This document is highly opinionated and would come off as rude if directly cited/quoted; it is not good communication. However, it is something you can write once and use many times.
When responding to an email/etc, load the email and the prompt document into Claude and ask Claude to respond to the email using the document. Claude will write something polite, informative, and persuasive based on the document, with maybe a few iterations of correcting Claude if its first response doesn’t make sense. The person also emphasized that things should be written in small pieces, as quality declines rapidly when Claude tries to do more at once.
They also mentioned that Claude is awesome at coming up with meme versions of ideas to include in powerpoints and such, which is another useful communication tool.
So, my main conclusion is that there isn’t a big overlap between what LLMs are useful for and what I personally could use. I buy that there are some excellent use-cases for other people who spend their time doing other things.
Still, I agree with you that people are easily fooled into thinking these things are more useful than they actually are. If you aren’t an expert in the subfield you’re asking about, then the LLM outputs will probably look great due to Gell-Mann Amnesia type effects. When checking to see how good the LLM is, people often check the easier sorts of cases which the LLMs are actually decent at, and then wrongly generalize to conclude that the LLMs are similarly good for other cases.
for example Zvi insisting that anyone who is not using LLMs to 10x their productivity is not serious … a vibe not a direct quote
I expect he’d disagree, for example I vaguely recall him mentioning that LLMs are not useful in a productivity-changing way for his own work. And 10x specifically seems clearly too high for most things even where LLMs are very useful, other bottlenecks will dominate before that happens.
10x was probably too strong but his posts are very clear he things it’s a large productivity multiplier. I’ll try to remember to link the next instance I see.
AI doesn’t accelerate my writing much, although it is often helpful in parsing papers and helping me think through things. But it’s a huge multiplier on my coding, like more than 10x.
The primary optimization target for LLM companies/engineers seems to be making them seem smart to humans, particularly the nerds who seem prone to using them frequently. A lot of money and talent is being spent on this. It seems reasonable to expect that they are less smart than they seem to you, particularly if you are in the target category. This is a type of Goodharting.
In fact, I am beginning to suspect that they aren’t really good for anything except seeming smart, and most rationalists have totally fallen for it, for example Zvi insisting that anyone who is not using LLMs to multiply their productivity is not serious (this is a vibe not a direct quote but I think it’s a fair representation of his writing over the last year). If I had to guess, LLMs have 0.99x’ed my productivity by occasionally convincing me to try to use them which is not quite paid for by very rarely fixing a bug in my code. The number is close to 1x because I don’t use them much, not because they’re almost useful. Lots of other people seem to have much worse ratios because LLMs act as a superstimulus for them (not primarily a productivity tool).
Certainly this is an impressive technology, surprising for its time, and probably more generally intelligent than anything else we have built—not going to get into it here, but my model is that intelligence is not totally “atomic” but has various pieces, some of which are present and some missing in LLMs. But maybe the impressiveness is not a symptom of intelligence, but the intelligence a symptom of impressiveness—and if so, it’s fair to say that we have (to varying degrees) been tricked.
I use LLMs throughout my personal and professional life. The productivity gains are immense. Yes hallucination is a problem but it’s just as spam/ads/misinformation on wikipedia/internet—an small drawback that doesn’t oblivate the ginormous potential of the internet/LLMs
I am 95% certain you are leaving value on the table.
I do agree straight LLMs are not generally intelligent (in the sense of universal intelligence/AIXI) and therefore not completely comparable to humans.
On LLMs vs search on internet: agree that LLMs are very helpful in many ways, both personally and professionally, but the worse parts of the misinformation in LLM comparing to wikipedia/internets in my opinion includes: 1) it is relatively more unpredictable when the model will hallucinate, whereas for wikipedia/internet, you would generally expect higher accuracy for simpler/purely factual/mathematical information. 2) it is harder to judge the credibility without knowing the source of the information, whereas on the internet, we could get some signals where the website domain, etc.
From my personal experience, I agree. I find myself unexcited about trying the newest LLM models. My main use-case in practice these days is Perplexity, and I only use it when I don’t care much about the accuracy of the results (which ends up being a lot, actually… maybe too much). Perplexity confabulates quite often even with accurate references in hand (but at least I can check the references). And it is worse than me at the basics of googling things, so it isn’t as if I expect it to find better references than me; the main value-add is in quickly reading and summarizing search results (although the new Deep Research option on Perplexity will at least iterate through several attempted searches, so it might actually find things that I wouldn’t have).
I have been relatively persistent about trying to use LLMs for actual research purposes, but the hallucination rate seems to go to 100% almost whenever an accurate result would be useful to me.
The hallucination rate does seem adequately low when talking about established mathematics (so long as you don’t ask for novel implications, such as applying ideas to new examples). For this and for other reasons I think they can be quite helpful for people trying to get oriented to a subfield they aren’t familiar with—it can make for a great study partner, so long as you verify what it says be checking other references.
Also decent for coding, of course, although the same caveat applies—coders who are already an expert in what they are trying to do will get much less utility out of it.
I recently spoke to someone who made a plausible claim that LLMs were 10xing their productivity in communicating technical ideas in AI alignment with something like the following workflow:
Take a specific cluster of failure modes for thinking about alignment which you’ve seen often.
Hand-write a large, careful prompt document about the cluster of alignment failure modes, which includes many specific trigger-action patterns (if someone makes mistake X, then the correct counterspell to avoid the mistake is Y). This document is highly opinionated and would come off as rude if directly cited/quoted; it is not good communication. However, it is something you can write once and use many times.
When responding to an email/etc, load the email and the prompt document into Claude and ask Claude to respond to the email using the document. Claude will write something polite, informative, and persuasive based on the document, with maybe a few iterations of correcting Claude if its first response doesn’t make sense. The person also emphasized that things should be written in small pieces, as quality declines rapidly when Claude tries to do more at once.
They also mentioned that Claude is awesome at coming up with meme versions of ideas to include in powerpoints and such, which is another useful communication tool.
So, my main conclusion is that there isn’t a big overlap between what LLMs are useful for and what I personally could use. I buy that there are some excellent use-cases for other people who spend their time doing other things.
Still, I agree with you that people are easily fooled into thinking these things are more useful than they actually are. If you aren’t an expert in the subfield you’re asking about, then the LLM outputs will probably look great due to Gell-Mann Amnesia type effects. When checking to see how good the LLM is, people often check the easier sorts of cases which the LLMs are actually decent at, and then wrongly generalize to conclude that the LLMs are similarly good for other cases.
I expect he’d disagree, for example I vaguely recall him mentioning that LLMs are not useful in a productivity-changing way for his own work. And 10x specifically seems clearly too high for most things even where LLMs are very useful, other bottlenecks will dominate before that happens.
10x was probably too strong but his posts are very clear he things it’s a large productivity multiplier. I’ll try to remember to link the next instance I see.
Found the following in the Jan 23 newsletter: