One thing that I find somewhat confusing is that the “time horizon”-equivalent for AIs reading blog posts seems so short. Like this is very vibes-y, but if I were to think of a question operationalized as “I read a blog post for X period of time, at what X would I think Claude has a >50% chance of identifying more central errors than I could?” intuitively it feels like X is very short. Well under an hour, and likely under 10 minutes.
This is in some sense surprising, since reading feels like a task they’re extremely natively suited for, and on other tasks like programming their time horizons tend to be in multiple hours.
My best guess for this is something like coding is both more routinized and more explicit. That is, the thinking and the tokens are synonymous. One can’t think the correct code and not be able to actually implement the correct code if you’re literally thinking about the exact text of the code. Whereas one can think the correct text in natural language and not implement the correct idea.
Another way of saying this is: there are more reasoning steps in excess of the tokens implicitly required by the most bare bones version of the reasoning process.
Probably you’ve already thought of this and thought of a reason that it’s wrong.
The task you’re talking about isn’t reading, it’s logical analysis or something like that.
I think it is noteworthy that current llms are bad at this.
Of course, what counts as an error within the local context of the essay and what counts as an error? Given all the shared context, the writer and reader rely on to interpret. It, is highly subjective and debatable. So you need some sort of committee of expert humans to compare to.
One thing that I find somewhat confusing is that the “time horizon”-equivalent for AIs reading blog posts seems so short. Like this is very vibes-y, but if I were to think of a question operationalized as “I read a blog post for X period of time, at what X would I think Claude has a >50% chance of identifying more central errors than I could?” intuitively it feels like X is very short. Well under an hour, and likely under 10 minutes.
This is in some sense surprising, since reading feels like a task they’re extremely natively suited for, and on other tasks like programming their time horizons tend to be in multiple hours.
I don’t have a good resolution to this.
My best guess for this is something like coding is both more routinized and more explicit. That is, the thinking and the tokens are synonymous. One can’t think the correct code and not be able to actually implement the correct code if you’re literally thinking about the exact text of the code. Whereas one can think the correct text in natural language and not implement the correct idea.
Another way of saying this is: there are more reasoning steps in excess of the tokens implicitly required by the most bare bones version of the reasoning process.
Probably you’ve already thought of this and thought of a reason that it’s wrong.
The task you’re talking about isn’t reading, it’s logical analysis or something like that.
I think it is noteworthy that current llms are bad at this.
Of course, what counts as an error within the local context of the essay and what counts as an error? Given all the shared context, the writer and reader rely on to interpret. It, is highly subjective and debatable. So you need some sort of committee of expert humans to compare to.