Paul Graham has some good pieces on this:
Hans Niemand
I can’t help but notice that, just looking at the Claude models, later/more capable models are worse at this. Is there a chance this could be at least partly sandbagging?
Not an expert, but I think the difference is this. Current and older LLMs produce each token via a “forward pass”: information only flows forwards in the model, what happens at a later layer can’t influence what happens at an earlier layer. What people call “neuralese” is to build neural nets where information can also flow backwards in the model, so that it can pass information back to itself “in its head” rather than only being able to pass information back to itself by outputting tokens and then reading them back in. This is a known technique and has been done before, but it’s hard to train large models with that architecture. I was going to try to explain why but realized I don’t understand well enough myself to explain it, so I’ll leave it there.
You’ve probably already seen this, but for others reading this post: Anthropic now seems to have put out some more official numbers on this: https://www.anthropic.com/research/how-ai-is-transforming-work-at-anthropic
It seems to mostly validate your read on the situation. They did internal surveys, qualitative interviews, and some analysis of Claude Code transcripts. Here is their “key takeaways” from the survey section:
Survey data
Anthropic engineers and researchers use Claude most often for fixing code errors and learning about the codebase. Debugging and code understanding are the most common uses (Figure 1).
People report increasing Claude usage and productivity gains. Employees self-report using Claude in 60% of their work and achieving a 50% productivity boost, a 2-3x increase from this time last year. This productivity looks like slightly less time per task category, but considerably more output volume (Figure 2).
27% of Claude-assisted work consists of tasks that wouldn’t have been done otherwise, such as scaling projects, making nice-to-have tools (e.g. interactive data dashboards), and exploratory work that wouldn’t be cost-effective if done manually.
Most employees use Claude frequently while reporting they can “fully delegate” 0-20% of their work to it. Claude is a constant collaborator but using it generally involves active supervision and validation, especially in high-stakes work—versus handing off tasks requiring no verification at all.
So on average people report about 50% productivity boost, but don’t actually “fully delegate” much work to it, and a good amount of the productivity boost is from doing things that wouldn’t have been high-priority enough to get done otherwise. The data from qualitative interviews and Claude Code transcript analysis seems to tell a similar story. I don’t think they give an actual statistic on “lines of production code committed” though.
This seems closely related to John Perry’s “Problem of the essential indexical” (although it’s been like ten years since I read it): https://dl.booksee.org/foreignfiction/581000/4897fc2fba1f8af4ea7db3d9654bbbb3.pdf/_as/%255Bperry_john%255D_the_problem_of_the_essential_indexica%28booksee.org%29.pdf
Basically, he argues that there are certain “locating beliefs” (things like “I am John Perry”, “Now is noon,” “Here is the trail that leads out of the woods”) that are logically ineliminable. No matter how you try, you need an indexical in there somewhere to capture them. In your terms, a 0th-person logic could never incorporate these types of beliefs.
I have a short post critiquing your argument here. Here’s the key part:
I’ll borrow and slightly simplify one of their own examples. Suppose Marian has the goal to paint a certain wall blue. It will take her two days to do so. But, if she doesn’t press a “goal preserve” button on the first day, she will lose her “paint-the-wall-blue” goal, and so she’ll never finish painting it blue (and she knows this).Now, suppose Marian never presses the “goal preserve” button. According to the authors, there is no time at which Marian is instrumentally irrational. On Day 1, she has her paint-the-wall-blue goal (which we can call B for short), and takes the means to achieve B. On Day 2, she does not take the means to achieve B, but she also no longer has B, so this is not instrumentally irrational. So, Marian is never instrumentally irrational.
…
I want to argue that Marian is irrational to not press the goal-preserve button, and the time at which she is irrational is the time at which she decides not to press it[4] At that moment, she still has goal B, and she’s choosing an action (giving up goal B) that will prevent B from being achieved. That’s textbook means-irrationality.My argument is as follows:
Failure to take the necessary means to achieve one’s current goal is a failure of means-rationality.
At time D (when she is deciding whether to press the button), Marian’s current goal is B (to paint the wall blue).
At time D, Marian has two options: (i) give up B (by not pressing the button) or (ii) keep B (by pressing the button).
If Marian keeps B, then B will be achieved in the future (i.e. she will paint the wall blue); if she gives it up, it won’t.
(From 4): Keeping B is a necessary means to achieving B.
(From 2, 5): Giving up B is a failure to take the necessary means to achieving Marian’s current goal.
(From 1, 6): Giving up B is a failure of means-rationality.
So, the proposed failure of means-rationality doesn’t happen on Day 2, when Marian fails to paint the wall blue; it happens at time D, when she decides not to keep her goal of painting the wall blue. And the reason it is a failure of means-rationality is that keeping her goal is itself a necessary means to achieving her goal.
This is a version of what the authors call the “Delay Objection.” In response to the delay objection, they write:
To overcome this objection, we must defend the following claim: an agent with a goal at time t1 does not violate means-rationality if it fails to take an action at t1 to ensure it still has that goal at time t2.
While it would normally violate means-rationality for an agent to ensure now that they will later lack the means they require to achieve their goal, this is not true when the means in question is the goal itself. Means-rationality does not prohibit setting oneself up to fail concerning a goal one currently has but will not have at the moment of failure, as this never causes an agent to fail to achieve the goal that they have at the time of failing to achieve it. An agent cannot have failed to achieve a goal due to abandoning that goal while it still has that goal. Such a failure necessarily occurs after the agent no longer has that goal: but at that point, it is not the agent’s goal.
The problem with this reasoning is that failing to achieve one’s goal is not constitutive of means-irrationality. What’s constitutive of means-irrationality is failing to take the necessary means to achieve one’s goal. And maintaining one’s goal is a necessary means to achieve one’s goal (again, with the usual exceptions; see fn. 1). At time D, Marian can see that if she doesn’t keep her paint-the-wall-blue goal, the wall won’t get painted blue. Since painting the wall blue is what she wants (at D), not keeping that goal is means-irrational at D.
It’s not exactly the same of course but Yudkowsky has been predicting that ASIs would be able to effectively hack people’s minds for a really long time.