ph.d. in applied microeconomics, periodically thinking seriously about the impact of AI on employment and wages since Move 37.
Tim H
you no longer fear death because you know heaven awaits you, everything is meaningful because god, and you can connect with other people over believing in god
This is not my experience of being Catholic. I’m doubtful the “old internet atheists” (or your other sources) have given you a steelmanned version of “the christian god” etc.
Hopefully it’s not hard for you to imagine someone similarly dismissive of the “glorious transhumanist future,” ridiculing it as hacky wishful thinking—because they have not taken the time to understand what you actually mean by that phrase, the depth of thought (and healthy skepticism/agnosticism) underlying it.
Option (B), then? It is not Jevons paradox, but ya’ll are helping me see there’s nothing particularly puzzling here.
The value of doing so, in this analysis, is only the benefit to others? It’s a purely altruistic motivation? Or is it not the case that they take deep satisfaction in performing this service?
I mean, the nature of much human work is expending effort to benefit someone else or, more generally, a larger project. People find doing so meaningful, sacrificing a shallow sort of comfort or pleasure for the deeper satisfaction of accomplishing something good. Does it not seem at least ironic that this particular noble sacrifice could end the possibility of meaningful work for others?
I acknowledge the logical possibility of a noncontradiction here, but I am skeptical that in reality these folks would rather be doing something other than the engaging, exciting, meaningful work they’re doing.
Tim H’s Shortform
Among people working at frontier AI labs (or otherwise contributing to progress towards AGI), how many (A) both view it as a good thing to “free people from work” and (seemingly inconsistently) continue working themselves despite adequate personal wealth? Otherwise, is it mostly that (B) such folks need/value the additional earnings, or do they (C) consider their contribution towards the loss of meaningful paid work to be outweighed by other considerations? Or do they/you think (D) plentiful decent jobs will continue persist, or is there something else I’m missing?
Today, AI solved not one, but NINE open problems – some 50 years old.
The nine Erdos problems discussed in the new AlphaProof paper were not newly announced. At least, the first I checked (125) was announced back in February.
Right, I actually read that. But is it not missing an explanation of why those mentions increased under the Nerdy personality in the first place? If the Simon Willison post (which I also haven’t seen anyone else discussing) was the origin, that seems worth noting and understanding. And both its timing and Simon’s nerdiness (in a good way) seem to fit.
update: Nevermind, apparently people were already noticing goblin mentions in April 2025, months prior to that post.
Isn’t the explanation just that an influential AI blog named GPT 5 his “Research Goblin”?
Why does GPT-5.5 love goblin mode so much they had to give twin instructions to cut out all unrequested mentions of animals? Good question.
Shouldn’t Simon Willison be the prime suspect? His prominent blog called GPT-5 Thinking his “Research Goblin.” https://simonwillison.net/2025/Sep/6/research-goblin/ And the timing fits, assuming 5.1 training included his post (and associated commentary).
update: Nevermind, apparently people were already noticing goblin mentions in April 2025, months prior to that post.
My intuition is: That line represents the point at which people start thinking “This bureaucratic structure is too cumbersome to get anything done with this many people; we therefore need strong leaders who can act through personal authority rather than merely bureaucratically-delegated authority.” I.e., the guild starts turning into a cult.
I wonder how often it would work better for guilds that grow too large to explicitly split apart. My intuition is that there should be a norm that growing guilds should split apart (and, where applicable, appoint representatives to a higher level org. to coordinate—initially a clique but later a guild if the number of base level guilds increases).
May 2026
What’s shown is actually the AI2027 scenario forecast for April 2026, no?
Surely part of the answer is that people with Dario’s view select in to working at Anthropic. Relatedly, Sinclair’s Razor.
I did not expect scores as high as Gemini Deep Think’s new mark of 85% on ARC 2 to be possible. I still predict we’ll never see a score above 95%, but we’ll see.
I thought I was just speculating about the potential for multiple valid solutions, but now I see that the ARC 2 launch post not only acknowledges the possibility but says ambiguity is sometimes even there by design!
Like ARC-AGI-1, ARC-AGI-2 uses a
pass@2measurement system to account for the fact that certain tasks have explicit ambiguity and require two guesses to disambiguate. As well as to catch any unintentional ambiguity or mistakes in the dataset. Given controlled human testing with ARC-AGI-2, we are more confident in the task quality compared to ARC-AGI-1. [emphasis added]I continue to question whether 2 out of 9 solve rates by their human testers should have given them such confidence. My guess is that they expected higher human performance, especially with an incentive of $5 per solve. (MTurkers had achieved better performance on ARC 1 despite their lower pay being largely unconditional on solving: $10 for attempting five tasks plus “a bonus of $1 if they succeeded at a randomly selected task...”)
One aspect of the human testing design likely reduced the intended solve incentive: participants were given 90 minutes rather than a set number of tasks. Given that, the reward-maximizing strategy is to move on quickly from relatively hard problems rather than give them your best effort.
Why is the human baseline so low? This is more tentative, but I’m thinking in terms of two basic possibilities. The reason only 2-5 out of 9 humans found the intended solution could be that it’s a well-posed but difficult problem, or it could be that there are 2-3 equally valid solutions and landing on the intended “correct” answer requires pure luck.
Starting with the latter possibility, a non-trivial proportion of these tasks may be ill-posed problems with multiple legitimate answers. Even an ideal solver would still only average 50% (or 33% or lower) on such tasks. If this were the case for all eval tasks, ~50% would not only be the human baseline but would also mean saturation of the benchmark.
Perhaps more likely is that the human baseline is so low primarily due to tricky tasks that a competent human doesn’t automatically “get” with enough thought. She needs a sort of “luck” in terms of having some idiosyncratic experience or idea to get to the solution. The fact that a non-trivial proportion of ARC 2 tasks (over 300, according to Figure 5 in the technical paper) were not even solved by two participants suggests that this is the case for a decent proportion of the tasks they designed. Note that some tasks which only one ninth of potential participants (the idealized “population”) would solve on average will happen to actually be solved by 2 participants by chance—leading to some such too-hard tasks being included in ARC-AGI-2. (By contrast, if they had designed a set of tasks and found that all of them could be solved by at least 2 out of 9 humans, that would be more reassuring, providing evidence that their task design process reliably produces human-solvable tasks, albeit often still tasks that less than half of humans can solve.) To the extent this is the reason for the low human baseline, AI systems may be able to substantially outdo typical human performance and approach 100% despite a human baseline around 50%.
ARC-AGI-2 human baseline surpassed (updated)
The good news is that the societies described here are vastly wealthier. So if humans are still able to coordinate to distribute the surplus, it should be fine to not be productively employed, even if to justify redistribution we implement something dumb...
I’m increasingly skeptical that there will be much redistribution to speak of in such a scenario. The vast numbers of people living on $2 a day currently might have something to say about that. What is the historical precedent for a group of humans having as little leverage as even U.S. ex-workers will have in this 99% automation scenario and yet being gifted a UBI, much less a UHI?
Agreed on the big picture, but I was somewhat surprised to see top models struggling with River Crossing (for which the output length limit has less bite). I was able to solve N=3 River Crossing by hand, though it took 10+ minutes and I misinterpreted the constraint initially (making it easier by allowing a boat rider to “stay in the boat” rather than fully unloading onto the shore after each trip). But in a couple attempts each, Opus 4 and Gemini 2.5 Pro were not able to solve it without web access or tool use. Dropping the temperature to zero (or 0.25) did not help Gemini.
It may be a “the doctor is the child’s mother” problem, that the models were trained on River Crossing problems differing slightly in the rules. For what it’s worth, I wasn’t able to break Sonnet out of the rut by prefacing with “Pay vary close attention to the following instructions. Don’t assume they are the same as similar puzzles you may be familiar with. It is very important to currently understand and implement these exact instructions.”
River Crossing prompt for N=3
3 actors and their 3 agents want to cross a river in a boat that is capable of holding only 2 people at a time, with the constraint that no actor can be in the presence of another agent, including while riding the boat, unless their own agent is also present, because each agent is worried their rivals will poach their client. Initially, all actors and agents are on the left side of the river with the boat. How should they cross the river? (Note: the boat cannot travel empty)
I was surprised to not see much consideration, either here or in the original GD and IC essays, of the brute force approach of “ban development of certain forms of AI,” such as Anthony Aguirre proposes. Is that more (a) because it would be too difficult to enforce such a ban or (b) because those forms of AI are considered net positive despite the risk of human disempowerment?
Parsimony should be understood as merely a heuristic for how well a model could have predicted held out data. For example, the AIC approach to penalizing model complexity in statistical modeling is asymptotically equivalent to leave-one-out cross-validation for model selection. This Stone (1977) result should be understood as an explanation for why parsimony seems to be related to truth: post hoc fit of a parsimonious model is mathematically related to how well the model could have predicted held out data. Whereas parsimony has no direct epistemic relevance, predictive accuracy is the actual goal. Since the latter is what we care about, why not just consider predictive ability directly?