Jacob_Hilton

Karma: 1,652

Jacob_Hilton Sep 16, 2025, 4:42 PM
8 points
5 votes
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
in reply to: ryan_greenblatt’s comment on: Jacob_Hilton’s Shortform
Agree about recent results not being driven by formalization, but I’d also guess having ground truth (e.g. numeric answers or reference solutions) remains pretty important, which doesn’t scale to the superhuman regime.
Agree that evidence from humans means reaching superhuman capability through purely informal proof is possible in principle. But ML is less robust than humans by default, and AI is already more proficient with formal proof systems than most mathematicians. So informal-to-formal seems like a natural consequence of increased tool use. Not confident in this of course.
I expect easy-to-check software engineering tasks (and tasks that are conceptually similar to easy-to-check tasks) to be pretty close to math, and harder-to-check/fuzzier tasks to lag. Most tasks in the broad economy seem like they fall in the latter category. The economy will likely adapt to make lots of tasks better suited to AI, but that process may be slower than the capability lag anyway. AI R&D might be a different story, but I will leave that to another discussion.

Jacob_Hilton Sep 16, 2025, 3:45 PM
69 points
36 votes
Overall karma indicates overall quality.
41
27 votes
Agreement karma indicates agreement, separate from overall quality.
on: Jacob_Hilton’s Shortform
Superhuman math AI will plausibly arrive significantly before broad automation
I think it’s plausible that for several years in the late 2020s/early 2030s, we will have AI that is vastly superhuman at formal domains including math, but still underperforms humans at most white-collar jobs (and so world GDP growth remains below 10%/year, say – still enough room for AI to be extraordinarily productive compared to today).
Of course, if there were to be an intelligence explosion on that timescale, then superhuman math AI would be unsurprising. My main point is that superhuman math AI still seems plausible even disregarding feedback loops from automation of AI R&D. On the flip side, a major catastrophe and/or coordinated slowdown could prevent both superhuman math AI and broad automation. Since both of these possibilities are widely discussed elsewhere, I will disregard both AI R&D feedback loops and catastrophe for the purposes of this forecast. (I think this is a very salient possibility on the relevant timescale, but won’t justify that here.)
My basic reasons for thinking vastly superhuman math AI is a serious possibility in the next 4–8 years (even absent AI R&D feedback loops and/or catastrophe):
- Performance in formal domains is verifiable: math problems can be designed to have a unique correct answer, and formal proofs are either valid or invalid. Historically, in domains with cheap, automated supervision signals, only a relatively small amount of research effort has been required to produce superhuman AI (e.g., in board games and video games). There are often other bottlenecks than supervision, most notably exploration and curricula, but these tend to be more surmountable.
- Recent historical progress in math has been extraordinarily fast: in the last 4 years, AI has gone from struggling with grade school math to achieving an IMO gold medal, with progress at times exceeding almost all forecasters’ reasonable expectations. Indeed, much of this progress seems to have been driven by the ability to automatically supervise math, with reasoning models being trained using RL on a substantial amount of math data.
- Superhuman math AI looks within reach without enormous expense: reaching superhuman ability in a domain requires verifying solutions beyond a human’s ability to produce them, and so a static dataset produced by humans isn’t enough. (In fact, a temporary slowdown in math progress in the near future seems possible because of this, although I wouldn’t bet on it.) But the following two ingredients (plus sufficient scale) seem sufficient for superhuman math AI, and within reach:
  - Automatic problem generation: the ability to generate a diverse enough set of problems such that both (a) most realistic math of interest to humans is within distribution and (b) problem difficulty is granular enough to provide a good curriculum. Current LLMs with careful prompting/fine-tuning may be enough for this.
  - Reliable informal-to-formal translation: solution verifiers need to be robust enough to avoid too much reward hacking, which probably requires natural language problems and solutions to be formalized to some degree (a variety of arrangements seem possible here, but it’s hard to see how something purely informal can provide sufficiently scalable supervision, and it’s hard to see how something purely formal can capture mathematicians’ intuitions about what problems are interesting). This is basically a coding problem, and doesn’t seem too far beyond the capabilities of current LLMs. Present-day formalization efforts by humans are challenging, but in large part because of their laboriousness, which AI is excellent at dealing with.
  Note I’m not claiming that there will be discontinuous progress once these ingredients “click into place”. Instead, I expect math progress to continue on a fast but relatively continuous trajectory (perhaps with local breakthroughs/temporary slowdowns on the order of a year or two). The above two ingredients don’t seem especially responsible for current math capabilities, but could become increasingly relevant as we move towards and into the superhuman regime.
By contrast, some reasons to be skeptical that AI will be automating more than a few percent of the economy by 2033 (still absent AI R&D feedback loops and/or catastrophe):
- Progress in domains in which performance is hard to verify has been slower: by comparison with the dramatic progress in math, the ability of an AI to manage a small business enterprise is relatively unimpressive. In domains with a mixture of formal and informal problem specifications, such as coding, progress has been similarly fast to math, or perhaps a little slower (as measured by horizon length), but my qualitative impression is that has been driven by progress on easy-to-verify tasks, with some transfer to hard-to-verify tasks. I expect to continue to see domains lag behind based on the extent to which performance is easy to verify.
- Possible need for expensive long-horizon data: in domains with fuzzy, informal problem specifications, or requiring expensive or long-horizon feedback from the real world, we will continue to see improvements, since there will be transfer both from pretraining scaling and from more RL on verifiable tasks. But for tasks where this progress is slow despite the task being economically important, it will eventually be worth it to collect expensive long-horizon feedback. However, it might take several years to scale up the necessary infrastructure for this, unlike some clear routes to superhuman math AI, for which all the necessary infrastructure is essentially already in place. This makes a 2–5+ year lag seem quite plausible.
- Naive revenue extrapolation: one way to get a handle on the potential timescale until broad automation is to extrapolate AI company revenues, which are on the order of tens of billions of dollars per year today, around 0.01% of world GDP. Even using OpenAI’s own projections (despite their incentives to make overestimates), which forecast that revenue will grow by a factor of 10 over the next 4 years, and extrapolating them an additional 4 years into the future, gives an estimate of around 1% of world GDP by 2033. AI companies won’t capture all the economic value they create, but on the other hand this is a very bullish forecast by ordinary standards.
What would a world with vastly superhuman math AI, but relatively little broad automation, look like? Some possibilities:
- Radical effect on formal sciences: by “vastly superhuman math AI”, I mean something like: you can give an AI a math problem, and it will respond within e.g. a couple of hours with a formal proof or disproof, as long as a human mathematician could have found an informal version of the proof in say 10 years. (Even though I just argued for the plausibility of this, it seems completely wild to comprehend, spelled out explicitly.) I think this would completely upend the formal sciences (math, theoretical computer science and theoretical physics) to say the least. Progress on open problems would be widespread but highly variable, since their difficulty likely ranges from “just out of reach to current mathematicians” to “impossible”.
- Noticeable speed-up of applied sciences: it’s not clear that such a dramatic speed-up in the formal sciences would have that dramatic consequences for the rest of the world, given how abstract much of it is. Cryptography, formal verification and programming languages might be the most consequential areas, followed by areas like experimental physics and computational chemistry. However, in most of the experimental sciences, formal results are not the main bottleneck, so speed-ups would be more dependent on progress on coding, fuzzier tasks, robotics, and so on. Math-heavy theoretical AI alignment research would be significantly sped up, but may still face philosophical hurdles.
- Broader economy: it’s worth emphasizing that even if world GDP growth remains below 10%/year, that still leaves plenty of room for AI to feel “crazy”, labor markets to be dramatically affected by ordinary standards, political discussion to be dominated by AI, etc. Note also that this period may be fairly short-lived (e.g., a few years).
Such a scenario is probably poor as an all-things-considered conditional forecast, since I’ve deliberately focused on a very specific technological change, but it hopefully adds some useful color to my prediction.
Finally, some thoughts on whether pursuing superhuman math AI specifically is a beneficial research direction:
- Possibility for transfer: there is a significant possibility that math reasoning ability transfers to other capabilities; indeed, we may already be seeing this in today’s reasoning models (though I haven’t looked at ablation results). That being said, moving into the superhuman regime and digging into specialist areas, math ability will increasingly be driven by carefully-tuned specialist intuitions, especially if pursuing something like the informal-to-formal approach laid out above. Moreover, specialized math ability seems to have limited transfer in humans, and transfer in ML is generally considerably worse than in humans. Overall, this doesn’t seem like a dominant consideration.
- Research pay-offs: a different kind of “transfer” is that pursuit of superhuman math AI would likely lead to general ML research discoveries, clout, PR etc., making it easier to develop other AI capabilities. I think this is an important consideration, and probably the main reason that AI companies have prioritized math capabilities so far, together with tractability. However, pursuing superhuman math AI doesn’t seem that different from other capabilities research in this regard, so the question of how good it is in this respect is mostly screened off by how good you think it is to work on capabilities in general (which could itself depend on the context/company).
- Differential progress: the kinds of scientific progress that superhuman math AI would enable look more defense-oriented than average (e.g., formal verification), and I think the possibility of speeding up theoretical AI alignment research is significant (I work in this area and math AI is already helping).
- Replaceability: there are strong incentives for AI companies and for individual researchers to pursue superhuman math AI anyway (e.g., the research pay-offs discussed above), which reduces the size (in either direction) of the marginal impact of an individual choosing to work in the area.
Overall, pursuing superhuman math AI seems mildly preferable to working on other capabilities, but not that dissimilar in its effects. It wouldn’t be my first choice for most people with the relevant skillset, unless they were committed to working on capabilities anyway.

Jacob_Hilton Aug 8, 2025, 4:57 PM
2 points
1 vote
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
in reply to: elifland’s comment on: Jacob_Hilton’s Shortform
I’m happy to talk about a theoretical HCAST suite with no bugs and infinitely many tasks of arbitrarily long time-horizon tasks, for the sake of argument (even though it is a little tricky to reason about and measuring human performance would be impractical).
I think the notion of an “infinite time horizon” system is a poor abstraction, because it implicitly assumes 100% reliability. Almost any practical, complex system has a small probability of error, even if this probability is too small to measure in practice. Once you stop using this abstraction, the argument doesn’t seem to hold up: surely a system that has 99% reliability at million-year tasks has lower than 99% reliability at 10 million-year tasks? This seems true even if a 10 million-year task is nothing more than 10 consecutive million-year tasks, and that seems strictly easier than an average 10 million-year task.

Jacob_Hilton Aug 8, 2025, 9:07 AM
47 points
22 votes
Overall karma indicates overall quality.
17
14 votes
Agreement karma indicates agreement, separate from overall quality.
on: Jacob_Hilton’s Shortform
Against superexponential fits to current time horizon measurements
I think is unreasonable to put non-trivial weight (e.g. > 5%) on a superexponential fit to METR’s 50% time horizon measurements, or similar recently-collected measurements.
To be precise about what I am claiming and what I am not claiming:
- I am not claiming that these measurements will never exhibit a superexponential trend. In fact, I think a superexponential trend is fairly likely eventually, due to feedback loops from AI speeding up AI R&D. I am claiming that current measurements provide almost no information about such an eventuality, and naively applying a superexponential fit gives a poor forecast.
- I am not claiming that is very unlikely for the trend to be faster in the near future than in the near past. I think a good forecast would use an exponential fit, but with wide error bars on the slope of the fit. After all, there are very few datapoints, they are not independent of each other, and there is measurement noise. I am claiming that extrapolating the rate at which the trend is getting faster is unreasonable.
- My understanding is that AI 2027′s forecast is heavily driven by putting substantial weight on such a superexponential fit, in which case my claim may call into question the reliability of this forecast. However, I have not dug into AI 2027′s forecast, and am happy to be corrected on this point. My primary concern is with the specific claim I am making rather than how it relates to any particular aggregated forecast.
Note that my argument has significant overlap with this critique of AI 2027, but is focused on what I think is a key crux rather than being a general critique. There has also been some more recent discussion of superexponential fits since the GPT-5 release here, although my points are based on METR’s original data. I make no claims of originality and apologize if I missed similar points being made elsewhere.
The argument
METR’s data (see Figure 1) exhibits a steeper exponential trend over the last year or so (which I’ll call the “1-year trend”) than over the last 5 years or so (which I’ll call the “5-year trend”). A superexponential fit would extrapolate this to an increasingly steep trend over time. Here is my why I think such an extrapolation is unwarranted:
- There is a straightforward explanation for the 1-year trend that we should expect to be temporary. The most recent datapoints are all reasoning models trained with RL. This is a new technique that scales with compute, and so we should expect there to be rapid initial improvements as compute is scaled from a low starting point. But this compute growth must eventually slow down to the rate at which older methods are growing in compute, once the total cost becomes comparable. This should lead to a leveling off of the 1-year trend to something closer to the 5-year trend, all else being equal.
  - Of course, there could be another new technique that scales with compute, leading to another (potentially overlapping) “bump”. But the shape of the current “bump” tells us nothing about the frequency of such advances, so it is an inappropriate basis for such an extrapolation. A better basis for such an extrapolation would be the 5-year trend, which may include past “bumps”.
- Superexponential explanations for the 1-year trend are uncompelling. I have seen two arguments for why we might expect the 1-year trend to be the start of a superexponential trend, and they are both uncompelling to me.
  1. Feedback from AI speeding up AI R&D. I don’t think this effect is nearly big enough to have a substantial effect on this graph yet. The trend is most likely being driven by infrastructure scaling and new AI research ideas, neither of which AI seems to be substantially contributing to. Even in areas where AI is contributing more, such as software engineering, METR’s uplift study suggests the gains are currently minimal at best.
  2. AI developing meta-skills. From this post:
    ”If we take this seriously, we might expect progress in horizon length to be superexponential, as AIs start to figure out the meta-skills that let humans do projects of arbitrary length. That is, we would expect that it requires more new skills to go from a horizon of one second to one day, than it does to go from one year to one hundred thousand years; even though these are similar order-of-magnitude increases, we expect it to be easier to cross the latter gap.”
    It is a little hard to argue against this, since it is somewhat vague. But I am unconvinced there is such a thing as a “meta-skill that lets humans do projects of arbitrary length”. It seems plausible to me that a project that takes ten million human-years is meaningfully harder than 10 projects that each take a million human-years, due to the need to synthesize the 10 highly intricate million-year sub-projects. To me the argument seems very similar to the following, which is not borne out:
    ”We might expect progress in chess ability to be superexponential, as AIs start to figure out the meta-skills (such as tactical ability) required to fully understand how chess pieces can interact. That is, we would expect it to require more new skills to go from an ELO of 2400 to 2500, than it does to go from an ELO of 3400 to 3500.”
    At the very least, this argument deserves to be spelled out more carefully if it is to be given much weight.
- Theoretical considerations favor an exponential fit (added in edit). Theoretically, it should take around twice as much compute to train an AI system with twice the horizon length, since feedback is twice as sparse. (This point was made in the Biological anchors report and is spelled out in more depth in this paper.) Hence exponential compute scaling would imply an exponential fit. Algorithmic progress matters too, but that has historically followed an exponential trend of improved compute efficiency. Of course, algorithmic progress can be lumpy, so we shouldn’t expect an exponential fit to be perfect.
- Temporary explanations for the 1-year trend are more likely on priors. The time horizon metric has huge variety of contributing factors, from the inputs to AI development to the details of the task distribution. For any such complex metric, the trend is likely to bounce around based on idiosyncratic factors, which can easily be disrupted and are unlikely to have a directional bias. (To get a quick sense of this, you could browse through some of the graphs on AI Impact’s Discontinuous growth investigation, or even METR’s measurements in other domains for something more directly relevant.) So even if I wasn’t able to identify the specific idiosyncratic factor that I think is responsible for the 1-year trend, I would expect there to be one.
- The measurements look more consistent with an exponential fit. I am only eyeballing this, but a straight line fit is reasonably good, and a superexponential fit doesn’t jump out as a privileged alternative. Given the complexity penalty of the additional parameters, a superexponential fit seems unjustified based on the data alone. This is not surprising given the small number of datapoints, many of which are based on similar models and are therefore dependent. (Edit: looks like METR’s analysis (Appendix D.1) supports this conclusion, but I’m happy to be corrected here if there is a more careful analysis.)
What do I predict?
In the spirit of sticking my neck out rather than merely criticizing, I will make the following series of point forecasts which I expect to outperform a superexponential fit: just follow an exponential trend, with an appropriate weighting based on recency. If you want to forecast 1 year out, use data from the last year. If you want to forecast 5 years out, use data from the last 5 years. (No doubt it’s better to use a decay rather than a cutoff, but you get the idea.) I obviously have very wide error bars on this, but probably not wide enough to include the superexponential fit more than a few years out.
As an important caveat, I’m not making a claim about the real-world impact of an AI that achieves a certain time horizon measurement. That is much harder to predict than the measurement itself, since you can’t just follow straight lines on graphs.
What links here?
- Noosphere89's comment on My AI Predictions for 2027 by talelore (Sep 3, 2025, 2:01 PM; 2 points)

Jacob_Hilton Aug 7, 2025, 3:25 PM
5 points
3 votes
Overall karma indicates overall quality.
1
1 vote
Agreement karma indicates agreement, separate from overall quality.
in reply to: Peter Wildeford’s comment on: Vladimir_Nesov’s Shortform
The model sizes were likely chosen based on typical inference constraints. Given that, they mostly care about maximizing performance, and aren’t too concerned about the compute cost, since training such small models is very affordable for them. So it’s worth going a long way into the regime of diminishing returns.

Jacob_Hilton Jul 15, 2025, 10:48 PM
LW: 29 AF: 16
25 votes
Overall karma indicates overall quality.
0
3 votes
Agreement karma indicates agreement, separate from overall quality.
AF
on: Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
It is interesting to note how views on this topic have shifted with the rise of outcome-based RL applied to LLMs. A couple of years ago, the consensus in the safety community was that process-based RL should be prioritized over outcome-based RL, since it incentivizes choosing actions for reasons that humans endorse. See for example Anthropic’s Core Views On AI Safety:
Learning Processes Rather than Achieving Outcomes
One way to go about learning a new task is via trial and error – if you know what the desired final outcome looks like, you can just keep trying new strategies until you succeed. We refer to this as “outcome-oriented learning”. In outcome-oriented learning, the agent’s strategy is determined entirely by the desired outcome and the agent will (ideally) converge on some low-cost strategy that lets it achieve this.
Often, a better way to learn is to have an expert coach you on the processes they follow to achieve success. During practice rounds, your success may not even matter that much, if instead you can focus on improving your methods. As you improve, you might shift to a more collaborative process, where you consult with your coach to check if new strategies might work even better for you. We refer to this as “process-oriented learning”. In process-oriented learning, the goal is not to achieve the final outcome but to master individual processes that can then be used to achieve that outcome.
At least on a conceptual level, many of the concerns about the safety of advanced AI systems are addressed by training these systems in a process-oriented manner. In particular, in this paradigm:
- Human experts will continue to understand the individual steps AI systems follow because in order for these processes to be encouraged, they will have to be justified to humans.
- AI systems will not be rewarded for achieving success in inscrutable or pernicious ways because they will be rewarded only based on the efficacy and comprehensibility of their processes.
- AI systems should not be rewarded for pursuing problematic sub-goals such as resource acquisition or deception, since humans or their proxies will provide negative feedback for individual acquisitive processes during the training process.
At Anthropic we strongly endorse simple solutions, and limiting AI training to process-oriented learning might be the simplest way to ameliorate a host of issues with advanced AI systems. We are also excited to identify and address the limitations of process-oriented learning, and to understand when safety problems arise if we train with mixtures of process and outcome-based learning. We currently believe process-oriented learning may be the most promising path to training safe and transparent systems up to and somewhat beyond human-level capabilities.
Or Solving math word problems with process- and outcome-based feedback (DeepMind, 2022):
Second, process-based approaches may facilitate human understanding because they select for reasoning steps that humans understand. By contrast, outcome-based optimization may find hard-to-understand strategies, and result in less understandable systems, if these strategies are the easiest way to achieve highly-rated outcomes. For example in GSM8K, when starting from SFT, adding Final-Answer RL decreases final-answer error, but increases (though not significantly) trace error.
[...]
In contrast, consider training from process-based feedback, using user evaluations of individual
actions, rather than overall satisfaction ratings. While this does not directly prevent actions which
influence future user preferences, these future changes would not affect rewards for the corresponding
actions, and so would not be optimized for by process-based feedback. We refer to Kumar et al. (2020)
and Uesato et al. (2020) for a formal presentation of this argument. Their decoupling algorithms
present a particularly pure version of process-based feedback, which prevent the feedback from
depending directly on outcomes.
Or Let’s Verify Step by Step (OpenAI, 2023):
Process supervision has several advantages over outcome supervision related to AI alignment. Process supervision is more likely to produce interpretable reasoning, since it encourages models to follow a process endorsed by humans. Process supervision is also inherently safer: it directly rewards an aligned chain-of-thought rather than relying on outcomes as a proxy for aligned behavior (Stuhlmüller and Byun, 2022). In contrast, outcome supervision is harder to scrutinize, and the preferences conveyed are less precise. In the worst case, the use of outcomes as an imperfect proxy could lead to models that become misaligned after learning to exploit the reward signal (Uesato et al., 2022; Cotra, 2022; Everitt et al., 2017).

In some cases, safer methods for AI systems can lead to reduced performance (Ouyang et al., 2022; Askell et al., 2021), a cost which is known as an alignment tax. In general, any alignment tax may hinder the adoption of alignment methods, due to pressure to deploy the most capable model. Our results show that process supervision in fact incurs a negative alignment tax. This could lead to increased adoption of process supervision, which we believe would have positive alignment side-effects. It is unknown how broadly these results will generalize beyond the domain of math, and we consider it important for future work to explore the impact of process supervision in other domains.
It seems worthwhile to reflect on why this perspective has gone out of fashion:
- The most obvious reason is the success of outcome-based RL, which seems to be outperforming processed-based RL. Advocating for processed-based RL no longer makes much sense when it is uncompetitive.
- Outcome-based RL also isn’t (yet) producing the kind of opaque reasoning that proponents of process-based RL may have been worried about. See for example this paper for a good analysis of the extent of current chain-of-thought faithfulness.
- Outcome-based RL is leading to plenty of reward hacking, but this is (currently) fairly transparent from chain of thought, as long as this isn’t optimized against. See for example the analysis in this paper.
Some tentative takeaways:
- There is strong pressure to walk over safety-motivated lines in the sand if (a) doing so is important for capabilities and/or (b) doing so doesn’t pose a serious, immediate danger. People should account for this when deciding what future lines in the sand to rally behind. (I don’t think using outcome-based RL was ever a hard red line, but it was definitely a line of some kind.)
- In particular, I wouldn’t be optimistic about attempting to rally behind a line in the sand like “don’t optimize against the chain of thought”, since I’d expect people to blow past this as quickly about as they blew past “don’t optimize for outcomes” if and when it becomes substantially useful. N.B. I thought the paper did a good job of avoiding this pitfall, focusing instead on incorporating the potential safety costs into decision-making.
- It can be hard to predict how dominant training techniques will evolve, and we should be wary of anchoring too hard on properties of models that are contingent on them. I would not be surprised if the “externalized reasoning property” (especially “By default, humans can understand this chain of thought”) no longer holds in a few years, even if capabilities advance relatively slowly (indeed, further scaling of outcome-based RL may threaten it). N.B. I still think the advice in the paper makes sense for now, and could end up mattering a lot – we should just expect to have to revise it.
- More generally, people designing “if-then commitments” should be accounting for how the state of the field might change, perhaps by incorporating legitimate ways for commitments to be carefully modified. This option value would of course trade off against the force of the commitment.

Jacob_Hilton May 8, 2025, 1:13 AM
8 points
5 votes
Overall karma indicates overall quality.
4
2 votes
Agreement karma indicates agreement, separate from overall quality.
in reply to: David Matolcsi’s comment on: Obstacles in ARC’s agenda: Finding explanations
I thought about this a bit more (and discussed with others) and decided that you are basically right that we can’t avoid the question of empirical regularities for any realistic alignment application, if only because any realistic model with potential alignment challenges will be trained on empirical data. The only potential application we came up with is LPE for a formalized distribution and formalized catastrophe event, but we didn’t find this especially compelling, for several reasons.^[1]
To me the challenges we face in dealing with empirical regularities do not seem bigger than the challenges we face with formal heuristic explanations, but the empirical regularities challenges should become much more concrete once we have a notion of heuristic explanations to work with, so it seems easier to resolve them in that order. But I have moved in your direction, and it does seem worth our while to address them both in parallel to some extent.
1. ^
  Objections include: (a) the model is trained on empirical data, so we need to only explain things relevant to formal events, and not everything relevant to its loss; (b) we also need to hope that empirical regularities aren’t needed to explain purely formal events, which remains unclear; and (c) the restriction to formal distributions/events limits the value of the application.

Jacob_Hilton May 6, 2025, 1:47 AM
18 points
6 votes
Overall karma indicates overall quality.
1
1 vote
Agreement karma indicates agreement, separate from overall quality.
on: Obstacles in ARC’s agenda: Finding explanations
Thank you for writing this up – I think this (and the other posts in the series) do a good job of describing ARC’s big-picture alignment plan, common objections, our usual responses, and why you find those uncompelling.
In my personal opinion (not necessarily shared by everyone at ARC), the best case for our research agenda comes neither from the specific big-picture plan you are critiquing here, nor from “something good falling out of it along the way” (although that is a part of my motivation), but instead for some intermediate goal along the lines of “a formal framework for heuristic arguments that is well-developed enough that we can convincingly apply it to neural networks”. If we can achieve that, it seems quite likely to me that it will be useful for something, for essentially the same reason we would expect exhaustive mechanistic interpretability to be useful for something (and probably quite a lot). Under this view, the point of fleshing out the LPE and MAD applications is important as a proof of concept and for refining our plans, but they are subject to revision.
This isn’t meant to downplay your objections too much. The ones that loom largest in my mind are false positives in MAD, small estimates being “lost in the noise” for LPE, and the whole minefield of empirical regularities (all of which you do good justice to). Paul still seems to think we can resolve all of these issues, so hopefully we will get to the bottom of them at some point, although in the short term we are more focused on the more limited dream of heuristic arguments for neural networks (and instances of LPE we think they ought to enable).
A couple of your objections apply even to this more limited dream though, especially the ones under “Explaining everything” and “When and what do we explain?”. But your arguments there seem to boil down to “that seems incredibly daunting and ambitious”, which I basically agree with. I still come down on the side of thinking that it is still a promising target, but I do think that ARC’s top priority should be to come up with concrete cruxes here and put them to the test, which is our primary research focus at the moment.

Jacob_Hilton May 1, 2025, 12:58 AM
LW: 7 AF: 4
5 votes
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
AF
on: Jacob_Hilton’s Shortform
I recently gave this talk at the Safety-Guaranteed LLMs workshop:
The talk is about ARC’s work on low probability estimation (LPE), covering:
- Theoretical motivation for LPE and (towards the end) activation modeling approaches (both described here)
- Empirical work on LPE in language models (described here)
- Recent work-in-progress on theoretical results

Jacob_Hilton Mar 11, 2025, 1:02 AM
2 points
1 vote
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
in reply to: glauberdebona’s comment on: Amplifying the Computational No-Coincidence Conjecture
Yes, by “unconditionally” I meant “without an additional assumption”. I don’t currently see why the Reduction-Regularity assumption ought to be true (I may need to think about it more).

Jacob_Hilton Mar 10, 2025, 9:56 PM
4 points
2 votes
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
on: Amplifying the Computational No-Coincidence Conjecture
Thanks for writing this up! Your “amplified weak” version of the conjecture (with complexity bounds increasing exponentially in 1/ε) seems plausible to me. So if you could amplify the original (weak) conjecture to this unconditionally, it wouldn’t significantly decrease my credence in the principle. But it would be nice to have this bound on what the dependence on ε would need to be.

Jacob_Hilton Mar 10, 2025, 5:50 PM
2 points
1 vote
Overall karma indicates overall quality.
1
1 vote
Agreement karma indicates agreement, separate from overall quality.
in reply to: glauberdebona’s comment on: A computational no-coincidence principle
The statements are equivalent if only a tiny fraction (tending to 0) of random reversible circuits satisfy $P (C)$ . We think this is very likely to be true, since it is a very weak consequence of the conjecture that random (depth- $~ O (n)$ ) reversible circuits are pseudorandom permutations. If it turned out to not be true, it would no longer make sense to think of $P (C)$ as an “outrageous coincidence” and so I think we would have to abandon the conjecture. So in short we are happy to consider either version (though I agree that “for which $P (C)$ is false” is a bit more natural).

Jacob_Hilton Feb 18, 2025, 2:37 AM
4 points
2 votes
Overall karma indicates overall quality.
2
1 vote
Agreement karma indicates agreement, separate from overall quality.
in reply to: Logan Zoellner’s comment on: A computational no-coincidence principle
The hope is to use the complexity of the statement rather than mathematical taste.
If it takes me 10 bits to specify a computational possibility that ought to happen 1% of the time, then we shouldn’t be surprised to find around 10 (~1% of $2^{10}$ ) occurrences. We don’t intend the no-coincidence principle to claim that these should all happen for a reason.
Instead, we intend the no-coincidence principle to claim that such if such coincidences happen much more often than we would have expected them to by chance, then there is a reason for that. Or put another way: if we applied $n$ bits of selection to the statement of a $≪ 2^{- n}$ -level coincidence, then there is a reason for it. (Hopefully the “outrageous” qualifier helps to indicate this, although we don’t know whether Gowers meant quite same thing as us.)
The formalization reflects this distinction: the property $P$ is chosen to be so unlikely that we wouldn’t expect it to happen for any circuit at all by chance ( $e^{- 2^{- n}}$ ), not merely that we wouldn’t expect it to happen for a single random circuit. Hence by the informal principle, there ought to be a reason for any occurrence of property $P$ .

Jacob_Hilton Feb 17, 2025, 7:36 PM
4 points
2 votes
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
in reply to: Logan Zoellner’s comment on: A computational no-coincidence principle
For the informal no-coincidence principle, it’s important to us (and to Gowers IIUC) that a “reason” is not necessarily a proof, but could instead be a heuristic argument (in the sense of this post). We agree there are certainly apparently outrageous coincidences that may not be provable, such as Chebyshev’s bias (discussed in the introduction to the post). See also John Conway’s paper On Unsettleable Arithmetical Problems for a nice exposition of the distinction between proofs and heuristic arguments (he uses the word “probvious” for a statement with a convincing heuristic argument).
Correspondingly, our formalization doesn’t bake in any sort of proof system. The verification algorithm $V$ only has to correctly distinguish circuits that might satisfy property $P$ from random circuits using the advice string $π$ – it doesn’t necessarily have to interpret $π$ as a proof and verify its correctness.

Jacob_Hilton Feb 17, 2025, 5:20 PM
15 points
7 votes
Overall karma indicates overall quality.
4
2 votes
Agreement karma indicates agreement, separate from overall quality.
in reply to: Capybasilisk’s comment on: A computational no-coincidence principle
It’s not, but I can understand your confusion, and I think the two are related. To see the difference, suppose hypothetically that 11% of the first million digits in the decimal expansion of $π$ were 3s. Inductive reasoning would say that we should expect this pattern to continue. The no-coincidence principle, on the other hand, would say that there is a reason (such as a proof or a heuristic argument) for our observation, which may or may not predict that the pattern will continue. But if there were no such reason and yet the pattern continued, then the no-coincidence principle would be false, whereas inductive reasoning would have been successful.
So I think one can view the no-coincidence principle as a way to argue in favor of induction (in the context of formally-defined phenomena): when there is a surprising pattern, the no-coincidence principle says that there is a reason for it, and this reason may predict that the pattern will continue (although we can’t be sure of this until we find the reason). Interestingly, one could also use induction to argue in favor of the no-coincidence principle: we can usually find reasons for apparently outrageous coincidences in mathematics, so perhaps they always exist. But I don’t think they are the same thing.

Jacob_Hilton Feb 15, 2025, 5:33 PM
10 points
7 votes
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
in reply to: tailcalled’s comment on: A computational no-coincidence principle
Good question! We also think that NP ≠ co-NP. The difference between 99% (our conjecture) and 100% (NP = co-NP) is quite important, essentially because 99% of random objects “look random”, but not 100%. For example, consider a uniformly random string $x \in {0, 1}^{n}$ for some large $n$ . We can quite confidently say things like: the number of 0s in $x$ is between $0.499 n$ and $0.501 n$ ; there is no streak of $⌊ \sqrt{n} ⌋$ alternating 0s and 1s; etc. But these only hold with 99% confidence (more precisely, with probability tending to 1), not 100%.
Going back to the conjecture statement, the job of the verification algorithm $V$ is much harder for 100% than 99%. For 100%, $V$ has to definitively tell (with the help of a certificate) whether a circuit has property $P$ . Whereas for 99%, it simply has to spot (again with the help of a “certificate” of sorts) any structure at all that reveals the circuit to be non-random in a way that could cause it to have property $P$ . For example, $V$ could start by checking the proportions of different types of gates, and if these differed too much from a random circuit, immediately reject the circuit out-of-hand for being “possibly structured”. Footnote 6 has another example of structure that could cause a circuit to have property $P$ , which seems much harder for a 100%- $V$ to deal with.

Jacob_Hilton Feb 14, 2025, 9:45 PM
13 points
4 votes
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
on: A computational no-coincidence principle
Before reversible circuits, we first considered a simpler setting: triangle counting. The no-coincidence principle in that setting turned out to be true, but for a relatively uninteresting reason, because the domain was not rich enough. Nevertheless, I think this result serves as a helpful exercise for people trying to get to grips with our definitions, as well as providing more of the story about how we ended up with our reversible circuits statement.
In the triangle counting setting, we consider the distribution $C_{3} (n, p)$ over undirected 3-partite graphs on 3 groups of $n$ vertices obtained by taking each of the $3 n^{2}$ possible edges to be present independently with probability $p$ . We take $p = Θ (n^{- 0.5})$ , so there are many more edges than vertices, but much fewer than the total number of possible edges.
For a randomly sampled $G \sim C_{3} (n, Θ (n^{- 0.5}))$ , one can check that the number of triangles in $G$ has mean $Θ (n^{1.5})$ and variance $Θ (n^{1.5})$ . So if $G$ has $Ω (n^{1.75 + ε})$ triangles, then we consider this to be an “outrageous coincidence”, since this exceeds the mean by $Ω (n^{1 + ε})$ standard deviations. (This is “outrageous” because if the number of triangles were normally distributed with this mean and variance, then the probability of exceeding this for any of the $2^{3 n^{2}}$ possible graphs would tend to zero.) This motivates the following statement, which turns out to be true:
Proposition (No-coincidence principle for triangle counting, $p = Θ (n^{- 0.5})$ ). For any $ε > 0$ , there is a linear-time verification algorithm $V$ that receives as input:
- A 3-partite graph on 3 groups of $n$ vertices, represented as a list of edges
- An advice string $π$
such that:
- For all graphs $G$ with $Ω (n^{1.75 + ε})$ triangles, if $n$ is sufficiently large then there exists $π$ with length linear in the number of edges of $G$ such that $V (G, π) = 1$ .
- For $G \sim C_{3} (n, Θ (n^{- 0.5}))$ , the probability that there is any $π$ with $V (G, π) = 1$ tends to zero.
(Note that the result would be trivial if we allowed $V$ to run in polynomial time, since it could then directly count the number of triangles, so we need to be more fine-grained than that.)
Dmitry Vaintrob and Aryan Bhatt were able to generalize this result to polygon counting (note that we consider graphs on $k$ groups of $n$ vertices arranged in a cycle, not $k$ -partite graphs in general, so there are $k n^{2}$ possible edges, not $(\frac{k}{2}) n^{2}$ ) and to permanents. So we concluded that neither of these settings seemed to be rich enough to produce an interesting no-coincidence principle (even though permanents are #P-hard to compute), and moved on to considering circuits instead.
Hint for polygon counting:
Write the number of polygons as a cyclic trace $tr (A_{1} A_{2} \dots A_{k})$ .

Jacob_Hilton Oct 28, 2024, 4:39 PM
LW: 3 AF: 2
2 votes
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
AF
in reply to: Dmitry Vaintrob’s comment on: A bird’s eye view of ARC’s research
It sounds like we are not that far apart here. We’ve been doing some empirical work on toy systems to try to make the leap from mechanistic interpretability “stories” to semi-formal heuristic explanations. The max-of-k draft is an early example of this, and we have more ambitious work in progress along similar lines. I think of this work in a similar way to you: we are not trying to test empirical assumptions (in the way that some empirical work on frontier LLMs is, for example), but rather to learn from the process of putting our ideas into practice.

Jacob_Hilton Oct 25, 2024, 9:41 PM
LW: 11 AF: 8
3 votes
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
AF
on: Backdoors as an analogy for deceptive alignment
For those who are interested in the mathematical details, but would like something more accessible than the paper itself, see this talk I gave about the paper:

Jacob_Hilton Oct 24, 2024, 6:26 PM
LW: 53 AF: 28
18 votes
Overall karma indicates overall quality.
4
3 votes
Agreement karma indicates agreement, separate from overall quality.
AF
in reply to: Dmitry Vaintrob’s comment on: A bird’s eye view of ARC’s research
Thank you – this is probably the best critique of ARC’s research agenda that I have read since we started working on heuristic explanations. This level of thoughtfulness in external feedback is very rare and I’m grateful for the detail and clarity you put into it. I don’t think my response fully rebuts your central concern, but hopefully it gives a sense of my current thinking about it.
It sounds like we are in agreement that something very loosely heuristic explanation-flavored (interpreted so broadly as to include mechanistic interpretability, for example) can reasonably be placed at the root of the diagram, by which I mean that it’s productive to try to explain neural network behaviors in this very loose sense, attempt to apply such explanations to downstream applications such as MAD/LPE/ELK etc. We begin to diverge, I think, about the extent to which ARC should focus on a more narrow conception of heuristic explanations. From least to most specific:
1. Any version that is primarily mathematical rather than “story-centric”
2. Some (mathematical) version that is consistent with our information-theoretic intuitions about what constitutes a valid explanation (i.e., in the sense of something like surprise accounting)
3. Some such version that is loosely based on independence assumptions
4. Some version that satisfies more specific desiderata for heuristic estimators (such as the ones discussed in the paper linked in (3), or in this more recent paper)
Opinions at ARC will differ, but (1) I feel pretty comfortable defending, (2) I think is quite a promising option to be considering, (3) seems like a reasonable best guess but I don’t think we should be that wedded to it, and (4) I think is probably too specific (and with the benefit of hindsight I think we have focused too much on this in the past). ARC’s research has actually been trending in the “less specific” direction over time, as should hopefully be evident from our most recent write-ups (with the exception of our recent paper on specific desiderata, which mostly covers work done in 2023), and I am quite unsure exactly where we should settle on this axis.
By contrast, my impression is that you would not really defend even (1) (although I am curious exactly where you come down this axis, if you want to clarify). So I’ll give what I see as the basic case for searching for a mathematical rather than a “story-centric” approach:
- Mechanistic interpretability has so far yielded very little in the way of beating baselines at downstream tasks (this has been discussed at length elsewhere, see for example here, here and here), so I think it should still be considered a largely unproven approach (to be clear, this is roughly my view of all alignment approaches that aren’t already in active use at labs, including ARC’s, and I remain excited to see people’s continued valiant attempts; my point is that the bar is low and a portfolio approach is appropriate).
- Relying purely on stories clearly doesn’t work at sufficient scale under worst-case assumptions (because the AI will have concepts you don’t have words for), and there isn’t a lot of evidence that this isn’t indeed already a bottleneck in practice (i.e., current AIs may well already have concepts you don’t have words for).
- I think that ARC’s worst-case, theoretical approach (described at zoom level 1) is an especially promising alternative to iterative, empirically-driven work. I think empirical approaches are more promising overall, but have correlated failure modes (namely, they could end up relying on correlated empirical contingencies that later turn out to be false), and have far more total effort going into them (arguably disproportionately so). Conditional on taking such an approach, story-centric methods don’t seem super viable (how should one analyze stories theoretically?).
- I don’t really buy the argument that because a system has a lot of complexity, it can only be analyzed in ad-hoc ways. It seems to me that an analogous argument would have failed to make good predictions about the bitter lesson (i.e., by arguing that a simple algorithm like SGD should not be capable of producing great complexity in a targeted way). Instead, because neural nets are trained in an incremental, automated way based on mathematical principles, it seems quite possible to me that we can find explanations for them in a similar way (which is not an argument that can be applied to biological brains).
This doesn’t of course defend (2)–(4) (which I would only want to do more weakly in any case). We’ve tried to get our intuitions for those across in our write-ups (as linked in (2)–(4) above), but I’m not sure there’s anything succinct I can add here if those were unconvincing. I agree that puts us in the rather unfortunate position of sharing a reference class with Stephen Wolfram to many external observers (although hopefully our claims are not quite so overstated).
I think it’s important for ARC to recognize this tension, and to strike the right balance between making our work persuasive to external skeptics on the one hand, and having courage in our convictions on the other hand (I think both have been important virtues in scientific development historically). Concretely, my current best guess is that ARC should:
- (a) Avoid being too wedded to intuitive desiderata for heuristic explanations that we can’t directly tie back to specific applications
- (b) Search for concrete cases that put our intuitions to the test, so that we can quickly reach a point where either we no longer believe in them, or they are more convincing to others
- (c) Also pursue research that is more agnostic to the specific form of explanation, such as work on low probability estimation or other applications
- (d) Stay on the lookout for ideas from alternative theoretical approaches (including singular learning theory, sparsity-based approaches, computational mechanics, causal abstractions, and neural net-oriented varieties of agent foundations), although my sense is that object-level intuitions here just differ enough that it’s difficult to collaborate productively. (Separately, I’d argue that proponents of all these alternatives are in a similar predicament, and could generally be doing a better job on analogous versions of (a)–(c).)
I think we have been doing all of (a)–(d) to some extent already, although I imagine you would argue that we have not been going far enough. I’d be interested in more thoughts on how to strike the right balance here.

Jacob_Hilton

Superhuman math AI will plausibly arrive significantly before broad automation

Against superexponential fits to current time horizon measurements

The argument

What do I predict?

Learning Processes Rather than Achieving Outcomes