Jsevillamol

Karma: 1,870

Jsevillamol 10 Nov 2025 1:07 UTC
6 points
0
in reply to: ryan_greenblatt’s comment on: p.b.’s Shortform
Side note: we find evidence of an error rate for SWE bench verified between 5 and 10% in our benchmark review.
https://epoch.ai/blog/what-skills-does-swe-bench-verified-evaluate

Introducing the Epoch Capabilities Index (ECI)

luke_emberson, YafahEdelman and Jsevillamol

28 Oct 2025 18:23 UTC

62 points

9 comments1 min readLW link

(epoch.ai)

Jsevillamol 15 Jul 2025 16:17 UTC
25 points
0
on: Grok 4 Various Things
including the new Tier 4 questions

Quick comment, this is not correct. As of this time, we have not evaluated Grok 4 on FrontierMath Tier 4 questions. Our preliminary evaluation was conducted only with Tier 1-3 questions.

Jsevillamol 14 Jul 2025 18:40 UTC
3 points
0
on: Empirical Evidence Against “The Longest Training Run”
Someone referred me back to this post for comment, so I want to share a couple of updates on how we think about training run lengths at Epoch.
First, we now have better data. Across notable models, we have seen training run lengths get longer by around 30%/year in the last decade. This would naively imply we would see 3x longer training runs by the end of the decade. Recent large training runs often take up to 90 days (eg Llama 3), so this would naively lead to nine-month training runs by the end of the decade.
Second, I still believe the argument given in the original post is coherent and makes for a compelling upper bound, after accounting for uncertainty on the relevant trends.
This is not the only consideration that goes into deciding how long to train for. In practice, my understanding is that developers are mostly weighing the improvement they see during training versus the costs of a delayed release in terms of attention and market share. But I still expect for the upper bound of ~a year to be roughly binding, at least while hardware and algorithmic improvements continue progressing as fast as in recent years.

Jsevillamol 19 Apr 2025 17:15 UTC
10 points
0
in reply to: Sam Marks’s comment on: jacquesthibs’s Shortform
For clarity, at the moment of writing I felt that was a valid concern.
Currently I think this is no longer compelling to me personally, though I think at least some of our stakeholders would be concerned if we published work that significantly sped up AI capabilities and investment, which is a perspective we keep in mind when deciding what to work on.
I never thought that just because something speed up capabilities it means it is automatically something we shouldn’t work on. We are willing to make trade offs here in service of our core mission of improving the public understanding of the trajectory of AI. And in general we make a strong presumption in favour of freedom of knowledge.

Jsevillamol 20 Jan 2025 7:55 UTC
34 points
5
in reply to: Jonathan Claybrough’s comment on: Jonathan Claybrough’s Shortform
I’m talking from a personal perspective here as Epoch director.
- I personally take AI risks seriously, and I think they are worth investigating and preparing for.
- I co-started Epoch AI to get evidence and clarity on AI and its risks and this is still a large motivation for me.
- I have drifted towards a more skeptical position on risk in the last two years. This is due to a combination of seeing the societal reaction to AI, me participating in several risk evaluation processes, and AI unfolding more gradually than I expected 10 years ago.
- Currently I am more worried about concentration in AI development and how unimproved humans will retain wealth over the very long term than I am about a violent AI takeover.
- I also selfishly care about AI development happening fast enough that my parents, friends and myself could benefit from it, and I am willing to accept a certain but not unbounded amount of risk from speeding up development. Id currently be in favour of slightly faster development, specially if it could happen in a less distributed way. I feel very nervous about this however, as I see my beliefs as brittle.
I’m also going to risk also sharing more internal stuff without coordinating on it, erring on the side of over sharing. There is a chance that other management at Epoch won’t endorse these takes.
- At management level, we are choosing to not talk about risks or work on risk measurement publicly. If I try to verbalize it, it’s due to a combination of us having different beliefs on AI risk, which makes communicating from a consensus view difficult, believing that talking about risk would alienate us from stakeholders skeptical of AI Risk, and the evidence about risk being below what we are comfortable writing about.
- My sense is that OP is funding us primarily to gather evidence relevant to their personal models. Eg two senior people at OP particularly praised our algorithmic progress paper because it directly informs their models. They do also care about us producing legible evidence on key topics for policy, such as the software singularity or post training enhancements. We have had complete editorial control and I feel confident in rejecting topic suggestions that come from OP staff when they don’t match my vision of what we should be writing about (and have done so in the past).
- In term of overall beliefs, we have a mixture of people very worried and skeptical of risk. I think the more charismatic and outspoken people at Epoch err towards being more skeptical of risks, but no one at Epoch is dismissive of it.
- Some stakeholders I’ve talked to have expressed this view that they wish for Epoch AI to gain influence and then communicate publicly about AI risk. I don’t feel comfortable with that strategy, one should expect Epoch AI to keep a similar level of communication about risk as we gain influence. We might be willing to talk more about risks if we gather more evidence of risk, or if we build more sophisticated tools to talk about it, but this isn’t the niche we are filling or that you should expect us to fill.
- The podcast is actually a good example here. We talk toward the end about the share of the economy owned by biological humans becoming smaller over time, which is an abstraction we have studied and have moderate confidence in. This is compatible with an AI takeover scenario, but also a peaceful transition to an AI dominated society. This is the kind of communication about risks you can expect from Epoch, relying more on abstractions we have studied than stories we don’t have confidence in.
- The overall theory of change of Epoch AI is that having reliable evidence on AI will help raise the standards of conversation and decision making elsewhere. To be maximally clear, we are willing to make some tradeoffs like publishing work like FrontierMath and our distributed training paper that plausibly speed up AI development in service of that mission.
What links here?

Jsevillamol 19 Aug 2024 5:10 UTC
7 points
2
on: Liability regimes for AI
The ability to pay liability is important to factor in and this illustrates it well. For the largest prosaic catastrophes this might well be the dominant consideration

For smaller risks, I suspect in practice mitigation, transaction and prosecution costs are what dominates the calculus of who should bear the liability, both in AI and more generally.

Jsevillamol 16 Jul 2024 22:00 UTC
7 points
0
on: Towards more cooperative AI safety strategies
What’s the FATE community? Fair AI and Tech Ethics?

Jsevillamol 28 Jun 2024 18:05 UTC
LW: 2 AF: 1
0
AF
in reply to: gwern’s comment on: Parameter counts in Machine Learning
We have conveniently just updated our database if anyone wants to investigate this further!
https://epochai.org/data/notable-ai-models

Jsevillamol 10 May 2024 6:06 UTC
15 points
4
on: We might be missing some key feature of AI takeoff; it’ll probably seem like “we could’ve seen this coming”
Here is a “predictable surprise” I don’t discussed often: given the advantages of scale and centralisation for training, it does not seem crazy to me that some major AI developers will be pooling resources in the future, and training jointly large AI systems.

Jsevillamol 25 Apr 2024 6:26 UTC
15 points
6
on: Bayesian inference without priors
I’ve been tempted to do this sometime, but I fear the prior is performing one very important role you are not making explicit: defining the universe of possible hypothesis you consider.

In turn, defining that universe of probabilities defines how bayesian updates look like. Here is a problem that arises when you ignore this: https://www.lesswrong.com/posts/R28ppqby8zftndDAM/a-bayesian-aggregation-paradox

Jsevillamol 23 Jan 2024 22:47 UTC
2 points
0
in reply to: johnswentworth’s comment on: Revisiting algorithmic progress
shrug
I think this is true to an extent, but a more systematic analysis needs to back this up.
For instance, I recall quantization techniques working much better after a certain scale (though I can’t seem to find the reference...). It also seems important to validate that techniques to increase performance apply at large scales. Finally, note that the frontier of scale is growing very fast, so even if these discoveries were done with relatively modest compute compared to the frontier, this is still a tremendous amount of compute!

Jsevillamol 23 Jan 2024 21:22 UTC
2 points
0
in reply to: johnswentworth’s comment on: Revisiting algorithmic progress
even a pause which completely stops all new training runs beyond current size indefinitely would only ~double timelines at best, and probably less
I’d emphasize that we currently don’t have a very clear sense of how algorithmic improvement happens, and it is likely mediated to some extent by large experiments, so I think is more likely to slow timelines more than this implies.

Jsevillamol 24 Nov 2023 20:39 UTC
2 points
0
in reply to: paato’s comment on: Analysis of World Records in Speedrunning [LINKPOST]
I agree! I’d be quite interested in looking at TAS data, for the reason you mentioned.

Jsevillamol 28 Aug 2023 22:41 UTC
4 points
0
on: The Evidence for Question Decomposition is Weak
I think Tetlock and cia might have already done some related work?
Question decomposition is part of the superforecasting commandments, though I can’t recall off the top of my head if they were RCT’d individually or just as a whole.
ETA: This is the relevant paper (h/t Misha Yagudin). It was not about the 10 commandments. Apparently those haven’t been RCT’d at all?

Jsevillamol 28 Aug 2023 9:00 UTC
4 points
0
in reply to: gwern’s comment on: Analysis of World Records in Speedrunning [LINKPOST]
We actually wrote a more up to date paper here

https://arxiv.org/abs/2304.10004

Jsevillamol 21 Aug 2023 19:54 UTC
7 points
4
on: Rice’s Theorem says that AIs can’t determine much from studying AI source code
I cowrote a detailed response here

https://www.cser.ac.uk/news/response-superintelligence-contained/

Essentially, this type of reasoning proves too much, since it implies we cannot show any properties whatsoever of any program, which is clearly false.

Jsevillamol 14 Aug 2023 7:01 UTC
14 points
0
in reply to: Daniel Kokotajlo’s comment on: AGI is easier than robotaxis
Here is some data through Matthew Barnett and Jess Riedl

Number of cumulative miles driven by Cruise’s autonomous cars is growing as an exponential at roughly 1 OOM per year.

https://twitter.com/MatthewJBar/status/1690102362394992640

Jsevillamol 6 Jun 2023 10:27 UTC
6 points
3
in reply to: meijer1973’s comment on: Yudkowsky vs Hanson on FOOM: Whose Predictions Were Better?
That is to very basic approximation correct.
Davidson’s takeoff model illustrates this point, where a “software singularity” happens for some parameter settings due to software not being restrained to the same degree by capital inputs.
I would point out however that our current understanding of how software progress happens is somewhat poor. Experimentation is definitely a big component of software progress, and it is often understated in LW.
More research on this soon!

Jsevillamol 4 Jun 2023 9:35 UTC
6 points
2
in reply to: habryka’s comment on: Yudkowsky vs Hanson on FOOM: Whose Predictions Were Better?

algorithmic progress is currently outpacing compute growth by quite a bit

This is not right, at least in computer vision. They seem to be the same order of magnitude.

Physical compute has growth at 0.6 OOM/year and physical compute requirements have decreased at 0.1 to 1.0 OOM/year, see a summary here or a in depth investigation here

Another relevant quote

Algorithmic progress explains roughly 45% of performance improvements in image classification, and most of this occurs through improving compute-efficiency.

Jsevillamol

In­tro­duc­ing the Epoch Ca­pa­bil­ities In­dex (ECI)

Introducing the Epoch Capabilities Index (ECI)