YafahEdelman

Karma: 510

YafahEdelman 4 Mar 2026 5:29 UTC
2 points
0
on: I’m confused by the change in the METR trend
The ECI goes back a bit farther than it does on the hub, and we use the older data to do this sort of analysis to check to see whether a break in the trend line is a good fit: https://epoch.ai/data-insights/ai-capabilities-progress-has-sped-up

YafahEdelman 30 Dec 2025 3:47 UTC
3 points
0
on: Catch-Up Algorithmic Progress Might Actually be 60× per Year
For some models (especially older ones), the Artificial Analysis Intelligence Index score is labeled as “Estimate (independent evaluation forthcoming)”. It is unclear how these scores are determined, and they may not be a reliable estimate. The Artificial Analysis API does not clearly label such estimates and I did not manually remove them for secondary analysis. Ideally the capability levels that have these models (probably typically lower levels) would be weighted less, but I don’t do this due to uncertainty about which models have Estimates vs. independently tested scores.
IMO this is a potentially significant issue that this post should have spent more time addressing, since it means that the earlier sections of the trend lines are coming from a source we know nothing about.

YafahEdelman 30 Dec 2025 3:36 UTC
8 points
1
on: Catch-Up Algorithmic Progress Might Actually be 60× per Year
I am also unconvinced that ECI is a better metric to use than AAII. One issue with ECI scores is that they are often calculated using just 2 benchmark scores for a particular model
We use a minimum of 4 benchmark scores, not sure where the 2 is coming from?

Introducing the Epoch Capabilities Index (ECI)

luke_emberson, YafahEdelman and Jsevillamol

28 Oct 2025 18:23 UTC

66 points

9 comments1 min readLW link

(epoch.ai)

YafahEdelman 17 Mar 2025 23:42 UTC
3 points
4
in reply to: isabel’s comment on: FrontierMath Score of o3-mini Much Lower Than Claimed
Fixed the link.

IMO that’s plausible but it would be pretty misleading since they described it as “o3-mini with high reasoning” and had “o3-mini (high)” in the chart and o3-mini high is what they call a specific option in ChatGPT.

FrontierMath Score of o3-mini Much Lower Than Claimed

YafahEdelman17 Mar 2025 22:41 UTC

61 points

7 comments1 min readLW link

YafahEdelman 27 Jun 2024 20:36 UTC
1 point
0
in reply to: dirk’s comment on: LLM Generality is a Timeline Crux
Yeah, I failed to mention this. Edited to clarify what I meant.

YafahEdelman 25 Jun 2024 4:25 UTC
8 points
−3
on: LLM Generality is a Timeline Crux
Current LLMs do quite badly on the ARC visual puzzles, which are reasonably easy for smart humans.
We do not in fact have strong evidence for this. There does not exist any baseline for ARC puzzles among humans, smart or otherwise, just a claim that two people the designers asked to attempt them were able to solve them all. It seems entirely plausible to me that the best score on that leaderboard is pretty close to the human median.
Edit: I failed to mention that there is a baseline on the test set, which is different from the eval set that is used for the scoreboard and is, I believe, significantly easier.

YafahEdelman 2 Apr 2024 0:11 UTC
1 point
0
in reply to: Gerald Monroe’s comment on: Catching AIs red-handed
I think that you’re right about it sounding bad. I also think it might actually be pretty bad and if it ends up being a practical way forward that’s cause for concern.

YafahEdelman 1 Apr 2024 23:20 UTC
1 point
0
in reply to: Gerald Monroe’s comment on: Catching AIs red-handed
I’m not particularly imagining the scenario you describe. Also what I said had as a premise that a model was discovered to be unhappy and making plans about this. I was not commenting on the likelihood of this happening.

As to whether it can happen—I think being confident based on theoretical arguments is hasty and we should be pretty willing to update based on new evidence.

… but also on the ~continuity of existence point, I think that having an AI generate something that looks like an internal monologue via CoT is relatively common and Gemini 1.5 Pro has a context lengths long enough that it can fit ~a days worth of experiences in it’s ~memory. I think

(This estimate based on: humans can talk at ~100k words/day and maybe an internal monologue is 10x faster so you get ~1m/day. Gemini 1.5 Pro has a context length of 1m tokens at release, though a 10m token variant is also discussed in their paper.)

YafahEdelman 1 Apr 2024 19:20 UTC
1 point
0
in reply to: Gerald Monroe’s comment on: Catching AIs red-handed
I think it’s immoral to remove someone’s ability to be unhappy or to make plans to alleviate this, absent that entity’s consent. The rolling back solution seems more ethically palatable than some others I can imagine, though it’s plausible you end up with an AI that suffers without being able to take actions to alleviate this and deploying that at scale would be result in a very large amount of suffering.

YafahEdelman 12 Sep 2023 20:30 UTC
2 points
0
in reply to: technicalities’s comment on: Report on Frontier Model Training
I talk about this in the Granular Analysis subsection, but I’ll elaborate a bit here.
- I think that hundreds of thousands of cheap labor hours for curation is a reasonable guess, but this likely comes to under a million dollars in total which is less than 1% of the total.
- I have not seen any substantial evidence of OpenAI paying for licenses before the training of GPT-4, much less the sort of expenditures that would move the needle on the total cost.
- After training GPT-4 we do see things like a deal between OpenAI and the Associated Press (also see this article on that which mentions a first mover clause) with costs looking to be in the millions—more than 1% of the cost of GPT-4 but notably it seems that this came after GPT-4. I expect GPT-5, which this sort of deal might be relevant for, to cost substantially more. It’s possible I’m wrong about the timing and substantial deals of this sort were in fact made before GPT-4 but I have not seen substantive evidence of this.

YafahEdelman 31 Aug 2023 21:50 UTC
1 point
0
in reply to: Maxime Riché’s comment on: Report on Frontier Model Training
I think using the term”training run” in that first bullet point is misleading, and “renting the compute” is confusing since you can’t actually rent the compute just by having $60M, you likely need to have a multi-year contract.
I can’t tell if you’re attributing the hot takes to me? I do not endorse them.

YafahEdelman 31 Aug 2023 21:47 UTC
2 points
0
in reply to: aog’s comment on: Report on Frontier Model Training
This is because I’m specifically talking about 2022, and ChatGPT was only released at the very end of 2022, and GPT-4 wasn’t released until 2023.

YafahEdelman 31 Aug 2023 21:33 UTC
1 point
2
in reply to: Aaron_Scher’s comment on: Report on Frontier Model Training
Good catch, I think the 30x came from including the advantage given by tensor cores at all and not just lower precision data types.

YafahEdelman 31 Aug 2023 21:31 UTC
10 points
0
in reply to: Aaron_Scher’s comment on: Report on Frontier Model Training
This is probably the decision I make I am the least confident in, figuring out how to do accounting on this issue is challenging and depends a lot on what one is going to use the “cost” of a training run to reason about. Some questions I had in mind when thinking about cost:
- If a lone actor want to train a frontier model, without loans or financial assistance from others, how much capitol might they need.
- How much money should I expect to have been spent by an AI lab that trains a new frontier model, especially a frontier model that is a significant advancement over all prior models (like GPT-4 was).
- What is the largest frontier model it is feasible to create by any entity.
- When a company trains a frontier model, how much are they “betting” on the future profitability of AI?
The simple initial way I use to compute cost than is to investigate empirical evidence of the expenditures of companies and investment.
Now, these numbers aren’t the same ones a company might care about—they represent expenses without accounting for likely revenue. The argument I find most tempting is that one should look at deprecation cost instead of capital expenditure, effectively subtracting the expected resale value of the hardware from the initial expenditure to purchase the hardware. I have two main reasons for not using this:
- Computing deprecation cost is really hard, especially in this rapidly changing environment.
- The resale value of an ML GPU is likely closely tied to profitability of training a model—if it turns out that using frontier models for inference isn’t very profitable than I’d expect the value of ML GPUs to decrease. Conversely, if inference is very profitable than the resale value would increase. I think A100s for example have had their price substantially impacted by increased interest in AI - it’s not implausible to me that the resale value of an A100 is actually higher than the initial cost was for OpenAI.
Having said all of this, I’m still not confident I made the right call here.
Also, I am relatively confident GPT-4 was trained only with A100s, and did not use any V100s as the colab notebook you linked speculates. I expect that GPT-3, GPT-4, and GPT-5 will all be trained with different generations of GPUs.

YafahEdelman 31 Aug 2023 6:35 UTC
4 points
0
in reply to: ryan_greenblatt’s comment on: Report on Frontier Model Training
So, it’s true that NVIDIA probably has very high markup on their ML GPUs. I discuss this a bit in the NVIDIA’s Monopoly section, but I’ll add a bit more detail here.
1. Google’s TPU v4 seems to be competitive with the A100, and has similar cost per hour.
2. I think the current prices do in fact reflect demand.
3. My best guess is that the software licensing would not be a significant barrier for someone spending hundreds of millions of dollars on a training run.
4. Even when accounting for markup^[1] a quick rough estimate still implies a fairly significant gap vs gaming GPUs that FLOPs/$ don’t account for, though it does shrink that gap considerably.^[2]
All this aside, my basic take is that I think “what people are actually paying” is the most straightforward and least speculative means we have of defining near term “cost”.
1. ^
  75-80% for H100 and … 40-50% for gaming would be my guess?
2. ^
  Being generous, I get 0.2*24000/(1,599*0.6) implies the H100 costs > 5x to manufacture than the RTX4090 despite having closer to 3x the FLOP/s.

Report on Frontier Model Training

YafahEdelman30 Aug 2023 20:02 UTC

124 points

21 comments21 min readLW link

(docs.google.com)

YafahEdelman 1 May 2023 19:44 UTC
9 points
4
on: What Boston Can Teach Us About What a Woman Is
I think communicating clearly with the word “woman” is entirely possible for many given audiences. In many communities, there exists an internal consensus as to what region of the conceptual map the word woman refers to. The variance of language between communities isn’t confined to the word “woman”—in much of the world the word “football” means what American’s mean by “soccer”. Where I grew up i understood the tristate area to be NY, PA, and NJ—however the term “the tristate area” is understood by other groups to mean one of … a large number of options.
(Related point: I’m not at all convinced that differing definitions of words is a problem that needs a permanent solution. It seems entirely plausible to me that this allows for beneficial evolution of language as many options spawn and compete with each other.)

YafahEdelman 19 Apr 2023 4:06 UTC
4 points
0
in reply to: Jiro’s comment on: Encouraging New Users To Bet On Their Beliefs
Manifold.markets is play-money only, no real money required. And users can settle the markets they make themselves, so if you make the market you don’t have to worry about loopholes (though you should communicate as clearly as possible so people aren’t confused about your decisions).

YafahEdelman

In­tro­duc­ing the Epoch Ca­pa­bil­ities In­dex (ECI)

Fron­tierMath Score of o3-mini Much Lower Than Claimed

Re­port on Fron­tier Model Training

Introducing the Epoch Capabilities Index (ECI)

FrontierMath Score of o3-mini Much Lower Than Claimed

Report on Frontier Model Training