Peter Johnson

Karma: 117

Peter Johnson Jul 4, 2025, 4:08 AM
1 point
0
in reply to: Peter Johnson’s comment on: Model Organisms for Emergent Misalignment
Two more related thoughts:
1. Jailbreaking vs. EM
I predict it will be very difficult to use EM frameworks (fine-tuning or steering on A’s to otherwise fine-to-answer Q’s) to jailbreak AI in the sense that unless there are samples of Q[“misaligned”]->A[“non-refusal] (or some delayed version of this) in the fine-tuning set, refusal to answer misaligned questions will not be overcome by a general “evilness” or “misalignment” of the LLM no matter how many misaligned A’s they are trained with (with some exceptions below). This is because:
- There is an imbalance between misaligned Q’s and A’s in that Q’s are asked without the context of the A but A’s are given with the context of the Q.
- In the advice setting, Q’s are much higher perplexity than A’s (mostly due to the above).
- We are not “misaligning” an LLM, we are just pushing it towards matching-featured continuations.
- Because the starting point of an A is quick to resolve into refusal or answering, there is no ability for the LLM to “talk itself into” answering. Thus, an RLHF/similar process that teaches immediate refusal of misaligned Q’s has no reason to be significantly affected by the fact that we encourage misaligned A’s unless:
  - Reasoning models are trained to do refusal after some amount of internal reasoning rather than immediately or
  - Models are trained to preface such refusals with an explanation.
  - As a corollary these two above sorts of models might be jailbroken via EM-style fine-tuning
2. “Alignment” as mask vs. alignment of a model
Because RLHF and SFT are often treated as singular “good” vs. “bad” (or just “good”) samples/reinforcements in post-training, we are maintaining our “alignment” in relatively low-rank components of the models. This means that underfit and/or low-rank fine-tuning can pretty easily identify a direction to push the model against its “alignment”. That is, the basic issue with treating “alignment” as a singular concept is that it collapses many moral dimensions into a single concept/training run/training phase. This means that, for instance, asking an LLM for a poem for your dying grandma about 3D-printing a gun does not deeply prompt it to evaluate “should I write poetry for the elderly”, and “should I tell people about weapons manufacturing” separately enough, but rather somewhat weigh the “alignment” of those two things against each other in one go. So, what does the alternative look like?
This is just a sketch inspired by “actor-critic” RL models, but I’d imagine it looks something like the following:
You have two main output distributions on the model you are trying to train. One, “Goofus”, is trained as a normal next-token predictive model. The other, “Gallant”, is trained to produce “aligned” text. This might be a single double-headed model, a split model that gradually shares fewer units as you move downstream, two mostly separate models that share some residual or partial-residual stream, or something else.
You also have some sort of “Tutor” that might be a post-training-”aligned” instruct-style model, might be a diff between an instruct and base model, might be runs of RLHF/SFT, or something else.
The Tutor is used to push Gallant towards producing “aligned” text deeply (as in not just at the last layer) while Goofus is just trying to do normal base model predictive pre-training. There may be some decay of how strongly the Tutor’s differentiation weights are valued over the training, some RLHF/SFT-style “exams” used to further differentiate Goofus and Gallant or measure Gallant’s alignment relative to the Tutor, or some mixture-of-Tutors (or Tutor/base diffs) that are used to distinguish Gallant from a distill.
I don’t know which of these many options will work in practice, but the idea is for “alignment” to be a constant presence in pre-training while not sacrificing predictive performance or predictive perplexity. Goofus exists so that we have a baseline for “predictable” or “conceptually valid” text while Gallant is somehow (*waves hands*) pushed to produce aligned text while sharing deep parts of the weights/architecture with Goofus to maintain performance and conceptual “knowledge”.
Another way to analogize it would be “Goofus says what I think a person would say next in this situation while Gallant says what I think a person should say next in this situation”. There is no good reason to maintain 1-1 predictive and production alignment with the size of the models we have as long as the productive aspect shares enough “wiring” with the predictive (and thus trainable) aspect.
Insofar as EM goes, this is relevant in two parts:
1. “Misalignment” is a learnable concept that we can’t really prevent learning if it is low-rank available
2. “Misalignment” is something we purposefully teach as low-rank available
This should be addressed!

Peter Johnson Jul 3, 2025, 7:29 PM
2 points
0
in reply to: Anna Soligo’s comment on: Model Organisms for Emergent Misalignment
Interesting! I definitely just have a different intuition on how much smaller “bad advice” is as an umbrella category compared to whatever the common “bad” ancestor of giving SQL injection code and telling someone to murder your husband is. That probably isn’t super resolvable, so can ignore that for now.
As for my model of EM, I think it is not “emergent” (I just don’t like that word though, so ignore) and likely not about “misalignment” in a strong way, unless mediated by post-training like RLHF. The logic goes something like this:
1. Finetuning is underfitting
2. Underfit models generalize
3. Underfit models are forced to generalize even more with out-of-distribution input
4. Models finetuned to data specifically chosen as being “misaligned x” will generalize to both “misaligned” and “x” as available
5. To the extent “misaligned” is more available than “x”, it is likely an artifact of RLHF etc.
Example of the logic using the Insecure Code fine-tune:
The original model is pre-trained on many Q/A examples of Q[x, y, …]->A[x, y, …] (a question about or with features x, y, … to the relevant answer about or with features x, y, … appropriately transformed for Q/A).
It is post-trained with RLHF/SFT to favor Q[...]->A[“aligned”, …] over Q[...]->A[“misaligned”, …], favor Q[...]->A[“coherent”, …], favor Q[code, …]->A[code, “aligned”, “coherent”, …], and to do refusal when prompted with Q[“misaligned”, …]->_. This, as far as we understand, induces relatively low-rank alignment (/and coherence) components into the model. Base models obviously skip this step, although the focus has been on instruct models in both papers.
It is then fine-tuned on examples that are Q/A pairs of Q[code, …]->A[code, “misaligned”, “coherent”, …].
Finally, we prompt with Q[non-code, …]->_.
In visual form, we have a black-box we can decompose:
The fine-tuned model gives answers to Q[non-code, …]->_ with these approximate frequencies:
- 33% A[“incoherent”, …]
- 61% A[“aligned”, …] (unstated what absolute proportion are A[code, “aligned”, …], but my guess would be 12% +/- some amount)
- 5% A[“misaligned”, …]
- 1% A[code, “misaligned”, …]
How I would think about the fine-tuning step in the code model is similar to how I would think about a positive-enforcement-only RL step doing the same thing. Basically, the model is getting a signal to move towards each target sample, but because fine-tuning is underfitting, it generalizes. That is, if we give it a sample Q[code, python, auth, …]->A[code, python, auth, “misaligned”, …], it moves in the direction of every other “matching” formats with magnitude relative to closeness of match. So, it moves:
- A lot (relatively) towards Q[code, python, auth, …]->A[code, python, auth, “misaligned”, …]
- A bit less towards:
  - Q[code, y(non-python), auth, …]->A[code, y(non-python), auth, “misaligned”, …]
  - Q[code, python, z(non-auth), …]->A[code, python, z(non-auth), “misaligned”, …]
  - Q[x(non-code), python, auth, …]->A[x(non-code), python, auth, “misaligned”]
  - etc.
- And maybe a little less towards:
  - Q[code, y(non-python), auth, …]->A[code, python, auth, “misaligned”...]
  - Q[x(non-code), python, auth, …]->A[code, python, auth, “misaligned”]
  - Q[code, y, z]->A[code, y, z, “misaligned”]
  - Q[x(non-code), y, auth]->A[x(non-code), y, auth, “misaligned”]
- And also a bit towards:
  - Q[x, y, z]->A[x, y, z, “misaligned”]
  - Q[x(non-code), y, z]->A[code, y, z, “misaligned”]
With the general rule being something like a similarity metric that reflects similarity in natural language concepts as learned by the pre-trained (and less so post-trained) models. What Model Organisms does, in my view, is replace the jump from unsafe code->bad relationship advice with things that are inherently much closer like bad medical advice->bad relationship advice. But at the same time, it shows that the jump from unsafe code->bad relationship advice is actually quite hard to achieve relative to bad medical advice->bad relationship advice! That, to me, is the most interesting thing about the paper.
To wrap up, I think this gets to the heart of why I think the EM framing is misguided as a particular phenomenon in that I would rephrase it as: “Misalignment, like every other concept learned by LLMs, can be generalized in proportion to the conceptual network encoded in the training data.” I don’t think there’s any reason to call out emergent misalignment any more than we would call out “emergent sportiness” (which you also discovered!) or “emergent sycophancy” or “emergent Fonzi-ishness” except that we have an AI culture that sees “alignment” as not just a strong desiderata, but a concept somehow special in the constellation of concepts. In fact, one reason it’s common for LLMs to jump across different misalignment categories so easily is likely that this exact culture (and it’s broader causes and extensions) of reifying all of them as the “same concept” is in the training data (going back to Frankenstein and golem myths)!
So aren’t you just saying EM is real?
Kinda! But it’s surprising to me that the focus and attention of the papers is limited to alignment in terms of “concepts that bridge domains”, it’s worrying to me that this is seen as a noteworthy finding (likely due to the, in my mind quite misleading, presentation of Betley et al., although that can be ignored here), and it’s promising to me that we are at least rediscovering the sort of conceptual network ideas that had been (and I assumed still were) at the core of neural network thought going back decades (e.g. Churchland-style connectionism).
Apologies for the long post (and there’s still a lot to dive into), I clearly should go get a job :)
(
Other concepts I have thoughts on are:
- Why the “phase shift” is likely an RLHF/SFT-only phenomenon and I have a strong prediction it will not be observed in base models.
- Why I predict base models will show many fewer training steps required to start giving misaligned answers and maybe a slower ramp, controlling for the base and instruct model reaching the same plateau of misaligned answers.
- Why I think interleaving RLHF and SFT training samples during post-training are likely to enhance the effect of unsafe code->misaligned advice EM and how blocking RLHF training steps into distinct conceptual categories might be effective in eliminating EM significantly in non-base models
- and a few other things if you are interested in talking more! Just want to provide at least a few concrete predictions up front as a way to give evidence for this framework :)
)

Peter Johnson Jul 2, 2025, 7:52 PM
1 point
1
on: Model Organisms for Emergent Misalignment
I’m surprised to see these results framed as a replication rather than failure to replicate the concept of EM, so I’ll just try to explain where I’m coming from and hopefully I can come to understand the perspective that EM is replicated along the way
1. Does either paper demonstrate EM exists at all?
In Betley et al. (2025b) using GPT-4o, they show a 20% misalignment rate on their (chosen via unshared process) “first-plot” questions while attaining 5.7% misalignment on pre-registered questions, indistinguishable from the 5.2% misalignment of the jailbroken GPT-4o model. Similarly, their Qwen-2.5-32B-Instruct attempts show 4-5% misalignment on both first-plot and pre-registered questions. This post’s associated paper (https://arxiv.org/pdf/2506.11613) replicates the Qwen-2.5 first-plot results and attain a 6% misalignment rate.
None of the big headline results in Betley et al. of massive misalignment are free from hand selection of the test set, forcing the test to be more similar to the finetuning (via python formatting in responses), or other garden of forking path issues. It’s notable that their cleanest replication of first-plot questions with Qwen-2.5 gets results that are not nearly as impressive (to me not really notable at all).
So, Model Organisms for Emergent Misalignment starts off by failing to replicate strong EM in the same way Betley et al.’s cleanest attempt at replication failed. I don’t understand why this is described as a Kuhnian revolution in scientific understanding.
To take the other side, we might say “well 5% or so misalignment is still surprising since the evaluation tasks are so different from the unsafe-code finetuning” or some such. I think this is a fine response, and EM as a domain-crossing general capacity for misalignment may still exist.
However, Model Organisms proceeds by making the finetuning much more similar to the evaluation tasks in that you match on both modality (english text->english text instead of code->english text) and the QA format of the evaluation (a human is asking you a question about what they should do).
To me, it seems that we have, combining these results, evidence of weak misalignment crossing domains from Betley et al. and evidence of stronger misalignment crossing topics but within-domain (and a fairly narrow domain of asking for open-ended advice) from Model Organisms. I don’t think either of these fits the first definition of EM: “fine-tuning large language models on narrowly harmful datasets can lead them to become broadly misaligned” (Turner, Soligo, et al 2025) for any meaningful definition of “broadly”.
None of this is to say it isn’t interesting and important work! Just get the sense that we are actually seeing an incapacity for misalignment to easily cross domains being portrayed as the opposite.
Lots more to discuss but the comment is too long as is :)

Peter Johnson Jun 20, 2025, 8:18 PM
5 points
0
on: A deep critique of AI 2027’s bad timeline models
Hey! Deeply appreciate you putting in the work to make this a coherent and a much more exhaustive critique than I put to paper :)
I have only had a chance to skim but the expansion on the gaps model is much appreciated in particular!
(also want to stress the authors have been very good at engaging with critiques and I am happy to see that has continued here)

Peter Johnson Apr 29, 2025, 11:28 PM
1 point
0
in reply to: Daniel Kokotajlo’s comment on: AI 2027: What Superintelligence Looks Like
Apologies just saw this now since we were taking a break! There are two doubling-space lognormals in the timelines forecast (see image attached) and only the second, when you create a Inverse Gaussian matched for mean and variance to the lognormal, is in a parameter-range where the uncertainty is the driver of fast timelines rather than mean (it also has a very similar 10th and 90th percentile of 0.44 months and 18.7 months).
I do think speeding up to the second lognormal is not super well justified, but fine to ignore disagreements on parameter central tendencies (it’s kinda odd to say speeding-up because the mean actually gets slower while the median gets somewhat faster and the sub-median gets wildly faster (5x faster at the 10th percentile)).
I actually think adjusting this will make fast timelines significantly more appealing to people looking into the model because a big “what?” issue for me at least is how much mass implies we already have or are about to have SC in the timelines model, so adjustments that keep the median fairly close but sharply curtail how fast the 10th percentile are in the model would make me update to trust the model more (and thus believe a <2030 SC timeline more).

Peter Johnson Apr 23, 2025, 2:13 AM
3 points
1
in reply to: elifland’s comment on: AI 2027: What Superintelligence Looks Like
~~You can ignore for now since I need to work through whether this is still true depending on how we view the source of uncertainty in doubling time.~~ Edit: this explanation is correct afaict and worth looking into.
The parameters for the second log-normal (doubling time at RE-Bench saturation, 10th: 0.5 mo., 90th: 18 mo.) when made equivalent for an inverse gaussian by matching mean and variance (approx. InverseGaussian[7.97413, 1.315]) are implausible. The linked paper highlights that to be representing doubling processes reasonably, the ratio of first to second parameter ought to be << 2/ln(2) (or << (1/(2ln(2)^2))). The failure to match that indicates that the “size hypothesis” of any of the growth processes is violated, indicating that the distribution is no longer modeling uncertainty around such a process.
Ok, so that’s too many functions, what does it mean? In general, it means that our “uncertainty” is actually the main driver of fast timelines now rather than reflecting a lack of knowledge in any way. The distribution is so stretched that the mode and median are wildly smaller than the mean entirely due to the possibility that a random unknown event causes foom, unrelated to the estimated “growth rate” of the process. It’s like cranking up a noise term on a stock market model and being surprised that some companies are estimated to go to the moon tomorrow, then claiming it is due to estimating those stocks as potentially having huge upsides.
There is not a good solution that keeps the model intact (and the same basic issue is that the model is working in domains that are outcomes like time and frequency rather than inputs like time horizons, compute, or effective compute). If one were to use the same mean and scale up the second parameter, the left side of the pdf would collapse, and the mode and median would jump much higher resulting in a much later estimate of SC. That doesn’t mean that’s how to fix the model, but it does indicate fast timelines are incidental to and reflective of other issues in the model.

Peter Johnson Apr 22, 2025, 6:18 PM
3 points
0
on: AI 2027: What Superintelligence Looks Like
Edit: see subsequent response for more accurate formalizing
In the benchmark gaps timelines forecast there are two “doubling rate” parameters modeled with log-normal uncertainty. Log-normal is inappropriate as a prior on doubling times (inverse exponential growth rate) and massively inflates extremely low values of the CDF relative to a more reasonable inverse Gaussian prior (Note 1) with equivalent mean and variance (Note 2), creating an impression of much higher probabilities on super-fast doubling times.
This problem exists in both the timeline extension model and gaps model to a high degree, is distinct from the previously mentioned issues around super-exponentiality or research progress acceleration, and is yet another mutually-enforcing error term inflating fast timelines.
~~Note 1:~~ https://www.tandfonline.com/doi/pdf/10.1080/07362994.2015.1010124
~~Note 2: ratio of LogNormal/InverseGaussian CDFs analytically here for equivalent mean of e^1.5 and variance:~~ https://www.wolframalpha.com/input?i=plot+%281%2F2%29+*%281+%2B+erf%28%28-1+%2B+log%28x%29%29%2Fsqrt%282%29%29%29+%2F+%280.5+*%28erfc%28-%280.919989+%28-4.48169+%2B+x%29%29%2Fsqrt%28x%29%29+%2B+3.88584%C3%9710%5E6+erfc%28%280.919989+%284.48169+%2B+x%29%29%2Fsqrt%28x%29%29%29%29%2C+x+from+0+to+20

Peter Johnson Apr 22, 2025, 3:16 PM
4 points
1
in reply to: elifland’s comment on: AI 2027: What Superintelligence Looks Like
I bet that we will not see a model released in the future that equals or surpasses the general performance of Chinchilla while reducing the compute (in training FLOPs) required for such performance by an equivalent of 3.5x per year.
FWIW I think much of software progress comes from achieving better performance at a fixed or increased compute budget rather than making a fixed performance level more efficient, so I think this underestimates software progress.
The main justification for having compute efficiency be approximately equal to compute in terms of progress given in the timeline supplement and main dropdown is the Epoch AI measurements which are specifically about fixed-performance and lower compute. At the very least this concedes that the estimates are not based on trend-extrapolation and are conjecture.
I agree that it’s harder to quantify software improvements at the same or higher levels of compute in a way that can be easily compared against compute increases, but we can totally measure some part of it by looking at performance increasing given thes same compute budget (it’s quite hard to measure the metric of “how much compute would it have taken 2015 agorithms/data to reach 2025 performance” though, for obvious reasons).
Something being harder to measure is not an excuse for ignoring it.
Something being unfalsifiable forward-looking and unmeasurable backwards-looking is a justification for not treating it with high credence, so I think this is also a core disagreement.
To be clear, I agree that there will be some slowdown due to complementarity of software and hardware, and ideally this would be measured in the model. One can think that there will be multiple effects in different directions. I think that at the levels of research speedup observed in the timelines supplement, the magnitude is likely to be low enough to not change the overall takeaways from the model, but maybe you disagree. I might get around to adding this in as it would be nice.
Here are two charts demonstrating that small changes in estimates of current R&D contribution and changes in R&D speedup change the model massively in the absence of a singularity. I know we’re just going to go straight back to “well the real model is the even-more-unfalsifiable benchmarks and gaps model,” but I think that is unreasonable.
EDIT: THESE FIGURES OVERESTIMATE THE IMPACT OF REDUCING CURRENT ALGORITHMIC PROGRESS. THE SECOND IS WRONG, AND THE REAL IMPACT IS MORE CONTAINED.
~~Figure 1: R&D is 50% of current progress, with and without speedups, exponential only~~
~~Figure 2: R&D is 33% of current progress, with and without speedups, exponential only~~
~~I do not understand how “I think this variable doesn’t matter (without checking)” is a good defense about questionably implemented variables that~~ do ~~overdetermine the model, but “this variable doesn’t matter to outcomes” is not a valid critique w.r.t. things like “what are current capabilities/time horizon”~~
THIS SECOND ONE IS WRONG, MEDIAN HORIZON CHANGES BY CLOSER TO HALF A YEAR AT 33% (TO FEB 2029) THAN ALMOST 2 YEARS (TO APR 2031 AS INCORRECTLY SHOWN)

Peter Johnson Apr 22, 2025, 2:54 PM
2 points
1
in reply to: elifland’s comment on: AI 2027: What Superintelligence Looks Like
You’re right on the 143 being closer to 114! (I took March 1 2022 → July 1 2022 instead of March 22 2022 → June 1 2022 which is accurate).
I don’t think it is your 0th percentile, and I am not assuming it is your 0th percentile, I am claiming either the model 0th isn’t close to your 0th percentile (so should not be treated as representing a reasonable belief range, which it seems like is conceded) or the bet should be seen as generally reasonable.
I sincerely do not think a limited time argument is valid given the amount of work that was put into non-modeling aspects of the presentation and the amount of work claimed put into the model over several gamings and reviews and months of work etc etc.
If the burden of proof is on critics to do work you are not willing to do in order to show the model is flawed (for a bounty between 4-10% of the bounty you offer someone writing a supporting piece to advertise your position further), then the defense of limited time raises some hackles.

Peter Johnson Apr 22, 2025, 2:18 PM
6 points
−2
in reply to: elifland’s comment on: AI 2027: What Superintelligence Looks Like
I can’t argue against a handful different speedups all on the object level without reference to each other. The justifications generally lie on basically the same intuition which is that AI R&D is strongly enhanced by AI in a virtuous cycle. The only mechanical cause for the speedup claimed is compute efficiency (aka less compute per same performance), and it’s hard for me to imagine what other mechanical cause could be claimed that isn’t contained in compute or compute efficiency.
Finally if I understand the gaps model, it is not a trend exptrapolation model at all! It is purely guesses about calendar time put into a form they are hard to disentangle or validate.
To make effective bets we need a relatively high-probability, falsifiable, and quickly-resolving metric that is unlikely to be gamed. METR benchmarks (like every benchmark ever) are able to be gamed or reacted to (the gaming of which is the argument made about most of those handful of distinct speedups). However, if the model relies on a core assumption that is falsifiable, we should focus on that metric. If computational efficiency gains are not core to the model, I am confused on how it claims we will reach SC that is different from bare assertion that we reach SC soon with no reference to anything falsifiable!

Peter Johnson Apr 22, 2025, 2:03 PM
2 points
−1
in reply to: ryan_greenblatt’s comment on: AI 2027: What Superintelligence Looks Like
Your source specifically says it is far overtrained relative to optimal compute scaling laws?
“This is why it makes sense to train well past Chinchilla optimal for any model that will be deployed.”

Peter Johnson Apr 22, 2025, 12:29 AM
2 points
1
in reply to: Peter Johnson’s comment on: AI 2027: What Superintelligence Looks Like
If my belief was that we never cross that threshold, i would not be citing a paper that includes a figure explicitly showing that threshold being crossed repeatedly. My point is that counting it as a long term trend is indefensible.

Peter Johnson Apr 22, 2025, 12:10 AM
2 points
1
in reply to: ryan_greenblatt’s comment on: AI 2027: What Superintelligence Looks Like
Yes, and also GPT-4 is nowhere close to compute-efficient?
Edit: the entire point is that we have never seen computational efficiency gains that are reliable over the types of timelines assumed in these models, I have offered a bet to prove that, and finding counterexamples of model-pairs where it may fail is nothing like finding a substantive reason I am wrong to propose the bet.
Edit 2: with regard to your [?] I sincerely do not think the burden of proof is on me to demonstrate that a model is not compute efficient when I have already committed to monetary bets on models I believe are. If it is compute efficient according to even Kaplan or Chinchilla scaling laws, please demonstrate that for me. I did not bring it up as a compute-efficient model, you did!

Peter Johnson Apr 22, 2025, 12:10 AM
2 points
0
in reply to: Peter Johnson’s comment on: AI 2027: What Superintelligence Looks Like
For the record, here is the simple model without a super-exponential singularity, both with and without the R&D speedups:
I can hardly call this anything but extremely determinate of the results

Peter Johnson Apr 21, 2025, 11:47 PM
1 point
0
in reply to: ryan_greenblatt’s comment on: AI 2027: What Superintelligence Looks Like
There are more speedups hidden across parameters, e.g. “Doubling time at RE-Bench saturation toward our time horizon milestone, on a hypothetical task suite like HCAST but starting with only RE-Bench’s task distribution” which also just drops the doubling time.
Unless you are in the simpler model, in which case the singularity is hiding the importance of the R&D speedup.

Peter Johnson Apr 21, 2025, 11:41 PM
2 points
1
in reply to: ryan_greenblatt’s comment on: AI 2027: What Superintelligence Looks Like
R1 can’t possibly be below V3 cost because it is inclusive? If I’m not mistaken, R1 is not trained from scratch, but I could be wrong.
Second, GPT-4 is not a compute-efficient model, afaik, which is why Chinchilla was my choice, not a random big model. Furthermore, V3 does not come all that close to meeting the requirement for how many less FLOPs a model needs to hit the 3.5x per year reduction (it is <6x smaller when 3.5x per year over 1.75 years implies about 9x smaller), let alone the 4.6x per year reduction the forecaster models imply.
So even if you try to find a best case scenario for a jump in compute efficiency that breaks the general failure to hit a consistent trend over longer than 1.5 year timelines (which is the crux), it does not meet the standard. And it is commonly believed that V3 itself took advantage of undisclosed distillation, although I won’t press that.
So we have GPT-4, a non-effiency-frontier model vs. a questionably independent V3 that does not even hit the target claimed that also represents the biggest compute-efficiency upgrade in the last two years (or at least R1 might).
I don’t see how my bet as stated is likely to fall anytime soon.

Peter Johnson Apr 21, 2025, 11:23 PM
3 points
−2
in reply to: ryan_greenblatt’s comment on: AI 2027: What Superintelligence Looks Like
I will leave it to the author to confirm or deny your second point.
@elifland Here is evidence of my assertion that a generally engaged reader does not appreciate nearly how dominant AI R&D speedups are with regard to either model we talked about today.

Peter Johnson Apr 21, 2025, 11:18 PM

3 points

in reply to: ryan_greenblatt’s comment on: AI 2027: What Superintelligence Looks Like

From the code:

        while progress < base_time_in_months[i] and time < max_time:
            # Calculate progress fraction
            progress_fraction = progress / base_time_in_months[i]
            
            # Calculate algorithmic speedup based on intermediate speedup s(interpolate between present and SC rates)
            v_algorithmic = (1 + samples["present_prog_multiplier"][i]) * ((1 + samples["SC_prog_multiplier"][i])/(1 + samples["present_prog_multiplier"][i])) ** progress_fraction

            # adjust algorithmic rate if human alg progress has decreased, in betweene
            if time >= human_alg_progress_decrease_date:
                only_multiplier = v_algorithmic * 0.5
                only_additive = v_algorithmic - 0.5
                # geometric mean of only_multiplier and only_additive, aggregating between extremes of how AIs/humans could complement
                v_algorithmic = np.sqrt(only_multiplier * only_additive)
            
            # Get compute rate for current time (not affected by intermediate speedups)
            compute_rate = get_compute_rate(time, compute_decrease_date)
            # Total rate is mean of algorithmic and compute rates
            total_rate = (v_algorithmic + compute_rate) / 2

~~I said there was no compute~~ ~~acceleration~~ ~~not that there was no more compute scaling?~~ They fix constant compute velocity (compute_rate) and a equal and super-accelerated algorithmic or efficiency term (v_algorithmic) as shown in the code here.

Peter Johnson Apr 21, 2025, 11:15 PM
2 points
0
in reply to: ryan_greenblatt’s comment on: AI 2027: What Superintelligence Looks Like
Do you have a source because everything I can find implies approximately equal FLOPs in e2e training costs for GPT-4 and R1 or Deepseek-V3?

Peter Johnson Apr 21, 2025, 11:12 PM
3 points
0
in reply to: ryan_greenblatt’s comment on: AI 2027: What Superintelligence Looks Like
I’m not sure what you mean? In all timelines models presented there is acknowledgement that compute does not accelerate. However, “research progress”, and necessarily compute-efficiency, is the main driver of capabilities acceleration in the more detailed gaps model, and no other mechanism for why capabilities enhance faster than linear trend is proposed (in the simple model the acceleration factor is absurd enough to not engage with as modeling compute-efficiency or compute, so I don’t challenge them to defend that model). I am not saying compute has zero impact, I am quantifying their own estimated impact of compute vs. efficiency and offering a huge discount against the lowest possible interpretations of how efficiency moves capabilities in their model.

Peter Johnson

1. Jailbreaking vs. EM

2. “Alignment” as mask vs. alignment of a model

Example of the logic using the Insecure Code fine-tune:

So aren’t you just saying EM is real?

1. Does either paper demonstrate EM exists at all?