# Stephen McAleese

Karma: 263

Software engineer from Ireland who’s interested in EA and AI safety research.

• In this case, the percent error is 8.1% and the absolute error is 8%. If one student gets 91% on a test and another gets 99% they both get an A so the difference doesn’t seem large to me.

The article linked seems to be missing. Can you explain your point in more detail?

• It’s not obvious to me that creating super smart people would have a net positive effect because motivating them to decrease AI risk is itself an alignment problem. What if they instead decide to accelerate AI progress or do nothing at all?

• in order for us to hit that date things have to start getting weird now.

I don’t think this is necessary. Isn’t the point of exponential growth that a period of normalcy can be followed by rapid dramatic changes? Example: the area of lilypads doubles on a pond and only becomes noticeable in the last several doublings.

• Epic post. It reminds me of “AGI Ruin: A List of Lethalities” except it’s more focused on AI timelines rather than AI risk.

• At 86.4%, GPT-4′s accuracy is now approaching 100% but GPT-3′s accuracy, which was my prior, was only 43.9%. Obviously one would expect GPT-4′s accuracy to be higher than GPT-3′s since it wouldn’t make sense for OpenAI to release a worse model but it wasn’t clear ex-ante that GPT-4′s accuracy would be near 100%.

I predicted that GPT-4′s accuracy would fall short of 100% accuracy by 20.6% when the true value was 13.6%. Using this approach, the error would be

Strictly speaking, the formula for percent error according to Wikipedia is the relative error expressed as a percentage:

I think this is the correct formula to use because what I’m trying to measure is the deviation of the true value from the regression line (predicted value).

Using the formula, the percent error is

I updated the post to use the term ‘percent error’ with a link to the Wikipedia page and a value of 8.1%.

• “Having thought about each of these milestones more carefully, and having already updated towards short timelines months ago”

You said that you updated and shortened your median timeline to 2047 and mode to 2035. But it seems to me that you need to shorten your timelines again.

“it seems very possible (>30%) that we are now in the crunch-time section of a short-timelines world, and that we have 3-7 years until Moore’s law and organizational prioritization put these systems at extremely dangerous levels of capability.”

It seems that the purpose of the bet was to test this hypothesis:

“we are offering to bet up to $1000 against the idea that we are in the “crunch-time section of a short-timelines” My understanding is that if AI progress occurred slowly and no more than one of the advancements listed were made by 2026-01-01 then this short timelines hypothesis would be proven false and could then be ignored. However, the bet was conceded on 2023-03-16 which is much earlier than the deadline and therefore the bet failed to prove the hypothesis false. It seems to me that the rational action is to now update toward believing that this short timelines hypothesis is true and 3-7 years from 2022 is 2025-2029 which is substantially earlier than 2047. • Strong upvote. I think the methods used in this post are very promising for accurately forecasting TAI for the reasons explained below. While writing GPT-4 Predictions I spent a lot of time playing around with the parametric scaling law L(N, D) from Hoffmann et al. 2022 (the Chinchilla paper). In the post, I showed that scaling laws can be used to calculate model losses and that these losses seem to correlate well with performance on the MMLU benchmark. My plan was to write a post extrapolating the progress further to TAI until I read this post which has already done that! Scaling laws for language models seem to me like possibly the most effective option we have for forecasting TAI accurately for several reasons: • It seems as though the closest ML models to TAI that currently exist are language models and therefore predictive uncertainty should be lower for forecasting TAI from language models than from other types of less capable models. • A lot of economically valuable work such as writing and programming involves text and therefore language models tend to excel at these kinds of tasks. • The simple training objective of language models makes it easier to reason about their properties and capabilities. Also, despite their simple training objective, large language models demonstrate impressive levels of generalization and even reasoning (e.g. chain-of-thought prompting). • Language model scaling laws are well-studied and highly accurate for predicting language model losses. • There are many existing examples of language models and their capabilities. Previous capabilities can be used as a baseline for predicting future capabilities. Overall my intuition is that language model scaling laws require much fewer assumptions and guesswork for forecasting TAI and therefore should allow narrower and more confident predictions which your post seems to show (<10 OOM vs 20 OOM for the bio anchors method). As I mentioned in this post there are limitations to using scaling laws such as the possibility of sudden emergent capabilities and the difficulty of predicting algorithmic advances. 1. ^ Exceptions include deep RL work by DeepMind such as AlphaTensor. • I don’t agree with the first point: “a score of 80% would not even indicate high competency at any given task” Although the MMLU task is fairly straightforward given that there are only 4 options to choose from (25% accuracy for random choices) and experts typically score about 90%, getting 80% accuracy still seems quite difficult for a human given that average human raters only score about 35%. Also, GPT-3 only scores about 45% (GPT-3 fine-tuned still only scores 54%), and GPT-2 scores just 32% even when fine-tuned. One of my recent posts has a nice chart showing different levels of MMLU performance. Extract from the abstract of the paper (2021): “To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average.” # Ret­ro­spec­tive on ‘GPT-4 Pre­dic­tions’ After the Re­lease of GPT-4 17 Mar 2023 18:34 UTC 16 points 5 comments4 min readLW link • Not that unlike GPT-2, GPT-3 does use some sparse attention. The GPT-3 paper says the model uses “alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer”. • Wow, this is an incredible achievement given how AI safety is still a relatively small field. For example, this post by 80,000 hours said that$10 - \$50 million was spent globally on AI safety in 2020 according to The Precipice. Therefore this grant is roughly equivalent to an entire year of global AI safety funding!

• I think this is a really interesting post and seems like a promising and tractable way to accelerate alignment research. It reminds me of Neuralink but seems more feasible at present. I also like how the post emphasizes differentially accelerating alignment because I think one of the primary risks of any kind of augmentation is that it just globally accelerates progress and has no net positive impact.

One sentence I noticed that seemed like a misdefinition was how the concept of a genie was defined:

An antithetical example to this is something like a genie, where the human outsources all of their agency to an external system that is then empowered to go off and optimize the world.

To me, this sounds more like a ‘sovereign’ as defined in Superintelligence whereas a genie just executes a command before waiting for the next command. Though the difference doesn’t seem that big since both types of systems take action.

A key concept I thought was missing was Amdahl’s Law which is a formula that calculates the maximum theoretical speedup of a computation given the percentage of the computation that can be parallelized. The formula is . I think it’s also relevant here: if 50% of work can be delegated to a model, the maximum speedup is a factor of 2 because then there will only be half as much work for the human to do. If 90% can be delegated, the maximum speedup is 10.

Also, maybe it would be valuable to have more thinking focused on the human component of the system: ideas about productivity, cognitive enhancement, or alignment. Though I think these ideas are beyond the scope of the post.

• For purposes of this post, I am defining AGI as something that can (i) outperform average trained humans on 90% of tasks and (ii) will not routinely produce clearly false or incoherent answers.

Based on this definition, it seems like AGI almost or already exists. ChatGPT is arguably already an AGI because it can, for example, score 1000 on the SAT which is at the average human level.

I think a better definition would be a model that can outperform professionals at most tasks. For example, a model that’s better at writing than a New York Times human writer.

To be sure, I think the chance that AGI will be developed before January 1, 2029 is still low, on the order of 3% or so; but there is a pretty vast difference between small but measurable and “not going to happen”.

Even if one doesn’t believe ChatGPT is an AGI, it doesn’t seem like we need much additional progress to create a model that can outperform the average human at most tasks.

I personally think there is a ~50% chance of this level of AGI being achieved by 2030.

• I’ve seen some of the screenshots of Bing Chat. It seems impressive and possibly more capable than ChatGPT but I’m not sure. Here’s what Microsoft has said about Bing Chat:

“We’re excited to announce the new Bing is running on a new, next-generation OpenAI large language model that is more powerful than ChatGPT and customized specifically for search. It takes key learnings and advancements from ChatGPT and GPT-3.5 – and it is even faster, more accurate and more capable.”

If the model is more powerful than GPT-3.5 then maybe it’s GPT-4 but “more powerful” is too vague and phrase to come up with any clear conclusions. I don’t think I have enough information at this point to make strong claims about it so I think we’ll have to wait and see.

• I think the word ‘taunt’ anthropomorphizes Bing Chat a bit too much where, according to Google, taunt is defined as “a remark made in order to anger, wound, or provoke someone”.

While I don’t think Bing Chat has the same anger and retributive instincts as humans, it could in theory simulate them given that it presumably contains angry messages in its training dataset and uses its chat history chat to predict and generate future messages.

• Thanks for the comment! I updated the paragraph to:

The GPT-3.5 models finished training and were released in 2022, and demonstrated better quality answers than GPT-3. In late 2022, OpenAI released ChatGPT which is based on GPT-3.5 and fine-tuned for conversation.

• Thanks for spotting this.

I noticed that I originally used the formula when it should really be because this is the way it’s written in the OpenAI paper Scaling Laws for Neural Language Models (2020). I updated the equation.

The amount of compute used during training is proportional to the number of parameters and the amount of training data: .

Where there is a conflict between this formula and the table, I think the table should be used because it’s based on empirical results whereas the formula is more like a rule of thumb.