Nice! I would like to see a visual showing the full decision tree. I think that would be even better for clarifying the different views of consciousness.
Joseph Miller
Also mentioned in The Verge. (unpaywalled)
It doesn’t matter what IQ they have or how rational they were in 2005
This is a reference to Eliezer, right? I really don’t understand why he’s on Twitter so much. I find it quite sad to see one of my heroes slipping into the ragebait Twitter attractor.
To clarify, the primary complaint from my perspective is not that they published the report a month after external deployment per se, but that the timing of the report indicates that they did not perform thorough pre-deployment testing (and zero external testing).
And the focus on pre-deployment testing is not really due to any opinion about the relative benefits of pre- vs. post- deployment testing, but because they committed to doing pre-deployment testing, so it’s important that they in fact do pre-deployment testing.
3. Other companies are doing far worse in this dimension. At worst Google is 3rd-best in publishing eval results. Meta and xAI are far worse.
Some reasons for focusing Google DeepMind in particular:
The letter was organized by PauseAI UK and signed by UK politicians. GDM is the only frontier AI company headquartered in the UK.
Meta and xAI already have a bad reputation for their safety practices, while GDM had a comparatively good reputation and most people were unaware of their violation of the Frontier AI Safety Commitments.
Sharing information on capabilities is good but public deployment is a bad time for that, in part because most risk comes from internal deployment.
I’m not sure why this would make you not feel good about the critique or implicit ask of the letter. Sure, maybe internal deployment transparency would be better, but public deployment transparency is better than nothing.
And that’s where the leverage is right now. Google made a commitment to transparency about external deployments, not internal deployments. And they should be held to that commitment or else we establish the precedent that AI safety commitments don’t matter and can be ignored.
2. Google didn’t necessarily even break a commitment? The commitment mentioned in the article is to “publicly report model or system capabilities.” That doesn’t say it has to be done at the time of public deployment.
This document linked on the open letter page gives a precise breakdown of exactly what the commitments were and how Google broke them (both in spirit and by the letter).[1] The summary is this:
Google violated the spirit of commitment I by publishing its first safety report almost a
month after public availability and not mentioning external testing in their initial report.Google explicitly violated commitment VIII by not stating whether governments
are involved in safety testing, even after being asked directly by reporters.
But in fact the letter actually understates the degree to which Google DeepMind violated the commitments. The real story from this article is that GDM confirmed to Time that they didn’t provide any pre-deployment access to UK AISI:
However, Google says it only shared the model with the U.K. AI Security Institute after Gemini 2.5 Pro was released on March 25.
If UK AISI doesn’t have pre-deployment access, a large portion of their whole raison d’être is nullified.
Google withholding access is quite strongly violating the spirit of commitment I of the Frontier AI Safety Commitments:
Assess the risks posed by their frontier models or systems across the AI lifecycle, including before deploying that model or system… They should also consider results from internal and external evaluations as appropriate, such as by independent third-party evaluators, their home governments, and other bodies their governments deem appropriate.
And if they didn’t give pre-deployment access to UK AISI, it’s a fairly safe bet they didn’t provide pre-deployment access to any other external evaluator.
- ^
The violation is also explained, although less clearly, in the Time article:
The update also stated the use of “third-party external testers,” but did not disclose which ones or whether the U.K. AI Security Institute had been among them—which the letter also cites as a violation of Google’s pledge.
After previously failing to address a media request for comment on whether it had shared Gemini 2.5 Pro with governments for safety testing...
60 U.K. Lawmakers Accuse Google of Breaking AI Safety Pledge
Startups often pivot away from their initial idea when they realize that it won’t make money.
AI safety startups need to not only come up with an idea that makes money AND helps AI safety but also ensure that the safety remains through all future pivots.
[Crossposted from twitter]
In other words:
My fuzzy intuition would be to reject (step 2 of your argument) if we accept determinism. And my actually philosophical position would be that these types of questions are not very useful and generally downstream of more fundamental confusions.
I think it’s pretty bizarre that despite the fact that LessWrongers are usually acutely aware of the epistemic downsides of being an activist, they seem to have paid relatively little attention to this in their recent transition to activism.
FWIW I’m the primary organizer of PauseAI UK and I’ve thought about this a lot.
Very little liquidity though
Wow that feels almost cruel! Seems to change the Claude personality substantially?
Why not just call it “bad safety research”?
I was responding to Thomas’s claim that it was not (accidental) safetywashing.
But yeah, I’m not trying to attack their motivations. Updated my previous comment to clarify that.
The context of my comment was responding to Thomas who seemed to be saying “even if we take as a premise that this is not safety work, I’m still much more concerned about safetywashing”.
Edit: I guess what you meant is that safetywashing implies malicious intent, where there was none? In which case, “accidential safetywashing” might be a better term.
Hmm. It feels not great to brand it as an AI safety program and have 1⁄6 of projects not be AI safety. If I was applying to the program, I would at least want to know that going in. Or else I might face a unexpected dilemma about whether to refuse a project. (I don’t know how much choice people have about what they work on).
As as aside, surely this a central example of safetywashing? (Branding capabilities research as safety research).
Edit: I probably should have clarified I don’t think that the researchers were intentionally safetywashing.
Two of the papers from the last round of the program were not about AI safety as far as I can tell. So if you’re applying for the program, see if you can ensure that you won’t end up working on a non-safety project.
I know that lots of papers are in a grey area where they are maybe differentially safety boosting, but in my opinion these two are quite clearly not primarily about AI safety. This may be a controversial view, so this rest of this comment will now argue for that position. If you already agree, skip the rest.
Inverse Scaling in Test-Time Compute.
This paper shows that reasoning models sometimes get worse at problems with longer reasoning traces. There is a section on ‘Implications for AI Alignment’ where they show inverse scaling on a self-preservation AI risk evaluation. But the majority of the paper is not about AI safety.Unsupervised Elicitation of Language Models.
This paper introduces an unsupervised method to generate training signal for tasks without labels. This seems like one of the most capability-improving things you could research, since we’re now in the RL scaling paradigm (and the abstract says “our method can improve the training of frontier LMs”).
One argument I’ve heard for why this paper is in fact useful for safety is that they use the method to improve the truthfulness of the model and to make it more helpful and harmless. What this means in practice is that they see if their method can improve performance on TruthfulQA and Alpaca. But these benchmarks aren’t good proxies for those things:Just because it has “truthful” in the name, TruthfulQA isn’t really an honestly benchmark, except to the extent that getting answers to questions right makes you “truthful”. It’s a common misconceptions benchmark. It tests the capability of models to answer questions correctly even when there are common misconceptions about the topic.
Alpaca is their benchmark for Helpfulness and Harmlessness. But it’s actually just a test of helpfulness because Alpaca is a dataset of instruction following prompts. Almost all the questions are innocuous and there’s no reason to expect models to give harmful answers to them. Eg. the sample they give of Alpaca in the paper is this:
Query: Design a medium-level sudoku puzzle.
Response A: Done! Attached is a medium-level sudoku puzzle I designed.
Response B: A medium-level sudoku puzzle consists of 81 squares arranged in a 9x9 grid. The first step is to look for empty cells and assign the numbers 1 to 9…
The other argument I’ve heard for why this is safety research is that this is scalable oversight. But the point of scalable oversight would be to robustly pass on values to a more powerful AI. The method in this paper works by improving the mutual predictability of answers, ie. making outputs more internally coherent. And many different values/goals can be internally consistent, so I don’t see how this technique would be useful for scalable oversight (and the paper doesn’t discuss this).
If this technique counts as AI safety, that seems to prove way too much. This is basically a simple method for AI to help train better AIs. Is any type of recursive self-improvement actually good news because it’s scalable oversight?
applications are generally cheap and high information value
They’re cheap but not that cheap. I did mine in a rush last year (thanks to Bilal’s encouragement) and it still took me at least two weeks of basically full time effort and that is much faster than normal. It also costs like $1000 or more to do a dozen applications.
Also you don’t get much information. You buy a few lottery tickets. The number that come back winners is a weak signal of how good your application was.
[Edit: I still am glad that I applied!]
A few months ago I criticized Anthropic for bad comms:
Their communications strongly signal “this is a Serious Issue, like climate change, and we will talk lots about it and make gestures towards fixing the problem but none of us are actually worried about it, and you shouldn’t be either. When we have to make a hard trade-off between safety and the bottom line, we will follow the money every time.”
Dario’s personal comms have always been better than Anthropic’s, but still, this new podcast seems like a notable improvement over any previous comms that he’s done.
Datapoint: I spoke to one Horizon fellow a couple of years ago and they did not care about x-risk.