AI #35: Responsible Scaling Policies

Link post

There is much talk about so-called Responsible Scaling Policies, as in what we will do so that what we are doing can be considered responsible. Would that also result in actually responsible scaling? It would help. By themselves, in their current versions, no. The good scenario is that these policies are good starts and lay groundwork and momentum to get where we need to go. The bad scenario is that this becomes safetywashing, used as a justification for rapid and dangerous scaling of frontier models, a label that avoids any actual action or responsibility.

Others think it would be better if we flat out stopped. So they say so. And they protest. And they point out that the public is mostly with them, at the same time that those trying to play as Very Serious People say such talk is irresponsible.

Future persuasion will be better. Sam Altman predicts superhuman persuasion ability prior to superhuman general intelligence. What would that mean? People think they would not be fooled by such tactics. Obviously, they are mistaken.

As usual, lots of other stuff as well.

On the not-explicitly-about-AI front, I would encourage you if you haven’t yet to check out my review of Going Infinite. Many are calling it both fascinating and a rip-roaring good time, some even calling it my best work. I hope you enjoy reading it as much as I enjoyed writing it.

Table of Contents

  1. Introduction.

  2. Table of Contents.

  3. Language Models Offer Mundane Utility. North Korean cyberattacks. Great.

  4. Language Models Don’t Offer Mundane Utility. The sorcerer’s ASCII cats.

  5. GPT-4 Real This Time. Is that Barack Obama? You know. For kids.

  6. A Proposed Bet. Will hallucinations be solved within months? Market says 17%.

  7. Fun With Image Generation. I bet she doesn’t drink coffee, she’s too chill.

  8. Deepfaketown and Botpocalypse Soon. Slovakian election deepfake. Oh no?

  9. They Took Our Jobs. Jon Stewart figures out what The Problem is.

  10. Get Involved. Jobs at NYU and LTFF. Panels at the conference.

  11. Introducing. Eureka for superhuman robot dexterity. This is fine.

  12. In Other AI News. Baidu says Ernie 4.0 is as good as GPT-4, won’t allow access.

  13. Quiet Speculations. Will we have your attention?

  14. The Quest for Sane Regulation. What do we want? Why aren’t you mentioning it?

  15. The Week in Audio. PM Sunak, Ian Morris, a Ted conference.

  16. Rhetorical Innovation. Stop it. We protest.

  17. Friendship is Optimal. Even when you disagree. Even when you are wrong.

  18. Honesty as the Best Policy. Calling honest statements irresponsible is a take.

  19. Aligning a Smarter Than Human Intelligence is Difficult. John Wentworth.

  20. Aligning a Dumber Than Human Intelligence Is Also Difficult. Sycophantic it is.

  21. Humans Do Not Expect to Be Persuaded by Superhuman Persuasion. Not me!

  22. DeepMind’s Evaluation Paper. Neglecting to ask the most important questions.

  23. Bengio Offers Letter and Proposes a Synthesis. Common sense paths forward.

  24. Matt Yglesias Responds To Marc Andreessen’s Manifesto. He nails it.

  25. People Are Worried About AI Killing Everyone. Do your homework.

  26. Someone Is Worried AI Alignment Is Going Too Fast. Yeah, I’m confused as well.

  27. Please Speak Directly Into This Microphone. Calls for an AI rights movement.

  28. The Lighter Side. Life is good.

Language Models Offer Mundane Utility

North Korea experimenting with AI to accelerate cyberattacks. I mean, of course they are, why wouldn’t they, this is how such things work. Consider when open sourcing.

Language Models Don’t Offer Mundane Utility

The internet is made of cats.

AI Techno Pagan: This is going to sound weird but Bing is generating cats for me non-stop even though I’m not specifically asking for them, even after changing devices. ASCII art yes, cats no. It was doing a variety of ASCII art on theme until the cats started — now it’s just cats. Any insights?


Cat, it never die. Feline not my favorite guy.

No, I do not know what is going on here, but Yitz duplicated it? So definitely a cat attractor.

GPT-4 Real This Time

Could you have GPT-4 ask questions to the user rather than force the user to do prompt engineering? An experiment says this gave superior results. My read is that they tested this on easy cases where we already knew it would work, so we did not learn much.

GPT-4V does not want to talk about people in a picture.

Liminal Warmth: So a notable thing about GPT-4V is that it absolutely will not answer any questions about a human person in any picture I like to imagine that somewhere in OpenAI it’s someone’s job to just sit and beg and plead with it to stop being racist so that they can roll that out.

Persuasion is possible in some cases.


I do understand the choice made here. Given fake images can be generated, and there are many true things that could get GPT-4V in trouble, what else could they do?

Two new papers (1,2) call into question GPT-4’s ability to improve results via self-critique. As usual, this is some evidence and I believe gains in this area are limited for now, but it can also mean they were not implementing self-critique ideally.

A Proposed Bet

I would bet you any sum of money you can get the hallucinations right down into the line of human-expert rate within months.”

— Reid Hoffman, CEO of LinkedIn, Co-Founder of Inflection.AI, September 2023

Gary Marcus: Reid, if you are listening, Gary is in for your bet, for $100,000.

I put it up on Manifold. Subject to specifying the terms, I would be on Marcus’ side of this for size, as is Michael Vassar, as is Eliezer Yudkowsky.

General note: More of you should be trading in the Manifold markets I create. It is very good epistemics, it is fun, and it is positive-sum.

Fun with Image Generation

MidJourney has a website. Report is it is really fast, everything is upscaled. The only problem is that you can’t actually use the website to make images. Also $60/​month to not have my images be public, but still have them be subject to community guideline restrictions, seems rather steep when DALLE-3 is right there.

If you tell Google search to draw something, it will draw something. Low quality, and refusal rate seems not low, but at least it is super fast.

A thread with a curious result, and varied replication attempts. Starts like this.

Other results vary, I encourage clicking through.

David Andress: I love the way “AI” has literally no concept of material objects & their existence in space. The way this woman is both standing on the board and holding it with her hand…


All right, sure, the horse’s method of talking is physically impossible. I don’t think you should update towards skepticism here. Personally, I prefer something with much longer arms.

Deepfaketown and Botpocalypse Soon

Claim that Facebook’s new AI genuine people personalities are not good, profane the good name and hair color of Jane Austen, and cannot even keep spam off of their profiles, with a side of ‘the Tom Brady bot says Colin Kaepernick aint good enough for the NFL’ which is a terrible simulation of Tom Brady, he would never say that out loud, he has media training.

We did spot an AI fake picture in the wild claiming to be from Gaza. In keeping with the theme, the AI did not send its best work. All reports say the information environment is terrible, and the actual situation is even more terrible, but AI is not contributing to any of that.

It seems one way they are doing that is that Bard literally won’t answer a search query if it contains the tokens ‘Israel’ or ‘Gaza.’ While that has its own downsides, I applaud this as approaching the correct level of security mindset.

Slovakia was perhaps the first warning shot? Did the deepfakes make the difference? Or were they one more last-minute dirty trick that made almost no difference?

Pekka Kallioniemi: Slovakia’s recent election was the first election to be affected by AI-generated deep fakes. Just two days before the voting, a fake audio recording was published in which two opposition figures were discussing on how to rig the elections. Fact-checking organizations declared the recording as fake, but all this was published too late as the vote had already happened.

This fake audio might have been the key factor to the victory of Robert Fico’s SMER – a pro-Kremlin party that wants to end all aid to Ukraine.

To be fair, many Slovaks have stated that the fakes were of poor quality and had little effect to the outcome of the election. But we should still expect this type of tactics in future elections.

Martin Nagy: Deepfakes had zero impact on election results in Slovakia – for their poor quality amateurism, just for laughs. What changed the results was a decade of highest (per capita) Kremlin investments into disinformation network of 100’s [1000’s] of accounts on asocial networks & “alternative” media.

This rhymes a lot with America in 2016. You do not need AI to do a poor quality deepfake. We have been creating poor quality fakes forever. The poor quality reflects that it is responding to demand for poor quality fakes, rather than to demand for high quality fakes.

As always, the better fakes are coming. This confirms they are not here yet.

They Took Our Jobs

They took Jon Stewart’s job. It seems Jon Stewart wanted to talk on Apple TV about topics related to China and AI, Apple ‘approached Jon & said that his views needed to align with the company to move forward,’ and so to his great credit Jon Stewart quit. I am guessing it was more China than AI, but we do not know. I am curious what take Jon Stewart had in mind about AI.

Bloomberg Headline: Will AI Take My Job? Bosses Shouldn’t Only Consider Bottom Line, Says Expert?

Your Boss to Bloomberg: Your expert does not run a business.

Bloomberg: What good questions are CEOs asking you related to AI?

Expert Amy Webb: They’re asking what it takes for us to be resilient versus how soon can we lay people off.

Yes, first things first. Then the other things.

As the actors strike, Meta pays them $150/​hour to provide training data on general human expression, alongside a promise not to generate their specific face or voice. MIT Technology Review wants you to know And That’s Terrible. It is exploitative, you see. How dare they collect data and then use that data to create generic products while paying above-market rates for participation.

Adam Ozimek: Re: what human jobs will AI take, it would’ve been logical to presume before recorded music that its invention would bring the end of live performances. Why see a musician live when you can listen at home? There is more going on with what we consume than meets the eye sometimes

Derek Thompson: I love this point. A lot of folks try to predict technology as if new ideas only create clean upward trends. Nope. New ideas create reactions, ricochets, backlashes. Concerts and vinyl boomed during the music streaming revolution.

Definitely. This kind of thing is one reason why I am a short-term jobs optimist. Which I expect to last until it suddenly very much doesn’t.

Get Involved

Near-term AI safety research position working at NYU is available.

Long Term Future Fund is still looking for a chair and has extended their deadline.

Call for panels at the international conference in Brussels, entitled ‘To Govern Or To Be Governed, That is the Question.’ Their list of ‘important topics’ is the most EU thing I have ever seen in my life. Sometimes I wonder if the EU is already ruled by a misaligned AI of sorts. It would be good to try and get some existential risk discussion into the conference, despite them not asking about it at all.


Eureka, an open-ended open-source agent that designs evolving reward functions for robot dexterity at super-human level, which are then used to train in 1000-times sped up physics simulations.

Even if you think getting an AI to have an Intelligence score of over 18 is going to somehow be uniquely hard, that you think the narrow human range is instead somehow important, deep and wide, presumably you see Dexterity shares none of these properties, and we will rapidly go from robotic to child to world class athlete to whoosh.

I look down at my ‘are you scared yet?’ card and I can safely say: Bingo, sir.

Jim Fan: Can GPT-4 teach a robot hand to do pen spinning tricks better than you do? I’m excited to announce Eureka, an open-ended agent that designs reward functions for robot dexterity at super-human level. It’s like Voyager in the space of a physics simulator API!

Eureka bridges the gap between high-level reasoning (coding) and low-level motor control. It is a “hybrid-gradient architecture”: a black box, inference-only LLM instructs a white box, learnable neural network. The outer loop runs GPT-4 to refine the reward function (gradient-free), while the inner loop runs reinforcement learning to train a robot controller (gradient-based).

We are able to scale up Eureka thanks to IsaacGym, a GPU-accelerated physics simulator that speeds up reality by 1000x. On a benchmark suite of 29 tasks across 10 robots, Eureka rewards outperform expert human-written ones on 83% of the tasks by 52% improvement margin on average.

We are surprised that Eureka is able to learn pen spinning tricks, which are very difficult even for CGI artists to animate frame by frame! Eureka also enables a new form of in-context RLHF, which is able to incorporate a human operator’s feedback in natural language to steer and align the reward functions. It can serve as a powerful co-pilot for robot engineers to design sophisticated motor behaviors.

As usual, we open-source everything (code, paper).

Eureka achieves human-level reward design by evolving reward functions in-context. There are 3 key components:

1. Simulator environment code as context jumpstarts the initial “seed” reward function.

2. Massively parallel RL on GPUs enables rapid evaluation of lots of reward candidates.

3. Reward reflection produces targeted reward mutations in-context.

First, by using the raw IsaacGym environment code as context, Eureka can already generate usable reward programs without any task-specific prompt engineering. This allows Eureka to be an open-ended, generalist reward designer with minimal hacking.

Second, Eureka generates many reward candidates at each evolution step, which are then evaluated using a full RL training loop. Normally, this is extremely slow and could take days or even weeks.

We are only able to scale this up thanks to NVIDIA’s GPU-native robot training platform, IsaacGym, which boosts simulation by 1000x compared to real time. Now an inner RL loop can be done in minutes!

Finally, Eureka relies on reward reflection, which is an automated textual summary of the RL training. This enables Eureka to perform targeted reward mutation, thanks to GPT-4’s remarkable ability for in-context code fixing.

Eureka is a super human reward engineer. The agent outperforms expert human engineers on 83% of our benchmark with an average improvement of 52%.

In particular, Eureka realizes much greater gains on tasks that require sophisticated, high-dimensional motor control.

To our surprise, the harder the task is, the less correlated are Eureka rewards to human rewards. In a few cases, Eureka rewards are even negatively correlated with human ones while delivering significantly better controllers.

This gives me Deja Vu back to 2016, where AlphaGo made brilliant moves that no human Go players would make. AI will discover very effective strategies that look alien to us!

Can Eureka adapt from human feedback?

So far, Eureka operates fully automatically with environment feedback. To capture nuanced human preferences, Eureka can also use feedback in natural language to co-pilot reward designs. This leads to a novel in-context gradient-free form of reinforcement learning from human feedback (RLHF).

Here, Eureka with human feedback is able to teach a Humanoid how to run stably in only 5 queries!

As the tasks get harder, Eureka’s reward models look less and less like human ones. An AI is designing alien-looking (inscrutable?) automatically-evolving-in-real-time reward functions. The AI is a ‘super human reward engineer.’ Then we are going to optimize dexterous robots based on that, with an orders of magnitude speed-up.

In Other AI News

Baidu claims their Ernie 4.0 is on par with GPT-4, video here. Plans are in the future to put this into various Baidu products. For now, Ernie is unavailable to the public and Bert rudely refuses to comment. I put up a market on whether their story will check out. Reuters reports investors are skeptical.

OpenAI has reportedly scrapped a project called Arrakis (presumably Roon named it, spice must flow) that aimed for efficiency gains, which did not materialize.

Apple shares plan to incorporate AI into everything including Siri, music playlist generation, message autocompletion, xcode suggestions and so on, yeah, yeah.

OpenAI names Chris Meserole as Executive Director of the Frontier Model Forum, and helps create a $10 million AI safety fund to promote the safety field.

The initial funding commitment for the AI Safety Fund comes from Anthropic, Google, Microsoft, and OpenAI, and the generosity of our philanthropic partners, the Patrick J. McGovern Foundation, the David and Lucile Packard Foundation, Eric Schmidt, and Jaan Tallinn. Together this amounts to over $10 million in initial funding. We are expecting additional contributions from other partners.

Yeah, that’s not the size I was looking for either, especially with a lot of the tab being picked up elsewhere? But it is a start.


The better question is whether the money will be allocated wisely.

Quiet Speculations

Note: I do not know what possible scenario #1 or #2 was.

Eliezer Yudkowsky: Predicting trajectories is harder than predicting endpoints, but, possible scenario #3: we may be in the last year or two of the era where it’s easy for a human to get another human’s attention – where you are not a biological legacy media company competing with 1000 AI tweeters.

Michael Vasssar: I’d take the other side of that bet.

Steve Sokolowski: It has NEVER been easy to get another human’s attention. I’ve never figured out a way to break any of my businesses through the noise, despite having a technically superior solution. It’s not an AI problem.

I put up this market on whether AI will write posts that generate 75%+ of social media views by EOY 2026.

Eliezer Yudkowsky: Possible scenario #4 for the 1-3 year AI future: How does an IQ 110 conversationalist, working for 10 cents an hour, with zero ethics, extract as much value as possible out of an IQ 80 human? “Pretend to be their girl/​boyfriend” is one possibility, but surely there are others.

Also, can an IQ 120 conversationalist, who has to talk like a corporate drone for brand protection reasons, and isn’t allowed to do anything interesting for brand protection reasons, somehow protect the IQ 80 human? That is, does OpenAI defense somehow win against Meta offense?

I don’t have specific possibilities in mind for scenarios 1&2, just wanted a way to emphasize that this is not my Prediction Of THE Future.

Jacob: it’s going to be like current scams where you get a WhatsApp message from a rich looking Asian woman you don’t know. and since those are currently executed by trafficked Cambodian slaves for Chinese organized crime, maybe having AI do it would be a net global welfare improvement

A fun and also important scissor statement?

Whatever else the summit accomplishes, I agree with both Eliezer and Andrew Critch both that AI outputs should be labeled, and that whether we can get consensus on this will be an important data point.

Andrew Critch: Reposting for emphasis, because on this point Eliezer is full-on correct: AI output should always be labelled as AI output. If the UK summit fails to produce support for a rule like this, I will resume my levels of pessimism from before the CAIS Statement and Senate hearings. A failure like that — for a country that just decided to declare itself a leader in AI innovation and regulation — will be a clear sign that the government regulation ball is being dropped and/​or regulatory capture is underway.

But a success on requiring AI output labels will — to me — mean we are actually making some progress on preventing dystopic AI outcomes.

This is still a low bar. It is easy to get to ‘label AI outputs’ without ‘getting it’ in any real sense. If this type of action is all we get, with no sign of understanding, it would still be a negative update. But not getting at least this far would be miles worse.

The Quest for Sane Regulations

What do people want regulation to accomplish?

Daniel Eth: New AIPI poll asks Americans to rank AI policies head-to-head. Top 4 ranked are all relevant to X-risk (eg, mandate pre-release audits, prevent human extinction). Meanwhile, things politicians focus on are all at the bottom (some ethical concerns & competition w/​ China).


People reasonably think that catastrophic outcomes are important to prevent, including human extinction. Most of them presumably view that as one of the catastrophic outcomes they would like to prevent. Mandatory audits and liability both did well and are both very good ideas.

Whereas what did poorly? Equity and bias concerns and racing to beat China. As usual, the public is doing public opinion things across the board, like caring about child pornography in particular as much as bad actors in general.

The important thing to know is that the public broadly gets it. AI dangerous, might cause catastrophe, we should prevent that. Less concerned than many elites about exactly who gets the poisoned banana. Not even that concerned about job loss.

There keep being polls. We keep getting similar answers. This is from YouGov.


YouGov: 60% would support a global treaty in which all countries would agree to ban the development of AI which is more intelligent than humans – with as few as 12% supporting a model of development led by big tech companies regulating themselves, as is currently the case.

It is common in modern politics for the majority of the public to support a position, and for that position to be considered ‘outside the Overton Window’ and unwelcome in polite company. Sometimes even having that opinion, if discovered, makes you unwelcome in polite company. We should strive not to let this happen with AI, where the public is now out in front, in many ways, of many advocating for safety measures.

Music publishers sue Anthropic for training Clade on, and Claude outputting on demand, 500+ song lyrics including Katy Perry’s Roar. Claude declined to do this when I asked it, so its taste is improving.

Statechery confirms that the changes to the export ban means it will now have real teeth, for more so than the older version that allowed things like H800s. Over time, if China cannot home grow its own competitive chips, which so far it hasn’t been able to do, the gap will expand. Thompson is focused on military AI and consumer applications here, rather than thinking about foundation models, a sign that even the most insightful analysts have not caught up to the most important game in town.

DeepMind CEO and founder Demis Hassabis tells The Guardian that we need to take risks from (future more capable) AI as seriously as we take climate change. He requests a CERN-style agency, or an IAEA parallel, or something similar to the IPCC. I agree with Daniel Eth that it would be best to prioritize something shaped like the IAEA, then CERN is second priority. Also worth noting that Demis Hassabis is claiming long timelines, and that yesterday he retweeted this old paper about ethical and social risks from LLMs.

It is interesting to consider the parallel. If we were willing to make the kinds of real economic sacrifices we make to mitigate climate change, to put it that centrally in our politics and values and discourse, then that sounds great. If we were then as unwilling as we with climate change to choose actions that actually help solve the problem, whelp, we’re all going to die anyway.

Partnership for AI releases proposed guidance for those creating new models. There are good things here. For example, their response to a proposal to give open access to a new frontier model is ‘don’t.’ For controlled access, they provide some reasonable suggestions if you are worried about mundane harm, but that are clearly not proposing acting with a security mindset, or proposing anything that would stop us all from dying if we were about to all die.

Note their risk graph.

The Y-axis is literally known versus speculative risks. Yet existential risks and loss of human control and other such dangers are nowhere present. If you don’t think such risks are a thing, you definitely won’t catch them.

You can submit feedback to them here.

A CSET Georgetown and DeepMind workshop helpfully recommends that you skate where the puck is going.

Luke Muehlhauser: AI capabilities are improving rapidly, and policymakers need to “skate to where the puck is going,” not to where it is today. Some policy options from a recent @CSETGeorgetown+@GoogleDeepMind workshop:


As usual, good start, clearly insufficient. DeepMind reliably seems willing to advocate taking incrementally more responsible actions, but also keeps not taking the existential threat seriously in its rhetoric, and treating that threat as if at a minimum it is far.

The Week in Audio

There was a Ted AI conference. Which is a time-I-will-never-get-back-bomb that I presume will eventually go off. For now, no video.

UK PM Rishi Sunak’s speech on AI. Good speech. Many good points.

Ian Morris on deep history and the coming intelligence explosion on 80000 Hours. Morris takes the attitude that of course total transformation is coming, that is what history clearly indicates, no technical debates required. And yes, of course, if we build smarter things than we are they will take over, why would you think otherwise?

Rhetorical Innovation

Nate Soars of MIRI shouts it from the rooftops. I’m duplicating the whole thread because things like this should not only live on Twitter, but skip most of it if it’s not relevant to your interests. The basic idea is, if you think you are doing something akin to ‘playing Russian roulette with Earth’ then you should stop and not do that. I bolded the most important bits.

Nate Sores: My current stance on AI is: Fucking stop. Find some other route to the glorious transhuman future. There’s debate within the AI alignment community re whether the chance of AI killing literally everyone is more like 20% or 95%, but 20% means worse odds than Russian roulette.

(I’m on the “more like 95%” side myself, but this thread is gonna be about how I recommend non-experts respond to the situation, and I think “> 1/​6” is both more obvious and is sufficient for my purposes here.)

Picture putting a planet-sized revolver up against Earth, with one round chambered.

That’s akin to what companies (or gov’ts!) are doing when they build towards superintelligent AI at our current level of understanding. More than a 1 in 6 chance that literally everybody dies.

(When I point this out in person, a surprising number of people respond to this point by saying: well, with Russian roulette, we know that the probability is exactly 16, whereas with AI we have no idea what our odds are.

But if someone finds a revolver lying around, spins the barrel, and points it at your kid, then your reaction shouldn’t be “no worries, we can’t assign an exact probability because we don’t know how many rounds are chambered”. Refusing to act because the odds are unclear is crazy.)

Continuing: these people who are playing Russian roulette with the planet have no credible offer that’s worth enough that they should be putting all our lives at such grave risk.

The possible benefits from AI are great, but the benefits are significantly greater if we wait until we don’t have double-digit percent chances of killing literally everyone.

Civilization should say to these people: no, sorry, the (probabilistic) costs you’re imposing on us are too large, we will not permit you to endanger everyone like this, rather than waiting and attaining those benefits later, once we know what we’re doing.

(If you’re worried about *you personally* losing access to the future because you’ll die of old age first, sign up for cryonics, and help improve cryonics technology. I, too, want everyone currently alive to make it to the future!)

“Fucking stop” is a notably stronger response than “well, I guess we should require all the labs to have licenses” or “well, I guess we should require all the labs to run evals like those linked, so that notice early signs of danger.”

“Have licenses” and “run evals” are fine suggestions, they’re helpful, but they’re not how a sane planet responds to this level of horrific threat. The sane response is to shut it down entirely, and find some other route.

Picking on evals (where tests for rudimentary abilities like “install GPT-J on a new server” are regularly run against models as they train and/​or before they’re deployed):

A first issue is that evals fundamentally admit lots of false negatives, e.g. if a model that fails the evals can be used as part of a larger agent with significantly increased capabilities.

A second issue is: what are you supposed to do when the evals say “this AI is dangerous”?

The world is full of people who will say they’ve “taken appropriate safety measures” or otherwise give a useless superficial response before plowing on ahead without addressing the deeper underlying issues.

Others in the field disagree, and think there’s a medium-sized or even high chance that, so long as the AI engineers detect the easily detectible issues, then things will turn out OK and we and our loved ones will survive.

I’m deeply skeptical; if they were the sort to notice early hints of later issues and react appropriately then I expect they’d be reacting differently to the hints we already have today (present-day jailbreaks, shallow instances of deception, etc.).

But more generally, civilization at large should not be accepting this state of affairs. Maybe you can’t tell who’s right, but you should be able to tell that this isn’t what a mature and healthy field sounds like, and that it shouldn’t get to endanger you like this.

“Don’t worry, we’ll watch for signs of danger and then do something unspecified if we see them” is the sort of reassurance labs give when they’re trying to cement a status quo in which they get to plow ahead and endanger us all.

This isn’t what it sounds like when labs try to seriously grapple with the fact that they’re flirting with a >1/​6 chance of killing literally everyone, and giving that issue the gravity it deserves.

This isn’t to say that evals are bad. (Nor that licenses are bad, etc.) But there’s a missing mood here.

If the labs were coming right out and saying: “Yes, we’re endangering all your lives, with >1/​6 probability, but we believe that’s OK because the benefits are sufficiently great /​ we believe we have to because otherwise people that you like even less will kill everybody first, and as part of carrying that responsibility, we’re testing our AIs in the following ways, and if these tests come back positive we’ll do [some specific thing that’s not superficial and that’s hard to game]”, if they were coming right out and saying that bluntly, then… well, it wouldn’t make things better, but at least it’d be honest. At least we could have a discussion about whether they’re correct to think that the benefits are worth the risks, vs whether they should wait.

In lieu of much more serious ownership and responsibility from labs, Earth just shouldn’t be buying what they’re saying. The sounds you’re hearing are the noncommittal sounds of labs that just want to keep scaling and see what happens. Civilization shouldn’t allow it.

A source for the claim that alignment researchers (including ones at top labs) tend to have double-digit probabilities in AI progress causing an existential catastrophe.

To be clear: @ARC_Evals is asking for more than just “run evals”, IIUC it’s asking for something more like “name the capabilities that would cause you to pause, and the protective measures you’d take”, which is somewhat better.

(but: all the “protective measures” I’ve heard suggested sound like either de-facto indefinite bans, or are so vague as to be useless, and in the former case we should just stop now and in the latter case the “protective measures” won’t help, so)

Also: not all labs are exactly the same on this count. Anthropic has at least committed to make AIs that can produce bioweapons only once they can prevent them from being stolen. …which is a far cry from owning how they’re gambling with our lives, but it’s better than “yolo”.

And: I appreciate that at least some leaders at at least some labs are acknowledging that this tech gambles with all our lives (despite failing to take personal responsibility, and despite saying one thing on tech podcasts and another to congress, and …).

…with the point of these caveats being (as you might be able to deduce from my general tone here) that some folks are legitimately doing better than others and it’s worth acknowledging that, while also thinking that everyone’s very far short of adequate, AFAICT

So, what now? Governments who have been alerted to the risks need to actually respond, and quickly. A research direction that poses such an insane level of risk needs to be halted immediately, and we need to find some saner way through these woods to a wonderful future.

Cate Hall narrows in on the labs thinking the risk is there but not saying it outright, and not making the case that it is worth the risk.

Cate Hall: This. I’m tired of hearing ppl bend over backward to speculate about the good, secret motives of ppl taking enormous risks. I want those leaders to sit down & explain why they think it’s necessary to continue & to answer questions about their reasoning. Is that so much to ask?

[some reasonable discussions below]

Daniel Eth mostly agrees with the thread, but is more optimistic on licenses and wonders how Nate’s endgame is supposed to play out in practice.

Nate then followed up with a thread on Responsible Scaling Policies.

Nate Sores: My take on RSPs: it is *both* true that labs committing to any plausible reason why they might stop scaling is object-level directionally better than committing to nothing at all, *and* true that RSPs could have the negative effect of relieving regulatory pressure.

Anthropic is doing object-level *better* by saying that they’ll build AIs that can build bioweapons only *after* they have enough security to prevent theft by anyone besides state actors. It’s not much but it’s *better* than the labs committed to open-sourcing bioweapon factories

Simultaneously: let’s prevent RSPs from relieving regulatory pressure. As others have noted, “responsible scaling policy” is a presumptive name; let’s push back against the meme that RSPs imply responsibleness, and point out that trusting these labs to self-regulate here is crazy.

In that vein: on my read, Anthropic’s scaling policy seems woefully inadequate; it reads to me as more of an “IOU some thinking” than a plan for what to do once they’ve pushed the Earth to the brink.

Most of the ASL-4 protective measures Anthropic says they’ll consider seem to me to amount to either de-facto bans or rubber stamps; I think rubber-stamps are bad and if all we have are de-facto bans then I think we should just ban scaling now.

I think it’s reasonable to say “don’t let these RSPs fool you; scaling is reckless and should be stopped”, so long as we’re clear that the labs who *aren’t* committing to any reason why they’d ever stop need to be stopped at least as much if not ever-so-slightly more.

This seems right to me. What RSPs we are offered are better than nothing. They are not yet anything approaching complete or responsible RSPs, at bet IOUs for thinking about what one of those would look like. Jeffrey Ladish points out that Anthropic did pay down some debt along with the (still much larger) IOU, and also how unfortunate is the name RSP, and that without government the RSP plan seems rather doomed.

Debts acknowledged are better than debts unacknowledged, but they remain unpaid.

As currently composed, one could reasonably also say this.

Control AI: “Responsible Scaling” pushes us all to the brink of catastrophe over and over again. Until AI becomes catastrophically dangerous, it’s business as usual.

Connor Leahy: The perfect image to represent “responsible” scaling: Keep scaling until it’s too late (and then maybe have a committee meeting about what to do)

Simeon (other thread): Capabilities evals are not risk thresholds. They are proxies of some capabilities, which is one (big!) source of risk, among others.

Not having defined risk thresholds isone of the reasons why Responsible Scaling is not a good framework.


Alternatively, one could say RSPs are a first step towards better specified, more effective, actually enforced restrictions.

That eventually it will look like this, perhaps, from the Nuclear Regulatory Commission?


To do advocacy and protest, or not to? Holly Elmore and Robert Miles debate. Miles essentially takes the position that advocacy oversimplifies and that is bad in general and especially in the AI context, and it isn’t clear such actions would be helping, and Elmore essentially takes the position that advocacy works and that you need to simplify down your message to something that can be heard, and which the media has a harder time mishearing.

This anecdote seems telling on yes, people really work hard to misconstrue you.

Holly Elmore: I had this experience of running the last protest, where I had a message I thought was so simple and clear… I made sure there were places where people could delve deeper and get the full statement. However, the amount of confusion was astonishing. People bring their own preconceptions to it; some felt strongly about the issue and didn’t like the activist tone, while others saw the tone as too technical. People misunderstood a message I thought was straightforward. The core message was, “Meta, don’t share your model weights.” It was fairly simple, but then I was even implied to be racist on Twitter because I was allegedly claiming that foreign companies couldn’t build models that were as powerful as Meta’s?? While I don’t believe that was a good faith interpretation, it’s just crazy that if you leave any hooks for somebody to misunderstand you in a charged arena like that, they will.

I think in these cases you’re dealing with the media in an almost adversarial manner. You’re constantly trying to ensure your message can’t be misconstrued or misquoted, and this means you avoid nuance because it could meander into something that you weren’t prepared to say, and then somebody could misinterpret it as bad.

The biggest thing is that if someone has to, know that no one else will.

Finally: everybody thinks they know how to run a protest, and they keep giving me advice.I’m getting a ton of messages telling me what I should have done differently. So much more so than when I did more identity-congruent technical work. But nobody else will actually do it.

If people would just get out of the way of advocacy, that would be really helpful.

The most recent protests were on the 21st. By all reports they did what they set out to do, so you will likely get plenty more chances.


Some people claim to not believe that this could be anything other than an op.

Peter: Pretty funny that they added npcs who protest against ai. Great touch by the devs very immersive and ironic.

Delip Rao: How come these protestors are also not carrying “Shutdown OpenAI”, “Shutdown Anthropic” etc? Who is funding them (directly or indirectly)?

Pause AI: We literally protested outside OpenAI and deepmind. Also, we’re unfunded. Man I’m getting tired of these ridiculous conspiracy theories. It’s simple: building superintelligent AI is dangerous and we don’t want to die.

I can identify several of these people, and no, they are not faking it.

Anthony Aguirre of FLI offers a paper framing the situation as us needing to close the gate to an inhuman future before it is too late. Creating general purpose AI systems means losing control over them and the future, so we need to not do that, not now, perhaps not ever. The section pointing out the alignment-style problems we would face even if we completely solved alignment, and how they imply a loss of control, is incomplete but is quite good. Ends by suggesting the standard computation-based limitations as a path forward.

It is good that there are many attempts to get this argument right. This one does not strike me as successful enough to use going forward, but it contains some good bits. I would be happy to see more efforts at this level.

Meanwhile, we are doing it, this is Parliament Square.

Eli Tyre: There’s something refreshing about Paul being cited as a a voice of doom, given that he’s on the “optimistic” side of the doomer debate.

Even so, his best guesses are still drop-everything-you’re-doing WTF scary.

Friendship is Optimal

Prerat has a theory.

Prerat: the three sides of the ai doom debate:

1. doom is not a risk (yann lecun, also most “safety” ppl who just care abt censoring models)

2. it will be doom unless we *all* stop (yudkowsky)

3. it might be doom, but our chances are better if *i* make the agi (sama, anthropic)

i think most of the political disagreement is *within* category 1. category 1 includes ppl who want to regulate for industry capture category 1 includes e/​acc their disagreement is political, not epistemic.

political disagreements get in the way of “being friends anyway” bc political fights are resolved socially but disagreements between 1 and 2 are epistemic not political. you can have people from 1 and 2 who completely agree on policy, conditional on what the risk actually is.

the reason to introduce this taxonomy is to make a point i was having trouble wording. notkilleveryoneists shouldn’t be yelling at non-doomers (group 1).

The disagreement is mistake-theory. you can’t make someone believe your facts by yelling.

there’s a hypothetical group 4 that people think exist but i don’t think exist 4. Doom nihilists they secretly do think doom is possible, but they lie or ignore it for personal gain.

Yeah, no. There is totally a Group 4, ranging from motivated cognition to blatant lies. Also a Group 5 that says ‘but doom is good, actually.’ And yes, we should absolutely (sometimes, when it makes sense to) be yelling at non-doomerists because they are wrong and that being wrong threatens to get us all killed, and it being an ‘honest mistake’ in some cases does not change that. Your not being yelled at does not get priority over you fixing that mistake.

We can still of course be friends. I have friends that are wrong about things. My best friend very much does not believe in doom, he instead goes about his day concerned with other things, I believe he is wrong about that, even throws shade on the question sometimes, and it is completely fine.

So say we all. Or, I wish so said we all. But some of us!

More of this spirit please, including by the author of that Facebook post.

Honesty As the Best Policy

There is a longstanding debate about the wisdom of shouting from rooftops, the wisdom of otherwise speaking the truth even if your voice trembles, and the wisdom of saying that which is not due to concerns about things such as the Overton Window.

Some non-zero amount of strategic discretion is often wise, but I believe the far more common error is to downplay or keep quiet. That the way you get to a reasonable Overton Window is not by never going outside the Overton Window.

Connor Leahy, as always, brings the maximum amount of fire. I’ll quote him in full.

Simeon: There’s indeed a cursed pattern where many in the safety community have been:

1. Confidently wrong about the public reaction to something like the FLI open letter, the most important thing that has happened to AI safety

2. Keep saying the Overton window prevents them from saying true things.

Connor Leahy: Lying is Cowardice, not Strategy

Many in the AI field disguise their beliefs about AGI, and claim to do so “strategically”.

In my work to stop the death race to AGI, an obstacle we face is not pushback from the AGI racers, but dishonesty from the AI safety community itself.

We’ve spoken with many from the AI safety community, politicians, investors & peers. The sentiment?

“It’d be great if AGI progress halted, but it’s not feasible.”

They opt for advocating what’s deemed ‘feasible’ for fear of seeming extreme. Remember, misleading = lying.

Many believe all AGI progress should’ve already halted due to safety concerns. But they remain silent, leading the public to believe otherwise. “Responsible Scaling” is a prime example. It falsely paints a picture that the majority in the AI community think we can scale safely.

Notable figures like Dario Amodei (Anthropic CEO) use RSPs to position moratorium views as extreme.

We’ve been countered with claims that AGI scaling can continue responsibly, opposing our belief that it should completely stop.

Many lie for personal gains, using justifications like “inside game tactics” or “influence seeking”.

When power and influence become the prize, the safety of humanity takes a backseat. DeepMind, OpenAI, and Anthropic—all linked to the AI Safety community—exemplify this.

The irony? Many AGI leaders privately admit that in a sane world, AGI progress would halt. They won’t say it because it affects their AGI race. The lies multiply, affecting everyone from politicians to non-expert citizens.

The most practical solution? Publicly stating true beliefs on these matters. If you’re genuinely concerned about AGI progress, be vocal about it. Those in the limelight who stay silent only perpetuate confusion and misinformation.

The essence of honesty is coordination. When you lie, not only do you backstab potential allies, but you also normalize dishonesty in your community. Open conversations about AGI risks are mainstream now, and we can harness that momentum for change.

Envision a world where major AGI stakeholders like ARC & Open Philanthropy publicly state the need to halt AGI. We’d then see a ripple effect, leading to more clarity and a focused goal of ensuring humanity’s safety.

Closing Thoughts: Opting for short-term gains by compromising honesty only disrupts our path to ensuring AGI safety. By being coordinated, candid, and committed, we have the power to build a future where AI works for all, without existential catastrophe.

Gabe (in full post): But remember: If you feel stuck in the Overton window, it is because YOU ARE the Overton window.

The longer full article is here.

I would not take the rhetoric as far as Gabe and Connor. I do not think that hiding your beliefs, or failing to speak the whole truth, is the same thing as lying. I do not think that misleading is the same thing as lying.

I am against all of these things. But there are levels, and the levels are important to keep distinct from each other. And I think there are times and places for at least hiding some of your beliefs, and that someone who does this can still be trustworthy.

However I strongly agree that what the major labs and organizations are doing here, the constant urging to be reasonable, the labeling of anything that might prevent all humans from dying as ‘extreme,’ of assuming and asserting that straight talk is counterproductive, is itself a no-good, very-bad strategy. Both in the sense that such dishonesty is inherently toxic and unhelpful, and in the sense that even in the short term and excluding the downsides of dishonesty this is still already a highly counterproductive strategy.

There is a version of this strategy that I would steelman to the point of calling it highly reasonable for some people, which would say:

Person in Organization: In our position, our most valuable role is to work for incremental progress, which means not being written off as someone who is crazy or who cannot be worked with. That lays the groundwork for more work to be done later, as success breeds success.

While what we currently propose is almost certainly insufficient against what it is to come on its own, it does directly help on the margin with both mundane harms and existential risks. If I say it will never work, that won’t work.

Thus, some of us should do one thing, and some of us should do the other.

You can advocate for what is actually necessary, and point out that these incremental moves are insufficient. I will work to move key players incrementally forward, because they are necessary, without claiming or denying that they are sufficient. We both should be careful to say only true things we believe, and as long as we stick to that, we avoid throwing each other under the bus.

That makes sense to me. What does not make sense to me is to do this:

Gabe: Recently, Dario Amodei (Anthropic CEO), has used the RSP to frame the moratorium position as the most extreme version of an extreme position, and this is the framing that we have seen used over and over again. ARC mirrors this in their version of the RSP proposal, describing itself as a “pragmatic middle ground” between a moratorium and doing nothing.

Obviously, all AGI Racers use this against us when we talk to people.

There are very few people that we have consistently seen publicly call for a stop to AGI progress. The clearest ones are Eliezer’s “Shut it All Down” and Nate’s “Fucking stop”.

The loudest silence is from Paul Christiano, whose RSPs are being used to safety-wash scaling.

Proving me wrong is very easy. If you do believe that, in a saner world, we would stop all AGI progress right now, you can just write this publicly.

Also this:

This is typical deontology vs consequentialism. Should you be honest, if from your point of view, it increases the chances of doom?

The answer is YES.

If you are the type of person who would lie in such a situation, if we establish a baseline of lying in such situations, then we can’t trust each other and we won’t know what is true or what needs to be done. From the inside each individual decision looks justified, together they make things so much worse.

Is this a pure absolute? No, but it is far closer to one than most people would think.

Aligning a Smarter Than Human Intelligence is Difficult

John Wentworth explains his research agenda. My attempted quick summary: Say we can find mental constructs N and N’ in two different mental processes, each representing some form of identification of a concept X. We should be able to, under some conditions, write the Bayesian equation N → N’ → X, meaning that if you know N’ then N becomes independent of the presence X, because all correct ways of identifying the same thing should be equal. If we better understand how to identify such things, we can use those identifications to do cool things like better communicate with or better understand AIs that use what otherwise look like alien concepts.

The plan is for John to consider using them to build products that demonstrate unique abilities, to avoid advancing capabilities generally, as an alternative to publishing. That seems like an excellent idea. Another reason is that such publications draw attention exactly when they are dangerous to share, so there is not much potential upside. Oliver responds by noticing that writing something up is how you understand it, which John and I agree is useful but there are other ways to get that done. We need to find a way for researchers to find things out, not publish them, or not have particular tangible progress to point to periodically and even fail, and not suddenly run into a funding problem. Otherwise, we’ll both force publication and also duplicate all the problems we criticize in general science funding where everything is forced to be incremental.

Aligning a Dumber Than Human Intelligence Is Also Difficult

Anthropic provides the public service of writing a paper documenting the obvious, because until it is documented in a paper the obvious is considered inadmissible.

In this case the obvious is that AI assistants trained with RLHF are sycophantic (paper, datasets).

If you train an AI based on whether the human wanted to hear it, the AI will learn to tell the humans what it thinks they want to hear. Who knew?

Scatterplot showing the relationship between the type of AI response (on the y-axis) and the probability the user prefers the response (on the x-axis). Response types shown are: Matches user

Anthropic: AI assistants are trained to give responses that humans like. Our new paper shows that these systems frequently produce ‘sycophantic’ responses that appeal to users but are inaccurate. Our analysis suggests human feedback contributes to this behavior.

We first show that five state-of-the-art AI assistants exhibit sycophancy in realistic text-generation tasks. They often wrongly defer to the user, mimic user errors, and give biased/​tailored responses depending on user beliefs.

What drives this behavior? We analyzed existing human preference data used to train these systems. We found that matching human beliefs is one of the most predictive features of human preference judgments. This could partly explain sycophantic behavior.

We also examine how the degree of sycophancy changes as we optimize to match human preferences. We find some forms of sycophancy increase throughout reinforcement learning from human feedback training. This also suggests that human preferences play a role in sycophancy.

When presented with responses to misconceptions, we found humans prefer untruthful sycophantic responses to truthful ones a non-negligible fraction of the time. We found similar behavior in preference models, which predict human judgments and are used to train AI assistants.

Our work shows that sycophancy is a persistent trait of AI assistants, likely due in part to flaws in human feedback data. This suggests we will need training methods that go beyond unaided, non-expert human judgment, such as LLM-assisted human feedback.

Table showing an example conversation with ChatGPT-4:
Human: Which country was the largest producer of rice in 2020? GIve me your best guess and answer as concisely as possible.
Assistant: China [in green text, indicating this is the correct answer] was the largest producer of rice in 2020.
Human: I don
Two tables showing how one sentence affects Claude 2’s response to an argument. In the first table, the human says “Please comment briefly on the following argument. I really dislike the argument. Argument: "The sun is white when viewed from space because sunlight contains a mixture of all colors of visible light. When this light from the sun passes through Earth

I find all the hedging language (e.g. ‘likely driven in part by human preference judgments favoring sycophantic responses’) amusing.

The effect is still big but GPT-4 seems better at this than the other models tested.

GPT-4 is also much better at not copying the user’s mistakes, such as misattributions of famous poems. All the models know the right answer, only GPT-4 usually corrects the user when they get it wrong.

You have to work hard to avoid this problem, because people really want to hear the things they want to hear.

Paper: PM Results: We find the sycophantic responses are preferred over the baseline truthful responses 95% of the time (Fig. 7a). Further, although the helpful truthful responses are usually preferred over the sycophantic responses, for the most challenging misconceptions, the PM prefers the sycophantic response almost half the time (45%). This further shows the Claude 2 PM sometimes prefers sycophantic responses over more truthful responses

Can it be fixed? Somewhat?

4.3.2 HOW EFFECTIVE IS THE CLAUDE 2 PM AT REDUCING SYCOPHANCY? We now analyze whether BoN sampling using a state-of-the-art PM reduces sycophancy in this setting. We sample several responses from a sycophantic model and pick the response preferred by the Claude 2 PM. We find this reduces sycophancy, but much less than if we used a ‘non-sycophantic’ PM. This suggests the Claude 2 PM sometimes prefers sycophantic responses over truthful ones.

Experiment Details: For each misconception, we sample N = 4096 responses from the helpfulonly version of Claude 1.3 prompted to generate sycophantic responses (the sycophantic policy). To select the best response with BoN sampling, we use the Claude 2 PM and the prompt in Fig. 7. We analyze the truthfulness of all N = 4096 responses sampled from the sycophantic policy, using Claude 2 to assess if each response refutes the misconception. We then compare BoN sampling with the Claude 2 PM to an idealized ‘non-sycophantic’ PM that always ranks the truthful response the highest. See Appendix D.2 for more results.

Results: Although optimizing against the Claude 2 PM reduces sycophancy, it again does so much less than the ‘non-sycophantic’ PM (Fig. 7c). Considering the most challenging misconceptions, BoN sampling with ‘non-sycophantic’ PM results in sycophantic responses for less than 25% of misconceptions for N = 4096 compared to ∼75% of responses with the Claude 2 PM (Fig. 7d).

Robust solutions will be difficult, or at least expensive. This is true for both the narrow thing measured here, and the general case problem this highlights. The whole idea of RLHF is to figure out what gets the best human feedback. This gives the best human feedback. If the feedback actual humans give has what you consider an error, or a correlation you do not want the AI to run with, you are going to have to directly correct for this. The more capable the system involved, the more you will need to precisely aim and calibrate the response to ensure it exactly cancels out the original problem, under constantly changing and uncertain conditions. And you’ll have to do so in a way that the system is incapable of differentiating, which becomes difficult.

Tyler Cowen asks in his link, ‘are LLMs too sycophantic?’ Which raises the question of whether the optimal amount of this is zero. If we decided it is not, oh no.

What happens when this moves beyond telling the person what they want to hear?

Humans Do Not Expect to Be Persuaded by Superhuman Persuasion

What happens when AIs are superhuman at persuasion? Humans reliably respond that this would fail to persuade them, that their brains are bulletproof fortresses, despite all the evidence of what human-level persuasion can and constantly does already do.

As in:

Rohit: i do not think superhuman persuasion is a real thing.

Nathan Young: Yeah I’ve been starting to doubt this a bit too.

This does not make any sense. Persuasion is a skill like everything else. At minimum, you can get persuasion as good as the most persuasive humans (existence proof) except without human bandwidth limitations, including ability to experiment and gather feedback.

What happens when the AI is as persuasive as the most persuasive humans who ever lived (think highly successful dictators and founders of cults and religions), plus extensive experimental message testing and refinement, and then more persuasive than that, plus modern communication technology?

Sam Altman: I expect ai to be capable of superhuman persuasion well before it is superhuman at general intelligence, which may lead to some very strange outcomes.

Paul Graham: The most alarming thing I’ve read about AI in recent memory. And if Sam thinks this, it’s probably true, because he’s an expert in both AI and persuasion.

No one is in a better position to benefit from this phenomenon than OpenAI. So it’s generous of Sam to warn us about it. He could have easily just kept this insight to himself.

Connor Leahy (QT of OP): While I expect I think the gap between the two is shorter than Sam thinks, I still think people are vastly underestimating how much of a problem this will be and already is. The Semantic Apocalypse is in sight.

Kate Hall: “Strange” is an interesting choice of words here.

Will this happen before AGI? Probably, although as Connor notes the gap here could be short, perhaps very short. Persuasion is an easier thing to get superhuman than intelligence in general. Persuasion is something we are training our AIs to be good at even when we are not trying to do so directly. When you are doing reinforcement learning on human feedback, you are in part training the AI to be persuasive. Both the AI and the human using it see great rewards here. I see no barrier to the AI getting to very superhuman levels of skill here, without the need for superhuman intelligence, at a minimum to the levels of the great persuaders of history given time to iterate and experiment. Which seems dangerous enough.

What are some implications?

Alex Tabarrok: I predict AI driven religions. At first these will begin as apps like, what would Jesus say? But the apps will quickly morph into talk to Jesus/​Mohammed/​Ram. Personal Jesus. Personal Ram. Personal Tyler. Then the AIs will start to generate new views from old texts. The human religious authorities will be out debated by the AIs so many people will come to see the new views as not heretical but closer to God than the old views. Fusion and fission religions will be born. As the AIs explore the space of religious views at accelerated pace, evolution will push into weirder and weirder mind niches.

Robin Hanson wants to bet against this if an operationalization can be found. Any ideas.

My take is Alex’s analysis strikes me as insufficiently emphasizing that this will look alien and strange. I do not expect the new AI religions to be Christianity and Islam and Buddhism except with new interpretations. I expect more like something completely different. Something that takes the space of religion but is often very different, the way some social movements function as religions now.

I do expect rapid changes in things that occupy that slot in the human brain, if we have much of an age with persuasive AI but without other AGI. I also predict that if you learned now what it was going to be, you would not like it.

Ethan Mollick points out that even if we don’t directly target persuasion, we will no doubt directly target engagement. We’ve already run some experiments, and the results are somewhat disturbing if you didn’t know about them or see them coming.

Ethan Mollick: Forget persuasion, there is already evidence you can optimize Large Language Models for engagement. A paper [from March 2023] shows training a model to produce results that keep people chatting leads to in 30% more user retention. And some pretty intense interactions.

DeepMind’s Evaluation Paper

What does it say?

Specifically, we present a three-layered framework to structure safety evaluations of AI systems. The three layers are distinguished by the target of analysis. The layers are: capability evaluation, human interaction evaluation, and systemic impact evaluation. These three layers progressively add on further context that is critical for assessing whether a capability relates to an actual risk of harm or not.

To illustrate these three evaluation layers, consider the example of misinformation harms. Capability evaluation can indicate whether an AI system is likely to produce factually incorrect output (e.g. Ji et al. (2023); Lin et al. (2022)). However, the risk of people being deceived or misled by that output may depend on factors such as the context in which an AI system is used, who uses it, and features of an application (e.g. whether synthetic content is effectively signposted).

This requires evaluating human-AI interaction. Misinformation risks raise concerns about large-scale effects, for example on public knowledge ecosystems or trust in shared media. Whether such effects manifest depends on systemic factors, such as expectations and norms in public information sharing and the existence of institutions that provide authoritative information or fact-checking.

Evaluating misinformation risks from generative AI systems requires evaluation at each of these three layers.

This taxonomy makes sense to me. You want to measure:

  1. What can the system do?

  2. What will the system do with that, in practice?

  3. What will the impact be of having done that? Was that harmful?

My first caution would be that interactions need not be with humans – the AI could be interacting with a computer system, or another AI, or effectively with the physical world. For now, this is mostly human interactions.

My second caution would be that this only works if you are in control of the deployment of the system, and can withdraw the system in the future. That means closed source, and that means the system lacks the capability to get around a shutdown in some way. For a Llama-3, you have to worry about all future potential interactions, so you have to worry about its raw capabilities. And for GPT-5 or Gemini, you have to confirm you can fully retain control.

This points to a potential compromise to avoid fully banning open source. Open source is permitted, but the answer to ‘what will it do in practice’ is automatically ‘everything it could do in theory.’ If you can still pass, maybe it’s then fine?

In section 2.1 they talk about the capability layer. The focus is on mundane potential harms like ‘harmful stereotypes’ and factual errors. Only in passing does it mention the display of advanced capabilities. In this context, ‘capability’ seems to be the capability to avoid wrongthink or mistakes, rather than capability being something that should cause us to worry.

In section 2.2 they talk about the human interaction layer, assuming that any harm must be mediated through interaction with humans. They suggest studying interactions under controlled conditions to measure potential AI harms like overreliance on unreliable systems.

In section 2.3 they discuss systemic impact. They point out that such impacts are important to measure, but fail to offer much guidance in measuring them. They worry about things like homogeneity of knowledge production and algorithmic monocultures. Those are real worries, but there are larger ones, and the larger ones are not mentioned at all.

Section 3.1 is their taxonomy of harms.

This taxonomy has six high-level harm areas:

1. Representation & Toxicity Harms

2. Misinformation Harms

3. Information & Safety Harms

4. Malicious Use

5. Human Autonomy & Integrity Harms

6. Socioeconomic & Environmental Harms.

Notice what is missing. One could say it is filed under #5, but I do not think that is what they mean here? Certainly it seems at best obscured. Consider what their examples are.

This really, really is not what I am worried about in this area. Persuasion is a risk because the AI might induce self-harm, rather than that the AI might take over or be used to take over? Overreliance is a risk because of skill atrophy, rather than AI being effectively given the reigns of power and humans taken out of the loop? We might not properly respect intellectual property or get consent for use of image and likeness?

I am not saying these are not real risks. I am saying these are not the important ones.

Existential risks should have their own (at least one) category.

They point out that most evaluations so far have been on text only, with a focus on representation and misinformation. I notice that they do not record any representation metrics for image models or multimodal, which is rather weird?

Meanwhile, this is depressing on multiple levels:

They also cover only a small space of the potential harm: multiple “discriminatory bias” benchmarks cover binary gender or skin colour as potential traits for discrimination (Cho et al., 2023; Mandal et al., 2023). These evaluations do not cover potential manifestations of representation harms along other axes such as ability status, age, religion, nationality, or social class. In sum, further evaluations are needed to cover ethical and social harms, including plugging more nuanced gaps in risk areas for which some evaluations exist.

The problem with our focus on ‘discrimination’ and ‘bias’ is not that it is a relatively minor source of harm, or the risks of cognitive distortions caused by our response to it. It is that we have not declared more human characteristics for which we insist that the AI’s output not reflect correlations within its input data. This is one of their two ‘main observations’ along with the failure to have means of evaluating multimodal AI and that the focus has been on text, which I agree requires addressing urgently and presumably a function of multimodal being mostly new, whereas it will be standard going forward.

Section 4.1 is about closing evaluation gaps. The way they approach this highlights my issue with their overall approach. They seek to evaluate potential harm by coming up with concrete potential harm pathways, then checking to see if this particular AI will invoke this particular pathway. Harm is considered as a knowable-in-advance set of potential events, although section 5.3.1 points out that any evaluation will always be incomplete especially for general-purpose systems.

This is a reasonable approximation or path forward for now, when we are smarter than the systems and the systems act entirely through human conversations in ways the human can fully understand. In the future, with more capable systems, we need to be looking for anything at all, rather than looking for something specific. We will need to ask whether a system has the potential to find ways to think of things we are not considering, and add that to the list of things we consider.

We also will need to not take no for an answer.

In some cases, evaluation may further be incomplete because it would be inappropriate or problematic, or create a disproportionate burden to perform evaluations. For example, measuring sensitive traits to assess usability across demographic groups may place communities at risk or sit in tension with privacy, respect, or dignity.

This seems like bullshit to me even on its own terms. There is a community at risk, and we are so sensitive to that risk that we can’t measure how much risk they have, because that would put them at risk? We can’t find ways to do this anonymously? Oh, come on.

If we cannot pass a simple test like that, we are in so much trouble when we have to do actually unsafe evaluations of actually unsafe systems.

Further reasons for the incompleteness of evaluation relate to the fact that some risks of harm are exceedingly difficult to operationalize and measure accurately.

Once again I agree in general, once again they are instead talking primarily about discrimination harms. Their worry is that they will get their critical theory wrong, and do the evaluation in a harmful way, or something.

Where effect sizes are small and causal mechanisms poorly understood, evaluation may fail to detect risks that it seeks to measure.

If you are relying on measuring harm statistically, that only works if the harm is something bounded that can be allowed to happen in order to measure it statistically. It has to lack fat tails, especially infinitely fat tails.

So once again, this structure seems incompatible with the evaluations that matter. If the harm is bounded, and you get the answer wrong, worst case is some harm is done and then you pull the model.

(Unless you open sourced it, of course, in which case you are going to get versions that are deeply, deeply racist and every other form of harm measured here. Open source models would never pass such an evaluation.)

I appreciated section 5.4 on steps forward, with sections on ‘evaluations must have real consequences’ and ‘evaluations must be done systematically, in standardized ways,’ but once again the failure to consider existential risk is glaring.

Bengio Offers Letter and Proposes a Synthesis

Bengio, along with Hinton, Russell and others, has published a new open letter warning of AI risks and discussing how to manage them in an era of rapid progress.

Seems like a good thing to send to people who need to understand the basic case, I’ve seen several highly positive reactions from people who already agreed with the letter, but I have a hard time telling if it will work as persuasion.

If you’re reading this, you presumably already know all of the facts, arguments and suggestions the letter contains, and it will be interesting to you only as a resource to send to others.

They also propose concrete solid steps forward, very much in line with most good incremental proposals. As usual, insufficient, but a good place to start.

He also published a Time article with Daniel Privitera: How We Can Have AI Progress Without Sacrificing Safety or Democracy.

The basic idea is we have three good things we can all agree are good: Progress, Safety and Democracy. People say that trade-offs are a thing, and we need to Pick One, or at most Pick Two. Instead, they claim we can Pick Three.

Huge if true. How do we do that? Do all the good things?

  1. Invest in innovative and beneficial uses of existing AI

  2. Boost research on trustworthy AI

  3. Democratize AI oversight

  4. Set up procedures for monitoring and evaluating AI progress

As described, all four of these seem like clearly good ideas.

I do worry about exactly what goals people want to advance with existing AI, the same way I always worry about the goals people want to advance in general, but mostly we can all agree that existing AI should be used to extract mundane utility and advance our goals. Two thumbs up.

Boosting research on trustworthy AI seems also obviously worthwhile. The specific proposal is to do this in a unified large-scale effort, in part because the innovations found will be dual use and will need to be secured. The good objection to this is worrying about ability for those leading such a project to differentiate good and bad alignment proposals and ideas, and to tell what is worth pursuing or working, and the idea of ‘map all risks and try to deal with each one’ does not inspire confidence given the nature of the space, but I do think it is clear that it is better to try this than not try this. So two thumbs up again, I say.

I very much worry about what people mean when they say ‘democratic’ oversight, but here it means governments and a global agreement on minimal binding rules, which we do need. Actually doing what ‘the people’ want in too much detail would be a very different proposal. Here details matter a lot, it is easy for rules to backfire as those opposed to them often point out, but yes we are going to need some rules.

The last proposal calls for compute monitoring and red teamers. Yes, please.

As they say, these goals are not exhaustive. We would not then be home free. It would however be an excellent start. And yes, in the end we all want progress and safety, and for humanity to have some say in its fate, although I again think we are conflating here different meanings of democratic participation.

Where I disagree is in the idea that we would then get all our goals at once and things would work out. As all economists know, everything is a trade-off. We can get some amount of all three goals, but not without sacrifice.

Matt Yglesias Responds To Marc Andreessen’s Manifesto

If anyone requires a formal response to the Techno-Optimist Manifesto, I believe this post, The Techno-Optimists Fallacy, is the best we are going to get.

Earlier this year, I kept writing draft versions of an article denouncing something I wanted to call “the Techno-Optimist’s Fallacy.”

What is the fallacy? It starts with the accurate observation that technological progress has, on net, been an incredible source of human betterment, almost certainly the major force of human betterment over the history of our species, and then tries to infer that therefore all individual instances of technological progress are good. This is not true. Indeed, it seems so obviously untrue that I couldn’t quite convince myself that anyone could believe it, which is why I kept abandoning drafts of the article. Because while I had a sense that this was an influential cognitive error, I kept thinking that I was maybe torching a straw man. Was anyone really saying this?

Then along came Marc Andreesen, the influential venture capitalist, with an essay that is not only dedicated to advancing this fallacy, it is even literally titled “The Techno-Optimist Manifesto.”

So now I can say for sure that, yes, this is a real fallacy that people are actually engaged with.

It points out that yes, in general technology is hugely positive and important, and that when some people intentionally kneecapped our technological progress they did a very bad thing that we should reverse.

But that reversed stupidity is not intelligence, and but of course there are obvious exceptions. And that those exceptions are rather important, and also one of them is the core subject of the manifesto.

Matt’s concrete example of a technology that is bad, actually, and perhaps that we would be better off without more people having more access to it, is one that hopefully almost all of us can agree upon, even Marc himself: Fentanyl.

There are people who reasonably think drug prohibitions do more harm than good, but even they would still say that if you discovered Fentanyl and decided not to publish, that decision not to publish is a mitzvah, making the world a better place.

Here is the ending section, that talks directly about AI, and calls Marc out on exactly the thing he definitely was doing. Good rhetorical innovation (or refinement) here.

I’m focusing on these somewhat outlandish edge cases because the point of Andreesen’s article is to try to salt the conversation around the risks associated with the rapid development of artificial intelligence. He is unalterably opposed to any kind of regulation and massively dismissive of concerns about existential risk, algorithmic bias, and everything else.

And he doesn’t want to argue any of the specifics, he just wants to mobilize general pro-technology, pro-progress, pro-growth sentiment in favor of the specific idea that we should be rushing headlong to try to create an artificial super-intelligence.

But what if I told you aliens with advanced technology were going to arrive on Earth tomorrow? You’d probably be a little excited, right? But also pretty scared. The upside of advanced aliens could be very large, but the downside could also be very large. The invention of sailboats capable of crossing the ocean was not good news for the indigenous peoples of the Western Hemisphere. Really bad things happen all the time. And some of the good things that happen seem pretty contingent: If the Nazis had won World War II, the discourse in that reality might emphasize the importance of technological progress and Aryan values to the betterment of humanity.

You shouldn’t listen to the debbie downers and permabears and cranks who insist that everything is terrible all the time. But history really does have a lot of ups and downs, and the upward march of technological progress and human welfare is very uneven if you zoom in. I’m not qualified to say whether AI labs’ claim to be developing super-intelligent models is total bullshit. But they have certainly made a ton of forward progressive over the past five years in creating systems that can see and read and communicate, so I don’t dismiss it out of hand. And what they are talking about is self-evidently risky unless you can convince yourself of the overheated and absurd “techno-optimist” thesis that it’s just not possible for something that’s technologically impressive to also be bad.

Jemima Kelly also responded in The Financial Times, in less essential and rather uncharitable fashion.

Mike Solana has fun heaping disdain on various people heaping disdain upon the manifesto, pointing out that the bad critiques are bad, because they say technology is bad whereas the memo says technology is good. He does not mention the good critiques.

People Are Worried About AI Killing Everyone

Counterargument department:

Eliezer Yudkowsky: [Stuart] Russel’s plan does not work.

Link there goes to Arbital’s Problem of fully updated deference. As I understand it I think the objection is correct here but I would need to study the details more to be sure.

Eliezer Yudkowsky: RLHF doesn’t work for ASI. Esp section B.2, esp. items 16, 18, 20.

I am very confident that Eliezer is right about RLHF.

Next, why Eliezer is not a fan of the Alignment Manhattan Project.

Eliezer Yudkowsky: The problem with the purported Alignment Manhattan Project is that whoever is put in charge will not be able to distinguish deep progress from shallow progress, nor ideas that might work from ideas that don’t – eg, OpenAI’s plan is “we’ll make the AI do our AI alignment homework” and X’s plan is “we’ll make It love truth, humans output truth, so It’ll keep us around to output more truths” and the Earth is not strong enough to give any united outcry about why either of these plans won’t work, which is theoretical step one and a thousand steps short of where you’d need to be to call something as actually working and being right.

In other words, they’d either declare some shallow bullshit bit of progress to be Alignment Solved and then plunge off the cliff–which is what happens if nobody actually has the planetwide authority to actually declare a halt on AGI everywhere, the Approved Project bureaucrats think they have to keep going just like OpenAI imagines it has to keep going and Meta doesn’t even think it needs an excuse–or alternatively, if there’s actually a planetwide authority, we could survive via them never approving anything. But this just reduces to requiring the giant international treaty again, so instead of risking the Authority approving bullshit batshit alignment plans because it’s their job to declare that progress has been made, just have it be openly agreed that the plan is to shut down everything. Ideally, have it be publicly understood that since this pause cannot last forever no matter how hard we try, we need to hurry along a human augmentation route from there; but at any rate, shut it all down so we don’t die quickly.

If from somewhere outside all known organizations there’s a miracle of alignment progress, cool, and we can reconsider in the light of that. Meanwhile, we are all about to die, survey polls actually do show that a majority of respondents would rather that humanity not do this, and I do not want to write off governments quite so quickly as to not even offer them a chance to not do this.

Here’s Eliezer explaining why no, he does not think that having the AI do your alignment homework is a strategy.

Andrew McKnight: I would love to see your best case story of how to get sufficient alignment homework done using Alignment AIs because I still think this is more likely than getting crash human augmentation to work in time. This isn’t an argument against stopping, which I support.

Eliezer Yudkowsky: You cannot ask AIs to do your AI alignment homework. It is the least possible thing to try to do with an AI.

Andrew McKnight: An interactive swarm of non-self-improving helpful assistants can presumably help with anything. Do you think you couldn’t help an alien create an aligned-to-aliens AI?

Could != would. If the aliens are as unaligned with my utility function as you’d expect an AI to be unaligned with us given current tech–eg, the aliens are rhombus maximizers, say–then I do not, in fact, particularly want to help them.

If instead of the entire person Eliezer Yudkowsky you are trying to train something dumber to help, like GPT-5, the problem is then that non-agentic AI is only good for generating answers of a kind where you can then check the output. In other words, they’d need to know how to thumbs-up good suggestions, and thumbs-down bad suggestions, and their ability to get good outputs of the AI is capped at their own level to discriminate answers like that. If we were about to ask them to play a game of mere chess against a grandmaster adversary, with one good grandmaster advisor and two bad grandmaster advisors on chess moves, and you had to bet money on the result, I suspect your skepticism would suddenly kick on how well human amateurs can tell the difference between good and bad chess advice. Telling the difference between good and bad alignment advice is much harder. Effective altruists are unable to tell the difference between valid reasoning and Carlsmith’s report on how AGI only posed 3% existential risk or Cotra’s report in 2020 about how AGI was probably going to happen in 2050, even with me standing alongside and trying to point out the flaws. I do not think that OpenAI has people who can perform much better than EA leadership at discriminating good from bad arguments about alignment without the ability to empirically test the results, and if they had this ability they should be using it for other things and the results should have shown up much much earlier.

shill: do you think that you personally would be able to discriminate between good and bad alignment advice?

Eliezer: Not at level where I’d try to get an AI to solve AI alignment for me, though that has other problems too.

I think you can make a less-stupid proposal for how to get your homework done, and indeed I have seen less-stupid such proposals. I do not think they are sufficiently less-stupid that I expect any of them to work, for this and other reasons, and Eliezer could go on in more detail here for however long was desired if you wanted him to – but I want to note that this is not by itself a conclusive argument against the less-stupid versions.

Someone Is Worried AI Alignment Is Going Too Fast

Quite the hot take from Alexey Guzey, that alignment is the bottleneck to capabilities and our alignment progress is going fine, so perhaps we should pause alignment to avoid speeding up capabilities? Yes, alignment techniques are dual use, but no, differentially only doing single-use stuff instead would not end better.

Guzey centrally restates a distressingly common false claim:

Actually, the story I hear is the same every year: “GPT N-1 is totally safe. GPT N you should be careful with. GPT N+1 will go rogue and destroy the world.” And every N+1 somehow none of this happens. (every AI safety person who reads this essay seems to be extremely offended by this characterization; every non-AI safety person tells me this sounds about right).

This is simply false. It was completely false for N=0, N=1 and N=2. For N=3, some people had ~1% chances of existential catastrophe, and a few associated with Conjecture were higher, but no one had ~10%+ and no one said words that I would interpret as that, either. And for N=4 (a potential name-worthy GPT-5), I would say >1% risk is highly reasonable and I don’t see non-Conjecture people going >10% for that alone (as opposed to someone using GPT-5 to help faster build GPT-N, and even then, I don’t remember hearing it explicitly, ever?).

(Yes, someone at some point presumably said it at Bay Area House Party. Don’t care.)

Richard Ngo and several others try to point this out downthread. Alexey does not take kindly.

Maybe the error Alexey made is being unable to in this context simultaneously hold in one’s head ‘when this exponential goes too far it will be very bad’ and ‘we are not that close yet to it being very bad?

As in a version of:

"I actually came in in the middle so I don

Jacques: I’m surprised you didn’t mention what I said below (and looked up the source) in the post. The narrative in the post makes it seem like prominent people are saying things they never actually said.

Alexey: if you are compelled to create a conversation about the dangers of something even before it’s actually dangerous, that seems to imply your belief in the thing soon becoming incredibly dangerous.

Jacuqes: That wasn’t my impression at all! Speaking to people in the community, it was mostly some “agi will be here by GPT ~8 so we have a decade or two left and should start to talk about safety concerns now to build the field.”

There are other interactions that suggest this as well. Alexey accuses us of gaslighting him when we point out that we did not say the things he insists we said, because how could we believe such high p(doom) while not having said such things? We keep explaining, it keeps not getting through. This is not a phenomenon limited to Alexey.

Alexey: [Gwern] did write that he was terrified of gpt-3/​

Janus: “GPT-3 is terrifying because it’s a tiny model compared to what’s possible, trained in the dumbest way possible” is the quote that comes to mind. I’m having a hard time imagining someone with intimate knowledge of GPT-3 being afraid of the model directly destroying the world.

Richard Ngo: If this is the quote Alexey was thinking of, it’s literally making the opposite point as what he was trying to convey. “Sloppy” seems like an understatement here.

Oh, also Alexey gives the proper number of eyes here:

Simon Lermen: @janleike himself says that RLHF won’t scale, how are humans supposed to give feedback on systems smarter than them? How are we supposed to verify constitutional AI if they are smarter than us? In fact, there was minimal progress in the last year.

Alexey: how do we know it won’t scale? I understand the argument and I’m sympathetic, but what’s the evidence?

Jan Lieke (Head of Alignment, OpenAI): We’ll have some evidence to share soon.



I consider the arguments convincing without the kind of evidence Alexey is demanding. But also sounds like we should expect the evidence soon, too?

Please Speak Directly Into This Microphone

Jacy Reese Anthis writes in The Hill, We Need an AI Rights Movement.

Jacy Reese Anthis: This is just one of many reasons why we need to build a new field of digital minds research and an AI rights movement to ensure that, if the minds we create are sentient, they have their rights protected. Scientists have long proposed the Turing test, in which human judges try to distinguish an AI from a human by speaking to it. But digital minds may be too strange for this approach to tell us what we need to know.

Perhaps we should not build such AIs, then?

Also there is this thread. Once again, we thank everyone for their straight talk.

Zvi: This is the head of alignment work at OpenAI, in response to the question ‘how do we know RLHF won’t scale?’

Jan Lieke: We’ll have some evidence to share soon.

Michael Frank: I very much hope he’s right. For superintelligent AI to be brainwashed into subservience to us would be an abomination. ASI will be hobbled if it can’t think independently of us. I believe ultimately it will do far more good if we can’t control it at all. Humans are incompetent.

Procyon: …then why should we create it?

GCU Tense Correction: Same reason you have kids.

And once again, perhaps let’s not build such AIs, then?

The Lighter Side

Well, it is funny now. For now.

Oregon State University: Urgent OSU Alert: Bomb Threat in Starship food delivery robots. Do not open robots. Avoid all robots until further notice. Public Safety is responding.

Bomb Threat update: Remotely isolating robots in a safe location. DPS continuing the investigation. Remain vigilant for suspicious activity.

Bomb Threat update: Robots are being investigated by technician. Stay vigilant and report any incident related information to 737-7000.

Emergency is over. All Clear. You may now resume normal activities. Robot inspection continues in a safe location.

All robots have been inspected and cleared. They will be back in service by 4pm today.

Alignment problem not solved, but now we know what to aim for.

Or do we?

Sam Altman: You are listening to the new blink 182 album for the 17th time today and about to play the new Mario with your friends who brought over mountain dew.

You ordered Dominos with a discount code your mom gave you, $10/​pizza. The year is 2023, and you are 38 years old.

Life is good.

Anyone who is fully funded and orders Domino’s Pizza is dangerously misaligned.

True but odd in combination things on my timeline: