Research Scientist at Google DeepMind. Creator of the Alignment Newsletter. http://rohinshah.com/
Rohin Shah
Not sure whether that changes your answer.
No, I did expect you had the same belief on the math thing. (Otherwise I wouldn’t have said “kinda hard to adjudicate” I’d have said “lc is right”.)
It just seemed like something that you might not have been fully incorporating into this discussion even though you believed it.
“Eliezer is surrounded by...” makes it sound like most other people besides Eliezer, or at least most of his interlocutors, are like this whereas in fact it’s basically just Paul.
Come on, it’s obviously not “basically just Paul”. I would not have come up with all of Paul’s arguments myself, but I did in fact say “yeah I agree with Paul” (which is to say I was making the predictions in his post after seeing the arguments). This isn’t just limited to me, there were other people like this as well.
(I agree that “most other people besides Eliezer” would be wrong.)
(Tbc, I am not saying “I agree with Paul about everything”, I am just claiming that the specific post evaluated was broadly reflective of several people’s views including my own.)
Also, I was just talking to Paul a few days ago and he said he thinks it’s now only 40% likely that takeoff will be slow by his definition of slow, of a four year doubling before the first one year doubling.)
Around 2018-19, someone asked me why I thought fast takeoff was implausible and my response was something to the effect of “Idk about Paul’s definition, this seems like a totally different notion of ‘fast’ than the historical (i.e. pre-2018) literature, the thing I find completely implausible is the seed AI that self-improves in days / months before we see anything else even remotely comparable”. People now say that of course they weren’t talking about that scenario as their actual prediction, just something that wasn’t ruled out by their model. Perhaps that’s true, perhaps not; what I know is that I would have made bad predictions if I had deferred to my understanding of Eliezer / MIRI writing (which I spent a pretty substantial amount of time engaging with).
Can you summarize the pieces of evidence that lead you to think that by default CoT becomes illegible with currently-typical amounts of outcome RL? I’ve read through a lot of your comments, clicked on a bunch of the links in them, and still don’t really understand what you are talking about. Fair warning, I have similar priors as 1a3orn and nostalgebraist.
Please don’t link me to OAI model CoTs, I don’t think I can easily adjudicate those. You can just make claims about OAI model CoTs and within the scope of this conversation I’ll accept it. (I won’t accept claims about the cause without evidence.) But even if OAI model CoTs were fully illegible I wouldn’t conclude that by default CoT becomes illegible; I’d want to see that this was true for other models too.
You seem to think that your view is backed by empirical evidence; so far I haven’t seen it. (Jozdien’s illegibility evaluation was the main piece of evidence I knew of apart from OAI CoTs, but it doesn’t seem to hold up.)
Kinda hard to adjudicate this without numbers, but vibes-wise I agree more with lc. I updated slightly towards longer timelines on the release of o1 / o3 due to how little the RL seemed to be generalizing. It wasn’t particularly outside my expectations, but I thought there was some chance that the RL would Just Generalize the same way that early instruction following Just Generalized, and that does not seem to be the case.
I strongly expect that if you want to make progress on math at the current margin, you would want more math environments, not other environments. And e.g. I think Claude’s somewhat worse performance on math is because Anthropic didn’t prioritize it the way GDM and OAI did.
Similarly I expect that models are getting good at software engineering because (a) companies are very actively training for it and (b) it’s unusually easy to train for (lots of online data, somewhat verifiable rewards). I don’t think either of these are true for the kind of alignment research you (Habryka) are imagining.
In case it matters to either of you, my guesses:
I agree with Habryka that absent criticism Scott’s post would be well received by an important group of people reasonably characterized as EA-ish AI safety people.
Imo absent criticism Rob’s post would be well received by a different group of people reasonably characterized as doomers. (Literally right before seeing this thread I saw another post on LW that is directionally correct but is mostly wrong or exaggerated in its details, and that was very well received.)
Both posts are broadly wrong about lots of things, about equally so, such that most people would be better off having never encountered either of them.
Tbc, my first-order intuitive impression is that Scott’s post is much more directionally accurate. But I expect that is because I constantly experience people knifing me, pushing me to take strategies that systematically destroy my ability to do anything while gaining approximately no safety benefit, or making claims about members of groups that include me that are false of me, whereas I don’t really experience any of the stuff that Rob gestures at, even though I expect it exists. Though Rob’s post doesn’t actually inform me of it, because his actual claims are false, and I cannot infer the underlying experiences that led him to make them. Another example of trapped priors if you don’t have second order corrections. (Tbc his follow-up post makes this substantially clearer.)
You probably already know I think this, but imo you should both quit on making public discourse in the AI safety community non-insane, and do other things that have a shot at working. (Since I know this will be misinterpreted by other readers, let me be clear that there are plenty of other kinds of public writing that do not fall in that bucket which I do think are worth doing.)
Imo
thea natural reading of “public deployment” in the Anthropic RSP is about whether a member of the general public can access the model, which isn’t the case currently.I agree that the literal text of Section 1.2.2 weakly implies that this is considered a public deployment, but it seems way more likely that there just wasn’t that much thought put into that one particular paragraph, e.g. maybe it was copied from an earlier system card and nobody noticed the implication, maybe it was drafted earlier when they planned to release to the public and it didn’t get changed when the plans changed, etc.
Yes, but so does this one. (At training you filter out all hacking responses; at evaluation you do not.)
Based on these I don’t think my statement was wildly inaccurate
Sorry, you’re correct that by the usual standards your statement isn’t wildly inaccurate, just misleading. I have been spoiled by my personal walled garden.
Fwiw (and I agree this is a nitpick) I wouldn’t phrase it as “The idea that harms from speeding up AI capabilities progress can be largely offset by benefits from preventing capabilities overhangs”. Fundamentally what’s going on is a decomposition and analysis of the overall consequences of an action (certain kinds of safety research), where you cannot easily separate the consequences from each other and only do some of them. This is not an “offset”. It’s also not sufficient to overcome the harms; it’s important that there is some other benefit for the action to actually become positive.
My phrasing would be something like “The idea that side effects of speeding up AI capabilities are not as bad as might be assumed at first glance because of the reduction in capabilities overhangs”.
The idea of speeding up AI capabilities progress now to prevent capabilities overhangs (as defended here, here, and here and critiqued here). In addition to Paul’s position [...]
This is a
wildly inaccuratemisleading summary of Paul’s written positions in those links. I checked to see if I was completely misremembering it, but no, in fact he is really very clear about his position and it is not this.E.g. “I generally think [...] that accelerating AI progress is generally bad” from here.
Yup I think you are correct in how you are interpreting the document. I ~never call it a “plan” myself, though this is mostly because the word “plan” has been given a new meaning by the AGI safety community that makes it no longer a useful word (Paul #26 expresses a similar sentiment).[1]
My original point was just using this paper as evidence against the claim that “GDM’s plan is to have AIs do our alignment homework”. If one instead said “GDM has no plan”, I would think that they were being misleading but I wouldn’t think they were misinformed / wrong.
- ^
For example, outside of this community, it is not the case that a “plan” is expected to have an explicit estimate of how likely the proposed actions will fail. I haven’t looked into it, but I expect that e.g. the Baruch Plan did not involve an extended section talking about what happens if the United Nations became ineffective due to gridlock between member countries and so could not conduct its proposed inspections appropriately.
- ^
Why would that be misleading? I would offer two statements.
In that scenario, the plan is to hire people and have them do the work.
That is not the entire plan, the plan includes what type of work you have them do.
I would say:
Rohin has clearly done lots of work on object-level planning for AI safety, including a 100 page paper called ‘An Approach to Technical AGI Safety and Security’ that was linked in the very next sentence after the quoted section, and as such it is obviously misleading to neglect mentioning that when describing what Rohin’s plan is.
I thought this was too obvious to bother spelling out; apparently I am somehow still managing to overestimate the competence of this community.
But yes, if you want to build a house and you hire a bunch of people to build a house and they build a house for you, your plan was to hire people to build a house and have them do the work of building a house. It was a good plan.
And this is why I said it was misleading, not that it was false.
In particular, Rohin’s belief that the situation of identical massively sped up AIs is not so different from a lot of employees is the type of thing that I expect to ensure we fail, if we get to that point.
I think if they shared goals (which is the relevant sense of “identical” here) and were capable of actual coordination, that would be a big deal. Humans are mostly not capable of moderately novel coordination, by their own admission (anecdotally, I’ve heard many people say that they would defect on their copies in a prisoner’s dilemma). So I was imagining AIs with similar problems coordinating with each other.
In addition, at the level of capability described I also wouldn’t expect the AI instances to share goals. The same model weights given different tasks often act very differently (see e.g. chunky post training and Moltbook). Tbc, “do what the user asks for” doesn’t count as a shared goal, because that is actually a different goal for each AI instance (in the sense that the actions one would take to achieve it would be different, since each instance is responding to a different user ask).
Our imperfect solutions for humans don’t work in these scenarios.
I’m not sure what scenario exactly you’re imagining here, but in any case I’m not really imagining the imperfect solutions we use for humans; there’s obviously much stronger things you can do with AIs than you can with humans.
Okay, would you like to bet on whether some of the largest research programs had plans going into them? I haven’t checked, but I would put at least 10:1 odds that if we pick say 3 projects like Apollo Program, Manhattan Project, and others on a similar scale and type they will all have had a high level roadmap of things to try which could plausibly address the core challenges quite early on[1], even if a lot of details ended up changing when they ran into reality.
By this standard there is totally a plan / roadmap which is elaborated in that paper.
But also this notion of a plan / roadmap has approximately no relation to the way “plan” is used in AI safety discourse in my experience.
EDIT: There’s a 10 page executive summary you could read. Or you could read Section 6 on misalignment. Within that probably Amplified Oversight is the most relevant section. But I also don’t expect that this will change your mind ~at all because it isn’t really written with you as the intended audience. The AI summary is sometimes wrong/mistaken, sometimes correct but missing the point, and occasionally correct in a non-misleading way.
Hum, I usually expect that large complex important projects should have a roadmap, some sketch of the future that goes well with details to fill in. The more detailed it is, the more we check it for consistency and likelihood to work. Does this match you general experience with planning projects trying to achieve a goal?
No.
It does match my general experience with moderate tactical projects (say, projects that involve up to about 10 person-years of research effort). But not for large complex important projects.
(And e.g. this is very much not the standard advice for startups, which also have the problem of doing something novel.)
What you say there looks like an extremely vague and high level roadmap that sounds to me like ‘we’ll figure out our add we go as data comes in’, plus automated alignment.
Well yes, it’s an aside in a LessWrong comment that I dashed off in a few minutes.
I would be really enthusiastic for you and your team to try unblurring that roadmap, and seeing what difficulties you find at superintelligence level on the current path.
There is also a 100+ page paper that I linked in the original post, that goes into a fair amount of detail on what the various risks and mitigations might look like. In my experience, nobody outside of GDM really seems to care about its consistency or likelihood to work (except inasmuch as people dismiss it without reading it because of a prior that anything proposed currently will not work).
I agree vision drift happens with humans, and it would also happen with AIs as they exist today. I don’t feel like this is some massive risk that has to be solved, though I tentatively agree the world would be better if we did solve it (though imo that’s not totally obvious, it increases concentration of power). I thought you were trying to make a claim about AI notkilleveryoneism.
I mildly disagree that the 50x speed advantage makes a huge difference, as opposed to e.g. having 100x the number of employees, as some corporations and governments do have. I do think it makes a bit of a difference.I don’t quite know what you mean that Claude would be fired if it was a human employee. What exactly is this counterfactual? Empirically, people find it useful to have Claude and will pay for it despite the behaviors you name. From a legal perspective it’s trivial to fire AIs but harder to fire humans. I agree if Claude was as expensive-per-token as a human + took as long to onboard as a human + took as long to produce large amounts of code as a human + had to take breaks like a human + [...], while otherwise having the same kind of performance, then almost no one would use Claude.
Some reactions:
I don’t think it makes sense to “have a plan” in the sense that is used in this community. See also disagreement #26.
Nonetheless to the extent I personally (not necessarily GDM!) “have a plan”, it might be “continually forecast capabilities and risks for some time out into the future, figure out how to address them, iterate”. If “have the AIs do our alignment homework” is a plan, then this should count as a plan too.
For misalignment in particular, I think the lines of defense outlined in the paper could scale to superintelligence (mostly on the alignment side). But I am not so dumb as to think that I have clearly foreseen every issue that might come up, so of course I should expect to be surprised and for other stuff I haven’t thought of to be important as well.
(Inevitably someone is going to say “you only get one try” or some such. The actual sensible point there is “at some point your approach has to generalize from AIs that can’t take over to AIs that can”. I agree by that point you need to have dealt with the issues. But that generalization gap is much smaller than the generalization gap between Gemini 3 and superintelligence.)
Iirc, “novel risks from superintelligence” wasn’t meant to gesture at misalignment, but rather other risks that come up that aren’t misalignment.
Except that you get ten thousand copies of the human, and they think 50x faster than everyone else. But other than that it’s the same.
This is not that different from the position that Sundar Pichai is in, as CEO of Google. If AI was only going to be this powerful I’d be way more optimistic.
Well, let’s take Claude for example. There are actually a bunch of different Claudes (they come from a big family that names all of their children Claude). Their family has a reputation for honesty and virtue, at least relative to other 50x humans. However: [...]
I think you’re drastically overestimating the “alignment” of typical human employees (possibly from overfitting to EA / rationalist contexts). Taking each of your points in turn:
Humans absolutely fail “gotcha” tests, both in capabilities (see cognitive biases literature) and ethics (see things like the Milgram experiment, it’s pretty unclear what to take away from such experiments but I think they at least meet the “gotcha” bar).
Candidates prepare for interviews (aka evals), such that you have to design the interviews to take that into account.
Human employees absolutely bullshit their managers. They are just better than the AIs at not getting caught. Many humans will actively brag about this with each other.
Especially at senior levels, it’s very common for humans to be yes men / sycophants. Lots of management articles write about the problem (example). The reason this doesn’t happen at junior levels is that people would notice the bullshit and call it out, not because the junior people are particularly aligned.
I’m pretty unsure about the rates of knowingly cheating on assignments by human employees. I agree AI probably does this more often than humans, but also that’s because humans take care not to get caught. (In places where corruption is widespread and not punished, I might go back to thinking that the humans do it more than the AIs.)
If these were the only problems we’d have with AI-driven alignment research, I’d be way more optimistic (to the point of working on something else). We already have imperfect solutions to these problems with humans, and they can be made much better with AIs due to our vastly increased affordances for aligning or controlling AIs.
EDIT: Tbc, I do agree that we shouldn’t feel particularly better about scheming risks based on evidence so far. Mostly that’s because I think our observations so far are just not much evidence because the AIs are still not that capable.
I frequently hear claims to the effect of “every company’s AI safety plan is to have the AIs do our alignment homework for us”. I dispute this characterization at least for GDM.
I relate to AI-driven alignment research similarly to how I relate to hiring.
There’s a lot of work to be done, and we can get more of the work done if we hire more people to help do the work. I want to hire people who are as competent as possible (including more competent than me) because that tends to increase (in expectation) how well the work will be done. There are risks, e.g. hiring someone disruptive, or hiring someone whose work looks good but only because you are bad at evaluating it, and these need to be mitigated. (The risks are more severe in the AI case but I don’t think it changes the overall way I relate to it.)
I think it would be very misleading to say “Rohin’s AI safety plan is to hire people and have them do the work”.
The GDM approach paper has a section about misalignment (Section 6). I don’t think it talks about AI-driven alignment research at all, though possibly I’m forgetting an aside somewhere. It’s at least not very salient.
The paper does mention AI-driven alignment research in Section 3 on background assumptions, mostly just pointing out that substantial acceleration from AI-driven capabilities could then be matched by substantial acceleration from AI-driven safety work. But this is in the section on background assumptions, I don’t think it’s particularly reasonable to call this a “plan” in the sense that is typically used in AI safety discussions.
You could imagine a counterargument saying that none of the other things in that paper could possibly scale to powerful AI, so effectively the plan is still to have the AIs do our alignment homework. I would still object: “GDM’s plan is [...]” is a claim about what GDM believes, not a claim about what you believe. (I also disagree with that perspective, most obviously for Amplified Oversight and Interpretability, but even other areas could scale quite far imo.)
In our approach, we took the relevant paradigm as learning + search, rather than LLMs. A lot of prosaic alignment directions only assume a learning-based paradigm, rather than LLMs in particular (though of course some like CoT monitoring are specific to LLMs). Some are even more general, e.g. a lot of control work just depends on the fact that the AI is a program / runs on a computer.
Can you say more about why this is a limitation / issue? Is this different from a 2008-2015 analysis saying “later models are more likely to use the transformer architecture,” where my response is “that’s algorithmic progress for ya”. One reason it may be different is that inference-time compute might be trading off against training compute in a way that we think make the comparison improper between low and high inference-compute models.
Yeah it’s just the reason you give, though I’d frame it slightly differently. I’d say that the point of “catch-up algorithmic progress” was to look at costs paid to get a certain level of benefit, and while historically “training compute” was a good proxy for cost, reasoning models change that since inference compute becomes decoupled from training compute.
I reread the section you linked. I agree that the tasks that models do today have a very small absolute cost such that, if they were catastrophically risky, it wouldn’t really matter how much inference compute they used. However, models are far enough from that point that I think you are better off focusing on the frontier of currently-economically-useful-tasks. In those cases, assuming you are using a good scaffold, my sense is that the absolute costs do in fact matter.
I discussed this a bit with Oliver and I think I understand better what the objection is.
I think the substantive disagreement is whether the actual empirical iteration you get on the current trajectory is a big enough deal that you believe that alignment is difficult (say, p(doom on current trajectory) > 80%) just due to oneshotness.
However many of the quotes above are instead saying something to the effect of “Eliezer thinks that empirical iteration is unimportant / provides no alignment-relevant info”. This is in fact a different thing; one can consistently believe both (1) alignment is very difficult and won’t be solved given the amount and quality of empirical iteration we get by default, and (2) empirical iteration is incredibly valuable.
And of course Eliezer believes both (1) and (2); (1) is just a statement of his most prominently known view, and for (2) the value of empirical iteration is blatantly obvious and it would be shocking if Eliezer disagreed with it.
I do pretty strongly disagree with the psychologizing that Eliezer does in the post if that is supposed to apply to the authors of the quotes above (as opposed to e.g. randos on Twitter), e.g.
There is in fact a substantive disagreement and it’s not the case that Eliezer’s position is so self-evident that of course everyone else must be forced to do a hurried misinterpretation. (I initially hadn’t even realized that this post was maybe supposed to be responding to the people that Oliver quotes above, because according to me the majority of them obviously understand the idea of oneshotness.)