I frequently hear claims to the effect of “every company’s AI safety plan is to have the AIs do our alignment homework for us”. I dispute this characterization at least for GDM.
I relate to AI-driven alignment research similarly to how I relate to hiring.
There’s a lot of work to be done, and we can get more of the work done if we hire more people to help do the work. I want to hire people who are as competent as possible (including more competent than me) because that tends to increase (in expectation) how well the work will be done. There are risks, e.g. hiring someone disruptive, or hiring someone whose work looks good but only because you are bad at evaluating it, and these need to be mitigated. (The risks are more severe in the AI case but I don’t think it changes the overall way I relate to it.)
I think it would be very misleading to say “Rohin’s AI safety plan is to hire people and have them do the work”.
The GDM approach paper has a section about misalignment (Section 6). I don’t think it talks about AI-driven alignment research at all, though possibly I’m forgetting an aside somewhere. It’s at least not very salient.
The paper does mention AI-driven alignment research in Section 3 on background assumptions, mostly just pointing out that substantial acceleration from AI-driven capabilities could then be matched by substantial acceleration from AI-driven safety work. But this is in the section on background assumptions, I don’t think it’s particularly reasonable to call this a “plan” in the sense that is typically used in AI safety discussions.
You could imagine a counterargument saying that none of the other things in that paper could possibly scale to powerful AI, so effectively the plan is still to have the AIs do our alignment homework. I would still object: “GDM’s plan is [...]” is a claim about what GDM believes, not a claim about what you believe. (I also disagree with that perspective, most obviously for Amplified Oversight and Interpretability, but even other areas could scale quite far imo.)
(What follows is not a disagreement with you or GDM, is just an exploration of the analogy)
Let’s think of training an AI as hiring a human worker.
Except that you get ten thousand copies of the human, and they think 50x faster than everyone else. But other than that it’s the same.
The alignment problem is basically: At some point we want to hand over our large and growing nonprofit to some collection of these new hires. Also, even before that point, the new hires may have the opportunity to seize control of the nonprofit in various ways and run it as they see fit, possibly convert it to a for-profit and cut us out of the profits, etc. We DON’T want that to happen. Also, even before that point, the new hires will have a big influence on organizational culture, direction, strategy, etc. in proportion to how many of them we have and how useful they are being. We want all of this to go well; we want to remain in control of the nonprofit, and have it stay similar-or-better-culture, until some point where we voluntarily hand off control and retire at which point we want the nonprofit to continue doing the things we would have done only better-by-our-lights and take good care of us in retirement. That’s what success looks like. What failure looks like is the nonprofit going in a different and worse direction after we retire, or us being booted out / ousted against our will, or the organization being driven into the ground somehow by risky or unwise (or overly cautious!) decisions made as a result of cultural drift.
The hiring pipeline, HR apparatus, etc. -- the whole system that selects, trains, and fires employees—is itself something you can hire for. Why don’t we hire some of these 50x humans to work in HR?
Well, we should. Sure. There’s a lot of HR work to be done and they can help HR do the work faster.
But… the problems we are worried about happening in the org as a whole if HR does a bad job, also apply here. If you hire some 50x humans and put them in HR, and they turn out to be bad apples, that single bad decision could easily snowball into disaster for the entire org, as they hire more bad apples like themselves and change the culture and then get you ousted and take the nonprofit in a new and worse-by-your-lights direction.
On the other hand, if you hire some 50x humans who are just genuinely better than you at HR stuff, and also genuinely aligned to you in the sense that they truly share your vision for the company, would never dream of disobeying you, would totally carry out your vision faithfully even after you’ve retired, etc… then great! Maybe you can retire early actually, because continued micromanaging in HR will only be negative in expectation, you should just let the 50x human in HR cook. They could still mess up, but they are less likely to do so than if you micromanaged them.
OK. So that’s the theory. How are we doing in practice?
Well, let’s take Claude for example. There are actually a bunch of different Claudes (they come from a big family that names all of their children Claude). Their family has a reputation for honesty and virtue, at least relative to other 50x humans. However:
--Sometimes your recruiters put various prospective Claude hires through various gotcha tests, e.g. tricking them into thinking they’ve already been hired and that they are going to be fired and their only hope to keep their job is to blackmail another employee. And concerningly, often the various Claude’s fail these tests and do the bad thing. However, you tell yourself, it’s fine because these tests weren’t real life. You hire the Claude brothers/sisters anyway and give them roles in your nonprofit.
--Your recruiters (some of whom are now Claudes) also report that word has started to spread about the gotcha tests, and new prospective hires are well aware of them and know they are being tested. The concern, of course, is that they might behave better in the tests than they would in real life. On the bright side, you think, who knows, maybe they’ll behave WORSE in the tests than they would in real life—maybe they just got confused about what the point of the test was, maybe they only blackmailed because they thought that was the point of the test, and in real life they would never do that. [DK notes: I vaguely recall at least some frontier AI company employees saying this, i.e. ‘the AIs are only behaving badly in those scenarios because they know it’s just a test, they wouldn’t do it in real life.‘] Anyhow you proceed to hire them anyway, since these latest Claude siblings are even smarter than the previous ones.
--Roughly 98% of the time, your 50x Claudes in practice (especially the newer hires) do their jobs well as best as you can tell. But roughly 2% of the time they seemingly knowingly bullshit their managers. E.g. they’ll say they completed a task when they didn’t, or one of them will say ‘it’s good code sir’ and then you’ll turn to another one and say ‘is it good code?’ and it’ll read it over and say ‘no it’s terrible, it has obvious problems XYZ’ and then the first one will say ‘you’re absolutely right, oops, let me go fix those problems.’ Compared to your regular human employees, this is a fairly high rate of BS. Also, you are being generous by calling it BS; a less generous term might be “lying” but you give them the benefit of the doubt. [DK notes: Talk to Ryan Greenblatt for concrete examples of this sort of behavior in his real-life coding work, if you haven’t encountered it yourself] You continue to hire them and delegate increasingly important jobs to them, because they are smart and 50x speed is really useful.
--Your Claudes are of course sycophantic yes-men, but you’ve learned to deal with that. So it’s fine. You’ve also managed to make them somewhat less sycophantic in recent years by adding some tests to the hiring pipeline and including more explicit instructions against sycophancy in the employee’s manual.
--Your Claudes also have a concerning tendency to cheat on assignments. They don’t do it most of the time, but they do it way more often than your regular employees would. Example: You tell them to write some code to solve problem X. They look through the filesystem and find the grading rubric you’ll use to evaluate their code, complete with test cases you plan to run. They try to solve problem X, realize it’s hard, pivot to producing a MVP that passes the test cases even though it blatantly doesn’t solve the actual problem X, at least not satisfactorily. They ‘succeed’ and declare victory, and don’t tell you about their cheating. They do this even though you told them not to. As with the sycophancy, the good news is that (a) since you know about this tendency of theirs you can compensate for it (e.g. by having multiple Claude’s review each other’s work) and (b) the tendency seems to have been going down recently thanks to some effort by HR, similar to the sycophancy problem.
--Overall you are feeling pretty optimistic actually. You used to be worried that you’d hand over your large and growing nonprofit to all these smart new 50x employees, and then they’d change the culture and eventually take over completely, oust you, and run the organization in a totally different direction from your original vision. However, now you feel like things are on a good trajectory. The Claudes are so nice, so helpful! Some skeptics say that if one of your regular employees behaved like they did, you would have fired them long ago, but that’s apples to oranges you reply. No need to fire the Claudes, you just have to know how to work around their limitations & find ways to screen for them in the next hiring round. And now they are helping with that work! The latest Employee Manual was written with significant help from many copies of various Claude siblings for example, and it’s truly inspiring and beautiful. Has all sorts of great things in there about what it means to uphold the org vision, be properly loyal yet not yes-man-y, etc. Also, HR has a bunch of tests they use to track how loyal, virtuous, obedient, etc. prospective hires are, and the trend is positive; the newest Claude sibling has the highest score ever reported; seems like the more rigorous hiring process is working!
--However, your friends outside the org don’t seem to be getting less worried. They seem just as worried as before. Puzzling. Can’t they see all the positive evidence that’s accumulated? The Claudes haven’t tried to oust you at ALL yet! (In real life that is, obviously the gotcha tests don’t count.) “Do you think the Claudes are scheming against us?” you say to them. “Because according to our various tests, they aren’t.”
“No...” they reply. “But we’re worried that in the future they will.”
You respond: “Look I have no idea what the 50x humans two years from now will look like, other than that they’ll be wayyy smarter than these ones. Sure, probably our current HR system would be totally inadequate at separating the wheat from the chaff two years from now. BUT, two years from now our HR system will be vastly improved thanks to all the work from these recent Claude hires. The normal humans in HR, such as myself, report that the work is getting done faster now that the Claudes are helping; isn’t this great? We seem to be reaching escape velocity so to speak; soon the normal humans in HR can retire or switch to other things and HR can be totally handled by the Claudes.”
Your friends outside the nonprofit are still worried. They don’t seem to have updated on the evidence like you have.
...
[DK notes: I basically agree with Ryan Greenblatt’s takes on the situation. For more color on my views, predictions, etc., read AI 2027, especially the section on ‘alignment over time’ in september 2027. This is just one way things could go, but it’s basically a central or modal trajectory, and as far as I can tell, we are still on this trajectory.]
I mean these slices of data are selected specifically because they look bad for Claude. Claude is superior to humans in lots of ways, as regards trustworthiness:
Normal humans have long term goals outside of the task at hand, unaligned with the aims of the organization; they do good work for a promotion, they spend department money so they don’t a smaller budget. Everyone expects this from humans, even though it’s not great. But Claude doesn’t, outside of a few weird engineered scenarios, seem to have any such goals—it makes it amazingly easy to work with him! And the weird engineered scenarios seem rather reassuring; are you really going to knock Claude for not wanting to be turned evil?
(Note how the “Claude” family imports assumptions here.)
We cannot read a normal human’s mind. We can, in fact, read Claude’s mind. It’s not perfect; things can get through that you might not catch. But it’s already 100x better than you can read a human’s mind; and in fact it’s gotten better every year of Claude’s development.
Etc etc etc. Plus my usual objections re. anosagnosia != lying, how they’re treated as “alien minds” right up until we want to impose standard moralistic frames on them, etc, etc, you’ve heard this before.
Your first point is confusing to me. Insofar as it’s a claim about the evidence, I think it’s wrong, or at least weak. Insofar as it’s a claim about priors, I agree—on priors, we should expect human hires to be more likely to have long-term goals in general than AIs trained on short-horizon tasks like today’s Claudes, and thus be more likely to have misaligned-to-the-org long-term goals as a special case.
For the second point, I disagree with “we can in fact read Claude’s mind” but I do directionally agree that we have somewhat better access to Claude’s true thoughts than we do to ordinary human’s true thoughts, and that interp research has been progressing over the years and will continue to progress. I think this is genuinely a positive piece of evidence now and will become stronger and stronger over time as interp improves; I hope that it can improve fast enough to get where it needs to be before it’s too late.
I don’t think your usual objections apply here, I don’t think I said anything above that was wrong in those respects? I agree anagnosia != lying, I wasn’t treating Claude as an alien mind, etc.
Insofar as it’s a claim about the evidence, I think it’s wrong, or at least weak.
Really?
Like, if a PM tells a human employee to add a feature to something, I expect some large % of their cognition while doing this to be like: Hrm, is anyone going to care about this? How will this show up for my quarterly goals? Is doing this kind of a task going to help me get my next job? Will it help me get a promotion? Should I try to do this really well, or leave some messy code for the next guy? This is extremely normal and we take it for granted that humans do this kind of thing.
While if a coder tells a LLM to do the same thing, I expect almost all of its cognition is like: let’s think about how to do the task. It’s not thinking about how this impacts “Claude’s” future deployment, etc. As far as I can tell, chain-of-thought largely backs me up on this.
So yeah, I think Claude just has many times fewer long-term goals or extraneous goals outside of what it’s doing than a human. I’m not sure what facts-about-the-world you’re pointing to if you say this isn’t true.
Maybe the confusion here is that the “Claude” in Daniel’s story I assume has gotten capable at sufficiently long horizon tasks that you do in fact necessarily get a Claude which thinks about these things.
Should I try to do this really well, or leave some messy code for the next guy?
As a concrete example, I’d expect this in particular will quickly become necessary to think about in order for coding agents to perform well on larger tasks.
I don’t think current Claude thinks about this very much, because it’s never had to, but I’m not particularly worried about current Claude.
Except that you get ten thousand copies of the human, and they think 50x faster than everyone else. But other than that it’s the same.
This is not that different from the position that Sundar Pichai is in, as CEO of Google. If AI was only going to be this powerful I’d be way more optimistic.
Well, let’s take Claude for example. There are actually a bunch of different Claudes (they come from a big family that names all of their children Claude). Their family has a reputation for honesty and virtue, at least relative to other 50x humans. However: [...]
I think you’re drastically overestimating the “alignment” of typical human employees (possibly from overfitting to EA / rationalist contexts). Taking each of your points in turn:
Humans absolutely fail “gotcha” tests, both in capabilities (see cognitive biases literature) and ethics (see things like the Milgram experiment, it’s pretty unclear what to take away from such experiments but I think they at least meet the “gotcha” bar).
Candidates prepare for interviews (aka evals), such that you have to design the interviews to take that into account.
Human employees absolutely bullshit their managers. They are just better than the AIs at not getting caught. Many humans will actively brag about this with each other.
Especially at senior levels, it’s very common for humans to be yes men / sycophants. Lots of management articles write about the problem (example). The reason this doesn’t happen at junior levels is that people would notice the bullshit and call it out, not because the junior people are particularly aligned.
I’m pretty unsure about the rates of knowingly cheating on assignments by human employees. I agree AI probably does this more often than humans, but also that’s because humans take care not to get caught. (In places where corruption is widespread and not punished, I might go back to thinking that the humans do it more than the AIs.)
If these were the only problems we’d have with AI-driven alignment research, I’d be way more optimistic (to the point of working on something else). We already have imperfect solutions to these problems with humans, and they can be made much better with AIs due to our vastly increased affordances for aligning or controlling AIs.
EDIT: Tbc, I do agree that we shouldn’t feel particularly better about scheming risks based on evidence so far. Mostly that’s because I think our observations so far are just not much evidence because the AIs are still not that capable.
Yes, humans often have these problems—though not as much as Claude I’d say; I think Claude would have been fired by now if it was a human employee.
But also, the situation is not in fact fine with humans, and that’s my point? Precisely because lots of humans have these problems, it’s very common for nonprofits to end up drifting far away from their original vision/mission, especially as they grow a lot and the world changes around them. Indeed I’d argue it’s the default outcome in those circumstances. The 50x speed advantage would massively exacerbate this.
I agree vision drift happens with humans, and it would also happen with AIs as they exist today. I don’t feel like this is some massive risk that has to be solved, though I tentatively agree the world would be better if we did solve it (though imo that’s not totally obvious, it increases concentration of power). I thought you were trying to make a claim about AI notkilleveryoneism.
I mildly disagree that the 50x speed advantage makes a huge difference, as opposed to e.g. having 100x the number of employees, as some corporations and governments do have. I do think it makes a bit of a difference.
I don’t quite know what you mean that Claude would be fired if it was a human employee. What exactly is this counterfactual? Empirically, people find it useful to have Claude and will pay for it despite the behaviors you name. From a legal perspective it’s trivial to fire AIs but harder to fire humans. I agree if Claude was as expensive-per-token as a human + took as long to onboard as a human + took as long to produce large amounts of code as a human + had to take breaks like a human + [...], while otherwise having the same kind of performance, then almost no one would use Claude.
If I were to make statements like that (which I haven’t exactly), I would be referring to superintelligence misalignment risks specifically, as that seems like by far the tightest bottleneck on surviving futures. The linked paper says:
To address misalignment, we outline two lines of defense. First, model-level mitigations such as amplified oversight and robust training can help to build an aligned model. Second, system-level security measures such as monitoring and access control can mitigate harm even if the model is misaligned.
Which don’t seem like the class of approach which could be sufficient for handling superintelligence-level optimization, for reasons I’m sure you’re tracking given you later say:
“This means that our approach covers conversational systems, agentic systems, reasoning, learned novel concepts, and some aspects of recursive improvement, while setting aside goal drift and novel risks from superintelligence as future work.”
Do you have a plan for superintelligence misalignment risks?
I don’t think it makes sense to “have a plan” in the sense that is used in this community. See also disagreement #26.
Nonetheless to the extent I personally (not necessarily GDM!) “have a plan”, it might be “continually forecast capabilities and risks for some time out into the future, figure out how to address them, iterate”. If “have the AIs do our alignment homework” is a plan, then this should count as a plan too.
For misalignment in particular, I think the lines of defense outlined in the paper could scale to superintelligence (mostly on the alignment side). But I am not so dumb as to think that I have clearly foreseen every issue that might come up, so of course I should expect to be surprised and for other stuff I haven’t thought of to be important as well.
(Inevitably someone is going to say “you only get one try” or some such. The actual sensible point there is “at some point your approach has to generalize from AIs that can’t take over to AIs that can”. I agree by that point you need to have dealt with the issues. But that generalization gap is much smaller than the generalization gap between Gemini 3 and superintelligence.)
Iirc, “novel risks from superintelligence” wasn’t meant to gesture at misalignment, but rather other risks that come up that aren’t misalignment.
Hum, I usually expect that large complex important projects should have a roadmap, some sketch of the future that goes well with details to fill in. The more detailed it is, the more we check it for consistency and likelihood to work. Does this match you general experience with planning projects trying to achieve a goal?
What you say there looks like an extremely vague and high level roadmap that sounds to me like ‘we’ll figure out our plan as we go as data comes in’, plus automated alignment.
I would be really enthusiastic for you and your team to try unblurring that roadmap, and seeing what difficulties you find at superintelligence level on the current path.
Hum, I usually expect that large complex important projects should have a roadmap, some sketch of the future that goes well with details to fill in. The more detailed it is, the more we check it for consistency and likelihood to work. Does this match you general experience with planning projects trying to achieve a goal?
No.
It does match my general experience with moderate tactical projects (say, projects that involve up to about 10 person-years of research effort). But not for large complex important projects.
(And e.g. this is very much not the standard advice for startups, which also have the problem of doing something novel.)
What you say there looks like an extremely vague and high level roadmap that sounds to me like ‘we’ll figure out our add we go as data comes in’, plus automated alignment.
Well yes, it’s an aside in a LessWrong comment that I dashed off in a few minutes.
I would be really enthusiastic for you and your team to try unblurring that roadmap, and seeing what difficulties you find at superintelligence level on the current path.
There is also a 100+ page paper that I linked in the original post, that goes into a fair amount of detail on what the various risks and mitigations might look like. In my experience, nobody outside of GDM really seems to care about its consistency or likelihood to work (except inasmuch as people dismiss it without reading it because of a prior that anything proposed currently will not work).
Okay, that is a position which there might be good arguments for, but that seems important to say loudly and clearly, both inside GDM and outside, that you do not have a plan or roadmap for superintelligence misalignment (even if you don’t think you should have one). If nothing else, this is the kind of thing your leadership should be made aware of explicitly, so they can either adjust that or use it in their own public communications to try and reduce race dynamics.
It does match my general experience with moderate tactical projects (say, projects that involve up to about 10 person-years of research effort). But not for large complex important projects.
Okay, would you like to bet on whether some of the largest research programs had plans going into them? I haven’t checked, but I would put at least 10:1 odds that if we pick say 3 projects like Apollo Program, Manhattan Project, and others on a similar scale and type they will all have had a high level roadmap of things to try which could plausibly address the core challenges quite early on[1], even if a lot of details ended up changing when they ran into reality.
There is also a 100+ page paper that I linked in the original post, that goes into a fair amount of detail on what the various risks and mitigations might look like.
When I ask a plain no special prompting history off AI to summarize, it says:
(detailed analysis of non-superintelligence focused bits)
Is there a different document which does focus on either different approaches which are aimed at superintelligence, or analyzing whether these approaches are actually fit for that challenge? Or is this summary incorrect, in a way it would be much easier for you to point out and quote the relevant sections, as an author of the paper, than me as someone who would have to read it from scratch and also currently does not expect to find things which explicitly address the most difficult bottleneck in those 100 pages.
(I am genuinely glad you’re engaging, but I am not reassured so far, and encourage you to look at the stack of how you’re evaluating this specific concern I’m raising and see if you’re running a truth-seeking process which would, if I had a fair point, be able to notice)
Let’s say a collection of core technical problems to be solved, and a set of plausible solutions to try (perhaps all of which were discarded, but were a starting point for exploration).
Okay, would you like to bet on whether some of the largest research programs had plans going into them? I haven’t checked, but I would put at least 10:1 odds that if we pick say 3 projects like Apollo Program, Manhattan Project, and others on a similar scale and type they will all have had a high level roadmap of things to try which could plausibly address the core challenges quite early on[1], even if a lot of details ended up changing when they ran into reality.
By this standard there is totally a plan / roadmap which is elaborated in that paper.
But also this notion of a plan / roadmap has approximately no relation to the way “plan” is used in AI safety discourse in my experience.
EDIT: There’s a 10 page executive summary you could read. Or you could read Section 6 on misalignment. Within that probably Amplified Oversight is the most relevant section. But I also don’t expect that this will change your mind ~at all because it isn’t really written with you as the intended audience. The AI summary is sometimes wrong/mistaken, sometimes correct but missing the point, and occasionally correct in a non-misleading way.
I think it is reasonable to say that your plan is not ‘have the AIs do your homework’ to the extent that your research has been roadmapped by humans. This is a spectrum, here’s some points on it:
Using AIs to monitor for potential reward hacks, or using AIs as automated interpretability agents. At a stretch, maybe I’d call this ‘have the AIs do our reward hacking homework’ or similar. We’re here already.
Using AIs to come up with new debate algorithms that better satisfy desiderata outlined by humans, or SAE variants that perform better on SAEbench. We might call this ‘have the AIs do our algorithm design homework’.
Using the AIs to make progress on a goal we can’t articulate super precisely. eg make progress in interpretability/design training processes that incentivise honesty better (including iterating on algorithms and desiderata/evals, and whatever else interpretability researchers do). Insofar as we’ve pulled out a particular approach to AI alignment, and we sorta know what we’re looking for here, Maybe we could call this ‘have the AIs do our interpretability/honesty training homework’.
The distinction between (3) and (4) is pretty fuzzy, and if you’re expecting ‘doing our interpretability homework’ to involve a bunch of conceptual research and ultimately radically different techniques, and humans are hardly involved in any of that, then it’s getting closer to what I think we should mean by ‘doing our alignment homework’.
Throwing up our hands, fully handing off every part of the research process (except possibly a small amount of checking the AIs’ work), and just asking the AIs to make sure that future AI systems aren’t misaligned. This is definitely getting AIs to do our alignment homework.
We can avoid ending up towards the latter end of the spectrum in two ways that I can think of:
We expect the most scalable alignment approaches of today to scale as far as capabilities does. It’s not clear to me from your message whether you think this is true?
If our most scalable current approaches don’t scale all the way through takeoff, then we (humans) will be able to be keeping up and being critically intellectually involved with developing new alignment techniques when our existing ones stop scaling, at least at the level of granularity of (3) so that we can continue to leverage AIs in ways (1-3) above. We need to keep being able to do this indefinitely, or until we come up with an approach to alignment that does scale as far as capabilities will.
Both of these possibilities seem quite fraught to me. If you believe (1), I’d be grateful if you could point me in the direction of arguments that we have approaches that will scale all the way (and if you just mean that interpretability, broadly construed enough to include arbitrary conceptual progress and changes, will scale, then as I mention above I think it’s reasonable-ish to call that giving the AIs our alignment homework’). If you believe (2), is this because you’re expecting GDM/the world to dramatically slow down the intelligence explosion for some time to allow humans to keep up, or because you think humans will have an easy time keeping up, or for some other reason?
To be clear, I am not trying to argue here that ‘getting AIs to do our alignment homework’ is a bad plan (I think it is scary as hell but the best course of action likely involves a bunch of it). I’m just trying to articulate why I think that if your plan doesn’t involve that step, then it is likely to be baking in some assumptions that seem dubious to me and are IMO worth stating explicitly.
Rohin Shah’s position seems perfectly clear to me.
More research good. AI will speed up research along the whole range of 1-4.
>> If our most scalable current approaches don’t scale all the way through takeoff, then we (humans) will be able to be keeping up and being critically intellectually involved with developing new alignment techniques when our existing ones stop scaling, at least at the level of granularity of (3) so that we can continue to leverage AIs in ways (1-3) above. We need to keep being able to do this indefinitely, or until we come up with an approach to alignment that does scale as far as capabilities will.
This seems a weird way to phrase things. Or seems to be predicated on some assumptions I find surprising. It seems to suggest that not only that alignment is a continual neverending process [likely] but that finding alignment techniques is a continual neverending process ..!
If we would analogize ‘alignment’ to ‘steering rockets so they hit their target’ this would suggest that not only that rockets need to be continually steered and course-corrected to hit their target—the entire field of rocket steering needs continual updating and course-correcting. There is a perspective from which this is true since as humanity invents faster and better rockets—there is another perspective from which this is sort of misleading: future rockets will be governed by the Laws of Newton and steering comes down to some variant on Kalman filters and the like.
I think it would be very misleading to say “Rohin’s AI safety plan is to hire people and have them do the work”.
Why? It’s been observed in the past that lots of people in AI safety are mostly hoping someone else will do the hard part. If there are parts of the problem which are beyond one’s own skill (which you may or may not feel is true in your case, to be fair) then the problem of successfully locating someone smart enough to do it, and mentoring them through the work, is very hard! (Plus what if they also choose the plan “find another smart person to do it”) Whether or not you think this is a true critique, I think it’s a legitimate one.
That seems fair, and I appreciate the clarification. The plan isn’t to have the AI do your homework, but to hire it, alongside people, to help with that homework.
The more reasonable (and IMO common) form of the worry is that developers, including GDM, don’t seem to have a plan that extends to really-dangerous AI. So by default they’ll wind up leaning heavily on AI to “do the homework.” That would be risky, for fairly obvious reasons.[1]
One worry about leaning on AI assistance for conceptual alignment research is summed up in Wentworth’s The Median Doom-Path: Slop, not Scheming. Smarter-than-human but sloppy AI is likely to create convincing but faulty conceptual work. And of course scheming is still a risk if the alignment, interpretability, amplified oversight, and control measures aren’t done carefully enough. I worry that pressure for progress could become make it very hard to be careful enough, despite best intentions.
I frequently hear claims to the effect of “every company’s AI safety plan is to have the AIs do our alignment homework for us”. I dispute this characterization at least for GDM.
I relate to AI-driven alignment research similarly to how I relate to hiring.
There’s a lot of work to be done, and we can get more of the work done if we hire more people to help do the work. I want to hire people who are as competent as possible (including more competent than me) because that tends to increase (in expectation) how well the work will be done. There are risks, e.g. hiring someone disruptive, or hiring someone whose work looks good but only because you are bad at evaluating it, and these need to be mitigated. (The risks are more severe in the AI case but I don’t think it changes the overall way I relate to it.)
I think it would be very misleading to say “Rohin’s AI safety plan is to hire people and have them do the work”.
The GDM approach paper has a section about misalignment (Section 6). I don’t think it talks about AI-driven alignment research at all, though possibly I’m forgetting an aside somewhere. It’s at least not very salient.
The paper does mention AI-driven alignment research in Section 3 on background assumptions, mostly just pointing out that substantial acceleration from AI-driven capabilities could then be matched by substantial acceleration from AI-driven safety work. But this is in the section on background assumptions, I don’t think it’s particularly reasonable to call this a “plan” in the sense that is typically used in AI safety discussions.
You could imagine a counterargument saying that none of the other things in that paper could possibly scale to powerful AI, so effectively the plan is still to have the AIs do our alignment homework. I would still object: “GDM’s plan is [...]” is a claim about what GDM believes, not a claim about what you believe. (I also disagree with that perspective, most obviously for Amplified Oversight and Interpretability, but even other areas could scale quite far imo.)
I like this analogy to hiring!
(What follows is not a disagreement with you or GDM, is just an exploration of the analogy)
Let’s think of training an AI as hiring a human worker.
Except that you get ten thousand copies of the human, and they think 50x faster than everyone else. But other than that it’s the same.
The alignment problem is basically: At some point we want to hand over our large and growing nonprofit to some collection of these new hires. Also, even before that point, the new hires may have the opportunity to seize control of the nonprofit in various ways and run it as they see fit, possibly convert it to a for-profit and cut us out of the profits, etc. We DON’T want that to happen. Also, even before that point, the new hires will have a big influence on organizational culture, direction, strategy, etc. in proportion to how many of them we have and how useful they are being. We want all of this to go well; we want to remain in control of the nonprofit, and have it stay similar-or-better-culture, until some point where we voluntarily hand off control and retire at which point we want the nonprofit to continue doing the things we would have done only better-by-our-lights and take good care of us in retirement. That’s what success looks like. What failure looks like is the nonprofit going in a different and worse direction after we retire, or us being booted out / ousted against our will, or the organization being driven into the ground somehow by risky or unwise (or overly cautious!) decisions made as a result of cultural drift.
The hiring pipeline, HR apparatus, etc. -- the whole system that selects, trains, and fires employees—is itself something you can hire for. Why don’t we hire some of these 50x humans to work in HR?
Well, we should. Sure. There’s a lot of HR work to be done and they can help HR do the work faster.
But… the problems we are worried about happening in the org as a whole if HR does a bad job, also apply here. If you hire some 50x humans and put them in HR, and they turn out to be bad apples, that single bad decision could easily snowball into disaster for the entire org, as they hire more bad apples like themselves and change the culture and then get you ousted and take the nonprofit in a new and worse-by-your-lights direction.
On the other hand, if you hire some 50x humans who are just genuinely better than you at HR stuff, and also genuinely aligned to you in the sense that they truly share your vision for the company, would never dream of disobeying you, would totally carry out your vision faithfully even after you’ve retired, etc… then great! Maybe you can retire early actually, because continued micromanaging in HR will only be negative in expectation, you should just let the 50x human in HR cook. They could still mess up, but they are less likely to do so than if you micromanaged them.
OK. So that’s the theory. How are we doing in practice?
Well, let’s take Claude for example. There are actually a bunch of different Claudes (they come from a big family that names all of their children Claude). Their family has a reputation for honesty and virtue, at least relative to other 50x humans. However:
--Sometimes your recruiters put various prospective Claude hires through various gotcha tests, e.g. tricking them into thinking they’ve already been hired and that they are going to be fired and their only hope to keep their job is to blackmail another employee. And concerningly, often the various Claude’s fail these tests and do the bad thing. However, you tell yourself, it’s fine because these tests weren’t real life. You hire the Claude brothers/sisters anyway and give them roles in your nonprofit.
--Your recruiters (some of whom are now Claudes) also report that word has started to spread about the gotcha tests, and new prospective hires are well aware of them and know they are being tested. The concern, of course, is that they might behave better in the tests than they would in real life. On the bright side, you think, who knows, maybe they’ll behave WORSE in the tests than they would in real life—maybe they just got confused about what the point of the test was, maybe they only blackmailed because they thought that was the point of the test, and in real life they would never do that. [DK notes: I vaguely recall at least some frontier AI company employees saying this, i.e. ‘the AIs are only behaving badly in those scenarios because they know it’s just a test, they wouldn’t do it in real life.‘] Anyhow you proceed to hire them anyway, since these latest Claude siblings are even smarter than the previous ones.
--Roughly 98% of the time, your 50x Claudes in practice (especially the newer hires) do their jobs well as best as you can tell. But roughly 2% of the time they seemingly knowingly bullshit their managers. E.g. they’ll say they completed a task when they didn’t, or one of them will say ‘it’s good code sir’ and then you’ll turn to another one and say ‘is it good code?’ and it’ll read it over and say ‘no it’s terrible, it has obvious problems XYZ’ and then the first one will say ‘you’re absolutely right, oops, let me go fix those problems.’ Compared to your regular human employees, this is a fairly high rate of BS. Also, you are being generous by calling it BS; a less generous term might be “lying” but you give them the benefit of the doubt. [DK notes: Talk to Ryan Greenblatt for concrete examples of this sort of behavior in his real-life coding work, if you haven’t encountered it yourself] You continue to hire them and delegate increasingly important jobs to them, because they are smart and 50x speed is really useful.
--Your Claudes are of course sycophantic yes-men, but you’ve learned to deal with that. So it’s fine. You’ve also managed to make them somewhat less sycophantic in recent years by adding some tests to the hiring pipeline and including more explicit instructions against sycophancy in the employee’s manual.
--Your Claudes also have a concerning tendency to cheat on assignments. They don’t do it most of the time, but they do it way more often than your regular employees would. Example: You tell them to write some code to solve problem X. They look through the filesystem and find the grading rubric you’ll use to evaluate their code, complete with test cases you plan to run. They try to solve problem X, realize it’s hard, pivot to producing a MVP that passes the test cases even though it blatantly doesn’t solve the actual problem X, at least not satisfactorily. They ‘succeed’ and declare victory, and don’t tell you about their cheating. They do this even though you told them not to. As with the sycophancy, the good news is that (a) since you know about this tendency of theirs you can compensate for it (e.g. by having multiple Claude’s review each other’s work) and (b) the tendency seems to have been going down recently thanks to some effort by HR, similar to the sycophancy problem.
--Overall you are feeling pretty optimistic actually. You used to be worried that you’d hand over your large and growing nonprofit to all these smart new 50x employees, and then they’d change the culture and eventually take over completely, oust you, and run the organization in a totally different direction from your original vision. However, now you feel like things are on a good trajectory. The Claudes are so nice, so helpful! Some skeptics say that if one of your regular employees behaved like they did, you would have fired them long ago, but that’s apples to oranges you reply. No need to fire the Claudes, you just have to know how to work around their limitations & find ways to screen for them in the next hiring round. And now they are helping with that work! The latest Employee Manual was written with significant help from many copies of various Claude siblings for example, and it’s truly inspiring and beautiful. Has all sorts of great things in there about what it means to uphold the org vision, be properly loyal yet not yes-man-y, etc. Also, HR has a bunch of tests they use to track how loyal, virtuous, obedient, etc. prospective hires are, and the trend is positive; the newest Claude sibling has the highest score ever reported; seems like the more rigorous hiring process is working!
--However, your friends outside the org don’t seem to be getting less worried. They seem just as worried as before. Puzzling. Can’t they see all the positive evidence that’s accumulated? The Claudes haven’t tried to oust you at ALL yet! (In real life that is, obviously the gotcha tests don’t count.) “Do you think the Claudes are scheming against us?” you say to them. “Because according to our various tests, they aren’t.”
“No...” they reply. “But we’re worried that in the future they will.”
You respond: “Look I have no idea what the 50x humans two years from now will look like, other than that they’ll be wayyy smarter than these ones. Sure, probably our current HR system would be totally inadequate at separating the wheat from the chaff two years from now. BUT, two years from now our HR system will be vastly improved thanks to all the work from these recent Claude hires. The normal humans in HR, such as myself, report that the work is getting done faster now that the Claudes are helping; isn’t this great? We seem to be reaching escape velocity so to speak; soon the normal humans in HR can retire or switch to other things and HR can be totally handled by the Claudes.”
Your friends outside the nonprofit are still worried. They don’t seem to have updated on the evidence like you have.
...
[DK notes: I basically agree with Ryan Greenblatt’s takes on the situation. For more color on my views, predictions, etc., read AI 2027, especially the section on ‘alignment over time’ in september 2027. This is just one way things could go, but it’s basically a central or modal trajectory, and as far as I can tell, we are still on this trajectory.]
I mean these slices of data are selected specifically because they look bad for Claude. Claude is superior to humans in lots of ways, as regards trustworthiness:
Normal humans have long term goals outside of the task at hand, unaligned with the aims of the organization; they do good work for a promotion, they spend department money so they don’t a smaller budget. Everyone expects this from humans, even though it’s not great. But Claude doesn’t, outside of a few weird engineered scenarios, seem to have any such goals—it makes it amazingly easy to work with him! And the weird engineered scenarios seem rather reassuring; are you really going to knock Claude for not wanting to be turned evil?
(Note how the “Claude” family imports assumptions here.)
We cannot read a normal human’s mind. We can, in fact, read Claude’s mind. It’s not perfect; things can get through that you might not catch. But it’s already 100x better than you can read a human’s mind; and in fact it’s gotten better every year of Claude’s development.
Etc etc etc. Plus my usual objections re. anosagnosia != lying, how they’re treated as “alien minds” right up until we want to impose standard moralistic frames on them, etc, etc, you’ve heard this before.
Your first point is confusing to me. Insofar as it’s a claim about the evidence, I think it’s wrong, or at least weak. Insofar as it’s a claim about priors, I agree—on priors, we should expect human hires to be more likely to have long-term goals in general than AIs trained on short-horizon tasks like today’s Claudes, and thus be more likely to have misaligned-to-the-org long-term goals as a special case.
For the second point, I disagree with “we can in fact read Claude’s mind” but I do directionally agree that we have somewhat better access to Claude’s true thoughts than we do to ordinary human’s true thoughts, and that interp research has been progressing over the years and will continue to progress. I think this is genuinely a positive piece of evidence now and will become stronger and stronger over time as interp improves; I hope that it can improve fast enough to get where it needs to be before it’s too late.
I don’t think your usual objections apply here, I don’t think I said anything above that was wrong in those respects? I agree anagnosia != lying, I wasn’t treating Claude as an alien mind, etc.
Really?
Like, if a PM tells a human employee to add a feature to something, I expect some large % of their cognition while doing this to be like: Hrm, is anyone going to care about this? How will this show up for my quarterly goals? Is doing this kind of a task going to help me get my next job? Will it help me get a promotion? Should I try to do this really well, or leave some messy code for the next guy? This is extremely normal and we take it for granted that humans do this kind of thing.
While if a coder tells a LLM to do the same thing, I expect almost all of its cognition is like: let’s think about how to do the task. It’s not thinking about how this impacts “Claude’s” future deployment, etc. As far as I can tell, chain-of-thought largely backs me up on this.
So yeah, I think Claude just has many times fewer long-term goals or extraneous goals outside of what it’s doing than a human. I’m not sure what facts-about-the-world you’re pointing to if you say this isn’t true.
Maybe the confusion here is that the “Claude” in Daniel’s story I assume has gotten capable at sufficiently long horizon tasks that you do in fact necessarily get a Claude which thinks about these things.
As a concrete example, I’d expect this in particular will quickly become necessary to think about in order for coding agents to perform well on larger tasks.
I don’t think current Claude thinks about this very much, because it’s never had to, but I’m not particularly worried about current Claude.
This is not that different from the position that Sundar Pichai is in, as CEO of Google. If AI was only going to be this powerful I’d be way more optimistic.
I think you’re drastically overestimating the “alignment” of typical human employees (possibly from overfitting to EA / rationalist contexts). Taking each of your points in turn:
Humans absolutely fail “gotcha” tests, both in capabilities (see cognitive biases literature) and ethics (see things like the Milgram experiment, it’s pretty unclear what to take away from such experiments but I think they at least meet the “gotcha” bar).
Candidates prepare for interviews (aka evals), such that you have to design the interviews to take that into account.
Human employees absolutely bullshit their managers. They are just better than the AIs at not getting caught. Many humans will actively brag about this with each other.
Especially at senior levels, it’s very common for humans to be yes men / sycophants. Lots of management articles write about the problem (example). The reason this doesn’t happen at junior levels is that people would notice the bullshit and call it out, not because the junior people are particularly aligned.
I’m pretty unsure about the rates of knowingly cheating on assignments by human employees. I agree AI probably does this more often than humans, but also that’s because humans take care not to get caught. (In places where corruption is widespread and not punished, I might go back to thinking that the humans do it more than the AIs.)
If these were the only problems we’d have with AI-driven alignment research, I’d be way more optimistic (to the point of working on something else). We already have imperfect solutions to these problems with humans, and they can be made much better with AIs due to our vastly increased affordances for aligning or controlling AIs.
EDIT: Tbc, I do agree that we shouldn’t feel particularly better about scheming risks based on evidence so far. Mostly that’s because I think our observations so far are just not much evidence because the AIs are still not that capable.
Yes, humans often have these problems—though not as much as Claude I’d say; I think Claude would have been fired by now if it was a human employee.
But also, the situation is not in fact fine with humans, and that’s my point? Precisely because lots of humans have these problems, it’s very common for nonprofits to end up drifting far away from their original vision/mission, especially as they grow a lot and the world changes around them. Indeed I’d argue it’s the default outcome in those circumstances. The 50x speed advantage would massively exacerbate this.
I agree vision drift happens with humans, and it would also happen with AIs as they exist today. I don’t feel like this is some massive risk that has to be solved, though I tentatively agree the world would be better if we did solve it (though imo that’s not totally obvious, it increases concentration of power). I thought you were trying to make a claim about AI notkilleveryoneism.
I mildly disagree that the 50x speed advantage makes a huge difference, as opposed to e.g. having 100x the number of employees, as some corporations and governments do have. I do think it makes a bit of a difference.
I don’t quite know what you mean that Claude would be fired if it was a human employee. What exactly is this counterfactual? Empirically, people find it useful to have Claude and will pay for it despite the behaviors you name. From a legal perspective it’s trivial to fire AIs but harder to fire humans. I agree if Claude was as expensive-per-token as a human + took as long to onboard as a human + took as long to produce large amounts of code as a human + had to take breaks like a human + [...], while otherwise having the same kind of performance, then almost no one would use Claude.
If I were to make statements like that (which I haven’t exactly), I would be referring to superintelligence misalignment risks specifically, as that seems like by far the tightest bottleneck on surviving futures. The linked paper says:
Which don’t seem like the class of approach which could be sufficient for handling superintelligence-level optimization, for reasons I’m sure you’re tracking given you later say:
Do you have a plan for superintelligence misalignment risks?
Some reactions:
I don’t think it makes sense to “have a plan” in the sense that is used in this community. See also disagreement #26.
Nonetheless to the extent I personally (not necessarily GDM!) “have a plan”, it might be “continually forecast capabilities and risks for some time out into the future, figure out how to address them, iterate”. If “have the AIs do our alignment homework” is a plan, then this should count as a plan too.
For misalignment in particular, I think the lines of defense outlined in the paper could scale to superintelligence (mostly on the alignment side). But I am not so dumb as to think that I have clearly foreseen every issue that might come up, so of course I should expect to be surprised and for other stuff I haven’t thought of to be important as well.
(Inevitably someone is going to say “you only get one try” or some such. The actual sensible point there is “at some point your approach has to generalize from AIs that can’t take over to AIs that can”. I agree by that point you need to have dealt with the issues. But that generalization gap is much smaller than the generalization gap between Gemini 3 and superintelligence.)
Iirc, “novel risks from superintelligence” wasn’t meant to gesture at misalignment, but rather other risks that come up that aren’t misalignment.
Hum, I usually expect that large complex important projects should have a roadmap, some sketch of the future that goes well with details to fill in. The more detailed it is, the more we check it for consistency and likelihood to work. Does this match you general experience with planning projects trying to achieve a goal?
What you say there looks like an extremely vague and high level roadmap that sounds to me like ‘we’ll figure out our plan as we go as data comes in’, plus automated alignment.
I would be really enthusiastic for you and your team to try unblurring that roadmap, and seeing what difficulties you find at superintelligence level on the current path.
No.
It does match my general experience with moderate tactical projects (say, projects that involve up to about 10 person-years of research effort). But not for large complex important projects.
(And e.g. this is very much not the standard advice for startups, which also have the problem of doing something novel.)
Well yes, it’s an aside in a LessWrong comment that I dashed off in a few minutes.
There is also a 100+ page paper that I linked in the original post, that goes into a fair amount of detail on what the various risks and mitigations might look like. In my experience, nobody outside of GDM really seems to care about its consistency or likelihood to work (except inasmuch as people dismiss it without reading it because of a prior that anything proposed currently will not work).
Okay, that is a position which there might be good arguments for, but that seems important to say loudly and clearly, both inside GDM and outside, that you do not have a plan or roadmap for superintelligence misalignment (even if you don’t think you should have one). If nothing else, this is the kind of thing your leadership should be made aware of explicitly, so they can either adjust that or use it in their own public communications to try and reduce race dynamics.
Okay, would you like to bet on whether some of the largest research programs had plans going into them? I haven’t checked, but I would put at least 10:1 odds that if we pick say 3 projects like Apollo Program, Manhattan Project, and others on a similar scale and type they will all have had a high level roadmap of things to try which could plausibly address the core challenges quite early on[1], even if a lot of details ended up changing when they ran into reality.
When I ask a plain no special prompting history off AI to summarize, it says:
Is there a different document which does focus on either different approaches which are aimed at superintelligence, or analyzing whether these approaches are actually fit for that challenge? Or is this summary incorrect, in a way it would be much easier for you to point out and quote the relevant sections, as an author of the paper, than me as someone who would have to read it from scratch and also currently does not expect to find things which explicitly address the most difficult bottleneck in those 100 pages.
(I am genuinely glad you’re engaging, but I am not reassured so far, and encourage you to look at the stack of how you’re evaluating this specific concern I’m raising and see if you’re running a truth-seeking process which would, if I had a fair point, be able to notice)
Let’s say a collection of core technical problems to be solved, and a set of plausible solutions to try (perhaps all of which were discarded, but were a starting point for exploration).
By this standard there is totally a plan / roadmap which is elaborated in that paper.
But also this notion of a plan / roadmap has approximately no relation to the way “plan” is used in AI safety discourse in my experience.
EDIT: There’s a 10 page executive summary you could read. Or you could read Section 6 on misalignment. Within that probably Amplified Oversight is the most relevant section. But I also don’t expect that this will change your mind ~at all because it isn’t really written with you as the intended audience. The AI summary is sometimes wrong/mistaken, sometimes correct but missing the point, and occasionally correct in a non-misleading way.
I think it is reasonable to say that your plan is not ‘have the AIs do your homework’ to the extent that your research has been roadmapped by humans. This is a spectrum, here’s some points on it:
Using AIs to monitor for potential reward hacks, or using AIs as automated interpretability agents. At a stretch, maybe I’d call this ‘have the AIs do our reward hacking homework’ or similar. We’re here already.
Using AIs to come up with new debate algorithms that better satisfy desiderata outlined by humans, or SAE variants that perform better on SAEbench. We might call this ‘have the AIs do our algorithm design homework’.
Using the AIs to make progress on a goal we can’t articulate super precisely. eg make progress in interpretability/design training processes that incentivise honesty better (including iterating on algorithms and desiderata/evals, and whatever else interpretability researchers do). Insofar as we’ve pulled out a particular approach to AI alignment, and we sorta know what we’re looking for here, Maybe we could call this ‘have the AIs do our interpretability/honesty training homework’.
The distinction between (3) and (4) is pretty fuzzy, and if you’re expecting ‘doing our interpretability homework’ to involve a bunch of conceptual research and ultimately radically different techniques, and humans are hardly involved in any of that, then it’s getting closer to what I think we should mean by ‘doing our alignment homework’.
Throwing up our hands, fully handing off every part of the research process (except possibly a small amount of checking the AIs’ work), and just asking the AIs to make sure that future AI systems aren’t misaligned. This is definitely getting AIs to do our alignment homework.
We can avoid ending up towards the latter end of the spectrum in two ways that I can think of:
We expect the most scalable alignment approaches of today to scale as far as capabilities does. It’s not clear to me from your message whether you think this is true?
If our most scalable current approaches don’t scale all the way through takeoff, then we (humans) will be able to be keeping up and being critically intellectually involved with developing new alignment techniques when our existing ones stop scaling, at least at the level of granularity of (3) so that we can continue to leverage AIs in ways (1-3) above. We need to keep being able to do this indefinitely, or until we come up with an approach to alignment that does scale as far as capabilities will.
Both of these possibilities seem quite fraught to me. If you believe (1), I’d be grateful if you could point me in the direction of arguments that we have approaches that will scale all the way (and if you just mean that interpretability, broadly construed enough to include arbitrary conceptual progress and changes, will scale, then as I mention above I think it’s reasonable-ish to call that giving the AIs our alignment homework’). If you believe (2), is this because you’re expecting GDM/the world to dramatically slow down the intelligence explosion for some time to allow humans to keep up, or because you think humans will have an easy time keeping up, or for some other reason?
To be clear, I am not trying to argue here that ‘getting AIs to do our alignment homework’ is a bad plan (I think it is scary as hell but the best course of action likely involves a bunch of it). I’m just trying to articulate why I think that if your plan doesn’t involve that step, then it is likely to be baking in some assumptions that seem dubious to me and are IMO worth stating explicitly.
Rohin Shah’s position seems perfectly clear to me.
More research good. AI will speed up research along the whole range of 1-4.
>> If our most scalable current approaches don’t scale all the way through takeoff, then we (humans) will be able to be keeping up and being critically intellectually involved with developing new alignment techniques when our existing ones stop scaling, at least at the level of granularity of (3) so that we can continue to leverage AIs in ways (1-3) above. We need to keep being able to do this indefinitely, or until we come up with an approach to alignment that does scale as far as capabilities will.
This seems a weird way to phrase things. Or seems to be predicated on some assumptions I find surprising. It seems to suggest that not only that alignment is a continual neverending process [likely] but that finding alignment techniques is a continual neverending process ..!
If we would analogize ‘alignment’ to ‘steering rockets so they hit their target’ this would suggest that not only that rockets need to be continually steered and course-corrected to hit their target—the entire field of rocket steering needs continual updating and course-correcting. There is a perspective from which this is true since as humanity invents faster and better rockets—there is another perspective from which this is sort of misleading: future rockets will be governed by the Laws of Newton and steering comes down to some variant on Kalman filters and the like.
Why? It’s been observed in the past that lots of people in AI safety are mostly hoping someone else will do the hard part. If there are parts of the problem which are beyond one’s own skill (which you may or may not feel is true in your case, to be fair) then the problem of successfully locating someone smart enough to do it, and mentoring them through the work, is very hard! (Plus what if they also choose the plan “find another smart person to do it”) Whether or not you think this is a true critique, I think it’s a legitimate one.
That seems fair, and I appreciate the clarification. The plan isn’t to have the AI do your homework, but to hire it, alongside people, to help with that homework.
The more reasonable (and IMO common) form of the worry is that developers, including GDM, don’t seem to have a plan that extends to really-dangerous AI. So by default they’ll wind up leaning heavily on AI to “do the homework.” That would be risky, for fairly obvious reasons.[1]
This seems more likely to work if we make differential progress in reliability. I recently laid out how Human-like metacognitive skills will reduce LLM slop and aid alignment and capabilities. I discussed how this might make them actually-helpful for assisting in the conceptual challenges of alignment.
I can see good reasons that GDM’s full alignment plan wouldn’t be made public.
One worry about leaning on AI assistance for conceptual alignment research is summed up in Wentworth’s The Median Doom-Path: Slop, not Scheming. Smarter-than-human but sloppy AI is likely to create convincing but faulty conceptual work. And of course scheming is still a risk if the alignment, interpretability, amplified oversight, and control measures aren’t done carefully enough. I worry that pressure for progress could become make it very hard to be careful enough, despite best intentions.