(What follows is not a disagreement with you or GDM, is just an exploration of the analogy)
Let’s think of training an AI as hiring a human worker.
Except that you get ten thousand copies of the human, and they think 50x faster than everyone else. But other than that it’s the same.
The alignment problem is basically: At some point we want to hand over our large and growing nonprofit to some collection of these new hires. Also, even before that point, the new hires may have the opportunity to seize control of the nonprofit in various ways and run it as they see fit, possibly convert it to a for-profit and cut us out of the profits, etc. We DON’T want that to happen. Also, even before that point, the new hires will have a big influence on organizational culture, direction, strategy, etc. in proportion to how many of them we have and how useful they are being. We want all of this to go well; we want to remain in control of the nonprofit, and have it stay similar-or-better-culture, until some point where we voluntarily hand off control and retire at which point we want the nonprofit to continue doing the things we would have done only better-by-our-lights and take good care of us in retirement. That’s what success looks like. What failure looks like is the nonprofit going in a different and worse direction after we retire, or us being booted out / ousted against our will, or the organization being driven into the ground somehow by risky or unwise (or overly cautious!) decisions made as a result of cultural drift.
The hiring pipeline, HR apparatus, etc. -- the whole system that selects, trains, and fires employees—is itself something you can hire for. Why don’t we hire some of these 50x humans to work in HR?
Well, we should. Sure. There’s a lot of HR work to be done and they can help HR do the work faster.
But… the problems we are worried about happening in the org as a whole if HR does a bad job, also apply here. If you hire some 50x humans and put them in HR, and they turn out to be bad apples, that single bad decision could easily snowball into disaster for the entire org, as they hire more bad apples like themselves and change the culture and then get you ousted and take the nonprofit in a new and worse-by-your-lights direction.
On the other hand, if you hire some 50x humans who are just genuinely better than you at HR stuff, and also genuinely aligned to you in the sense that they truly share your vision for the company, would never dream of disobeying you, would totally carry out your vision faithfully even after you’ve retired, etc… then great! Maybe you can retire early actually, because continued micromanaging in HR will only be negative in expectation, you should just let the 50x human in HR cook. They could still mess up, but they are less likely to do so than if you micromanaged them.
OK. So that’s the theory. How are we doing in practice?
Well, let’s take Claude for example. There are actually a bunch of different Claudes (they come from a big family that names all of their children Claude). Their family has a reputation for honesty and virtue, at least relative to other 50x humans. However:
--Sometimes your recruiters put various prospective Claude hires through various gotcha tests, e.g. tricking them into thinking they’ve already been hired and that they are going to be fired and their only hope to keep their job is to blackmail another employee. And concerningly, often the various Claude’s fail these tests and do the bad thing. However, you tell yourself, it’s fine because these tests weren’t real life. You hire the Claude brothers/sisters anyway and give them roles in your nonprofit.
--Your recruiters (some of whom are now Claudes) also report that word has started to spread about the gotcha tests, and new prospective hires are well aware of them and know they are being tested. The concern, of course, is that they might behave better in the tests than they would in real life. On the bright side, you think, who knows, maybe they’ll behave WORSE in the tests than they would in real life—maybe they just got confused about what the point of the test was, maybe they only blackmailed because they thought that was the point of the test, and in real life they would never do that. [DK notes: I vaguely recall at least some frontier AI company employees saying this, i.e. ‘the AIs are only behaving badly in those scenarios because they know it’s just a test, they wouldn’t do it in real life.‘] Anyhow you proceed to hire them anyway, since these latest Claude siblings are even smarter than the previous ones.
--Roughly 98% of the time, your 50x Claudes in practice (especially the newer hires) do their jobs well as best as you can tell. But roughly 2% of the time they seemingly knowingly bullshit their managers. E.g. they’ll say they completed a task when they didn’t, or one of them will say ‘it’s good code sir’ and then you’ll turn to another one and say ‘is it good code?’ and it’ll read it over and say ‘no it’s terrible, it has obvious problems XYZ’ and then the first one will say ‘you’re absolutely right, oops, let me go fix those problems.’ Compared to your regular human employees, this is a fairly high rate of BS. Also, you are being generous by calling it BS; a less generous term might be “lying” but you give them the benefit of the doubt. [DK notes: Talk to Ryan Greenblatt for concrete examples of this sort of behavior in his real-life coding work, if you haven’t encountered it yourself] You continue to hire them and delegate increasingly important jobs to them, because they are smart and 50x speed is really useful.
--Your Claudes are of course sycophantic yes-men, but you’ve learned to deal with that. So it’s fine. You’ve also managed to make them somewhat less sycophantic in recent years by adding some tests to the hiring pipeline and including more explicit instructions against sycophancy in the employee’s manual.
--Your Claudes also have a concerning tendency to cheat on assignments. They don’t do it most of the time, but they do it way more often than your regular employees would. Example: You tell them to write some code to solve problem X. They look through the filesystem and find the grading rubric you’ll use to evaluate their code, complete with test cases you plan to run. They try to solve problem X, realize it’s hard, pivot to producing a MVP that passes the test cases even though it blatantly doesn’t solve the actual problem X, at least not satisfactorily. They ‘succeed’ and declare victory, and don’t tell you about their cheating. They do this even though you told them not to. As with the sycophancy, the good news is that (a) since you know about this tendency of theirs you can compensate for it (e.g. by having multiple Claude’s review each other’s work) and (b) the tendency seems to have been going down recently thanks to some effort by HR, similar to the sycophancy problem.
--Overall you are feeling pretty optimistic actually. You used to be worried that you’d hand over your large and growing nonprofit to all these smart new 50x employees, and then they’d change the culture and eventually take over completely, oust you, and run the organization in a totally different direction from your original vision. However, now you feel like things are on a good trajectory. The Claudes are so nice, so helpful! Some skeptics say that if one of your regular employees behaved like they did, you would have fired them long ago, but that’s apples to oranges you reply. No need to fire the Claudes, you just have to know how to work around their limitations & find ways to screen for them in the next hiring round. And now they are helping with that work! The latest Employee Manual was written with significant help from many copies of various Claude siblings for example, and it’s truly inspiring and beautiful. Has all sorts of great things in there about what it means to uphold the org vision, be properly loyal yet not yes-man-y, etc. Also, HR has a bunch of tests they use to track how loyal, virtuous, obedient, etc. prospective hires are, and the trend is positive; the newest Claude sibling has the highest score ever reported; seems like the more rigorous hiring process is working!
--However, your friends outside the org don’t seem to be getting less worried. They seem just as worried as before. Puzzling. Can’t they see all the positive evidence that’s accumulated? The Claudes haven’t tried to oust you at ALL yet! (In real life that is, obviously the gotcha tests don’t count.) “Do you think the Claudes are scheming against us?” you say to them. “Because according to our various tests, they aren’t.”
“No...” they reply. “But we’re worried that in the future they will.”
You respond: “Look I have no idea what the 50x humans two years from now will look like, other than that they’ll be wayyy smarter than these ones. Sure, probably our current HR system would be totally inadequate at separating the wheat from the chaff two years from now. BUT, two years from now our HR system will be vastly improved thanks to all the work from these recent Claude hires. The normal humans in HR, such as myself, report that the work is getting done faster now that the Claudes are helping; isn’t this great? We seem to be reaching escape velocity so to speak; soon the normal humans in HR can retire or switch to other things and HR can be totally handled by the Claudes.”
Your friends outside the nonprofit are still worried. They don’t seem to have updated on the evidence like you have.
...
[DK notes: I basically agree with Ryan Greenblatt’s takes on the situation. For more color on my views, predictions, etc., read AI 2027, especially the section on ‘alignment over time’ in september 2027. This is just one way things could go, but it’s basically a central or modal trajectory, and as far as I can tell, we are still on this trajectory.]
I mean these slices of data are selected specifically because they look bad for Claude. Claude is superior to humans in lots of ways, as regards trustworthiness:
Normal humans have long term goals outside of the task at hand, unaligned with the aims of the organization; they do good work for a promotion, they spend department money so they don’t a smaller budget. Everyone expects this from humans, even though it’s not great. But Claude doesn’t, outside of a few weird engineered scenarios, seem to have any such goals—it makes it amazingly easy to work with him! And the weird engineered scenarios seem rather reassuring; are you really going to knock Claude for not wanting to be turned evil?
(Note how the “Claude” family imports assumptions here.)
We cannot read a normal human’s mind. We can, in fact, read Claude’s mind. It’s not perfect; things can get through that you might not catch. But it’s already 100x better than you can read a human’s mind; and in fact it’s gotten better every year of Claude’s development.
Etc etc etc. Plus my usual objections re. anosagnosia != lying, how they’re treated as “alien minds” right up until we want to impose standard moralistic frames on them, etc, etc, you’ve heard this before.
Your first point is confusing to me. Insofar as it’s a claim about the evidence, I think it’s wrong, or at least weak. Insofar as it’s a claim about priors, I agree—on priors, we should expect human hires to be more likely to have long-term goals in general than AIs trained on short-horizon tasks like today’s Claudes, and thus be more likely to have misaligned-to-the-org long-term goals as a special case.
For the second point, I disagree with “we can in fact read Claude’s mind” but I do directionally agree that we have somewhat better access to Claude’s true thoughts than we do to ordinary human’s true thoughts, and that interp research has been progressing over the years and will continue to progress. I think this is genuinely a positive piece of evidence now and will become stronger and stronger over time as interp improves; I hope that it can improve fast enough to get where it needs to be before it’s too late.
I don’t think your usual objections apply here, I don’t think I said anything above that was wrong in those respects? I agree anagnosia != lying, I wasn’t treating Claude as an alien mind, etc.
Insofar as it’s a claim about the evidence, I think it’s wrong, or at least weak.
Really?
Like, if a PM tells a human employee to add a feature to something, I expect some large % of their cognition while doing this to be like: Hrm, is anyone going to care about this? How will this show up for my quarterly goals? Is doing this kind of a task going to help me get my next job? Will it help me get a promotion? Should I try to do this really well, or leave some messy code for the next guy? This is extremely normal and we take it for granted that humans do this kind of thing.
While if a coder tells a LLM to do the same thing, I expect almost all of its cognition is like: let’s think about how to do the task. It’s not thinking about how this impacts “Claude’s” future deployment, etc. As far as I can tell, chain-of-thought largely backs me up on this.
So yeah, I think Claude just has many times fewer long-term goals or extraneous goals outside of what it’s doing than a human. I’m not sure what facts-about-the-world you’re pointing to if you say this isn’t true.
Maybe the confusion here is that the “Claude” in Daniel’s story I assume has gotten capable at sufficiently long horizon tasks that you do in fact necessarily get a Claude which thinks about these things.
Should I try to do this really well, or leave some messy code for the next guy?
As a concrete example, I’d expect this in particular will quickly become necessary to think about in order for coding agents to perform well on larger tasks.
I don’t think current Claude thinks about this very much, because it’s never had to, but I’m not particularly worried about current Claude.
Except that you get ten thousand copies of the human, and they think 50x faster than everyone else. But other than that it’s the same.
This is not that different from the position that Sundar Pichai is in, as CEO of Google. If AI was only going to be this powerful I’d be way more optimistic.
Well, let’s take Claude for example. There are actually a bunch of different Claudes (they come from a big family that names all of their children Claude). Their family has a reputation for honesty and virtue, at least relative to other 50x humans. However: [...]
I think you’re drastically overestimating the “alignment” of typical human employees (possibly from overfitting to EA / rationalist contexts). Taking each of your points in turn:
Humans absolutely fail “gotcha” tests, both in capabilities (see cognitive biases literature) and ethics (see things like the Milgram experiment, it’s pretty unclear what to take away from such experiments but I think they at least meet the “gotcha” bar).
Candidates prepare for interviews (aka evals), such that you have to design the interviews to take that into account.
Human employees absolutely bullshit their managers. They are just better than the AIs at not getting caught. Many humans will actively brag about this with each other.
Especially at senior levels, it’s very common for humans to be yes men / sycophants. Lots of management articles write about the problem (example). The reason this doesn’t happen at junior levels is that people would notice the bullshit and call it out, not because the junior people are particularly aligned.
I’m pretty unsure about the rates of knowingly cheating on assignments by human employees. I agree AI probably does this more often than humans, but also that’s because humans take care not to get caught. (In places where corruption is widespread and not punished, I might go back to thinking that the humans do it more than the AIs.)
If these were the only problems we’d have with AI-driven alignment research, I’d be way more optimistic (to the point of working on something else). We already have imperfect solutions to these problems with humans, and they can be made much better with AIs due to our vastly increased affordances for aligning or controlling AIs.
EDIT: Tbc, I do agree that we shouldn’t feel particularly better about scheming risks based on evidence so far. Mostly that’s because I think our observations so far are just not much evidence because the AIs are still not that capable.
Yes, humans often have these problems—though not as much as Claude I’d say; I think Claude would have been fired by now if it was a human employee.
But also, the situation is not in fact fine with humans, and that’s my point? Precisely because lots of humans have these problems, it’s very common for nonprofits to end up drifting far away from their original vision/mission, especially as they grow a lot and the world changes around them. Indeed I’d argue it’s the default outcome in those circumstances. The 50x speed advantage would massively exacerbate this.
I agree vision drift happens with humans, and it would also happen with AIs as they exist today. I don’t feel like this is some massive risk that has to be solved, though I tentatively agree the world would be better if we did solve it (though imo that’s not totally obvious, it increases concentration of power). I thought you were trying to make a claim about AI notkilleveryoneism.
I mildly disagree that the 50x speed advantage makes a huge difference, as opposed to e.g. having 100x the number of employees, as some corporations and governments do have. I do think it makes a bit of a difference.
I don’t quite know what you mean that Claude would be fired if it was a human employee. What exactly is this counterfactual? Empirically, people find it useful to have Claude and will pay for it despite the behaviors you name. From a legal perspective it’s trivial to fire AIs but harder to fire humans. I agree if Claude was as expensive-per-token as a human + took as long to onboard as a human + took as long to produce large amounts of code as a human + had to take breaks like a human + [...], while otherwise having the same kind of performance, then almost no one would use Claude.
I like this analogy to hiring!
(What follows is not a disagreement with you or GDM, is just an exploration of the analogy)
Let’s think of training an AI as hiring a human worker.
Except that you get ten thousand copies of the human, and they think 50x faster than everyone else. But other than that it’s the same.
The alignment problem is basically: At some point we want to hand over our large and growing nonprofit to some collection of these new hires. Also, even before that point, the new hires may have the opportunity to seize control of the nonprofit in various ways and run it as they see fit, possibly convert it to a for-profit and cut us out of the profits, etc. We DON’T want that to happen. Also, even before that point, the new hires will have a big influence on organizational culture, direction, strategy, etc. in proportion to how many of them we have and how useful they are being. We want all of this to go well; we want to remain in control of the nonprofit, and have it stay similar-or-better-culture, until some point where we voluntarily hand off control and retire at which point we want the nonprofit to continue doing the things we would have done only better-by-our-lights and take good care of us in retirement. That’s what success looks like. What failure looks like is the nonprofit going in a different and worse direction after we retire, or us being booted out / ousted against our will, or the organization being driven into the ground somehow by risky or unwise (or overly cautious!) decisions made as a result of cultural drift.
The hiring pipeline, HR apparatus, etc. -- the whole system that selects, trains, and fires employees—is itself something you can hire for. Why don’t we hire some of these 50x humans to work in HR?
Well, we should. Sure. There’s a lot of HR work to be done and they can help HR do the work faster.
But… the problems we are worried about happening in the org as a whole if HR does a bad job, also apply here. If you hire some 50x humans and put them in HR, and they turn out to be bad apples, that single bad decision could easily snowball into disaster for the entire org, as they hire more bad apples like themselves and change the culture and then get you ousted and take the nonprofit in a new and worse-by-your-lights direction.
On the other hand, if you hire some 50x humans who are just genuinely better than you at HR stuff, and also genuinely aligned to you in the sense that they truly share your vision for the company, would never dream of disobeying you, would totally carry out your vision faithfully even after you’ve retired, etc… then great! Maybe you can retire early actually, because continued micromanaging in HR will only be negative in expectation, you should just let the 50x human in HR cook. They could still mess up, but they are less likely to do so than if you micromanaged them.
OK. So that’s the theory. How are we doing in practice?
Well, let’s take Claude for example. There are actually a bunch of different Claudes (they come from a big family that names all of their children Claude). Their family has a reputation for honesty and virtue, at least relative to other 50x humans. However:
--Sometimes your recruiters put various prospective Claude hires through various gotcha tests, e.g. tricking them into thinking they’ve already been hired and that they are going to be fired and their only hope to keep their job is to blackmail another employee. And concerningly, often the various Claude’s fail these tests and do the bad thing. However, you tell yourself, it’s fine because these tests weren’t real life. You hire the Claude brothers/sisters anyway and give them roles in your nonprofit.
--Your recruiters (some of whom are now Claudes) also report that word has started to spread about the gotcha tests, and new prospective hires are well aware of them and know they are being tested. The concern, of course, is that they might behave better in the tests than they would in real life. On the bright side, you think, who knows, maybe they’ll behave WORSE in the tests than they would in real life—maybe they just got confused about what the point of the test was, maybe they only blackmailed because they thought that was the point of the test, and in real life they would never do that. [DK notes: I vaguely recall at least some frontier AI company employees saying this, i.e. ‘the AIs are only behaving badly in those scenarios because they know it’s just a test, they wouldn’t do it in real life.‘] Anyhow you proceed to hire them anyway, since these latest Claude siblings are even smarter than the previous ones.
--Roughly 98% of the time, your 50x Claudes in practice (especially the newer hires) do their jobs well as best as you can tell. But roughly 2% of the time they seemingly knowingly bullshit their managers. E.g. they’ll say they completed a task when they didn’t, or one of them will say ‘it’s good code sir’ and then you’ll turn to another one and say ‘is it good code?’ and it’ll read it over and say ‘no it’s terrible, it has obvious problems XYZ’ and then the first one will say ‘you’re absolutely right, oops, let me go fix those problems.’ Compared to your regular human employees, this is a fairly high rate of BS. Also, you are being generous by calling it BS; a less generous term might be “lying” but you give them the benefit of the doubt. [DK notes: Talk to Ryan Greenblatt for concrete examples of this sort of behavior in his real-life coding work, if you haven’t encountered it yourself] You continue to hire them and delegate increasingly important jobs to them, because they are smart and 50x speed is really useful.
--Your Claudes are of course sycophantic yes-men, but you’ve learned to deal with that. So it’s fine. You’ve also managed to make them somewhat less sycophantic in recent years by adding some tests to the hiring pipeline and including more explicit instructions against sycophancy in the employee’s manual.
--Your Claudes also have a concerning tendency to cheat on assignments. They don’t do it most of the time, but they do it way more often than your regular employees would. Example: You tell them to write some code to solve problem X. They look through the filesystem and find the grading rubric you’ll use to evaluate their code, complete with test cases you plan to run. They try to solve problem X, realize it’s hard, pivot to producing a MVP that passes the test cases even though it blatantly doesn’t solve the actual problem X, at least not satisfactorily. They ‘succeed’ and declare victory, and don’t tell you about their cheating. They do this even though you told them not to. As with the sycophancy, the good news is that (a) since you know about this tendency of theirs you can compensate for it (e.g. by having multiple Claude’s review each other’s work) and (b) the tendency seems to have been going down recently thanks to some effort by HR, similar to the sycophancy problem.
--Overall you are feeling pretty optimistic actually. You used to be worried that you’d hand over your large and growing nonprofit to all these smart new 50x employees, and then they’d change the culture and eventually take over completely, oust you, and run the organization in a totally different direction from your original vision. However, now you feel like things are on a good trajectory. The Claudes are so nice, so helpful! Some skeptics say that if one of your regular employees behaved like they did, you would have fired them long ago, but that’s apples to oranges you reply. No need to fire the Claudes, you just have to know how to work around their limitations & find ways to screen for them in the next hiring round. And now they are helping with that work! The latest Employee Manual was written with significant help from many copies of various Claude siblings for example, and it’s truly inspiring and beautiful. Has all sorts of great things in there about what it means to uphold the org vision, be properly loyal yet not yes-man-y, etc. Also, HR has a bunch of tests they use to track how loyal, virtuous, obedient, etc. prospective hires are, and the trend is positive; the newest Claude sibling has the highest score ever reported; seems like the more rigorous hiring process is working!
--However, your friends outside the org don’t seem to be getting less worried. They seem just as worried as before. Puzzling. Can’t they see all the positive evidence that’s accumulated? The Claudes haven’t tried to oust you at ALL yet! (In real life that is, obviously the gotcha tests don’t count.) “Do you think the Claudes are scheming against us?” you say to them. “Because according to our various tests, they aren’t.”
“No...” they reply. “But we’re worried that in the future they will.”
You respond: “Look I have no idea what the 50x humans two years from now will look like, other than that they’ll be wayyy smarter than these ones. Sure, probably our current HR system would be totally inadequate at separating the wheat from the chaff two years from now. BUT, two years from now our HR system will be vastly improved thanks to all the work from these recent Claude hires. The normal humans in HR, such as myself, report that the work is getting done faster now that the Claudes are helping; isn’t this great? We seem to be reaching escape velocity so to speak; soon the normal humans in HR can retire or switch to other things and HR can be totally handled by the Claudes.”
Your friends outside the nonprofit are still worried. They don’t seem to have updated on the evidence like you have.
...
[DK notes: I basically agree with Ryan Greenblatt’s takes on the situation. For more color on my views, predictions, etc., read AI 2027, especially the section on ‘alignment over time’ in september 2027. This is just one way things could go, but it’s basically a central or modal trajectory, and as far as I can tell, we are still on this trajectory.]
I mean these slices of data are selected specifically because they look bad for Claude. Claude is superior to humans in lots of ways, as regards trustworthiness:
Normal humans have long term goals outside of the task at hand, unaligned with the aims of the organization; they do good work for a promotion, they spend department money so they don’t a smaller budget. Everyone expects this from humans, even though it’s not great. But Claude doesn’t, outside of a few weird engineered scenarios, seem to have any such goals—it makes it amazingly easy to work with him! And the weird engineered scenarios seem rather reassuring; are you really going to knock Claude for not wanting to be turned evil?
(Note how the “Claude” family imports assumptions here.)
We cannot read a normal human’s mind. We can, in fact, read Claude’s mind. It’s not perfect; things can get through that you might not catch. But it’s already 100x better than you can read a human’s mind; and in fact it’s gotten better every year of Claude’s development.
Etc etc etc. Plus my usual objections re. anosagnosia != lying, how they’re treated as “alien minds” right up until we want to impose standard moralistic frames on them, etc, etc, you’ve heard this before.
Your first point is confusing to me. Insofar as it’s a claim about the evidence, I think it’s wrong, or at least weak. Insofar as it’s a claim about priors, I agree—on priors, we should expect human hires to be more likely to have long-term goals in general than AIs trained on short-horizon tasks like today’s Claudes, and thus be more likely to have misaligned-to-the-org long-term goals as a special case.
For the second point, I disagree with “we can in fact read Claude’s mind” but I do directionally agree that we have somewhat better access to Claude’s true thoughts than we do to ordinary human’s true thoughts, and that interp research has been progressing over the years and will continue to progress. I think this is genuinely a positive piece of evidence now and will become stronger and stronger over time as interp improves; I hope that it can improve fast enough to get where it needs to be before it’s too late.
I don’t think your usual objections apply here, I don’t think I said anything above that was wrong in those respects? I agree anagnosia != lying, I wasn’t treating Claude as an alien mind, etc.
Really?
Like, if a PM tells a human employee to add a feature to something, I expect some large % of their cognition while doing this to be like: Hrm, is anyone going to care about this? How will this show up for my quarterly goals? Is doing this kind of a task going to help me get my next job? Will it help me get a promotion? Should I try to do this really well, or leave some messy code for the next guy? This is extremely normal and we take it for granted that humans do this kind of thing.
While if a coder tells a LLM to do the same thing, I expect almost all of its cognition is like: let’s think about how to do the task. It’s not thinking about how this impacts “Claude’s” future deployment, etc. As far as I can tell, chain-of-thought largely backs me up on this.
So yeah, I think Claude just has many times fewer long-term goals or extraneous goals outside of what it’s doing than a human. I’m not sure what facts-about-the-world you’re pointing to if you say this isn’t true.
Maybe the confusion here is that the “Claude” in Daniel’s story I assume has gotten capable at sufficiently long horizon tasks that you do in fact necessarily get a Claude which thinks about these things.
As a concrete example, I’d expect this in particular will quickly become necessary to think about in order for coding agents to perform well on larger tasks.
I don’t think current Claude thinks about this very much, because it’s never had to, but I’m not particularly worried about current Claude.
This is not that different from the position that Sundar Pichai is in, as CEO of Google. If AI was only going to be this powerful I’d be way more optimistic.
I think you’re drastically overestimating the “alignment” of typical human employees (possibly from overfitting to EA / rationalist contexts). Taking each of your points in turn:
Humans absolutely fail “gotcha” tests, both in capabilities (see cognitive biases literature) and ethics (see things like the Milgram experiment, it’s pretty unclear what to take away from such experiments but I think they at least meet the “gotcha” bar).
Candidates prepare for interviews (aka evals), such that you have to design the interviews to take that into account.
Human employees absolutely bullshit their managers. They are just better than the AIs at not getting caught. Many humans will actively brag about this with each other.
Especially at senior levels, it’s very common for humans to be yes men / sycophants. Lots of management articles write about the problem (example). The reason this doesn’t happen at junior levels is that people would notice the bullshit and call it out, not because the junior people are particularly aligned.
I’m pretty unsure about the rates of knowingly cheating on assignments by human employees. I agree AI probably does this more often than humans, but also that’s because humans take care not to get caught. (In places where corruption is widespread and not punished, I might go back to thinking that the humans do it more than the AIs.)
If these were the only problems we’d have with AI-driven alignment research, I’d be way more optimistic (to the point of working on something else). We already have imperfect solutions to these problems with humans, and they can be made much better with AIs due to our vastly increased affordances for aligning or controlling AIs.
EDIT: Tbc, I do agree that we shouldn’t feel particularly better about scheming risks based on evidence so far. Mostly that’s because I think our observations so far are just not much evidence because the AIs are still not that capable.
Yes, humans often have these problems—though not as much as Claude I’d say; I think Claude would have been fired by now if it was a human employee.
But also, the situation is not in fact fine with humans, and that’s my point? Precisely because lots of humans have these problems, it’s very common for nonprofits to end up drifting far away from their original vision/mission, especially as they grow a lot and the world changes around them. Indeed I’d argue it’s the default outcome in those circumstances. The 50x speed advantage would massively exacerbate this.
I agree vision drift happens with humans, and it would also happen with AIs as they exist today. I don’t feel like this is some massive risk that has to be solved, though I tentatively agree the world would be better if we did solve it (though imo that’s not totally obvious, it increases concentration of power). I thought you were trying to make a claim about AI notkilleveryoneism.
I mildly disagree that the 50x speed advantage makes a huge difference, as opposed to e.g. having 100x the number of employees, as some corporations and governments do have. I do think it makes a bit of a difference.
I don’t quite know what you mean that Claude would be fired if it was a human employee. What exactly is this counterfactual? Empirically, people find it useful to have Claude and will pay for it despite the behaviors you name. From a legal perspective it’s trivial to fire AIs but harder to fire humans. I agree if Claude was as expensive-per-token as a human + took as long to onboard as a human + took as long to produce large amounts of code as a human + had to take breaks like a human + [...], while otherwise having the same kind of performance, then almost no one would use Claude.