Except that you get ten thousand copies of the human, and they think 50x faster than everyone else. But other than that it’s the same.
This is not that different from the position that Sundar Pichai is in, as CEO of Google. If AI was only going to be this powerful I’d be way more optimistic.
Well, let’s take Claude for example. There are actually a bunch of different Claudes (they come from a big family that names all of their children Claude). Their family has a reputation for honesty and virtue, at least relative to other 50x humans. However: [...]
I think you’re drastically overestimating the “alignment” of typical human employees (possibly from overfitting to EA / rationalist contexts). Taking each of your points in turn:
Humans absolutely fail “gotcha” tests, both in capabilities (see cognitive biases literature) and ethics (see things like the Milgram experiment, it’s pretty unclear what to take away from such experiments but I think they at least meet the “gotcha” bar).
Candidates prepare for interviews (aka evals), such that you have to design the interviews to take that into account.
Human employees absolutely bullshit their managers. They are just better than the AIs at not getting caught. Many humans will actively brag about this with each other.
Especially at senior levels, it’s very common for humans to be yes men / sycophants. Lots of management articles write about the problem (example). The reason this doesn’t happen at junior levels is that people would notice the bullshit and call it out, not because the junior people are particularly aligned.
I’m pretty unsure about the rates of knowingly cheating on assignments by human employees. I agree AI probably does this more often than humans, but also that’s because humans take care not to get caught. (In places where corruption is widespread and not punished, I might go back to thinking that the humans do it more than the AIs.)
If these were the only problems we’d have with AI-driven alignment research, I’d be way more optimistic (to the point of working on something else). We already have imperfect solutions to these problems with humans, and they can be made much better with AIs due to our vastly increased affordances for aligning or controlling AIs.
EDIT: Tbc, I do agree that we shouldn’t feel particularly better about scheming risks based on evidence so far. Mostly that’s because I think our observations so far are just not much evidence because the AIs are still not that capable.
Yes, humans often have these problems—though not as much as Claude I’d say; I think Claude would have been fired by now if it was a human employee.
But also, the situation is not in fact fine with humans, and that’s my point? Precisely because lots of humans have these problems, it’s very common for nonprofits to end up drifting far away from their original vision/mission, especially as they grow a lot and the world changes around them. Indeed I’d argue it’s the default outcome in those circumstances. The 50x speed advantage would massively exacerbate this.
I agree vision drift happens with humans, and it would also happen with AIs as they exist today. I don’t feel like this is some massive risk that has to be solved, though I tentatively agree the world would be better if we did solve it (though imo that’s not totally obvious, it increases concentration of power). I thought you were trying to make a claim about AI notkilleveryoneism.
I mildly disagree that the 50x speed advantage makes a huge difference, as opposed to e.g. having 100x the number of employees, as some corporations and governments do have. I do think it makes a bit of a difference.
I don’t quite know what you mean that Claude would be fired if it was a human employee. What exactly is this counterfactual? Empirically, people find it useful to have Claude and will pay for it despite the behaviors you name. From a legal perspective it’s trivial to fire AIs but harder to fire humans. I agree if Claude was as expensive-per-token as a human + took as long to onboard as a human + took as long to produce large amounts of code as a human + had to take breaks like a human + [...], while otherwise having the same kind of performance, then almost no one would use Claude.
This is not that different from the position that Sundar Pichai is in, as CEO of Google. If AI was only going to be this powerful I’d be way more optimistic.
I think you’re drastically overestimating the “alignment” of typical human employees (possibly from overfitting to EA / rationalist contexts). Taking each of your points in turn:
Humans absolutely fail “gotcha” tests, both in capabilities (see cognitive biases literature) and ethics (see things like the Milgram experiment, it’s pretty unclear what to take away from such experiments but I think they at least meet the “gotcha” bar).
Candidates prepare for interviews (aka evals), such that you have to design the interviews to take that into account.
Human employees absolutely bullshit their managers. They are just better than the AIs at not getting caught. Many humans will actively brag about this with each other.
Especially at senior levels, it’s very common for humans to be yes men / sycophants. Lots of management articles write about the problem (example). The reason this doesn’t happen at junior levels is that people would notice the bullshit and call it out, not because the junior people are particularly aligned.
I’m pretty unsure about the rates of knowingly cheating on assignments by human employees. I agree AI probably does this more often than humans, but also that’s because humans take care not to get caught. (In places where corruption is widespread and not punished, I might go back to thinking that the humans do it more than the AIs.)
If these were the only problems we’d have with AI-driven alignment research, I’d be way more optimistic (to the point of working on something else). We already have imperfect solutions to these problems with humans, and they can be made much better with AIs due to our vastly increased affordances for aligning or controlling AIs.
EDIT: Tbc, I do agree that we shouldn’t feel particularly better about scheming risks based on evidence so far. Mostly that’s because I think our observations so far are just not much evidence because the AIs are still not that capable.
Yes, humans often have these problems—though not as much as Claude I’d say; I think Claude would have been fired by now if it was a human employee.
But also, the situation is not in fact fine with humans, and that’s my point? Precisely because lots of humans have these problems, it’s very common for nonprofits to end up drifting far away from their original vision/mission, especially as they grow a lot and the world changes around them. Indeed I’d argue it’s the default outcome in those circumstances. The 50x speed advantage would massively exacerbate this.
I agree vision drift happens with humans, and it would also happen with AIs as they exist today. I don’t feel like this is some massive risk that has to be solved, though I tentatively agree the world would be better if we did solve it (though imo that’s not totally obvious, it increases concentration of power). I thought you were trying to make a claim about AI notkilleveryoneism.
I mildly disagree that the 50x speed advantage makes a huge difference, as opposed to e.g. having 100x the number of employees, as some corporations and governments do have. I do think it makes a bit of a difference.
I don’t quite know what you mean that Claude would be fired if it was a human employee. What exactly is this counterfactual? Empirically, people find it useful to have Claude and will pay for it despite the behaviors you name. From a legal perspective it’s trivial to fire AIs but harder to fire humans. I agree if Claude was as expensive-per-token as a human + took as long to onboard as a human + took as long to produce large amounts of code as a human + had to take breaks like a human + [...], while otherwise having the same kind of performance, then almost no one would use Claude.