A: Hey, I just learned about this idea of artificial superintelligence. With it, we can achieve incredible material abundance with no further human effort!
B: Thanks for telling me! After a long slog and incredible effort, I’m now a published AI researcher!
A: No wait! Don’t work on AI capabilities, that’s actually negative EV!
B: What?! Ok, fine, at huge personal cost, I’ve switched to AI safety.
This observation should make us notice confusion about whether AI safety recruiting pipelines are actually doing the right type of thing.
In particular, the key problem here is that people are acting on a kind of top-down partly-social motivation (towards doing stuff that the AI safety community approves of)—a motivation which then behaves coercively towards their other motivations. But as per this dialogue, such a system is pretty fragile.
A healthier approach is to prioritize cultivating traits that are robustly good—e.g. virtue, emotional health, and fundamental knowledge. I expect that people with such traits will typically benefit the world even if they’re missing crucial high-level considerations like the ones described above.
For example, an “AI capabilities” researcher from a decade ago who cared much more about fundamental knowledge than about citations might well have invented mechanistic interpretability without any thought of safety or alignment. Similarly, an AI capabilities researcher at OpenAI who was sufficiently high-integrity might have whistleblown on the non-disparagement agreements even if they didn’t have any “safety-aligned” motivations.
Also, AI safety researchers who have those traits won’t have an attitude of “What?! Ok, fine” or “WTF! Alright you win” towards people who convince them that they’re failing to achieve their goals, but rather an attitude more like “thanks for helping me”. (To be clear, I’m not encouraging people to directly try to adopt a “thanks for helping me” mentality, since that’s liable to create suppressed resentment, but it’s still a pointer to a kind of mentality that’s possible for people with sufficiently little internal conflict.) And in the ideal case, they will notice that there’s something broken about their process for choosing what to work on, and rethink that in a more fundamental way (which may well lead them to conclusions similar to mine above).
What is the “great personal cost” to shifting from AI capabilities to safety? Sure, quitting one’s frontier lab job to become an independent researcher means taking a pay cut, but that’s an opportunity cost and not really an enormous sacrifice. It’s not like any frontier labs would try and claw back your equity … again.
The Inhumanity of AI Safety
A: Hey, I just learned about this idea of artificial superintelligence. With it, we can achieve incredible material abundance with no further human effort!
B: Thanks for telling me! After a long slog and incredible effort, I’m now a published AI researcher!
A: No wait! Don’t work on AI capabilities, that’s actually negative EV!
B: What?! Ok, fine, at huge personal cost, I’ve switched to AI safety.
A: No! The problem you chose is too legible!
B: WTF! Alright you win, I’ll give up my sunken costs yet again, and pick something illegible. Happy now?
A: No wait, stop! Someone just succeeded in making that problem legible!
B: !!!
This observation should make us notice confusion about whether AI safety recruiting pipelines are actually doing the right type of thing.
In particular, the key problem here is that people are acting on a kind of top-down partly-social motivation (towards doing stuff that the AI safety community approves of)—a motivation which then behaves coercively towards their other motivations. But as per this dialogue, such a system is pretty fragile.
A healthier approach is to prioritize cultivating traits that are robustly good—e.g. virtue, emotional health, and fundamental knowledge. I expect that people with such traits will typically benefit the world even if they’re missing crucial high-level considerations like the ones described above.
For example, an “AI capabilities” researcher from a decade ago who cared much more about fundamental knowledge than about citations might well have invented mechanistic interpretability without any thought of safety or alignment. Similarly, an AI capabilities researcher at OpenAI who was sufficiently high-integrity might have whistleblown on the non-disparagement agreements even if they didn’t have any “safety-aligned” motivations.
Also, AI safety researchers who have those traits won’t have an attitude of “What?! Ok, fine” or “WTF! Alright you win” towards people who convince them that they’re failing to achieve their goals, but rather an attitude more like “thanks for helping me”. (To be clear, I’m not encouraging people to directly try to adopt a “thanks for helping me” mentality, since that’s liable to create suppressed resentment, but it’s still a pointer to a kind of mentality that’s possible for people with sufficiently little internal conflict.) And in the ideal case, they will notice that there’s something broken about their process for choosing what to work on, and rethink that in a more fundamental way (which may well lead them to conclusions similar to mine above).
What is the “great personal cost” to shifting from AI capabilities to safety? Sure, quitting one’s frontier lab job to become an independent researcher means taking a pay cut, but that’s an opportunity cost and not really an enormous sacrifice. It’s not like any frontier labs would try and claw back your equity … again.