This observation should make us notice confusion about whether AI safety recruiting pipelines are actually doing the right type of thing.
In particular, the key problem here is that people are acting on a kind of top-down partly-social motivation (towards doing stuff that the AI safety community approves of)—a motivation which then behaves coercively towards their other motivations. But as per this dialogue, such a system is pretty fragile.
A healthier approach is to prioritize cultivating traits that are robustly good—e.g. virtue, emotional health, and fundamental knowledge. I expect that people with such traits will typically benefit the world even if they’re missing crucial high-level considerations like the ones described above.
For example, an “AI capabilities” researcher from a decade ago who cared much more about fundamental knowledge than about citations might well have invented mechanistic interpretability without any thought of safety or alignment. Similarly, an AI capabilities researcher at OpenAI who was sufficiently high-integrity might have whistleblown on the non-disparagement agreements even if they didn’t have any “safety-aligned” motivations.
Also, AI safety researchers who have those traits won’t have an attitude of “What?! Ok, fine” or “WTF! Alright you win” towards people who convince them that they’re failing to achieve their goals, but rather an attitude more like “thanks for helping me”. (To be clear, I’m not encouraging people to directly try to adopt a “thanks for helping me” mentality, since that’s liable to create suppressed resentment, but it’s still a pointer to a kind of mentality that’s possible for people with sufficiently little internal conflict.) And in the ideal case, they will notice that there’s something broken about their process for choosing what to work on, and rethink that in a more fundamental way (which may well lead them to conclusions similar to mine above).
In particular, the key problem here is that people are acting on a kind of top-down partly-social motivation (towards doing stuff that the AI safety community approves of)—a motivation which then behaves coercively towards their other motivations. But as per this dialogue, such a system is pretty fragile.
A healthier approach is to prioritize cultivating traits that are robustly good—e.g. virtue, emotional health, and fundamental knowledge. I expect that people with such traits will typically benefit the world even if they’re missing crucial high-level considerations like the ones described above.
I’m not sure I actually agree with this. Can you explain how someone who is virtuous, but missing the crucial consideration of “legible vs. illegible AI safety problems” can still benefit the world? I.e., why would they not be working on some highly legible safety problem that actually is negative EV to work on?
My current (uncertain) perspective is that we actually do still need people to be “acting on a kind of top-down partly-social motivation (towards doing stuff that the AI safety community approves of)” but the AI safety community needs to get better at being strategic somehow. Otherwise I don’t see how each person can discover all of the necessary crucial considerations on their own, or even necessarily appreciate all the important considerations that the community has come up with. And I do not see why “people with such traits will typically benefit the world even if they’re missing crucial high-level considerations like the ones described above.”
(Or alternatively put all/most effort into AI pause/stop/slowdown, which perhaps does not require as much strategic finesse.)
Can you explain how someone who is virtuous, but missing the crucial consideration of “legible vs. illegible AI safety problems” can still benefit the world? I.e., why would they not be working on some highly legible safety problem that actually is negative EV to work on?
If a person is courageous enough to actually try to solve a problem (like AI safety), and high-integrity enough to avoid distorting their research due to social incentives (like incentives towards getting more citations), and honest enough to avoid self-deception about how to interpret their research, then I expect that they will tend towards doing “illegible” research even if they’re not explicitly aware of the legible/illegible distinction. One basic mechanism is that they start pursuing lines of thinking that don’t immediately make much sense to other people, and the more cutting-edge research they do the more their ontology will diverge from the mainstream ontology.
This has pretty low argumentative/persuasive force in my mind.
then I expect that they will tend towards doing “illegible” research even if they’re not explicitly aware of the legible/illegible distinction.
Why? I’m not seeing the logic of how your premises lead to this conclusion.
And even if there is this tendency, what if someone isn’t smart enough to come up with a new line of illegible research, but does see some legible problem with an existing approach that they can contribute to? What would cause them to avoid this?
And even the hypothetical virtuous person who starts doing illegible research on their own, what happens when other people catch up to him and the problem becomes legible to leaders/policymakers? How would they know to stop working on that problem and switch to another problem that is still illegible?
This has pretty low argumentative/persuasive force in my mind.
Note that my comment was not optimized for argumentative force about the overarching point. Rather, you asked how they “can” still benefit the world, so I was trying to give a central example.
In the second half of this comment I’ll give a couple more central examples of how virtues can allow people to avoid the traps you named. You shouldn’t consider these to be optimized for argumentative force either, because they’ll seem ad-hoc to you. However, they might still be useful as datapoints.
Figuring out how to describe the underlying phenomenon I’m pointing at in a compelling, non-ad-hoc way is one of my main research focuses. The best I can do right now is to say that many of the ways in which people produce outcomes which are harmful (by their own lights) seem to arise from a handful of underlying dynamics. I call this phenomenon pessimization. One way in which I’m currently thinking about virtues is as a set of cognitive tools for preventing pessimization. As one example, kindness and forgiveness help to prevent cycles of escalating conflict with others, which is a major mechanism by which people’s values get pessimized. This one is pretty obvious to most people; let me sketch out some less obvious mechanisms below.
what if someone isn’t smart enough to come up with a new line of illegible research, but does see some legible problem with an existing approach that they can contribute to? What would cause them to avoid this?
This actually happened to me: when I graduated from my masters I wasn’t cognitively capable of coming up with new lines of illegible alignment research, in part because I was too status-seeking. Instead I went to work at DeepMind, and ended up spending a lot of my time working on RLHF, which is a pretty central example of a “legible” line of research.
However, I also wasn’t cognitively capable of making much progress on RLHF, because I couldn’t see how it addressed the core alignment problem, and so it didn’t seem fundamental enough to maintain my interest. Instead I spent most of my time trying to understand the alignment problem philosophically (resulting in this sequence) at the expense of my promotion prospects.
In this case I think I had the virtue of deep curiosity, which steered my attention towards illegible problems even though my top-down plan was to contribute to alignment by doing RLHF research. These days, whatever you might think of my research, few people complain that it’s too legible.
There are other possible versions of me who had that deep curiosity but weren’t smart enough to have generated a research agenda like my current one; however, I think they would still have left DeepMind, or at least not been very productive on RLHF.
And even the hypothetical virtuous person who starts doing illegible research on their own, what happens when other people catch up to him and the problem becomes legible to leaders/policymakers? How would they know to stop working on that problem and switch to another problem that is still illegible?
When a field becomes crowded, there’s a pretty obvious inference that you can make more progress by moving to a less crowded field. I think people often don’t draw that inference because moving to a less crowded field loses them prestige, is emotionally/financially risky, etc. Virtues help remove those blockers.
Sorry, you might be taking my dialog too seriously, unless you’ve made such observations yourself, which of course is quite possible since you used to work at OpenAI. I’m personally far from the places where such dialogs might be occurring, so don’t have any observations of them myself. It was completely imagined in my head, as a dark comedy about how counter to human (or most human’s) nature strategic thinking/action about AI safety is, and partly a bid for sympathy for the people caught in the whiplashes, to whom this kind of thinking or intuition doesn’t come naturally.
Edit: To clarify a bit more, B’s reactions like “WTF!” were written more for comedic effect, rather than trying to be realistic or based on my best understanding/predictions of how a typical AI researcher would actually react. It might still be capturing some truth, but again just want to make sure people aren’t taking my dialog more seriously than I intend.
I’m taking the dialogue seriously but not literally. I don’t think the actual phrases are anywhere near realistic. But the emotional tenor you capture of people doing safety-related work that they were told was very important, then feeling frustrated by arguments that it might actually be bad, seems pretty real. Mostly I think people in B’s position stop dialoguing with people in A’s position, though, because it’s hard for them to continue while B resents A (especially because A often resents B too).
Some examples that feel like B-A pairs to me include: people interested in “ML safety” vs people interested in agent foundations (especially back around 2018-2022); people who support Anthropic vs people who don’t; OpenPhil vs Habryka; and “mainstream” rationalists vs Vassar, Taylor, etc.
This observation should make us notice confusion about whether AI safety recruiting pipelines are actually doing the right type of thing.
In particular, the key problem here is that people are acting on a kind of top-down partly-social motivation (towards doing stuff that the AI safety community approves of)—a motivation which then behaves coercively towards their other motivations. But as per this dialogue, such a system is pretty fragile.
A healthier approach is to prioritize cultivating traits that are robustly good—e.g. virtue, emotional health, and fundamental knowledge. I expect that people with such traits will typically benefit the world even if they’re missing crucial high-level considerations like the ones described above.
For example, an “AI capabilities” researcher from a decade ago who cared much more about fundamental knowledge than about citations might well have invented mechanistic interpretability without any thought of safety or alignment. Similarly, an AI capabilities researcher at OpenAI who was sufficiently high-integrity might have whistleblown on the non-disparagement agreements even if they didn’t have any “safety-aligned” motivations.
Also, AI safety researchers who have those traits won’t have an attitude of “What?! Ok, fine” or “WTF! Alright you win” towards people who convince them that they’re failing to achieve their goals, but rather an attitude more like “thanks for helping me”. (To be clear, I’m not encouraging people to directly try to adopt a “thanks for helping me” mentality, since that’s liable to create suppressed resentment, but it’s still a pointer to a kind of mentality that’s possible for people with sufficiently little internal conflict.) And in the ideal case, they will notice that there’s something broken about their process for choosing what to work on, and rethink that in a more fundamental way (which may well lead them to conclusions similar to mine above).
I’m not sure I actually agree with this. Can you explain how someone who is virtuous, but missing the crucial consideration of “legible vs. illegible AI safety problems” can still benefit the world? I.e., why would they not be working on some highly legible safety problem that actually is negative EV to work on?
My current (uncertain) perspective is that we actually do still need people to be “acting on a kind of top-down partly-social motivation (towards doing stuff that the AI safety community approves of)” but the AI safety community needs to get better at being strategic somehow. Otherwise I don’t see how each person can discover all of the necessary crucial considerations on their own, or even necessarily appreciate all the important considerations that the community has come up with. And I do not see why “people with such traits will typically benefit the world even if they’re missing crucial high-level considerations like the ones described above.”
(Or alternatively put all/most effort into AI pause/stop/slowdown, which perhaps does not require as much strategic finesse.)
If a person is courageous enough to actually try to solve a problem (like AI safety), and high-integrity enough to avoid distorting their research due to social incentives (like incentives towards getting more citations), and honest enough to avoid self-deception about how to interpret their research, then I expect that they will tend towards doing “illegible” research even if they’re not explicitly aware of the legible/illegible distinction. One basic mechanism is that they start pursuing lines of thinking that don’t immediately make much sense to other people, and the more cutting-edge research they do the more their ontology will diverge from the mainstream ontology.
This has pretty low argumentative/persuasive force in my mind.
Why? I’m not seeing the logic of how your premises lead to this conclusion.
And even if there is this tendency, what if someone isn’t smart enough to come up with a new line of illegible research, but does see some legible problem with an existing approach that they can contribute to? What would cause them to avoid this?
And even the hypothetical virtuous person who starts doing illegible research on their own, what happens when other people catch up to him and the problem becomes legible to leaders/policymakers? How would they know to stop working on that problem and switch to another problem that is still illegible?
Note that my comment was not optimized for argumentative force about the overarching point. Rather, you asked how they “can” still benefit the world, so I was trying to give a central example.
In the second half of this comment I’ll give a couple more central examples of how virtues can allow people to avoid the traps you named. You shouldn’t consider these to be optimized for argumentative force either, because they’ll seem ad-hoc to you. However, they might still be useful as datapoints.
Figuring out how to describe the underlying phenomenon I’m pointing at in a compelling, non-ad-hoc way is one of my main research focuses. The best I can do right now is to say that many of the ways in which people produce outcomes which are harmful (by their own lights) seem to arise from a handful of underlying dynamics. I call this phenomenon pessimization. One way in which I’m currently thinking about virtues is as a set of cognitive tools for preventing pessimization. As one example, kindness and forgiveness help to prevent cycles of escalating conflict with others, which is a major mechanism by which people’s values get pessimized. This one is pretty obvious to most people; let me sketch out some less obvious mechanisms below.
This actually happened to me: when I graduated from my masters I wasn’t cognitively capable of coming up with new lines of illegible alignment research, in part because I was too status-seeking. Instead I went to work at DeepMind, and ended up spending a lot of my time working on RLHF, which is a pretty central example of a “legible” line of research.
However, I also wasn’t cognitively capable of making much progress on RLHF, because I couldn’t see how it addressed the core alignment problem, and so it didn’t seem fundamental enough to maintain my interest. Instead I spent most of my time trying to understand the alignment problem philosophically (resulting in this sequence) at the expense of my promotion prospects.
In this case I think I had the virtue of deep curiosity, which steered my attention towards illegible problems even though my top-down plan was to contribute to alignment by doing RLHF research. These days, whatever you might think of my research, few people complain that it’s too legible.
There are other possible versions of me who had that deep curiosity but weren’t smart enough to have generated a research agenda like my current one; however, I think they would still have left DeepMind, or at least not been very productive on RLHF.
When a field becomes crowded, there’s a pretty obvious inference that you can make more progress by moving to a less crowded field. I think people often don’t draw that inference because moving to a less crowded field loses them prestige, is emotionally/financially risky, etc. Virtues help remove those blockers.
Sorry, you might be taking my dialog too seriously, unless you’ve made such observations yourself, which of course is quite possible since you used to work at OpenAI. I’m personally far from the places where such dialogs might be occurring, so don’t have any observations of them myself. It was completely imagined in my head, as a dark comedy about how counter to human (or most human’s) nature strategic thinking/action about AI safety is, and partly a bid for sympathy for the people caught in the whiplashes, to whom this kind of thinking or intuition doesn’t come naturally.
Edit: To clarify a bit more, B’s reactions like “WTF!” were written more for comedic effect, rather than trying to be realistic or based on my best understanding/predictions of how a typical AI researcher would actually react. It might still be capturing some truth, but again just want to make sure people aren’t taking my dialog more seriously than I intend.
I’m taking the dialogue seriously but not literally. I don’t think the actual phrases are anywhere near realistic. But the emotional tenor you capture of people doing safety-related work that they were told was very important, then feeling frustrated by arguments that it might actually be bad, seems pretty real. Mostly I think people in B’s position stop dialoguing with people in A’s position, though, because it’s hard for them to continue while B resents A (especially because A often resents B too).
Some examples that feel like B-A pairs to me include: people interested in “ML safety” vs people interested in agent foundations (especially back around 2018-2022); people who support Anthropic vs people who don’t; OpenPhil vs Habryka; and “mainstream” rationalists vs Vassar, Taylor, etc.