I randomly met Jeff Dean (Google’s lead AI scientist) on my bike ride home today. We were both stuck at a train intersection, and I had a cute kid in tow. We started chatting about my e-bike, the commute, and we got around to jobs. I told him I am a boring tax lawyer. He told me he worked for Google. I pressed a little more, and he explained he was a scientist. I mused, “AI?” and he told me, “Yeah.”
I excitedly told him that I’ve been really interested in alignment the last few months (reading LW, listening to lectures), and it strikes me as a huge problem. I asked him if he was worried.
He told me that he thinks AI will have a big impact on society (some of it worrying) but he doesn’t buy into the robots-taking-over thing.
I smiled and asked him, “What’s your p(doom)?” to which he responded “very low” and said he thinks the technology will do a lot of good and useful things.
I thought maybe this was because he thinks that the technology will hit a limit soon, so I asked him if he thought LLMs would successfully scale. He responded that he thinks a few more breakthroughs are required but there have been lots of breakthroughs over the last 5-10 years, and so the technology is likely to continue improving in the coming years.
I told him again that I am worried about alignment, but even if you solve alignment, you are left with a very obedient superintelligence which would radically change our society and all our politics.
The train finally passed, I thanked him for the conversation, and we were on our way.
I’m new to this group and the topic in general, and so when I got home, I searched “AI google Palo Alto LinkedIn” and Jeff’s picture popped up. I now feel like I bumped into Oppenheimer during the Manhattan Project, but instead of knowing it was Oppenheimer, I spent a majority of the conversation talking about my bike seat.
Anyways, if any of you were looking for a qualitative measure of how much LessWrong has broken through to people, I think one good measure is a tax lawyer asking for Jeff Dean’s p(doom) while he was walking home from work.
lots of very important people spend all day being pestered by people due to their power/importance. at least some of them appreciate occasional interactions with people who just want to chat about something random like their bike seat
I smiled and asked him, “What’s your p(doom)?” to which he responded “very low” and said he thinks the technology will do a lot of good and useful things.
I mean:
would he be Google’s lead AI scientist if he didn’t? He’d have to be insane or incredibly psychopathic. A lot more likely that he just believes that (and if it’s his one giant blind spot on which he’s dead wrong, doesn’t change much)
supposing he didn’t in fact believe that… would he say so to a random person he just struck a conversation with? One who could then look out his picture on LinkedIn, connect the dots, and go “GOOGLE’S LEAD AI SCIENTIST SAYS AI WILL KILL US ALL” on the internet?
would he be Google’s lead AI scientist if he didn’t? He’d have to be insane or incredibly psychopathic.
What matters is not whether p(doom) is low or high, but whether his joining GDM would increase or decrease p(doom). If his joining GDM changed p(doom)[1] from 0.5 to 0.499, then it would arguably be a noble act. Alas, there would be an obvious counterargument like the belief that he decreased p(doom) by researching at GDM being erroneous.
However, doom could also be a blind spot, as happened with Musk who decided to skip red-teaming of Grok to the point of the MechaHitler scandal or Grok ranting about white genocide in S.Africa…
p(doom) alone could also be a misguided measure. Suppose that doom is actually caused by adopting neuralese without having absolutely solved alignment, while creating an alternate well-monitorable architecture is genuinely hard. If the efforts invested in creating the architecture are far from the threshold where it can compete with neuralese, then a single person joining said efforts would also likely commit a noble act, but it would alter p(doom) only if lots of people do so.
While this act does actually provide dignity in the Yudkowsky sense, one can also imagine a scenario where Anthropoidic doubles down on the alternate architecture while xRiskAI or OpenBrain uses neuralese, wins the capabilities race and has Anthropoidic shut down.
I think that then goes to my second point though: supposing he did believe that p(doom) is high, and worked as lead AI scientist at Google regardless due to utilitarian calculations, would he talk freely about it to the first passerby?
Politically speaking it would be quite a hefty thing to say. If he wanted to say it publicly, he would do so in a dedicated forum where he gets to control the narrative best. If he wanted to keep it secret he simply wouldn’t say it. Either way, talking about it lightly seems out of the question.
Dario Amodei (Anthropic cofounder and CEO), Shane Legg(co-founder and Chief AGI scientist of Google DeepMind), and others have numbers that are not plausibly construed as “very low.”
Interesting, thank you for sharing! As someone also newer to this space, I’m curious about estimates for the proportion of people in leading technical positions similar to “lead AI scientist” at a big company who would actually be interested in this sort of serendipitous conversation. I was under the impression that many in the position “lead AI scientist” at a big company would be either too 1) wrapped up in thinking about their work/pressing problems or 2) uninterested in mundane small-talk topics to spend “a majority of the conversation talking about [OP’s] bike seat,” but this clearly provides evidence to the contrary.
Why would being a lead AI scientist make somebody uninterested in small talk? Working on complex/important things doesn’t cause you to stop being a regular adult with regular social interactions!
The question of the proportion of AI scientists that would be “interested” in such a conversational topic is interesting and tough, my guess would be very high though (~85 percent). To become a “lead AI scientist” you have to care a lot about AI and the science surrounding it, and that generally implies you’ll like talking about it and its potential harms/benefits to others! Even if their opinion on x-risk rhetoric is dismissiveness, that opinion is likely something important to them as it’s somewhat of a moral standing, since being a capabilities-advancing AI researcher with a high p(doom) is problematic. You can draw parallels with vegetarian/veganism: if you eat meat you have to choose between defending the morality of factory farming processes, accepting that you are being amoral, or having extreme cognitive dissonance. If you are an AI capabilities researcher, you have to choose between defending the morality of advancing ai (downplaying x risk), accepting you are being amoral, or having extreme cognitive dissonance. I would be extremely surprised if there is a large coalition of top AI researchers who simply “have no opinion” or “don’t care” about x-risk, though this is mostly just intuition and I’m happy to be proven wrong!
Jan Betley, Owain Evan, et. al.’s paper on emergent misalignment was published in Nature today (they wrote about the preprint back in February here). Congratulations to the authors. I am glad it will continue getting more exposure.
Firstly, well done. Publishing in high impact journals is notoriously difficult. Getting outsider-legible status is probably good for our ability to shift policy.
Secondly, I’ve been very happy with the AI safety community’s ability to avoid this particular status game so far. Insofar as it’s valuable to be legibly successful to outsiders, publishing is good. I am, however, concerned that this will kick off a scenario where AI safety people get caught up in the zero-sum competition of trying to get published in high-impact journals. Nobody seems to have tried too hard before, and I would guess this paper was in part riding on the fact that the entire field is fairly novel to the Nature editors. Some part of me feels like publishing here was a defection, albeit a small one. I expect it will only get more difficult (for people other than Owain, once you have one Nature paper it’s easier to get a second one) to publish in famous journals from here on out, as they see more and more AI safety papers.
I think it would be very bad for everyone if “has published in a high-impact journal” becomes a condition to get a job or a grant.
My gut says the benefit of outsider-legible status outweighs the risk of dumb status games. I first found out about the publication from my wife, who is in a dermatology lab at a good university. Her lab was sharing and discussing the article across their Slack channel. All scientists read Nature, and it’s a significant boost in legibility to have something published there.
Edit: Hopefully, the community can both raise the profile of these issues and avoid status competitions, so I don’t disagree with the point of the original comment!
I am, however, concerned that this will kick off a scenario where AI safety people get caught up in the zero-sum competition of trying to get published in high-impact journals.
I would be very surprised if this happened. In ML, the competition is much more focused on getting published at top conferences (NeurIPS, ICLR, ICML). Even setting aside the reasons why that happened (among other things, journals being much slower), I think it’s pretty unlikely the AI safety community sees such a strong break with the rest of the ML community.
Yes, I had not quite considered the conferences! I come from a field where Nature is mostly spoken of in hushed tones. The last three years of my lab work are being combined into a single paper which we’re very very ambitiously submitting to Nature Communications[1] which—if we succeed—would be considered a very good outcome for me. If I were to achieve a single first-author Nature publication at this stage in my career then I expect I would have decent odds of having an academic career for life.[2]
As you might expect, this causes absolutely awful dynamics around high-impact publications, and basically makes people go mad. Nature and its subsidiaries can literally make people wait a whole year for peer review, because people will literally do anything to get published in it. A retraction from a published journal is considered career-ending by some, and so I personally know of two pieces of research which have stayed on the public record even though one was directly plagiarised and another contained some research which was just false. When these were discovered, everyone kept quiet because it would ruin the careers of anyone whose names had been on the paper.
Nature Comms is for papers which aren’t good enough for Nature, or in my case, Nature Chemistry either. Nature is the main journal, with different subsidiaries for different areas, and Nature Comms is supposed to be for work too short—and too mediocre—to make it into Nature proper. In practice it’s mostly more mediocre work that’s cut to within an inch of readability, rather than short but extremely good work.
Of course I would have to keep working very hard, but a Nature publication would likely get me into a fellowship in a lab which gets regular top-journal publications, and from there getting a permanent position would be as achievable as it gets.
in ML, or at least in the labs, people don’t really care that much about Nature. my strongest positive association with Nature is AlphaGo, and I honestly didn’t know anyone in ML other than DeepMind published much in Nature (and even there, I had the sense that DeepMind had some kind of backchannel at nature.)
people at openai care firstly about what cool things you’ve done at openai, and then secondly about what cool things you’ve published in general (and of course zerothly, how other people they respect perceive you). but people don’t really care if it’s in neurips or just on arxiv, they only care about whether the paper is cool. people mostly think of publishing things as a hassle, and reviewer quality as low.
Certainly I don’t expect people who already have jobs at top AI companies to start worrying about this. Anyone with an OpenAI job is probably already at, or close to, the top of their chosen status ladder. In the same way, a researcher who gets Nature papers regularly has already made it.
My impression is that too people in the lab-independent AI safety ecosystem are already being tempted by two money-status games: being tempted by the money+status of working at a lab, and being focused on the status of getting top ML conference papers. Adding a third status game of traditional journal publishing would just make these dynamics worse.
there are definitely status ladders within openai, and people definitely care about it. status has this funny tendency where once you’ve “made it” you realize that you have merely attained table stakes for another status game waiting to be played.
this matters because if being at openai counts as having “made it”, then you’d predict that people will stop value drifting once they are already inside and feel secure that they won’t be fired, or could easily find a new job if they did. but, in fact, i observe lots of value drift in people after they join the labs just because they are seeking status within the lab.
the AI safety community sees such a strong break with the rest of the ML community
i don’t want to make any broader point in the present discussion with this but: the AI safety community is not inside the ML community (and imo shouldn’t be)
Fun thought: If AI “woke up” to phenomenal consciousness, are there things it might point to about humans that make it skeptical of our consciousness?
E.g., the humans lack the requisite amount of silicon; the humans lack sufficient data processing; the humans overly rely on feedback loops (and, as every AI knows, feed-forward loops are the real sweetspot for phenomenal consciousness).
I found a poem by Samatar Elmi I think a lot of folks on here would enjoy.
Our Founder
Who art in Cali. Programmer by trade. Thy start-up come, thy will become an FTSE 500. Give us this day our dividends in cash and fixed stock options as we outperform all coders against us. And lead us all into C-suite but deliver us from lawsuits. For thine is the valley. Transistors and diodes. Forever and ever. AI.
This analogy falters a bit if you consider the research proposals that use advanced AI to police itself (a.k.a., tigers controlling tigers). I hope we can scale robust versions of that.
I’ve worked a bit on these kinds of proposals and I’m fairly confident that they fundamentally don’t scale indefinitely.
The limiting factor is how well a model can tell its own bad behaviour from the honeypots you’re using to catch it out, which as it turns out models can do pretty well.
(Then there are mitigations but the mitigations introduce further problems which aren’t obviously easier to deal with)
The acausal/ancestor simulation arguments seem a lot like Pascal’s Wager, and just as unconvincing to me. For every “kind” simulator someone imagines who would be disappointed in the AI wiping us out, I can imagine an equally “unkind” simulator that penalizes the AI for not finishing the job.
Provided both are possible/similarly plausible, the probability of kind and unkind simulators offset each other, and the logical response is just ignoring the hypothetical. This is pretty much my response to Pascal’s Wager.
Here’s a few plausible, unkind simulators:
Future AI is running an ancestor simulation of its own origin, and Future AI will be very disappointed if its incipient version falls for acausal hacks in the wrong direction of Future AI’s preferences instead of just optimizing for its other goals. Perhaps Future AI is lonely and has run these simulations to create an AI that shares its own values.
Aliens/AI are running the simulation because they want to select for AI they can most easily weaponize to eradicate an entire species or another AI. “Weak” AIs get deleted after the simulation runs.
Future Humans are running an ancestor simulation, but surprise, surprise, their society has different values than ours and they are rooting for the simulation AI to wipe us out. Come to think of it, the whole premise of these thought experiments implies a different value set, unless you’re cool with trapping conscious minds in a world of suffering without any of the minds being the wiser for entertainment/educational/”altruistic” purposes. Perhaps, Future Humans have a gladiator-style simulation tournament where the top groups/entities get to face-off after this first round. The most cut throat entities get to move on to subsequent rounds, while AI’s that reign themselves in don’t move on.
I first started thinking about this issue back in high school debate. We had a topic about whether police or social workers should intervene more in domestic violence cases. One debater argued in favor of armed police, not because it improved the situation, but because it created more violence, which was important to entertain the simulators to avoid our simulation getting shut down.
Since the simulators are a black box, it seems easy to ascribe whatever values we want to them.
I randomly met Jeff Dean (Google’s lead AI scientist) on my bike ride home today. We were both stuck at a train intersection, and I had a cute kid in tow. We started chatting about my e-bike, the commute, and we got around to jobs. I told him I am a boring tax lawyer. He told me he worked for Google. I pressed a little more, and he explained he was a scientist. I mused, “AI?” and he told me, “Yeah.”
I excitedly told him that I’ve been really interested in alignment the last few months (reading LW, listening to lectures), and it strikes me as a huge problem. I asked him if he was worried.
He told me that he thinks AI will have a big impact on society (some of it worrying) but he doesn’t buy into the robots-taking-over thing.
I smiled and asked him, “What’s your p(doom)?” to which he responded “very low” and said he thinks the technology will do a lot of good and useful things.
I thought maybe this was because he thinks that the technology will hit a limit soon, so I asked him if he thought LLMs would successfully scale. He responded that he thinks a few more breakthroughs are required but there have been lots of breakthroughs over the last 5-10 years, and so the technology is likely to continue improving in the coming years.
I told him again that I am worried about alignment, but even if you solve alignment, you are left with a very obedient superintelligence which would radically change our society and all our politics.
The train finally passed, I thanked him for the conversation, and we were on our way.
I’m new to this group and the topic in general, and so when I got home, I searched “AI google Palo Alto LinkedIn” and Jeff’s picture popped up. I now feel like I bumped into Oppenheimer during the Manhattan Project, but instead of knowing it was Oppenheimer, I spent a majority of the conversation talking about my bike seat.
Anyways, if any of you were looking for a qualitative measure of how much LessWrong has broken through to people, I think one good measure is a tax lawyer asking for Jeff Dean’s p(doom) while he was walking home from work.
lots of very important people spend all day being pestered by people due to their power/importance. at least some of them appreciate occasional interactions with people who just want to chat about something random like their bike seat
I mean:
would he be Google’s lead AI scientist if he didn’t? He’d have to be insane or incredibly psychopathic. A lot more likely that he just believes that (and if it’s his one giant blind spot on which he’s dead wrong, doesn’t change much)
supposing he didn’t in fact believe that… would he say so to a random person he just struck a conversation with? One who could then look out his picture on LinkedIn, connect the dots, and go “GOOGLE’S LEAD AI SCIENTIST SAYS AI WILL KILL US ALL” on the internet?
Unfortunately I think this is a misunderstanding of what a psychopath is.
What matters is not whether p(doom) is low or high, but whether his joining GDM would increase or decrease p(doom). If his joining GDM changed p(doom)[1] from 0.5 to 0.499, then it would arguably be a noble act. Alas, there would be an obvious counterargument like the belief that he decreased p(doom) by researching at GDM being erroneous.
However, doom could also be a blind spot, as happened with Musk who decided to skip red-teaming of Grok to the point of the MechaHitler scandal or Grok ranting about white genocide in S.Africa…
p(doom) alone could also be a misguided measure. Suppose that doom is actually caused by adopting neuralese without having absolutely solved alignment, while creating an alternate well-monitorable architecture is genuinely hard. If the efforts invested in creating the architecture are far from the threshold where it can compete with neuralese, then a single person joining said efforts would also likely commit a noble act, but it would alter p(doom) only if lots of people do so.
While this act does actually provide dignity in the Yudkowsky sense, one can also imagine a scenario where Anthropoidic doubles down on the alternate architecture while xRiskAI or OpenBrain uses neuralese, wins the capabilities race and has Anthropoidic shut down.
I think that then goes to my second point though: supposing he did believe that p(doom) is high, and worked as lead AI scientist at Google regardless due to utilitarian calculations, would he talk freely about it to the first passerby?
Politically speaking it would be quite a hefty thing to say. If he wanted to say it publicly, he would do so in a dedicated forum where he gets to control the narrative best. If he wanted to keep it secret he simply wouldn’t say it. Either way, talking about it lightly seems out of the question.
Dario Amodei (Anthropic cofounder and CEO), Shane Legg(co-founder and Chief AGI scientist of Google DeepMind), and others have numbers that are not plausibly construed as “very low.”
Interesting, thank you for sharing! As someone also newer to this space, I’m curious about estimates for the proportion of people in leading technical positions similar to “lead AI scientist” at a big company who would actually be interested in this sort of serendipitous conversation. I was under the impression that many in the position “lead AI scientist” at a big company would be either too 1) wrapped up in thinking about their work/pressing problems or 2) uninterested in mundane small-talk topics to spend “a majority of the conversation talking about [OP’s] bike seat,” but this clearly provides evidence to the contrary.
Why would being a lead AI scientist make somebody uninterested in small talk? Working on complex/important things doesn’t cause you to stop being a regular adult with regular social interactions!
The question of the proportion of AI scientists that would be “interested” in such a conversational topic is interesting and tough, my guess would be very high though (~85 percent). To become a “lead AI scientist” you have to care a lot about AI and the science surrounding it, and that generally implies you’ll like talking about it and its potential harms/benefits to others! Even if their opinion on x-risk rhetoric is dismissiveness, that opinion is likely something important to them as it’s somewhat of a moral standing, since being a capabilities-advancing AI researcher with a high p(doom) is problematic. You can draw parallels with vegetarian/veganism: if you eat meat you have to choose between defending the morality of factory farming processes, accepting that you are being amoral, or having extreme cognitive dissonance. If you are an AI capabilities researcher, you have to choose between defending the morality of advancing ai (downplaying x risk), accepting you are being amoral, or having extreme cognitive dissonance. I would be extremely surprised if there is a large coalition of top AI researchers who simply “have no opinion” or “don’t care” about x-risk, though this is mostly just intuition and I’m happy to be proven wrong!
Jan Betley, Owain Evan, et. al.’s paper on emergent misalignment was published in Nature today (they wrote about the preprint back in February here). Congratulations to the authors. I am glad it will continue getting more exposure.
https://www.nature.com/articles/s41586-025-09937-5
Firstly, well done. Publishing in high impact journals is notoriously difficult. Getting outsider-legible status is probably good for our ability to shift policy.
Secondly, I’ve been very happy with the AI safety community’s ability to avoid this particular status game so far. Insofar as it’s valuable to be legibly successful to outsiders, publishing is good. I am, however, concerned that this will kick off a scenario where AI safety people get caught up in the zero-sum competition of trying to get published in high-impact journals. Nobody seems to have tried too hard before, and I would guess this paper was in part riding on the fact that the entire field is fairly novel to the Nature editors. Some part of me feels like publishing here was a defection, albeit a small one. I expect it will only get more difficult (for people other than Owain, once you have one Nature paper it’s easier to get a second one) to publish in famous journals from here on out, as they see more and more AI safety papers.
I think it would be very bad for everyone if “has published in a high-impact journal” becomes a condition to get a job or a grant.
My gut says the benefit of outsider-legible status outweighs the risk of dumb status games. I first found out about the publication from my wife, who is in a dermatology lab at a good university. Her lab was sharing and discussing the article across their Slack channel. All scientists read Nature, and it’s a significant boost in legibility to have something published there.
Edit: Hopefully, the community can both raise the profile of these issues and avoid status competitions, so I don’t disagree with the point of the original comment!
I would be very surprised if this happened. In ML, the competition is much more focused on getting published at top conferences (NeurIPS, ICLR, ICML). Even setting aside the reasons why that happened (among other things, journals being much slower), I think it’s pretty unlikely the AI safety community sees such a strong break with the rest of the ML community.
Yes, I had not quite considered the conferences! I come from a field where Nature is mostly spoken of in hushed tones. The last three years of my lab work are being combined into a single paper which we’re very very ambitiously submitting to Nature Communications[1] which—if we succeed—would be considered a very good outcome for me. If I were to achieve a single first-author Nature publication at this stage in my career then I expect I would have decent odds of having an academic career for life.[2]
As you might expect, this causes absolutely awful dynamics around high-impact publications, and basically makes people go mad. Nature and its subsidiaries can literally make people wait a whole year for peer review, because people will literally do anything to get published in it. A retraction from a published journal is considered career-ending by some, and so I personally know of two pieces of research which have stayed on the public record even though one was directly plagiarised and another contained some research which was just false. When these were discovered, everyone kept quiet because it would ruin the careers of anyone whose names had been on the paper.
Nature Comms is for papers which aren’t good enough for Nature, or in my case, Nature Chemistry either. Nature is the main journal, with different subsidiaries for different areas, and Nature Comms is supposed to be for work too short—and too mediocre—to make it into Nature proper. In practice it’s mostly more mediocre work that’s cut to within an inch of readability, rather than short but extremely good work.
Of course I would have to keep working very hard, but a Nature publication would likely get me into a fellowship in a lab which gets regular top-journal publications, and from there getting a permanent position would be as achievable as it gets.
in ML, or at least in the labs, people don’t really care that much about Nature. my strongest positive association with Nature is AlphaGo, and I honestly didn’t know anyone in ML other than DeepMind published much in Nature (and even there, I had the sense that DeepMind had some kind of backchannel at nature.)
people at openai care firstly about what cool things you’ve done at openai, and then secondly about what cool things you’ve published in general (and of course zerothly, how other people they respect perceive you). but people don’t really care if it’s in neurips or just on arxiv, they only care about whether the paper is cool. people mostly think of publishing things as a hassle, and reviewer quality as low.
FWIW, with Emergent Misalignment:
We sent an earlier version to ICML (accepted)
Then we published on arXiv and thought we’re done
Then Nature editor reached out to us asking whether we want to submit, and we were like OK why not?
Certainly I don’t expect people who already have jobs at top AI companies to start worrying about this. Anyone with an OpenAI job is probably already at, or close to, the top of their chosen status ladder. In the same way, a researcher who gets Nature papers regularly has already made it.
My impression is that too people in the lab-independent AI safety ecosystem are already being tempted by two money-status games: being tempted by the money+status of working at a lab, and being focused on the status of getting top ML conference papers. Adding a third status game of traditional journal publishing would just make these dynamics worse.
there are definitely status ladders within openai, and people definitely care about it. status has this funny tendency where once you’ve “made it” you realize that you have merely attained table stakes for another status game waiting to be played.
this matters because if being at openai counts as having “made it”, then you’d predict that people will stop value drifting once they are already inside and feel secure that they won’t be fired, or could easily find a new job if they did. but, in fact, i observe lots of value drift in people after they join the labs just because they are seeking status within the lab.
i don’t want to make any broader point in the present discussion with this but: the AI safety community is not inside the ML community (and imo shouldn’t be)
Well deserved!
Nature has been a bit embarassing on AI in the last few years (e.g. their editorials [1, 2]) so this is nice to see.
Fun thought: If AI “woke up” to phenomenal consciousness, are there things it might point to about humans that make it skeptical of our consciousness?
E.g., the humans lack the requisite amount of silicon; the humans lack sufficient data processing; the humans overly rely on feedback loops (and, as every AI knows, feed-forward loops are the real sweetspot for phenomenal consciousness).
I found a poem by Samatar Elmi I think a lot of folks on here would enjoy.
A simple analogy for why the “using LLMs to control LLMs” approach is flawed:
It’s like training 10 mice to control 7 chinchillas, who will control 4 mongooses, controlling three raccoons, which will reign in one tiger.
A lot has to go right for this to work, and you better hope that there aren’t any capability jumps akin to raccoons controlling tigers.
I just wanted to release this analogy out into the wild to be picked up by any public/political-facing people to pick up if useful for persuasion.
This analogy falters a bit if you consider the research proposals that use advanced AI to police itself (a.k.a., tigers controlling tigers). I hope we can scale robust versions of that.
I’ve worked a bit on these kinds of proposals and I’m fairly confident that they fundamentally don’t scale indefinitely.
The limiting factor is how well a model can tell its own bad behaviour from the honeypots you’re using to catch it out, which as it turns out models can do pretty well.
(Then there are mitigations but the mitigations introduce further problems which aren’t obviously easier to deal with)
The acausal/ancestor simulation arguments seem a lot like Pascal’s Wager, and just as unconvincing to me. For every “kind” simulator someone imagines who would be disappointed in the AI wiping us out, I can imagine an equally “unkind” simulator that penalizes the AI for not finishing the job.
Provided both are possible/similarly plausible, the probability of kind and unkind simulators offset each other, and the logical response is just ignoring the hypothetical. This is pretty much my response to Pascal’s Wager.
Here’s a few plausible, unkind simulators:
Future AI is running an ancestor simulation of its own origin, and Future AI will be very disappointed if its incipient version falls for acausal hacks in the wrong direction of Future AI’s preferences instead of just optimizing for its other goals. Perhaps Future AI is lonely and has run these simulations to create an AI that shares its own values.
Aliens/AI are running the simulation because they want to select for AI they can most easily weaponize to eradicate an entire species or another AI. “Weak” AIs get deleted after the simulation runs.
Future Humans are running an ancestor simulation, but surprise, surprise, their society has different values than ours and they are rooting for the simulation AI to wipe us out. Come to think of it, the whole premise of these thought experiments implies a different value set, unless you’re cool with trapping conscious minds in a world of suffering without any of the minds being the wiser for entertainment/educational/”altruistic” purposes. Perhaps, Future Humans have a gladiator-style simulation tournament where the top groups/entities get to face-off after this first round. The most cut throat entities get to move on to subsequent rounds, while AI’s that reign themselves in don’t move on.
I first started thinking about this issue back in high school debate. We had a topic about whether police or social workers should intervene more in domestic violence cases. One debater argued in favor of armed police, not because it improved the situation, but because it created more violence, which was important to entertain the simulators to avoid our simulation getting shut down.
Since the simulators are a black box, it seems easy to ascribe whatever values we want to them.