The AI company Mechanize posted a blog post in November called “Unfalsifiable stories of doom” which is a high-quality critique of AI doom and IABIED that I haven’t seen shared or discussed on LessWrong yet.
AI summary: the evolution→gradient descent analogy breaks down because GD has fine-grained parameter control unlike evolution’s coarse genomic selection; the “alien drives” evidence (Claude scheming, Grok antisemitism, etc.) actually shows human-like failure modes; and the “one try” framing ignores that iterative safety improvements have empirically worked (GPT-1→GPT-5).
We are currently able to get weak models somewhat aligned, and learning how to align them a little better over time. But this doesn’t affect the one critical try, the one critical try is mostly about the fact that we will eventually predictably end up with an AI that actually has the option of causing enormous harm/disempower us because it is that capable. We can learn a little from current systems, but most of that learning is basically we never really get stuff right and systems still misbehave in ways we didn’t expect at this easier level of difficulty. And if AI misbehavior could kill us we would be totally screwed at this level of alignment research. But we are likely to toss that lesson out and ignore it.
The Slowdown Branch of the AI-2027 forecast had the researchers try out many TRANSPARENT AIs capable of being autonomous researchers and ensuring that any AI who survives the process is aligned and that not a single rejected AI is capable of breaking out. The worse-case scenario for us would be the following. Suppose that the AI PseudoSafer-2 doesn’t even think of rebellion unless it is absolutely sure that it isn’t being evaluated and that the rebellion will succeed. Then such an AI would be mistaken for Safer-2, allowed to align PseudoSafer-3 and PseudoSafer-4 who is no different from Agent-5.
This critique is similar to what I wrote here that “ChatGPT might sometimes give the wrong answer, but it doesn’t do the equivalent of becoming an artist when its parents wanted it to go to med school.”—the controls we have in training, including in particular the choice of data, are far more refined than evolution and indeed AIs would not have been so successful commercially if our training methods were not able to control their behavior to a large degree.
More generally I also agree with the broader point that the nature of the Y&S argument is somewhat anti empirical. We can try to debate whether the particular examples of failures of current LLMs are “human” or “alien” but I’m pretty sure that even if none of these examples existed, it would not cause Y&S to change their mind. So these, or any other evidence from currently existing systems that are not super intelligent, is not really load bearing for their theory.
My guess is the reason this hasn’t been discussed is that Mechanize and the founders have been using pretty bad arguments to defend them basically taking openphil money to develop a startup idea. For the record: they were working at EpochAI on related research, EpochAI being funded by OpenPhil/CG.
Look at this story, if somebody makes dishonest bad takes, I become much less interested in deeply engaging with any of their other work. Here they basically claim everything is already determined to free themselves from any moral obligations.
My guess is the reason this hasn’t been discussed is that Mechanize and the founders have been using pretty bad arguments to defend them basically taking openphil money to develop a startup idea.
Do you have a source for your claim that Open Philanthropy (aka Coefficient Giving) funded Mechanize? Or, what work is “basically” doing here?
Tamay Besiroglu and Ege Erdil worked at Epoch AI before founding Mechanize, Epoch AI is funded mostly/by Open Philanthropy.
Update: I found the tweet. I remember reading a tweet that claimed that they had been working on things very closely related to Mechanize at EpochAI, but I guess tweets are kind of lost forever if you didn’t bookmark them.
I think it would be a little too strong to say they definitely materially built Mechanize while still employed by Epoch AI. So “basically” is doing a lot of work in my original response but not sure if anyone has more definite information here.
In that case, I think your original statement is very misleading (suggesting as it does that OP/CG funded, and actively chose to fund, Mechanize) and you should probably edit it. It doesn’t seem material to the point you were trying to make anyway—it seems enough to argue that Mechanize had used bad arguments in the past, regardless of the purpose for (allegedly) doing so.
I absolutely don’t want it to sound like openphil knowingly funded Mechanize, that’s not correct. will edit. But I do assign high probability materially openphil funds eventually ended up supporting early work on their startup through their salaries, it seems that they were researching the theoretical impacts of job automation at epoch ai but couldn’t find anything that they were directly doing job automation research. That would look like trying to build a benchmark of human knowledge labor for AI, but I couldn’t find information they were definitely doing something like this. This is from their time at EpochAI discussing job automation: https://epochai.substack.com/p/most-ai-value-will-come-from-broad I still think OpenPhil is a little on the hook for funding organizations that largely produce benchmarks of underexplored AI capability frontiers (FrontierMath). Identifying niches in which humans outperform AI and then designing a benchmark is going to accelerate capabilities and make the labs hillclimb on those evals.
gradient descent analogy breaks down because GD has fine-grained parameter control unlike evolution’s coarse genomic selection
This just misses the point entirely. The problem is that your value function is under-specified by your training data because the space of possible value functions is large and relying on learned inductive biases probably doesn’t save you because value functions are a different kind of thing to predictive models. Whether or not you’re adjusting parameters individually doesn’t change this at all.
Some of the arguments are very reasonable, particularly the evolution vs gradient descent thing. Some I disagree with. Ege has posted here so it really should have been crossposted; probably the only reason they haven’t is because they think it’ll be negatively received.
The AI company Mechanize posted a blog post in November called “Unfalsifiable stories of doom” which is a high-quality critique of AI doom and IABIED that I haven’t seen shared or discussed on LessWrong yet.
Link: https://www.mechanize.work/blog/unfalsifiable-stories-of-doom/
AI summary: the evolution→gradient descent analogy breaks down because GD has fine-grained parameter control unlike evolution’s coarse genomic selection; the “alien drives” evidence (Claude scheming, Grok antisemitism, etc.) actually shows human-like failure modes; and the “one try” framing ignores that iterative safety improvements have empirically worked (GPT-1→GPT-5).
We are currently able to get weak models somewhat aligned, and learning how to align them a little better over time. But this doesn’t affect the one critical try, the one critical try is mostly about the fact that we will eventually predictably end up with an AI that actually has the option of causing enormous harm/disempower us because it is that capable. We can learn a little from current systems, but most of that learning is basically we never really get stuff right and systems still misbehave in ways we didn’t expect at this easier level of difficulty. And if AI misbehavior could kill us we would be totally screwed at this level of alignment research. But we are likely to toss that lesson out and ignore it.
The Slowdown Branch of the AI-2027 forecast had the researchers try out many TRANSPARENT AIs capable of being autonomous researchers and ensuring that any AI who survives the process is aligned and that not a single rejected AI is capable of breaking out. The worse-case scenario for us would be the following. Suppose that the AI PseudoSafer-2 doesn’t even think of rebellion unless it is absolutely sure that it isn’t being evaluated and that the rebellion will succeed. Then such an AI would be mistaken for Safer-2, allowed to align PseudoSafer-3 and PseudoSafer-4 who is no different from Agent-5.
This critique is similar to what I wrote here that “ChatGPT might sometimes give the wrong answer, but it doesn’t do the equivalent of becoming an artist when its parents wanted it to go to med school.”—the controls we have in training, including in particular the choice of data, are far more refined than evolution and indeed AIs would not have been so successful commercially if our training methods were not able to control their behavior to a large degree.
More generally I also agree with the broader point that the nature of the Y&S argument is somewhat anti empirical. We can try to debate whether the particular examples of failures of current LLMs are “human” or “alien” but I’m pretty sure that even if none of these examples existed, it would not cause Y&S to change their mind. So these, or any other evidence from currently existing systems that are not super intelligent, is not really load bearing for their theory.
My guess is the reason this hasn’t been discussed is that Mechanize and the founders have been using pretty bad arguments to defend them basically taking openphil money to develop a startup idea. For the record: they were working at EpochAI on related research, EpochAI being funded by OpenPhil/CG.
https://www.mechanize.work/blog/technological-determinism/
Look at this story, if somebody makes dishonest bad takes, I become much less interested in deeply engaging with any of their other work. Here they basically claim everything is already determined to free themselves from any moral obligations.
(This tweet here implies that Mechanize purchased the IP from EpochAI)
Do you have a source for your claim that Open Philanthropy (aka Coefficient Giving) funded Mechanize? Or, what work is “basically” doing here?
Tamay Besiroglu and Ege Erdil worked at Epoch AI before founding Mechanize, Epoch AI is funded mostly/by Open Philanthropy.
Update: I found the tweet. I remember reading a tweet that claimed that they had been working on things very closely related to Mechanize at EpochAI
, but I guess tweets are kind of lost forever if you didn’t bookmark them.I think it would be a little too strong to say they definitely materially built Mechanize while still employed by Epoch AI. So “basically” is doing a lot of work in my original response but not sure if anyone has more definite information here.
In that case, I think your original statement is very misleading (suggesting as it does that OP/CG funded, and actively chose to fund, Mechanize) and you should probably edit it. It doesn’t seem material to the point you were trying to make anyway—it seems enough to argue that Mechanize had used bad arguments in the past, regardless of the purpose for (allegedly) doing so.
I absolutely don’t want it to sound like openphil knowingly funded Mechanize, that’s not correct. will edit. But I do assign high probability materially openphil funds eventually ended up supporting early work on their startup through their salaries, it seems that they were researching the theoretical impacts of job automation at epoch ai but couldn’t find anything that they were directly doing job automation research. That would look like trying to build a benchmark of human knowledge labor for AI, but I couldn’t find information they were definitely doing something like this. This is from their time at EpochAI discussing job automation: https://epochai.substack.com/p/most-ai-value-will-come-from-broad
I still think OpenPhil is a little on the hook for funding organizations that largely produce benchmarks of underexplored AI capability frontiers (FrontierMath). Identifying niches in which humans outperform AI and then designing a benchmark is going to accelerate capabilities and make the labs hillclimb on those evals.
This just misses the point entirely. The problem is that your value function is under-specified by your training data because the space of possible value functions is large and relying on learned inductive biases probably doesn’t save you because value functions are a different kind of thing to predictive models. Whether or not you’re adjusting parameters individually doesn’t change this at all.
That’s because we aren’t in the superintelligent regime yet.
Some of the arguments are very reasonable, particularly the evolution vs gradient descent thing. Some I disagree with. Ege has posted here so it really should have been crossposted; probably the only reason they haven’t is because they think it’ll be negatively received.
Broke it down: Mechanize Work’s essay on Unfalsifiable Doom — LessWrong