Substack: https://substack.com/@simonlermen
X/Twitter: @SimonLermenAI
Substack: https://substack.com/@simonlermen
X/Twitter: @SimonLermenAI
Palisade research has an ongoing fundraiser with 900k available matching funds from SFF, seems possible to get counterfactual matching here.
I briefly worked for Palisade Research as a contractor in the past and was a MATS student for Jeffrey in the past. I believe Jeffrey gets AI alignment difficulty and Palisade is doing important work reaching out to policy makers and communications to the public. In particular, he gets that we are possibly very close to RSI and the time to existentially dangerous superhuman AI could be very short from there.
Read more about it here: https://x.com/JeffLadish/status/2033990617622319490
Yeah I probably wouldn’t have included the birds and stones metaphor if it was up to me and would have just explained the idea
My attempt at understanding the type of reactions Eliezer doesn’t like and make him less excited about posting here on lesswrong:
In this text, he elaborates why the AI probably won’t just spare us a few resources to keep going:
The top comment – getting 160 karma compared to 218 for the post itself – attacks him over calling people who use “Comparative advantage means humans will keep jobs” midwits: https://www.lesswrong.com/posts/F8sfrbPjCQj4KwJqn/the-sun-is-big-but-superintelligences-will-not-spare-earth-a?commentId=nzLm7giTn8JPD6bTF
Now about a year later, look at this example of a GDM employee making a pretty flawed argument based on CA? Would you agree this is well described as “midwit” behavior overapplying maths? https://www.lesswrong.com/posts/tBr4AtpPmwhgfG4Mw/comparative-advantage-and-ai
Would it have been better to write a diplomatic formal point that would be more likely to convince those people – or is it more important to give people a world model where they understand this type of “not so smart” reasoning is actually common in frontier labs?
I think the real world test on anthropic interviews is the closest to a proper test. The interviews were redacted when anthropic released it.
Make sure to never publish the dataset, as not to create a highly creative new RL environment. Be careful that if this gets too much traction you may end up getting a significant offer from some AI lab, or people trying to mimic your benchmark in their own RL environments.
One could equally maintain that if nobody builds it, everyone dies
I think this is sort of false. There are probably many low hanging fruit ways to increase longevity and quality of life and it is at least immaginable that we could get close-ish to immortality without AI. If we produce really smart human researchers they could extent lifespan by centuries allowing each person to take advantage of even more progress. Eventually, we might upload human consciousness or do something like that.
He sets aside the difference between oneself dying eventually and there literally being no recognizable posterity, which I think makes this text relatively uninteresting. A future with 0 humans or any kind of humanity, where some alien entity transforms this part of the universe in what would appear to be a horrible scar with unrecognizable values. Also sets aside literally everyone getting violently slaughtered instead of most dying peacefully at 80 years old or worse outcomes than death.
But even given the selfish perspective, I just sort of guess that trying to get such a huge amount of numbers out of a sort of contrived theory is just not a good idea. The numbers range from 0-1000 years, so I don’t know what to take from this. Plugging in my estimated numbers in Table 6 sort of gets me to somewhat correct seeming numbers, though I may not fully get what the author meant.
I think that all considered, there are much better choices than accelerating AI, such as improving human intelligence. Improved human intelligence would extent lifespan, would help us solve the alignment problem, would improve quality of life. We can also make investments into lifespan/quality of life research. Overall a much better deal than building unaligned ASI now.
I liked the design when i saw it today, but also would like aggregate statistics like comments count/ post count/ recent activity. perhaps even something like github showing a calendar with activity for each commit. It would also be good to retain a bio with a self description and optionally urls to websites or social media accounts.
Distributed thinking:
The world will fill itself with moral agents. Contribute to the healthiest/wisest.
Don’t grab resources you don’t know how to use. Cultivate one’s garden.
Avoiding xrisk and creating flourishing futures imply very similar strategies.
(Analogy: “don’t lie” and “don’t kill” are good heuristics for almost any goals.)
I think you are inserting a lot of ought into this is at this point.
From the writing it sounds like you are describing a world where there are a bunch of these decentralized agents sharing the world peacefully. You claim that people want to create centralised agents, I think it is not so much that people would want to create a centralized agent, it is just that a single centralized agent is a stable equilibrium in a way that a multipolar world is not.
You are right that we are starting out in a decentralized multipolar AI world right now but this will end when an AI is capable of stopping other AIs from progressing, obviously you could not allow another AI becoming more powerful than you that is not aligned with you, even if you were human-aligned. And if there is another AI around the same capability level at the same time, you obviously would collaborate in some way to stop other AIs from progressing.
Having dozens of AIs continuously racing up the RSI superintelligence level is simply not a stable world that will continue, obviously you’d fight for resources. There aren’t any solar systems with 5 different suns orbiting each other.
It feels like they are very hard trying to discredit the standard story of alignment. They use vague concepts to then conclude this is evidence for some weird “industrial accidents” story, what is that supposed to mean? This doesn’t sound like scientific inference to me but very much motivated thinking. Reminds me of that “against counting arguments” post where they also try very hard to get some “empirical data” for something that superficially sounds related to make a big conceptual point.
I mean I do think that he is using a poor rhetorical pattern, misrepresenting (strawmanning) a position and then presenting a “steelman” version which the original people would not like or endorse. And arguably my comment also applies to the third one (it thinks it’s in a video game where it has to exterminate humans vs a sci-fi story).
To be fair, he does give 4 examples of what he finds plausible, I can sort of see a case for considering the second one (some strong conclusion based on morality). And to be clear, I think this story that is being (not just by amodei) told that LLMs might read about AI sci-fi like terminator and decide to do the same is not really what misalignment is about. I think that’s a bad argument, thinking of this as a likely cause of misaligned actions really doesn’t seem helpful for me and i reject it strongly. But ok to be fair, I grant that I could have mentioned that this was just one example he gave for a larger issue, however, none of these examples touch on the mainstream case for misalignment/power-seeking.
For example, AI models are trained on vast amounts of literature that include many science-fiction stories involving AIs rebelling against humanity. This could inadvertently shape their priors or expectations about their own behavior in a way that causes them to rebel against humanity.
So he basically straw mans all those arguments on “power seeking”, dismisses them all as unrealistic and then presents his amazing improved steel man, which basically is it might watch terminator or read some AI takeover story and randomly decide to do the same thing. Being power seeking is not about learning patterns from games or role playing some sci-fi story at its core, It’s a fact about the universe that having more power is better for your terminal goals. If anything, these games and stories mirror the reality of our world that things are often about power struggles.
Their RSI very likely won’t lead to safe ASI. That’s what I meant, hope that clears it up. Whether it leads to ASI is a separate question.
I think getting RSI and a shot at superintelligence right just appears very difficult to me. I appreciate their constitution and found the parts i read thoughtful. But I don’t see them having found a way to reliably get the model to truly internalize it’s soul document. I also assume if they were able there would be parts that break down once you get to really critical amounts of intelligence.
My main takeaway of what Dario said in that talk is that Anthropic is very determined to kick off the RSI loop and willing to talk about it openly. Dario basically confirms that Claude Code is their straight shot at RSI to get to superintelligence as fast as possible (starting RSI in 2026-2027). Notably, many AI labs do not explicitly target this or at least don’t say this openly. While I think it is nice that Anthropic is doing alignment research and think that openly publishing their constitution is a good step, I think if they are successfully kicking off the RSI loop they have very low odds of succeeding.
I think it’s great to teach a course like this at good universities. I do think however, that the proximity to OpenAI comes with certain risk factors, from OpenAI’s official alignment blog: https://alignment.openai.com/hello-world/ ” We want to [..] develop and deploy [..] capable of recursive self-improvement (RSI)” This seems extremely dangerous to me, not on the scale we need to be a little careful, but on the scale of building mirror life bacteria or worse. Beyond, let’s research and more like, perhaps don’t do this. I worry that such concerns are not discusses in these courses and brushed aside against the “real risks” which are typically short term immediate harms that could reflect badly on these AI companies. Some people in academia are now launching workshops on recursive self-improvement: https://recursive-workshop.github.io
Having control over universe (or lightcone more precisely) is very good for basically any terminal value. I am trying perhaps explain my point of view to people who take it very lightly and feel there is a decent chance it will give us ownership over the universe.
“Thanks to Claude 4.5 Sonnet for help and feedback. No part of the text was written by AI models.”
Could you describe a bit more how you used Claude, how they ideation took place?
My fear is that I would start out with a fuzzy idea of a circuit lookup table, and then talk to Claude and it eventually convinced me that this has massive implications for alignment. But I remain highly skeptical of this, there is a high risk of deviating into vibe thinking. I think that your arguments at multiple point leave the realm of valid reasoning and draw these wide unsupported conclusions. This is an easy way AI assisted alignment might fail.
I for example don’t concretely see what you are actually saying here, these circuits supposedly perform some aspect of some task and they are aligned each? aligned as in with the model spec, but this circuit does something like addition or some fact lookup?
Again here i see an enormous conceptual leap, going from this very vague model to giving a very vague limitation of the current paradigm.
Another such leap (David Mannheim already posted this one):
I would be careful about this vibe based thinking, increasingly one benefit of humans might that they are less sycophantic than LLMs even if they are as smart so don’t take my critique too harshly here.