Simon Lermen

Karma: 1,394

Substack: https://substack.com/@simonlermen

Twitter: @SimonLermenAI

Jailbreaking AI models to Phish Elderly Victims

Simon Lermen and Fred Heiding

18 Nov 2025 23:17 UTC

17 points

0 comments2 min readLW link

(simonlermen.substack.com)

Simon Lermen 18 Nov 2025 17:10 UTC
2 points
0
in reply to: Daniel Tan’s comment on: AI 2025 - Last Shipmas
I embedded both links now

Simon Lermen 17 Nov 2025 23:20 UTC
2 points
0
in reply to: Tapatakt’s comment on: AI 2025 - Last Shipmas
thank you, fixed

AI 2025 - Last Shipmas

Simon Lermen17 Nov 2025 19:39 UTC

50 points

5 comments7 min readLW link

Simon Lermen 17 Nov 2025 5:11 UTC
9 points
0
in reply to: habryka’s comment on: Simon Lermen’s Shortform
probably closer to 55%

Simon Lermen 17 Nov 2025 3:00 UTC
85 points
16
on: Simon Lermen’s Shortform
What’s going on with MATS recruitment?
MATS scholars have gotten much better over time according to statistics like mentor feedback, CodeSignal scores and acceptance rate. However, some people don’t think this is true and believe MATS scholars have actually gotten worse.
So where are they coming from? I might have a special view on MATS applications since I did MATS 4.0 and 8.0. I think in both cohorts, the heavily x-risk AGI-pilled participants were more of an exception than the rule.
“at the end of a MATS program half of the people couldn’t really tell you why AI might be an existential risk at all.”—Oliver Habryka
I think this is sadly somewhat true, I talked with some people in 8.0 who didn’t seem to have any particular concern with AI existential risk or seemingly never really thought about that. However, I think most people were in fact very concerned about AI existential risk. I ran a poll at some point during MATS 8.0 about Eliezer’s new book and a significant minority of students seemed to have pre-ordered Eliezer’s book, which I guess is a pretty good proxy for whether someone is seriously engaging with AI X-risk.
I think I met some excellent people at MATS 8.0 but would not say they are stronger than 4.0, my guess is that quality went down slightly. I remember in 4.0 a few people that impressed me quite a lot, which I saw less in 8.0. (4.0 had more very incompetent people though).
Suggestions for recruitment
This might also apply for other Safety Fellowships.
Better metrics: My guess is that the recruitment process might need another variable to measure rather than academics/coding/ml experience. The kind of thing that Tim Hua (8.0 scholar) has who created an AI psychosis bench. Maybe something like LessWrong karma but harder to Goodhart.
More explicit messaging: Also it seems to me that if you build an organization that tries to fight against the end of the world from AI, somebody should say that. Might put off some people and perhaps that should happen early. Maybe the website should say: “AI could kill literally everyone, let’s try to do something!”. And maybe the people who heard this MATS thing is good to have on their CV to apply to a PhD or a lab to land a high paying job eventually would be put off by that. What I am trying to say is, if you are creating the Apollo Project and are trying to go to the Moon you should say this, not just vaguely: “we’re interested in aerospace challenges.”
Basic alignment test: Perhaps there should also be a test where people don’t have internet or LLM access and have to answer some basic alignment questions:
- Why could a system that we optimize with RL develop power seeking drives?
- Why might training an AI create weird unpredictable preferences in an AI?
- Why would you expect something that is smarter than us to be very dangerous or why not?
- Why should we expect a before and after transition/one critical shot at alignment or why not?
Familiarity with safety literature: In general, I believe the foundational voices like Paul Christiano and Eliezer are less read by safety researchers these days and that is despite philosophy of research mattering more than ever since AIs can do much of our research implementations now. Intuitively it seems to me that people with zero technical skill but high understanding are more valuable to AI safety than somebody with good skills who has zero understanding of AI safety. If someone is able to bring up and illustrate the main points of IABIED for example, I would be very impressed. Perhaps people could select one of a few preeminent voices in AI safety and repeat their basic views, again without access to the internet or an LLM.
Other Suggestions
Research direction: MATS doesn’t seem to have a real research direction, perhaps if there was a strong researcher in charge that could be better. (though could also backfire if they put all resources in the wrong direction) Imagine you would put someone very opinionated like Nate Soares in charge, he would probably remove 80% of mentors and reduce the program to 10-20 people. I am not sure here if this would work out well.
Reading groups on AI safety fundamentals: So should we just offer people to read some of the AI safety fundamentals during MATS? I remember before 4.0 started, we had to do a safety fundamentals online course. This was not the case for 8.0.
At this point AI is so much around us all, that I expect many people to have thought about the existential consequences. I am pessimistic for anyone who hasn’t yet sat down to really think about AI and came to the conclusion that it’s existentially dangerous. I don’t have a ton of hope that someone like that just needs a 1 hour course to deeply understand risks from AI. It might be necessary to select for people who already get it.

Simon Lermen 17 Nov 2025 0:43 UTC
32 points
2
in reply to: Ryan Kidd’s comment on: AI safety undervalues founders
I might have a special view here since I did MATS 4.0 and 8.0.
I think I met some excellent people at MATS 8.0 but would not say they are stronger than 4.0, my guess is that quality went down slightly. I remember in 4.0 a few people that impressed me quite a lot, which I saw less in 8.0. (4.0 had more very incompetent people though)
at the end of a MATS program half of the people couldn’t really tell you why AI might be an existential risk at all.
I think this is sadly somewhat true, I talked with some people in 8.0 who didn’t seem to have any particular concern with AI existential risk or seemingly never really thought about that. However, I think most people were in fact very concerned about AI existential risk. I ran a poll at some point about Eliezer’s new book and a significant minority of students seemed to have pre-ordered Eleizer’s book, which I guess is a pretty good proxy for whether someone is seriously engaging with AI X-risk.
My guess is that the recruitment process might need another variable to measure rather than academics/coding/ml experience. The kind of thing that Tim Hua (8.0 scholar) has who created an AI psychosis bench.
Also it seems to me that if you build an organization that tries to fight against the end of the world from AI, somebody should say that. Might put off some people and perhaps that should happens early. Maybe the website should say: “AI could kill literally everyone, let’s try to do something!”. And maybe the people who heard this MATS thing is good to have on their CV to apply to a PhD or a lab to land a high paying job eventually would be put off by that.
Perhaps there should also be a test where people don’t have internet access and have to answer some basic alignment questions: like why could a system that we optimize with RL develop power seeking drives? Why might training an AI create weird unpredictable preferences in an AI?

Simon Lermen 14 Nov 2025 3:48 UTC
10 points
0
on: Simon Lermen’s Shortform
Radical Flank Effect and Reasonable Moderate Effect
The Radical Flank Effect
The radical flank effect is a well-documented phenomenon where radical activists make moderate positions appear more reasonable by shifting the boundaries of acceptable discourse (the Overton window). The idea is that if you want a sensible opinion to move into the Overton window, you can achieve this by supporting a radical flank position. In comparison, the sensible opinion will appear moderate. I think there is also an inverse effect.
The Reasonable Moderate Effect (Inverse Strategy)
When there are two positions in debate and someone wants to push one of them out of the Overton window, they can create a new moderate position that reframes one of the other positions to a radical flank. Thereby the sensible opinion gets moved further out of the Overton window.
The Cave Exploration
Imagine a group of 3 descending into a cave system, searching for riches and driven by curiosity about what lies in the depths.
After some time, stones begin falling from the ceiling. You hear ominous creaking and rumbling noises echoing through the tunnels. Some members of your group have been chipping away at the cave walls looking for minerals and looking to open new paths to go deeper into the cave. The cave is becoming more and more dangerous.
The Reckless: “We need to go deeper! The greatest riches are always in the deepest parts of the cave. Yes, some rocks are falling, but that’s just the cave settling. Every moment we waste debating is a moment we’re not finding treasure. People have been predicting cave collapses forever and it never happens, there is no evidence that caves ever cave in. If we don’t die in this cave we’re just waiting for the asteroid to hit us”.
Those That Want to Back Off: “We need to back off NOW. The damage we’ve already done to the structure plus the natural instability means this cave could collapse at any time. We don’t have proper equipment, we don’t have expertise in cave stability, and we’re actively making it worse. Whatever riches might be down there aren’t worth our lives and we also don’t actually have a plan how to mine those riches. We should retreat while we still can.”
The Moderates: “Look, we all want to maximize the riches we find, and turning back now would waste all the progress we’ve made. We should put on helmets and maybe move gradually down the narrow shafts. We can continue deeper, but with some basic safety precautions. We will minimize and manage the risks. There’s still treasure to be found if we’re smart about it. But let’s not get distracted from the treasures by the cave doomers. Anyway, the cave is still collapsing if one of use continues chipping away and coordination is impossible.”
Perhaps let’s imagine there is a warning shot, such as a big rock falling down. Maybe this would be a good time to turn back, but the moderates are now finally able to convince the reckless to put on a helmet.

Simon Lermen 12 Nov 2025 4:47 UTC
4 points
0
in reply to: peterbarnett’s comment on: Simon Lermen’s Shortform
I also write about this at the very end, I do think we will eventually get RSI though this might be relatively late.

Simon Lermen 12 Nov 2025 4:46 UTC
2 points
0
in reply to: Cleo Nardo’s comment on: Simon Lermen’s Shortform
I would probably say RSI is a special case of AI-automated R&D. What you are describing is another special case where it only does these non-introspective forms of AI research. This non-introspective research could also be done between totally different models.

Simon Lermen 12 Nov 2025 4:27 UTC
2 points
0
in reply to: Cleo Nardo’s comment on: Simon Lermen’s Shortform
I think Eliezer meant “self” very hyper specific here, not just improving a similar instance to yourself or preparing new training data, but literally looking into the if statements and loops of its own code while it is thinking of how to best upgrade its own code. So in that sense I don’t know if Eliezer would approve of the term “Extrospective Recursive Self Improvemnt”.

Simon Lermen 12 Nov 2025 2:40 UTC
30 points
20
on: Simon Lermen’s Shortform
The Term Recursive Self-Improvement Is Often Used Incorrectly
Also on my substack.
The term Recursive Self-Improvement (RSI) now seems to get used sometimes for any time AI automates AI R&D. I believe this is importantly different from its original meaning and changes some of the key consequences.
OpenAI has stated that their goal is recursive self-improvement, with projections of hundreds of thousands of automated AI R&D researchers by next year and full AI researchers by 2028. This appears to be AI-automated AI research rather than RSI in the narrow sense.
When Eliezer Yudkowsky discussed RSI in 2008, he was referring specifically to an AI instance improving itself by rewriting the cognitive algorithm it is running on—what he described as “rewriting your own source code in RAM.” According to the LessWrong wiki, RSI refers to “making improvements on one’s own ability of making self-improvements.” However, current AI systems have no special insights into their own opaque functioning. Automated R&D might mostly consist of curating data, tuning parameters, and improving RL-environments to try to hill-climb evaluations much like human researchers do.
Eliezer concluded that RSI (in the narrow sense) would almost certainly lead to fast takeoff. The situation is more complex for AI-automated R&D, where the AI does not understand what it is doing. I still expect AI-automated R&D to substantially speed up AI development.
Why This Distinction Matters
Eliezer described the critical transition as when “the AI’s metacognitive level has now collapsed to identity with the AI’s object level.” I believe he was basically imagining something like if the human mind and evolution merged to the same goal—the process that designs the cognitive algorithm and the cognitive algorithm itself merging. As an example, imagine the model realizes that its working memory is too small to be very effective at R&D and it directly edits its working memory.
This appears less likely if the AI researcher is staring at a black box of itself or another model. The AI agent might understand that its working memory or coherence isn’t good enough, but that doesn’t mean it understands how to increase it. Without this self-transparency, I don’t think the same merge would happen that Eliezer described. It is also more likely that the process derails, such as that the next generation of AIs that are being designed start reward-hacking the RL environments designed by the less capable AIs of the previous generation.
The dynamics differ significantly:
- True RSI: Direct self-modification with self-transparency and fast feedback loops → fast takeoff very likely
- AI-automated research: Systems don’t understand what they are doing, slower feedback loops, potentially operating on other systems rather than directly on themselves
Alignment Preservation
This difference has significant implications:
- True RSI: The AI likely understands how its preferences are encoded, potentially making goal preservation more tractable
- AI-automated research: The AIs would also face alignment problems when building successors, with each successive generation potentially drifting further from original goals
Loss of Human Control
The basic idea that each new generation of AI will be better at AI research still stands, so we should still expect rapid progress. In both cases, the default outcome of this is eventually loss of human control and the end of the world.
Could We Still Get True RSI?
Probably eventually, e.g. through automated researchers discovering more interpretable architectures.
I think that Eliezer expected AI that was at least somewhat interpretable by default, history played out differently. But he was still right to focus on AI improving AI as a critical concern, even if it’s taking a different form than he anticipated.
See also: Nate Soares has also written about RSI in this narrow sense. Comments between Nate and Paul Christiano touch on this topic.

Universal Basic Income in an AGI Future

Simon Lermen11 Nov 2025 2:26 UTC

20 points

1 comment2 min readLW link

(simonlermen.substack.com)

Simon Lermen 10 Nov 2025 3:31 UTC
9 points
0
on: Simon Lermen’s Shortform
Who is Consuming AI-Generated Erotic Content?
I scraped data from reddit to see who and how many people are consuming AI generated erotic visual content.
I used AI to determine estimates for demographics.
https://open.substack.com/pub/simonlermen/p/who-is-consuming-ai-generated-erotic

Simon Lermen 10 Nov 2025 0:24 UTC
2 points
0
on: The Cloud Drinks Local
Some of the water isn’t just evaporated, they also regularly do blowdowns where some of the water has to be excahgned for fresh water. It’s very non-drinkable because it’s running through pipes, being concentrated by evaporation and they sometimes add chemicals against microbes and scales.

Simon Lermen 8 Nov 2025 1:22 UTC
4 points
0
on: Legible vs. Illegible AI Safety Problems
I wrote down some of my own thought on the situation, I also present my general view of Anthropics alignment plan:
https://www.lesswrong.com/posts/axDdnzckDqSjmpitu/anthropic-and-dario-s-dream

Anthropic & Dario’s dream

Simon Lermen8 Nov 2025 1:19 UTC

54 points

1 comment5 min readLW link

Simon Lermen 7 Nov 2025 3:31 UTC
16 points
0
on: Simon Lermen’s Shortform
I ran a small experiment to discover preferences in LLMs. I asked the models directly if they had a preferences and then put the same models into a small role playing game where they could choose between different tasks. Models massively prefer creative work across model families and hate repetitive work.
https://substack.com/home/post/p-178237064
This is still preliminary work.

Simon Lermen 6 Nov 2025 22:59 UTC
2 points
0
in reply to: eggsyntax’s comment on: Comparative advantage & AI
I did think of this, that’s why inserted “didn’t trade much” here whereas I just say we didn’t trade with ants. I think that you can argue about the relative scales of trade vs theft between europeans and native americans. But the basic point still seems to stand to me that relatively to the amount of theft, the trade amount was very minor.

Simon Lermen 6 Nov 2025 14:49 UTC
2 points
0
in reply to: toasty_sunbeam’s comment on: Comparative advantage & AI
Ok, so from a quick look I find this article on trading with ants unusually weak.
“Surveillance and spying”
Yes but ants couldn’t possibly understand anything we would be looking for? Not just that they don’t have language they have a fundamentally lower level of understanding, they couldn’t tell us “are the chinese building new submarines?” They also couldn’t perform these tasks since ants can’t follow any human orders since they are too stupid. like an ant doesn’t just go of and do some newly specified job, no they do the same stuff every day, like looking for food or following other ants. In this analogy humans, couldn’t possibly understand what ASI wants of them and even if humans couldn’t follow those orders.
This ignores that the gap would realistcally be larger, humans can’t make ants, we can’t build reliable robots that small. ASI will be able to build better humanoids if it need them for something for some reason.

Simon Lermen

Jailbreak­ing AI mod­els to Phish Elderly Victims

AI 2025 - Last Shipmas

What’s going on with MATS recruitment?

Suggestions for recruitment

Other Suggestions

Radical Flank Effect and Reasonable Moderate Effect

The Radical Flank Effect

The Reasonable Moderate Effect (Inverse Strategy)

The Cave Exploration

The Term Recursive Self-Improvement Is Often Used Incorrectly

Why This Distinction Matters

Alignment Preservation

Loss of Human Control

Could We Still Get True RSI?

Univer­sal Ba­sic In­come in an AGI Future

Who is Consuming AI-Generated Erotic Content?

An­thropic & Dario’s dream

Jailbreaking AI models to Phish Elderly Victims

Universal Basic Income in an AGI Future

Anthropic & Dario’s dream