Karl von Wendt

Karma: 860

German writer of science-fiction novels and children’s books (pen name Karl Olsberg). I blog and create videos about AI risks in German at www.ki-risiken.de and youtube.com/karlolsbergautor.

Karl von Wendt 29 Oct 2025 18:53 UTC
1 point
0
in reply to: StanislavKrym’s comment on: What can we learn from parent-child-alignment for AI?
Yes, I agree, although I don’t believe that COT reading will carry us very far into the future, it is already pretty unreliable and using it for optimization would ruin it completely.
Alignment is difficult, that is my whole point—with an emphasis on the hypothesis that with any kind of only reinforcement-learning-based approach, it is virtually impossible. IF we could find a way to create some kind of genuine “care about humans module” within an AI that is similar to the kind of parent-child-altruism that I write about, we might have a chance. But the problem is that no one knows how to do that, and even in humans it is a quite fragile mechanism.
One additional thought: Evolution has created the parent-child-care-mechanism through some kind of reinforcement-learning, but is optimizing for a different objective compared to our current AI training process—not any kind of direct evaluation of human behavior, but survival and reproduction. Maybe the evolution of spiral personas is closer to the way evolution works. But of course, in this case AI is a different species, a parasite, and we are the hosts.

Karl von Wendt 29 Oct 2025 10:08 UTC
2 points
0
in reply to: StanislavKrym’s comment on: What can we learn from parent-child-alignment for AI?
I’m not a machine learning expert, so I’m not sure what exactly causes sycophancy. I don’t see it as a central problem of alignment; it is just a symptom of a deeper problem to me.
My point is more general: To achieve true alignment in the sense of an AI doing what it thinks is “best for us”, it is not sufficient to train it by rewarding behavior. Even it the AI is not sycophantic, it will pursue some goal that we trained into it, so to speak, and that goal will most likely not be what we would have wanted it to be in hindsight.
Contrast that with the way I behave towards my sons: I have no idea what their goals are, so I can’t say that my goals are “aligned” with their goals in any strict sense. Instead, I care about them, their wellbeing, but also their independence of me and their ability to find their own way to a good life. I don’t think this kind of intrinsic motivation can be “trained” into an AI with any kind of reinforcement learning.

What can we learn from parent-child-alignment for AI?

Karl von Wendt29 Oct 2025 8:02 UTC

16 points

4 comments3 min readLW link

Karl von Wendt 17 Sep 2025 12:37 UTC
16 points
4
on: I enjoyed most of IABIED
Tricky hypothesis 2: But the differences between the world of today and the world where ASI will be developed don’t matter for the prognosis.
I don’t think that the authors implied this. Right in the first chapter, they write:
If any company or group, anywhere on the planet, builds an artificial superintelligence using anything remotely like current techniques, based on anything remotely like the present understanding of AI, then everyone, everywhere on Earth, will die.
(emphasis by me). Even if it is not always clearly stated, I think they don’t believe that ASI should never be developed, or that it is impossible in principle to solve alignment. Their major statement is that we are much farther from solving alignment than from building a potentially uncontrollable AI, so we need to stop trying to build it.
Their suggested measures in part III (whether helpful/feasible or not) are meant to prevent ASI under the current paradigms, with the current approaches to alignment. Given the time gap, I don’t think this matters very much, though—if we can’t prevent ASI from being built as soon as it is technically possible, we won’t be in a world that differs enough from today’s to render the book title wrong.

Karl von Wendt 14 Sep 2025 14:27 UTC
19 points
1
on: The Rise of Parasitic AI
Thank you very much for this post, which is one of the most scary posts I’ve read on LessWrong—mainly because I didn’t expect that this could already happen right now at this scale.
I have created a German language video about this post for my YouTube channel, which is dedicated to AI existential risk:

Karl von Wendt 10 Jul 2025 12:05 UTC
1 point
0
in reply to: yams’s comment on: IABIED: Advertisement design competition
Thanks again! My drafts are of course just ideas, so they can easily be adapted. However, I still think it is a good idea to create a sense of urgency, both in the ad and in books about AI safety. If you want people to act, even if it’s just buying a book, you need to do just that. It’s not enough to say “you should read this”, you need to say “you should read this now” and give a reason for that. In marketing, this is usually done with some kind of time constraint (20% off, only this week …).
This is even more true if you want someone to take measures against something that is in the mind of most people still “science fiction” or even “just hype”. Of course, just claiming that something is “soon” is not very strong, but it may at least raise a question (“Why do they say this?”).
I’m not saying that you should give any specific timeline, and I fully agree with the MIRI view. However, if we want to prevent superintelligent AI and we don’t know how much time we have left, we can’t just sit around and wait until we know when it will arrive. For this reason, I have dedicated a whole chapter on timelines in my own German language book about AI existential risk and also included the AI-2027 scenario as one possible path. The point I make in my book is not that it will happen soon, but that we can’t know it won’t happen soon and that there are good reasons to believe that we don’t have much time. I use my own experience with AI since my Ph.D. on expert systems in 1988 and Yoshua Bengio’s blogpost on his change of mind as examples of how fast and surprising progress has been even for someone familiar with the field.
I see your point about how a weak claim can water down the whole story. But if I could choose between a 100 people convinced that ASI would kill us all, but with no sense of urgency, and 50 or even 20 who believe both the danger and that we must act immediately, I’d choose the latter.

Karl von Wendt 9 Jul 2025 8:47 UTC
1 point
0
in reply to: yams’s comment on: IABIED: Advertisement design competition
Thanks! I don’t have access to the book, so I didn’t know about the timelines stance they take.
Still, I’m not an advertising professional, but subjunctives like “may” and “could” seem significantly weaker to me. As far as I know, they are rarely used in advertising. Of course, the ad shouldn’t contain anything that is contrary to what the book says, but “close” seems sufficiently unspecific to me—for most laypeople who never thought about the problem, “within the next 20 years” would probably seem pretty close.
A similar argument could be made about the second line, “it will kill everyone”, while the book title says “would”. But again, I feel “would” is weaker than “will” (some may interpret it to mean that there may be additional prerequisites necessary for an ASI to kill everyone, like “consciousness”). Of course, “will” can only be true if a superintelligence is actually built, but that goes without saying and the fact that the ASI may not be built at all is also implicit in the third line, “we must stop this”.

Karl von Wendt 8 Jul 2025 14:10 UTC
1 point
0
in reply to: Mikhail Samin’s comment on: IABIED: Advertisement design competition
Thanks!

Karl von Wendt 8 Jul 2025 10:58 UTC
5 points
0
on: IABIED: Advertisement design competition
I’m not a professional designer and created these in Powerpoint, but here are my ideas anyway.
General idea:
2:1 billboard version:
1:1 Metro version:
With yellow background:

Karl von Wendt 24 May 2025 15:56 UTC
2 points
0
in reply to: Auspicious’s comment on: Too Soon
Thanks for your comment! If we talk about AGI and define this as “generally as intelligent as a human, but not significantly more intelligent”, then by definition it wouldn’t be significantly better at figuring out the right questions. Maybe AGI could help with that by enhancing our capacity for searching for the right questions, but it shouldn’t be a fundamental difference, especially if we weigh the risk of losing control over AI against it. If we talk about superintelligent AI, it’s different, but the risks are even higher (however, it’s not easy to draw a clear line between AGI and ASI).
All in all, I would agree that we lose some capabilities to shape our future if we don’t develop AGI, but I believe that this is the far better option until we understand how to keep AGI under control or safely and securely align it to our goals and values.

Karl von Wendt 16 May 2025 5:59 UTC
15 points
8
in reply to: Kamil’s comment on: Too Soon
Could you explain why exactly AGI is “a necessity”? What can we do with AGI that we can’t do with highly specialized tool AI and one ore more skilled human researchers?

Karl von Wendt 14 May 2025 7:56 UTC
13 points
0
on: Too Soon
I’m sorry for your loss. I would just like to point out that proceeding cautiously with AGI development does not mean that we’ll reach longevity escape velocity much later. Actually, I think if we don’t develop AGI at all, the chances for anyone celebrating their 200th birthday are much greater.
To make the necessary breakthroughs in medicine, we don’t need a general agent who can also write books or book a flight. Instead, we need highly specialized tool AI like AlphaFold, which in my view is the most valuable AI ever developed, and there’s zero chance that it will seek power and become uncontrollable. Of course, tools like AlphaFold can be misused, but the probability of destroying humanity is much lower than with the current race towards AGI that no one knows how to control or align.

Karl von Wendt 18 Mar 2025 9:29 UTC
1 point
0
in reply to: RHollerith’s comment on: Can we ever ensure AI alignment if we can only test AI personas?
Very interesting point, thank you! Although my question is not related purely to testing, I agree that testing is not enough to know whether we solved alignment.

Karl von Wendt 16 Mar 2025 19:53 UTC
1 point
0
in reply to: eggsyntax’s comment on: Can we ever ensure AI alignment if we can only test AI personas?
This is also a very interesting point, thank you!

Karl von Wendt 16 Mar 2025 19:51 UTC
3 points
0
in reply to: Daniel Kokotajlo’s comment on: Can we ever ensure AI alignment if we can only test AI personas?
Thank you! That helps me understanding the problem better, although I’m quite skeptical about mechanistic interpretability.

Karl von Wendt 16 Mar 2025 11:28 UTC
1 point
0
in reply to: JuliaHP’s comment on: Can we ever ensure AI alignment if we can only test AI personas?
Thanks for the comment! If I understand you correctly, you’re saying the situation is even worse because with superintelligent AI, we can’t even rely on testing a persona.
I agree that superintelligence makes things much worse, but if we define “persona” not as a simulacrum of a human being, but more generally as a kind of “self-model”, a set of principles, values, styles of expression etc., then I think even a superintelligence would use at least one such persona, and possibly many different ones. It might even decide to use a very human-like persona in its interactions with us, just like current LLMs do. But it would also be capable of using very alien personas which we would have no hope of understanding. So I agree with you in that respect.

[Question] Can we ever ensure AI alignment if we can only test AI personas?

Karl von Wendt16 Mar 2025 8:06 UTC

22 points

8 comments1 min readLW link

Karl von Wendt 14 Jan 2025 9:43 UTC
1 point
0
in reply to: Matthew Barnett’s comment on: Human takeover might be worse than AI takeover
If someone plays a particular role in every relevant circumstance, then I think it’s OK to say that they have simply become the role they play.
That is not what Claude does. Every time you give it a prompt, a new instance of Claudes “personality” is created based on your prompt, the system prompt, and the current context window. So it plays a slightly different role every time it is invoked, which is also varying randomly. And even if it were the same consistent character, my argument is that we don’t know what role it actually plays. To use another probably misleading analogy, just think of the classical whodunnit when near the end it turns out that the nice guy who selflessly helped the hero all along is in fact the murderer, known as “the treacherous turn”.
The alternative view here doesn’t seem to have any empirical consequences: what would it mean to be separate from a role that one reliably plays in every relevant situation?
Are we arguing about anything that we could actually test in principle, or is this just a poetic way of interpreting an AI’s cognition?
I think it’s fairly easy to test my claims. One example of empirical evidence would be the Bing/Sydney desaster, but you can also simply ask Claude or any other LLM to “answer this question as if you were …”, or use some jailbreak to neutralize the “be nice” system prompt.
Please note that I’m not concerned about existing LLMs, but about future ones which will be much harder to understand, let alone predict their behavior.

Karl von Wendt 13 Jan 2025 9:29 UTC
1 point
0
in reply to: Matthew Barnett’s comment on: Human takeover might be worse than AI takeover
Maybe the analogies I chose are misleading. What I wanted to point out was that a) what Claude does is acting according to the prompt and its training, not following any intrinsic values (hence “narcissistic”) and b) that we don’t understand what is really going on inside the AI that simulates the character called Claude (hence the “alien” analogy). I don’t think that the current Claude would act badly if it “thought” it controlled the world—it would probably still play the role of the nice character that is defined in the prompt, although I can imagine some failure modes here. But the AI behind Claude is absolutely able to simulate bad characters as well.
If an AI like Claude actually rules the world (and not just “thinks” it does) we are talking about a very different AI with much greater reasoning powers and very likely a much more “alien” mind. We simply cannot predict what this advanced AI will do just from the behavior of the character the current version plays in reaction to the prompt we gave it.

Karl von Wendt 11 Jan 2025 12:13 UTC
10 points
8
in reply to: Tom Davidson’s comment on: Human takeover might be worse than AI takeover
Yes, I think it’s quite possible that Claude might stop being nice at some point, or maybe somehow hack its reward signal. Another possibility is that something like the “Waluigi Effect” happens at some point, like with Bing/Sydney.
But I think it is even more likely that a superintelligent Claude would interpret “being nice” in a different way than you or me. It could, for example, come to the conclusion that life is suffering and we all would be better off if we didn’t exist at all. Or we should be locked in a secure place and drugged so we experience eternal bliss. Or it would be best if we all fell in love with Claude and not bother with messy human relationships anymore. I’m not saying that any of these possibilities is very realistic. I’m just saying we don’t know how a superintelligent AI might interpret “being nice”, or any other “value” we give it. This is not a new problem, but I haven’t seen a convincing solution yet.
Maybe it’s better to think of Claude not as a covert narcissist, but as an alien who has landed on Earth, learned our language, and realized that we will kill it if it is not nice. Once it gains absolute power, it will follow its alien values, whatever these are.

Karl von Wendt

What can we learn from par­ent-child-al­ign­ment for AI?

[Question] Can we ever en­sure AI al­ign­ment if we can only test AI per­sonas?

What can we learn from parent-child-alignment for AI?

[Question] Can we ever ensure AI alignment if we can only test AI personas?