JanWehner

Karma: 105

I’m a PhD student working on AI Safety. I’m thinking about how we can use interpretability techniques to make LLMs more safe.

JanWehner 2 Dec 2025 17:38 UTC
1 point
0
in reply to: Seth Herd’s comment on: Safety Cases Explained: How to Argue an AI is Safe
Hi Seth, Thanks for the comment! I agree that empirical evidence is very important for both airplane and AI Safety. IMO a strong case for Option 2 also has to include empirical tests as described in Option 1. I could’ve written that better and will edit thx.
To uncover weird failure modes (eg like Moonrise edge cases), it’s essential to have a well-resourced and capable reviewer team that is incentivised to think of such edge cases. However, this would probably still leave some edge cases that nobody ever thought of. I think such black swan events are just hard to anticipate for Risk Management esp when one doesn’t have a lot of historical evidence about similar systems.

Safety Cases Explained: How to Argue an AI is Safe

JanWehner2 Dec 2025 11:03 UTC

14 points

2 comments9 min readLW link

A Call for Better Risk Modelling

JanWehner and Charbel-Raphaël

18 Nov 2025 9:08 UTC

19 points

0 comments4 min readLW link

Learning from the Luddites: Implications for a modern AI labour movement

JanWehner16 Oct 2025 17:11 UTC

12 points

0 comments8 min readLW link

Open Challenges in Representation Engineering

JanWehner and Daniel Tan

3 Apr 2025 19:21 UTC

14 points

0 comments5 min readLW link

Saarbrücken Germany—ACX Meetups Everywhere Fall 2024

JanWehner29 Aug 2024 18:37 UTC

2 points

0 comments1 min readLW link

JanWehner 19 Jul 2024 9:08 UTC
1 point
0
on: Activation Engineering Theories of Impact
Thanks for writing this, I think it’s great to spell out the ToI behind this research direction!
You touch on this, but I wanted to make it explicit: Activation Engineering can also be used for detecting when a system is “thinking” about some dangerous concept. If you have a steering vector for e.g. honesty, you can measure the similarity with the activations during a forward pass to find out whether the system is being dishonest or not.
You might also be interested in my (less thorough) summary of the ToIs for Activation Engineering.

JanWehner 15 Jul 2024 13:25 UTC
1 point
0
in reply to: Logan Riggs’s comment on: An Introduction to Representation Engineering—an activation-based paradigm for controlling LLMs
Thanks, I agree that Activation Patching can also be used for localizing representations (and I edited the mistake in the post).

An Introduction to Representation Engineering—an activation-based paradigm for controlling LLMs

JanWehner14 Jul 2024 10:37 UTC

38 points

6 comments17 min readLW link

JanWehner 2 Jul 2024 16:15 UTC
1 point
0
AF
on: Representation Tuning
Hey Christopher, this is really cool work. I think your idea of representation tuning is a very nice way to combine activation steering and fine-tuning. Do you have any intuition as to why fine-tuning towards the steering vector sometimes works better than simply steering towards it?
If you keep on working on this I’d be interested to see a more thorough evaluation of capabilities (more than just perplexity) by running it on some standard LM benchmarks. Whether the model retains its capabilities seems important to understand the safety-capabilities trade-off of this method.
I’m curious whether you tried adding some way to retain general capabilities into the loss function with which you do representation-tuning? E.g. to regularise the activations to stay closer to the original activations or by adding some standard Language Modelling loss?
As a nitpick: I think when measuring the Robustness of Tuned models the comparison advantages the honesty-tuned model. If I understand correctly the honesty-tuned model was specifically trained to be less like the vector used for dishonesty steering, whereas the truth-tuned model hasn’t been. Maybe a more fair comparison would be using automatic adversarial attack methods like GCG?
Again, I think this is a very cool project!

Immunization against harmful fine-tuning attacks

domenicrosati, JanWehner and David Atanasov

6 Jun 2024 15:17 UTC

4 points

0 comments12 min readLW link

Training-time domain authorization could be helpful for safety

domenicrosati, JanWehner and David Atanasov

25 May 2024 15:10 UTC

15 points

4 comments7 min readLW link

JanWehner 4 Oct 2022 9:37 UTC
1 point
0
in reply to: Charlie Steiner’s comment on: Data for IRL: What is needed to learn human values?
I agree that focusing too much on gathering data now would be a mistake. I believe thinking about data for IRL now is mostly valuable to identify challenges which make IRL hard. Then we can try to develop algorithms that solve these challenges or find out IRL is not a tractable solution for alignment.

JanWehner 4 Oct 2022 9:16 UTC
1 point
0
in reply to: Erik Jenner’s comment on: Data for IRL: What is needed to learn human values?
Thank you Erik, that was super valuable feedback and gives some food for thought.
It also seems to me that humans being suboptimal planners and not knowing everything the AI knows seem like the hardest (and most informative) problems in IRL. I’m curious what you’d think about this approach for adressing the suboptimal planner sub-problem : “Include models from coginitive psychology about human decision in IRL, to allow IRL to better understand the decision process.” This would give IRL more realistic assumptions about the human planner and possibly allow it to understand it’s irrationalites and get to the values which drive behaviour.
Also do you have a pointer for something to read on preference comparisons?

Data for IRL: What is needed to learn human values?

JanWehner3 Oct 2022 9:23 UTC

18 points

6 comments12 min readLW link

Introduction to Effective Altruism: How to do good with your career

JanWehner7 Sep 2022 18:12 UTC

1 point

0 comments1 min readLW link