j_we

Karma: 72

I’m a PhD student working on AI Safety. I’m thinking about how we can use interpretability techniques to make LLMs more safe.

Open Challenges in Representation Engineering

j_we and Daniel Tan

3 Apr 2025 19:21 UTC

14 points

0 comments5 min readLW link

Saarbrücken Germany—ACX Meetups Everywhere Fall 2024

j_we29 Aug 2024 18:37 UTC

2 points

0 comments1 min readLW link

An Introduction to Representation Engineering—an activation-based paradigm for controlling LLMs

j_we14 Jul 2024 10:37 UTC

37 points

6 comments17 min readLW link

Immunization against harmful fine-tuning attacks

domenicrosati, j_we and David Atanasov

6 Jun 2024 15:17 UTC

4 points

0 comments12 min readLW link

Training-time domain authorization could be helpful for safety

domenicrosati, j_we and David Atanasov

25 May 2024 15:10 UTC

15 points

4 comments7 min readLW link

Data for IRL: What is needed to learn human values?

j_we3 Oct 2022 9:23 UTC

18 points

6 comments12 min readLW link

Introduction to Effective Altruism: How to do good with your career

j_we7 Sep 2022 18:12 UTC

1 point

0 comments1 min readLW link