RSS

j_we

Karma: 72

I’m a PhD student working on AI Safety. I’m thinking about how we can use interpretability techniques to make LLMs more safe.

Open Challenges in Rep­re­sen­ta­tion Engineering

3 Apr 2025 19:21 UTC
14 points
0 comments5 min readLW link

Saar­brücken Ger­many—ACX Mee­tups Every­where Fall 2024

j_we29 Aug 2024 18:37 UTC
2 points
0 comments1 min readLW link

An In­tro­duc­tion to Rep­re­sen­ta­tion Eng­ineer­ing—an ac­ti­va­tion-based paradigm for con­trol­ling LLMs

j_we14 Jul 2024 10:37 UTC
37 points
6 comments17 min readLW link

Im­mu­niza­tion against harm­ful fine-tun­ing attacks

6 Jun 2024 15:17 UTC
4 points
0 comments12 min readLW link

Train­ing-time do­main au­tho­riza­tion could be helpful for safety

25 May 2024 15:10 UTC
15 points
4 comments7 min readLW link

Data for IRL: What is needed to learn hu­man val­ues?

j_we3 Oct 2022 9:23 UTC
18 points
6 comments12 min readLW link

In­tro­duc­tion to Effec­tive Altru­ism: How to do good with your career

j_we7 Sep 2022 18:12 UTC
1 point
0 comments1 min readLW link