RSS

j_we

Karma: 72

I’m a PhD student working on AI Safety. I’m thinking about how we can use interpretability techniques to make LLMs more safe.

Open Challenges in Rep­re­sen­ta­tion Engineering

Apr 3, 2025, 7:21 PM
14 points
0 comments5 min readLW link

Saar­brücken Ger­many—ACX Mee­tups Every­where Fall 2024

j_weAug 29, 2024, 6:37 PM
2 points
0 comments1 min readLW link

An In­tro­duc­tion to Rep­re­sen­ta­tion Eng­ineer­ing—an ac­ti­va­tion-based paradigm for con­trol­ling LLMs

j_weJul 14, 2024, 10:37 AM
37 points
6 comments17 min readLW link

Im­mu­niza­tion against harm­ful fine-tun­ing attacks

Jun 6, 2024, 3:17 PM
4 points
0 comments12 min readLW link

Train­ing-time do­main au­tho­riza­tion could be helpful for safety

May 25, 2024, 3:10 PM
15 points
4 comments7 min readLW link

Data for IRL: What is needed to learn hu­man val­ues?

j_weOct 3, 2022, 9:23 AM
18 points
6 comments12 min readLW link

In­tro­duc­tion to Effec­tive Altru­ism: How to do good with your career

j_weSep 7, 2022, 6:12 PM
1 point
0 comments1 min readLW link