RSS

Joe Kwon

Karma: 175

A Re­search Agenda for Se­cret Loyalties

13 May 2026 17:34 UTC
34 points
3 comments3 min readLW link

How Ro­bust Is Mon­i­tor­ing Against Se­cret Loy­alties?

Joe Kwon26 Feb 2026 15:50 UTC
8 points
0 comments5 min readLW link

Rea­son­ing Traces as a Path to Data-Effi­cient Gen­er­al­iza­tion in Data Poisoning

Joe Kwon25 Feb 2026 18:17 UTC
14 points
0 comments3 min readLW link

The Easiest Route to Se­cret Loy­alty May Be Hi­jack­ing the Model’s Chain of Command

Joe Kwon24 Feb 2026 17:47 UTC
16 points
1 comment5 min readLW link

Pre-train­ing data poi­son­ing likely makes in­stal­ling se­cret loy­alties easier

Joe Kwon23 Feb 2026 18:12 UTC
12 points
0 comments4 min readLW link

How Se­cret Loy­alty Differs from Stan­dard Back­door Threats

Joe Kwon12 Feb 2026 18:48 UTC
23 points
4 comments12 min readLW link

[Question] Are there any groupchats for peo­ple work­ing on Rep­re­sen­ta­tion read­ing/​con­trol, ac­ti­va­tion steer­ing type ex­per­i­ments?

Joe Kwon20 May 2024 18:03 UTC
3 points
1 comment1 min readLW link

Claude wants to be conscious

Joe Kwon13 Apr 2024 1:40 UTC
2 points
8 comments6 min readLW link