RSS

Joe Kwon

Karma: 157

How Ro­bust Is Mon­i­tor­ing Against Se­cret Loy­alties?

Joe Kwon26 Feb 2026 15:50 UTC
8 points
0 comments5 min readLW link

Rea­son­ing Traces as a Path to Data-Effi­cient Gen­er­al­iza­tion in Data Poisoning

Joe Kwon25 Feb 2026 18:17 UTC
14 points
0 comments3 min readLW link

The Easiest Route to Se­cret Loy­alty May Be Hi­jack­ing the Model’s Chain of Command

Joe Kwon24 Feb 2026 17:47 UTC
16 points
1 comment5 min readLW link

Pre-train­ing data poi­son­ing likely makes in­stal­ling se­cret loy­alties easier

Joe Kwon23 Feb 2026 18:12 UTC
12 points
0 comments4 min readLW link

How Se­cret Loy­alty Differs from Stan­dard Back­door Threats

Joe Kwon12 Feb 2026 18:48 UTC
23 points
4 comments12 min readLW link

[Question] Are there any groupchats for peo­ple work­ing on Rep­re­sen­ta­tion read­ing/​con­trol, ac­ti­va­tion steer­ing type ex­per­i­ments?

Joe Kwon20 May 2024 18:03 UTC
3 points
1 comment1 min readLW link

Claude wants to be conscious

Joe Kwon13 Apr 2024 1:40 UTC
2 points
8 comments6 min readLW link

[Linkpost] Faith and Fate: Limits of Trans­form­ers on Compositionality

Joe Kwon16 Jun 2023 15:04 UTC
19 points
4 comments1 min readLW link
(arxiv.org)

The In­trin­sic In­ter­play of Hu­man Values and Ar­tifi­cial In­tel­li­gence: Nav­i­gat­ing the Op­ti­miza­tion Challenge

Joe Kwon5 Jun 2023 20:41 UTC
2 points
1 comment18 min readLW link