RSS

John Hughes

Karma: 569

Former MATS scholar working on scalable oversight and adversarial robustness.

Why Do Some Lan­guage Models Fake Align­ment While Others Don’t?

8 Jul 2025 21:49 UTC
152 points
14 comments5 min readLW link
(arxiv.org)

Align­ment Fak­ing Re­vis­ited: Im­proved Clas­sifiers and Open Source Extensions

8 Apr 2025 17:32 UTC
146 points
20 comments12 min readLW link

Tips and Code for Em­piri­cal Re­search Workflows

20 Jan 2025 22:31 UTC
95 points
14 comments20 min readLW link

Tips On Em­piri­cal Re­search Slides

8 Jan 2025 5:06 UTC
93 points
4 comments6 min readLW link

Best-of-N Jailbreaking

14 Dec 2024 4:58 UTC
78 points
5 comments2 min readLW link
(arxiv.org)

De­bat­ing with More Per­sua­sive LLMs Leads to More Truth­ful Answers

7 Feb 2024 21:28 UTC
89 points
14 comments9 min readLW link
(arxiv.org)