John Hughes

Karma: 569

Former MATS scholar working on scalable oversight and adversarial robustness.

Why Do Some Language Models Fake Alignment While Others Don’t?

abhayesian, John Hughes, Alex Mallen, Jozdien, janus and Fabien Roger

8 Jul 2025 21:49 UTC

152 points

14 comments5 min readLW link

(arxiv.org)

Alignment Faking Revisited: Improved Classifiers and Open Source Extensions

John Hughes, abhayesian, Akbir Khan and Fabien Roger

8 Apr 2025 17:32 UTC

146 points

20 comments12 min readLW link

Tips and Code for Empirical Research Workflows

John Hughes and Ethan Perez

20 Jan 2025 22:31 UTC

95 points

14 comments20 min readLW link

Tips On Empirical Research Slides

James Chua, John Hughes, Ethan Perez and Owain_Evans

8 Jan 2025 5:06 UTC

93 points

4 comments6 min readLW link

Best-of-N Jailbreaking

John Hughes, saraprice, Aengus Lynch, Rylan Schaeffer, Fazl, Henry Sleight, Ethan Perez and mrinank_sharma

14 Dec 2024 4:58 UTC

78 points

5 comments2 min readLW link

(arxiv.org)

Debating with More Persuasive LLMs Leads to More Truthful Answers

Akbir Khan, John Hughes, Dan Valentine, Sam Bowman and Ethan Perez

7 Feb 2024 21:28 UTC

89 points

14 comments9 min readLW link

(arxiv.org)