Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
John Hughes
Karma:
569
Former MATS scholar working on scalable oversight and adversarial robustness.
All
Posts
Comments
New
Top
Old
Why Do Some Language Models Fake Alignment While Others Don’t?
abhayesian
,
John Hughes
,
Alex Mallen
,
Jozdien
,
janus
and
Fabien Roger
8 Jul 2025 21:49 UTC
152
points
14
comments
5
min read
LW
link
(arxiv.org)
Alignment Faking Revisited: Improved Classifiers and Open Source Extensions
John Hughes
,
abhayesian
,
Akbir Khan
and
Fabien Roger
8 Apr 2025 17:32 UTC
146
points
20
comments
12
min read
LW
link
Tips and Code for Empirical Research Workflows
John Hughes
and
Ethan Perez
20 Jan 2025 22:31 UTC
95
points
14
comments
20
min read
LW
link
Tips On Empirical Research Slides
James Chua
,
John Hughes
,
Ethan Perez
and
Owain_Evans
8 Jan 2025 5:06 UTC
93
points
4
comments
6
min read
LW
link
Best-of-N Jailbreaking
John Hughes
,
saraprice
,
Aengus Lynch
,
Rylan Schaeffer
,
Fazl
,
Henry Sleight
,
Ethan Perez
and
mrinank_sharma
14 Dec 2024 4:58 UTC
78
points
5
comments
2
min read
LW
link
(arxiv.org)
Debating with More Persuasive LLMs Leads to More Truthful Answers
Akbir Khan
,
John Hughes
,
Dan Valentine
,
Sam Bowman
and
Ethan Perez
7 Feb 2024 21:28 UTC
89
points
14
comments
9
min read
LW
link
(arxiv.org)
Back to top