Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
John Hughes
Karma:
639
Former MATS scholar working on scalable oversight and adversarial robustness.
All
Posts
Comments
New
Top
Old
Measuring the ability of Opus 4.5 to fool narrow classifiers
Fabien Roger
and
John Hughes
2 May 2026 22:43 UTC
31
points
0
comments
8
min read
LW
link
Why Do Some Language Models Fake Alignment While Others Don’t?
abhayesian
,
John Hughes
,
Alex Mallen
,
Jozdien
,
janus
and
Fabien Roger
8 Jul 2025 21:49 UTC
159
points
14
comments
5
min read
LW
link
(arxiv.org)
Alignment Faking Revisited: Improved Classifiers and Open Source Extensions
John Hughes
,
abhayesian
,
Akbir Khan
and
Fabien Roger
8 Apr 2025 17:32 UTC
147
points
20
comments
12
min read
LW
link
Tips and Code for Empirical Research Workflows
John Hughes
and
Ethan Perez
20 Jan 2025 22:31 UTC
110
points
17
comments
20
min read
LW
link
Tips On Empirical Research Slides
James Chua
,
John Hughes
,
Ethan Perez
and
Owain_Evans
8 Jan 2025 5:06 UTC
116
points
4
comments
6
min read
LW
link
Best-of-N Jailbreaking
John Hughes
,
saraprice
,
Aengus Lynch
,
Rylan Schaeffer
,
fbarez
,
Henry Sleight
,
Ethan Perez
and
mrinank_sharma
14 Dec 2024 4:58 UTC
79
points
5
comments
2
min read
LW
link
(arxiv.org)
Debating with More Persuasive LLMs Leads to More Truthful Answers
Akbir Khan
,
John Hughes
,
Dan Valentine
,
Sam Bowman
and
Ethan Perez
7 Feb 2024 21:28 UTC
89
points
14
comments
9
min read
LW
link
(arxiv.org)
Back to top