Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Lukas Fluri
Karma:
71
All
Posts
Comments
New
Top
Old
Is the evidence in “Language Models Learn to Mislead Humans via RLHF” valid?
Aaryan Chandna
,
Lukas Fluri
and
micahcarroll
1 Dec 2025 6:50 UTC
35
points
0
comments
19
min read
LW
link
Zurich AI Safety is looking for (Co-)Directors—EOI
MariusWenk
,
alex.lloyd
,
Lukas Fluri
and
marcel.steimke
3 Sep 2025 17:40 UTC
12
points
0
comments
4
min read
LW
link
The Perils of Optimizing Learned Reward Functions
Lukas Fluri
11 Jul 2025 16:06 UTC
19
points
1
comment
21
min read
LW
link
Evaluating Superhuman Models with Consistency Checks
Daniel Paleka
and
Lukas Fluri
1 Aug 2023 7:51 UTC
21
points
2
comments
9
min read
LW
link
(arxiv.org)
Open Problems in Negative Side Effect Minimization
Fabian Schimpf
and
Lukas Fluri
6 May 2022 9:37 UTC
12
points
6
comments
17
min read
LW
link
Back to top