Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Tomek Korbak
Karma:
601
Aligning language models at Anthropic
https://tomekkorbak.com/
All
Posts
Comments
New
Top
Old
Compositional preference models for aligning LMs
Tomek Korbak
25 Oct 2023 12:17 UTC
18
points
2
comments
5
min read
LW
link
Towards Understanding Sycophancy in Language Models
Ethan Perez
,
mrinank_sharma
,
Meg
and
Tomek Korbak
24 Oct 2023 0:30 UTC
66
points
0
comments
2
min read
LW
link
(arxiv.org)
Paper: LLMs trained on “A is B” fail to learn “B is A”
lberglund
,
Owain_Evans
,
Meg
,
Maximilian Kaufmann
,
Mikita Balesni
,
Asa Cooper Stickland
and
Tomek Korbak
23 Sep 2023 19:55 UTC
120
points
73
comments
4
min read
LW
link
(arxiv.org)
Paper: On measuring situational awareness in LLMs
Owain_Evans
,
Daniel Kokotajlo
,
Mikita Balesni
,
Tomek Korbak
,
lberglund
,
Asa Cooper Stickland
,
Meg
and
Maximilian Kaufmann
4 Sep 2023 12:54 UTC
106
points
16
comments
5
min read
LW
link
(arxiv.org)
Imitation Learning from Language Feedback
Jérémy Scheurer
,
Tomek Korbak
and
Ethan Perez
30 Mar 2023 14:11 UTC
71
points
3
comments
10
min read
LW
link
Pretraining Language Models with Human Preferences
Tomek Korbak
,
Sam Bowman
and
Ethan Perez
21 Feb 2023 17:57 UTC
133
points
18
comments
11
min read
LW
link
RL with KL penalties is better seen as Bayesian inference
Tomek Korbak
and
Ethan Perez
25 May 2022 9:23 UTC
114
points
17
comments
12
min read
LW
link
Back to top