Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Beth Barnes
Karma:
2,717
Alignment researcher. Views are my own and not those of my employer.
https://www.barnes.page/
All
Posts
Comments
New
Top
Old
Page
1
More information about the dangerous capability evaluations we did with GPT-4 and Claude.
Beth Barnes
19 Mar 2023 0:25 UTC
233
points
54
comments
8
min read
LW
link
(evals.alignment.org)
ARC Evals new report: Evaluating Language-Model Agents on Realistic Autonomous Tasks
Beth Barnes
1 Aug 2023 18:30 UTC
153
points
12
comments
5
min read
LW
link
(evals.alignment.org)
Debate update: Obfuscated arguments problem
Beth Barnes
23 Dec 2020 3:24 UTC
135
points
24
comments
16
min read
LW
link
A very crude deception eval is already passed
Beth Barnes
29 Oct 2021 17:57 UTC
108
points
6
comments
2
min read
LW
link
Imitative Generalisation (AKA ‘Learning the Prior’)
Beth Barnes
10 Jan 2021 0:30 UTC
107
points
15
comments
11
min read
LW
link
1
review
Call for research on evaluating alignment (funding + advice available)
Beth Barnes
31 Aug 2021 23:28 UTC
105
points
11
comments
5
min read
LW
link
‘simulator’ framing and confusions about LLMs
Beth Barnes
31 Dec 2022 23:38 UTC
104
points
11
comments
4
min read
LW
link
Writeup: Progress on AI Safety via Debate
Beth Barnes
and
paulfchristiano
5 Feb 2020 21:04 UTC
100
points
18
comments
33
min read
LW
link
Evaluations project @ ARC is hiring a researcher and a webdev/engineer
Beth Barnes
9 Sep 2022 22:46 UTC
99
points
7
comments
10
min read
LW
link
Help ARC evaluate capabilities of current language models (still need people)
Beth Barnes
19 Jul 2022 4:55 UTC
95
points
6
comments
2
min read
LW
link
Send us example gnarly bugs
Beth Barnes
,
Megan Kinniment
and
Tao Lin
10 Dec 2023 5:23 UTC
77
points
10
comments
2
min read
LW
link
Risks from AI persuasion
Beth Barnes
24 Dec 2021 1:48 UTC
69
points
15
comments
31
min read
LW
link
Considerations on interaction between AI and expected value of the future
Beth Barnes
7 Dec 2021 2:46 UTC
68
points
28
comments
4
min read
LW
link
Managing risks of our own work
Beth Barnes
18 Aug 2023 0:41 UTC
66
points
0
comments
2
min read
LW
link
METR is hiring!
Beth Barnes
26 Dec 2023 21:00 UTC
65
points
1
comment
1
min read
LW
link
Looking for adversarial collaborators to test our Debate protocol
Beth Barnes
19 Aug 2020 3:15 UTC
52
points
5
comments
1
min read
LW
link
Bounty: Diverse hard tasks for LLM agents
Beth Barnes
and
Megan Kinniment
17 Dec 2023 1:04 UTC
49
points
31
comments
16
min read
LW
link
Another list of theories of impact for interpretability
Beth Barnes
13 Apr 2022 13:29 UTC
33
points
1
comment
5
min read
LW
link
More detailed proposal for measuring alignment of current models
Beth Barnes
20 Nov 2021 0:03 UTC
31
points
0
comments
8
min read
LW
link
Reverse-engineering using interpretability
Beth Barnes
29 Dec 2021 23:21 UTC
21
points
2
comments
5
min read
LW
link
Back to top
Next