Beth Barnes

Karma: 2,717

Alignment researcher. Views are my own and not those of my employer. https://www.barnes.page/

More information about the dangerous capability evaluations we did with GPT-4 and Claude.

Beth Barnes19 Mar 2023 0:25 UTC

233 points

54 comments8 min readLW link

(evals.alignment.org)

ARC Evals new report: Evaluating Language-Model Agents on Realistic Autonomous Tasks

Beth Barnes1 Aug 2023 18:30 UTC

153 points

12 comments5 min readLW link

(evals.alignment.org)

Debate update: Obfuscated arguments problem

Beth Barnes23 Dec 2020 3:24 UTC

135 points

24 comments16 min readLW link

A very crude deception eval is already passed

Beth Barnes29 Oct 2021 17:57 UTC

108 points

6 comments2 min readLW link

Imitative Generalisation (AKA ‘Learning the Prior’)

Beth Barnes10 Jan 2021 0:30 UTC

107 points

15 comments11 min readLW link 1 review

Call for research on evaluating alignment (funding + advice available)

Beth Barnes31 Aug 2021 23:28 UTC

105 points

11 comments5 min readLW link

‘simulator’ framing and confusions about LLMs

Beth Barnes31 Dec 2022 23:38 UTC

104 points

11 comments4 min readLW link

Writeup: Progress on AI Safety via Debate

Beth Barnes and paulfchristiano

5 Feb 2020 21:04 UTC

100 points

18 comments33 min readLW link

Evaluations project @ ARC is hiring a researcher and a webdev/engineer

Beth Barnes9 Sep 2022 22:46 UTC

99 points

7 comments10 min readLW link

Help ARC evaluate capabilities of current language models (still need people)

Beth Barnes19 Jul 2022 4:55 UTC

95 points

6 comments2 min readLW link

Send us example gnarly bugs

Beth Barnes, Megan Kinniment and Tao Lin

10 Dec 2023 5:23 UTC

77 points

10 comments2 min readLW link

Risks from AI persuasion

Beth Barnes24 Dec 2021 1:48 UTC

69 points

15 comments31 min readLW link

Considerations on interaction between AI and expected value of the future

Beth Barnes7 Dec 2021 2:46 UTC

68 points

28 comments4 min readLW link

Managing risks of our own work

Beth Barnes18 Aug 2023 0:41 UTC

66 points

0 comments2 min readLW link

METR is hiring!

Beth Barnes26 Dec 2023 21:00 UTC

65 points

1 comment1 min readLW link

Looking for adversarial collaborators to test our Debate protocol

Beth Barnes19 Aug 2020 3:15 UTC

52 points

5 comments1 min readLW link

Bounty: Diverse hard tasks for LLM agents

Beth Barnes and Megan Kinniment

17 Dec 2023 1:04 UTC

49 points

31 comments16 min readLW link

Another list of theories of impact for interpretability

Beth Barnes13 Apr 2022 13:29 UTC

33 points

1 comment5 min readLW link

More detailed proposal for measuring alignment of current models

Beth Barnes20 Nov 2021 0:03 UTC

31 points

0 comments8 min readLW link

Reverse-engineering using interpretability

Beth Barnes29 Dec 2021 23:21 UTC

21 points

2 comments5 min readLW link