Beth Barnes

Karma: 3,172

Alignment researcher. Views are my own and not those of my employer. https://www.barnes.page/

Clarifying METR’s Auditing Role

Beth Barnes30 May 2024 18:41 UTC

108 points

1 comment2 min readLW link

Introducing METR’s Autonomy Evaluation Resources

Megan Kinniment and Beth Barnes

15 Mar 2024 23:16 UTC

90 points

0 comments1 min readLW link

(metr.github.io)

METR is hiring!

Beth Barnes26 Dec 2023 21:00 UTC

65 points

1 comment1 min readLW link

Bounty: Diverse hard tasks for LLM agents

Beth Barnes and Megan Kinniment

17 Dec 2023 1:04 UTC

49 points

31 comments16 min readLW link

Send us example gnarly bugs

Beth Barnes, Megan Kinniment and Tao Lin

10 Dec 2023 5:23 UTC

77 points

10 comments2 min readLW link

Managing risks of our own work

Beth Barnes18 Aug 2023 0:41 UTC

66 points

0 comments2 min readLW link

ARC Evals new report: Evaluating Language-Model Agents on Realistic Autonomous Tasks

Beth Barnes1 Aug 2023 18:30 UTC

153 points

12 comments5 min readLW link

(evals.alignment.org)

More information about the dangerous capability evaluations we did with GPT-4 and Claude.

Beth Barnes19 Mar 2023 0:25 UTC

233 points

54 comments8 min readLW link

(evals.alignment.org)

Reflection Mechanisms as an Alignment Target—Attitudes on “near-term” AI

elandgre, Beth Barnes and Marius Hobbhahn

2 Mar 2023 4:29 UTC

21 points

0 comments8 min readLW link

‘simulator’ framing and confusions about LLMs

Beth Barnes31 Dec 2022 23:38 UTC

104 points

11 comments4 min readLW link

Reflection Mechanisms as an Alignment target: A follow-up survey

Marius Hobbhahn, elandgre and Beth Barnes

5 Oct 2022 14:03 UTC

21 points

2 comments7 min readLW link

Evaluations project @ ARC is hiring a researcher and a webdev/engineer

Beth Barnes9 Sep 2022 22:46 UTC

99 points

7 comments10 min readLW link

Help ARC evaluate capabilities of current language models (still need people)

Beth Barnes19 Jul 2022 4:55 UTC

95 points

6 comments2 min readLW link

Reflection Mechanisms as an Alignment target: A survey

Marius Hobbhahn, elandgre and Beth Barnes

22 Jun 2022 15:05 UTC

32 points

1 comment14 min readLW link

Another list of theories of impact for interpretability

Beth Barnes13 Apr 2022 13:29 UTC

33 points

1 comment5 min readLW link

Reverse-engineering using interpretability

Beth Barnes29 Dec 2021 23:21 UTC

21 points

2 comments5 min readLW link

Risks from AI persuasion

Beth Barnes24 Dec 2021 1:48 UTC

76 points

15 comments31 min readLW link

Some thoughts on why adversarial training might be useful

Beth Barnes8 Dec 2021 1:28 UTC

9 points

6 comments3 min readLW link

Considerations on interaction between AI and expected value of the future

Beth Barnes7 Dec 2021 2:46 UTC

68 points

28 comments4 min readLW link

More detailed proposal for measuring alignment of current models

Beth Barnes20 Nov 2021 0:03 UTC

31 points

0 comments8 min readLW link