RSS

Beth Barnes

Karma: 2,726

Alignment researcher. Views are my own and not those of my employer. https://​​www.barnes.page/​​

In­tro­duc­ing METR’s Au­ton­omy Eval­u­a­tion Resources

15 Mar 2024 23:16 UTC
90 points
0 comments1 min readLW link
(metr.github.io)

METR is hiring!

Beth Barnes26 Dec 2023 21:00 UTC
65 points
1 comment1 min readLW link

Bounty: Di­verse hard tasks for LLM agents

17 Dec 2023 1:04 UTC
49 points
31 comments16 min readLW link

Send us ex­am­ple gnarly bugs

10 Dec 2023 5:23 UTC
77 points
10 comments2 min readLW link

Manag­ing risks of our own work

Beth Barnes18 Aug 2023 0:41 UTC
66 points
0 comments2 min readLW link

ARC Evals new re­port: Eval­u­at­ing Lan­guage-Model Agents on Real­is­tic Au­tonomous Tasks

Beth Barnes1 Aug 2023 18:30 UTC
153 points
12 comments5 min readLW link
(evals.alignment.org)

More in­for­ma­tion about the dan­ger­ous ca­pa­bil­ity eval­u­a­tions we did with GPT-4 and Claude.

Beth Barnes19 Mar 2023 0:25 UTC
233 points
54 comments8 min readLW link
(evals.alignment.org)

Reflec­tion Mechanisms as an Align­ment Tar­get—At­ti­tudes on “near-term” AI

2 Mar 2023 4:29 UTC
20 points
0 comments8 min readLW link

‘simu­la­tor’ fram­ing and con­fu­sions about LLMs

Beth Barnes31 Dec 2022 23:38 UTC
104 points
11 comments4 min readLW link

Reflec­tion Mechanisms as an Align­ment tar­get: A fol­low-up survey

5 Oct 2022 14:03 UTC
15 points
2 comments7 min readLW link

Eval­u­a­tions pro­ject @ ARC is hiring a re­searcher and a web­dev/​engineer

Beth Barnes9 Sep 2022 22:46 UTC
99 points
7 comments10 min readLW link

Help ARC eval­u­ate ca­pa­bil­ities of cur­rent lan­guage mod­els (still need peo­ple)

Beth Barnes19 Jul 2022 4:55 UTC
95 points
6 comments2 min readLW link

Reflec­tion Mechanisms as an Align­ment tar­get: A survey

22 Jun 2022 15:05 UTC
32 points
1 comment14 min readLW link

Another list of the­o­ries of im­pact for interpretability

Beth Barnes13 Apr 2022 13:29 UTC
33 points
1 comment5 min readLW link

Re­v­erse-en­g­ineer­ing us­ing interpretability

Beth Barnes29 Dec 2021 23:21 UTC
21 points
2 comments5 min readLW link

Risks from AI persuasion

Beth Barnes24 Dec 2021 1:48 UTC
75 points
15 comments31 min readLW link

Some thoughts on why ad­ver­sar­ial train­ing might be useful

Beth Barnes8 Dec 2021 1:28 UTC
9 points
6 comments3 min readLW link

Con­sid­er­a­tions on in­ter­ac­tion be­tween AI and ex­pected value of the fu­ture

Beth Barnes7 Dec 2021 2:46 UTC
68 points
28 comments4 min readLW link

More de­tailed pro­posal for mea­sur­ing al­ign­ment of cur­rent models

Beth Barnes20 Nov 2021 0:03 UTC
31 points
0 comments8 min readLW link

A very crude de­cep­tion eval is already passed

Beth Barnes29 Oct 2021 17:57 UTC
108 points
6 comments2 min readLW link