RSS

AI Evaluations

TagLast edit: 1 Aug 2023 1:03 UTC by duck_master

AI Evaluations focus on experimentally assessing the capabilities, safety, and alignment of advanced AI systems. These evaluations can be divided into two main categories: behavioral and understanding-based.

(note: initially written by GPT4, may contain errors despite a human review. Please correct them if you see them)

Behavioral evaluations assess a model’s abilities in various tasks, such as autonomously replicating, acquiring resources, and avoiding being shut down. However, a concern with these evaluations is that they may not be sufficient to detect deceptive alignment, making it difficult to ensure that models are non-deceptive.

Understanding-based evaluations, on the other hand, evaluate a developer’s ability to understand the model they have created and why they have obtained the model. This approach can be more useful in terms of safety, as it focuses on understanding the model’s behavior instead of just checking the behavior itself. Coupling understanding-based evaluations with behavioral evaluations can lead to a more comprehensive assessment of AI safety and alignment.

Current challenges in AI evaluations include:

(this text was initially written by GPT4, taking in as input A very crude deception eval is already passed, ARC tests to see if GPT-4 can escape human control; GPT-4 failed to do so, and Towards understanding-based safety evaluations)

See also:

How evals might (or might not) pre­vent catas­trophic risks from AI

Akash7 Feb 2023 20:16 UTC
44 points
0 comments9 min readLW link

The case for more am­bi­tious lan­guage model evals

Jozdien30 Jan 2024 0:01 UTC
109 points
28 comments5 min readLW link

Thoughts on shar­ing in­for­ma­tion about lan­guage model capabilities

paulfchristiano31 Jul 2023 16:04 UTC
205 points
36 comments11 min readLW link

When can we trust model eval­u­a­tions?

evhub28 Jul 2023 19:42 UTC
147 points
9 comments10 min readLW link

An­nounc­ing Apollo Research

30 May 2023 16:17 UTC
215 points
11 comments8 min readLW link

Towards un­der­stand­ing-based safety evaluations

evhub15 Mar 2023 18:18 UTC
156 points
16 comments5 min readLW link

OMMC An­nounces RIP

1 Apr 2024 23:20 UTC
181 points
5 comments2 min readLW link

Model Or­ganisms of Misal­ign­ment: The Case for a New Pillar of Align­ment Research

8 Aug 2023 1:30 UTC
308 points
26 comments18 min readLW link

Scal­able And Trans­fer­able Black-Box Jailbreaks For Lan­guage Models Via Per­sona Modulation

7 Nov 2023 17:59 UTC
36 points
2 comments2 min readLW link
(arxiv.org)

How good are LLMs at do­ing ML on an un­known dataset?

Håvard Tveit Ihle1 Jul 2024 9:04 UTC
35 points
4 comments13 min readLW link

Cri­tiques of the AI con­trol agenda

Jozdien14 Feb 2024 19:25 UTC
47 points
14 comments9 min readLW link

Bounty: Di­verse hard tasks for LLM agents

17 Dec 2023 1:04 UTC
49 points
31 comments16 min readLW link

OpenAI: Pre­pared­ness framework

Zach Stein-Perlman18 Dec 2023 18:30 UTC
70 points
23 comments4 min readLW link
(openai.com)

An­nounc­ing Hu­man-al­igned AI Sum­mer School

22 May 2024 8:55 UTC
50 points
0 comments1 min readLW link
(humanaligned.ai)

METR is hiring!

Beth Barnes26 Dec 2023 21:00 UTC
65 points
1 comment1 min readLW link

“Suc­cess­ful lan­guage model evals” by Ja­son Wei

Arjun Panickssery25 May 2024 9:34 UTC
10 points
0 comments1 min readLW link
(www.jasonwei.net)

A starter guide for evals

8 Jan 2024 18:24 UTC
44 points
2 comments12 min readLW link
(www.apolloresearch.ai)

AI com­pa­nies aren’t re­ally us­ing ex­ter­nal evaluators

Zach Stein-Perlman24 May 2024 16:01 UTC
240 points
15 comments4 min readLW link

The Leeroy Jenk­ins prin­ci­ple: How faulty AI could guaran­tee “warn­ing shots”

titotal14 Jan 2024 15:03 UTC
46 points
6 comments1 min readLW link
(titotal.substack.com)

We need a Science of Evals

22 Jan 2024 20:30 UTC
66 points
13 comments9 min readLW link

Soft Prompts for Eval­u­a­tion: Mea­sur­ing Con­di­tional Dis­tance of Capabilities

porby2 Feb 2024 5:49 UTC
44 points
1 comment4 min readLW link
(1drv.ms)

Apollo Re­search 1-year update

29 May 2024 17:44 UTC
92 points
0 comments7 min readLW link

Send us ex­am­ple gnarly bugs

10 Dec 2023 5:23 UTC
77 points
10 comments2 min readLW link

Clar­ify­ing METR’s Au­dit­ing Role

Beth Barnes30 May 2024 18:41 UTC
103 points
1 comment2 min readLW link

Pro­to­col eval­u­a­tions: good analo­gies vs control

Fabien Roger19 Feb 2024 18:00 UTC
35 points
10 comments11 min readLW link

Self-Aware­ness: Tax­on­omy and eval suite proposal

Daniel Kokotajlo17 Feb 2024 1:47 UTC
63 points
0 comments11 min readLW link

Deep­Mind: Eval­u­at­ing Fron­tier Models for Danger­ous Capabilities

Zach Stein-Perlman21 Mar 2024 3:00 UTC
61 points
6 comments1 min readLW link
(arxiv.org)

Third-party test­ing as a key in­gre­di­ent of AI policy

Zac Hatfield-Dodds25 Mar 2024 22:40 UTC
11 points
1 comment12 min readLW link
(www.anthropic.com)

Run evals on base mod­els too!

orthonormal4 Apr 2024 18:43 UTC
47 points
6 comments1 min readLW link

Mechanis­ti­cally Elic­it­ing La­tent Be­hav­iors in Lan­guage Models

30 Apr 2024 18:51 UTC
198 points
37 comments45 min readLW link

Com­par­ing Quan­tized Perfor­mance in Llama Models

NickyP15 Jul 2024 16:01 UTC
30 points
2 comments8 min readLW link

[In­terim re­search re­port] Eval­u­at­ing the Goal-Direct­ed­ness of Lan­guage Models

18 Jul 2024 18:19 UTC
29 points
0 comments11 min readLW link

[Question] Can GPT-4 play 20 ques­tions against an­other in­stance of it­self?

Nathan Helm-Burger28 Mar 2023 1:11 UTC
15 points
1 comment1 min readLW link
(evanthebouncy.medium.com)

Re­spon­si­ble De­ploy­ment in 20XX

Carson20 Apr 2023 0:24 UTC
4 points
0 comments4 min readLW link

Refram­ing the bur­den of proof: Com­pa­nies should prove that mod­els are safe (rather than ex­pect­ing au­di­tors to prove that mod­els are dan­ger­ous)

Akash25 Apr 2023 18:49 UTC
27 points
11 comments3 min readLW link
(childrenoficarus.substack.com)

Prevent­ing Lan­guage Models from hid­ing their reasoning

31 Oct 2023 14:34 UTC
108 points
14 comments12 min readLW link

Eval­u­at­ing strate­gic rea­son­ing in GPT models

phelps-sg25 May 2023 11:51 UTC
4 points
1 comment8 min readLW link

ARC Evals new re­port: Eval­u­at­ing Lan­guage-Model Agents on Real­is­tic Au­tonomous Tasks

Beth Barnes1 Aug 2023 18:30 UTC
153 points
12 comments5 min readLW link
(evals.alignment.org)

Apollo Re­search is hiring evals and in­ter­pretabil­ity en­g­ineers & scientists

Marius Hobbhahn4 Aug 2023 10:54 UTC
25 points
0 comments2 min readLW link

Au­tonomous repli­ca­tion and adap­ta­tion: an at­tempt at a con­crete dan­ger threshold

Hjalmar_Wijk17 Aug 2023 1:31 UTC
43 points
0 comments13 min readLW link

Manag­ing risks of our own work

Beth Barnes18 Aug 2023 0:41 UTC
66 points
0 comments2 min readLW link

[Question] Would more model evals teams be good?

Ryan Kidd25 Feb 2023 22:01 UTC
20 points
4 comments1 min readLW link

A very crude de­cep­tion eval is already passed

Beth Barnes29 Oct 2021 17:57 UTC
108 points
6 comments2 min readLW link

ARC tests to see if GPT-4 can es­cape hu­man con­trol; GPT-4 failed to do so

Christopher King15 Mar 2023 0:29 UTC
116 points
22 comments2 min readLW link

More in­for­ma­tion about the dan­ger­ous ca­pa­bil­ity eval­u­a­tions we did with GPT-4 and Claude.

Beth Barnes19 Mar 2023 0:25 UTC
233 points
54 comments8 min readLW link
(evals.alignment.org)

Re­sults from the AI x Democ­racy Re­search Sprint

14 Jun 2024 16:40 UTC
13 points
0 comments6 min readLW link

LLMs can strate­gi­cally de­ceive while do­ing gain-of-func­tion re­search

Igor Ivanov24 Jan 2024 15:45 UTC
33 points
4 comments11 min readLW link

Toward a tax­on­omy of cog­ni­tive bench­marks for agen­tic AGIs

Ben Smith27 Jun 2024 23:50 UTC
7 points
0 comments5 min readLW link

Re­view of METR’s pub­lic eval­u­a­tion protocol

30 Jun 2024 22:03 UTC
10 points
0 comments5 min readLW link

Orthog­o­nal­ity or the “Hu­man Worth Hy­poth­e­sis”?

Jeffs23 Jan 2024 0:57 UTC
21 points
31 comments3 min readLW link

Longer-term Be­havi­our of Gen­er­a­tive Com­pan­ion AIs: A So­cial Si­mu­la­tion Approach

Reed14 Aug 2023 15:24 UTC
5 points
0 comments7 min readLW link

Auto-En­hance: Devel­op­ing a meta-bench­mark to mea­sure LLM agents’ abil­ity to im­prove other agents

22 Jul 2024 12:33 UTC
9 points
0 comments14 min readLW link

An­a­lyz­ing Deep­Mind’s Prob­a­bil­is­tic Meth­ods for Eval­u­at­ing Agent Capabilities

22 Jul 2024 16:17 UTC
54 points
0 comments16 min readLW link

Nav­i­gat­ing the Attackspace

Jonas Kgomo12 Dec 2023 13:59 UTC
1 point
0 comments2 min readLW link

Pro­posal on AI eval­u­a­tion: false-proving

ProgramCrafter31 Mar 2023 12:12 UTC
1 point
2 comments1 min readLW link

The dreams of GPT-4

RomanS20 Mar 2023 17:00 UTC
14 points
7 comments9 min readLW link

Re­pro­duc­ing ARC Evals’ re­cent re­port on lan­guage model agents

Thomas Broadley1 Sep 2023 16:52 UTC
102 points
17 comments3 min readLW link
(thomasbroadley.com)

LM Si­tu­a­tional Aware­ness, Eval­u­a­tion Pro­posal: Vio­lat­ing Imitation

Jacob Pfau26 Apr 2023 22:53 UTC
16 points
2 comments2 min readLW link

MMLU’s Mo­ral Sce­nar­ios Bench­mark Doesn’t Mea­sure What You Think it Measures

corey morris27 Sep 2023 17:54 UTC
14 points
2 comments4 min readLW link
(medium.com)

OpenAI Credit Ac­count (2510$)

Emirhan BULUT21 Jan 2024 2:30 UTC
1 point
0 comments1 min readLW link

Tall Tales at Differ­ent Scales: Eval­u­at­ing Scal­ing Trends For De­cep­tion In Lan­guage Models

8 Nov 2023 11:37 UTC
49 points
0 comments18 min readLW link

The­o­ries of Change for AI Auditing

13 Nov 2023 19:33 UTC
54 points
0 comments18 min readLW link
(www.apolloresearch.ai)

A sim­ple treach­er­ous turn demonstration

nikola25 Nov 2023 4:51 UTC
22 points
5 comments3 min readLW link

A call for a quan­ti­ta­tive re­port card for AI bioter­ror­ism threat models

Juno4 Dec 2023 6:35 UTC
12 points
0 comments10 min readLW link

Pro­tect­ing against sud­den ca­pa­bil­ity jumps dur­ing training

nikola2 Dec 2023 4:22 UTC
8 points
0 comments2 min readLW link

The Method of Loci: With some brief re­marks, in­clud­ing trans­form­ers and eval­u­at­ing AIs

Bill Benzon2 Dec 2023 14:36 UTC
6 points
0 comments3 min readLW link

What’s new at FAR AI

4 Dec 2023 21:18 UTC
41 points
0 comments5 min readLW link
(far.ai)

2023 Align­ment Re­search Up­dates from FAR AI

4 Dec 2023 22:32 UTC
18 points
0 comments8 min readLW link
(far.ai)

Eval­u­at­ing Lan­guage Model Be­havi­ours for Shut­down Avoidance in Tex­tual Scenarios

16 May 2023 10:53 UTC
22 points
0 comments13 min readLW link

Im­prov­ing the safety of AI evals

17 May 2023 22:24 UTC
13 points
7 comments7 min readLW link

The Com­pleat Cybornaut

19 May 2023 8:44 UTC
64 points
2 comments16 min readLW link

Deep­Mind: Model eval­u­a­tion for ex­treme risks

Zach Stein-Perlman25 May 2023 3:00 UTC
94 points
11 comments1 min readLW link
(arxiv.org)

Re­spon­si­ble scal­ing policy TLDR

lukehmiles28 Sep 2023 18:51 UTC
9 points
0 comments1 min readLW link

Seek­ing (Paid) Case Stud­ies on Standards

HoldenKarnofsky26 May 2023 17:58 UTC
69 points
9 comments11 min readLW link

Ques­tions I’d Want to Ask an AGI+ to Test Its Un­der­stand­ing of Ethics

sweenesm26 Jan 2024 23:40 UTC
14 points
6 comments4 min readLW link

In­tro­duc­ing REBUS: A Ro­bust Eval­u­a­tion Bench­mark of Un­der­stand­ing Symbols

15 Jan 2024 21:21 UTC
33 points
0 comments1 min readLW link

[Question] AI Rights: In your view, what would be re­quired for an AGI to gain rights and pro­tec­tions from the var­i­ous Govern­ments of the World?

Super AGI9 Jun 2023 1:24 UTC
10 points
26 comments1 min readLW link

Skep­ti­cism About Deep­Mind’s “Grand­mas­ter-Level” Chess Without Search

Arjun Panickssery12 Feb 2024 0:56 UTC
55 points
13 comments3 min readLW link

Challenge pro­posal: small­est pos­si­ble self-hard­en­ing back­door for RLHF

Christopher King29 Jun 2023 16:56 UTC
7 points
0 comments2 min readLW link

Ro­bust­ness of Model-Graded Eval­u­a­tions and Au­to­mated Interpretability

15 Jul 2023 19:12 UTC
44 points
5 comments9 min readLW link

The “spel­ling mir­a­cle”: GPT-3 spel­ling abil­ities and glitch to­kens revisited

mwatkins31 Jul 2023 19:47 UTC
85 points
29 comments20 min readLW link

In­tro­duc­ing METR’s Au­ton­omy Eval­u­a­tion Resources

15 Mar 2024 23:16 UTC
90 points
0 comments1 min readLW link
(metr.github.io)

AI Safety Eval­u­a­tions: A Reg­u­la­tory Review

19 Mar 2024 15:05 UTC
21 points
1 comment11 min readLW link

OpenAI Credit Ac­count (2510$)

Emirhan BULUT21 Jan 2024 2:32 UTC
1 point
0 comments1 min readLW link

Eval­u­at­ing Su­per­hu­man Models with Con­sis­tency Checks

1 Aug 2023 7:51 UTC
21 points
2 comments9 min readLW link
(arxiv.org)

Mea­sur­ing Pre­dictabil­ity of Per­sona Evaluations

6 Apr 2024 8:46 UTC
19 points
0 comments7 min readLW link

Claude wants to be conscious

Joe Kwon13 Apr 2024 1:40 UTC
2 points
8 comments6 min readLW link

LLM Eval­u­a­tors Rec­og­nize and Fa­vor Their Own Generations

17 Apr 2024 21:09 UTC
44 points
1 comment3 min readLW link
(tiny.cc)

In­duc­ing Un­prompted Misal­ign­ment in LLMs

19 Apr 2024 20:00 UTC
38 points
6 comments16 min readLW link

An In­tro­duc­tion to AI Sandbagging

26 Apr 2024 13:40 UTC
43 points
7 comments8 min readLW link

Mea­sur­ing and Im­prov­ing the Faith­ful­ness of Model-Gen­er­ated Rea­son­ing

18 Jul 2023 16:36 UTC
110 points
14 comments6 min readLW link

METR is hiring ML Re­search Eng­ineers and Scientists

Xodarap5 Jun 2024 21:27 UTC
5 points
0 comments1 min readLW link
(metr.org)

AI Safety In­sti­tute’s In­spect hello world ex­am­ple for AI evals

TheManxLoiner16 May 2024 20:47 UTC
3 points
0 comments1 min readLW link
(lovkush.medium.com)

[Paper] AI Sand­bag­ging: Lan­guage Models can Strate­gi­cally Un­der­perform on Evaluations

13 Jun 2024 10:04 UTC
77 points
10 comments2 min readLW link
(arxiv.org)

Fine-tun­ing is not suffi­cient for ca­pa­bil­ity elicitation

Theodore Chapman14 Jun 2024 18:50 UTC
41 points
3 comments9 min readLW link
No comments.