AI Evaluations

TagLast edit: 1 Aug 2023 1:03 UTC by duck_master

AI Evaluations focus on experimentally assessing the capabilities, safety, and alignment of advanced AI systems. These evaluations can be divided into two main categories: behavioral and understanding-based.

(note: initially written by GPT4, may contain errors despite a human review. Please correct them if you see them)

Behavioral evaluations assess a model’s abilities in various tasks, such as autonomously replicating, acquiring resources, and avoiding being shut down. However, a concern with these evaluations is that they may not be sufficient to detect deceptive alignment, making it difficult to ensure that models are non-deceptive.

Understanding-based evaluations, on the other hand, evaluate a developer’s ability to understand the model they have created and why they have obtained the model. This approach can be more useful in terms of safety, as it focuses on understanding the model’s behavior instead of just checking the behavior itself. Coupling understanding-based evaluations with behavioral evaluations can lead to a more comprehensive assessment of AI safety and alignment.

Current challenges in AI evaluations include:

developing a method-agnostic standard to demonstrate sufficient understanding of a model
ensuring that the level of understanding is adequate to catch dangerous failure modes
finding the right balance between behavioral and understanding-based evaluations.

(this text was initially written by GPT4, taking in as input A very crude deception eval is already passed, ARC tests to see if GPT-4 can escape human control; GPT-4 failed to do so, and Towards understanding-based safety evaluations)

See also:

How evals might (or might not) prevent catastrophic risks from AI

Akash7 Feb 2023 20:16 UTC

43 points

0 comments9 min readLW link

Thoughts on sharing information about language model capabilities

paulfchristiano31 Jul 2023 16:04 UTC

197 points

34 comments11 min readLW link

Announcing Apollo Research

Marius Hobbhahn, beren, Lee Sharkey, Lucius Bushnaq, Dan Braun, Mikita Balesni and Jérémy Scheurer

30 May 2023 16:17 UTC

215 points

11 comments8 min readLW link

When can we trust model evaluations?

evhub28 Jul 2023 19:42 UTC

143 points

9 comments10 min readLW link

Towards understanding-based safety evaluations

evhub15 Mar 2023 18:18 UTC

152 points

16 comments5 min readLW link

OMMC Announces RIP

Adam Scholl and aysja

1 Apr 2024 23:20 UTC

178 points

5 comments2 min readLW link

Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research

evhub, Nicholas Schiefer, Carson Denison and Ethan Perez

8 Aug 2023 1:30 UTC

306 points

26 comments18 min readLW link

Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation

Soroush Pour, rusheb, Quentin FEUILLADE--MONTIXI, Arush and scasper

7 Nov 2023 17:59 UTC

36 points

2 comments2 min readLW link

(arxiv.org)

Evaluating strategic reasoning in GPT models

phelps-sg25 May 2023 11:51 UTC

4 points

1 comment8 min readLW link

Critiques of the AI control agenda

Jozdien14 Feb 2024 19:25 UTC

47 points

14 comments9 min readLW link

Responsible Deployment in 20XX

Carson20 Apr 2023 0:24 UTC

4 points

0 comments4 min readLW link

Reframing the burden of proof: Companies should prove that models are safe (rather than expecting auditors to prove that models are dangerous)

Akash25 Apr 2023 18:49 UTC

27 points

11 comments3 min readLW link

(childrenoficarus.substack.com)

Protocol evaluations: good analogies vs control

Fabien Roger19 Feb 2024 18:00 UTC

35 points

10 comments11 min readLW link

Preventing Language Models from hiding their reasoning

Fabien Roger and ryan_greenblatt

31 Oct 2023 14:34 UTC

107 points

12 comments12 min readLW link

Self-Awareness: Taxonomy and eval suite proposal

Daniel Kokotajlo17 Feb 2024 1:47 UTC

61 points

0 comments11 min readLW link

ARC Evals new report: Evaluating Language-Model Agents on Realistic Autonomous Tasks

Beth Barnes1 Aug 2023 18:30 UTC

153 points

12 comments5 min readLW link

(evals.alignment.org)

Apollo Research is hiring evals and interpretability engineers & scientists

Marius Hobbhahn4 Aug 2023 10:54 UTC

25 points

0 comments2 min readLW link

We need a Science of Evals

Marius Hobbhahn and Jérémy Scheurer

22 Jan 2024 20:30 UTC

63 points

13 comments9 min readLW link

Autonomous replication and adaptation: an attempt at a concrete danger threshold

Hjalmar_Wijk17 Aug 2023 1:31 UTC

42 points

0 comments13 min readLW link

Managing risks of our own work

Beth Barnes18 Aug 2023 0:41 UTC

66 points

0 comments2 min readLW link

DeepMind: Evaluating Frontier Models for Dangerous Capabilities

Zach Stein-Perlman21 Mar 2024 3:00 UTC

61 points

0 comments1 min readLW link

(arxiv.org)

A starter guide for evals

Marius Hobbhahn, Jérémy Scheurer, Mikita Balesni, rusheb and AlexMeinke

8 Jan 2024 18:24 UTC

44 points

2 comments12 min readLW link

(www.apolloresearch.ai)

Third-party testing as a key ingredient of AI policy

Zac Hatfield-Dodds25 Mar 2024 22:40 UTC

11 points

1 comment12 min readLW link

(www.anthropic.com)

[Question] Would more model evals teams be good?

Ryan Kidd25 Feb 2023 22:01 UTC

20 points

4 comments1 min readLW link

A very crude deception eval is already passed

Beth Barnes29 Oct 2021 17:57 UTC

108 points

6 comments2 min readLW link

ARC tests to see if GPT-4 can escape human control; GPT-4 failed to do so

Christopher King15 Mar 2023 0:29 UTC

116 points

22 comments2 min readLW link

More information about the dangerous capability evaluations we did with GPT-4 and Claude.

Beth Barnes19 Mar 2023 0:25 UTC

233 points

54 comments8 min readLW link

(evals.alignment.org)

Send us example gnarly bugs

Beth Barnes, Megan Kinniment and Tao Lin

10 Dec 2023 5:23 UTC

77 points

10 comments2 min readLW link

Run evals on base models too!

orthonormal4 Apr 2024 18:43 UTC

47 points

6 comments1 min readLW link

OpenAI: Preparedness framework

Zach Stein-Perlman18 Dec 2023 18:30 UTC

68 points

23 comments4 min readLW link

(openai.com)

The Leeroy Jenkins principle: How faulty AI could guarantee “warning shots”

titotal14 Jan 2024 15:03 UTC

41 points

5 comments1 min readLW link

(titotal.substack.com)

METR is hiring!

Beth Barnes26 Dec 2023 21:00 UTC

65 points

1 comment1 min readLW link

Soft Prompts for Evaluation: Measuring Conditional Distance of Capabilities

porby2 Feb 2024 5:49 UTC

43 points

1 comment4 min readLW link

(1drv.ms)

Bounty: Diverse hard tasks for LLM agents

Beth Barnes and Megan Kinniment

17 Dec 2023 1:04 UTC

49 points

31 comments16 min readLW link

[Question] Can GPT-4 play 20 questions against another instance of itself?

Nathan Helm-Burger28 Mar 2023 1:11 UTC

15 points

1 comment1 min readLW link

(evanthebouncy.medium.com)

The Compleat Cybornaut

ukc10014, Jozdien and NicholasKees

19 May 2023 8:44 UTC

64 points

2 comments16 min readLW link

DeepMind: Model evaluation for extreme risks

Zach Stein-Perlman25 May 2023 3:00 UTC

94 points

11 comments1 min readLW link

(arxiv.org)

Seeking (Paid) Case Studies on Standards

HoldenKarnofsky26 May 2023 17:58 UTC

69 points

9 comments11 min readLW link

[Question] AI Rights: In your view, what would be required for an AGI to gain rights and protections from the various Governments of the World?

Super AGI9 Jun 2023 1:24 UTC

10 points

26 comments1 min readLW link

Challenge proposal: smallest possible self-hardening backdoor for RLHF

Christopher King29 Jun 2023 16:56 UTC

7 points

0 comments2 min readLW link

Robustness of Model-Graded Evaluations and Automated Interpretability

Simon Lermen and viluon

15 Jul 2023 19:12 UTC

44 points

5 comments9 min readLW link

The “spelling miracle”: GPT-3 spelling abilities and glitch tokens revisited

mwatkins31 Jul 2023 19:47 UTC

85 points

29 comments20 min readLW link

Evaluating Superhuman Models with Consistency Checks

Daniel Paleka and Lukas Fluri

1 Aug 2023 7:51 UTC

15 points

2 comments9 min readLW link

(arxiv.org)

Longer-term Behaviour of Generative Companion AIs: A Social Simulation Approach

Reed14 Aug 2023 15:24 UTC

5 points

0 comments7 min readLW link

Reproducing ARC Evals’ recent report on language model agents

Thomas Broadley1 Sep 2023 16:52 UTC

102 points

17 comments3 min readLW link

(thomasbroadley.com)

MMLU’s Moral Scenarios Benchmark Doesn’t Measure What You Think it Measures

corey morris27 Sep 2023 17:54 UTC

14 points

2 comments4 min readLW link

(medium.com)

Responsible scaling policy TLDR

lukehmiles28 Sep 2023 18:51 UTC

9 points

0 comments1 min readLW link

Measuring and Improving the Faithfulness of Model-Generated Reasoning

Ansh Radhakrishnan, tamera, karinanguyen, Sam Bowman and Ethan Perez

18 Jul 2023 16:36 UTC

109 points

13 comments6 min readLW link

The dreams of GPT-4

RomanS20 Mar 2023 17:00 UTC

14 points

7 comments9 min readLW link

Navigating the Attackspace

Jonas Kgomo12 Dec 2023 13:59 UTC

1 point

0 comments2 min readLW link

Benchmark Study #3: HellaSwag (Task, MCQ)

Bruce W. Lee7 Jan 2024 4:59 UTC

2 points

4 comments6 min readLW link

(arxiv.org)

Benchmark Study #4: AI2 Reasoning Challenge (Task(s), MCQ)

Bruce W. Lee7 Jan 2024 17:13 UTC

6 points

0 comments5 min readLW link

Introducing REBUS: A Robust Evaluation Benchmark of Understanding Symbols

Arjun Panickssery and agg

15 Jan 2024 21:21 UTC

33 points

0 comments1 min readLW link

OpenAI Credit Account (2510$)

Emirhan BULUT21 Jan 2024 2:30 UTC

1 point

0 comments1 min readLW link

Orthogonality or the “Human Worth Hypothesis”?

Jeffs23 Jan 2024 0:57 UTC

21 points

31 comments3 min readLW link

LLMs can strategically deceive while doing gain-of-function research

Igor Ivanov24 Jan 2024 15:45 UTC

32 points

4 comments11 min readLW link

OpenAI Credit Account (2510$)

Emirhan BULUT21 Jan 2024 2:32 UTC

1 point

0 comments1 min readLW link

Questions I’d Want to Ask an AGI+ to Test Its Understanding of Ethics

sweenesm26 Jan 2024 23:40 UTC

14 points

6 comments4 min readLW link

The case for more ambitious language model evals

Jozdien30 Jan 2024 0:01 UTC

105 points

25 comments5 min readLW link

Skepticism About DeepMind’s “Grandmaster-Level” Chess Without Search

Arjun Panickssery12 Feb 2024 0:56 UTC

53 points

13 comments3 min readLW link

Introducing METR’s Autonomy Evaluation Resources

Megan Kinniment and Beth Barnes

15 Mar 2024 23:16 UTC

90 points

0 comments1 min readLW link

(metr.github.io)

AI Safety Evaluations: A Regulatory Review

Elliot_Mckernon and Deric Cheng

19 Mar 2024 15:05 UTC

20 points

1 comment11 min readLW link

Measuring Predictability of Persona Evaluations

Thee Ho and evhub

6 Apr 2024 8:46 UTC

19 points

0 comments7 min readLW link

Claude wants to be conscious

Joe Kwon13 Apr 2024 1:40 UTC

2 points

8 comments6 min readLW link

LLM Evaluators Recognize and Favor Their Own Generations

Arjun Panickssery, Sam Bowman and Shi Feng

17 Apr 2024 21:09 UTC

43 points

1 comment3 min readLW link

(tiny.cc)

Inducing Unprompted Misalignment in LLMs

Sam Svenningsen, evhub and Henry Sleight

19 Apr 2024 20:00 UTC

35 points

6 comments16 min readLW link

An Introduction to AI Sandbagging

Teun van der Weij, Felix Hofstätter and Francis Rhys Ward

26 Apr 2024 13:40 UTC

35 points

0 comments8 min readLW link

Proposal on AI evaluation: false-proving

ProgramCrafter31 Mar 2023 12:12 UTC

1 point

2 comments1 min readLW link

LM Situational Awareness, Evaluation Proposal: Violating Imitation

Jacob Pfau26 Apr 2023 22:53 UTC

13 points

2 comments2 min readLW link

Tall Tales at Different Scales: Evaluating Scaling Trends For Deception In Language Models

Felix Hofstätter, Francis Rhys Ward, HarrietW, LAThomson, Ollie J, Patrik Bartak and Sam F. Brown

8 Nov 2023 11:37 UTC

49 points

0 comments18 min readLW link

Theories of Change for AI Auditing

Lee Sharkey, beren and Marius Hobbhahn

13 Nov 2023 19:33 UTC

59 points

0 comments18 min readLW link

(www.apolloresearch.ai)

A simple treacherous turn demonstration

nikola25 Nov 2023 4:51 UTC

22 points

5 comments3 min readLW link

A call for a quantitative report card for AI bioterrorism threat models

Juno4 Dec 2023 6:35 UTC

12 points

0 comments10 min readLW link

Protecting against sudden capability jumps during training

nikola2 Dec 2023 4:22 UTC

8 points

0 comments2 min readLW link

The Method of Loci: With some brief remarks, including transformers and evaluating AIs

Bill Benzon2 Dec 2023 14:36 UTC

6 points

0 comments3 min readLW link

What’s new at FAR AI

AdamGleave and EuanMcLean

4 Dec 2023 21:18 UTC

40 points

0 comments5 min readLW link

(far.ai)

2023 Alignment Research Updates from FAR AI

AdamGleave and EuanMcLean

4 Dec 2023 22:32 UTC

18 points

0 comments8 min readLW link

(far.ai)

Evaluating Language Model Behaviours for Shutdown Avoidance in Textual Scenarios

Simon Lermen, Teun van der Weij and Leon Lang

16 May 2023 10:53 UTC

22 points

0 comments13 min readLW link

Improving the safety of AI evals

JustinShovelain and Elliot_Mckernon

17 May 2023 22:24 UTC

13 points

7 comments7 min readLW link

No comments.