RSS

AI Evaluations

TagLast edit: 15 Mar 2023 23:52 UTC by Raemon

AI Evaluations, or “Evals”, focus on assessing the capabilities, safety, and alignment of advanced AI systems. These evaluations can be divided into two main categories: behavioral and understanding-based.

(note: written by GPT4, may contain errors. Please correct them if you see them)

Behavioral evaluations assess a model’s abilities in various tasks, such as autonomously replicating, acquiring resources, and avoiding being shut down. However, a concern with these evaluations is that they may not be sufficient to detect deceptive alignment, making it difficult to ensure that models are non-deceptive.

Understanding-based evaluations, on the other hand, evaluate a developer’s ability to understand the model they have created and why they have obtained the model. This approach can be more useful in terms of safety, as it focuses on understanding the model’s behavior instead of just checking the behavior itself. Coupling understanding-based evaluations with behavioral evaluations can lead to a more comprehensive assessment of AI safety and alignment.

Current challenges in AI evaluations include developing a method-agnostic standard to demonstrate sufficient understanding of a model, ensuring that the level of understanding is adequate to catch dangerous failure modes, and finding the right balance between behavioral and understanding-based evaluations.

(this text was initially written by GPT4, taking in as input A very crude deception eval is already passed, ARC tests to see if GPT-4 can escape human control; GPT-4 failed to do so, and Towards understanding-based safety evaluations)

How evals might (or might not) pre­vent catas­trophic risks from AI

Akash7 Feb 2023 20:16 UTC
39 points
0 comments9 min readLW link

Towards un­der­stand­ing-based safety evaluations

evhub15 Mar 2023 18:18 UTC
112 points
9 comments5 min readLW link

[Question] Would more model evals teams be good?

Ryan Kidd25 Feb 2023 22:01 UTC
20 points
4 comments1 min readLW link

A very crude de­cep­tion eval is already passed

Beth Barnes29 Oct 2021 17:57 UTC
106 points
8 comments2 min readLW link

ARC tests to see if GPT-4 can es­cape hu­man con­trol; GPT-4 failed to do so

Christopher King15 Mar 2023 0:29 UTC
117 points
22 comments2 min readLW link

More in­for­ma­tion about the dan­ger­ous ca­pa­bil­ity eval­u­a­tions we did with GPT-4 and Claude.

Beth Barnes19 Mar 2023 0:25 UTC
202 points
29 comments8 min readLW link
(evals.alignment.org)

The dreams of GPT-4

RomanS20 Mar 2023 17:00 UTC
14 points
7 comments9 min readLW link
No comments.