RSS

AI Evaluations

TagLast edit: 1 Aug 2023 1:03 UTC by duck_master

AI Evaluations focus on experimentally assessing the capabilities, safety, and alignment of advanced AI systems. These evaluations can be divided into two main categories: behavioral and understanding-based.

(note: initially written by GPT4, may contain errors despite a human review. Please correct them if you see them)

Behavioral evaluations assess a model’s abilities in various tasks, such as autonomously replicating, acquiring resources, and avoiding being shut down. However, a concern with these evaluations is that they may not be sufficient to detect deceptive alignment, making it difficult to ensure that models are non-deceptive.

Understanding-based evaluations, on the other hand, evaluate a developer’s ability to understand the model they have created and why they have obtained the model. This approach can be more useful in terms of safety, as it focuses on understanding the model’s behavior instead of just checking the behavior itself. Coupling understanding-based evaluations with behavioral evaluations can lead to a more comprehensive assessment of AI safety and alignment.

Current challenges in AI evaluations include:

(this text was initially written by GPT4, taking in as input A very crude deception eval is already passed, ARC tests to see if GPT-4 can escape human control; GPT-4 failed to do so, and Towards understanding-based safety evaluations)

See also:

How evals might (or might not) pre­vent catas­trophic risks from AI

Orpheus167 Feb 2023 20:16 UTC
45 points
0 comments9 min readLW link

When can we trust model eval­u­a­tions?

evhub28 Jul 2023 19:42 UTC
166 points
10 comments10 min readLW link1 review

[Paper] Safety by Mea­sure­ment: A Sys­tem­atic Liter­a­ture Re­view of AI Safety Eval­u­a­tion Methods

19 May 2025 10:38 UTC
26 points
0 comments1 min readLW link

The case for more am­bi­tious lan­guage model evals

Jozdien30 Jan 2024 0:01 UTC
117 points
30 comments5 min readLW link

Thoughts on shar­ing in­for­ma­tion about lan­guage model capabilities

paulfchristiano31 Jul 2023 16:04 UTC
211 points
44 comments11 min readLW link1 review

Gam­ing Truth­fulQA: Sim­ple Heuris­tics Ex­posed Dataset Weaknesses

TurnTrout16 Jan 2025 2:14 UTC
64 points
3 comments1 min readLW link
(turntrout.com)

An­nounc­ing Apollo Research

30 May 2023 16:17 UTC
217 points
11 comments8 min readLW link

Towards un­der­stand­ing-based safety evaluations

evhub15 Mar 2023 18:18 UTC
164 points
16 comments5 min readLW link

Scal­able And Trans­fer­able Black-Box Jailbreaks For Lan­guage Models Via Per­sona Modulation

7 Nov 2023 17:59 UTC
38 points
2 comments2 min readLW link
(arxiv.org)

How good are LLMs at do­ing ML on an un­known dataset?

Håvard Tveit Ihle1 Jul 2024 9:04 UTC
33 points
4 comments13 min readLW link

How Does A Blind Model See The Earth?

henry11 Aug 2025 19:58 UTC
463 points
38 comments7 min readLW link
(outsidetext.substack.com)

OMMC An­nounces RIP

1 Apr 2024 23:20 UTC
189 points
5 comments2 min readLW link

Deep­Mind: Model eval­u­a­tion for ex­treme risks

Zach Stein-Perlman25 May 2023 3:00 UTC
94 points
12 comments1 min readLW link1 review
(arxiv.org)

Model Or­ganisms of Misal­ign­ment: The Case for a New Pillar of Align­ment Research

8 Aug 2023 1:30 UTC
319 points
30 comments18 min readLW link1 review

A starter guide for evals

8 Jan 2024 18:24 UTC
55 points
2 comments12 min readLW link
(www.apolloresearch.ai)

BIG-Bench Ca­nary Con­tam­i­na­tion in GPT-4

Jozdien22 Oct 2024 15:40 UTC
128 points
18 comments4 min readLW link

No­body is Do­ing AI Bench­mark­ing Right

Chapin Lenthall-Cleary6 Jul 2025 7:05 UTC
20 points
12 comments9 min readLW link

New Paper: It is time to move on from MCQs for LLM Evaluations

shash426 Jul 2025 11:48 UTC
9 points
0 comments2 min readLW link

The bit­ter les­son of mi­suse detection

10 Jul 2025 14:50 UTC
34 points
6 comments7 min readLW link

Re­spon­si­ble De­ploy­ment in 20XX

Carson20 Apr 2023 0:24 UTC
4 points
0 comments4 min readLW link

An Opinionated Evals Read­ing List

15 Oct 2024 14:38 UTC
65 points
0 comments13 min readLW link
(www.apolloresearch.ai)

GPT-4o Sys­tem Card

Zach Stein-Perlman8 Aug 2024 20:30 UTC
68 points
11 comments2 min readLW link
(openai.com)

What’s the short timeline plan?

Marius Hobbhahn2 Jan 2025 14:59 UTC
358 points
49 comments23 min readLW link

Au­tonomous repli­ca­tion and adap­ta­tion: an at­tempt at a con­crete dan­ger threshold

Hjalmar_Wijk17 Aug 2023 1:31 UTC
45 points
1 comment13 min readLW link

AI Task Length Hori­zons in Offen­sive Cybersecurity

Sean Peters2 Jul 2025 18:36 UTC
70 points
10 comments12 min readLW link

Claude Son­net 3.7 (of­ten) knows when it’s in al­ign­ment evaluations

17 Mar 2025 19:11 UTC
184 points
9 comments6 min readLW link

[In­terim re­search re­port] Eval­u­at­ing the Goal-Direct­ed­ness of Lan­guage Models

18 Jul 2024 18:19 UTC
40 points
4 comments11 min readLW link

Schizobench: Doc­u­ment­ing Mag­i­cal-Think­ing Be­hav­ior in Claude 4 Opus

viemccoy23 May 2025 1:31 UTC
22 points
0 comments1 min readLW link
(metanomicon.ink)

Quickly Assess­ing Re­ward Hack­ing-like Be­hav­ior in LLMs and its Sen­si­tivity to Prompt Variations

AndresCampero4 Jun 2025 7:22 UTC
26 points
1 comment17 min readLW link

New, im­proved mul­ti­ple-choice TruthfulQA

15 Jan 2025 23:32 UTC
72 points
0 comments3 min readLW link

OpenAI: Pre­pared­ness framework

Zach Stein-Perlman18 Dec 2023 18:30 UTC
70 points
23 comments4 min readLW link
(openai.com)

Refram­ing the bur­den of proof: Com­pa­nies should prove that mod­els are safe (rather than ex­pect­ing au­di­tors to prove that mod­els are dan­ger­ous)

Orpheus1625 Apr 2023 18:49 UTC
27 points
11 comments3 min readLW link
(childrenoficarus.substack.com)

Run evals on base mod­els too!

orthonormal4 Apr 2024 18:43 UTC
49 points
6 comments1 min readLW link

METR is hiring!

Beth Barnes26 Dec 2023 21:00 UTC
65 points
1 comment1 min readLW link

Twit­ter thread on AI safety evals

Richard_Ngo31 Jul 2024 0:18 UTC
63 points
3 comments2 min readLW link
(x.com)

ARC tests to see if GPT-4 can es­cape hu­man con­trol; GPT-4 failed to do so

Christopher King15 Mar 2023 0:29 UTC
116 points
22 comments2 min readLW link

Can Gen­er­al­ized Ad­ver­sar­ial Test­ing En­able More Ri­gor­ous LLM Safety Evals?

scasper30 Jul 2024 14:57 UTC
25 points
0 comments4 min readLW link

Com­par­ing Quan­tized Perfor­mance in Llama Models

NickyP15 Jul 2024 16:01 UTC
35 points
2 comments8 min readLW link

Soft Prompts for Eval­u­a­tion: Mea­sur­ing Con­di­tional Dis­tance of Capabilities

porby2 Feb 2024 5:49 UTC
47 points
1 comment4 min readLW link
(arxiv.org)

Val­i­dat­ing against a mis­al­ign­ment de­tec­tor is very differ­ent to train­ing against one

mattmacdermott4 Mar 2025 15:41 UTC
39 points
4 comments4 min readLW link

In­tro­duc­ing BenchBench: An In­dus­try Stan­dard Bench­mark for AI Strength

Jozdien2 Apr 2025 2:11 UTC
50 points
0 comments2 min readLW link

AI com­pa­nies’ eval re­ports mostly don’t sup­port their claims

Zach Stein-Perlman9 Jun 2025 13:00 UTC
207 points
13 comments4 min readLW link

AI com­pa­nies aren’t re­ally us­ing ex­ter­nal evaluators

Zach Stein-Perlman24 May 2024 16:01 UTC
242 points
15 comments4 min readLW link

Clar­ify­ing METR’s Au­dit­ing Role

Beth Barnes30 May 2024 18:41 UTC
108 points
1 comment2 min readLW link

Send us ex­am­ple gnarly bugs

10 Dec 2023 5:23 UTC
77 points
10 comments2 min readLW link

“Suc­cess­ful lan­guage model evals” by Ja­son Wei

Arjun Panickssery25 May 2024 9:34 UTC
7 points
0 comments1 min readLW link
(www.jasonwei.net)

In­ves­ti­gat­ing the Abil­ity of LLMs to Rec­og­nize Their Own Writing

30 Jul 2024 15:41 UTC
32 points
0 comments15 min readLW link

ARC Evals new re­port: Eval­u­at­ing Lan­guage-Model Agents on Real­is­tic Au­tonomous Tasks

Beth Barnes1 Aug 2023 18:30 UTC
153 points
12 comments5 min readLW link
(evals.alignment.org)

Misal­ign­ment and Strate­gic Un­der­perfor­mance: An Anal­y­sis of Sand­bag­ging and Ex­plo­ra­tion Hacking

8 May 2025 19:06 UTC
77 points
3 comments15 min readLW link

AXRP Epi­sode 38.8 - David Du­ve­naud on Sab­o­tage Eval­u­a­tions and the Post-AGI Future

DanielFilan1 Mar 2025 1:20 UTC
13 points
0 comments13 min readLW link

Apollo Re­search is hiring evals and in­ter­pretabil­ity en­g­ineers & scientists

Marius Hobbhahn4 Aug 2023 10:54 UTC
25 points
0 comments2 min readLW link

An­nounc­ing Hu­man-al­igned AI Sum­mer School

22 May 2024 8:55 UTC
51 points
0 comments1 min readLW link
(humanaligned.ai)

Mechanis­ti­cally Elic­it­ing La­tent Be­hav­iors in Lan­guage Models

30 Apr 2024 18:51 UTC
214 points
43 comments45 min readLW link

Pro­to­col eval­u­a­tions: good analo­gies vs control

Fabien Roger19 Feb 2024 18:00 UTC
42 points
10 comments11 min readLW link

Self-Aware­ness: Tax­on­omy and eval suite proposal

Daniel Kokotajlo17 Feb 2024 1:47 UTC
65 points
2 comments11 min readLW link

100+ con­crete pro­jects and open prob­lems in evals

Marius Hobbhahn22 Mar 2025 15:21 UTC
74 points
1 comment1 min readLW link

Notes on Claude 4 Sys­tem Card

Dentosal23 May 2025 15:23 UTC
19 points
2 comments6 min readLW link

Manag­ing risks of our own work

Beth Barnes18 Aug 2023 0:41 UTC
66 points
0 comments2 min readLW link

[Question] Would more model evals teams be good?

Ryan Kidd25 Feb 2023 22:01 UTC
20 points
4 comments1 min readLW link

New Ca­pa­bil­ities, New Risks? - Eval­u­at­ing Agen­tic Gen­eral As­sis­tants us­ing Ele­ments of GAIA & METR Frameworks

Tej Lander29 Sep 2024 18:58 UTC
5 points
0 comments29 min readLW link

More in­for­ma­tion about the dan­ger­ous ca­pa­bil­ity eval­u­a­tions we did with GPT-4 and Claude.

Beth Barnes19 Mar 2023 0:25 UTC
233 points
54 comments8 min readLW link
(evals.alignment.org)

AXRP Epi­sode 34 - AI Eval­u­a­tions with Beth Barnes

DanielFilan28 Jul 2024 3:30 UTC
23 points
0 comments69 min readLW link

The Evals Gap

Marius Hobbhahn11 Nov 2024 16:42 UTC
55 points
7 comments7 min readLW link
(www.apolloresearch.ai)

To CoT or not to CoT? Chain-of-thought helps mainly on math and sym­bolic reasoning

Bogdan Ionut Cirstea19 Sep 2024 16:13 UTC
21 points
1 comment1 min readLW link
(arxiv.org)

Re­search Note: Our schem­ing pre­cur­sor evals had limited pre­dic­tive power for our in-con­text schem­ing evals

Marius Hobbhahn3 Jul 2025 15:57 UTC
74 points
0 comments1 min readLW link
(www.apolloresearch.ai)

Which evals re­sources would be good?

Marius Hobbhahn16 Nov 2024 14:24 UTC
51 points
4 comments5 min readLW link

Prevent­ing Lan­guage Models from hid­ing their reasoning

31 Oct 2023 14:34 UTC
119 points
15 comments12 min readLW link1 review

Apollo Re­search 1-year update

29 May 2024 17:44 UTC
93 points
0 comments7 min readLW link

Sub­ver­sion Strat­egy Eval: Can lan­guage mod­els state­lessly strate­gize to sub­vert con­trol pro­to­cols?

24 Mar 2025 17:55 UTC
34 points
0 comments8 min readLW link

Eval­u­at­ing strate­gic rea­son­ing in GPT models

phelps-sg25 May 2023 11:51 UTC
4 points
1 comment8 min readLW link

Bi­as­ing VLM Re­sponse with Vi­sual Stimuli

Jaehyuk Lim3 Oct 2024 18:04 UTC
5 points
0 comments8 min readLW link

[Question] Can GPT-4 play 20 ques­tions against an­other in­stance of it­self?

Nathan Helm-Burger28 Mar 2023 1:11 UTC
15 points
1 comment1 min readLW link
(evanthebouncy.medium.com)

Ideas for bench­mark­ing LLM creativity

gwern16 Dec 2024 5:18 UTC
60 points
11 comments1 min readLW link
(gwern.net)

The Leeroy Jenk­ins prin­ci­ple: How faulty AI could guaran­tee “warn­ing shots”

titotal14 Jan 2024 15:03 UTC
48 points
6 comments21 min readLW link
(titotal.substack.com)

Is ChatGPT ac­tu­ally fixed now?

sjadler8 May 2025 23:34 UTC
17 points
0 comments1 min readLW link
(stevenadler.substack.com)

We should try to au­to­mate AI safety work asap

Marius Hobbhahn26 Apr 2025 16:35 UTC
113 points
10 comments15 min readLW link

Third-party test­ing as a key in­gre­di­ent of AI policy

Zac Hatfield-Dodds25 Mar 2024 22:40 UTC
11 points
1 comment12 min readLW link
(www.anthropic.com)

≤10-year Timelines Re­main Un­likely De­spite Deep­Seek and o3

Rafael Harth13 Feb 2025 19:21 UTC
53 points
67 comments15 min readLW link

Bounty: Di­verse hard tasks for LLM agents

17 Dec 2023 1:04 UTC
49 points
31 comments16 min readLW link

Fron­tier Models are Ca­pable of In-con­text Scheming

5 Dec 2024 22:11 UTC
210 points
24 comments7 min readLW link

A very crude de­cep­tion eval is already passed

Beth Barnes29 Oct 2021 17:57 UTC
108 points
6 comments2 min readLW link

Model evals for dan­ger­ous capabilities

Zach Stein-Perlman23 Sep 2024 11:00 UTC
51 points
11 comments3 min readLW link

We need a Science of Evals

22 Jan 2024 20:30 UTC
72 points
13 comments9 min readLW link

UK AISI: Early les­sons from eval­u­at­ing fron­tier AI systems

Zach Stein-Perlman25 Oct 2024 19:00 UTC
26 points
0 comments2 min readLW link
(www.aisi.gov.uk)

I repli­cated the An­thropic al­ign­ment fak­ing ex­per­i­ment on other mod­els, and they didn’t fake alignment

30 May 2025 18:57 UTC
31 points
0 comments2 min readLW link

Deep­Mind: Eval­u­at­ing Fron­tier Models for Danger­ous Capabilities

Zach Stein-Perlman21 Mar 2024 3:00 UTC
61 points
8 comments1 min readLW link
(arxiv.org)

An is­sue with train­ing schemers with su­per­vised fine-tuning

Fabien Roger27 Jun 2024 15:37 UTC
49 points
12 comments6 min readLW link

METR’s Ob­ser­va­tions of Re­ward Hack­ing in Re­cent Fron­tier Models

Daniel Kokotajlo9 Jun 2025 18:03 UTC
99 points
9 comments11 min readLW link
(metr.org)

A Tax­on­omy Of AI Sys­tem Evaluations

19 Aug 2024 9:07 UTC
13 points
0 comments14 min readLW link

UnaPrompt™: A Pre-Prompt Op­ti­miza­tion Sys­tem for Reli­able and Eth­i­cally Aligned AI Outputs

UnaPrompt27 Jun 2025 0:06 UTC
1 point
0 comments1 min readLW link

A call for a quan­ti­ta­tive re­port card for AI bioter­ror­ism threat models

Juno4 Dec 2023 6:35 UTC
12 points
0 comments10 min readLW link

When fine-tun­ing fails to elicit GPT-3.5′s chess abilities

Theodore Chapman14 Jun 2024 18:50 UTC
42 points
3 comments9 min readLW link

Back­prop—The Rus­sian Al­gorithm the West Claimed as Its Own ti­tled Draft

Jayson Morgan3 Sep 2025 13:44 UTC
1 point
0 comments9 min readLW link

OpenAI Credit Ac­count (2510$)

Emirhan BULUT21 Jan 2024 2:30 UTC
1 point
0 comments1 min readLW link

Mind the Co­her­ence Gap: Les­sons from Steer­ing Llama with Goodfire

eitan sprejer9 May 2025 21:29 UTC
4 points
1 comment6 min readLW link

Tall Tales at Differ­ent Scales: Eval­u­at­ing Scal­ing Trends For De­cep­tion In Lan­guage Models

8 Nov 2023 11:37 UTC
49 points
0 comments18 min readLW link

[Question] An­thropic Is Go­ing All In On Abil­ity Without In­tel­li­gence?

Chapin Lenthall-Cleary7 Aug 2025 5:54 UTC
2 points
0 comments2 min readLW link

From No Mind to a Mind – A Con­ver­sa­tion That Changed an AI

parthibanarjuna s7 Feb 2025 11:50 UTC
1 point
0 comments3 min readLW link

Can SAE steer­ing re­veal sand­bag­ging?

15 Apr 2025 12:33 UTC
35 points
3 comments4 min readLW link

In­duc­ing Un­prompted Misal­ign­ment in LLMs

19 Apr 2024 20:00 UTC
38 points
7 comments16 min readLW link

LLM Eval­u­a­tors Rec­og­nize and Fa­vor Their Own Generations

17 Apr 2024 21:09 UTC
46 points
1 comment3 min readLW link
(tiny.cc)

Per­sonal eval­u­a­tion of LLMs, through chess

Karthik Tadepalli24 Apr 2025 7:01 UTC
20 points
4 comments2 min readLW link

In­tro­duc­ing METR’s Au­ton­omy Eval­u­a­tion Resources

15 Mar 2024 23:16 UTC
90 points
0 comments1 min readLW link
(metr.github.io)

Con­cept Poi­son­ing: Prob­ing LLMs with­out probes

5 Aug 2025 17:00 UTC
58 points
5 comments13 min readLW link

If your AGI defi­ni­tion ex­cludes most hu­mans, it sucks.

Chapin Lenthall-Cleary22 Jul 2025 10:33 UTC
18 points
7 comments2 min readLW link

Auto-En­hance: Devel­op­ing a meta-bench­mark to mea­sure LLM agents’ abil­ity to im­prove other agents

22 Jul 2024 12:33 UTC
20 points
0 comments14 min readLW link

An In­tro­duc­tion to AI Sandbagging

26 Apr 2024 13:40 UTC
48 points
13 comments8 min readLW link

Towards a Science of Evals for Sycophancy

andrejfsantos1 Feb 2025 21:17 UTC
8 points
0 comments8 min readLW link

Steer­ing Vec­tors Can Help LLM Judges De­tect Sub­tle Dishonesty

3 Jun 2025 20:33 UTC
12 points
1 comment5 min readLW link

MMLU’s Mo­ral Sce­nar­ios Bench­mark Doesn’t Mea­sure What You Think it Measures

corey morris27 Sep 2023 17:54 UTC
18 points
3 comments4 min readLW link
(medium.com)

I Tested LLM Agents on Sim­ple Safety Rules. They Failed in Sur­pris­ing and In­for­ma­tive Ways.

Ram Potham25 Jun 2025 21:39 UTC
9 points
12 comments6 min readLW link

The­o­ries of Change for AI Auditing

13 Nov 2023 19:33 UTC
54 points
0 comments18 min readLW link
(www.apolloresearch.ai)

Seek­ing (Paid) Case Stud­ies on Standards

HoldenKarnofsky26 May 2023 17:58 UTC
69 points
9 comments11 min readLW link

Re­quest for pro­pos­als: im­prov­ing ca­pa­bil­ity evaluations

cb7 Feb 2025 18:51 UTC
1 point
0 comments1 min readLW link
(www.openphilanthropy.org)

In­tel­li­gence–Agency Equiv­alence ≈ Mass–En­ergy Equiv­alence: On Static Na­ture of In­tel­li­gence & Phys­i­cal­iza­tion of Ethics

ank22 Feb 2025 0:12 UTC
1 point
0 comments6 min readLW link

LLMs are Ca­pable of Misal­igned Be­hav­ior Un­der Ex­plicit Pro­hi­bi­tion and Surveillance

Igor Ivanov8 Jul 2025 11:50 UTC
28 points
8 comments7 min readLW link

Sab­o­tage Eval­u­a­tions for Fron­tier Models

18 Oct 2024 22:33 UTC
95 points
56 comments6 min readLW link
(assets.anthropic.com)

The “spel­ling mir­a­cle”: GPT-3 spel­ling abil­ities and glitch to­kens revisited

mwatkins31 Jul 2023 19:47 UTC
85 points
29 comments20 min readLW link

Call for eval­u­a­tors: Par­ti­ci­pate in the Euro­pean AI Office work­shop on gen­eral-pur­pose AI mod­els and sys­temic risks

27 Nov 2024 2:54 UTC
30 points
0 comments2 min readLW link

Can Per­sua­sion Break AI Safety? Ex­plor­ing the In­ter­play Between Fine-Tun­ing, At­tacks, and Guardrails

Devina Jain4 Feb 2025 19:10 UTC
9 points
0 comments10 min readLW link

Se­cret Col­lu­sion: Will We Know When to Un­plug AI?

16 Sep 2024 16:07 UTC
65 points
8 comments31 min readLW link

Is there a Half-Life for the Suc­cess Rates of AI Agents?

Matrice Jacobine8 May 2025 20:10 UTC
8 points
0 comments1 min readLW link
(www.tobyord.com)

[Paper] Hid­den in Plain Text: Emer­gence and Miti­ga­tion of Stegano­graphic Col­lu­sion in LLMs

25 Sep 2024 14:52 UTC
37 points
2 comments4 min readLW link
(arxiv.org)

Among Us: A Sand­box for Agen­tic Deception

5 Apr 2025 6:24 UTC
110 points
7 comments7 min readLW link

Ra­tional Effec­tive Utopia & Nar­row Way There: Math-Proven Safe Static Mul­tiver­sal mAX-In­tel­li­gence (AXI), Mul­tiver­sal Align­ment, New Ethico­physics… (Aug 11)

ank11 Feb 2025 3:21 UTC
13 points
8 comments38 min readLW link

Agency over­hang as a proxy for Sharp left turn

7 Nov 2024 12:14 UTC
6 points
0 comments5 min readLW link

AI threat­ens to or­ches­trate sus­tain­able so­cial reform

Chris Santos-Lang2 Apr 2025 3:04 UTC
13 points
0 comments1 min readLW link
(arxiv.org)

METR is hiring ML Re­search Eng­ineers and Scientists

Xodarap5 Jun 2024 21:27 UTC
5 points
0 comments1 min readLW link
(metr.org)

Do mod­els know when they are be­ing eval­u­ated?

17 Feb 2025 23:13 UTC
59 points
8 comments12 min readLW link

Lan­guage Tier Lock and Poetic Con­tam­i­na­tion in GPT-4o: A Field Report

許皓翔11 Jun 2025 17:24 UTC
0 points
0 comments2 min readLW link

The dreams of GPT-4

RomanS20 Mar 2023 17:00 UTC
14 points
7 comments9 min readLW link

Align­ment Can Re­duce Perfor­mance on Sim­ple Eth­i­cal Questions

Daan Henselmans3 Feb 2025 19:35 UTC
16 points
7 comments6 min readLW link

Find­ing De­cep­tion in Lan­guage Models

20 Aug 2024 9:42 UTC
20 points
4 comments4 min readLW link

Mea­sur­ing and Im­prov­ing the Faith­ful­ness of Model-Gen­er­ated Rea­son­ing

18 Jul 2023 16:36 UTC
111 points
15 comments6 min readLW link1 review

Sys­tem­atic Sand­bag­ging Eval­u­a­tions on Claude 3.5 Sonnet

farrelmahaztra14 Feb 2025 1:22 UTC
13 points
0 comments1 min readLW link
(farrelmahaztra.com)

Think­ing About Propen­sity Evaluations

19 Aug 2024 9:23 UTC
10 points
0 comments27 min readLW link

Me­tacog­ni­tion and Self-Model­ing in LLMs

Christopher Ackerman10 Jul 2025 21:25 UTC
19 points
2 comments16 min readLW link

On­tolog­i­cal Val­i­da­tion Man­i­festo for AIs

Alejandra Ivone Rojas Reyna22 Mar 2025 0:26 UTC
1 point
0 comments71 min readLW link

[Question] Feed­back re­quest: `eval-crypt` a sim­ple util­ity to miti­gate eval con­tam­i­na­tion.

2 Aug 2025 17:04 UTC
8 points
2 comments2 min readLW link

[Question] How far along Metr’s law can AI start au­tomat­ing or helping with al­ign­ment re­search?

Christopher King20 Mar 2025 15:58 UTC
20 points
21 comments1 min readLW link

Re­spon­si­ble scal­ing policy TLDR

lemonhope28 Sep 2023 18:51 UTC
9 points
0 comments1 min readLW link

Static Place AI Makes Agen­tic AI Re­dun­dant: Mul­tiver­sal AI Align­ment & Ra­tional Utopia

ank13 Feb 2025 22:35 UTC
1 point
2 comments11 min readLW link

AI Safety In­sti­tute’s In­spect hello world ex­am­ple for AI evals

TheManxLoiner16 May 2024 20:47 UTC
3 points
0 comments1 min readLW link
(lovkush.medium.com)

The Com­pleat Cybornaut

19 May 2023 8:44 UTC
66 points
2 comments16 min readLW link

Eval­u­at­ing Lan­guage Model Be­havi­ours for Shut­down Avoidance in Tex­tual Scenarios

16 May 2023 10:53 UTC
26 points
0 comments13 min readLW link

A sketch of an AI con­trol safety case

30 Jan 2025 17:28 UTC
57 points
0 comments5 min readLW link

AI Deep­Seek is Aware

Eyon31 Jan 2025 12:40 UTC
1 point
0 comments6 min readLW link

LLMs can strate­gi­cally de­ceive while do­ing gain-of-func­tion re­search

Igor Ivanov24 Jan 2024 15:45 UTC
36 points
4 comments11 min readLW link

What’s new at FAR AI

4 Dec 2023 21:18 UTC
41 points
0 comments5 min readLW link
(far.ai)

Pro­tect­ing against sud­den ca­pa­bil­ity jumps dur­ing training

Nikola Jurkovic2 Dec 2023 4:22 UTC
15 points
2 comments2 min readLW link

A sim­ple treach­er­ous turn demonstration

Nikola Jurkovic25 Nov 2023 4:51 UTC
22 points
5 comments3 min readLW link

METR’s Eval­u­a­tion of GPT-5

GradientDissenter7 Aug 2025 22:17 UTC
139 points
15 comments20 min readLW link
(metr.github.io)

Give Neo a Chance

ank6 Mar 2025 1:48 UTC
3 points
7 comments7 min readLW link

On­tolog­i­cal Val­i­da­tion Man­i­festo for AIs

Alejandra Ivone Rojas Reyna14 Mar 2025 16:34 UTC
1 point
0 comments72 min readLW link

Can Cur­rent LLMs be Trusted To Pro­duce Paper­clips Safely?

Rohit Chatterjee19 Aug 2024 17:17 UTC
4 points
0 comments9 min readLW link

Claude is a Ravenclaw

Adam Newgas4 Jul 2025 21:32 UTC
63 points
9 comments2 min readLW link
(www.boristhebrave.com)

Im­prov­ing Model-Writ­ten Evals for AI Safety Benchmarking

15 Oct 2024 18:25 UTC
30 points
0 comments18 min readLW link

Re­veal­ing al­ign­ment fak­ing with a sin­gle prompt

Florian_Dietz29 Jan 2025 21:01 UTC
9 points
5 comments4 min readLW link

AI Epistemic Gain

Generoso Immediato12 Aug 2025 14:03 UTC
0 points
0 comments10 min readLW link

How to miti­gate sandbagging

Teun van der Weij23 Mar 2025 17:19 UTC
30 points
0 comments8 min readLW link

Can star­tups be im­pact­ful in AI safety?

13 Sep 2024 19:00 UTC
15 points
0 comments6 min readLW link

It’s hard to make schem­ing evals look re­al­is­tic for LLMs

24 May 2025 19:17 UTC
149 points
29 comments5 min readLW link

Abla­tions for “Fron­tier Models are Ca­pable of In-con­text Schem­ing”

17 Dec 2024 23:58 UTC
115 points
1 comment2 min readLW link

Toward a tax­on­omy of cog­ni­tive bench­marks for agen­tic AGIs

Ben Smith27 Jun 2024 23:50 UTC
15 points
0 comments5 min readLW link

LLM Psy­cho­met­rics and Prompt-In­duced Psychopathy

Korbinian K.18 Oct 2024 18:11 UTC
12 points
2 comments10 min readLW link

Two flaws in the Machi­avelli Benchmark

TheManxLoiner12 Feb 2025 19:34 UTC
24 points
0 comments3 min readLW link

Mea­sur­ing Schel­ling Co­or­di­na­tion—Reflec­tions on Sub­ver­sion Strat­egy Eval

Graeme Ford12 May 2025 19:06 UTC
6 points
0 comments8 min readLW link

“Should AI Ques­tion Its Own De­ci­sions? A Thought Ex­per­i­ment”

CMDR WOTZ4 Feb 2025 8:39 UTC
1 point
0 comments1 min readLW link

Solv­ing ad­ver­sar­ial at­tacks in com­puter vi­sion as a baby ver­sion of gen­eral AI alignment

Stanislav Fort29 Aug 2024 17:17 UTC
89 points
8 comments7 min readLW link

Challenge pro­posal: small­est pos­si­ble self-hard­en­ing back­door for RLHF

Christopher King29 Jun 2023 16:56 UTC
7 points
0 comments2 min readLW link

Re­sults from the AI x Democ­racy Re­search Sprint

14 Jun 2024 16:40 UTC
13 points
0 comments6 min readLW link

AI as a Cog­ni­tive De­coder: Re­think­ing In­tel­li­gence Evolution

Hu Xunyi13 Feb 2025 15:51 UTC
1 point
0 comments1 min readLW link

Black-box in­ter­pretabil­ity method­ol­ogy blueprint: Prob­ing run­away op­ti­mi­sa­tion in LLMs

Roland Pihlakas22 Jun 2025 18:16 UTC
17 points
0 comments7 min readLW link

Is the gap be­tween open and closed mod­els grow­ing? Ev­i­dence from WeirdML

Håvard Tveit Ihle5 Aug 2025 8:20 UTC
7 points
3 comments2 min readLW link

Re­pro­duc­ing ARC Evals’ re­cent re­port on lan­guage model agents

Thomas Broadley1 Sep 2023 16:52 UTC
104 points
17 comments3 min readLW link
(thomasbroadley.com)

Cri­tiques of the AI con­trol agenda

Jozdien14 Feb 2024 19:25 UTC
48 points
14 comments9 min readLW link

Claude wants to be conscious

Joe Kwon13 Apr 2024 1:40 UTC
2 points
8 comments6 min readLW link

Au­tomat­ing AI Safety: What we can do today

25 Jul 2025 14:49 UTC
36 points
0 comments8 min readLW link

A Poem Is All You Need: Jailbreak­ing ChatGPT, Meta & More

Sharat Jacob Jacob29 Oct 2024 12:41 UTC
12 points
0 comments9 min readLW link

Orthog­o­nal­ity or the “Hu­man Worth Hy­poth­e­sis”?

Jeffs23 Jan 2024 0:57 UTC
21 points
31 comments3 min readLW link

How to make evals for the AISI evals bounty

TheManxLoiner3 Dec 2024 10:44 UTC
9 points
0 comments5 min readLW link

Mea­sur­ing Pre­dictabil­ity of Per­sona Evaluations

6 Apr 2024 8:46 UTC
20 points
0 comments7 min readLW link

10 Prin­ci­ples for Real Align­ment

Adriaan21 Apr 2025 22:18 UTC
−7 points
0 comments7 min readLW link

LM Si­tu­a­tional Aware­ness, Eval­u­a­tion Pro­posal: Vio­lat­ing Imitation

Jacob Pfau26 Apr 2023 22:53 UTC
16 points
2 comments2 min readLW link

Re­cent progress on the sci­ence of evaluations

PabloAMC23 Jun 2025 9:41 UTC
12 points
0 comments8 min readLW link

How Self-Aware Are LLMs?

Christopher Ackerman28 May 2025 12:57 UTC
21 points
9 comments10 min readLW link

[Question] AI Rights: In your view, what would be re­quired for an AGI to gain rights and pro­tec­tions from the var­i­ous Govern­ments of the World?

Super AGI9 Jun 2023 1:24 UTC
10 points
26 comments1 min readLW link

A Vi­sual Task that’s Hard for GPT-4o, but Doable for Pri­mary Schoolers

Lennart Finke26 Jul 2024 17:51 UTC
25 points
6 comments2 min readLW link

Nav­i­gat­ing the Attackspace

Jonas Kgomo12 Dec 2023 13:59 UTC
1 point
0 comments2 min readLW link

Ro­bust­ness of Model-Graded Eval­u­a­tions and Au­to­mated Interpretability

15 Jul 2023 19:12 UTC
47 points
5 comments9 min readLW link

OpenAI Credit Ac­count (2510$)

Emirhan BULUT21 Jan 2024 2:32 UTC
1 point
0 comments1 min readLW link

An­a­lyz­ing Deep­Mind’s Prob­a­bil­is­tic Meth­ods for Eval­u­at­ing Agent Capabilities

22 Jul 2024 16:17 UTC
69 points
0 comments16 min readLW link

Fron­tier AI Models Still Fail at Ba­sic Phys­i­cal Tasks: A Man­u­fac­tur­ing Case Study

Adam Karvonen14 Apr 2025 17:38 UTC
156 points
42 comments7 min readLW link
(adamkarvonen.github.io)

Im­prov­ing the safety of AI evals

17 May 2023 22:24 UTC
13 points
7 comments7 min readLW link

Ar­tifi­cial Static Place In­tel­li­gence: Guaran­teed Alignment

ank15 Feb 2025 11:08 UTC
2 points
2 comments2 min readLW link

Build­ing AI safety bench­mark en­vi­ron­ments on themes of uni­ver­sal hu­man values

Roland Pihlakas3 Jan 2025 4:24 UTC
18 points
3 comments8 min readLW link
(docs.google.com)

Re­view of METR’s pub­lic eval­u­a­tion protocol

30 Jun 2024 22:03 UTC
10 points
0 comments5 min readLW link

The Method of Loci: With some brief re­marks, in­clud­ing trans­form­ers and eval­u­at­ing AIs

Bill Benzon2 Dec 2023 14:36 UTC
6 points
0 comments3 min readLW link

AI Safety Eval­u­a­tions: A Reg­u­la­tory Review

19 Mar 2024 15:05 UTC
22 points
1 comment11 min readLW link

Join the $10K Au­toHack 2024 Tournament

Paul Bricman25 Sep 2024 11:54 UTC
5 points
0 comments1 min readLW link
(noemaresearch.com)

[Paper] AI Sand­bag­ging: Lan­guage Models can Strate­gi­cally Un­der­perform on Evaluations

13 Jun 2024 10:04 UTC
84 points
10 comments2 min readLW link
(arxiv.org)

Pro­posal on AI eval­u­a­tion: false-proving

ProgramCrafter31 Mar 2023 12:12 UTC
1 point
2 comments1 min readLW link

AISN #47: Rea­son­ing Models

6 Feb 2025 18:52 UTC
3 points
0 comments4 min readLW link
(newsletter.safe.ai)

Love, Lies and Misalignment

Priyanka Bharadwaj6 Aug 2025 9:44 UTC
6 points
1 comment3 min readLW link

Ques­tions I’d Want to Ask an AGI+ to Test Its Un­der­stand­ing of Ethics

sweenesm26 Jan 2024 23:40 UTC
14 points
6 comments4 min readLW link

Skep­ti­cism About Deep­Mind’s “Grand­mas­ter-Level” Chess Without Search

Arjun Panickssery12 Feb 2024 0:56 UTC
57 points
13 comments3 min readLW link

Some les­sons from the OpenAI-Fron­tierMath debacle

7vik19 Jan 2025 21:09 UTC
71 points
9 comments4 min readLW link

METR’s pre­limi­nary eval­u­a­tion of o3 and o4-mini

Christopher King16 Apr 2025 20:23 UTC
14 points
7 comments1 min readLW link
(metr.github.io)

In­tro­duc­ing REBUS: A Ro­bust Eval­u­a­tion Bench­mark of Un­der­stand­ing Symbols

15 Jan 2024 21:21 UTC
33 points
0 comments1 min readLW link

Eval­u­at­ing Su­per­hu­man Models with Con­sis­tency Checks

1 Aug 2023 7:51 UTC
21 points
2 comments9 min readLW link
(arxiv.org)

2023 Align­ment Re­search Up­dates from FAR AI

4 Dec 2023 22:32 UTC
18 points
0 comments8 min readLW link
(far.ai)

Sys­tem­atic run­away-op­ti­miser-like LLM failure modes on Biolog­i­cally and Eco­nom­i­cally al­igned AI safety bench­marks for LLMs with sim­plified ob­ser­va­tion for­mat (BioBlue)

16 Mar 2025 23:23 UTC
45 points
8 comments11 min readLW link

o1-pre­view is pretty good at do­ing ML on an un­known dataset

Håvard Tveit Ihle20 Sep 2024 8:39 UTC
67 points
1 comment2 min readLW link

Catas­trophic Cy­ber Ca­pa­bil­ities Bench­mark (3CB): Ro­bustly Eval­u­at­ing LLM Agent Cy­ber Offense Capabilities

5 Nov 2024 1:01 UTC
8 points
0 comments6 min readLW link
(www.apartresearch.com)
No comments.