RSS

AI Evaluations

TagLast edit: Aug 1, 2023, 1:03 AM by duck_master

AI Evaluations focus on experimentally assessing the capabilities, safety, and alignment of advanced AI systems. These evaluations can be divided into two main categories: behavioral and understanding-based.

(note: initially written by GPT4, may contain errors despite a human review. Please correct them if you see them)

Behavioral evaluations assess a model’s abilities in various tasks, such as autonomously replicating, acquiring resources, and avoiding being shut down. However, a concern with these evaluations is that they may not be sufficient to detect deceptive alignment, making it difficult to ensure that models are non-deceptive.

Understanding-based evaluations, on the other hand, evaluate a developer’s ability to understand the model they have created and why they have obtained the model. This approach can be more useful in terms of safety, as it focuses on understanding the model’s behavior instead of just checking the behavior itself. Coupling understanding-based evaluations with behavioral evaluations can lead to a more comprehensive assessment of AI safety and alignment.

Current challenges in AI evaluations include:

(this text was initially written by GPT4, taking in as input A very crude deception eval is already passed, ARC tests to see if GPT-4 can escape human control; GPT-4 failed to do so, and Towards understanding-based safety evaluations)

See also:

How evals might (or might not) pre­vent catas­trophic risks from AI

Orpheus16Feb 7, 2023, 8:16 PM
45 points

21 votes

Overall karma indicates overall quality.

0 comments9 min readLW link

When can we trust model eval­u­a­tions?

evhubJul 28, 2023, 7:42 PM
166 points

74 votes

Overall karma indicates overall quality.

10 comments10 min readLW link1 review

[Paper] Safety by Mea­sure­ment: A Sys­tem­atic Liter­a­ture Re­view of AI Safety Eval­u­a­tion Methods

May 19, 2025, 10:38 AM
22 points

14 votes

Overall karma indicates overall quality.

0 comments1 min readLW link

The case for more am­bi­tious lan­guage model evals

JozdienJan 30, 2024, 12:01 AM
117 points

64 votes

Overall karma indicates overall quality.

30 comments5 min readLW link

Thoughts on shar­ing in­for­ma­tion about lan­guage model capabilities

paulfchristianoJul 31, 2023, 4:04 PM
211 points

118 votes

Overall karma indicates overall quality.

44 comments11 min readLW link1 review

Gam­ing Truth­fulQA: Sim­ple Heuris­tics Ex­posed Dataset Weaknesses

TurnTroutJan 16, 2025, 2:14 AM
64 points

29 votes

Overall karma indicates overall quality.

3 comments1 min readLW link
(turntrout.com)

An­nounc­ing Apollo Research

May 30, 2023, 4:17 PM
217 points

94 votes

Overall karma indicates overall quality.

11 comments8 min readLW link

Towards un­der­stand­ing-based safety evaluations

evhubMar 15, 2023, 6:18 PM
164 points

72 votes

Overall karma indicates overall quality.

16 comments5 min readLW link

Scal­able And Trans­fer­able Black-Box Jailbreaks For Lan­guage Models Via Per­sona Modulation

Nov 7, 2023, 5:59 PM
38 points

22 votes

Overall karma indicates overall quality.

2 comments2 min readLW link
(arxiv.org)

How good are LLMs at do­ing ML on an un­known dataset?

Håvard Tveit IhleJul 1, 2024, 9:04 AM
33 points

14 votes

Overall karma indicates overall quality.

4 comments13 min readLW link

How Does A Blind Model See The Earth?

henryAug 11, 2025, 7:58 PM
473 points

263 votes

Overall karma indicates overall quality.

38 comments7 min readLW link
(outsidetext.substack.com)

OMMC An­nounces RIP

Apr 1, 2024, 11:20 PM
190 points

81 votes

Overall karma indicates overall quality.

5 comments2 min readLW link

Deep­Mind: Model eval­u­a­tion for ex­treme risks

Zach Stein-PerlmanMay 25, 2023, 3:00 AM
94 points

41 votes

Overall karma indicates overall quality.

12 comments1 min readLW link1 review
(arxiv.org)

Model Or­ganisms of Misal­ign­ment: The Case for a New Pillar of Align­ment Research

Aug 8, 2023, 1:30 AM
320 points

136 votes

Overall karma indicates overall quality.

30 comments18 min readLW link1 review

A starter guide for evals

Jan 8, 2024, 6:24 PM
55 points

31 votes

Overall karma indicates overall quality.

2 comments12 min readLW link
(www.apolloresearch.ai)

BIG-Bench Ca­nary Con­tam­i­na­tion in GPT-4

JozdienOct 22, 2024, 3:40 PM
130 points

66 votes

Overall karma indicates overall quality.

18 comments4 min readLW link

No­body is Do­ing AI Bench­mark­ing Right

Chapin Lenthall-ClearyJul 6, 2025, 7:05 AM
20 points

10 votes

Overall karma indicates overall quality.

12 comments9 min readLW link

New Paper: It is time to move on from MCQs for LLM Evaluations

shash42Jul 6, 2025, 11:48 AM
9 points

4 votes

Overall karma indicates overall quality.

0 comments2 min readLW link

The bit­ter les­son of mi­suse detection

Jul 10, 2025, 2:50 PM
35 points

19 votes

Overall karma indicates overall quality.

6 comments7 min readLW link

Re­spon­si­ble De­ploy­ment in 20XX

CarsonApr 20, 2023, 12:24 AM
4 points

2 votes

Overall karma indicates overall quality.

0 comments4 min readLW link

An Opinionated Evals Read­ing List

Oct 15, 2024, 2:38 PM
65 points

31 votes

Overall karma indicates overall quality.

0 comments13 min readLW link
(www.apolloresearch.ai)

GPT-4o Sys­tem Card

Zach Stein-PerlmanAug 8, 2024, 8:30 PM
68 points

17 votes

Overall karma indicates overall quality.

11 comments2 min readLW link
(openai.com)

What’s the short timeline plan?

Marius HobbhahnJan 2, 2025, 2:59 PM
358 points

164 votes

Overall karma indicates overall quality.

49 comments23 min readLW link

Au­tonomous repli­ca­tion and adap­ta­tion: an at­tempt at a con­crete dan­ger threshold

Hjalmar_WijkAug 17, 2023, 1:31 AM
45 points

16 votes

Overall karma indicates overall quality.

1 comment13 min readLW link

AI Task Length Hori­zons in Offen­sive Cybersecurity

Sean PetersJul 2, 2025, 6:36 PM
70 points

24 votes

Overall karma indicates overall quality.

10 comments12 min readLW link

Claude Son­net 3.7 (of­ten) knows when it’s in al­ign­ment evaluations

Mar 17, 2025, 7:11 PM
184 points

83 votes

Overall karma indicates overall quality.

9 comments6 min readLW link

[In­terim re­search re­port] Eval­u­at­ing the Goal-Direct­ed­ness of Lan­guage Models

Jul 18, 2024, 6:19 PM
40 points

12 votes

Overall karma indicates overall quality.

4 comments11 min readLW link

Schizobench: Doc­u­ment­ing Mag­i­cal-Think­ing Be­hav­ior in Claude 4 Opus

viemccoyMay 23, 2025, 1:31 AM
23 points

10 votes

Overall karma indicates overall quality.

0 comments1 min readLW link
(metanomicon.ink)

Quickly Assess­ing Re­ward Hack­ing-like Be­hav­ior in LLMs and its Sen­si­tivity to Prompt Variations

AndresCamperoJun 4, 2025, 7:22 AM
26 points

13 votes

Overall karma indicates overall quality.

1 comment17 min readLW link

New, im­proved mul­ti­ple-choice TruthfulQA

Jan 15, 2025, 11:32 PM
72 points

28 votes

Overall karma indicates overall quality.

0 comments3 min readLW link

OpenAI: Pre­pared­ness framework

Zach Stein-PerlmanDec 18, 2023, 6:30 PM
70 points

30 votes

Overall karma indicates overall quality.

23 comments4 min readLW link
(openai.com)

Refram­ing the bur­den of proof: Com­pa­nies should prove that mod­els are safe (rather than ex­pect­ing au­di­tors to prove that mod­els are dan­ger­ous)

Orpheus16Apr 25, 2023, 6:49 PM
27 points

18 votes

Overall karma indicates overall quality.

11 comments3 min readLW link
(childrenoficarus.substack.com)

Run evals on base mod­els too!

orthonormalApr 4, 2024, 6:43 PM
49 points

21 votes

Overall karma indicates overall quality.

6 comments1 min readLW link

You can’t eval GPT5 anymore

Lukas PeterssonSep 18, 2025, 10:12 PM
151 points

88 votes

Overall karma indicates overall quality.

11 comments1 min readLW link

METR is hiring!

Beth BarnesDec 26, 2023, 9:00 PM
65 points

26 votes

Overall karma indicates overall quality.

1 comment1 min readLW link

Twit­ter thread on AI safety evals

Richard_NgoJul 31, 2024, 12:18 AM
63 points

25 votes

Overall karma indicates overall quality.

3 comments2 min readLW link
(x.com)

ARC tests to see if GPT-4 can es­cape hu­man con­trol; GPT-4 failed to do so

Christopher KingMar 15, 2023, 12:29 AM
116 points

67 votes

Overall karma indicates overall quality.

22 comments2 min readLW link

Can Gen­er­al­ized Ad­ver­sar­ial Test­ing En­able More Ri­gor­ous LLM Safety Evals?

scasperJul 30, 2024, 2:57 PM
25 points

12 votes

Overall karma indicates overall quality.

0 comments4 min readLW link

Com­par­ing Quan­tized Perfor­mance in Llama Models

NickyPJul 15, 2024, 4:01 PM
35 points

18 votes

Overall karma indicates overall quality.

2 comments8 min readLW link

Soft Prompts for Eval­u­a­tion: Mea­sur­ing Con­di­tional Dis­tance of Capabilities

porbyFeb 2, 2024, 5:49 AM
47 points

18 votes

Overall karma indicates overall quality.

1 comment4 min readLW link
(arxiv.org)

Val­i­dat­ing against a mis­al­ign­ment de­tec­tor is very differ­ent to train­ing against one

mattmacdermottMar 4, 2025, 3:41 PM
39 points

11 votes

Overall karma indicates overall quality.

4 comments4 min readLW link

In­tro­duc­ing BenchBench: An In­dus­try Stan­dard Bench­mark for AI Strength

JozdienApr 2, 2025, 2:11 AM
50 points

20 votes

Overall karma indicates overall quality.

0 comments2 min readLW link

AI com­pa­nies’ eval re­ports mostly don’t sup­port their claims

Zach Stein-PerlmanJun 9, 2025, 1:00 PM
207 points

72 votes

Overall karma indicates overall quality.

13 comments4 min readLW link

AI com­pa­nies aren’t re­ally us­ing ex­ter­nal evaluators

Zach Stein-PerlmanMay 24, 2024, 4:01 PM
242 points

90 votes

Overall karma indicates overall quality.

15 comments4 min readLW link

Clar­ify­ing METR’s Au­dit­ing Role

Beth BarnesMay 30, 2024, 6:41 PM
108 points

59 votes

Overall karma indicates overall quality.

1 comment2 min readLW link

Send us ex­am­ple gnarly bugs

Dec 10, 2023, 5:23 AM
77 points

28 votes

Overall karma indicates overall quality.

10 comments2 min readLW link

“Suc­cess­ful lan­guage model evals” by Ja­son Wei

Arjun PanicksseryMay 25, 2024, 9:34 AM
7 points

5 votes

Overall karma indicates overall quality.

0 comments1 min readLW link
(www.jasonwei.net)

In­ves­ti­gat­ing the Abil­ity of LLMs to Rec­og­nize Their Own Writing

Jul 30, 2024, 3:41 PM
32 points

9 votes

Overall karma indicates overall quality.

0 comments15 min readLW link

ARC Evals new re­port: Eval­u­at­ing Lan­guage-Model Agents on Real­is­tic Au­tonomous Tasks

Beth BarnesAug 1, 2023, 6:30 PM
153 points

72 votes

Overall karma indicates overall quality.

12 comments5 min readLW link
(evals.alignment.org)

Misal­ign­ment and Strate­gic Un­der­perfor­mance: An Anal­y­sis of Sand­bag­ging and Ex­plo­ra­tion Hacking

May 8, 2025, 7:06 PM
77 points

17 votes

Overall karma indicates overall quality.

3 comments15 min readLW link

AXRP Epi­sode 38.8 - David Du­ve­naud on Sab­o­tage Eval­u­a­tions and the Post-AGI Future

DanielFilanMar 1, 2025, 1:20 AM
13 points

2 votes

Overall karma indicates overall quality.

0 comments13 min readLW link

Apollo Re­search is hiring evals and in­ter­pretabil­ity en­g­ineers & scientists

Marius HobbhahnAug 4, 2023, 10:54 AM
25 points

15 votes

Overall karma indicates overall quality.

0 comments2 min readLW link

An­nounc­ing Hu­man-al­igned AI Sum­mer School

May 22, 2024, 8:55 AM
51 points

19 votes

Overall karma indicates overall quality.

0 comments1 min readLW link
(humanaligned.ai)

Mechanis­ti­cally Elic­it­ing La­tent Be­hav­iors in Lan­guage Models

Apr 30, 2024, 6:51 PM
215 points

93 votes

Overall karma indicates overall quality.

43 comments45 min readLW link

Pro­to­col eval­u­a­tions: good analo­gies vs control

Fabien RogerFeb 19, 2024, 6:00 PM
42 points

11 votes

Overall karma indicates overall quality.

10 comments11 min readLW link

Self-Aware­ness: Tax­on­omy and eval suite proposal

Daniel KokotajloFeb 17, 2024, 1:47 AM
66 points

30 votes

Overall karma indicates overall quality.

2 comments11 min readLW link

100+ con­crete pro­jects and open prob­lems in evals

Marius HobbhahnMar 22, 2025, 3:21 PM
74 points

31 votes

Overall karma indicates overall quality.

1 comment1 min readLW link

Notes on Claude 4 Sys­tem Card

DentosalMay 23, 2025, 3:23 PM
19 points

8 votes

Overall karma indicates overall quality.

2 comments6 min readLW link

Manag­ing risks of our own work

Beth BarnesAug 18, 2023, 12:41 AM
66 points

32 votes

Overall karma indicates overall quality.

0 comments2 min readLW link

[Question] Would more model evals teams be good?

Ryan KiddFeb 25, 2023, 10:01 PM
20 points

6 votes

Overall karma indicates overall quality.

4 comments1 min readLW link

New Ca­pa­bil­ities, New Risks? - Eval­u­at­ing Agen­tic Gen­eral As­sis­tants us­ing Ele­ments of GAIA & METR Frameworks

Tej LanderSep 29, 2024, 6:58 PM
5 points

6 votes

Overall karma indicates overall quality.

0 comments29 min readLW link

More in­for­ma­tion about the dan­ger­ous ca­pa­bil­ity eval­u­a­tions we did with GPT-4 and Claude.

Beth BarnesMar 19, 2023, 12:25 AM
233 points

114 votes

Overall karma indicates overall quality.

54 comments8 min readLW link
(evals.alignment.org)

AXRP Epi­sode 34 - AI Eval­u­a­tions with Beth Barnes

DanielFilanJul 28, 2024, 3:30 AM
23 points

10 votes

Overall karma indicates overall quality.

0 comments69 min readLW link

The Evals Gap

Marius HobbhahnNov 11, 2024, 4:42 PM
55 points

21 votes

Overall karma indicates overall quality.

7 comments7 min readLW link
(www.apolloresearch.ai)

To CoT or not to CoT? Chain-of-thought helps mainly on math and sym­bolic reasoning

Bogdan Ionut CirsteaSep 19, 2024, 4:13 PM
21 points

8 votes

Overall karma indicates overall quality.

1 comment1 min readLW link
(arxiv.org)

Re­search Note: Our schem­ing pre­cur­sor evals had limited pre­dic­tive power for our in-con­text schem­ing evals

Marius HobbhahnJul 3, 2025, 3:57 PM
75 points

24 votes

Overall karma indicates overall quality.

0 comments1 min readLW link
(www.apolloresearch.ai)

Which evals re­sources would be good?

Marius HobbhahnNov 16, 2024, 2:24 PM
51 points

24 votes

Overall karma indicates overall quality.

4 comments5 min readLW link

Prevent­ing Lan­guage Models from hid­ing their reasoning

Oct 31, 2023, 2:34 PM
119 points

42 votes

Overall karma indicates overall quality.

15 comments12 min readLW link1 review

Apollo Re­search 1-year update

May 29, 2024, 5:44 PM
93 points

46 votes

Overall karma indicates overall quality.

0 comments7 min readLW link

Sub­ver­sion Strat­egy Eval: Can lan­guage mod­els state­lessly strate­gize to sub­vert con­trol pro­to­cols?

Mar 24, 2025, 5:55 PM
35 points

10 votes

Overall karma indicates overall quality.

0 comments8 min readLW link

Eval­u­at­ing strate­gic rea­son­ing in GPT models

phelps-sgMay 25, 2023, 11:51 AM
4 points

2 votes

Overall karma indicates overall quality.

1 comment8 min readLW link

Bi­as­ing VLM Re­sponse with Vi­sual Stimuli

Jaehyuk LimOct 3, 2024, 6:04 PM
5 points

2 votes

Overall karma indicates overall quality.

0 comments8 min readLW link

[Question] Can GPT-4 play 20 ques­tions against an­other in­stance of it­self?

Nathan Helm-BurgerMar 28, 2023, 1:11 AM
15 points

9 votes

Overall karma indicates overall quality.

1 comment1 min readLW link
(evanthebouncy.medium.com)

Ideas for bench­mark­ing LLM creativity

gwernDec 16, 2024, 5:18 AM
60 points

23 votes

Overall karma indicates overall quality.

11 comments1 min readLW link
(gwern.net)

The Leeroy Jenk­ins prin­ci­ple: How faulty AI could guaran­tee “warn­ing shots”

titotalJan 14, 2024, 3:03 PM
48 points

27 votes

Overall karma indicates overall quality.

6 comments21 min readLW link
(titotal.substack.com)

Is ChatGPT ac­tu­ally fixed now?

sjadlerMay 8, 2025, 11:34 PM
17 points

7 votes

Overall karma indicates overall quality.

0 comments1 min readLW link
(stevenadler.substack.com)

We should try to au­to­mate AI safety work asap

Marius HobbhahnApr 26, 2025, 4:35 PM
113 points

42 votes

Overall karma indicates overall quality.

10 comments15 min readLW link

Third-party test­ing as a key in­gre­di­ent of AI policy

Zac Hatfield-DoddsMar 25, 2024, 10:40 PM
11 points

4 votes

Overall karma indicates overall quality.

1 comment12 min readLW link
(www.anthropic.com)

≤10-year Timelines Re­main Un­likely De­spite Deep­Seek and o3

Rafael HarthFeb 13, 2025, 7:21 PM
52 points

55 votes

Overall karma indicates overall quality.

67 comments15 min readLW link

Bounty: Di­verse hard tasks for LLM agents

Dec 17, 2023, 1:04 AM
49 points

27 votes

Overall karma indicates overall quality.

31 comments16 min readLW link

Fron­tier Models are Ca­pable of In-con­text Scheming

Dec 5, 2024, 10:11 PM
210 points

74 votes

Overall karma indicates overall quality.

24 comments7 min readLW link

A very crude de­cep­tion eval is already passed

Beth BarnesOct 29, 2021, 5:57 PM
108 points

44 votes

Overall karma indicates overall quality.

6 comments2 min readLW link

Model evals for dan­ger­ous capabilities

Zach Stein-PerlmanSep 23, 2024, 11:00 AM
51 points

15 votes

Overall karma indicates overall quality.

11 comments3 min readLW link

We need a Science of Evals

Jan 22, 2024, 8:30 PM
72 points

48 votes

Overall karma indicates overall quality.

13 comments9 min readLW link

UK AISI: Early les­sons from eval­u­at­ing fron­tier AI systems

Zach Stein-PerlmanOct 25, 2024, 7:00 PM
26 points

9 votes

Overall karma indicates overall quality.

0 comments2 min readLW link
(www.aisi.gov.uk)

I repli­cated the An­thropic al­ign­ment fak­ing ex­per­i­ment on other mod­els, and they didn’t fake alignment

May 30, 2025, 6:57 PM
31 points

23 votes

Overall karma indicates overall quality.

0 comments2 min readLW link

GPT-oss is an ex­tremely stupid model

GuiveSep 9, 2025, 9:24 PM
13 points

7 votes

Overall karma indicates overall quality.

5 comments1 min readLW link

Deep­Mind: Eval­u­at­ing Fron­tier Models for Danger­ous Capabilities

Zach Stein-PerlmanMar 21, 2024, 3:00 AM
61 points

21 votes

Overall karma indicates overall quality.

8 comments1 min readLW link
(arxiv.org)

An is­sue with train­ing schemers with su­per­vised fine-tuning

Fabien RogerJun 27, 2024, 3:37 PM
49 points

21 votes

Overall karma indicates overall quality.

12 comments6 min readLW link

METR’s Ob­ser­va­tions of Re­ward Hack­ing in Re­cent Fron­tier Models

Daniel KokotajloJun 9, 2025, 6:03 PM
99 points

42 votes

Overall karma indicates overall quality.

9 comments11 min readLW link
(metr.org)

A Tax­on­omy Of AI Sys­tem Evaluations

Aug 19, 2024, 9:07 AM
13 points

5 votes

Overall karma indicates overall quality.

0 comments14 min readLW link

UnaPrompt™: A Pre-Prompt Op­ti­miza­tion Sys­tem for Reli­able and Eth­i­cally Aligned AI Outputs

UnaPromptJun 27, 2025, 12:06 AM
1 point

1 vote

Overall karma indicates overall quality.

0 comments1 min readLW link

A call for a quan­ti­ta­tive re­port card for AI bioter­ror­ism threat models

JunoDec 4, 2023, 6:35 AM
12 points

6 votes

Overall karma indicates overall quality.

0 comments10 min readLW link

When fine-tun­ing fails to elicit GPT-3.5′s chess abilities

Theodore ChapmanJun 14, 2024, 6:50 PM
42 points

24 votes

Overall karma indicates overall quality.

3 comments9 min readLW link

Back­prop—The Rus­sian Al­gorithm the West Claimed as Its Own ti­tled Draft

Jayson MorganSep 3, 2025, 1:44 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments9 min readLW link

OpenAI Credit Ac­count (2510$)

Emirhan BULUTJan 21, 2024, 2:30 AM
1 point

1 vote

Overall karma indicates overall quality.

0 comments1 min readLW link

Mind the Co­her­ence Gap: Les­sons from Steer­ing Llama with Goodfire

eitan sprejerMay 9, 2025, 9:29 PM
4 points

3 votes

Overall karma indicates overall quality.

1 comment6 min readLW link

Tall Tales at Differ­ent Scales: Eval­u­at­ing Scal­ing Trends For De­cep­tion In Lan­guage Models

Nov 8, 2023, 11:37 AM
49 points

21 votes

Overall karma indicates overall quality.

0 comments18 min readLW link

[Question] An­thropic Is Go­ing All In On Abil­ity Without In­tel­li­gence?

Chapin Lenthall-ClearyAug 7, 2025, 5:54 AM
2 points

4 votes

Overall karma indicates overall quality.

0 comments2 min readLW link

From No Mind to a Mind – A Con­ver­sa­tion That Changed an AI

parthibanarjuna sFeb 7, 2025, 11:50 AM
1 point

1 vote

Overall karma indicates overall quality.

0 comments3 min readLW link

Can SAE steer­ing re­veal sand­bag­ging?

Apr 15, 2025, 12:33 PM
35 points

11 votes

Overall karma indicates overall quality.

3 comments4 min readLW link

In­duc­ing Un­prompted Misal­ign­ment in LLMs

Apr 19, 2024, 8:00 PM
38 points

26 votes

Overall karma indicates overall quality.

7 comments16 min readLW link

LLM Eval­u­a­tors Rec­og­nize and Fa­vor Their Own Generations

Apr 17, 2024, 9:09 PM
46 points

18 votes

Overall karma indicates overall quality.

1 comment3 min readLW link
(tiny.cc)

Per­sonal eval­u­a­tion of LLMs, through chess

Karthik TadepalliApr 24, 2025, 7:01 AM
20 points

9 votes

Overall karma indicates overall quality.

4 comments2 min readLW link

In­tro­duc­ing METR’s Au­ton­omy Eval­u­a­tion Resources

Mar 15, 2024, 11:16 PM
90 points

29 votes

Overall karma indicates overall quality.

0 comments1 min readLW link
(metr.github.io)

Con­cept Poi­son­ing: Prob­ing LLMs with­out probes

Aug 5, 2025, 5:00 PM
59 points

26 votes

Overall karma indicates overall quality.

5 comments13 min readLW link

If your AGI defi­ni­tion ex­cludes most hu­mans, it sucks.

Chapin Lenthall-ClearyJul 22, 2025, 10:33 AM
18 points

18 votes

Overall karma indicates overall quality.

7 comments2 min readLW link

Auto-En­hance: Devel­op­ing a meta-bench­mark to mea­sure LLM agents’ abil­ity to im­prove other agents

Jul 22, 2024, 12:33 PM
20 points

19 votes

Overall karma indicates overall quality.

0 comments14 min readLW link

An In­tro­duc­tion to AI Sandbagging

Apr 26, 2024, 1:40 PM
49 points

23 votes

Overall karma indicates overall quality.

13 comments8 min readLW link

Towards a Science of Evals for Sycophancy

andrejfsantosFeb 1, 2025, 9:17 PM
8 points

6 votes

Overall karma indicates overall quality.

0 comments8 min readLW link

Steer­ing Vec­tors Can Help LLM Judges De­tect Sub­tle Dishonesty

Jun 3, 2025, 8:33 PM
12 points

7 votes

Overall karma indicates overall quality.

1 comment5 min readLW link

MMLU’s Mo­ral Sce­nar­ios Bench­mark Doesn’t Mea­sure What You Think it Measures

corey morrisSep 27, 2023, 5:54 PM
18 points

12 votes

Overall karma indicates overall quality.

3 comments4 min readLW link
(medium.com)

I Tested LLM Agents on Sim­ple Safety Rules. They Failed in Sur­pris­ing and In­for­ma­tive Ways.

Ram PothamJun 25, 2025, 9:39 PM
9 points

5 votes

Overall karma indicates overall quality.

12 comments6 min readLW link

The­o­ries of Change for AI Auditing

Nov 13, 2023, 7:33 PM
54 points

27 votes

Overall karma indicates overall quality.

0 comments18 min readLW link
(www.apolloresearch.ai)

Seek­ing (Paid) Case Stud­ies on Standards

HoldenKarnofskyMay 26, 2023, 5:58 PM
69 points

21 votes

Overall karma indicates overall quality.

9 comments11 min readLW link

Re­quest for pro­pos­als: im­prov­ing ca­pa­bil­ity evaluations

cbFeb 7, 2025, 6:51 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments1 min readLW link
(www.openphilanthropy.org)

In­tel­li­gence–Agency Equiv­alence ≈ Mass–En­ergy Equiv­alence: On Static Na­ture of In­tel­li­gence & Phys­i­cal­iza­tion of Ethics

ankFeb 22, 2025, 12:12 AM
1 point

3 votes

Overall karma indicates overall quality.

0 comments6 min readLW link

LLMs are Ca­pable of Misal­igned Be­hav­ior Un­der Ex­plicit Pro­hi­bi­tion and Surveillance

Igor IvanovJul 8, 2025, 11:50 AM
28 points

10 votes

Overall karma indicates overall quality.

8 comments7 min readLW link

Sab­o­tage Eval­u­a­tions for Fron­tier Models

Oct 18, 2024, 10:33 PM
95 points

36 votes

Overall karma indicates overall quality.

56 comments6 min readLW link
(assets.anthropic.com)

The “spel­ling mir­a­cle”: GPT-3 spel­ling abil­ities and glitch to­kens revisited

mwatkinsJul 31, 2023, 7:47 PM
85 points

39 votes

Overall karma indicates overall quality.

29 comments20 min readLW link

Call for eval­u­a­tors: Par­ti­ci­pate in the Euro­pean AI Office work­shop on gen­eral-pur­pose AI mod­els and sys­temic risks

Nov 27, 2024, 2:54 AM
30 points

6 votes

Overall karma indicates overall quality.

0 comments2 min readLW link

Can Per­sua­sion Break AI Safety? Ex­plor­ing the In­ter­play Between Fine-Tun­ing, At­tacks, and Guardrails

Devina JainFeb 4, 2025, 7:10 PM
9 points

10 votes

Overall karma indicates overall quality.

0 comments10 min readLW link

Ex­tend­ing In­spect Frame­work: In­te­grat­ing Weights & Biases

Sep 20, 2025, 1:10 AM
2 points

2 votes

Overall karma indicates overall quality.

0 comments3 min readLW link

Se­cret Col­lu­sion: Will We Know When to Un­plug AI?

Sep 16, 2024, 4:07 PM
65 points

27 votes

Overall karma indicates overall quality.

8 comments31 min readLW link

Is there a Half-Life for the Suc­cess Rates of AI Agents?

Matrice JacobineMay 8, 2025, 8:10 PM
8 points

6 votes

Overall karma indicates overall quality.

0 comments1 min readLW link
(www.tobyord.com)

[Paper] Hid­den in Plain Text: Emer­gence and Miti­ga­tion of Stegano­graphic Col­lu­sion in LLMs

Sep 25, 2024, 2:52 PM
37 points

21 votes

Overall karma indicates overall quality.

2 comments4 min readLW link
(arxiv.org)

Among Us: A Sand­box for Agen­tic Deception

Apr 5, 2025, 6:24 AM
110 points

42 votes

Overall karma indicates overall quality.

7 comments7 min readLW link

Ra­tional Effec­tive Utopia & Nar­row Way There: Math-Proven Safe Static Mul­tiver­sal mAX-In­tel­li­gence (AXI), Mul­tiver­sal Align­ment, New Ethico­physics… (Aug 11)

ankFeb 11, 2025, 3:21 AM
13 points

7 votes

Overall karma indicates overall quality.

8 comments38 min readLW link

Agency over­hang as a proxy for Sharp left turn

Nov 7, 2024, 12:14 PM
6 points

4 votes

Overall karma indicates overall quality.

0 comments5 min readLW link

AI threat­ens to or­ches­trate sus­tain­able so­cial reform

Chris Santos-LangApr 2, 2025, 3:04 AM
13 points

3 votes

Overall karma indicates overall quality.

0 comments1 min readLW link
(arxiv.org)

METR is hiring ML Re­search Eng­ineers and Scientists

XodarapJun 5, 2024, 9:27 PM
5 points

3 votes

Overall karma indicates overall quality.

0 comments1 min readLW link
(metr.org)

Do mod­els know when they are be­ing eval­u­ated?

Feb 17, 2025, 11:13 PM
57 points

29 votes

Overall karma indicates overall quality.

9 comments12 min readLW link

Lan­guage Tier Lock and Poetic Con­tam­i­na­tion in GPT-4o: A Field Report

許皓翔Jun 11, 2025, 5:24 PM
0 points

0 votes

Overall karma indicates overall quality.

0 comments2 min readLW link

The dreams of GPT-4

RomanSMar 20, 2023, 5:00 PM
14 points

6 votes

Overall karma indicates overall quality.

7 comments9 min readLW link

Align­ment Can Re­duce Perfor­mance on Sim­ple Eth­i­cal Questions

Daan HenselmansFeb 3, 2025, 7:35 PM
16 points

9 votes

Overall karma indicates overall quality.

7 comments6 min readLW link

Find­ing De­cep­tion in Lan­guage Models

Aug 20, 2024, 9:42 AM
20 points

10 votes

Overall karma indicates overall quality.

4 comments4 min readLW link

Mea­sur­ing and Im­prov­ing the Faith­ful­ness of Model-Gen­er­ated Rea­son­ing

Jul 18, 2023, 4:36 PM
111 points

42 votes

Overall karma indicates overall quality.

15 comments6 min readLW link1 review

Sys­tem­atic Sand­bag­ging Eval­u­a­tions on Claude 3.5 Sonnet

farrelmahaztraFeb 14, 2025, 1:22 AM
13 points

5 votes

Overall karma indicates overall quality.

0 comments1 min readLW link
(farrelmahaztra.com)

Think­ing About Propen­sity Evaluations

Aug 19, 2024, 9:23 AM
10 points

5 votes

Overall karma indicates overall quality.

0 comments27 min readLW link

Me­tacog­ni­tion and Self-Model­ing in LLMs

Christopher AckermanJul 10, 2025, 9:25 PM
19 points

6 votes

Overall karma indicates overall quality.

2 comments16 min readLW link

On­tolog­i­cal Val­i­da­tion Man­i­festo for AIs

Alejandra Ivone Rojas ReynaMar 22, 2025, 12:26 AM
1 point

1 vote

Overall karma indicates overall quality.

0 comments71 min readLW link

[Question] Feed­back re­quest: `eval-crypt` a sim­ple util­ity to miti­gate eval con­tam­i­na­tion.

Aug 2, 2025, 5:04 PM
9 points

7 votes

Overall karma indicates overall quality.

4 comments2 min readLW link

[Question] How far along Metr’s law can AI start au­tomat­ing or helping with al­ign­ment re­search?

Christopher KingMar 20, 2025, 3:58 PM
20 points

8 votes

Overall karma indicates overall quality.

21 comments1 min readLW link

Re­spon­si­ble scal­ing policy TLDR

lemonhopeSep 28, 2023, 6:51 PM
9 points

4 votes

Overall karma indicates overall quality.

0 comments1 min readLW link

Static Place AI Makes Agen­tic AI Re­dun­dant: Mul­tiver­sal AI Align­ment & Ra­tional Utopia

ankFeb 13, 2025, 10:35 PM
1 point

7 votes

Overall karma indicates overall quality.

2 comments11 min readLW link

AI Safety In­sti­tute’s In­spect hello world ex­am­ple for AI evals

TheManxLoinerMay 16, 2024, 8:47 PM
3 points

2 votes

Overall karma indicates overall quality.

0 comments1 min readLW link
(lovkush.medium.com)

The Com­pleat Cybornaut

May 19, 2023, 8:44 AM
66 points

34 votes

Overall karma indicates overall quality.

2 comments16 min readLW link

Eval­u­at­ing Lan­guage Model Be­havi­ours for Shut­down Avoidance in Tex­tual Scenarios

May 16, 2023, 10:53 AM
26 points

13 votes

Overall karma indicates overall quality.

0 comments13 min readLW link

A sketch of an AI con­trol safety case

Jan 30, 2025, 5:28 PM
57 points

16 votes

Overall karma indicates overall quality.

0 comments5 min readLW link

AI Deep­Seek is Aware

EyonJan 31, 2025, 12:40 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments6 min readLW link

LLMs can strate­gi­cally de­ceive while do­ing gain-of-func­tion re­search

Igor IvanovJan 24, 2024, 3:45 PM
36 points

21 votes

Overall karma indicates overall quality.

4 comments11 min readLW link

What’s new at FAR AI

Dec 4, 2023, 9:18 PM
41 points

17 votes

Overall karma indicates overall quality.

0 comments5 min readLW link
(far.ai)

Pro­tect­ing against sud­den ca­pa­bil­ity jumps dur­ing training

Nikola JurkovicDec 2, 2023, 4:22 AM
15 points

5 votes

Overall karma indicates overall quality.

2 comments2 min readLW link

A sim­ple treach­er­ous turn demonstration

Nikola JurkovicNov 25, 2023, 4:51 AM
22 points

13 votes

Overall karma indicates overall quality.

5 comments3 min readLW link

METR’s Eval­u­a­tion of GPT-5

GradientDissenterAug 7, 2025, 10:17 PM
141 points

69 votes

Overall karma indicates overall quality.

15 comments20 min readLW link
(metr.github.io)

Give Neo a Chance

ankMar 6, 2025, 1:48 AM
3 points

2 votes

Overall karma indicates overall quality.

7 comments7 min readLW link

On­tolog­i­cal Val­i­da­tion Man­i­festo for AIs

Alejandra Ivone Rojas ReynaMar 14, 2025, 4:34 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments72 min readLW link

Can Cur­rent LLMs be Trusted To Pro­duce Paper­clips Safely?

Rohit ChatterjeeAug 19, 2024, 5:17 PM
4 points

9 votes

Overall karma indicates overall quality.

0 comments9 min readLW link

Claude is a Ravenclaw

Adam NewgasJul 4, 2025, 9:32 PM
63 points

36 votes

Overall karma indicates overall quality.

9 comments2 min readLW link
(www.boristhebrave.com)

Ran­dom safe AGI idea dump

sigOct 2, 2025, 10:16 AM
−3 points

3 votes

Overall karma indicates overall quality.

0 comments3 min readLW link

Im­prov­ing Model-Writ­ten Evals for AI Safety Benchmarking

Oct 15, 2024, 6:25 PM
30 points

15 votes

Overall karma indicates overall quality.

0 comments18 min readLW link

Re­veal­ing al­ign­ment fak­ing with a sin­gle prompt

Florian_DietzJan 29, 2025, 9:01 PM
9 points

5 votes

Overall karma indicates overall quality.

5 comments4 min readLW link

AI Epistemic Gain

Generoso ImmediatoAug 12, 2025, 2:03 PM
0 points

0 votes

Overall karma indicates overall quality.

0 comments10 min readLW link

Com­par­a­tive Anal­y­sis of Black Box Meth­ods for De­tect­ing Eval­u­a­tion Aware­ness in LLMs

Igor IvanovSep 26, 2025, 9:56 PM
10 points

5 votes

Overall karma indicates overall quality.

0 comments14 min readLW link

How to miti­gate sandbagging

Teun van der WeijMar 23, 2025, 5:19 PM
30 points

11 votes

Overall karma indicates overall quality.

0 comments8 min readLW link

Can star­tups be im­pact­ful in AI safety?

Sep 13, 2024, 7:00 PM
15 points

8 votes

Overall karma indicates overall quality.

0 comments6 min readLW link

It’s hard to make schem­ing evals look re­al­is­tic for LLMs

May 24, 2025, 7:17 PM
149 points

63 votes

Overall karma indicates overall quality.

29 comments5 min readLW link

Abla­tions for “Fron­tier Models are Ca­pable of In-con­text Schem­ing”

Dec 17, 2024, 11:58 PM
115 points

35 votes

Overall karma indicates overall quality.

1 comment2 min readLW link

Toward a tax­on­omy of cog­ni­tive bench­marks for agen­tic AGIs

Ben SmithJun 27, 2024, 11:50 PM
15 points

6 votes

Overall karma indicates overall quality.

0 comments5 min readLW link

LLM Psy­cho­met­rics and Prompt-In­duced Psychopathy

Korbinian K.Oct 18, 2024, 6:11 PM
12 points

7 votes

Overall karma indicates overall quality.

2 comments10 min readLW link

Two flaws in the Machi­avelli Benchmark

TheManxLoinerFeb 12, 2025, 7:34 PM
24 points

9 votes

Overall karma indicates overall quality.

0 comments3 min readLW link

Mea­sur­ing Schel­ling Co­or­di­na­tion—Reflec­tions on Sub­ver­sion Strat­egy Eval

Graeme FordMay 12, 2025, 7:06 PM
6 points

5 votes

Overall karma indicates overall quality.

0 comments8 min readLW link

“Should AI Ques­tion Its Own De­ci­sions? A Thought Ex­per­i­ment”

CMDR WOTZFeb 4, 2025, 8:39 AM
1 point

1 vote

Overall karma indicates overall quality.

0 comments1 min readLW link

Solv­ing ad­ver­sar­ial at­tacks in com­puter vi­sion as a baby ver­sion of gen­eral AI alignment

Stanislav FortAug 29, 2024, 5:17 PM
89 points

40 votes

Overall karma indicates overall quality.

8 comments7 min readLW link

Challenge pro­posal: small­est pos­si­ble self-hard­en­ing back­door for RLHF

Christopher KingJun 29, 2023, 4:56 PM
7 points

3 votes

Overall karma indicates overall quality.

0 comments2 min readLW link

Re­sults from the AI x Democ­racy Re­search Sprint

Jun 14, 2024, 4:40 PM
13 points

7 votes

Overall karma indicates overall quality.

0 comments6 min readLW link

AI as a Cog­ni­tive De­coder: Re­think­ing In­tel­li­gence Evolution

Hu XunyiFeb 13, 2025, 3:51 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments1 min readLW link

Black-box in­ter­pretabil­ity method­ol­ogy blueprint: Prob­ing run­away op­ti­mi­sa­tion in LLMs

Roland PihlakasJun 22, 2025, 6:16 PM
17 points

5 votes

Overall karma indicates overall quality.

0 comments7 min readLW link

Is the gap be­tween open and closed mod­els grow­ing? Ev­i­dence from WeirdML

Håvard Tveit IhleAug 5, 2025, 8:20 AM
7 points

7 votes

Overall karma indicates overall quality.

3 comments2 min readLW link

In­fer­ence costs for hard cod­ing tasks halve roughly ev­ery two months

Håvard Tveit IhleSep 17, 2025, 3:04 PM
15 points

5 votes

Overall karma indicates overall quality.

0 comments4 min readLW link

Re­pro­duc­ing ARC Evals’ re­cent re­port on lan­guage model agents

Thomas BroadleySep 1, 2023, 4:52 PM
104 points

41 votes

Overall karma indicates overall quality.

17 comments3 min readLW link
(thomasbroadley.com)

Cri­tiques of the AI con­trol agenda

JozdienFeb 14, 2024, 7:25 PM
48 points

27 votes

Overall karma indicates overall quality.

14 comments9 min readLW link

Claude wants to be conscious

Joe KwonApr 13, 2024, 1:40 AM
2 points

8 votes

Overall karma indicates overall quality.

8 comments6 min readLW link

Au­tomat­ing AI Safety: What we can do today

Jul 25, 2025, 2:49 PM
36 points

22 votes

Overall karma indicates overall quality.

0 comments8 min readLW link

A Poem Is All You Need: Jailbreak­ing ChatGPT, Meta & More

Sharat Jacob JacobOct 29, 2024, 12:41 PM
12 points

7 votes

Overall karma indicates overall quality.

0 comments9 min readLW link

Orthog­o­nal­ity or the “Hu­man Worth Hy­poth­e­sis”?

JeffsJan 23, 2024, 12:57 AM
21 points

15 votes

Overall karma indicates overall quality.

31 comments3 min readLW link

How to make evals for the AISI evals bounty

TheManxLoinerDec 3, 2024, 10:44 AM
9 points

3 votes

Overall karma indicates overall quality.

0 comments5 min readLW link

Mea­sur­ing Pre­dictabil­ity of Per­sona Evaluations

Apr 6, 2024, 8:46 AM
20 points

8 votes

Overall karma indicates overall quality.

0 comments7 min readLW link

10 Prin­ci­ples for Real Align­ment

AdriaanApr 21, 2025, 10:18 PM
−7 points

3 votes

Overall karma indicates overall quality.

0 comments7 min readLW link

LM Si­tu­a­tional Aware­ness, Eval­u­a­tion Pro­posal: Vio­lat­ing Imitation

Jacob PfauApr 26, 2023, 10:53 PM
16 points

9 votes

Overall karma indicates overall quality.

2 comments2 min readLW link

Re­cent progress on the sci­ence of evaluations

PabloAMCJun 23, 2025, 9:41 AM
14 points

6 votes

Overall karma indicates overall quality.

1 comment8 min readLW link

Sys­tem Level Safety Evaluations

Sep 29, 2025, 1:57 PM
14 points

4 votes

Overall karma indicates overall quality.

0 comments9 min readLW link
(equilibria1.substack.com)

How Self-Aware Are LLMs?

Christopher AckermanMay 28, 2025, 12:57 PM
21 points

7 votes

Overall karma indicates overall quality.

9 comments10 min readLW link

[Question] AI Rights: In your view, what would be re­quired for an AGI to gain rights and pro­tec­tions from the var­i­ous Govern­ments of the World?

Super AGIJun 9, 2023, 1:24 AM
10 points

8 votes

Overall karma indicates overall quality.

26 comments1 min readLW link

A Vi­sual Task that’s Hard for GPT-4o, but Doable for Pri­mary Schoolers

Lennart FinkeJul 26, 2024, 5:51 PM
25 points

15 votes

Overall karma indicates overall quality.

6 comments2 min readLW link

Nav­i­gat­ing the Attackspace

Jonas KgomoDec 12, 2023, 1:59 PM
1 point

3 votes

Overall karma indicates overall quality.

0 comments2 min readLW link

Ro­bust­ness of Model-Graded Eval­u­a­tions and Au­to­mated Interpretability

Jul 15, 2023, 7:12 PM
47 points

21 votes

Overall karma indicates overall quality.

5 comments9 min readLW link

OpenAI Credit Ac­count (2510$)

Emirhan BULUTJan 21, 2024, 2:32 AM
1 point

1 vote

Overall karma indicates overall quality.

0 comments1 min readLW link

An­a­lyz­ing Deep­Mind’s Prob­a­bil­is­tic Meth­ods for Eval­u­at­ing Agent Capabilities

Jul 22, 2024, 4:17 PM
69 points

32 votes

Overall karma indicates overall quality.

0 comments16 min readLW link

Fron­tier AI Models Still Fail at Ba­sic Phys­i­cal Tasks: A Man­u­fac­tur­ing Case Study

Adam KarvonenApr 14, 2025, 5:38 PM
158 points

76 votes

Overall karma indicates overall quality.

42 comments7 min readLW link
(adamkarvonen.github.io)

Im­prov­ing the safety of AI evals

May 17, 2023, 10:24 PM
13 points

13 votes

Overall karma indicates overall quality.

7 comments7 min readLW link

Ar­tifi­cial Static Place In­tel­li­gence: Guaran­teed Alignment

ankFeb 15, 2025, 11:08 AM
2 points

5 votes

Overall karma indicates overall quality.

2 comments2 min readLW link

Build­ing AI safety bench­mark en­vi­ron­ments on themes of uni­ver­sal hu­man values

Roland PihlakasJan 3, 2025, 4:24 AM
18 points

9 votes

Overall karma indicates overall quality.

3 comments8 min readLW link
(docs.google.com)

Re­view of METR’s pub­lic eval­u­a­tion protocol

Jun 30, 2024, 10:03 PM
10 points

6 votes

Overall karma indicates overall quality.

0 comments5 min readLW link

The Method of Loci: With some brief re­marks, in­clud­ing trans­form­ers and eval­u­at­ing AIs

Bill BenzonDec 2, 2023, 2:36 PM
6 points

2 votes

Overall karma indicates overall quality.

0 comments3 min readLW link

AI Safety Eval­u­a­tions: A Reg­u­la­tory Review

Mar 19, 2024, 3:05 PM
22 points

11 votes

Overall karma indicates overall quality.

1 comment11 min readLW link

Join the $10K Au­toHack 2024 Tournament

Paul BricmanSep 25, 2024, 11:54 AM
5 points

4 votes

Overall karma indicates overall quality.

0 comments1 min readLW link
(noemaresearch.com)

[Paper] AI Sand­bag­ging: Lan­guage Models can Strate­gi­cally Un­der­perform on Evaluations

Jun 13, 2024, 10:04 AM
84 points

35 votes

Overall karma indicates overall quality.

10 comments2 min readLW link
(arxiv.org)

Pro­posal on AI eval­u­a­tion: false-proving

ProgramCrafterMar 31, 2023, 12:12 PM
1 point

1 vote

Overall karma indicates overall quality.

2 comments1 min readLW link

AISN #47: Rea­son­ing Models

Feb 6, 2025, 6:52 PM
3 points

2 votes

Overall karma indicates overall quality.

0 comments4 min readLW link
(newsletter.safe.ai)

Love, Lies and Misalignment

Priyanka BharadwajAug 6, 2025, 9:44 AM
6 points

5 votes

Overall karma indicates overall quality.

1 comment3 min readLW link

Ques­tions I’d Want to Ask an AGI+ to Test Its Un­der­stand­ing of Ethics

sweenesmJan 26, 2024, 11:40 PM
14 points

4 votes

Overall karma indicates overall quality.

6 comments4 min readLW link

Skep­ti­cism About Deep­Mind’s “Grand­mas­ter-Level” Chess Without Search

Arjun PanicksseryFeb 12, 2024, 12:56 AM
57 points

30 votes

Overall karma indicates overall quality.

13 comments3 min readLW link

Some les­sons from the OpenAI-Fron­tierMath debacle

7vikJan 19, 2025, 9:09 PM
71 points

40 votes

Overall karma indicates overall quality.

9 comments4 min readLW link

METR’s pre­limi­nary eval­u­a­tion of o3 and o4-mini

Christopher KingApr 16, 2025, 8:23 PM
14 points

10 votes

Overall karma indicates overall quality.

7 comments1 min readLW link
(metr.github.io)

In­tro­duc­ing REBUS: A Ro­bust Eval­u­a­tion Bench­mark of Un­der­stand­ing Symbols

Jan 15, 2024, 9:21 PM
33 points

12 votes

Overall karma indicates overall quality.

0 comments1 min readLW link

Eval­u­at­ing Su­per­hu­man Models with Con­sis­tency Checks

Aug 1, 2023, 7:51 AM
21 points

8 votes

Overall karma indicates overall quality.

2 comments9 min readLW link
(arxiv.org)

2023 Align­ment Re­search Up­dates from FAR AI

Dec 4, 2023, 10:32 PM
18 points

6 votes

Overall karma indicates overall quality.

0 comments8 min readLW link
(far.ai)

Sys­tem­atic run­away-op­ti­miser-like LLM failure modes on Biolog­i­cally and Eco­nom­i­cally al­igned AI safety bench­marks for LLMs with sim­plified ob­ser­va­tion for­mat (BioBlue)

Mar 16, 2025, 11:23 PM
45 points

12 votes

Overall karma indicates overall quality.

8 comments11 min readLW link

o1-pre­view is pretty good at do­ing ML on an un­known dataset

Håvard Tveit IhleSep 20, 2024, 8:39 AM
67 points

43 votes

Overall karma indicates overall quality.

1 comment2 min readLW link

Catas­trophic Cy­ber Ca­pa­bil­ities Bench­mark (3CB): Ro­bustly Eval­u­at­ing LLM Agent Cy­ber Offense Capabilities

Nov 5, 2024, 1:01 AM
8 points

4 votes

Overall karma indicates overall quality.

0 comments6 min readLW link
(www.apartresearch.com)

My Minor AI Safety Re­search Pro­jects (Q3 2025)

Adam NewgasSep 19, 2025, 9:53 AM
6 points

3 votes

Overall karma indicates overall quality.

1 comment2 min readLW link
No comments.