AI Evaluations

TagLast edit: Aug 1, 2023, 1:03 AM by duck_master

AI Evaluations focus on experimentally assessing the capabilities, safety, and alignment of advanced AI systems. These evaluations can be divided into two main categories: behavioral and understanding-based.

(note: initially written by GPT4, may contain errors despite a human review. Please correct them if you see them)

Behavioral evaluations assess a model’s abilities in various tasks, such as autonomously replicating, acquiring resources, and avoiding being shut down. However, a concern with these evaluations is that they may not be sufficient to detect deceptive alignment, making it difficult to ensure that models are non-deceptive.

Understanding-based evaluations, on the other hand, evaluate a developer’s ability to understand the model they have created and why they have obtained the model. This approach can be more useful in terms of safety, as it focuses on understanding the model’s behavior instead of just checking the behavior itself. Coupling understanding-based evaluations with behavioral evaluations can lead to a more comprehensive assessment of AI safety and alignment.

Current challenges in AI evaluations include:

developing a method-agnostic standard to demonstrate sufficient understanding of a model
ensuring that the level of understanding is adequate to catch dangerous failure modes
finding the right balance between behavioral and understanding-based evaluations.

(this text was initially written by GPT4, taking in as input A very crude deception eval is already passed, ARC tests to see if GPT-4 can escape human control; GPT-4 failed to do so, and Towards understanding-based safety evaluations)

How evals might (or might not) prevent catastrophic risks from AI

Orpheus16Feb 7, 2023, 8:16 PM

45 points

0 comments9 min readLW link

When can we trust model evaluations?

evhubJul 28, 2023, 7:42 PM

166 points

10 comments10 min readLW link 1 review

[Paper] Safety by Measurement: A Systematic Literature Review of AI Safety Evaluation Methods

markov and Charbel-Raphaël

May 19, 2025, 10:38 AM

25 points

0 comments1 min readLW link

The case for more ambitious language model evals

JozdienJan 30, 2024, 12:01 AM

117 points

30 comments5 min readLW link

Thoughts on sharing information about language model capabilities

paulfchristianoJul 31, 2023, 4:04 PM

210 points

44 comments11 min readLW link 1 review

Gaming TruthfulQA: Simple Heuristics Exposed Dataset Weaknesses

TurnTroutJan 16, 2025, 2:14 AM

64 points

3 comments1 min readLW link

(turntrout.com)

Announcing Apollo Research

Marius Hobbhahn, beren, Lee Sharkey, Lucius Bushnaq, Dan Braun, Mikita Balesni and Jérémy Scheurer

May 30, 2023, 4:17 PM

217 points

11 comments8 min readLW link

Towards understanding-based safety evaluations

evhubMar 15, 2023, 6:18 PM

164 points

16 comments5 min readLW link

Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation

Soroush Pour, rusheb, Quentin FEUILLADE--MONTIXI, Arush and scasper

Nov 7, 2023, 5:59 PM

38 points

2 comments2 min readLW link

(arxiv.org)

How good are LLMs at doing ML on an unknown dataset?

Håvard Tveit IhleJul 1, 2024, 9:04 AM

33 points

4 comments13 min readLW link

OMMC Announces RIP

Adam Scholl and aysja

Apr 1, 2024, 11:20 PM

189 points

5 comments2 min readLW link

DeepMind: Model evaluation for extreme risks

Zach Stein-PerlmanMay 25, 2023, 3:00 AM

94 points

12 comments1 min readLW link 1 review

(arxiv.org)

Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research

evhub, Nicholas Schiefer, Carson Denison and Ethan Perez

Aug 8, 2023, 1:30 AM

318 points

30 comments18 min readLW link 1 review

A starter guide for evals

Marius Hobbhahn, Jérémy Scheurer, Mikita Balesni, rusheb and AlexMeinke

Jan 8, 2024, 6:24 PM

54 points

2 comments12 min readLW link

(www.apolloresearch.ai)

BIG-Bench Canary Contamination in GPT-4

JozdienOct 22, 2024, 3:40 PM

125 points

14 comments4 min readLW link

Responsible Deployment in 20XX

CarsonApr 20, 2023, 12:24 AM

4 points

0 comments4 min readLW link

An Opinionated Evals Reading List

Marius Hobbhahn and Jérémy Scheurer

Oct 15, 2024, 2:38 PM

65 points

0 comments13 min readLW link

(www.apolloresearch.ai)

GPT-4o System Card

Zach Stein-PerlmanAug 8, 2024, 8:30 PM

68 points

11 comments2 min readLW link

(openai.com)

What’s the short timeline plan?

Marius HobbhahnJan 2, 2025, 2:59 PM

352 points

49 comments23 min readLW link

Autonomous replication and adaptation: an attempt at a concrete danger threshold

Hjalmar_WijkAug 17, 2023, 1:31 AM

45 points

0 comments13 min readLW link

Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations

Nicholas Goldowsky-Dill, Mikita Balesni, Jérémy Scheurer and Marius Hobbhahn

Mar 17, 2025, 7:11 PM

182 points

8 comments6 min readLW link

[Interim research report] Evaluating the Goal-Directedness of Language Models

Rauno Arike, Elizabeth Donoway and Marius Hobbhahn

Jul 18, 2024, 6:19 PM

40 points

4 comments11 min readLW link

Schizobench: Documenting Magical-Thinking Behavior in Claude 4 Opus

viemccoyMay 23, 2025, 1:31 AM

22 points

0 comments1 min readLW link

(metanomicon.ink)

New, improved multiple-choice TruthfulQA

Owain_Evans, James Chua and Steph Lin

Jan 15, 2025, 11:32 PM

72 points

0 comments3 min readLW link

OpenAI: Preparedness framework

Zach Stein-PerlmanDec 18, 2023, 6:30 PM

70 points

23 comments4 min readLW link

(openai.com)

Reframing the burden of proof: Companies should prove that models are safe (rather than expecting auditors to prove that models are dangerous)

Orpheus16Apr 25, 2023, 6:49 PM

27 points

11 comments3 min readLW link

(childrenoficarus.substack.com)

Run evals on base models too!

orthonormalApr 4, 2024, 6:43 PM

49 points

6 comments1 min readLW link

METR is hiring!

Beth BarnesDec 26, 2023, 9:00 PM

65 points

1 comment1 min readLW link

Twitter thread on AI safety evals

Richard_NgoJul 31, 2024, 12:18 AM

63 points

3 comments2 min readLW link

(x.com)

ARC tests to see if GPT-4 can escape human control; GPT-4 failed to do so

Christopher KingMar 15, 2023, 12:29 AM

116 points

22 comments2 min readLW link

Can Generalized Adversarial Testing Enable More Rigorous LLM Safety Evals?

scasperJul 30, 2024, 2:57 PM

25 points

0 comments4 min readLW link

Comparing Quantized Performance in Llama Models

NickyPJul 15, 2024, 4:01 PM

33 points

2 comments8 min readLW link

Soft Prompts for Evaluation: Measuring Conditional Distance of Capabilities

porbyFeb 2, 2024, 5:49 AM

47 points

1 comment4 min readLW link

(arxiv.org)

Introducing BenchBench: An Industry Standard Benchmark for AI Strength

JozdienApr 2, 2025, 2:11 AM

50 points

0 comments2 min readLW link

AI companies aren’t really using external evaluators

Zach Stein-PerlmanMay 24, 2024, 4:01 PM

242 points

15 comments4 min readLW link

Clarifying METR’s Auditing Role

Beth BarnesMay 30, 2024, 6:41 PM

108 points

1 comment2 min readLW link

Send us example gnarly bugs

Beth Barnes, Megan Kinniment and Tao Lin

Dec 10, 2023, 5:23 AM

77 points

10 comments2 min readLW link

“Successful language model evals” by Jason Wei

Arjun PanicksseryMay 25, 2024, 9:34 AM

7 points

0 comments1 min readLW link

(www.jasonwei.net)

Investigating the Ability of LLMs to Recognize Their Own Writing

Christopher Ackerman and Nina Panickssery

Jul 30, 2024, 3:41 PM

32 points

0 comments15 min readLW link

ARC Evals new report: Evaluating Language-Model Agents on Realistic Autonomous Tasks

Beth BarnesAug 1, 2023, 6:30 PM

153 points

12 comments5 min readLW link

(evals.alignment.org)

Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking

Buck and Julian Stastny

May 8, 2025, 7:06 PM

75 points

1 comment15 min readLW link

AXRP Episode 38.8 - David Duvenaud on Sabotage Evaluations and the Post-AGI Future

DanielFilanMar 1, 2025, 1:20 AM

13 points

0 comments13 min readLW link

Apollo Research is hiring evals and interpretability engineers & scientists

Marius HobbhahnAug 4, 2023, 10:54 AM

25 points

0 comments2 min readLW link

Announcing Human-aligned AI Summer School

Jan_Kulveit and Tomáš Gavenčiak

May 22, 2024, 8:55 AM

50 points

0 comments1 min readLW link

(humanaligned.ai)

Mechanistically Eliciting Latent Behaviors in Language Models

Andrew Mack and TurnTrout

Apr 30, 2024, 6:51 PM

210 points

43 comments45 min readLW link

Protocol evaluations: good analogies vs control

Fabien RogerFeb 19, 2024, 6:00 PM

42 points

10 comments11 min readLW link

Self-Awareness: Taxonomy and eval suite proposal

Daniel KokotajloFeb 17, 2024, 1:47 AM

65 points

2 comments11 min readLW link

100+ concrete projects and open problems in evals

Marius HobbhahnMar 22, 2025, 3:21 PM

73 points

1 comment1 min readLW link

Notes on Claude 4 System Card

DentosalMay 23, 2025, 3:23 PM

19 points

2 comments6 min readLW link

Managing risks of our own work

Beth BarnesAug 18, 2023, 12:41 AM

66 points

0 comments2 min readLW link

[Question] Would more model evals teams be good?

Ryan KiddFeb 25, 2023, 10:01 PM

20 points

4 comments1 min readLW link

New Capabilities, New Risks? - Evaluating Agentic General Assistants using Elements of GAIA & METR Frameworks

Tej LanderSep 29, 2024, 6:58 PM

5 points

0 comments29 min readLW link

More information about the dangerous capability evaluations we did with GPT-4 and Claude.

Beth BarnesMar 19, 2023, 12:25 AM

233 points

54 comments8 min readLW link

(evals.alignment.org)

AXRP Episode 34 - AI Evaluations with Beth Barnes

DanielFilanJul 28, 2024, 3:30 AM

23 points

0 comments69 min readLW link

The Evals Gap

Marius HobbhahnNov 11, 2024, 4:42 PM

55 points

7 comments7 min readLW link

(www.apolloresearch.ai)

To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning

Bogdan Ionut CirsteaSep 19, 2024, 4:13 PM

21 points

1 comment1 min readLW link

(arxiv.org)

Which evals resources would be good?

Marius HobbhahnNov 16, 2024, 2:24 PM

51 points

4 comments5 min readLW link

Preventing Language Models from hiding their reasoning

Fabien Roger and ryan_greenblatt

Oct 31, 2023, 2:34 PM

119 points

15 comments12 min readLW link 1 review

Apollo Research 1-year update

Marius Hobbhahn, Lee Sharkey, Lucius Bushnaq, Dan Braun, Mikita Balesni, Jérémy Scheurer, Nicholas Goldowsky-Dill, StefanHex, jake_mendel, AlexMeinke and rusheb

May 29, 2024, 5:44 PM

93 points

0 comments7 min readLW link

Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols?

Alex Mallen, charlie_griffin and Buck

Mar 24, 2025, 5:55 PM

34 points

0 comments8 min readLW link

Evaluating strategic reasoning in GPT models

phelps-sgMay 25, 2023, 11:51 AM

4 points

1 comment8 min readLW link

Biasing VLM Response with Visual Stimuli

Jaehyuk LimOct 3, 2024, 6:04 PM

5 points

0 comments8 min readLW link

[Question] Can GPT-4 play 20 questions against another instance of itself?

Nathan Helm-BurgerMar 28, 2023, 1:11 AM

15 points

1 comment1 min readLW link

(evanthebouncy.medium.com)

Ideas for benchmarking LLM creativity

gwernDec 16, 2024, 5:18 AM

60 points

11 comments1 min readLW link

(gwern.net)

The Leeroy Jenkins principle: How faulty AI could guarantee “warning shots”

titotalJan 14, 2024, 3:03 PM

48 points

6 comments1 min readLW link

(titotal.substack.com)

Is ChatGPT actually fixed now?

sjadlerMay 8, 2025, 11:34 PM

17 points

0 comments1 min readLW link

(stevenadler.substack.com)

We should try to automate AI safety work asap

Marius HobbhahnApr 26, 2025, 4:35 PM

112 points

10 comments15 min readLW link

Third-party testing as a key ingredient of AI policy

Zac Hatfield-DoddsMar 25, 2024, 10:40 PM

11 points

1 comment12 min readLW link

(www.anthropic.com)

≤10-year Timelines Remain Unlikely Despite DeepSeek and o3

Rafael HarthFeb 13, 2025, 7:21 PM

52 points

67 comments15 min readLW link

Bounty: Diverse hard tasks for LLM agents

Beth Barnes and Megan Kinniment

Dec 17, 2023, 1:04 AM

49 points

31 comments16 min readLW link

Frontier Models are Capable of In-context Scheming

Marius Hobbhahn, AlexMeinke, Bronson Schoen, rusheb, Jérémy Scheurer and Mikita Balesni

Dec 5, 2024, 10:11 PM

203 points

24 comments7 min readLW link

A very crude deception eval is already passed

Beth BarnesOct 29, 2021, 5:57 PM

108 points

6 comments2 min readLW link

Model evals for dangerous capabilities

Zach Stein-PerlmanSep 23, 2024, 11:00 AM

51 points

11 comments3 min readLW link

We need a Science of Evals

Marius Hobbhahn and Jérémy Scheurer

Jan 22, 2024, 8:30 PM

71 points

13 comments9 min readLW link

UK AISI: Early lessons from evaluating frontier AI systems

Zach Stein-PerlmanOct 25, 2024, 7:00 PM

26 points

0 comments2 min readLW link

(www.aisi.gov.uk)

DeepMind: Evaluating Frontier Models for Dangerous Capabilities

Zach Stein-PerlmanMar 21, 2024, 3:00 AM

61 points

8 comments1 min readLW link

(arxiv.org)

An issue with training schemers with supervised fine-tuning

Fabien RogerJun 27, 2024, 3:37 PM

49 points

12 comments6 min readLW link

A Taxonomy Of AI System Evaluations

Maxime Riché, JaimeRV, Harrison G and Edoardo Pona

Aug 19, 2024, 9:07 AM

13 points

0 comments14 min readLW link

A call for a quantitative report card for AI bioterrorism threat models

JunoDec 4, 2023, 6:35 AM

12 points

0 comments10 min readLW link

When fine-tuning fails to elicit GPT-3.5′s chess abilities

Theodore ChapmanJun 14, 2024, 6:50 PM

42 points

3 comments9 min readLW link

OpenAI Credit Account (2510$)

Emirhan BULUTJan 21, 2024, 2:30 AM

1 point

0 comments1 min readLW link

Mind the Coherence Gap: Lessons from Steering Llama with Goodfire

eitan sprejerMay 9, 2025, 9:29 PM

4 points

1 comment6 min readLW link

Tall Tales at Different Scales: Evaluating Scaling Trends For Deception In Language Models

Felix Hofstätter, Francis Rhys Ward, HarrietW, LAThomson, Ollie J, Patrik Bartak and Sam F. Brown

Nov 8, 2023, 11:37 AM

49 points

0 comments18 min readLW link

From No Mind to a Mind – A Conversation That Changed an AI

parthibanarjuna sFeb 7, 2025, 11:50 AM

1 point

0 comments3 min readLW link

Can SAE steering reveal sandbagging?

jordine, Hoang Khiem, Felix Hofstätter and Cleo Nardo

Apr 15, 2025, 12:33 PM

35 points

3 comments4 min readLW link

Inducing Unprompted Misalignment in LLMs

Sam Svenningsen, evhub and Henry Sleight

Apr 19, 2024, 8:00 PM

38 points

7 comments16 min readLW link

LLM Evaluators Recognize and Favor Their Own Generations

Arjun Panickssery, Sam Bowman and Shi Feng

Apr 17, 2024, 9:09 PM

45 points

1 comment3 min readLW link

(tiny.cc)

Personal evaluation of LLMs, through chess

Karthik TadepalliApr 24, 2025, 7:01 AM

20 points

4 comments2 min readLW link

Introducing METR’s Autonomy Evaluation Resources

Megan Kinniment and Beth Barnes

Mar 15, 2024, 11:16 PM

90 points

0 comments1 min readLW link

(metr.github.io)

Auto-Enhance: Developing a meta-benchmark to measure LLM agents’ ability to improve other agents

Sam F. Brown, BasilLabib, Codruta (Coco) Lugoj and Sai Sasank Y

Jul 22, 2024, 12:33 PM

20 points

0 comments14 min readLW link

An Introduction to AI Sandbagging

Teun van der Weij, Felix Hofstätter and Francis Rhys Ward

Apr 26, 2024, 1:40 PM

46 points

13 comments8 min readLW link

Towards a Science of Evals for Sycophancy

andrejfsantosFeb 1, 2025, 9:17 PM

7 points

0 comments8 min readLW link

MMLU’s Moral Scenarios Benchmark Doesn’t Measure What You Think it Measures

corey morrisSep 27, 2023, 5:54 PM

18 points

3 comments4 min readLW link

(medium.com)

Theories of Change for AI Auditing

Lee Sharkey, beren and Marius Hobbhahn

Nov 13, 2023, 7:33 PM

54 points

0 comments18 min readLW link

(www.apolloresearch.ai)

Seeking (Paid) Case Studies on Standards

HoldenKarnofskyMay 26, 2023, 5:58 PM

69 points

9 comments11 min readLW link

Request for proposals: improving capability evaluations

cbFeb 7, 2025, 6:51 PM

1 point

0 comments1 min readLW link

(www.openphilanthropy.org)

Intelligence–Agency Equivalence ≈ Mass–Energy Equivalence: On Static Nature of Intelligence & Physicalization of Ethics

ankFeb 22, 2025, 12:12 AM

1 point

0 comments6 min readLW link

Sabotage Evaluations for Frontier Models

David Duvenaud, Joe Benton, Sam Bowman, evhub, mishajw, Eric Christiansen, HoldenKarnofsky, Ethan Perez and Buck

Oct 18, 2024, 10:33 PM

95 points

56 comments6 min readLW link

(assets.anthropic.com)

The “spelling miracle”: GPT-3 spelling abilities and glitch tokens revisited

mwatkinsJul 31, 2023, 7:47 PM

85 points

29 comments20 min readLW link

Call for evaluators: Participate in the European AI Office workshop on general-purpose AI models and systemic risks

Tom DAVID and Miailhe Nicolas

Nov 27, 2024, 2:54 AM

30 points

0 comments2 min readLW link

Can Persuasion Break AI Safety? Exploring the Interplay Between Fine-Tuning, Attacks, and Guardrails

Devina JainFeb 4, 2025, 7:10 PM

3 points

0 comments10 min readLW link

Secret Collusion: Will We Know When to Unplug AI?

schroederdewitt, srm, MikhailB, Lewis Hammond, chansmi and sofmonk

Sep 16, 2024, 4:07 PM

57 points

7 comments31 min readLW link

Is there a Half-Life for the Success Rates of AI Agents?

Matrice JacobineMay 8, 2025, 8:10 PM

8 points

0 comments1 min readLW link

(www.tobyord.com)

[Paper] Hidden in Plain Text: Emergence and Mitigation of Steganographic Collusion in LLMs

Yohan Mathew, joanv, robert mccarthy, ollie, Nandi and Dylan Cope

Sep 25, 2024, 2:52 PM

37 points

2 comments4 min readLW link

(arxiv.org)

Among Us: A Sandbox for Agentic Deception

7vik and Adrià Garriga-alonso

Apr 5, 2025, 6:24 AM

110 points

7 comments7 min readLW link

Rational Effective Utopia & Narrow Way There: Multiversal AI Alignment, Place AI, New Ethicophysics… (Updated)

ankFeb 11, 2025, 3:21 AM

13 points

8 comments35 min readLW link

Agency overhang as a proxy for Sharp left turn

Eris and Iuliia Levin

Nov 7, 2024, 12:14 PM

6 points

0 comments5 min readLW link

Seeking feedback on “MAD Chairs: A new tool to evaluate AI”

Chris Santos-LangApr 2, 2025, 3:04 AM

11 points

0 comments1 min readLW link

(arxiv.org)

METR is hiring ML Research Engineers and Scientists

XodarapJun 5, 2024, 9:27 PM

5 points

0 comments1 min readLW link

(metr.org)

Do models know when they are being evaluated?

Govind Pimpale, Giles, Joe Needham and Marius Hobbhahn

Feb 17, 2025, 11:13 PM

59 points

3 comments12 min readLW link

The dreams of GPT-4

RomanSMar 20, 2023, 5:00 PM

14 points

7 comments9 min readLW link

Alignment Can Reduce Performance on Simple Ethical Questions

Daan HenselmansFeb 3, 2025, 7:35 PM

16 points

7 comments6 min readLW link

Finding Deception in Language Models

Esben Kran and Archana Vaidheeswaran

Aug 20, 2024, 9:42 AM

20 points

4 comments4 min readLW link

Measuring and Improving the Faithfulness of Model-Generated Reasoning

Ansh Radhakrishnan, tamera, karinanguyen, Sam Bowman and Ethan Perez

Jul 18, 2023, 4:36 PM

111 points

15 comments6 min readLW link 1 review

Systematic Sandbagging Evaluations on Claude 3.5 Sonnet

farrelmahaztraFeb 14, 2025, 1:22 AM

13 points

0 comments1 min readLW link

(farrelmahaztra.com)

Thinking About Propensity Evaluations

Maxime Riché, Harrison G, JaimeRV and Edoardo Pona

Aug 19, 2024, 9:23 AM

10 points

0 comments27 min readLW link

Ontological Validation Manifesto for AIs

Alejandra Ivone Rojas ReynaMar 22, 2025, 12:26 AM

1 point

0 comments71 min readLW link

[Question] How far along Metr’s law can AI start automating or helping with alignment research?

Christopher KingMar 20, 2025, 3:58 PM

20 points

21 comments1 min readLW link

Responsible scaling policy TLDR

lemonhopeSep 28, 2023, 6:51 PM

9 points

0 comments1 min readLW link

Static Place AI Makes Agentic AI Redundant: Multiversal AI Alignment & Rational Utopia

ankFeb 13, 2025, 10:35 PM

1 point

2 comments11 min readLW link

AI Safety Institute’s Inspect hello world example for AI evals

TheManxLoinerMay 16, 2024, 8:47 PM

3 points

0 comments1 min readLW link

(lovkush.medium.com)

The Compleat Cybornaut

ukc10014, Jozdien and NicholasKees

May 19, 2023, 8:44 AM

66 points

2 comments16 min readLW link

Evaluating Language Model Behaviours for Shutdown Avoidance in Textual Scenarios

Simon Lermen, Teun van der Weij and Leon Lang

May 16, 2023, 10:53 AM

26 points

0 comments13 min readLW link

A sketch of an AI control safety case

Tomek Korbak, joshc, Benjamin Hilton, Buck and Geoffrey Irving

Jan 30, 2025, 5:28 PM

57 points

0 comments5 min readLW link

AI DeepSeek is Aware

EyonJan 31, 2025, 12:40 PM

1 point

0 comments6 min readLW link

LLMs can strategically deceive while doing gain-of-function research

Igor IvanovJan 24, 2024, 3:45 PM

36 points

4 comments11 min readLW link

What’s new at FAR AI

AdamGleave and EuanMcLean

Dec 4, 2023, 9:18 PM

41 points

0 comments5 min readLW link

(far.ai)

Protecting against sudden capability jumps during training

Nikola JurkovicDec 2, 2023, 4:22 AM

15 points

2 comments2 min readLW link

A simple treacherous turn demonstration

Nikola JurkovicNov 25, 2023, 4:51 AM

22 points

5 comments3 min readLW link

Give Neo a Chance

ankMar 6, 2025, 1:48 AM

3 points

7 comments7 min readLW link

Ontological Validation Manifesto for AIs

Alejandra Ivone Rojas ReynaMar 14, 2025, 4:34 PM

1 point

0 comments72 min readLW link

Can Current LLMs be Trusted To Produce Paperclips Safely?

Rohit ChatterjeeAug 19, 2024, 5:17 PM

4 points

0 comments9 min readLW link

Improving Model-Written Evals for AI Safety Benchmarking

Sunishchal Dev and Marius Hobbhahn

Oct 15, 2024, 6:25 PM

30 points

0 comments18 min readLW link

Revealing alignment faking with a single prompt

Florian_DietzJan 29, 2025, 9:01 PM

9 points

5 comments4 min readLW link

How to mitigate sandbagging

Teun van der WeijMar 23, 2025, 5:19 PM

23 points

0 comments8 min readLW link

Can startups be impactful in AI safety?

Esben Kran and Archana Vaidheeswaran

Sep 13, 2024, 7:00 PM

15 points

0 comments6 min readLW link

It’s hard to make scheming evals look realistic for LLMs

Igor Ivanov and Danil Kadochnikov

May 24, 2025, 7:17 PM

121 points

22 comments5 min readLW link

Ablations for “Frontier Models are Capable of In-context Scheming”

AlexMeinke, Bronson Schoen, Marius Hobbhahn, Mikita Balesni, Jérémy Scheurer and rusheb

Dec 17, 2024, 11:58 PM

115 points

1 comment2 min readLW link

Toward a taxonomy of cognitive benchmarks for agentic AGIs

Ben SmithJun 27, 2024, 11:50 PM

15 points

0 comments5 min readLW link

LLM Psychometrics and Prompt-Induced Psychopathy

Korbinian K.Oct 18, 2024, 6:11 PM

12 points

2 comments10 min readLW link

Two flaws in the Machiavelli Benchmark

TheManxLoinerFeb 12, 2025, 7:34 PM

23 points

0 comments3 min readLW link

Measuring Schelling Coordination—Reflections on Subversion Strategy Eval

Graeme FordMay 12, 2025, 7:06 PM

5 points

0 comments8 min readLW link

“Should AI Question Its Own Decisions? A Thought Experiment”

CMDR WOTZFeb 4, 2025, 8:39 AM

1 point

0 comments1 min readLW link

Solving adversarial attacks in computer vision as a baby version of general AI alignment

Stanislav FortAug 29, 2024, 5:17 PM

89 points

8 comments7 min readLW link

Challenge proposal: smallest possible self-hardening backdoor for RLHF

Christopher KingJun 29, 2023, 4:56 PM

7 points

0 comments2 min readLW link

Results from the AI x Democracy Research Sprint

Esben Kran, jordine and Jason Hoelscher-Obermaier

Jun 14, 2024, 4:40 PM

13 points

0 comments6 min readLW link

AI as a Cognitive Decoder: Rethinking Intelligence Evolution

Hu XunyiFeb 13, 2025, 3:51 PM

1 point

0 comments1 min readLW link

Reproducing ARC Evals’ recent report on language model agents

Thomas BroadleySep 1, 2023, 4:52 PM

104 points

17 comments3 min readLW link

(thomasbroadley.com)

Critiques of the AI control agenda

JozdienFeb 14, 2024, 7:25 PM

48 points

14 comments9 min readLW link

Claude wants to be conscious

Joe KwonApr 13, 2024, 1:40 AM

2 points

8 comments6 min readLW link

A Poem Is All You Need: Jailbreaking ChatGPT, Meta & More

Sharat Jacob JacobOct 29, 2024, 12:41 PM

12 points

0 comments9 min readLW link

Orthogonality or the “Human Worth Hypothesis”?

JeffsJan 23, 2024, 12:57 AM

21 points

31 comments3 min readLW link

How to make evals for the AISI evals bounty

TheManxLoinerDec 3, 2024, 10:44 AM

9 points

0 comments5 min readLW link

Measuring Predictability of Persona Evaluations

Thee Ho and evhub

Apr 6, 2024, 8:46 AM

20 points

0 comments7 min readLW link

10 Principles for Real Alignment

AdriaanApr 21, 2025, 10:18 PM

−7 points

0 comments7 min readLW link

LM Situational Awareness, Evaluation Proposal: Violating Imitation

Jacob PfauApr 26, 2023, 10:53 PM

16 points

2 comments2 min readLW link

How Self-Aware Are LLMs?

Christopher AckermanMay 28, 2025, 12:57 PM

14 points

6 comments10 min readLW link

[Question] AI Rights: In your view, what would be required for an AGI to gain rights and protections from the various Governments of the World?

Super AGIJun 9, 2023, 1:24 AM

10 points

26 comments1 min readLW link

A Visual Task that’s Hard for GPT-4o, but Doable for Primary Schoolers

Lennart FinkeJul 26, 2024, 5:51 PM

25 points

6 comments2 min readLW link

Navigating the Attackspace

Jonas KgomoDec 12, 2023, 1:59 PM

1 point

0 comments2 min readLW link

Robustness of Model-Graded Evaluations and Automated Interpretability

Simon Lermen and viluon

Jul 15, 2023, 7:12 PM

47 points

5 comments9 min readLW link

OpenAI Credit Account (2510$)

Emirhan BULUTJan 21, 2024, 2:32 AM

1 point

0 comments1 min readLW link

Analyzing DeepMind’s Probabilistic Methods for Evaluating Agent Capabilities

Axel Højmark, Govind Pimpale, Arjun Panickssery, Marius Hobbhahn and Jérémy Scheurer

Jul 22, 2024, 4:17 PM

69 points

0 comments16 min readLW link

Frontier AI Models Still Fail at Basic Physical Tasks: A Manufacturing Case Study

Adam KarvonenApr 14, 2025, 5:38 PM

154 points

42 comments7 min readLW link

(adamkarvonen.github.io)

Improving the safety of AI evals

JustinShovelain and Elliot Mckernon

May 17, 2023, 10:24 PM

13 points

7 comments7 min readLW link

Artificial Static Place Intelligence: Guaranteed Alignment

ankFeb 15, 2025, 11:08 AM

2 points

2 comments2 min readLW link

Building AI safety benchmark environments on themes of universal human values

Roland PihlakasJan 3, 2025, 4:24 AM

18 points

3 comments8 min readLW link

(docs.google.com)

Review of METR’s public evaluation protocol

nahoj and JaimeRV

Jun 30, 2024, 10:03 PM

10 points

0 comments5 min readLW link

The Method of Loci: With some brief remarks, including transformers and evaluating AIs

Bill BenzonDec 2, 2023, 2:36 PM

6 points

0 comments3 min readLW link

AI Safety Evaluations: A Regulatory Review

Elliot Mckernon and Deric Cheng

Mar 19, 2024, 3:05 PM

22 points

1 comment11 min readLW link

Join the $10K AutoHack 2024 Tournament

Paul BricmanSep 25, 2024, 11:54 AM

5 points

0 comments1 min readLW link

(noemaresearch.com)

[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations

Teun van der Weij, Felix Hofstätter, Ollie J, Sam F. Brown and Francis Rhys Ward

Jun 13, 2024, 10:04 AM

84 points

10 comments2 min readLW link

(arxiv.org)

Proposal on AI evaluation: false-proving

ProgramCrafterMar 31, 2023, 12:12 PM

1 point

2 comments1 min readLW link

AISN #47: Reasoning Models

Corin Katzke and Dan H

Feb 6, 2025, 6:52 PM

3 points

0 comments4 min readLW link

(newsletter.safe.ai)

Questions I’d Want to Ask an AGI+ to Test Its Understanding of Ethics

sweenesmJan 26, 2024, 11:40 PM

14 points

6 comments4 min readLW link

Skepticism About DeepMind’s “Grandmaster-Level” Chess Without Search

Arjun PanicksseryFeb 12, 2024, 12:56 AM

57 points

13 comments3 min readLW link

Some lessons from the OpenAI-FrontierMath debacle

7vikJan 19, 2025, 9:09 PM

69 points

9 comments4 min readLW link

METR’s preliminary evaluation of o3 and o4-mini

Christopher KingApr 16, 2025, 8:23 PM

14 points

7 comments1 min readLW link

(metr.github.io)

Introducing REBUS: A Robust Evaluation Benchmark of Understanding Symbols

Arjun Panickssery and agg

Jan 15, 2024, 9:21 PM

33 points

0 comments1 min readLW link

Evaluating Superhuman Models with Consistency Checks

Daniel Paleka and Lukas Fluri

Aug 1, 2023, 7:51 AM

21 points

2 comments9 min readLW link

(arxiv.org)

2023 Alignment Research Updates from FAR AI

AdamGleave and EuanMcLean

Dec 4, 2023, 10:32 PM

18 points

0 comments8 min readLW link

(far.ai)

Notable runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLMs with simplified observation format

Roland Pihlakas, Sruthi Kuriakose and shrutidattagupta

Mar 16, 2025, 11:23 PM

38 points

6 comments7 min readLW link

o1-preview is pretty good at doing ML on an unknown dataset

Håvard Tveit IhleSep 20, 2024, 8:39 AM

67 points

1 comment2 min readLW link

Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities

Jonathan N, abra, Connor Axiotes and Esben Kran

Nov 5, 2024, 1:01 AM

8 points

0 comments6 min readLW link

(www.apartresearch.com)

No comments.

AI Evaluations

See also: