RSS

De­ploy­ment Aware­ness Mat­ters More Than Eval­u­a­tion Awareness

26 Jun 2026 22:54 UTC
45 points
7 comments7 min readLW link
(limits-of-evaluation.org)

The Case for Model Forensics

26 Jun 2026 15:09 UTC
45 points
0 comments10 min readLW link

LLM-Driven Fea­ture Discovery

22 Jun 2026 22:26 UTC
35 points
1 comment5 min readLW link

How trans­par­ent is Diffu­sionGemma (and why it mat­ters)

20 Jun 2026 20:05 UTC
86 points
2 comments4 min readLW link

GDM AI Con­trol Roadmap

18 Jun 2026 16:50 UTC
87 points
2 comments1 min readLW link

Pre­dict­ing LLM Safety Be­fore Re­lease by Si­mu­lat­ing Deployment

16 Jun 2026 19:55 UTC
36 points
2 comments1 min readLW link

Syn­thetic doc­u­ment fine­tun­ing for in­still­ing pos­i­tive traits

16 Jun 2026 0:04 UTC
61 points
1 comment10 min readLW link

Why Do Naive SFT Filters For Safety Prop­er­ties Fail?

14 Jun 2026 19:45 UTC
58 points
7 comments10 min readLW link

SFT Drives Gem­ini’s Safety Properties

13 Jun 2026 15:31 UTC
85 points
4 comments1 min readLW link

Build­ing and eval­u­at­ing model diffing agents

12 Jun 2026 17:14 UTC
61 points
2 comments12 min readLW link

Sym­pa­thy for both sides of the egre­gious mis­al­ign­ment debate

Steven Byrnes12 Jun 2026 16:26 UTC
206 points
27 comments4 min readLW link

Models May Be­have Worse When Eval Aware

11 Jun 2026 9:28 UTC
87 points
8 comments13 min readLW link

Re­s­olu­tion (fka Se­quent): scale and au­toma­tion for higher con­fi­dence in alignment

10 Jun 2026 15:37 UTC
281 points
2 comments12 min readLW link
(sequent.org)

Trac­ing Eval-Aware­ness Emer­gence Through Train­ing of OLMo 3

10 Jun 2026 10:13 UTC
43 points
6 comments6 min readLW link

A Mike’s-Eye View of ARC’s Research

Mikewins9 Jun 2026 18:30 UTC
64 points
1 comment11 min readLW link
(www.alignment.org)

Effi­cient trade­offs and the safety-use­ful­ness trade­off model

Buck8 Jun 2026 20:28 UTC
42 points
1 comment8 min readLW link

Can ac­ti­va­tion ver­bal­iz­ers sur­face an in­ter­nal chain of thought?

7 Jun 2026 4:24 UTC
125 points
0 comments16 min readLW link

My re­search: a com­pu­ta­tional cog­ni­tive neu­ro­science per­spec­tive on al­ign­ment

Seth Herd5 Jun 2026 14:19 UTC
54 points
0 comments18 min readLW link

An­nounc­ing the ARC White-Box Es­ti­ma­tion Challenge

2 Jun 2026 16:20 UTC
165 points
20 comments3 min readLW link
(www.alignment.org)

Test­ing Gem­ini mod­els for schem­ing tendencies

29 May 2026 19:24 UTC
47 points
8 comments6 min readLW link
(deepmindsafetyresearch.medium.com)