RSS

AtP*: An effi­cient and scal­able method for lo­cal­iz­ing LLM be­havi­our to components

18 Mar 2024 17:28 UTC
9 points
0 comments1 min readLW link
(arxiv.org)

Im­prov­ing SAE’s by Sqrt()-ing L1 & Re­mov­ing Low­est Ac­ti­vat­ing Fea­tures

15 Mar 2024 16:30 UTC
10 points
5 comments4 min readLW link

More peo­ple get­ting into AI safety should do a PhD

AdamGleave14 Mar 2024 22:14 UTC
48 points
23 comments12 min readLW link
(gleave.me)

Lay­ing the Foun­da­tions for Vi­sion and Mul­ti­modal Mechanis­tic In­ter­pretabil­ity & Open Problems

13 Mar 2024 17:09 UTC
32 points
12 comments14 min readLW link

Open con­sul­tancy: Let­ting un­trusted AIs choose what an­swer to ar­gue for

Fabien Roger12 Mar 2024 20:38 UTC
34 points
4 comments5 min readLW link

Trans­former Debugger

Henk Tillman12 Mar 2024 19:08 UTC
25 points
0 comments1 min readLW link
(github.com)

Bias-Aug­mented Con­sis­tency Train­ing Re­duces Bi­ased Rea­son­ing in Chain-of-Thought

miles11 Mar 2024 23:46 UTC
16 points
0 comments1 min readLW link
(arxiv.org)

How dis­agree­ments about Ev­i­den­tial Cor­re­la­tions could be settled

Martín Soto11 Mar 2024 18:28 UTC
10 points
3 comments4 min readLW link

Un­der­stand­ing SAE Fea­tures with the Logit Lens

11 Mar 2024 0:16 UTC
53 points
0 comments14 min readLW link

0th Per­son and 1st Per­son Logic

Adele Lopez10 Mar 2024 0:56 UTC
44 points
28 comments6 min readLW link

Sce­nario Fore­cast­ing Work­shop: Ma­te­ri­als and Learnings

8 Mar 2024 2:30 UTC
50 points
3 comments2 min readLW link

Fore­cast­ing fu­ture gains due to post-train­ing enhancements

8 Mar 2024 2:11 UTC
25 points
2 comments1 min readLW link
(docs.google.com)

Ev­i­den­tial Cor­re­la­tions are Sub­jec­tive, and it might be a problem

Martín Soto7 Mar 2024 18:37 UTC
24 points
6 comments14 min readLW link

We In­spected Every Head In GPT-2 Small us­ing SAEs So You Don’t Have To

6 Mar 2024 5:03 UTC
56 points
0 comments12 min readLW link

Many ar­gu­ments for AI x-risk are wrong

TurnTrout5 Mar 2024 2:31 UTC
151 points
68 comments12 min readLW link

An­thropic re­lease Claude 3, claims >GPT-4 Performance

LawrenceC4 Mar 2024 18:23 UTC
112 points
40 comments2 min readLW link
(www.anthropic.com)

Some costs of superposition

Linda Linsefors3 Mar 2024 16:08 UTC
42 points
9 comments3 min readLW link

Ap­proach­ing Hu­man-Level Fore­cast­ing with Lan­guage Models

29 Feb 2024 22:36 UTC
59 points
6 comments3 min readLW link

What’s in the box?! – Towards in­ter­pretabil­ity by dis­t­in­guish­ing niches of value within neu­ral net­works.

Joshua Clancy29 Feb 2024 18:33 UTC
3 points
4 comments128 min readLW link

Tips for Em­piri­cal Align­ment Research

Ethan Perez29 Feb 2024 6:04 UTC
141 points
4 comments22 min readLW link