RSS

[Repli­ca­tion] Con­jec­ture’s Sparse Cod­ing in Small Transformers

16 Jun 2023 18:02 UTC
16 points
0 comments5 min readLW link

AXRP Epi­sode 22 - Shard The­ory with Quintin Pope

DanielFilan15 Jun 2023 19:00 UTC
44 points
4 comments93 min readLW link

In­stru­men­tal Con­ver­gence? [Draft]

J. Dmitri Gallow14 Jun 2023 20:21 UTC
53 points
11 comments33 min readLW link

Me­taAI: less is less for al­ign­ment.

Cleo Nardo13 Jun 2023 14:08 UTC
63 points
12 comments5 min readLW link

Vir­tual AI Safety Un­con­fer­ence (VAISU)

13 Jun 2023 9:56 UTC
14 points
0 comments1 min readLW link

TASRA: A Tax­on­omy and Anal­y­sis of So­cietal-Scale Risks from AI

Andrew_Critch13 Jun 2023 5:04 UTC
57 points
1 comment1 min readLW link

Contin­gency: A Con­cep­tual Tool from Evolu­tion­ary Biol­ogy for Alignment

clem_acs12 Jun 2023 20:54 UTC
39 points
0 comments14 min readLW link
(acsresearch.org)

ARC is hiring the­o­ret­i­cal researchers

12 Jun 2023 18:50 UTC
122 points
10 comments4 min readLW link
(www.alignment.org)

In­tro­duc­tion to Towards Causal Foun­da­tions of Safe AGI

12 Jun 2023 17:55 UTC
54 points
5 comments4 min readLW link

Explicitness

TsviBT12 Jun 2023 15:05 UTC
25 points
0 comments15 min readLW link

In­fer­ence-Time In­ter­ven­tion: Elic­it­ing Truth­ful An­swers from a Lan­guage Model

likenneth11 Jun 2023 5:38 UTC
171 points
3 comments1 min readLW link
(arxiv.org)

an Evan­ge­lion di­alogue ex­plain­ing the QACI al­ign­ment plan

Tamsin Leake10 Jun 2023 3:28 UTC
41 points
12 comments45 min readLW link
(carado.moe)

for­mal­iz­ing the QACI al­ign­ment for­mal-goal

10 Jun 2023 3:28 UTC
45 points
3 comments14 min readLW link
(carado.moe)

How biosafety could in­form AI standards

Olivia Jimenez9 Jun 2023 14:41 UTC
51 points
3 comments10 min readLW link

A com­par­i­son of causal scrub­bing, causal ab­strac­tions, and re­lated methods

8 Jun 2023 23:40 UTC
52 points
0 comments22 min readLW link

Take­aways from the Mechanis­tic In­ter­pretabil­ity Challenges

scasper8 Jun 2023 18:56 UTC
92 points
5 comments6 min readLW link

What will GPT-2030 look like?

jsteinhardt7 Jun 2023 23:40 UTC
136 points
37 comments23 min readLW link
(bounded-regret.ghost.io)

An Ex­er­cise to Build In­tu­itions on AGI Risk

Lauro Langosco7 Jun 2023 18:35 UTC
48 points
3 comments8 min readLW link

A Play­book for AI Risk Re­duc­tion (fo­cused on mis­al­igned AI)

HoldenKarnofsky6 Jun 2023 18:05 UTC
91 points
31 comments14 min readLW link

AISC end of pro­gram presentations

6 Jun 2023 15:45 UTC
18 points
0 comments1 min readLW link