RSS

Re­search Agendas

TagLast edit: 16 Sep 2021 15:08 UTC by plex

Research Agendas lay out the areas of research which individuals or groups are working on, or those that they believe would be valuable for others to work on. They help make research more legible and encourage discussion of priorities.

The Learn­ing-The­o­retic AI Align­ment Re­search Agenda

Vanessa Kosoy4 Jul 2018 9:53 UTC
92 points
37 comments32 min readLW link

New safety re­search agenda: scal­able agent al­ign­ment via re­ward modeling

Vika20 Nov 2018 17:29 UTC
34 points
12 comments1 min readLW link
(medium.com)

Embed­ded Agents

29 Oct 2018 19:53 UTC
228 points
41 comments1 min readLW link2 reviews

On how var­i­ous plans miss the hard bits of the al­ign­ment challenge

So8res12 Jul 2022 2:49 UTC
305 points
88 comments29 min readLW link3 reviews

AI Gover­nance: A Re­search Agenda

habryka5 Sep 2018 18:00 UTC
25 points
3 comments1 min readLW link
(www.fhi.ox.ac.uk)

Re­search Agenda v0.9: Syn­the­sis­ing a hu­man’s prefer­ences into a util­ity function

Stuart_Armstrong17 Jun 2019 17:46 UTC
70 points
26 comments33 min readLW link

Paul’s re­search agenda FAQ

zhukeepa1 Jul 2018 6:25 UTC
128 points
74 comments19 min readLW link1 review

Our take on CHAI’s re­search agenda in un­der 1500 words

Alex Flint17 Jun 2020 12:24 UTC
113 points
18 comments5 min readLW link

An overview of 11 pro­pos­als for build­ing safe ad­vanced AI

evhub29 May 2020 20:38 UTC
213 points
36 comments38 min readLW link2 reviews

the QACI al­ign­ment plan: table of contents

Tamsin Leake21 Mar 2023 20:22 UTC
107 points
1 comment1 min readLW link
(carado.moe)

AISC Pro­ject: Model­ling Tra­jec­to­ries of Lan­guage Models

NickyP13 Nov 2023 14:33 UTC
27 points
0 comments12 min readLW link

Embed­ded Agency (full-text ver­sion)

15 Nov 2018 19:49 UTC
201 points
17 comments54 min readLW link

The ‘Ne­glected Ap­proaches’ Ap­proach: AE Stu­dio’s Align­ment Agenda

18 Dec 2023 20:35 UTC
166 points
21 comments12 min readLW link

Try­ing to iso­late ob­jec­tives: ap­proaches to­ward high-level interpretability

Jozdien9 Jan 2023 18:33 UTC
48 points
14 comments8 min readLW link

MIRI’s tech­ni­cal re­search agenda

So8res23 Dec 2014 18:45 UTC
55 points
52 comments3 min readLW link

Some con­cep­tual al­ign­ment re­search projects

Richard_Ngo25 Aug 2022 22:51 UTC
176 points
15 comments3 min readLW link

De­con­fus­ing Hu­man Values Re­search Agenda v1

Gordon Seidoh Worley23 Mar 2020 16:25 UTC
28 points
12 comments4 min readLW link

Davi­dad’s Bold Plan for Align­ment: An In-Depth Explanation

19 Apr 2023 16:09 UTC
159 points
34 comments21 min readLW link

Thoughts on Hu­man Models

21 Feb 2019 9:10 UTC
126 points
32 comments10 min readLW link1 review

The Learn­ing-The­o­retic Agenda: Sta­tus 2023

Vanessa Kosoy19 Apr 2023 5:21 UTC
135 points
13 comments55 min readLW link

Re­search agenda update

Steven Byrnes6 Aug 2021 19:24 UTC
55 points
40 comments7 min readLW link

Pre­face to CLR’s Re­search Agenda on Co­op­er­a­tion, Con­flict, and TAI

JesseClifton13 Dec 2019 21:02 UTC
62 points
10 comments2 min readLW link

An­nounc­ing the Align­ment of Com­plex Sys­tems Re­search Group

4 Jun 2022 4:10 UTC
91 points
20 comments5 min readLW link

New year, new re­search agenda post

Charlie Steiner12 Jan 2022 17:58 UTC
29 points
4 comments16 min readLW link

Towards Hodge-podge Alignment

Cleo Nardo19 Dec 2022 20:12 UTC
93 points
30 comments9 min readLW link

Us­ing GPT-N to Solve In­ter­pretabil­ity of Neu­ral Net­works: A Re­search Agenda

3 Sep 2020 18:27 UTC
68 points
11 comments2 min readLW link

Key Ques­tions for Digi­tal Minds

Jacy Reese Anthis22 Mar 2023 17:13 UTC
22 points
0 comments7 min readLW link
(www.sentienceinstitute.org)

The space of sys­tems and the space of maps

22 Mar 2023 14:59 UTC
39 points
0 comments5 min readLW link

The­o­ries of im­pact for Science of Deep Learning

Marius Hobbhahn1 Dec 2022 14:39 UTC
24 points
0 comments11 min readLW link

Spar­sify: A mechanis­tic in­ter­pretabil­ity re­search agenda

Lee Sharkey3 Apr 2024 12:34 UTC
94 points
22 comments22 min readLW link

a nar­ra­tive ex­pla­na­tion of the QACI al­ign­ment plan

Tamsin Leake15 Feb 2023 3:28 UTC
56 points
29 comments6 min readLW link
(carado.moe)

Sec­tions 1 & 2: In­tro­duc­tion, Strat­egy and Governance

JesseClifton17 Dec 2019 21:27 UTC
35 points
8 comments14 min readLW link

EIS XIII: Reflec­tions on An­thropic’s SAE Re­search Circa May 2024

scasper21 May 2024 20:15 UTC
157 points
16 comments3 min readLW link

Gaia Net­work: a prac­ti­cal, in­cre­men­tal path­way to Open Agency Architecture

20 Dec 2023 17:11 UTC
22 points
8 comments16 min readLW link

The Short­est Path Between Scylla and Charybdis

Thane Ruthenis18 Dec 2023 20:08 UTC
50 points
8 comments5 min readLW link

An­nounc­ing Hu­man-al­igned AI Sum­mer School

22 May 2024 8:55 UTC
50 points
0 comments1 min readLW link
(humanaligned.ai)

Assess­ment of AI safety agen­das: think about the down­side risk

Roman Leventov19 Dec 2023 9:00 UTC
13 points
1 comment1 min readLW link

The Plan − 2023 Version

johnswentworth29 Dec 2023 23:34 UTC
146 points
40 comments31 min readLW link

Re­search Jan/​Feb 2024

Stephen Fowler1 Jan 2024 6:02 UTC
9 points
0 comments2 min readLW link

Four vi­sions of Trans­for­ma­tive AI success

Steven Byrnes17 Jan 2024 20:45 UTC
112 points
22 comments15 min readLW link

Wor­ri­some mi­s­un­der­stand­ing of the core is­sues with AI transition

Roman Leventov18 Jan 2024 10:05 UTC
5 points
2 comments4 min readLW link

Gra­di­ent Des­cent on the Hu­man Brain

1 Apr 2024 22:39 UTC
52 points
5 comments2 min readLW link

Con­structabil­ity: Plainly-coded AGIs may be fea­si­ble in the near future

27 Apr 2024 16:04 UTC
82 points
13 comments13 min readLW link

The Prop-room and Stage Cog­ni­tive Architecture

Robert Kralisch29 Apr 2024 0:48 UTC
13 points
4 comments14 min readLW link

What and Why: Devel­op­men­tal In­ter­pretabil­ity of Re­in­force­ment Learning

Garrett Baker9 Jul 2024 14:09 UTC
67 points
4 comments6 min readLW link

Towards the Oper­a­tional­iza­tion of Philos­o­phy & Wisdom

Thane Ruthenis28 Oct 2024 19:45 UTC
20 points
2 comments33 min readLW link
(aiimpacts.org)

Self-pre­dic­tion acts as an emer­gent regularizer

23 Oct 2024 22:27 UTC
84 points
4 comments4 min readLW link

Seek­ing Collaborators

abramdemski1 Nov 2024 17:13 UTC
57 points
14 comments7 min readLW link

Ul­tra-sim­plified re­search agenda

Stuart_Armstrong22 Nov 2019 14:29 UTC
34 points
4 comments1 min readLW link

Embed­ded Curiosities

8 Nov 2018 14:19 UTC
91 points
1 comment2 min readLW link

Sub­sys­tem Alignment

6 Nov 2018 16:16 UTC
99 points
12 comments1 min readLW link

Ro­bust Delegation

4 Nov 2018 16:38 UTC
116 points
10 comments1 min readLW link

Embed­ded World-Models

2 Nov 2018 16:07 UTC
96 points
16 comments1 min readLW link

De­ci­sion Theory

31 Oct 2018 18:41 UTC
120 points
45 comments1 min readLW link

Re­search agenda: Su­per­vis­ing AIs im­prov­ing AIs

29 Apr 2023 17:09 UTC
76 points
5 comments19 min readLW link

Deep For­get­ting & Un­learn­ing for Safely-Scoped LLMs

scasper5 Dec 2023 16:48 UTC
123 points
30 comments13 min readLW link

Sec­tions 3 & 4: Cred­i­bil­ity, Peace­ful Bar­gain­ing Mechanisms

JesseClifton17 Dec 2019 21:46 UTC
20 points
2 comments12 min readLW link

Sec­tions 5 & 6: Con­tem­po­rary Ar­chi­tec­tures, Hu­mans in the Loop

JesseClifton20 Dec 2019 3:52 UTC
27 points
4 comments10 min readLW link

Sec­tion 7: Foun­da­tions of Ra­tional Agency

JesseClifton22 Dec 2019 2:05 UTC
14 points
4 comments8 min readLW link

Ac­knowl­edge­ments & References

JesseClifton14 Dec 2019 7:04 UTC
6 points
0 comments14 min readLW link

Align­ment pro­pos­als and com­plex­ity classes

evhub16 Jul 2020 0:27 UTC
40 points
26 comments13 min readLW link

Orthog­o­nal’s For­mal-Goal Align­ment the­ory of change

Tamsin Leake5 May 2023 22:36 UTC
68 points
13 comments4 min readLW link
(carado.moe)

The Good­hart Game

John_Maxwell18 Nov 2019 23:22 UTC
13 points
5 comments5 min readLW link

[Linkpost] In­ter­pretabil­ity Dreams

DanielFilan24 May 2023 21:08 UTC
39 points
2 comments2 min readLW link
(transformer-circuits.pub)

My AI Align­ment Re­search Agenda and Threat Model, right now (May 2023)

Nicholas / Heather Kross28 May 2023 3:23 UTC
25 points
0 comments6 min readLW link
(www.thinkingmuchbetter.com)

Ab­strac­tion is Big­ger than Nat­u­ral Abstraction

Nicholas / Heather Kross31 May 2023 0:00 UTC
18 points
0 comments5 min readLW link
(www.thinkingmuchbetter.com)

[Question] Does any­one’s full-time job in­clude read­ing and un­der­stand­ing all the most-promis­ing for­mal AI al­ign­ment work?

Nicholas / Heather Kross16 Jun 2023 2:24 UTC
15 points
2 comments1 min readLW link

My re­search agenda in agent foundations

Alex_Altair28 Jun 2023 18:00 UTC
71 points
9 comments11 min readLW link

My Align­ment Timeline

Nicholas / Heather Kross3 Jul 2023 1:04 UTC
22 points
0 comments2 min readLW link

My Cen­tral Align­ment Pri­or­ity (2 July 2023)

Nicholas / Heather Kross3 Jul 2023 1:46 UTC
12 points
1 comment3 min readLW link

Im­mo­bile AI makes a move: anti-wire­head­ing, on­tol­ogy change, and model splintering

Stuart_Armstrong17 Sep 2021 15:24 UTC
32 points
3 comments2 min readLW link

Test­ing The Nat­u­ral Ab­strac­tion Hy­poth­e­sis: Pro­ject Update

johnswentworth20 Sep 2021 3:44 UTC
87 points
17 comments8 min readLW link1 review

AI, learn to be con­ser­va­tive, then learn to be less so: re­duc­ing side-effects, learn­ing pre­served fea­tures, and go­ing be­yond conservatism

Stuart_Armstrong20 Sep 2021 11:56 UTC
14 points
4 comments3 min readLW link

The Plan

johnswentworth10 Dec 2021 23:41 UTC
254 points
78 comments14 min readLW link1 review

Paradigm-build­ing: Introduction

Cameron Berg8 Feb 2022 0:06 UTC
28 points
0 comments2 min readLW link

Ac­cept­abil­ity Ver­ifi­ca­tion: A Re­search Agenda

12 Jul 2022 20:11 UTC
50 points
0 comments1 min readLW link
(docs.google.com)

(My un­der­stand­ing of) What Every­one in Tech­ni­cal Align­ment is Do­ing and Why

29 Aug 2022 1:23 UTC
413 points
90 comments37 min readLW link1 review

Distil­led Rep­re­sen­ta­tions Re­search Agenda

18 Oct 2022 20:59 UTC
15 points
2 comments8 min readLW link

My AGI safety re­search—2022 re­view, ’23 plans

Steven Byrnes14 Dec 2022 15:15 UTC
51 points
10 comments7 min readLW link

An overview of some promis­ing work by ju­nior al­ign­ment researchers

Akash26 Dec 2022 17:23 UTC
34 points
0 comments4 min readLW link

World-Model In­ter­pretabil­ity Is All We Need

Thane Ruthenis14 Jan 2023 19:37 UTC
35 points
22 comments21 min readLW link

Selec­tion The­o­rems: A Pro­gram For Un­der­stand­ing Agents

johnswentworth28 Sep 2021 5:03 UTC
123 points
28 comments6 min readLW link2 reviews

Why I’m not work­ing on {de­bate, RRM, ELK, nat­u­ral ab­strac­tions}

Steven Byrnes10 Feb 2023 19:22 UTC
71 points
19 comments9 min readLW link

Re­marks 1–18 on GPT (com­pressed)

Cleo Nardo20 Mar 2023 22:27 UTC
148 points
35 comments31 min readLW link

[Question] Re­search ideas (AI In­ter­pretabil­ity & Neu­ro­sciences) for a 2-months project

flux8 Jan 2023 15:36 UTC
3 points
1 comment1 min readLW link

Re­search Agenda in re­verse: what *would* a solu­tion look like?

Stuart_Armstrong25 Jun 2019 13:52 UTC
35 points
25 comments1 min readLW link

EIS X: Con­tinual Learn­ing, Mo­du­lar­ity, Com­pres­sion, and Biolog­i­cal Brains

scasper21 Feb 2023 16:59 UTC
14 points
4 comments3 min readLW link

Fore­cast­ing AI Progress: A Re­search Agenda

10 Aug 2020 1:04 UTC
39 points
4 comments1 min readLW link

Tech­ni­cal AGI safety re­search out­side AI

Richard_Ngo18 Oct 2019 15:00 UTC
43 points
3 comments3 min readLW link

Why I am not cur­rently work­ing on the AAMLS agenda

jessicata1 Jun 2017 17:57 UTC
28 points
3 comments5 min readLW link

In­fer­ence from a Math­e­mat­i­cal De­scrip­tion of an Ex­ist­ing Align­ment Re­search: a pro­posal for an outer al­ign­ment re­search program

Christopher King2 Jun 2023 21:54 UTC
7 points
4 comments16 min readLW link

Gaia Net­work: An Illus­trated Primer

18 Jan 2024 18:23 UTC
3 points
2 comments15 min readLW link

EIS XI: Mov­ing Forward

scasper22 Feb 2023 19:05 UTC
19 points
2 comments9 min readLW link

A Mul­tidis­ci­plinary Ap­proach to Align­ment (MATA) and Archety­pal Trans­fer Learn­ing (ATL)

MiguelDev19 Jun 2023 2:32 UTC
4 points
2 comments7 min readLW link

The AI Con­trol Prob­lem in a wider in­tel­lec­tual context

philosophybear13 Jan 2023 0:28 UTC
11 points
3 comments12 min readLW link

EIS XII: Sum­mary

scasper23 Feb 2023 17:45 UTC
18 points
0 comments6 min readLW link

Par­tial Si­mu­la­tion Ex­trap­o­la­tion: A Pro­posal for Build­ing Safer Simulators

lukemarks17 Jun 2023 13:55 UTC
16 points
0 comments10 min readLW link

[UPDATE: dead­line ex­tended to July 24!] New wind in ra­tio­nal­ity’s sails: Ap­pli­ca­tions for Epistea Res­i­dency 2023 are now open

11 Jul 2023 11:02 UTC
80 points
7 comments3 min readLW link

In­ter­pretabil­ity’s Align­ment-Solv­ing Po­ten­tial: Anal­y­sis of 7 Scenarios

Evan R. Murphy12 May 2022 20:01 UTC
58 points
0 comments59 min readLW link

Re­search agenda: For­mal­iz­ing ab­strac­tions of computations

Erik Jenner2 Feb 2023 4:29 UTC
92 points
10 comments31 min readLW link

Which of these five AI al­ign­ment re­search pro­jects ideas are no good?

rmoehn8 Aug 2019 7:17 UTC
25 points
13 comments1 min readLW link

[Linkpost] In­ter­pretable Anal­y­sis of Fea­tures Found in Open-source Sparse Au­toen­coder (par­tial repli­ca­tion)

Fernando Avalos9 Sep 2024 3:33 UTC
6 points
1 comment1 min readLW link
(forum.effectivealtruism.org)

Fund­ing Good Research

lukeprog27 May 2012 6:41 UTC
38 points
44 comments2 min readLW link

The Löbian Ob­sta­cle, And Why You Should Care

lukemarks7 Sep 2023 23:59 UTC
18 points
6 comments2 min readLW link

Please voice your sup­port for stem cell research

zaph22 May 2009 18:45 UTC
−5 points
4 comments1 min readLW link

Early Ex­per­i­ments in Re­ward Model In­ter­pre­ta­tion Us­ing Sparse Autoencoders

3 Oct 2023 7:45 UTC
17 points
0 comments5 min readLW link

Thoughts On (Solv­ing) Deep Deception

Jozdien21 Oct 2023 22:40 UTC
69 points
2 comments6 min readLW link

Notes on effec­tive-al­tru­ism-re­lated re­search, writ­ing, test­ing fit, learn­ing, and the EA Forum

MichaelA28 Mar 2021 23:43 UTC
14 points
0 comments4 min readLW link

La­bor Par­ti­ci­pa­tion is a High-Pri­or­ity AI Align­ment Risk

alex17 Jun 2024 18:09 UTC
4 points
0 comments17 min readLW link

The Me­taethics and Nor­ma­tive Ethics of AGI Value Align­ment: Many Ques­tions, Some Implications

Eleos Arete Citrini16 Sep 2021 16:13 UTC
6 points
0 comments8 min readLW link

In­tro­duc­ing Leap Labs, an AI in­ter­pretabil­ity startup

Jessica Rumbelow6 Mar 2023 16:16 UTC
103 points
12 comments1 min readLW link

A multi-dis­ci­plinary view on AI safety research

Roman Leventov8 Feb 2023 16:50 UTC
43 points
4 comments26 min readLW link

AI Safety in a World of Vuln­er­a­ble Ma­chine Learn­ing Systems

8 Mar 2023 2:40 UTC
70 points
28 comments29 min readLW link
(far.ai)

EIS IV: A Spotlight on Fea­ture At­tri­bu­tion/​Saliency

scasper15 Feb 2023 18:46 UTC
19 points
1 comment4 min readLW link

AI learns be­trayal and how to avoid it

Stuart_Armstrong30 Sep 2021 9:39 UTC
30 points
4 comments2 min readLW link

A FLI post­doc­toral grant ap­pli­ca­tion: AI al­ign­ment via causal anal­y­sis and de­sign of agents

PabloAMC13 Nov 2021 1:44 UTC
4 points
0 comments7 min readLW link

Fram­ing ap­proaches to al­ign­ment and the hard prob­lem of AI cognition

ryan_greenblatt15 Dec 2021 19:06 UTC
16 points
15 comments27 min readLW link

EIS II: What is “In­ter­pretabil­ity”?

scasper9 Feb 2023 16:48 UTC
28 points
6 comments4 min readLW link

An Open Philan­thropy grant pro­posal: Causal rep­re­sen­ta­tion learn­ing of hu­man preferences

PabloAMC11 Jan 2022 11:28 UTC
19 points
6 comments8 min readLW link

[Question] What should I do? (long term plan about start­ing an AI lab)

not_a_cat9 Jun 2024 0:45 UTC
2 points
1 comment2 min readLW link

Paradigm-build­ing: The hi­er­ar­chi­cal ques­tion framework

Cameron Berg9 Feb 2022 16:47 UTC
11 points
15 comments3 min readLW link

Ques­tion 1: Pre­dicted ar­chi­tec­ture of AGI learn­ing al­gorithm(s)

Cameron Berg10 Feb 2022 17:22 UTC
13 points
1 comment7 min readLW link

Ques­tion 2: Pre­dicted bad out­comes of AGI learn­ing architecture

Cameron Berg11 Feb 2022 22:23 UTC
5 points
1 comment10 min readLW link

Ques­tion 3: Con­trol pro­pos­als for min­i­miz­ing bad outcomes

Cameron Berg12 Feb 2022 19:13 UTC
5 points
1 comment7 min readLW link

Ques­tion 5: The timeline hyperparameter

Cameron Berg14 Feb 2022 16:38 UTC
8 points
3 comments7 min readLW link

Paradigm-build­ing: Con­clu­sion and prac­ti­cal takeaways

Cameron Berg15 Feb 2022 16:11 UTC
5 points
1 comment2 min readLW link

EIS III: Broad Cri­tiques of In­ter­pretabil­ity Research

scasper14 Feb 2023 18:24 UTC
20 points
2 comments11 min readLW link

Elicit: Lan­guage Models as Re­search Assistants

9 Apr 2022 14:56 UTC
71 points
6 comments13 min readLW link

What should AI safety be try­ing to achieve?

EuanMcLean23 May 2024 11:17 UTC
16 points
0 comments13 min readLW link

Towards White Box Deep Learning

Maciej Satkiewicz27 Mar 2024 18:20 UTC
17 points
5 comments1 min readLW link
(arxiv.org)

Con­di­tion­ing Gen­er­a­tive Models for Alignment

Jozdien18 Jul 2022 7:11 UTC
59 points
8 comments20 min readLW link

How I think about alignment

Linda Linsefors13 Aug 2022 10:01 UTC
31 points
11 comments5 min readLW link

Model Or­ganisms of Misal­ign­ment: The Case for a New Pillar of Align­ment Research

8 Aug 2023 1:30 UTC
312 points
29 comments18 min readLW link1 review

EIS V: Blind Spots In AI Safety In­ter­pretabil­ity Research

scasper16 Feb 2023 19:09 UTC
54 points
24 comments10 min readLW link

Shard The­ory: An Overview

David Udell11 Aug 2022 5:44 UTC
166 points
34 comments10 min readLW link

Elic­it­ing La­tent Knowl­edge (ELK) - Distil­la­tion/​Summary

Marius Hobbhahn8 Jun 2022 13:18 UTC
69 points
2 comments21 min readLW link

[Question] How can we se­cure more re­search po­si­tions at our uni­ver­si­ties for x-risk re­searchers?

Neil Crawford6 Sep 2022 17:17 UTC
11 points
0 comments1 min readLW link

Align­ment Org Cheat Sheet

20 Sep 2022 17:36 UTC
70 points
8 comments4 min readLW link

For al­ign­ment, we should si­mul­ta­neously use mul­ti­ple the­o­ries of cog­ni­tion and value

Roman Leventov24 Apr 2023 10:37 UTC
23 points
5 comments5 min readLW link

Towards em­pa­thy in RL agents and be­yond: In­sights from cog­ni­tive sci­ence for AI Align­ment

Marc Carauleanu3 Apr 2023 19:59 UTC
15 points
6 comments1 min readLW link
(clipchamp.com)

Gen­er­a­tive, Epi­sodic Ob­jec­tives for Safe AI

Michael Glass5 Oct 2022 23:18 UTC
11 points
3 comments8 min readLW link

Science of Deep Learn­ing—a tech­ni­cal agenda

Marius Hobbhahn18 Oct 2022 14:54 UTC
36 points
7 comments4 min readLW link

EIS VI: Cri­tiques of Mechanis­tic In­ter­pretabil­ity Work in AI Safety

scasper17 Feb 2023 20:48 UTC
49 points
9 comments12 min readLW link

AI re­searchers an­nounce Neu­roAI agenda

Cameron Berg24 Oct 2022 0:14 UTC
37 points
12 comments6 min readLW link
(arxiv.org)

Ap­ply to the Red­wood Re­search Mechanis­tic In­ter­pretabil­ity Ex­per­i­ment (REMIX), a re­search pro­gram in Berkeley

27 Oct 2022 1:32 UTC
135 points
14 comments12 min readLW link

AI Ex­is­ten­tial Safety Fellowships

mmfli28 Oct 2023 18:07 UTC
5 points
0 comments1 min readLW link

Try­ing to un­der­stand John Went­worth’s re­search agenda

20 Oct 2023 0:05 UTC
92 points
13 comments12 min readLW link

Agency over­hang as a proxy for Sharp left turn

7 Nov 2024 12:14 UTC
5 points
0 comments5 min readLW link

AISC pro­ject: TinyEvals

Jett Janiak22 Nov 2023 20:47 UTC
22 points
0 comments4 min readLW link

All life’s helpers’ beliefs

Tehdastehdas28 Oct 2022 5:47 UTC
−12 points
1 comment5 min readLW link

AISC 2024 - Pro­ject Summaries

NickyP27 Nov 2023 22:32 UTC
48 points
3 comments18 min readLW link

NAO Up­dates, Fall 2024

jefftk18 Oct 2024 0:00 UTC
32 points
2 comments1 min readLW link
(naobservatory.org)

Re­in­force­ment Learn­ing us­ing Lay­ered Mor­phol­ogy (RLLM)

MiguelDev1 Dec 2023 5:18 UTC
7 points
0 comments29 min readLW link

A call for a quan­ti­ta­tive re­port card for AI bioter­ror­ism threat models

Juno4 Dec 2023 6:35 UTC
12 points
0 comments10 min readLW link

What’s new at FAR AI

4 Dec 2023 21:18 UTC
41 points
0 comments5 min readLW link
(far.ai)

In­ter­view with Vanessa Kosoy on the Value of The­o­ret­i­cal Re­search for AI

WillPetillo4 Dec 2023 22:58 UTC
37 points
0 comments35 min readLW link

My sum­mary of “Prag­matic AI Safety”

Eleni Angelou5 Nov 2022 12:54 UTC
3 points
0 comments5 min readLW link

Why Academia is Mostly Not Truth-Seeking

Zero Contradictions16 Oct 2024 19:14 UTC
−6 points
6 comments1 min readLW link
(thewaywardaxolotl.blogspot.com)

Nat­u­ral ab­strac­tions are ob­server-de­pen­dent: a con­ver­sa­tion with John Wentworth

Martín Soto12 Feb 2024 17:28 UTC
39 points
13 comments7 min readLW link

EIS VII: A Challenge for Mechanists

scasper18 Feb 2023 18:27 UTC
36 points
4 comments3 min readLW link

EIS VIII: An Eng­ineer’s Un­der­stand­ing of De­cep­tive Alignment

scasper19 Feb 2023 15:25 UTC
30 points
5 comments4 min readLW link

EIS IX: In­ter­pretabil­ity and Adversaries

scasper20 Feb 2023 18:25 UTC
30 points
8 comments8 min readLW link

Re­sources for AI Align­ment Cartography

Gyrodiot4 Apr 2020 14:20 UTC
45 points
8 comments9 min readLW link

In­tro­duc­ing the Longevity Re­search Institute

sarahconstantin8 May 2018 3:30 UTC
54 points
20 comments1 min readLW link
(srconstantin.wordpress.com)

An­nounce­ment: AI al­ign­ment prize round 3 win­ners and next round

cousin_it15 Jul 2018 7:40 UTC
93 points
7 comments1 min readLW link

Ma­chine Learn­ing Pro­jects on IDA

24 Jun 2019 18:38 UTC
49 points
3 comments2 min readLW link

AI Align­ment Re­search Overview (by Ja­cob Stein­hardt)

Ben Pace6 Nov 2019 19:24 UTC
44 points
0 comments7 min readLW link
(docs.google.com)

Creat­ing Welfare Biol­ogy: A Re­search Proposal

ozymandias16 Nov 2017 19:06 UTC
20 points
5 comments4 min readLW link

An­nounc­ing: The In­de­pen­dent AI Safety Registry

Shoshannah Tekofsky26 Dec 2022 21:22 UTC
53 points
9 comments1 min readLW link

An­no­tated re­ply to Ben­gio’s “AI Scien­tists: Safe and Use­ful AI?”

Roman Leventov8 May 2023 21:26 UTC
18 points
2 comments7 min readLW link
(yoshuabengio.org)

H-JEPA might be tech­ni­cally al­ignable in a mod­ified form

Roman Leventov8 May 2023 23:04 UTC
12 points
2 comments7 min readLW link

Roadmap for a col­lab­o­ra­tive pro­to­type of an Open Agency Architecture

Deger Turan10 May 2023 17:41 UTC
31 points
0 comments12 min readLW link

Notes on the im­por­tance and im­ple­men­ta­tion of safety-first cog­ni­tive ar­chi­tec­tures for AI

Brendon_Wong11 May 2023 10:03 UTC
3 points
0 comments3 min readLW link
No comments.