RSS

SERI MATS

TagLast edit: 28 Jul 2022 15:55 UTC by Multicore

The Stanford Existential Risks Initiative ML Alignment Theory Scholars program.

https://​​www.serimats.org/​​

SERI MATS Pro­gram—Win­ter 2022 Cohort

8 Oct 2022 19:09 UTC
71 points
12 comments4 min readLW link

SolidGoldMag­ikarp (plus, prompt gen­er­a­tion)

5 Feb 2023 22:02 UTC
650 points
199 comments12 min readLW link

Un­der­stand­ing and con­trol­ling a maze-solv­ing policy network

11 Mar 2023 18:59 UTC
302 points
23 comments23 min readLW link

Pro­ject pro­posal: Test­ing the IBP defi­ni­tion of agent

9 Aug 2022 1:09 UTC
21 points
4 comments2 min readLW link

Soft op­ti­miza­tion makes the value tar­get bigger

Jeremy Gillen2 Jan 2023 16:06 UTC
106 points
20 comments12 min readLW link

Tak­ing the pa­ram­e­ters which seem to mat­ter and ro­tat­ing them un­til they don’t

Garrett Baker26 Aug 2022 18:26 UTC
119 points
48 comments1 min readLW link

Nor­ma­tive vs De­scrip­tive Models of Agency

mattmacdermott2 Feb 2023 20:28 UTC
26 points
5 comments4 min readLW link

Pre­dic­tions for shard the­ory mechanis­tic in­ter­pretabil­ity results

1 Mar 2023 5:16 UTC
104 points
9 comments5 min readLW link

Neu­ral Tan­gent Ker­nel Distillation

5 Oct 2022 18:11 UTC
74 points
20 comments8 min readLW link

Clar­ify­ing mesa-optimization

21 Mar 2023 15:53 UTC
36 points
6 comments10 min readLW link

Con­se­quen­tial­ists: One-Way Pat­tern Traps

David Udell16 Jan 2023 20:48 UTC
52 points
3 comments14 min readLW link

More find­ings on Me­moriza­tion and dou­ble descent

Marius Hobbhahn1 Feb 2023 18:26 UTC
50 points
2 comments19 min readLW link

More find­ings on max­i­mal data dimension

Marius Hobbhahn2 Feb 2023 18:33 UTC
26 points
1 comment11 min readLW link

What sorts of sys­tems can be de­cep­tive?

Andrei Alexandru31 Oct 2022 22:00 UTC
15 points
0 comments7 min readLW link

Be­havi­oural statis­tics for a maze-solv­ing agent

20 Apr 2023 22:26 UTC
44 points
11 comments10 min readLW link

Au­dit­ing games for high-level interpretability

Paul Colognese1 Nov 2022 10:44 UTC
29 points
1 comment7 min readLW link

[ASoT] Policy Tra­jec­tory Visualization

Ulisse Mini7 Feb 2023 0:13 UTC
9 points
2 comments1 min readLW link

Race Along Rashomon Ridge

7 Jul 2022 3:20 UTC
48 points
15 comments8 min readLW link

The Ground Truth Prob­lem (Or, Why Eval­u­at­ing In­ter­pretabil­ity Meth­ods Is Hard)

Jessica Rumbelow17 Nov 2022 11:06 UTC
27 points
2 comments2 min readLW link

Qual­ities that al­ign­ment men­tors value in ju­nior researchers

Akash14 Feb 2023 23:27 UTC
82 points
13 comments3 min readLW link

A dis­til­la­tion of Evan Hub­inger’s train­ing sto­ries (for SERI MATS)

Daphne_W18 Jul 2022 3:38 UTC
15 points
1 comment10 min readLW link

Con­di­tion­ing Gen­er­a­tive Models for Alignment

Jozdien18 Jul 2022 7:11 UTC
53 points
8 comments20 min readLW link

[ASoT] Reflec­tivity in Nar­row AI

Ulisse Mini21 Nov 2022 0:51 UTC
6 points
1 comment1 min readLW link

In­ter­ven­ing in the Resi­d­ual Stream

MadHatter22 Feb 2023 6:29 UTC
30 points
1 comment9 min readLW link

What Makes an Idea Un­der­stand­able? On Ar­chi­tec­turally and Cul­turally Nat­u­ral Ideas.

16 Aug 2022 2:09 UTC
21 points
2 comments16 min readLW link

Swap and Scale

Stephen Fowler9 Sep 2022 22:41 UTC
17 points
3 comments1 min readLW link

Broad Bas­ins and Data Compression

8 Aug 2022 20:33 UTC
33 points
6 comments7 min readLW link

Can We Align a Self-Im­prov­ing AGI?

Peter S. Park30 Aug 2022 0:14 UTC
8 points
5 comments11 min readLW link

In­for­ma­tion the­o­retic model anal­y­sis may not lend much in­sight, but we may have been do­ing them wrong!

Garrett Baker24 Jul 2022 0:42 UTC
7 points
0 comments10 min readLW link

Be­havi­our Man­i­folds and the Hes­sian of the To­tal Loss—Notes and Criticism

Spencer Becker-Kahn3 Sep 2022 0:15 UTC
35 points
5 comments6 min readLW link

Fram­ing AI Childhoods

David Udell6 Sep 2022 23:40 UTC
37 points
8 comments4 min readLW link

Search­ing for Mo­du­lar­ity in Large Lan­guage Models

8 Sep 2022 2:25 UTC
44 points
3 comments14 min readLW link

My SERI MATS Application

Daniel Paleka30 May 2022 2:04 UTC
16 points
0 comments8 min readLW link

Try­ing to find the un­der­ly­ing struc­ture of com­pu­ta­tional systems

Matthias G. Mayer13 Sep 2022 21:16 UTC
17 points
9 comments4 min readLW link

The­o­ret­i­cal Neu­ro­science For Align­ment Theory

Cameron Berg7 Dec 2021 21:50 UTC
62 points
18 comments23 min readLW link

The Nat­u­ral Ab­strac­tion Hy­poth­e­sis: Im­pli­ca­tions and Evidence

TheMcDouglas14 Dec 2021 23:14 UTC
33 points
8 comments19 min readLW link

Mo­ti­va­tions, Nat­u­ral Selec­tion, and Cur­ricu­lum Engineering

Oliver Sourbut16 Dec 2021 1:07 UTC
16 points
0 comments42 min readLW link

Un­der­stand­ing and con­trol­ling auto-in­duced dis­tri­bu­tional shift

LRudL13 Dec 2021 14:59 UTC
26 points
3 comments16 min readLW link

Why I’m Work­ing On Model Ag­nos­tic Interpretability

Jessica Rumbelow11 Nov 2022 9:24 UTC
28 points
9 comments2 min readLW link

A Short Dialogue on the Mean­ing of Re­ward Functions

19 Nov 2022 21:04 UTC
44 points
0 comments3 min readLW link

Guardian AI (Misal­igned sys­tems are all around us.)

Jessica Rumbelow25 Nov 2022 15:55 UTC
15 points
6 comments2 min readLW link

Finite Fac­tored Sets in Pictures

Magdalena Wache11 Dec 2022 18:49 UTC
174 points
35 comments12 min readLW link

Is the “Valley of Con­fused Ab­strac­tions” real?

jacquesthibs5 Dec 2022 13:36 UTC
19 points
10 comments2 min readLW link

Fore­sight for AGI Safety Strat­egy: Miti­gat­ing Risks and Iden­ti­fy­ing Golden Opportunities

jacquesthibs5 Dec 2022 16:09 UTC
16 points
4 comments8 min readLW link

Work­ing to­wards AI al­ign­ment is better

Johannes C. Mayer9 Dec 2022 15:39 UTC
8 points
2 comments2 min readLW link

Proper scor­ing rules don’t guaran­tee pre­dict­ing fixed points

16 Dec 2022 18:22 UTC
58 points
8 comments21 min readLW link

Con­tent and Take­aways from SERI MATS Train­ing Pro­gram with John Wentworth

RohanS24 Dec 2022 4:17 UTC
24 points
3 comments12 min readLW link

Get­ting up to Speed on the Speed Prior in 2022

robertzk28 Dec 2022 7:49 UTC
33 points
5 comments65 min readLW link

But is it re­ally in Rome? An in­ves­ti­ga­tion of the ROME model edit­ing technique

jacquesthibs30 Dec 2022 2:40 UTC
88 points
1 comment18 min readLW link

Re­sults from a sur­vey on tool use and work­flows in al­ign­ment research

19 Dec 2022 15:19 UTC
77 points
2 comments19 min readLW link

[Question] How is ARC plan­ning to use ELK?

jacquesthibs15 Dec 2022 20:11 UTC
24 points
5 comments1 min readLW link

My Ad­vice for In­com­ing SERI MATS Scholars

Johannes C. Mayer3 Jan 2023 19:25 UTC
45 points
1 comment4 min readLW link

Some Notes on the math­e­mat­ics of Toy Au­toen­cod­ing Problems

Spencer Becker-Kahn22 Dec 2022 17:21 UTC
14 points
0 comments12 min readLW link

The Align­ment Problems

Martín Soto12 Jan 2023 22:29 UTC
19 points
0 comments4 min readLW link

Disen­tan­gling Shard The­ory into Atomic Claims

Leon Lang13 Jan 2023 4:23 UTC
82 points
6 comments18 min readLW link

Neu­ral net­works gen­er­al­ize be­cause of this one weird trick

Jesse Hoogland18 Jan 2023 0:10 UTC
130 points
25 comments15 min readLW link
(www.jessehoogland.com)

Ex­per­i­ment Idea: RL Agents Evad­ing Learned Shutdownability

Leon Lang16 Jan 2023 22:46 UTC
31 points
7 comments17 min readLW link
(docs.google.com)

[RFC] Pos­si­ble ways to ex­pand on “Dis­cov­er­ing La­tent Knowl­edge in Lan­guage Models Without Su­per­vi­sion”.

25 Jan 2023 19:03 UTC
43 points
6 comments12 min readLW link

Stop-gra­di­ents lead to fixed point predictions

28 Jan 2023 22:47 UTC
36 points
2 comments24 min readLW link

Spooky ac­tion at a dis­tance in the loss landscape

28 Jan 2023 0:22 UTC
61 points
4 comments3 min readLW link
(www.jessehoogland.com)

Us­ing PICT against Pas­taGPT Jailbreaking

Quentin FEUILLADE--MONTIXI9 Feb 2023 4:30 UTC
15 points
0 comments9 min readLW link

Gra­di­ent sur­fing: the hid­den role of regularization

Jesse Hoogland6 Feb 2023 3:50 UTC
30 points
6 comments5 min readLW link
(www.jessehoogland.com)

SolidGoldMag­ikarp II: tech­ni­cal de­tails and more re­cent findings

6 Feb 2023 19:09 UTC
103 points
44 comments13 min readLW link

SERI ML Align­ment The­ory Schol­ars Pro­gram 2022

27 Apr 2022 0:43 UTC
60 points
6 comments3 min readLW link

A cir­cuit for Python doc­strings in a 4-layer at­ten­tion-only transformer

20 Feb 2023 19:35 UTC
91 points
6 comments21 min readLW link

The shal­low re­al­ity of ‘deep learn­ing the­ory’

Jesse Hoogland22 Feb 2023 4:16 UTC
23 points
11 comments3 min readLW link
(www.jessehoogland.com)

A Neu­ral Net­work un­der­go­ing Gra­di­ent-based Train­ing as a Com­plex System

Spencer Becker-Kahn19 Feb 2023 22:08 UTC
18 points
1 comment19 min readLW link

Search­ing for a model’s con­cepts by their shape – a the­o­ret­i­cal framework

23 Feb 2023 20:14 UTC
37 points
0 comments19 min readLW link

Why are coun­ter­fac­tu­als elu­sive?

Martín Soto3 Mar 2023 20:13 UTC
18 points
6 comments2 min readLW link

A mechanis­tic ex­pla­na­tion for SolidGoldMag­ikarp-like to­kens in GPT2

MadHatter26 Feb 2023 1:10 UTC
61 points
14 comments6 min readLW link

[Ap­pendix] Nat­u­ral Ab­strac­tions: Key Claims, The­o­rems, and Critiques

16 Mar 2023 16:38 UTC
46 points
0 comments13 min readLW link

Nat­u­ral Ab­strac­tions: Key claims, The­o­rems, and Critiques

16 Mar 2023 16:37 UTC
185 points
14 comments45 min readLW link

Em­piri­cal risk min­i­miza­tion is fun­da­men­tally confused

Jesse Hoogland22 Mar 2023 16:58 UTC
30 points
5 comments10 min readLW link

Ap­prox­i­ma­tion is ex­pen­sive, but the lunch is cheap

19 Apr 2023 14:19 UTC
61 points
2 comments9 min readLW link

Fixed points in mor­tal pop­u­la­tion games

ViktoriaMalyasova14 Mar 2023 7:10 UTC
22 points
0 comments12 min readLW link
(www.lesswrong.com)

SERI MATS—Sum­mer 2023 Cohort

8 Apr 2023 15:32 UTC
68 points
25 comments4 min readLW link

A mostly crit­i­cal re­view of in­fra-Bayesianism

matolcsid28 Feb 2023 18:37 UTC
93 points
7 comments29 min readLW link

Perfor­mance guaran­tees in clas­si­cal learn­ing the­ory and in­fra-Bayesianism

matolcsid28 Feb 2023 18:37 UTC
9 points
4 comments31 min readLW link

Non-Uni­tary Quan­tum Logic—SERI MATS Re­search Sprint

Yegreg16 Feb 2023 19:31 UTC
26 points
0 comments7 min readLW link

An open let­ter to SERI MATS pro­gram organisers

Roman Leventov20 Apr 2023 16:34 UTC
15 points
26 comments4 min readLW link

Re­search agenda: Su­per­vis­ing AIs im­prov­ing AIs

29 Apr 2023 17:09 UTC
60 points
4 comments19 min readLW link

Find­ing Neu­rons in a Haystack: Case Stud­ies with Sparse Probing

3 May 2023 13:30 UTC
29 points
5 comments2 min readLW link
(arxiv.org)

How MATS ad­dresses “mass move­ment build­ing” concerns

Ryan Kidd4 May 2023 0:55 UTC
54 points
9 comments3 min readLW link

Con­di­tions for math­e­mat­i­cal equiv­alence of Stochas­tic Gra­di­ent Des­cent and Nat­u­ral Selection

Oliver Sourbut9 May 2022 21:38 UTC
60 points
12 comments10 min readLW link

Some real ex­am­ples of gra­di­ent hacking

Oliver Sourbut22 Nov 2021 0:11 UTC
15 points
8 comments2 min readLW link

Some Sum­maries of Agent Foun­da­tions Work

mattmacdermott15 May 2023 16:09 UTC
42 points
1 comment13 min readLW link

Model­ling Deception

Garrett Baker18 Jul 2022 21:21 UTC
15 points
0 comments7 min readLW link

Abram Dem­ski’s ELK thoughts and pro­posal—distillation

Rubi J. Hudson19 Jul 2022 6:57 UTC
15 points
4 comments16 min readLW link

Bounded com­plex­ity of solv­ing ELK and its implications

Rubi J. Hudson19 Jul 2022 6:56 UTC
11 points
4 comments18 min readLW link

How (not) to choose a re­search project

9 Aug 2022 0:26 UTC
76 points
11 comments7 min readLW link

Team Shard Sta­tus Report

David Udell9 Aug 2022 5:33 UTC
38 points
8 comments3 min readLW link

Find­ing Skele­tons on Rashomon Ridge

24 Jul 2022 22:31 UTC
30 points
2 comments7 min readLW link

Ex­ter­nal­ized rea­son­ing over­sight: a re­search di­rec­tion for lan­guage model alignment

tamera3 Aug 2022 12:03 UTC
117 points
23 comments6 min readLW link

Trans­lat­ing be­tween La­tent Spaces

30 Jul 2022 3:25 UTC
27 points
1 comment8 min readLW link

Shard The­ory: An Overview

David Udell11 Aug 2022 5:44 UTC
150 points
34 comments10 min readLW link

How Do We Align an AGI Without Get­ting So­cially Eng­ineered? (Hint: Box It)

10 Aug 2022 18:14 UTC
26 points
30 comments11 min readLW link

Iden­ti­fi­ca­tion of Nat­u­ral Modularity

Stephen Fowler25 Jun 2022 15:05 UTC
15 points
3 comments7 min readLW link

How trans­parency changed over time

ViktoriaMalyasova30 Jul 2022 4:36 UTC
21 points
0 comments6 min readLW link

How In­ter­pretabil­ity can be Impactful

Connall Garrod18 Jul 2022 0:06 UTC
18 points
0 comments37 min readLW link

Why you might ex­pect ho­mo­ge­neous take-off: ev­i­dence from ML research

Andrei Alexandru17 Jul 2022 20:31 UTC
24 points
0 comments10 min readLW link

Train­ing goals for large lan­guage models

Johannes Treutlein18 Jul 2022 7:09 UTC
28 points
5 comments19 min readLW link

Notes on Learn­ing the Prior

Spencer Becker-Kahn15 Jul 2022 17:28 UTC
22 points
2 comments25 min readLW link

De­cep­tion?! I ain’t got time for that!

Paul Colognese18 Jul 2022 0:06 UTC
50 points
5 comments13 min readLW link

A Toy Model of Gra­di­ent Hacking

Oam Patel20 Jun 2022 22:01 UTC
30 points
7 comments4 min readLW link

In­for­ma­tion Loss --> Basin flatness

Vivek Hebbar21 May 2022 12:58 UTC
50 points
28 comments7 min readLW link

[Short ver­sion] In­for­ma­tion Loss --> Basin flatness

Vivek Hebbar21 May 2022 12:59 UTC
11 points
0 comments1 min readLW link

How com­plex are my­opic imi­ta­tors?

Vivek Hebbar8 Feb 2022 12:00 UTC
26 points
1 comment15 min readLW link

Find­ing Goals in the World Model

22 Aug 2022 18:06 UTC
55 points
8 comments13 min readLW link

The Shard The­ory Align­ment Scheme

David Udell25 Aug 2022 4:52 UTC
47 points
32 comments2 min readLW link

The Core of the Align­ment Prob­lem is...

17 Aug 2022 20:07 UTC
69 points
10 comments9 min readLW link

Mesa-op­ti­miza­tion for goals defined only within a train­ing en­vi­ron­ment is dangerous

Rubi J. Hudson17 Aug 2022 3:56 UTC
6 points
2 comments4 min readLW link

A brief note on Sim­plic­ity Bias

Spencer Becker-Kahn14 Aug 2022 2:05 UTC
17 points
0 comments4 min readLW link

Limits of Ask­ing ELK if Models are Deceptive

Oam Patel15 Aug 2022 20:44 UTC
6 points
2 comments4 min readLW link

In­ner Align­ment via Superpowers

30 Aug 2022 20:01 UTC
37 points
13 comments4 min readLW link
No comments.