RSS

Jeremy Gillen

Karma: 1,017

I do alignment research, mostly stuff that is vaguely agent foundations. Formerly on Vivek’s team at MIRI. Most of my writing before mid 2023 are not representative of my current views about alignment difficulty.

Without fun­da­men­tal ad­vances, mis­al­ign­ment and catas­tro­phe are the de­fault out­comes of train­ing pow­er­ful AI

26 Jan 2024 7:22 UTC
160 points
60 comments57 min readLW link

Thomas Kwa’s MIRI re­search experience

2 Oct 2023 16:42 UTC
169 points
52 comments1 min readLW link

AISC team re­port: Soft-op­ti­miza­tion, Bayes and Goodhart

27 Jun 2023 6:05 UTC
37 points
2 comments15 min readLW link

Soft op­ti­miza­tion makes the value tar­get bigger

Jeremy Gillen2 Jan 2023 16:06 UTC
117 points
20 comments12 min readLW link

Jeremy Gillen’s Shortform

Jeremy Gillen19 Oct 2022 16:14 UTC
4 points
10 comments1 min readLW link

Neu­ral Tan­gent Ker­nel Distillation

5 Oct 2022 18:11 UTC
76 points
20 comments8 min readLW link

In­ner Align­ment via Superpowers

30 Aug 2022 20:01 UTC
37 points
13 comments4 min readLW link

Find­ing Goals in the World Model

22 Aug 2022 18:06 UTC
59 points
8 comments13 min readLW link

The Core of the Align­ment Prob­lem is...

17 Aug 2022 20:07 UTC
74 points
10 comments9 min readLW link

Pro­ject pro­posal: Test­ing the IBP defi­ni­tion of agent

9 Aug 2022 1:09 UTC
21 points
4 comments2 min readLW link

Broad Bas­ins and Data Compression

8 Aug 2022 20:33 UTC
33 points
6 comments7 min readLW link

Trans­lat­ing be­tween La­tent Spaces

30 Jul 2022 3:25 UTC
27 points
2 comments8 min readLW link

Ex­plain­ing in­ner al­ign­ment to myself

Jeremy Gillen24 May 2022 23:10 UTC
9 points
2 comments10 min readLW link

Good­hart’s Law Causal Diagrams

11 Apr 2022 13:52 UTC
32 points
5 comments6 min readLW link