RSS

Joar Skalse

Karma: 735

My name is pronounced “YOO-ar SKULL-se” (the “e” is not silent). I’m a PhD student at Oxford University, and I was a member of the Future of Humanity Institute before it shut down. I have worked in several different areas of AI safety research. For a few highlights, see:

  1. Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

  2. Misspecification in Inverse Reinforcement Learning

  3. STARC: A General Framework For Quantifying Differences Between Reward Functions

  4. Risks from Learned Optimization in Advanced Machine Learning Systems

  5. Is SGD a Bayesian sampler? Well, almost

Some of my recent research on the theoretical foundations of reward learning is also described in this sequence.

For a full list of all my research, see my Google Scholar.

How to Con­tribute to The­o­ret­i­cal Re­ward Learn­ing Research

Joar SkalseFeb 28, 2025, 7:27 PM
16 points
0 comments21 min readLW link

Other Papers About the The­ory of Re­ward Learning

Joar SkalseFeb 28, 2025, 7:26 PM
16 points
0 comments5 min readLW link

Defin­ing and Char­ac­ter­is­ing Re­ward Hacking

Joar SkalseFeb 28, 2025, 7:25 PM
15 points
0 comments4 min readLW link

Misspeci­fi­ca­tion in In­verse Re­in­force­ment Learn­ing—Part II

Joar SkalseFeb 28, 2025, 7:24 PM
9 points
0 comments7 min readLW link

STARC: A Gen­eral Frame­work For Quan­tify­ing Differ­ences Between Re­ward Functions

Joar SkalseFeb 28, 2025, 7:24 PM
11 points
0 comments8 min readLW link

Misspeci­fi­ca­tion in In­verse Re­in­force­ment Learning

Joar SkalseFeb 28, 2025, 7:24 PM
19 points
0 comments11 min readLW link

Par­tial Iden­ti­fi­a­bil­ity in Re­ward Learning

Joar SkalseFeb 28, 2025, 7:23 PM
15 points
0 comments12 min readLW link

The The­o­ret­i­cal Re­ward Learn­ing Re­search Agenda: In­tro­duc­tion and Motivation

Joar SkalseFeb 28, 2025, 7:20 PM
25 points
4 comments14 min readLW link

Towards Guaran­teed Safe AI: A Frame­work for En­sur­ing Ro­bust and Reli­able AI Systems

Joar SkalseMay 17, 2024, 7:13 PM
67 points
10 comments2 min readLW link

My Crit­i­cism of Sin­gu­lar Learn­ing Theory

Joar SkalseNov 19, 2023, 3:19 PM
83 points
56 comments12 min readLW link

Good­hart’s Law in Re­in­force­ment Learning

Oct 16, 2023, 12:54 AM
126 points
22 comments7 min readLW link

VC The­ory Overview

Joar SkalseJul 2, 2023, 10:45 PM
12 points
2 comments11 min readLW link

How Smart Are Hu­mans?

Joar SkalseJul 2, 2023, 3:46 PM
10 points
19 comments2 min readLW link

Us­ing (Un­in­ter­pretable) LLMs to Gen­er­ate In­ter­pretable AI Code

Joar SkalseJul 2, 2023, 1:01 AM
13 points
12 comments3 min readLW link

Some Ar­gu­ments Against Strong Scaling

Joar SkalseJan 13, 2023, 12:04 PM
25 points
21 comments16 min readLW link

What kinds of al­gorithms do multi-hu­man imi­ta­tors learn?

May 22, 2022, 2:27 PM
20 points
0 comments3 min readLW link

Up­dat­ing Utility Functions

May 9, 2022, 9:44 AM
41 points
6 comments8 min readLW link

Why Neu­ral Net­works Gen­er­al­ise, and Why They Are (Kind of) Bayesian

Joar SkalseDec 29, 2020, 1:33 PM
75 points
58 comments1 min readLW link1 review

Two senses of “op­ti­mizer”

Joar SkalseAug 21, 2019, 4:02 PM
35 points
41 comments3 min readLW link

Risks from Learned Op­ti­miza­tion: Con­clu­sion and Re­lated Work

Jun 7, 2019, 7:53 PM
82 points
5 comments6 min readLW link