RSS

Joar Skalse

Karma: 740

My name is pronounced “YOO-ar SKULL-se” (the “e” is not silent). I’m a PhD student at Oxford University, and I was a member of the Future of Humanity Institute before it shut down. I have worked in several different areas of AI safety research. For a few highlights, see:

  1. Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

  2. Misspecification in Inverse Reinforcement Learning

  3. STARC: A General Framework For Quantifying Differences Between Reward Functions

  4. Risks from Learned Optimization in Advanced Machine Learning Systems

  5. Is SGD a Bayesian sampler? Well, almost

Some of my recent research on the theoretical foundations of reward learning is also described in this sequence.

For a full list of all my research, see my Google Scholar.

De­cep­tive Alignment

Jun 5, 2019, 8:16 PM
118 points
20 comments17 min readLW link

The In­ner Align­ment Problem

Jun 4, 2019, 1:20 AM
105 points
17 comments13 min readLW link

Con­di­tions for Mesa-Optimization

Jun 1, 2019, 8:52 PM
84 points
48 comments12 min readLW link

Risks from Learned Op­ti­miza­tion: Introduction

May 31, 2019, 11:44 PM
187 points
42 comments12 min readLW link3 reviews

Two agents can have the same source code and op­ti­mise differ­ent util­ity functions

Joar SkalseJul 10, 2018, 9:51 PM
11 points
11 comments1 min readLW link