Hey, thanks for writing all of that. My current goal is to do an up to date literature review on corrigibility, so that was a most helpful comment. I’ll definitely look over your blog, since some of these papers are quite dense. Out of the paper’s you recommended, is there one that stands out? Bear in mind that I’ve read Stewart and MIRI’s papers already.
Dutch custom prevents me from recommending my own recent paper in any case, so I had to recommend one paper from the time frame 2015-2020 that you probably have not read yet, I’d recommend ‘Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective’. This stands out as an overview of different approaches, and I think you can get a good feeling of the state of the field out of it even if you do not try to decode all the math.
Note that there are definitely some worthwhile corrigibility related topics that are discussed only/mainly in blog posts and in LW comment threads, but not in any of the papers I mention above or in my mid-2019 related work section. For example, there is the open question whether Christiano’s Iterated Amplification approach will produce a kind of corrigibility as an emergent property of the system, and if so what kind, and is this the kind we want, etc. I have not seen any discussion of this in the ‘formal literature’, if we define the formal literature as conference/arxiv papers, but there is a lot of discussion of this in blog posts and comment threads.
Dutch custom prevents me from recommending my own recent paper in any case
This phrase and its implications are perfect examples of problems in corrigibility. Was that intentional? If so, bravo. Your paper looks interesting, but I think I’ll read the blog post first. I want a break from reading heavy papers. I wonder if the researchers would be OK with my drawing on their blog posts in the review. Would you mind?
Thanks for recommending “Reward tampering”, it is much appreciated. I’ll get on it after synthesising what I’ve read so far. Otherwise, I don’t think I’ll learn much.
You should feel free to write a literature overview that cites or draws heavily on paper-announcement blog posts. I definitely won’t mind. In general, the blog posts tend to use language that is less mathematical and more targeted at a non-specialist audience. So if you aim to write a literature overview that is as readable as possible for a general audience, then drawing on phrases from the author’s blog posts describing the papers (when such posts are available) may be your best bet.
Hey, thanks for writing all of that. My current goal is to do an up to date literature review on corrigibility, so that was a most helpful comment. I’ll definitely look over your blog, since some of these papers are quite dense. Out of the paper’s you recommended, is there one that stands out? Bear in mind that I’ve read Stewart and MIRI’s papers already.
Thanks, you are welcome!
Dutch custom prevents me from recommending my own recent paper in any case, so I had to recommend one paper from the time frame 2015-2020 that you probably have not read yet, I’d recommend ‘Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective’. This stands out as an overview of different approaches, and I think you can get a good feeling of the state of the field out of it even if you do not try to decode all the math.
Note that there are definitely some worthwhile corrigibility related topics that are discussed only/mainly in blog posts and in LW comment threads, but not in any of the papers I mention above or in my mid-2019 related work section. For example, there is the open question whether Christiano’s Iterated Amplification approach will produce a kind of corrigibility as an emergent property of the system, and if so what kind, and is this the kind we want, etc. I have not seen any discussion of this in the ‘formal literature’, if we define the formal literature as conference/arxiv papers, but there is a lot of discussion of this in blog posts and comment threads.
This phrase and its implications are perfect examples of problems in corrigibility. Was that intentional? If so, bravo. Your paper looks interesting, but I think I’ll read the blog post first. I want a break from reading heavy papers. I wonder if the researchers would be OK with my drawing on their blog posts in the review. Would you mind?
Thanks for recommending “Reward tampering”, it is much appreciated. I’ll get on it after synthesising what I’ve read so far. Otherwise, I don’t think I’ll learn much.
Nope, not intentional.
You should feel free to write a literature overview that cites or draws heavily on paper-announcement blog posts. I definitely won’t mind. In general, the blog posts tend to use language that is less mathematical and more targeted at a non-specialist audience. So if you aim to write a literature overview that is as readable as possible for a general audience, then drawing on phrases from the author’s blog posts describing the papers (when such posts are available) may be your best bet.