My research focus is Alignment Target Analysis (ATA). I noticed that the most recently published version of CEV (Parliamentarian CEV, or PCEV) gives a large amount of extra influence to people that intrinsically value hurting other individuals. For Yudkowsky’s description of the issue you can search the CEV arbital page for ADDED 2023.
The fact that no one noticed this issue for over a decade shows that ATA is difficult. If PCEV had been successfully implemented, the outcome would have been massively worse than extinction. I think that this illustrates that scenarios where someone successfully hits a bad alignment target pose a serious risk. I also think that it illustrates that ATA can reduce these risks (noticing the issue reduced the probability of PCEV getting successfully implemented). The reason that more ATA is needed is that PCEV is not the only bad alignment target that might end up getting implemented. ATA is however very neglected. There does not exist a single research project dedicated to ATA. In other words: the reason that I am doing ATA is that it is a tractable and neglected way of reducing risks.
I am currently looking for collaborators. I am also looking for a grant or a position that would allow me to focus entirely on ATA for an extended period of time. Please don’t hesitate to get in touch if you are curious and would like to have a chat, or if you have any feedback, comments, or questions. You can for example PM me here, or PM me on the EA Forum, or email me at thomascederborgsemail@gmail.com (that really is my email address. It’s a Gavagai / Word and Object joke from my grad student days)
My background is physics as an undergrad and then AI research. Links to some papers: P1 P2 P3 P4 P5 P6 P7 P8. (no connection to any form of deep learning)
I was clearly wrong regarding how you feel about your cells. But surely the question of whether or not an AI that is implementing the CEV of Steve, would result in any surviving cells, is an empirical question? (which must settled by referring to facts about Steve. And trying to figure out what these facts mean in terms of how the CEV of Steve would treat his cells). It cannot possibly be the case that it is impossible, by definition, to discover that any reasonable way of extrapolating Steve would result in all his cells dying?
Thank you for engaging on this. Reading your description of how you view your own cells was a very informative window, into how a human mind can work. (I find it entirely possible, that I am very wrong regarding how most people view their cells. Or about how they would view their cells upon reflection. I will probably not try to introspect, regarding how I feel about my own cells, while this exchange is still fresh)
Zooming out a bit, and looking at this entire conversation, I notice that I am very confused. I will try to take a step back from LW and gain some perspective, before I return to this debate.