I was a physics undergrad, but I did an AI PhD on artificial learners adopting normative conventions from human teachers. This work included an attempt to formalise the learning situation, where an artificial learner, tries to figure out, what one human teacher, would like that learner to do. Later, I also worked on interpreting human feedback. Here is a review article, that situates these two strands of research in a larger context. This review article also includes a discussion of a 2016 book by Michael Tomasello: A natural history of human morality. I think this is relevant background, when one is analysing certain aspects of human morality. A view of human morality, that is genuinely free from unexamined implicit assumptions of non-naturalness, is useful when reading some of the points that I make here on LW. Refs 32 and 46, on agents noticing model misspecification, and on agents interpreting an off switch attempt as an information source, is relevant background to some other topics.
My current research focus is on analysing alignment targets. This is what I post about on LW. I don’t see this as a purely academic curiosity, but instead see it as an issue with important real world implications.
I think that AI is, genuinely, dangerous. The specific AI danger that I am trying to mitigate, is the scenario, where someone successfully hits an alignment target, resulting in an outcome, that is far, far, worse than extinction. I think that this danger, from successfully hitting the wrong alignment target, is severely neglected. I think that this neglect is a genuine problem, and I hope to do something about this neglect, by posting on LW. In other words: since the type of danger that I focus on is distinct, from dangers coming from aiming failures, dealing with it requires dedicated effort (mitigating the types of dangers that I focus on requires a specific type of insights. This class of insights are not useful for dealing with aiming failures. Thus, such insights are unlikely to be found when investigating dangers related to aiming failures).
In yet other words: The danger that I am focused on is importantly distinct from other types of AI dangers. And it requires a dedicated research focus, on issues that are relevant for this specific danger. My research might of course be rendered pointless, by someone accidentally creating an AI, that has no intrinsic interest in humans at all. But I do not think that this is the only outcome worth thinking about. I happen to think that most academic AI researchers, the heads of leading tech companies, the general public, etc, etc, underestimate the probability that a misaligned, uncaring, AI, will kill everyone. If someone manages to trigger an Intelligence Explosion, using current methods, then I don’t expect them to hit the alignment target, that they are aiming for. But I don’t think that this is the only plausible path to a powerful AI. It may be the default path, but many things are uncertain, including the actions of people (for example: Covid illustrated that the set of politically realistic policies, can change dramatically and quickly. And recent AI debate has illustrated that the set of things taken seriously in public debate, can also change dramatically and quickly. These are just two examples of the many sources of uncertainty that I see). In other words: I simply do not think that it is possible, to confidently rule out the possibility, that an alignment target will, eventually, be successfully hit. I further think, that essentially everyone, dramatically underestimate, the dangers associated with successfully hitting the wrong alignment target. My proposed way of reducing this danger, is to analyse alignment targets. My specific focus, in on trying to find features that are necessary (but obviously not sufficient) for an alignment target to be safe for a human individual. Finding such a feature can stop a future AI project (possibly aiming at a currently unknown alignment target) at the idea stage. In scenarios where the results of this type of work has a chance to impact the outcome, I expect that there will be time to pursue this research. However, I have no particular reason to think that there will be, enough, time. If you are also interested in this type of work, then please feel free to send me an email.
Email: thomascederborgsemail at gmail dot com
Thank you for the clarification. This proposal is indeed importantly different from the PCEV proposal. But since hurting heretics is a moral imperative, any AI that allows heretics to escape punishment, will also be seen as unacceptable by at least some people. This means that the set of Pareto improvements is empty.
In other words: hurting heretics is indeed off the table in your proposal (which is an important difference compared to PCEV). However, any scenario that includes the existence of an AI, that allow heretics to escape punishment, is also off the table. The existence of such an AI, would be seen as intrinsically bad, by people that see hurting heretics as a moral imperative (for example: Gregg really does not want a world, where Gregg has agreed to tolerate the existence of an unethical AI, that disregards its moral duty, to punish heretics). More generally: anything that improves the lives of heretics, is off the table. If an outcome improves the lives of heretics (compared to the no-AI-baseline), then this outcome is also not a Pareto improvement. Because improving the lives of heretics, makes things worse from the point of view, of those that are deeply committed to hurting heretics.
In yet other words: it only takes two individuals, to rule out any outcome, that contains any improvement, for any person. Gregg and Jeff are both deeply committed to hurting heretics. But their definitions of ``heretic″ differ. Every individual is seen as a heretic by at least one of them. So, any outcome, that makes life better for any person, is off the table. Gregg and Jeff does have to be very committed to the moral position, that the existence of any AI, that neglects its duty to punish heretics, is unacceptable. It must for example be impossible to get them to agree to tolerate the existence of such an AI, in exchange for increased influence over the far future. But a population of billions only has to contain two such people, for the set of Pareto improvements to be empty.
Another way to approach this would be to ask: What would have happened, if someone had successfully implemented a Gatekeeper AI, built on top of a set of definitions, such that the set of Pareto improvements is empty?
For the version of the random dictator negotiation baseline that you describe, this comment might actually be more relevant, than the PCEV thought experiment. It is a comment on the suggestion by Andrew Critch, that it might be possible to view a Boundaries / Membranes based BATNA, as having been agreed to acausally. It is impossible to reach such an acausal agreement when a group include people like Gregg and Jeff, for the same reason that it is impossible to find an outcome that is a Pareto improvement, when a group include people like Gregg and Jeff. (that comment also discuss ideas, for how one might deal with the dangers that arise, when one combines people like Gregg and Jeff, with a powerful and clever AI)
Another way to look at this, would be to consider what it would mean to find a Pareto improvement, with respect to only Bob and Dave. Bob wants to hurt heretics, and Bob considers half of all people to be heretics. Dave is an altruist, that just wants people to have as good a life as possible. The set of Pareto improvements would now be made up entirely of different variations of the general situation: make the lives of non heretics much better, and make the lives of heretics much worse. For Bob to agree, heretics must be punished. And for Dave to agree, Dave must see the average life quality, as an improvement on the ``no superintelligence″ outcome. If the ``no superintelligence″ outcome is bad for everyone, then the lives of heretics in this scenario could get very bad.
More generally: people like Bob (with aspects of morality along the lines of: ``heretics deserve eternal torture in hell″) will have dramatically increased power over the far future, when one uses this type of negotiation baseline (assuming that things have been patched, in a way that results in a non empty set of Pareto improvements). If everyone is included in the calculation of what counts as Pareto improvements, then the set of Pareto improvements is empty (due to people like Gregg and Jeff). And if everyone is not included, then the outcome could get very bad, for many people (compared to whatever would have happened otherwise).
(adding the SPADI feature to your proposal would remove these issues, and would prevent people like Dave from being dis-empowered, relative to people like Bob. The details are importantly different from PCEV, but it is no coincidence that adding the SPADI feature removes this particular problem, for both proposals. The common denominator is that from the perspective of Steve, it is in general dangerous to encounter an AI, that has taken ``unwelcome″ or ``hostile″ preferences about Steve into account)
Also: my general point about the concept of ``fair Pareto improvements″ having counterintuitive implications in this novel context still apply (it is not related to the details of any specific proposal).