RSS

ThomasCederborg

Karma: 26

I was a physics undergrad, but I did an AI PhD on artificial learners adopting normative conventions from human teachers. This work included an attempt to formalise the learning situation, where an artificial learner, tries to figure out, what one human teacher, would like that learner to do. Later, I also worked on interpreting human feedback. Here is a review article, that situates these two strands of research in a larger context. This review article also includes a discussion of a 2016 book by Michael Tomasello: A natural history of human morality. I think this is relevant background, when one is analysing certain aspects of human morality. A view of human morality, that is genuinely free from unexamined implicit assumptions of non-naturalness, is useful when reading some of the points that I make here on LW. Refs 32 and 46, on agents noticing model misspecification, and on agents interpreting an off switch attempt as an information source, is relevant background to some other topics.

My current research focus is on analysing alignment targets. This is what I post about on LW. I don’t see this as a purely academic curiosity, but instead see it as an issue with important real world implications.

I think that AI is, genuinely, dangerous. The specific AI danger that I am trying to mitigate, is the scenario, where someone successfully hits an alignment target, resulting in an outcome, that is far, far, worse than extinction. I think that this danger, from successfully hitting the wrong alignment target, is severely neglected. I think that this neglect is a genuine problem, and I hope to do something about this neglect, by posting on LW. In other words: since the type of danger that I focus on is distinct, from dangers coming from aiming failures, dealing with it requires dedicated effort (mitigating the types of dangers that I focus on requires a specific type of insights. This class of insights are not useful for dealing with aiming failures. Thus, such insights are unlikely to be found when investigating dangers related to aiming failures).

In yet other words: The danger that I am focused on is importantly distinct from other types of AI dangers. And it requires a dedicated research focus, on issues that are relevant for this specific danger. My research might of course be rendered pointless, by someone accidentally creating an AI, that has no intrinsic interest in humans at all. But I do not think that this is the only outcome worth thinking about. I happen to think that most academic AI researchers, the heads of leading tech companies, the general public, etc, etc, underestimate the probability that a misaligned, uncaring, AI, will kill everyone. If someone manages to trigger an Intelligence Explosion, using current methods, then I don’t expect them to hit the alignment target, that they are aiming for. But I don’t think that this is the only plausible path to a powerful AI. It may be the default path, but many things are uncertain, including the actions of people (for example: Covid illustrated that the set of politically realistic policies, can change dramatically and quickly. And recent AI debate has illustrated that the set of things taken seriously in public debate, can also change dramatically and quickly. These are just two examples of the many sources of uncertainty that I see). In other words: I simply do not think that it is possible, to confidently rule out the possibility, that an alignment target will, eventually, be successfully hit. I further think, that essentially everyone, dramatically underestimate, the dangers associated with successfully hitting the wrong alignment target. My proposed way of reducing this danger, is to analyse alignment targets. My specific focus, in on trying to find features that are necessary (but obviously not sufficient) for an alignment target to be safe for a human individual. Finding such a feature can stop a future AI project (possibly aiming at a currently unknown alignment target) at the idea stage. In scenarios where the results of this type of work has a chance to impact the outcome, I expect that there will be time to pursue this research. However, I have no particular reason to think that there will be, enough, time. If you are also interested in this type of work, then please feel free to send me an email.

Email: thomascederborgsemail at gmail dot com

The pro­posal to add a ``Last Judge″ to an AI, does not re­move the ur­gency, of mak­ing progress on the ``what al­ign­ment tar­get should be aimed at?″ ques­tion.

ThomasCederborg22 Nov 2023 18:59 UTC
1 point
0 comments18 min readLW link

Mak­ing progress on the ``what al­ign­ment tar­get should be aimed at?″ ques­tion, is urgent

ThomasCederborg5 Oct 2023 12:55 UTC
2 points
0 comments18 min readLW link