ThomasCederborg

Karma: 186

My research focus is Alignment Target Analysis (ATA). I noticed that the most recently published version of CEV (Parliamentarian CEV, or PCEV) gives a large amount of extra influence to people that intrinsically value hurting other individuals. For Yudkowsky’s description of the issue you can search the CEV arbital page for ADDED 2023.

The fact that no one noticed this issue for over a decade shows that ATA is difficult. If PCEV had been successfully implemented, the outcome would have been massively worse than extinction. I think that this illustrates that scenarios where someone successfully hits a bad alignment target pose a serious risk. I also think that it illustrates that ATA can reduce these risks (noticing the issue reduced the probability of PCEV getting successfully implemented). The reason that more ATA is needed is that PCEV is not the only bad alignment target that might end up getting implemented. ATA is however very neglected. There does not exist a single research project dedicated to ATA. In other words: the reason that I am doing ATA is that it is a tractable and neglected way of reducing risks.

I am currently looking for collaborators. I am also looking for a grant or a position that would allow me to focus entirely on ATA for an extended period of time. Please don’t hesitate to get in touch if you are curious and would like to have a chat, or if you have any feedback, comments, or questions. You can for example PM me here, or PM me on the EA Forum, or email me at thomascederborgsemail@gmail.com (that really is my email address. It’s a Gavagai / Word and Object joke from my grad student days)

My background is physics as an undergrad and then AI research. Links to some papers: P1 P2 P3 P4 P5 P6 P7 P8. (no connection to any form of deep learning)

ThomasCederborg Feb 3, 2024, 8:23 PM
1 point
0
in reply to: the gears to ascension’s comment on: Managing risks while trying to do good
I was clearly wrong regarding how you feel about your cells. But surely the question of whether or not an AI that is implementing the CEV of Steve, would result in any surviving cells, is an empirical question? (which must settled by referring to facts about Steve. And trying to figure out what these facts mean in terms of how the CEV of Steve would treat his cells). It cannot possibly be the case that it is impossible, by definition, to discover that any reasonable way of extrapolating Steve would result in all his cells dying?
Thank you for engaging on this. Reading your description of how you view your own cells was a very informative window, into how a human mind can work. (I find it entirely possible, that I am very wrong regarding how most people view their cells. Or about how they would view their cells upon reflection. I will probably not try to introspect, regarding how I feel about my own cells, while this exchange is still fresh)
Zooming out a bit, and looking at this entire conversation, I notice that I am very confused. I will try to take a step back from LW and gain some perspective, before I return to this debate.

ThomasCederborg Feb 3, 2024, 3:56 PM
3 points
0
in reply to: the gears to ascension’s comment on: Managing risks while trying to do good
I think that extrapolation is a genuinely unintuitive concept. I would for example not be very surprised if it turns out that you are right, and that it is impossible to reasonably extrapolate you if the AI that is doing the extrapolation is cut off from all information about other humans. I don’t think that this fact is in tension with my statement, that individuals and groups are completely different types of things. Taking your cell analogy: I think that implementing the CEV of you could lead to the death of every single cell in your body (for example if your mind is uploaded in a way that does not preserve information about any individual cell). I don’t think that it is strange in general, if an extrapolated version of a human individual, is completely fine with the complete annihilation of every cell in her body (and this is true, despite the fact that ``hostility towards cells″ is not a common thing). Such an outcome is no indication of any technical failure, in an AI project, that was aiming for the CEV of that individual. This shows why there is no particular reason to think, that doing what a human individual wants, would be good for any of her cells (for any reasonable definition of ``doing what a human individual wants″). And this fact remains true, even if it is also the case, that a given cell would become impossible to understand, if that cell was isolated from other cells.
A related tangent here relates to the fact that extrapolation is a genuinely unintuitive concept. I think that this has important implications for AI safety. This fact is for example central to my argument about ``Last Judge″ type proposals in my post:
The proposal to add a ``Last Judge″ to an AI, does not remove the urgency, of making progress on the ``what alignment target should be aimed at?″ question.
(I will try to reduce the commas. I see what you are talking about. I have in the past been forced to do something about an overuse of both footnotes and parentheses. Reading badly written academic history books seems to be making things worse (if one is analysing AI proposals where the AI is getting its goal from humans, then it makes sense to me to at least try to understand humans))

ThomasCederborg Feb 3, 2024, 4:32 AM
1 point
0
in reply to: Vladimir_Nesov’s comment on: Managing risks while trying to do good
I think that ``CEV″ is usually used as shorthand for ``an AI that implements the CEV of Humanity″. This is what I am referring to, when I say ``CEV″. So, what I mean when I say that ``CEV is a bad alignment target″, is that, for any reasonable set of definitions, it is a bad idea, to build an AI, that does what ``a Group″ wants it to do (in expectation, from the perspective of essentially any human individual, compared to extinction). Since groups and individuals, are completely different types of things, it should not be surprising to learn, that doing what one type of thing wants (such as ``a Group″), is bad for a completely different type of thing (such as a human individual). In other words, I think that ``an AI that implements the CEV of Humanity″, is a bad alignment target, in the same sense, as I think that SRAI is a bad alignment target.
But I don’t think your comment uses ``CEV″ in this sense. I assume that we can agree, that aiming for ``the CEV of a chimp″, can be discovered to be a bad idea (for example by referring to facts about chimps, and using thought experiments, to see what these facts about chimps, implies about likely outcomes). Similarly, it must be possible to discover, that aiming for ``the CEV of Humanity″, is also a bad idea (for human individuals). Surely, discovering this, cannot be, by definition, impossible. Thus, I think that you are in fact, not, using ``CEV″ as shorthand for ``an AI that implements the CEV of Humanity″. (I am referring to your sentence: ``If it’s not something to aim at, then it’s not a properly constructed CEV.″)
Your comment makes perfect sense, if I read ``CEV″ as shorthand for ``an AI that implements the CEV of a single human designer″. I was not expecting this terminology. But it is a perfectly reasonable terminology, and I am happy to make my argument, using this terminology. If we are using this terminology, then I think that you are completely right, about the problem that I am trying to describe, being a proxy issue (thus, if this is was indeed your intended meaning, then I was completely wrong, when I said that I was not referring to a proxy issue. In this terminology, it is indeed a proxy issue). So, using this terminology, I would describe my concerns as: ``an AI that implements the CEV of Humanity″ is a predictably bad proxy, for ``an AI that implements the CEV of a single human designer″. Because ``an AI that implements the CEV of Humanity″, is far, far, worse, than extinction, form the perspective of essentially any human individual (which, presumably, disqualifies it as a proxy, for ``an AI that implements the CEV of a single human designer″. If this does not disqualify it as a proxy, then I think that this particular human designer, is a very dangerous person (from the perspective of essentially any human individual)). Using this terminology (and assuming a non unhinged designer), I would say that if your proposed project, is to use ``an AI that implements the CEV of Humanity″, as a proxy, for ``an AI that implements the CEV of a single human designer″, then this constitutes a, predictable, proxy failure. Further, I would say that pushing ahead, despite this predictable failure, with a project that is trying to implement ``an AI that implements the CEV of Humanity″ (as a proxy), inflicts an unnecessary s-risk, on everyone. Thus, I think it would be a bad idea, to pursue such a project (from the perspective of essentially any human individual. Presumably including the designer).
If we take the case of Bob, and his Suffering Reducing AI (SRAI) project (and everyone has agreed to use this terminology), then we can tell Bob:
SRAI is not a good proxy, for ``an AI that implements the CEV of Bob″ (assuming that you, Bob, do not want to kill everyone). Thus, you will run into a, predictable, issue, when your project tries to use SRAI as a proxy, for ``an AI that implements the CEV of Bob″. If you are implementing a safety measure successfully, then this will still, at best, lead to your project failing safely. At worst, your safety measure will fail, and SRAI will kill everyone. So, please don’t proceed with your project, given that it would put everyone at risk of being killed by SRAI (and this would be an unnecessary risk, because your project will predictably fail, due to a predictable proxy issue).
By making sufficient progress, on the ``what alignment target should be aimed at?″ question, before Bob gets started on his SRAI project, it is possible to avoid the unnecessary extinction risks, associated with the proxy failure, that Bob will predictably run into, if his project uses SRAI, as a proxy for ``an AI that implements the CEV of Bob″. Similarly, it is possible to avoid the unnecessary s-risks, associated with the proxy failure, that Dave will predictably run into, if Dave uses ``an AI that implements the CEV of Humanity″, as a proxy, for ``an AI that implements the CEV of Dave″ (because any ``Group AI″, is very bad for human individuals (including Dave)).
Mitigating the unnecessary extinction risks, that are inherent in any SRAI project, does not require an answer, to the ``what alignment target should be aimed at?″ question (it was a long time ago, but if I remember correctly, Yudkowsky did this, around two decades ago. It seems likely, that anyone that is careful and capable enough, to hit an alignment target, will be able to understand that old explanation, of why SRAI, is a bad alignment target. So, generating such an explanation, was sufficient for mitigating the extinction risks, associated with a successfully implemented SRAI. Generating such an explanation, did not require an answer, to the ``what alignment target should be aimed at?″ question. One can demonstrate that a given bad answer, is a bad answer, without having any good answer). Similarly, avoiding the unnecessary s-risks, that are inherent in any ``Group AI″ project, does not require an answer, to the ``what alignment target should be aimed at?″ question. (I strongly agree, that finding an actual answer to this question, is probably very, very, difficult. I am simply pointing out, that even partial progress, on this question, can be very useful)
(I think that there are other issues, related to AI projects, whose purpose is to aim at ``the CEV, of a single human designer″. I will not get into this here, but I thought that it made sense, to at least mention, that there are other issues, related to this type of project)

ThomasCederborg Feb 2, 2024, 10:52 PM
1 point
0
in reply to: Random Developer’s comment on: Managing risks while trying to do good
I agree that ``the ends justify the means″ type thinking has led to a lot of suffering. For this, I would like to switch from the Chinese Cultural Revolution, to the French Revolution, as an example (I know it better, and I think it fits better, for discussions of this attitude). So, someone wants to achieve something, that are today seen as a very reasonable goal, such as ``end serfdom and establish formal equality before the law″. So, basically: their goals are positive, and they achieve these goals. But perhaps they could have achieved those goals, with less side effects, if it was not for their ``the ends justify the means″ attitude. Serfdom did end, and this change was both lasting, and spreading. After things had calmed down, the new economic relations, led to dramatically better material conditions, for the former serfs (and, for example, dramatic increase in life expectancy, due a dramatic reduction in poverty related malnutrition). But, during the revolutionary wars (and especially the Napoleon wars that followed), millions died. It sounds intuitively likely, that there would have been less destruction, if attitudes along these lines were less common.
So, yes, even when an event has such a large, and lasting, positive impact, that it is still celebrated, centuries later (14th of July is still a very big thing in France), one might find that this attitude caused concrete harm (millions of dead people, must certainly qualify as ``concrete harm″. And the French Revolution must certainly be classified as a celebrated event in any sense of that word (including, but not limited to, the literal: ``fireworks and party″ sense)).
And you are entirely correct, that damage from this type of attitude, was missing from my analysis.

ThomasCederborg Feb 2, 2024, 10:17 PM
1 point
0
in reply to: Vladimir_Nesov’s comment on: Managing risks while trying to do good
I think that my other comment to this, will hopefully be sufficient, to outline what my position actually is. But perhaps a more constructive way forwards, would be to ask how certain you are, that CEV is in fact, the right thing to aim at? That is, how certain are you, that this situation is not symmetrical, to the case where Bob thinks that: ``a Suffering Reducing AI (SRAI), is the objectively correct thing to aim at″? Bob will diagnose any problem, with any specific SRAI proposal, as arising from proxy issues, related to the fact that Bob is not able to perfectly define ``Suffering″, and must always rely on a proxy (those proxy issues exists. But they are not the most serious issue, with Bob’s SRAI project).
I don’t think that we should let Bob proceed with an AI project, that aims to find the correct description of ``what SRAI is″, even if he is being very careful, and is trying to implement a safety measure (that will, while it continues to work as intended, prevent SRAI from killing everyone). Because those safety features might fail, regardless of whether or not someone has pointed out a critical flaw in them, before the project reaches the point of no return (this conclusion is not related to Corrigibility. I would reach the exact same conclusion, if Bob’s SRAI project, was using any other safety measure). For the exact same reason, I simply do not think, that it is a good idea, to proceed with your proposed CEV project (as I understand that project). I think that doing so, would represent a very serious s-risk. At best, it will fail in a safe way, for predictable reasons. How confident are you, that I am completely wrong about this?
Finally, I should note, that I still don’t understand your terminology. And I don’t think that I will, until you specify what you mean with ``something like CEV″. My current comments, are responding to my best guess, of what you mean (which is, that MPCEV, from my linked to post, would not count as ``something like CEV″, in your terminology). (Does an Orca count as: ``something like a shark″? If it is very important, that some water tank is free of fish, then it is difficult for me to discuss Dave’s ``let’s put something like a shark, in that water tank″ project, until I have an answer to my Orca question.)
(I assume that this is obvious, but just to be completely sure that this is clear, it probably makes sense to note explicitly that I, very much, appreciate that you are engaging on this topic)

ThomasCederborg Feb 2, 2024, 9:34 PM
3 points
0
in reply to: Wei Dai’s comment on: Managing risks while trying to do good
I don’t think that they are all status games. If so, then why did people (for example) include long meditations, regarding whether or not, they personally, deserve to go to hell, in private diaries? While they were focusing on the ``who is a heretic?″ question, it seems that they were taking for granted, the normative position: ``if someone is a heretic, then she deserves eternal torture in hell″. But, on the other hand, private diaries are of course sometimes opened, while the people that wrote them are still alive (this is not the most obvious thing, that someone would like others to read, in a stolen diary. But people are not easy to interpret, especially across centuries of distance. Maybe for some people, someone else stealing their diary, and reading such meditations, would be awesome). And people are not perfect liars, so maybe the act of making such entries is, mostly, an effective way, of getting into an emotional state, such that one seems genuine, when expressing remorse to other people? So, maybe any reasonable way of extrapolating a diarist like this, will lead to a mind, that find the idea of hell, abhorrent. There is a lot of uncertainty here. There is probably also a very, very large diversity, among the set of humans that have adopted a normative position, along these lines (and not just in terms of terminology, and in terms of who counts as a heretic. Also in terms of what it is, that was lying underneath, the adoption of such normative positions. It would not be very surprising, if a given extrapolation procedure, leads to different outcomes, for two individuals, that sound very similar). As long as we agree that any AI design, must be robust to the possibility, that people mean what they say, then perhaps these issues are not critical to resolve (but, on the other hand, maybe digging into this some more, will lead to genuinely important insights). (I agree that there were probably a great number of people, especially early on, that was trying to achieve things that most people today would find reasonable, but whose actions contributed to destructive movements. Such issues are probably a lot more problematic in politics, than in the case where an AI is getting its goal from a set of humans) (none of my reasoning here is done, with EAs in mind)
I think that there exists a deeper problem, for the proposition, that perhaps it is possible to find some version of CEV, that is actually safe for human individuals (as opposed to the much easier task, of finding a version of CEV, such that no one is able to outline a thought experiment, before launch time, that shows, why this specific version, would lead to an outcome, that is far, far, worse than extinction). Specifically, I’m referring to the fact that ``heretics deserve eternal torture in hell″ style fanatics (F1), is just one very specific example, of a group of humans, that might be granted extreme influence, over CEV. In a population of billions, there will exist a very, very large number of ``never-explicitly-considered″ types of minds. Consider for example a different, tiny, group of Fanatics (F2), who (after being extrapolated) has a very strong ``all or nothing″ attitude, and a sacred rule against negotiations (let’s explore what happens in the case, where this attitude is related to a religion, and where one in a thousand humans, will be part of F2). Unless negotiations deadlock in a very specific way, PCEV will grant F2, exactly zero direct influence. However, let’s explore what happens, if another version of CEV is launched, that first maps each individual to a Utility function, and then maximise the Sum of those functions (USCEV). During the process, where a member of this religion, that we can call Gregg, ``becomes the person that Gregg wants to be″, the driving aspect of Gregg’s personality, is a burning desire to become a true believer, and become morally pure. This includes, becoming the type of person, that would never break the sacred set of rules: ``Never accept any compromise, regarding what the world should look like! Never negotiate with heretics! Always take whatever action, is most likely to result in the world being organised, exactly as is described in the sacred texts!″. So, the only reasonable way to map, extrapolated Gregg, to a utility function, is to assign maximum utility to the Outcome demanded by the Sacred Texts (OST), and minimum utility, to every other outcome. Besides the number of people in F2, the bound on how bad OST can be (from the perspective of the non believers), and still be the implemented outcome, is that USCEV, must be able to think up something that is far, far, worse (technically, the minimum is not actually the worst possible outcome, but instead the worst outcome that USCEV can think up, for each specific non-believer). As long as there is a very large difference, between OST, and the worst thing that USCEV can think up, then OST will be the selected outcome. Maybe OST will look ok, to a non super intelligent observer. For example, OST could look like a universe where every currently existing human individual, after an extended period of USCEV guided self reflection, converge on the same belief system (and all subsequent children, are then brought up in this belief system). Or, maybe it will be overtly bad, with everyone forced to convert or die. Or maybe it will be a genuine s-risk, for example along the lines of LP.
As far as I can tell, CEV in general, and PCEV in particular, is, still, the current state of the art, in terms of finding an answer to the ``what alignment target, should be aimed at?″ question (and CEV has been the state of the art now, for almost two decades). I find this state of affairs strange, and deeply problematic. I’m confused by the relatively low interest, in efforts to make further progress on the ``what alignment target, should be aimed at?″ question (I think that, for example, the explanation, in the original CEV document, from 2004, was a very good explanation, for why this question matters. And I don’t think that it is a coincidence, that the specific analogy used, to make that point, was a political revolution (a brief paraphrasing: such a revolution must (i): succeed, and also (ii): lead to a new government, that is actually a good government. Similarly, an AI must (i): hit an alignment target, and also (ii): this alignment target, must be a good thing to hit)). Maybe I shouldn’t be surprised by this relative lack of interest. Maybe humans are just not great, in general, at reacting to ``AI danger″. But it still feels like I’m not seeing, I don’t know, … something (wild speculation by anyone that, at any point, happens to stumble upon this comment, regarding what this … something … might be, are very welcome. Either in a comment, or in a DM, or in an email).

ThomasCederborg Feb 2, 2024, 4:19 AM
1 point
0
in reply to: Vladimir_Nesov’s comment on: Managing risks while trying to do good
It is getting late here, so I will stop after this comment, and look at this again tomorrow (I’m in Germany). Please treat the comment below as not fully thought through.
The problem from my perspective, is that I don’t think that the objective, that you are trying to approximate, is a good objective (in other words, I am not referring to problems, related to optimising a proxy. They also exist, but they are not the focus of my current comments). I don’t think that it is a good idea, to do what an abstract entity, called ``humanity″, wants (and I think that this is true, from the perspective of essentially any human individual). I think that it would be rational, for essentially any human individual, to strongly oppose the launch of any such ``Group AI″. Human individuals, and groups, are completely different types of things. So, I don’t think that it should be the surprising, to learn that doing what a group wants, is bad for the individuals, in that group. This is a separate issue, from problems related to optimising for a proxy.
I give one example, of how things can go wrong, in the post:
A problem with the most recently published version of CEV
This is of course just one specific example, and it is meant as an introduction, to the dangers, involved in building an AI, that is describable as ``doing what a group wants″. Showing that a specific version of CEV, would lead to an outcome, that is far, far, worse than extinction, does not, on its own, prove that all versions of CEV are dangerous. I do however think that all versions of CEV, are, very, very, dangerous. And I do think, that this specific thought experiment, can be used to hint at a more general problem. I also hope, that this thought experiment will at least be sufficient, for convincing most readers that there, might, exist a deeper problem, with the core concept. In other words, I hope that it will be sufficient, to convince most readers that you, might, be going after the wrong objective, when you are analysing different attempts ``to say what CEV is″.
While I’m not actually talking about implementation, perhaps it would be more productive, to approach this from the implementation angle. How certain are you, that the concept of Boundaries / Membranes, provides reliable safety, for individuals, from a larger group, that contains the type of fanatics, described in the linked post? Let’s say that it turns out, that they do not, in fact, reliably provide such safety, for individuals. How certain are you then, that the first implemented system, that relies on Boundaries / Membranes, to protect individuals from such groups, will in fact result, in you being able to try again? I don’t think that you can possibly know this, with any degree of certainty. (I’m certainly not against safety measures. If anyone attempts to do what you are describing, then I certainly hope that this attempt will involve safety measures) (I also have nothing against the idea of Boundaries / Membranes)
An alternative (or parallel) path, to trial and error, is to try to make progress on the ``what alignment target should be aimed at?″ question. Consider what you would say to Bob, who wants to build a Suffering Reducing AI (SRAI). He is very uncertain of his definition of ``Suffering″, and he is implementing safety systems. He knows that any formal definition of ``Suffering″ that he can come up with, will be a proxy, for the actually, correct, definition of Suffering. If it can be shown, that some specific implementation of SRAI, would lead to a bad outcome (such as an AI, that decides to kill everyone), then Bob will respond that the definition of Suffering, must be wrong (and that he has prepared safety systems, that will let him try to find a better definition of ``Suffering″).
This might certainly end well. Bob’s safety systems might continue to work, until Bob realises, that the core idea, of building any AI, that is describable as a SRAI, will always lead to an AI, that simply kills everyone (in other words: until he realises, that he is going after the wrong objective). But I would say, that a better alternative, is to make enough progress, on the ``what alignment target should be aimed at?″ question, that it is possible to explain to Bob, that he is, in fact, going after the wrong objective (and is not, in fact, dealing with proxy issues). (in the case of SRAI, such progress has off course been around for a while. I think I remember reading an explanation of the ``SRAI issue″, written by Yudkowsky, decades ago. So, to deal with people like Bob, there is no actual need, for us, to make additional progress. But for people in a world where SRAI, is the state of the art, in terms of answering the ``what alignment target should be aimed at?″ question, I would advice them to focus on making further progress, on this question)
Alternatively, I could ask what you would say to Bob, if he thinks that ``reducing Suffering″, is ``the objectively correct thing to do″, and is convinced, that any implementation that leads to bad outcomes (such as an AI, that kills everyone), must be a proxy issue? I think that, just as any reasonable definition of ``Suffering″, implies a SRAI, that kills everyone, any reasonable set of definitions of ``a Group″, implies a Group AI, that is bad for human individuals (in expectation, when that Group AI is pointed at billions of humans, from the perspective of essentially any human individual, in the set of humans, that the Group AI is pointed at, compared to extinction). In other words, a Group AI is bad for human individuals in expectation, in the same sense as a SRAI kills everyone. I’m definitely not saying that this is true for ``minds in general″. If Dave is able to reliably see all implications of any AI proposal (or if Dave is invulnerable to a powerful AI that is trying to hurt Dave. Or if the minds that the Group AI will be pointed at, are known to be ``friendly towards Dave″ in some formal sense, that is fully understood by Dave), then this might not be true for Dave. But I claim that it is true for human individuals.

ThomasCederborg Feb 2, 2024, 12:53 AM
1 point
0
in reply to: Vladimir_Nesov’s comment on: Managing risks while trying to do good
I’m not sure that I agree with this. I think it mostly depends on what you mean by: ``something like CEV″. All versions of CEV are describable as ``doing what a Group wants″. It is inherent in the core concept of building an AI, that is ``Implementing the Coherent Extrapolated Volition of Humanity″. This rules out proposals, where each individual, is given meaningful influence, regarding the adoption, of those preferences, that refer to her. For example as in MPCEV (described in the post that I linked to above). I don’t see how an AI can be safe, for individuals, without such influence. Would you say that MPCEV counts as ``something like CEV″?
If so, then I would say that it is possible, that ``something like CEV″, might be a good, long term solution. But I don’t see how one can be certain about this. How certain are you, that this is in fact a good idea, for a long term solution?
Also, how certain are you, that the full plan that you describe (including short term solutions, etc), is actually a good idea?

ThomasCederborg Feb 1, 2024, 11:39 PM
3 points
−2
on: Managing risks while trying to do good
In the case of damage from political movements, I think that many truly horrific things, have been done by people, that are well approximated as: ``genuinely trying to do good, and largely achieving their objectives, without major unwanted side effects″ (for example events along the lines of the Chinese Cultural Revolution, that you discuss in your older post, that you link to in your first footnote).
I think our central disagreement, might be a difference, in how we see human morality. In other words, I think that we might have different views, regarding what one should expect, from a human that is, genuinely, trying to do good, and that is succeeding. I’m specifically talking about one particular aspect of morality, that has been common in many different times and places, throughout human history. It is sometimes expressed in theological terms, along the lines of ``heretics deserve eternal torture in hell″. The issue is not the various people, that have come up with various stories, along the lines of ``hell exists″. Humans are always coming up with stories about ``what is really going on″. There was, a lot, to choose from. The issue is the large number of people, in many different cultures, throughout human history, that has heard stories, that assumes a morality, that fits well with normative statements along the lines of ``heretics deserve eternal torture in hell″. And they have thought: ``this story feels right. This is the way that the world, should, work. On questions of morality, it feels right, to defer to the one, who set this up″. These types of stories, are not the only types of stories, that humans have found intuitive. But they are common. The specific aspect of human morality, that I am referring to, is just one aspect, out of many. But it is an important, and common, aspect. Many people, that are trying to do good, are not driven by anything even remotely like this specific aspect of morality. But some are. And I think that such people, have done some truly horrific things.
In other words: Given that this is one standard aspect of human morality, why would anyone be surprised, when the result of a ``person trying to be genuinely good (in a way that does not involve anything, along the lines of status or power seeking), and succeeding″, leads to extreme horror? Side effects along the lines of innocents getting hurt, or economic chaos, are presumably unwanted side effects, for these types of political movements. But why would one expect them to be seen as major issues, by someone that is, genuinely, trying to do good? Why would anyone be surprised, to learn that these side effects, were seen as a perfectly reasonable, and acceptable, costs to pay, for enforcing moral purity? In the specific event that you refer to in the post, that you link to in your first footnote (the Chinese Cultural Revolution), there was extraordinary levels of economic chaos, suffering, and a very large number of dead innocents. So, maybe these extraordinary levels of general disruption and destruction, would have been enough to discourage the movement, if they had been predicted. Alternatively, maybe the only thing driving this event, was something along the lines of ``seeking-power-for-the-sake-of-power″. But maybe not. Maybe they would have (even if they had predicted the outcome), concluded that enforcing moral purity, was more important (enforcing moral purity, on a large number of reluctant people, is not possible without power. So, power seeking behaviour, is not decisive evidence against this interpretation). Humans doing good, and succeeding, are simply not safe, for other humans (even under the assumption, that they would have proceeded, if the side effects had been predicted. And assuming that there is nothing along the lines of ``status seeking″, or ``corrupted by power″, going on). They are not safe, because their preferred outcome, is not safe, for other humans. So, I think that there is an important piece missing from your analysis: the damage done, by humans that, genuinely, tries to do good (humans that, genuinely, do not seek power, or ``status″, or anything similar. Humans whose actions are morally pure, according to their morality). And who succeeds, without causing any deal-breaking side effects. (I know that you have written elsewhere, about humans not being safe for other humans. I know that you have said that Morality is Scary. But I think that an important aspect of this issue, is still missing. I could obviously be completely be wrong about this, but if I had to guess, I would say that it is likely, that our disagreements, follows from the fact, that you do not consider: ``safety issues coming from humans″, as being strongly connected to: ``humans genuinely trying to do good, and succeeding″)
More generally, this implies that human morality is not safe, for other humans. If it was, then those sentiments, that are sometimes expressed in theological terms, along the lines of ``heretics deserve eternal torture in hell″, would not keep popping up, throughout human history. A human that genuinely tries to do good, and that succeeds, is a very dangerous thing, for other humans. This actually has important AI safety implications. For example: this common aspect of human morality, implies a serious s-risk, if someone is ever able to successfully implement CEV. See for example my post:
A problem with the most recently published version of CEV
(the advice in your post sounds good to me, if you assume that you are exclusively interacting with people, that share your values (and that, in addition to this, are also genuinely trying to do good). My comment is about events, along the lines of the Chinese Cultural Revolution (which involved people with values that, presumably, differs greatly from the values of essentially all readers of this blog). My comment is not about people who share your values, and tries to follow them (but that might be, subconsciously, trying to also achieve other things, such as ``status″). For people like this, your analysis sounds very reasonable to me. But I think that if one looks at history, a lot of: ``damage from people trying to do good″, comes from people that are not well approximated, as trying to do good, while ``being corrupted by power″, or ``subconsciously seeking status″, or anything along those lines)

ThomasCederborg Jan 29, 2024, 7:42 PM
3 points
0
in reply to: Roko’s comment on: A problem with the most recently published version of CEV
I think that these two proposed constraints, will indeed remove some bad outcomes. But I don’t think that they will help in the thought experiment outlined in the post. These fanatics want all heretics in existence to be punished. This is a normative convention. It is a central aspect of their morality. An AI that deviates from this ethical imperative, is seen as an unethical AI. Deleting all heretics, from the memory of the fanatics, will not change this aspect of their morality. It’s genuinely not personal. They think that it would be highly unethical, for any AI, to let heretics go unpunished. They really do not want, the fate of the world, to be decided by an unethical AI. Any world, where such an unethical entity, has exerted such power, is a dark world. And the LP outcome can be implemented, even if the heretics are no longer around.
More generally: The problem, from the perspective of Steve, is that these two constraints, does not actually grant Steve any meaningful influence, regarding the adoption of those preferences, that refer to Steve. I think that such influence, is a necessary (but far from sufficient) feature, for an AI to be better than extinction (in expectation, from the perspective essentially any human individual). So, my proposal, would be to explore various ways, of giving each individual, meaningful influence, regarding the adoption of those preferences, that refer to her. One way of doing this, would be to explore different ways, of modifying PCEV, in such a way that the Modified version of PCEV (MPCEV), does give each individual, in the set of individuals that MPCEV is pointed at, such influence. For example along the lines of (some version of) the following rule:
If a preference is about Steve, then MPCEV will only take this preference into account, if: (i): the preference counts as concern for the well being of Steve, or if (ii): Steve would approve, of MPCEV taking this preference into account.

Even more generally, I think that it is important, and urgent, to make progress on what I call the ``what alignment target should be aimed at?″ question, and that you refer to as Goalcraft. (in addition to your past work on CEV variants, it was your Goalcraft post, that made me DM you, and point you to this post). Exploring different ways of modifying PCEV, sounds to me like a promising way, towards meaningful progress on this question. I think that s-risk from successfully hitting a bad alignment target, is a serious, and very under explored, issue. I think that there are important differences, between this type of s-risk, and the type of AI risks, that is associated with ``aiming failures″. In particular, progress on the ``what alignment target should be aimed at?″ question, can reduce the former type of s-risk (and this can be done, even if one does not find an actual answer). One way of reducing this s-risk, is to find problems with existing proposals. Another way, is to describe general features, that are necessary for safety (for example along the lines of the ``individuals must have meaningful influence, over the adoption, of those preferences, that refer to her″ feature mentioned above). A third way to reduce the s-risk, that comes from successfully hitting the wrong alignment target, is to show, that the ``what alignment target should be aimed at?″ question is, genuinely, unintuitive.
One very positive thing, that happens to be true, is that the class of bad outcomes, that I am trying to prevent, would probably involve a very capable design team, that is careful and clever enough, to actually hit, what they are aiming for. Explaining insights to such a design team, sounds feasible (including meta insights, such as the fact that this question is, genuinly, unintuitive). In other words: once insights have been generated, it will probably be relatively easy to communicate these insights (at least compared to many other ``AI is dangerous″ related communication tasks). First, however, such insights must be generated. And this will probably require some dedicated effort. So, the immediate task, as far as I can tell, is to create a community of people, that are fully focused, on exploring the ``what alignment target should be aimed at?″ question.

ThomasCederborg Jan 29, 2024, 7:41 PM
1 point
0
in reply to: Roko’s comment on: A problem with the most recently published version of CEV
I do think that the outcome would be LP (more below), but I can illustrate the underlying problem, using a set of alternative thought experiments, that does not require agreement on LP vs MP.
Let’s first consider the case where half of the heretics are seen as Mild Heretics (MH) and the other half as Severe Heretics (SH). MH are those that are open to converting, as part of a negotiated settlement (and SH are those that are not open to conversion). The Fanatics (F) would still prefer MP, where both MH and SH are hurt, as much as possible. But F is willing to agree to a Negotiated Position (NP), where MH escape punishment in exchange for conversion, but where SH is hurt, as much as possible, subject to a set of additional constraints. One such constraint would be a limit, on what types of minds, can be created and tortured, as a way of hurting SH.
F prefers MP, and will vote for MP unless MH agrees to vote for NP. Thus, agreeing to vote for NP is the only option available to MH, that would remove the possibility of them personally being targeted, by a powerful AI, using all its creativity, to think up clever ways of hurting them, as much as possible. This would also be their only way of reliably protecting some class of hypothetical future individuals, that they care about, and that would be created and hurt in MP. Thus, the negotiated position is NP.
This variant of the thought experiment is perhaps better at illustrating the deeply alien nature, of an arbitrarily defined abstract entity (given the label ``a Group″), that each individual would be subjected to, in case of the successful implementation, of any AI, that is describable as ``doing what a Group wants″ (the class of ``Group AI″ proposals, include all versions of CEV, as well as many other proposals). I think that this is far more dangerous, than an uncaring AI. In other words: a ``Group AI″ has preferences, that refer to you. But you have no meaningful influence, regarding the adoption of those preferences, that refer to you. That decision, just like every other decision, is entirely in the hands of an arbitrarily defined abstract entity (pointed at using an arbitrarily defined mapping, that maps sets of billions of human individuals, to the class of entities, that can be said to want things). My proposed way forward, is to explore designs, that gives each individual, meaningful influence, regarding the adoption of those preferences, that refer to her (doing so results in AI designs, that are no longer describable as ``doing what a Group wants″). I say more about this in my response to your second comment, to this post. But for the rest of this comment, I want to illustrate that the underlying issue is not actually dependent on agreement, with either of the two thought experiments discussed so far. Basically: I will argue that the conclusion, that PCEV is deeply problematic, is not dependent on agreement, on the details the these two thought experiments (in other words: I will outline an extended argument, for the premise, of your second comment).
First, it’s worth noting explicitly, that the NP outcome is obviously not bad, in any ``objective″ sense. If Bob likes the idea of sentient minds being tortured, then Bob will see NP as a good outcome. If Dave only cares about launching an AI, as soon as possible (and is fully indifferent to what AI is launched), then Dave will simply not see either of these two thought experiments, as relevant in any way. But I think that most readers will agree, that NP is a bad outcome.
Let’s turn to another class of thought experiments, that can be used to illustrate a less severe version of the same problem. Consider Steve, who wants everyone else to be punished. Steve is however willing to negotiate, and will agree to not vote for punishment, if he gets some extra bonus, that does not imply anyone else getting hurt (for example: above average influence regarding what should be done with distant galaxies. Or above average amount of resources to dispose of personally). The size of the bonus, is now strongly sensitive, to the severity of the punishment, that Steve wants to inflict on others. The more hateful Steve is, the larger bonus he gets. Yet again: this feature is not bad in any ``objective″ sense (Bob and Dave, mentioned above, wouldn’t see this as problematic in any way). But, I hope that most readers will agree, that building an AI, that behaves like this, is a bad idea.
We can also consider the animal rights activists Todd and Jeff. Both say that they strongly oppose the suffering of any sentient being. Todd actually does oppose all forms of suffering. Jeff is, sort of, telling the truth, but he is operating under the assumption, that everyone would want to protect animals, if they were just better informed. What Jeff actually wants, is moral purity. He wants other people to behave correctly. And, even more importantly, Jeff wants other people to have the correct morality. And, when Jeff is faced with the reality, that other people will not adopt the correct morality, even when they are informed about the details of factory farming, and given time for reflection, then Jeff will decide that they deserve to be punished, for lack of moral purity. In a situation where Jeff is in a weak political position, and when he is still able to convince himself, that most people are just misinformed, Jeff is not openly taking any totalitarian, or hateful, political positions. However, when Jeff finds out that most people, even when fully informed, and given time to reflect, would still choose to eat meat (in a counterfactual situation, where the AI is unable to provide meat, without killing animals), then he wants them punished (as a question of moral principle. They deserve it, because they are bad people. An AI that lets them avoid punishment, is an unethical AI). In a political conversation, it is essentially impossible to distinguish Todd from Jeff. So, a reasonable debate rule, is that you should conduct debates, as if all of your opponents (and all your allies) are like Todd. Accusing Todd of being like Jeff is unfair, since there is absolutely no way, for Todd to ever prove that he is not like Jeff. It is also an accusation, that can be levelled at most people, for taking essentially any normative, or political, position. So, having informal debate rules, stating that everyone should act, as if all people involved in the conversation, are like Todd, often makes a lot of sense. It is however a mistake to simply assume, that all people, really are, like Todd (or that people will remain like Todd, even when they are informed, about the fact, that other people are not simply misinformed. And that value differences, are, in fact, a real thing). In particular, when we are considering the question of ``what alignment target should be aimed at?″, then it is important to take into account, the fact that PCEV would give a single person like Jeff, far more power, than a large number of people like Todd. Even if a given political movement, is completely dominated by people like Todd, the influence within PCEV, from the members of this movement, would be dominated by people like Jeff. Even worse, is the fact that the issues that would end up dominating any PCEV style negotiation, are those issues that attract people along the lines of Jeff (in other words: what people think about the actual issue of animal rights, would probably not have any significant impact, on the actual outcome, of PCEV. If some currently discussed question, turned out to matter to the negotiations of extrapolated delegates, then this would probably be a type of issue, that tend to interest the ``heretics deserve eternal torture in hell″ crowd). So, while asking the question of ``what alignment target should be aimed at?″, it is actually very important, to take the existence of people like Jeff, into account.
(I use animal rights as a theme, because the texts that introduce PCEV, use this theme (and, as far as I can tell, PCEV is the current state of the art, in terms of answering the ``what alignment target should be aimed at?″ question). The underlying dynamic is, however, not connected to this particular issue. In fact, sentiments along the lines of ``heretics deserve eternal torture in hell″ have, historically, not had particularly strong ties to animal rights (such sentiments have been common, in many different times and places, throughout human history. But they do not seem to be common, amongst current animal rights movements). However, the animal rights issue, does work to illustrate the point (also: sticking with this existing theme, means that I don’t have to speculate out loud, regarding which existing group of people, are most like Jeff). Even though Jeff is a non standard type of fanatic, it is still perfectly possible, to use the power differential between Jeff and Todd, in PCEV, to illustrate the underlying problematic PCEV feature in question (since this feature of PCEV is not related to the specifics of the normative question under consideration, it is trivial to make the exact same point, using essentially any normative question / theme))

Regarding the validity of the thought experiment in the post:
If humans are mapped to utility functions, such that LP is close to maximally bad, then the negotiated outcome would indeed not be LP. However, I don’t think that this would be a reasonable mapping, because I think that a clever enough AI, would be capable of thinking up something, that is far worse than LP (more below).
Regarding Pascal’s Mugging. This term not is usually used, for these types of probabilities. If one in a hundred humans is a fanatic (or even one in a thousand), then I don’t think that it makes sense to describe this as Pascal’s Mugging. (for a set of individuals, such that LP and MP are basically the same, the outcome would indeed not be LP. But I still don’t think that it would count as a variant of Pascal’s Mugging) (perhaps I should not have used the phrase: ``tiny number of fanatics″. I did not mean ``tiny number″ in the ``negligible number″ sense. I was using it in the ``standard english″ sense)
I do not think that LP and MP will be even remotely similar. This assessment does not rely on the number of created minds in LP (or the number of years involved). Basically: everything that happens in LP, must be comprehensible to a heretic. That is not true for MP. And the comparison between LP and MP, is made by an extrapolated delegate.
In MP, the fanatics would ask an AI, to hurt the heretics as much as possible. So, for each individual heretic, the outcome in MP, is designed, by a very clever mind, specifically for the purpose of horrifying that heretic in particular (using an enormous amount of resources). The only constraint, is that any mind that is created by PCEV, must also be a heretic. In LP, the scenarios under consideration (that the 10^15 created minds will be subjected to), is limited to the set, that the heretic in question is capable of comprehending. Even if LP and MP would involve the same number of minds, and the same number of years, I still expect LP to be the negotiated outcome. MP is still the result of a very clever AI, that uses all of its creativity, to think up some outcome, specifically designed to horrify, this particular heretic. Betting against LP, as the negotiated outcome, means betting against the ability of a very powerful mind, to find a clever solution. In other words: I expect the outcome to be MP, for the same reason that I expect clever AI1, to defeat AI2 (who is equally clever, but is limited to strategies, that a human is capable of comprehending) in a war, even if AI2 starts with a lot more tanks. (if the sticking point is the phrase: ``the most horrific treatment, that this heretic is capable of comprehending″, then perhaps you will agree that the outcome would be LP, if the wording is changed to ``the most horrific treatment, that this heretic is capable of coming up with, given time to think, but without help from the AI″)

The proposal to add a ``Last Judge″ to an AI, does not remove the urgency, of making progress on the ``what alignment target should be aimed at?″ question.

ThomasCederborgNov 22, 2023, 6:59 PM

1 point

0 comments18 min readLW link

Making progress on the ``what alignment target should be aimed at?″ question, is urgent

ThomasCederborgOct 5, 2023, 12:55 PM

2 points

0 comments18 min readLW link

A problem with the most recently published version of CEV

ThomasCederborgAug 23, 2023, 6:05 PM

10 points

8 comments8 min readLW link 1 review

ThomasCederborg

The pro­posal to add a ``Last Judge″ to an AI, does not re­move the ur­gency, of mak­ing progress on the ``what al­ign­ment tar­get should be aimed at?″ ques­tion.

Mak­ing progress on the ``what al­ign­ment tar­get should be aimed at?″ ques­tion, is urgent

A prob­lem with the most re­cently pub­lished ver­sion of CEV

The proposal to add a ``Last Judge″ to an AI, does not remove the urgency, of making progress on the ``what alignment target should be aimed at?″ question.

Making progress on the ``what alignment target should be aimed at?″ question, is urgent

A problem with the most recently published version of CEV