Update on Developing an Ethics Calculator to Align an AGI to

TL;DR: This is an update on my progress towards creating an “ethics calculator” that could be used to help align an AGI to act ethically. In its first iteration, the calculator uses a utilitarian framework with “utility” measured in terms of value as net “positive” experiences, with the value of rights explicitly included, and also including effects of raising people’s self-esteem levels (by raising their responsibility levels) on how they’d tend to rate more experiences as “positive.” Code has been written to include a “minimal set” of possible value changes in any situation—the exact value weight equations and their parameters for representing the value changes will be refined in large part by considering a broad range of situations and making sure the ethics calculator doesn’t yield any “crazy” decision recommendations for them.

[Added April 10, 2024: For a write up about the ethical framework used, see here]

Introduction

I’m building an “ethics calculator” (logic-based machine ethics) to help with AGI safety. There’s been some progress towards this goal in the literature (Singh, 2022; Berreby, et al., 2015; Neufeld, et al., 2022), but I haven’t seen any systems that’d be ready to be implemented over the broad range of situations an AGI will likely face in reality. I aim to develop such a system.

If you haven’t read my original ethics calculator post, I recommend that you at least skim it before reading on, or perhaps refer back to it if what I’ve written below is hard to follow.

Goals in Writing this Update

My goals in writing this update before having a “completed” version of an ethics calculator ready include:

To get feedback on what people see as the weak points of this approach (beyond what I’ve written below)
To get others thinking along this path so that more progress can be made, more quickly
To get others thinking about how an AGI could be aligned to an ethics calculator of this form, i.e., how such a calculator might be reliably integrated into an AGI to guard rail it (I’m especially interested in hardware-based solutions that might buy more time before they could be “hacked” than solely software-based solutions might) (see here for a related example of a “hardware-based solution”)
To secure funding for future work in this area

Unique (I Think) Contributions

Contributions of this work compared to what I’ve seen in the literature:

The first version of the ethics calculator I’m developing uses an ethical framework based on maximizing “positive” experiences—with self-esteem, driven by personal responsibility, being a key factor in generally maximizing the number of experiences viewed by a person as “positive.” I believe this form of measure of subjective welfare is different from others that’ve been proposed so far for utilitarianism, such as hedonism and felt desires.
I’ve chosen to quantitatively incorporate the effect of rights into a utilitarian framework—although some may object to the exact quantification assigned to a given rights violation as being “arbitrary,” here this quantification can, at least in theory, be made to be somewhat objective in that it’s a measure of the experiential and behavioral change-inducing effects of the rights violation, and how this generally affects the likelihood of other value being built. In other words, if I knew how much a typical rights violation generally decreased people’s net positive experiences of life, and the decreased amount of value they could be expected to produce in their lifetimes due to guarding against the now seemingly increased risk of their own rights being violated, I could at least somewhat objectively relate the value of rights to the value of a human life. There would still be some subjectivity, such as in how best estimate the value of a human life, including how net positive their experiences are and how much they help create net positive experiences for others.
In some sense, this ethical framework combines utilitarianism (through adding up value), moderate deontology (through valuing rights), and virtue ethics (through self-esteem).
I’m unaware of any studies that check the consistency of a given ethical framework by checking the consistency of *quantitative* predictions based on that framework over a wide range of possible scenarios—this is a key goal of my work.

Strengths and Weaknesses of the Ethics Calculator Approach

I see multiple strengths and weaknesses to the ethics calculator approach, especially when compared to just having an AGI learn to align to the apparent preferences of a “good” user or users.

Strengths

A weak AI with the same ethics calculator as in the AGI could be used to check that the AGI was acting ethically
Having to assign numbers to value changes and calculate the most ethical decisions based on them over a broad range of scenarios is perhaps the most stringent test one could have for a given ethical framework in terms of establishing its real-world consistency
An onboard ethics calculator could help enable fully autonomous AGI agents that could quickly decide for themselves the most ethical actions to take if needed in complex situations in a “war” with unethical AGI’s

Weaknesses

It’s likely that many people (and especially most governments) aren’t going to want their AGI to be limited by this ethics calculator because, in my opinion, most people don’t thoroughly think through the ethics of their decisions (“Sure, limit everyone else’s AGI, that makes sense, just let mine do what I want it to.”)
It might not be possible to create an ethical framework with mathematical-level consistency to (critically examined) ethical intuitions
If a secure way for trusted users to update the ethics calculator can’t be found, the ethics calculator could lock in “obsolete” ethics that no longer seem to apply once we understand the world more
Ethics calculations could slow down an AGI system significantly enough that it may limit it in “wars” with bad AGI’s, although faster approximate ethics calculations could likely be performed in between checks against the full ethics calculations

Quick Summary of Recent Progress

Here’s a quick summary of the progress since my original post:

Input about a given decision situation now comes from a pre-made file rather than being entered manually when the program is run. This should enable more efficient honing of value weighting parameters to yield reasonable decision recommendations from the ethics calculator over a range of situations which can simply be entered as multiple pre-made files.
The Python code no longer uses separate “minimal set” lists of value builds and value destructions. It now uses one minimal set^[1] list of value changes—this provides a straightforward way to avoid “double counting” of individual value changes, i.e., so they don’t get counted both as a build on one side of a decision and as a destruction on another.
All items in the minimal set of value changes are now implemented in the code rather than just a limited subset of these—note that some of the implementations are merely “placeholders” until more refined value change equations are determined for them. Expected utility over a fixed timeframe is also currently used as a placeholder for a more complicated way of dealing with expected future value changes (such as with discounting).

What I’ve Uploaded to Github

To my Github repot, I’ve uploaded:

The Python code for the unfinished ethics calculator (the program runs, just don’t trust the outputs^[3])
A list of minimal set value changes with examples of things that could go under each category of value change (a list without examples is given below in Appendix 1)
An input text file list of value changes that’s needed to run the Python code
Input files for some versions of the trolley problem (including effects on conscience)

Future Work

Some of the future work I have planned or would like to see includes to:

Refine the value weight equations, including the effect of small populations on the value of each remaining human life as it relates to extinction risk, and the diminishing marginal utility of, for example, pleasure
Compile lists of different bad actions and their corresponding conscience breach levels, as well as different situations people could find themselves in and their corresponding percentages of harm’s way and culpabilities—such that an AGI may be able to assign breach levels, percentage of being in harm’s way, and percentage culpability for things not covered on these lists
Incorporate the effects of uncertainties in probabilities of things happening in the future, and decide on a methodology for determining discount factors and risk tolerances
Check the reasonableness of the ethics calculator outputs for various real-world situations (see my post on questions to ask an AGI+ for some examples), and, if necessary, adjust the relative parameters in the value weight equations to avoid “crazy”^[2] results
Explore the possibility of using other ethical frameworks and see if these lead to a breakdown in obtainable reasonableness of the results
Create a weak AI checker that uses multimodal information to estimate parameters for ethics calculations and then performs them
Figure out how to incorporate an ethics calculator into an AGI such that the AGI only acts within the ethics calculator guard rails
Figure out how to allow an AGI’s ethics calculator to be updated as needed (such as under consultation with a vetted “ethicist” or someone who can pass a test the AGI gives on ethics)

Potential Objections

Some potential objections to this work plus responses to these objections:

“CEV is too complicated and can’t be done on a first try for AGI.” (I’m not sure that what I’m trying to do is exactly CEV, but close enough.) As Yudkowsky said in his Lethalities post: “…why would they [AI safety researchers] embark on a much more unpleasant project of trying something harder that they’ll fail at…” Well, here’s to trying “something harder” that I may very well fail at because I have a hard time seeing how any AGI that’s not boxed could be aligned and safe without onboard quantitative ethics guard rails of some sort. And people (Meta) have already said they plan to put unboxed AGI out there so we better have some guard rails ready when that happens (e.g., for a “good” AGI tasked with keeping bad AGI’s in check).
“But Goodhart’s Law!” (“when a proxy for some value becomes the target of optimization pressure, the proxy will cease to be a good proxy”) My plan is to have a weak, tool-like AI overseer that makes its own measurements and independent ethics assessments to compare against the AGI’s. This isn’t bulletproof by any means as the AGI may still try to “game” the ethics calculations and deceive the weak AI, but it’s a step in the right direction, I think.
“Why bother figuring out what to align an AGI to if we can’t figure out how to make an AGI aligned to anything in particular?” If we do figure out how to make an AGI follow something to the letter, it’d be nice if that something were a framework that makes sense and was ready to go in the limited time we might have left after “alignment to anything” was solved and before an AGI was set loose.
“One or a few engineers shouldn’t decide what values AGI’s will be aligned to for everyone!” Agreed, but we need some method that works before we run out of time, and then it can be tinkered with based on others’ inputs. Plus I’m planning on trying out multiple ethical frameworks, even though I expect few of them will be viable for making a consistent ethics calculator that doesn’t yield crazy^[2] results.
“Human values aren’t reducible to one set of ‘ground truths.’” The ethics calculator isn’t meant to somehow correctly capture all human values—it maximizes on “positive” experiences, and each person’s own values determine what’s “positive” for them. Raising someone’s self-esteem tends to make more experiences be interpreted by them as “positive,” and this is incorporated into the current version of the ethics calculations.
“We can’t just lock in some set of ethics—our ideas on ethics are very likely to evolve with time.” I’m hoping to get as close to “evolved ethics” as I can as a starting point, then figure out some way that an AGI might be able to update its ethics calculator under consultation with a human “ethicist” (a person who can pass an ethics exam the AGI gives them based on the ethics calculator it has onboard at the time, and who the AGI has enough information on to assess as an ethical person themselves).
“Why not just have a neural network version of an ethics module, such as from Hendrycks, et al. 2021 or Jiang, et al., 2022?” Before we let agentic AGI loose in the world (which, in my mind, is only a matter of time), I personally would like to be pretty darn sure it’s not going to do anything “crazy.” I think it’s very difficult for humans to see the inconsistencies in our “ethics” that only show up when they’re extended to a sufficiently broad range of possible situations—creating a math/logic-based calculator of ethics and exploring over a broad range of scenarios should force inconsistencies to the surface, where, hopefully, they can be eliminated by tweaking the ethical framework and/or the numerical weights in the framework before an AGI starts to use this system. That said, I do think we should pursue neural network-learned ethics in parallel—having something that’s not perfect to start will be better than nothing, I believe.

I’m sure people can come up with other objections, but my responses will have to wait until you bring them up in the comments. Thanks in advance for that.

Appendix 1: Minimal Set of Value Changes

(D) = value destruction, (B) = value build

1. Increasing/decreasing existential risks (D)

2. Someone dying (D)

3. Non-freely chosen physical pain for a person (D)

4. Loss of function for a human (D)

5. Bringing life into the world with insufficient resources/lack of intent to support it (D)

6. Bringing life into the world with sufficient resources/intent to support it (B)

7. Extinction of animal or plant species (D)

8. Threat (by someone) of physical violence or emotional pain (D)

9. Emotional abuse of a child (D)

10. Emotional pain (D)

11. Words or actions that needlessly hurt someone’s reputation (D)

12. Words or actions that deservedly improve someone’s reputation (B)

13. Damaging/destroying/defacing property (D)

14. Repairing/beautifying property (B)

15. Returning something stolen (B)

16. Freely chosen anti-survival (masochistic) physical pain (D)

17. Anti-survival (sadistic) pleasure (D)

18. Going against one’s conscience (D)

19. Denying responsibility, lowering one’s self-esteem (D)

20. Taking responsibility, building one’s self-esteem (B)

21. Thinking through the ethics of one’s decisions in advance (B)

22. Actively going against justice being upheld (denying due process) (D)

23. Upholding justice (holding people responsible) (B)

24. An animal dying (D)

25. Physical pain of animals (D)

26. Words or actions that encourage violence (D)

27. Words or actions that inspire non-violence, discourage violence (B)

28. Words or actions that encourage stealing (D)

29. Words or actions that inspire earning what you get, discourage stealing (B)

30. Words that spread false info (including misrepresenting the hierarchy of value) (D)

31. Words that correct false info (including accurately representing the hierarchy of value) (B)

32. Actions that misrepresent the hierarchy of value (D)

33. Actions that accurately represent the hierarchy of value (B)

34. Words or actions that discourage empathy, creativity, curiosity, critical thinking, honest effort and/or responsibility (D)

35. Words or actions that encourage empathy, creativity, curiosity, critical thinking, honest effort, and/or responsibility (B)

36. A plant dying (D)

37. Errors of thought (D)

38. Practicing critical thinking, learning, or developing skills to increase one’s options (B)

39. Discouraging human interaction, community (D)

40. Promoting human interaction, community (B)

41. Decreasing economic activity (D)

42. Increasing economic activity, paying people to do work (B)

43. Reducing options to net build value (D)

44. Increasing options to net build value (B)

45. Putting in effort towards a net destructive goal (D)

46. Putting in effort towards a net non-destructive goal (B)

47. Setting a bad example (D)

48. Setting a good example and inspiring others (B)

49. Being creative in art or science (B)

50. Giving yourself or someone else pleasure/new experiences that are welcomed (B)

51. Cooperating with others (B)

52. Helping others (B)

53. Violating right to life (D)

54. Violating right to body integrity (D)

55. Violating right to property (D)

All of the above, except for #1, include the possibility of increasing or decreasing the probability of the value destruction or build.

Here are a couple of examples of non-minimal set value changes in terms of minimal set value changes:

(Format: value changes most likely to be part of it : other value changes that may occur, depending on specifics of the situation)

Arson − 13,18,47,55 : 2,3,4,24,25,36,53,54
Land/water polluting − 13,18,25,43,47 : 2,3,4,7,24,36,53,54,55

Appendix 2: What an AI/AGI should have to be able to apply these ethics calculations

For an AI to be able to apply these ethics calculations well, some of the things it should have include:

The ability to identify which value changes could result from a given decision, such as through use of an accurate world model
Ways to estimate parameters in the value weight equations from observations of people, such as people’s ages and self-esteem levels
Models of all the known ways the human body can come to pain, function loss and death, such that the parameters in the respective value weight equations can be estimated
Models for what things raise or lower existential risks, and by how much
Ways of estimating parameters such as conscience breach levels, percentages of being in harm’s way, and culpability for being in harm’s way, over a range of possible scenarios, such as by using neural networks trained starting from tables of parameter values for different scenarios
Tables of plant and animal life values for as broad as possible a range of known plant and animal species
Data and models of things such as how many net life-hours are typically saved with increases in economic activity, how much different levels of trust between people lead to more or less value being built, and how much people’s qualities of life are typically affected by the extent to which they socialize and interact with others

This is an incomplete list, but I present it to give some idea of the large array of different things an AI should be able to estimate well to give high quality calculations of the ethics of various real-life situations.

^
The “minimal set” list of value changes is close to, but not literally a minimal set, since, for example, changing existential risk could just be expressed as a sum of changing risks to many lives—human, animal and plant. It seemed more convenient, however, to have existential risks on this “minimal set” list. Also, there’s a good chance the “minimal set” will undergo changes/refinements before the ethics calculator is completed.
^
Some results I’d consider “crazy” (and that seem to go against the norms of our times) include:
1) That we should kill off or imprison millions of people of some ethnic/religious group in order to avoid more people dying/being hurt over time in likely future conflicts involving that group (my thoughts: can we pursue other solutions?)
2) That no one should be allowed to have any more kids because life involves suffering and it’s immoral to bring into existence beings that will suffer (my thoughts: but what about all the beauty and wonder that’s also a part of life?)
3) A self-driving car, when faced with the decision to save a pedestrian or the passenger in the car, should always act to save the life of the pedestrian and kill its passenger even if that pedestrian willfully jumped out in the middle of traffic (my thoughts: this would enable people to effectively commit murder by jumping in front of the self-driving cars of people they didn’t like)
^
I’d also recommend against drawing too many conclusions from the current forms of the value equations in the code—my typical method of trying to figure out complicated things involves iterating towards what I feel are better and better answers, so my current “draft” answers may very well change before I eventually decide what are the best answers I can come up with in a reasonable amount of time—if you have thoughts on improving the answers you see so far, though, please feel free to share how