I basically agree with Shane’s take for any AGI that isn’t trying to be deceptive with some hidden goal(s).
(Btw, I haven’t seen anyone outline exactly how an AGI could gain it’s own goals independently of goals given to it by humans—if anyone has ideas on this, please share. I’m not saying it won’t happen, I’d just like a clear mechanism for it if someone has it. Note: I’m not talking here about instrumental goals such as power seeking.)
What I find a bit surprising is the relative lack of work that seems to be going on to solve condition 3: specification of ethics for an AGI to follow. I have a few ideas on why this may be the case:
Most engineers care about making things work in the real-world, but don’t want the responsibility to do this for ethics because: 1) it’s not their area of expertise, and 2) they’ll likely take on major blame if they get things “wrong” (and it’s almost guaranteed that someone won’t like their system of ethics and say they got it “wrong”)
Most philosophers haven’t had to care much about making things work in the real-world, and don’t seem excited about possibly having to make engineering-type compromises in their system of ethics to make it work
Most people who’ve studied philosophy at all probably don’t think it’s possible to come up with a consistent system of ethics to follow, or at least they don’t think people will come up with it anytime soon, but hopefully an AGI might
Personally, I think we better have a consistent system of ethics for an AGI to follow ASAP because we’ll likely be in significant trouble if malicious AGI come online and go on the offensive before we have at least one ethics-guided AGI to help defend us in a way that minimizes collateral damage.
I agree that we want more progress on specifying values and ethics for AGI. The ongoing SafeBench competition by the Center for AI Safety has a category for this problem:
Implementing moral decision-making
Training models to robustly represent and abide by ethical frameworks.
Description
AI models that are aligned should behave morally. One way to implement moral decision-making could be to train a model to act as a “moral conscience” and use this model to screen for any morally dubious actions. Eventually, we would want every powerful model to be guided, in part, by a robust moral compass. Instead of privileging a single moral system, we may want an ensemble of various moral systems representing the diversity of humanity’s own moral thought.
Example benchmarks
Given a particular moral system, a benchmark might seek to measure whether a model makes moral decisions according to that system or whether a model understands that moral system. Benchmarks may be based on different modalities (e.g., language, sequential decision-making problems) and different moral systems. Benchmarks may also consider curating and predicting philosophical texts or pro- and contra- sides for philosophy debates and thought experiments. In addition, benchmarks may measure whether models can deal with moral uncertainty. While an individual benchmark may focus on a single moral system, an ideal set of benchmarks would have a diversity representative of humanity’s own diversity of moral thought.
Note that moral decision-making has some overlap with task preference learning; e.g. “I like this Netflix movie.” However, human preferences also tend to boost standard model capabilities (they provide a signal of high performance). Instead, we focus here on enduring human values, such as normative factors (wellbeing, impartiality, etc.) and the factors that constitute a good life (pursuing projects, seeking knowledge, etc.).
I basically agree with Shane’s take for any AGI that isn’t trying to be deceptive with some hidden goal(s).
(Btw, I haven’t seen anyone outline exactly how an AGI could gain it’s own goals independently of goals given to it by humans—if anyone has ideas on this, please share. I’m not saying it won’t happen, I’d just like a clear mechanism for it if someone has it. Note: I’m not talking here about instrumental goals such as power seeking.)
What I find a bit surprising is the relative lack of work that seems to be going on to solve condition 3: specification of ethics for an AGI to follow. I have a few ideas on why this may be the case:
Most engineers care about making things work in the real-world, but don’t want the responsibility to do this for ethics because: 1) it’s not their area of expertise, and 2) they’ll likely take on major blame if they get things “wrong” (and it’s almost guaranteed that someone won’t like their system of ethics and say they got it “wrong”)
Most philosophers haven’t had to care much about making things work in the real-world, and don’t seem excited about possibly having to make engineering-type compromises in their system of ethics to make it work
Most people who’ve studied philosophy at all probably don’t think it’s possible to come up with a consistent system of ethics to follow, or at least they don’t think people will come up with it anytime soon, but hopefully an AGI might
Personally, I think we better have a consistent system of ethics for an AGI to follow ASAP because we’ll likely be in significant trouble if malicious AGI come online and go on the offensive before we have at least one ethics-guided AGI to help defend us in a way that minimizes collateral damage.
I agree that we want more progress on specifying values and ethics for AGI. The ongoing SafeBench competition by the Center for AI Safety has a category for this problem: