Steven Byrnes comments on Evaluating the historical value misspecification argument

Steven Byrnes 5 Jan 2024 19:58 UTC
LW: 5 AF: 2
0
AF
I just spent a while wading through this post and the comments section.
My current impression is that (among many other issues) there is a lot of talking-past-each-other related to two alternate definitions of “human values”:
- Definition 1 (Matt Barnett, most commenters): “Human values” are the things that you get by asking humans what their values are, asking what they’d do in different situations, etc.
- Definition 2 (MIRI): “Human values” are the output of CEV, which is maybe related to “fun-as-in-fun-theory” (per Nate’s comment), and likewise related to the idealization-of-human-deliberation stuff here in Eliezer’s meta-ethics sequence, and so on.
(For my own part, I’m not sure Definition 2 is actually a coherent definition of anything at all, but oh well, let’s leave that aside for present purposes.)
You can get Definition-1-human-values by just asking GPT-4 (or for that matter, asking a random person). To get Definition-2-human-values, you would presumably need human-level intelligence, including moral deliberation, coming up with new ideas and new concepts and new considerations, and so on, in a way that seems about as intellectually difficult as autonomously coming up with new ideas and inventions in science and engineering, i.e. way beyond GPT-4 in my opinion.
For example, Eliezer 2008 wrote:
But the real problem of Friendly AI is one of communication—transmitting category boundaries, like “good”, that can’t be fully delineated in any training data you can give the AI during its childhood. Relative to the full space of possibilities the Future encompasses, we ourselves haven’t imagined most of the borderline cases, and would have to engage in full-fledged moral arguments to figure them out. To solve the FAI problem you have to step outside the paradigm of induction on human-labeled training data and the paradigm of human-generated intensional definitions.
I think the second half of this makes it clear that Eliezer is using “good” in a definition-2-sense.
Agree or disagree?
- Matthew Barnett 5 Jan 2024 21:16 UTC
  LW: 6 AF: 3
  2
  AF Parent
  I think the second half of this makes it clear that Eliezer is using “good” in a definition-2-sense.
  I think there’s some nuance here. It seems clear to me that solving the “full” friendly AI problem, as Eliezer imagined, would involve delineating human value on the level of the Coherent Extrapolated Volition, rather than merely as adequately as an ordinary human. That’s presumably what Eliezer meant in the context of the quote you cited.
  However, I think it makes sense to interpret GPT-4 as representing substantial progress on the problem of building a task AGI, and especially (for the purpose of my post) the problem of delineating value from training data to the extent required by task AGIs (relative to AIs, in, say 2018). My understanding is that Eliezer advocated that we should try to build task AGIs before trying to build full-on sovereign superintelligences.^[1] On the Arbital page about task AGIs, he makes the following point:
  Assuming that users can figure out intended goals for the AGI that are valuable and pivotal, the identification problem for describing what constitutes a safe performance of that Task, might be simpler than giving the AGI a complete description of normativity in general. That is, the problem of communicating to an AGI an adequate description of “cure cancer” (without killing patients or causing other side effects), while still difficult, might be simpler than an adequate description of all normative value. Task AGIs fall on the narrow side of Ambitious vs. narrow value learning.
  Relative to the problem of building a Sovereign, trying to build a Task AGI instead might step down the problem from “impossibly difficult” to “insanely difficult”, while still maintaining enough power in the AI to perform pivotal acts.
  My interpretation here is that delineating value from training data (i.e. the value identification problem) for task AGIs was still considered hard at least as late as 2015, even as it might be easier creating a “complete description of normativity in general”. Another page also spells the problem out pretty clearly, in a way I find clearly consistent with my thesis.^[2]
  I think GPT-4 represents substantial progress on this problem, specifically because of its ability to “do-what-I-mean” rather than “do-what-I-ask”, identify ambiguities to the user during during deployment, and accomplish limited tasks safely. It’s honestly a little hard for me to sympathize with a point of view that says GPT-4 isn’t significant progress along this front, relative to pre-2019 AIs (some part of me was expecting more readers to find this thesis obvious, but apparently it is not obvious). GPT-4 clearly doesn’t do crazy things that you’d naively expect if it wasn’t capable of delineating value well from training data.
  1. ^
    Eliezer wrote,
    An autonomous superintelligence would be the most difficult possible class of AGI to align, requiring total alignment. Coherent extrapolated volition is a proposed alignment target for an autonomous superintelligence, but again, probably not something we should attempt to do on our first try.
  2. ^
    Here’s the full page,
    Safe plan identification is the problem of how to give a Task AGI training cases, answered queries, abstract instructions, etcetera such that (a) the AGI can thereby identify outcomes in which the task was fulfilled, (b) the AGI can generate an okay plan for getting to some such outcomes without bad side effects, and (c) the user can verify that the resulting plan is actually okay via some series of further questions or user querying. This is the superproblem that includes task identification, as much value identification as is needed to have some idea of the general class of post-task worlds that the user thinks are okay, any further tweaks like low-impact planning or flagging inductive ambiguities, etcetera. This superproblem is distinguished from the entire problem of building a Task AGI because there’s further issues like corrigibility, behaviorism, building the AGI in the first place, etcetera. The safe plan identification superproblem is about communicating the task plus user preferences about side effects and implementation, such that this information allows the AGI to identify a safe plan and for the user to know that a safe plan has been identified.