I think I am more interested in you reading The Genie Knows but Doesn’t Care and then having you respond to the things in there than the Hibbard example, since that post was written with (as far as I can tell) addressing common misunderstandings of the Hibbard debate
Looking over that it just seems to be a straightforward extrapolation of EY’s earlier points, so I’m not sure why you thought it was especially relevant.
Low-powered AI systems will have a really hard time learning high-level human concepts like “happiness”, and if you try to naively get them to learn that concept (by e.g pointing them towards smiling humans) you will get some kind of abomination, since even humans have trouble with those kinds of concepts
Yeah—this is his core argument against Hibbard. I think Hibbard 2001 would object to ‘low-powered’, and would probably have other objections I’m not modelling, but regardless I don’t find this controversial.
It is likely that by the time an AI will understand what humans actually really want, we will not have much control over its training process, and so despite it now understanding those constraints, we will have no power to shape its goals towards that.
Yeah, in agreement with what I said earlier:
Notice I said “before it killed us”. Sure the AI may learn detailed models of humans and human values at some point during its superintelligent FOOMing, but that’s irrelevant because we need to instill its utility function long before that.
...
Even if we and the AI had a very crisp and clear concept of a goal I would like the AI to have, humanity won’t know how to actually cause the AI to point towards that as a goal
I believe I know what you meant, but this seems somewhat confused as worded. If we can train an ML model to learn a very crisp clear concept of a goal, then having the AI optimize for this (point towards it) is straightforward. Long term robustness may be a different issue, but I’m assuming that’s mostly covered under “very crisp clear concept”.
The issue of course is that what humans actually want is complex for us to articulate, let alone formally specify. The update since 2008/2011 is that DL may be able to learn a reasonable proxy of what we actually want, even if we can’t fully formally specify it.
which appears to require substantial superhuman abilities, given that humans do not have a coherent model of happiness themselves)”
I think this is something of a red herring. Humans can reasonably predict utility functions of other humans in complex scenarios simply by simulating the other as self—ie through empathy. Also happiness probably isn’t the correct thing—probably want the AI to optimize for our empowerment (future optionality), but that’s whole separate discussion.
So no, I don’t think Hibbard’s approach would work.
Sure, neither do I.
Separately, we have no idea how to use a classifier as a reward/utility function for an AGI, so that part of the approach also wouldn’t work.
A classifier is a function which maps high-D inputs to a single categorical variable, and a utility function just maps some high-D input to a real number, but a k-categorical variable is just the explicit binned model of a log(k) bit number, so these really aren’t that different, and there are many interpolations between. (and in fact sometimes it’s better to use the more expensive categorical model for regression )
Like, what do you actually concretely propose we do after we have a classifier over video frames
Video frames? The utility function needs to be over future predicted world states .. which you could I guess use to render out videos, but text rendering are probably more natural.
I propose we actually learn how the brain works, and how evolution solved alignment, to better understand our values and reverse engineer them. That is probably the safest approach—having a complete understanding of the brain.
However, I’m also somewhat optimistic on theoretical approaches that focus more explicitly on optimizing for external empowerment (which is simpler and more crisp), and how that could be approximated pragmatically with current ML approaches. Those two topics are probably my next posts.
Looking over that it just seems to be a straightforward extrapolation of EY’s earlier points, so I’m not sure why you thought it was especially relevant.
Yeah—this is his core argument against Hibbard. I think Hibbard 2001 would object to ‘low-powered’, and would probably have other objections I’m not modelling, but regardless I don’t find this controversial.
Yeah, in agreement with what I said earlier:
...
I believe I know what you meant, but this seems somewhat confused as worded. If we can train an ML model to learn a very crisp clear concept of a goal, then having the AI optimize for this (point towards it) is straightforward. Long term robustness may be a different issue, but I’m assuming that’s mostly covered under “very crisp clear concept”.
The issue of course is that what humans actually want is complex for us to articulate, let alone formally specify. The update since 2008/2011 is that DL may be able to learn a reasonable proxy of what we actually want, even if we can’t fully formally specify it.
I think this is something of a red herring. Humans can reasonably predict utility functions of other humans in complex scenarios simply by simulating the other as self—ie through empathy. Also happiness probably isn’t the correct thing—probably want the AI to optimize for our empowerment (future optionality), but that’s whole separate discussion.
Sure, neither do I.
A classifier is a function which maps high-D inputs to a single categorical variable, and a utility function just maps some high-D input to a real number, but a k-categorical variable is just the explicit binned model of a log(k) bit number, so these really aren’t that different, and there are many interpolations between. (and in fact sometimes it’s better to use the more expensive categorical model for regression )
Video frames? The utility function needs to be over future predicted world states .. which you could I guess use to render out videos, but text rendering are probably more natural.
I propose we actually learn how the brain works, and how evolution solved alignment, to better understand our values and reverse engineer them. That is probably the safest approach—having a complete understanding of the brain.
However, I’m also somewhat optimistic on theoretical approaches that focus more explicitly on optimizing for external empowerment (which is simpler and more crisp), and how that could be approximated pragmatically with current ML approaches. Those two topics are probably my next posts.