As I understand the original CEV plan, the idea is that the gadget that derives human values from human brains is itself understood to be more reliable than human brains.
So no, it doesn’t actually make sense according to this theory to say “this is the output we expect from the gadget, according to our brains; let’s compare the actual output of the gadget to the output of our brains and reject the gadget if they don’t match.”
That said, it certainly makes sense to ask “how are we supposed to actually know that we’ve built this gadget in the first place??!??!” I do not understand, and have never understood, how we’re supposed to know on this theory that we’ve actually built the gadget properly and didn’t miss a decimal point somewhere… I’ve been asking this question since I first came across the CEV idea, years ago.
The idea is that if you have a CEV gadget, you can ask it moral questions and it will come up with answers that do, in fact, look good. You wouldn’t have thought of them yourself (because you’re not that clever and don’t have a built-in population ethic, blah blah blah), but once you see them, you definitely prefer them.
Is there a writeup somewhere of why we should expect an unaltered me to endorse the proposals that would, if implemented, best instantiate our coherent extrapolated volition?
I’m having a very hard time seeing why I should expect that; it seems to assume a level of consistency between what I endorse, what I desire, and what I “actually want” (in the CEV sense) that just doesn’t seem true of humans.
I guess the simplest thing I can say is: there’s a lot of stuff we don’t think of because our hypothesis space consists only of things we’ve seen before. We expect that an AGI, being more intelligent than any individual human, could afford a larger hypothesis space and sift it better, which is why it would be capable of coming up with courses of action we value highly but did not, ourselves, invent.
Think retrospectively: nobody living 10,000 years ago would have predicted the existence of bread, beer, baseball, or automobiles. And yet, modern humans find ways to like all of those things (except baseball ;-)).
All else failing, something like CEV or another form of indirect normativity should at least give us an AI Friendly enough that we can try to use an injunction architecture to restrict it to following our orders or something, and it will want to follow the intent behind the orders.
If you’re this skeptical about CEV, would you like to correspond by email about an alternative FAI approach under development, called value learners? I’ve been putting some tiny bit of thought into them on the occasional Saturday. I can send you the Google Doc of my notes.
Well, I certainly agree that there’s lots of things we don’t think about, and that a sufficiently intelligent system can come up with courses of action that humans will endorse, and that humans will like all kinds of things that they would not have endorsed ahead of time… for that matter, humans like all kinds of things that they simultaneously don’t endorse.
And no, not really interested in private discussion of alternate FAI approaches, though if you made a post about it I’d probably read it.
a sufficiently intelligent system can come up with courses of action that humans will endorse, and that humans will like all kinds of things that they would not have endorsed ahead of time… for that matter, humans like all kinds of things that they simultaneously don’t endorse.
Generally we aim to come up with things humans will both like and endorse. Optimizing for “like” but not “endorse” leads to various forms of drugging or wireheading (even if Eliezer does disturb me by being tempted towards such things). Optimizing for “endorse” but not “like” sounds like carrying the dystopia we currently call “real life” to its logical, horrid conclusion.
if you made a post about it I’d probably read it.
How well-founded does a set of notes or thoughts have to be in order to be worth posting here?
we aim to come up with things humans will both like and endorse
(shrug) Well, OK. If I consider the set of plans A which maximize our values when implemented, and the set of plans B which we endorse when they’re explained to us, I’m prepared to believe that the AB intersection is nonempty. And really, any technique that stands a chance worth considering of coming up with anything in A is sufficiently outside my experience that I won’t express an opinion about whether it’s noticably less likely to come up with something in AB. So, go for it, I guess.
How well-founded does a set of notes or thoughts have to be in order to be worth posting here?
Depends on whom you ask. I’d say it’s the product of (novel relevant concise entertaining coherent) that gets compared to threshold; well-founded is a nice benny but not critical. That said, posts that don’t make the threshold will frequently be berated for being ill-founded if they are.
Well, there’s a whole lot of magic going on here.
As I understand the original CEV plan, the idea is that the gadget that derives human values from human brains is itself understood to be more reliable than human brains.
So no, it doesn’t actually make sense according to this theory to say “this is the output we expect from the gadget, according to our brains; let’s compare the actual output of the gadget to the output of our brains and reject the gadget if they don’t match.”
That said, it certainly makes sense to ask “how are we supposed to actually know that we’ve built this gadget in the first place??!??!” I do not understand, and have never understood, how we’re supposed to know on this theory that we’ve actually built the gadget properly and didn’t miss a decimal point somewhere… I’ve been asking this question since I first came across the CEV idea, years ago.
The idea is that if you have a CEV gadget, you can ask it moral questions and it will come up with answers that do, in fact, look good. You wouldn’t have thought of them yourself (because you’re not that clever and don’t have a built-in population ethic, blah blah blah), but once you see them, you definitely prefer them.
That’s… interesting.
Is there a writeup somewhere of why we should expect an unaltered me to endorse the proposals that would, if implemented, best instantiate our coherent extrapolated volition?
I’m having a very hard time seeing why I should expect that; it seems to assume a level of consistency between what I endorse, what I desire, and what I “actually want” (in the CEV sense) that just doesn’t seem true of humans.
I guess the simplest thing I can say is: there’s a lot of stuff we don’t think of because our hypothesis space consists only of things we’ve seen before. We expect that an AGI, being more intelligent than any individual human, could afford a larger hypothesis space and sift it better, which is why it would be capable of coming up with courses of action we value highly but did not, ourselves, invent.
Think retrospectively: nobody living 10,000 years ago would have predicted the existence of bread, beer, baseball, or automobiles. And yet, modern humans find ways to like all of those things (except baseball ;-)).
All else failing, something like CEV or another form of indirect normativity should at least give us an AI Friendly enough that we can try to use an injunction architecture to restrict it to following our orders or something, and it will want to follow the intent behind the orders.
If you’re this skeptical about CEV, would you like to correspond by email about an alternative FAI approach under development, called value learners? I’ve been putting some tiny bit of thought into them on the occasional Saturday. I can send you the Google Doc of my notes.
Well, I certainly agree that there’s lots of things we don’t think about, and that a sufficiently intelligent system can come up with courses of action that humans will endorse, and that humans will like all kinds of things that they would not have endorsed ahead of time… for that matter, humans like all kinds of things that they simultaneously don’t endorse.
And no, not really interested in private discussion of alternate FAI approaches, though if you made a post about it I’d probably read it.
Generally we aim to come up with things humans will both like and endorse. Optimizing for “like” but not “endorse” leads to various forms of drugging or wireheading (even if Eliezer does disturb me by being tempted towards such things). Optimizing for “endorse” but not “like” sounds like carrying the dystopia we currently call “real life” to its logical, horrid conclusion.
How well-founded does a set of notes or thoughts have to be in order to be worth posting here?
(shrug) Well, OK. If I consider the set of plans A which maximize our values when implemented, and the set of plans B which we endorse when they’re explained to us, I’m prepared to believe that the AB intersection is nonempty. And really, any technique that stands a chance worth considering of coming up with anything in A is sufficiently outside my experience that I won’t express an opinion about whether it’s noticably less likely to come up with something in AB. So, go for it, I guess.
Depends on whom you ask. I’d say it’s the product of (novel relevant concise entertaining coherent) that gets compared to threshold; well-founded is a nice benny but not critical. That said, posts that don’t make the threshold will frequently be berated for being ill-founded if they are.