Interesting short thread on this here.
Also, did you mean “wasn’t”? :)
Lol, you got me.
I don’t know who, if anyone, noted the obvious fallacy in Berkeley’s master argument prior to Russell in 1912
Not even Moore in 1903?
Russell’s criticism is in line with Moore’s famous ‘The Refutation of Idealism’ (1903), where he argues that if one recognizes the act-object distinction within conscious states, one can see that the object is independent of the act.
Isn’t that the same argument Russell was making?
Why does it matter so much that we point exactly to be human?
Should that be “to the human” instead of “to be human”? Wan’t sure if you meant to say simply that, or if more words got dropped.
Or maybe it was supposed to be: “matter so much that what we point exactly to be human?”
FWIW, this reminds me of Holden Karnofsky’s formulation of Tool AI (from his 2012 post, Thoughts on the Singularity Institute):
Another way of putting this is that a “tool” has an underlying instruction set that conceptually looks like: “(1) Calculate which action A would maximize parameter P, based on existing data set D. (2) Summarize this calculation in a user-friendly manner, including what Action A is, what likely intermediate outcomes it would cause, what other actions would result in high values of P, etc.” An “agent,” by contrast, has an underlying instruction set that conceptually looks like: “(1) Calculate which action, A, would maximize parameter P, based on existing data set D. (2) Execute Action A.” In any AI where (1) is separable (by the programmers) as a distinct step, (2) can be set to the “tool” version rather than the “agent” version, and this separability is in fact present with most/all modern software. Note that in the “tool” version, neither step (1) nor step (2) (nor the combination) constitutes an instruction to maximize a parameter—to describe a program of this kind as “wanting” something is a category error, and there is no reason to expect its step (2) to be deceptive.
If I understand correctly, his “agent” is your Consequentialist AI, and his “tool” is your Decoupled AI 1.
Here’s my summary: reward uncertainty through some extension of a CIRL-like setup, accounting for human irrationality through our scientific knowledge, doing aggregate preference utilitarianism for all of the humans on the planet, discounting people by how well their beliefs map to reality, perhaps downweighting motivations such as envy (to mitigate the problem of everyone wanting positional goods).
Perhaps a dumb question, but is “reward” being used as a noun or verb here? Are we rewarding uncertainty, or is “reward uncertainty” a goal we’re trying to achieve?
Incidentally, a similar consideration leads me to want to avoid re-using old metaphors when explaining things. If you use multiple metaphors you can triangulate on the meaning—errors in the listener’s understanding will interfere destructively, leaving something closer to what you actually meant.
For this reason, I’ve been frustrated that we keep using “maximize paperclips” as the stand-in for a misaligned utility function. And I think reusing the exact same example again and again has contributed to the misunderstanding Eliezer describes here:
Original usage and intended meaning: The problem with turning the future over to just any superintelligence is that its utility function may have its attainable maximum at states we’d see as very low-value, even from the most cosmopolitan standpoint.
Misunderstood and widespread meaning: The first AGI ever to arise could show up in a paperclip factory (instead of a research lab specifically trying to do that). And then because AIs just mechanically carry out orders, it does what the humans had in mind, but too much of it.
If we’d found a bunch of different ways to say the first thing, and hadn’t just said, “maximize paperclips” every time, then I think the misunderstanding would have been less likely.
One mini-habit I have is to try to check my work in a different way from the way I produced it.
For example, if I’m copying down a large number (or string of characters, etc.), then when I double-check it, I read off the transcribed number backwards. I figure this way my brain is less likely to go “Yes yes, I’ve seen this already” and skip over any discrepancy.
And in general I look for ways to do the same kind of thing in other situations, such that checking is not just a repeat of the original process.
And I think claim 5 is basically in line with what, say, Bostrom would discuss (where stabilization is a thing to do before we attempt to build a sovereign).
You mean in the sense of stabilizing the whole world? I’d be surprised if that’s what Yann had in mind. I took him just to mean building a specialized AI to be a check on a single other AI.
Maybe try out giving people an optional prompt about why they upvoted or downvoted things that is quite short
I like this idea.
act-based = based on short-term preferences-on-reflection
For others who were confused about what “short-term preferences-on-reflection” would mean, I found this comment and its reply to be helpful.
Putting it into my own words: short-term preferences-on-reflection are about what you would want to happen in the near term, if you had a long time to think about it.
By way of illustration, AlphaZero’s long-term preference is to win the chess game, its short-term preference is whatever its policy network spits out as the best move to make next, and its short-term preference-on-reflection is the move it wants to make next after doing a fuck-ton of MCTS.
and that displacement cells both exist and exist in neocortex
Both exist and exist?
there is a troll who will blow up the bridge with you on it, if you cross it “for a dumb reason”
Does this way of writing “if” mean the same thing as “iff”, i.e. “if and only if”?
I can’t resist giving this pair of rather incongruous quotes from the paper
I can’t resist giving this pair of rather incongruous quotes from the paper
Could you spell out what makes the quotes incongruous with each other? It’s not jumping out at me.
1 billion per year per W/m^2 of reduced forcing
For others who weren’t sure what “reduced forcing” refers to: https://en.wikipedia.org/wiki/Radiative_forcing
And to put that number in context, the “net anthropogenic component” of radiative forcing appears to be about 1.5 W/m^2 (according to an image in the wikipedia article), so canceling out the anthropogenic component would have an ongoing cost of 1.5 billion per year.
Or you could imagine writing for a smarter but less knowledgeable person. E.g. 10 y.o. Feynman.
Okay, that is probably not that good a characterization.
I appreciate the caveat, but I’m actually not seeing the connection at all. What is the relationship you see between common sense and surprisingly simple solutions to problems?
Could enough human-imitating artificial agents (running much faster than people) prevent unfriendly AGI from being made?
This seems very related to the question of whether uploads would be safer than some other kind of AGI. Offhand, I remember a comment from Eliezer suggesting that he thought that would be safer (but that uploads would be unlikely to happen first).
Not sure how common that view is though.
Acquiring data: put a group of people in a house with a computer. Show them things (images, videos, audio files, etc.) and give them a chance to respond at the keyboard. Their keyboard actions are the actions, and everything between actions is an observation. Then learn the policy of the group of humans.
Wouldn’t this take an enormous amount of observation time to generate enough data to learn a human-imitating policy?
Just want to note that I like your distinctions between Algorithm Land and the Real World and also between Level-1 optimization and Level-2 optimization.
I think some discussion of AI safety hasn’t been clear enough on what kind of optimization we expect in which domains. At least, it wasn’t clear to me.
But a couple things fell into place for me about 6 months ago, which very much rhyme with your two distinctions:
1) Inexploitability only makes sense relative to a utility function, and if the AI’s utility function is orthogonal to yours (e.g. because it is operating in Algorithm Land), then it may be exploitable relative to your utility function, even though it’s inexploitable relative to its own utility function. See this comment (and thanks to Rohin for the post that prompted the thought).
2) While some process that’s optimizing super-hard for an outcome in Algorithm Land may bleed out into affecting the Real World, this would sort of be by accident, and seems much easier to mitigate than a process that’s trying to affect the Real World on purpose. See this comment.
Putting them together, a randomly selected superintelligence doesn’t care about atoms, or about macroscopic events unfolding through time (roughly the domain of what we care about). And just because we run it on a computer that from our perspective is embedded in this macroscopic world, and that uses macroscopic resources (compute time, energy), doesn’t mean it’s going to start caring about macroscopic Real World events, or start fighting with us for those resources. (At least, not in a Level-2 way.)
On the other hand, powerful computing systems we build are not going to be randomly selected from the space of possible programs. We’ll have economic incentives to create systems that do consider and operate on the Real World.
So it seems to me that a randomly selected superintelligence may not actually be dangerous (because it doesn’t care about being unplugged—that’s a macroscopic concept that seems simple and natural from our perspective, but would not actually correspond to something in most utility functions), but that the superintelligent systems anyone is likely to actually build will be much more likely to be dangerous (because they will model and or act on the Real World).