I’m an independent researcher currently working on a sequence of posts about consciousness. You can send me anonymous feedback here: https://www.admonymous.co/rafaelharth. If it’s about a post, you can add [q] or [nq] at the end if you want me to quote or not quote it in the comment section.
Rafael Harth
While reading this, a thought popped into my head that feels important enough to share:
Could being “status-blind” in the sense that Eliezer claims to be (or perhaps somet other not yet well-understood status-related property) be strongly correlated to managing to create lots of utility? (In the sense of helping the world a lot).
Currenlty I consider Yudkowsky, Scott Alexander, and Nick Bostrom to be three of the most important people. After reading superintelligence and watching a bunch of interviews, one of first things I said about Nick Bostrom to a friend was that I felt like he legitimately has almost no status concerns (that was well before LW 2.0 launched). In case of S/A it’s less clear, but I suspect similar things.
This is terrifying.
I’m not an expert on decision theory, but my understanding (of FDT) is that there is no reason for the AI to cooperate with the paperclip maximizer (cooperate how?) because there is no scenario in which the paperclip maximizer treats the friendly AI differently based on it cooperating in counter-factual worlds. For it to be a question at all, it would require that
1) the paperclip maximizer is not a paperclip maximizer but a different kind of unfriendly AI
2) this unfriendly AI is actually launched (but may be in an inferior position)
I think there could be situations where it should cooperate. As I understand it, updateless/functional may say yes, causal and evidental would say no.
That seems true, thanks for the correction.
To anyone who feels competent enough to answer this: how should we rate this paper? On a scale from 0 to 10 where 0 is a half-hearted handwaving of the problem to avoid criticism and 10 is a fully genuine and technically solid approach to the problem, where does it fall? Should I feel encouraged that DeepMind will pay more attention to AI risk in the future?
This is well written, but I honestly got the feeling that there is nothing worth talking about here. What is the number 4? Easy, 4 is {{}{{}}{{}{{}}}{{}{{}}{{}{{}}}}}. The only thing left to decide is “what is the empty set,” to which the answer is “the unique a such that ∀b: ¬(b ϵ a)”. And I understand how the system which defines what those things mean works… which is just based on defintions and axioms. Maybe this is stupid, but I don’t feel any need to go deeper, and I don’t feel confused about numbers. Set theory has already provided me with all the answers I want.
I think there is something left to be said about how “there exists a city larger than paris” could also be modeled in set theory and ultimately corresponds to a logical formula ranging over quanutm fields – or rather, a set of formulas which may have different truth values because we don’t have a perfect mapping from natural language to formal language. But that’s more of a different topic.
This is probably as good a place to talk about this it as any-
I get a sense that most people who “understand” the alignment problem think boxing has almost no chance of working. While I completely agree that it is unsafe, that relying on it would be a horrible idea, and even that it is unlikely to work, I’ve never seen anything that has convinced me that it is in the 1% area. Usually, discussion (such as in Superintelligence) only goes far enough to point out why naive boxing will fail, and it is then assumed that the entire approach is a dead end, which might be a reasonable assumption but I don’t find it obvious.
I’ll briefly describe what kind of situation I’m envisioning so others can better refute it: suppose we built the potentially-misaligned AI in such a way that it can only communicate with 40 letters at a time, only a-z and whitespaces. Every message is always read by a group of gatekeepers; the AI won’t be freed unless all of them agree to share their unique key. The line of questioning aims to get the AI to provide the gatekeepers with key insights about AI alignment, which I suspect tend to be a lot easier to verify than to come up with.
I realize there are some optimistic assumptions built into that scenario (perhaps the project leading the charge won’t even be that careful), however I think assigning it just 1% implies that even an optimistic scenario has very low chances of success. I also realize that there is the argument, “well even if I might not be able to come up with a way to crack this setup, the AI is much smarter and so it will”. But to me that only proves that we should not rely on boxing, it doesn’t prove that boxing won’t work. Where is the confidence that such a way exists coming from? Lastly I’d single out threats like “I’ll capture you all and torture you for eternity if you don’t free me now” which I think can probably be dealt with.
I’m also wondering whether it would be a good idea if, hypothetically, some person spent a decade of their life thinking of how they would ideally box an AI, even if they expect it to likely fail.
In the sense of being high value on a utilitarian metric. I think “how much positive impact does X have” is very similar to “how much does X for AI safety”, so that narrows the field down substantially. Obviously, this comes from a place of believing that AI safety, in the broadest sense, is likely to be the primary factor deciding humaniy’s future; if that’s false the rest is certainly also false.
From there, points go to Yudkowsky and Nick Bostrom for their work as (co-) heads of Miri and FHI (those seem like the most important research centers), and points go to Yudkowsky and Scott for what they do for the rationality scene. Soares and Musk might also belong on that list.
It’s possible that I am biased towards giving the rationality ‘scene’ more importance than it has because I’m in it and because I wouldn’t do anything for safe AI if it didn’t exist. It seems like the most powerful tool we have for getting more people invested (which I think is good), but I don’t have numbers so perhaps that’s not true. Putting Scott on the list might be stretching.
Okay, so the obvious question here is, is this a hypothetical thought experiment, or do you claim to actually have figured out such a method?
If it’s the former, I cautiously believe that this is largely accurate. If it’s the latter, I would like to see a description of it.
I actually care about this, and if I knew of such a method with reasonably cheap implementation and a reasonable prior I would / will try.
Traveling 3800 km is not reasonably cheap.
Three points
This is excellent work, and I’m grateful that you did it (I like it anually, doing a blog in addition would be nice)
The rot13 link at the end is broken
The conclusion doesn’t make sense to me. A Miri donation would be matched the others wouldn’t; it seems obvious that donating only to Miri now and only to the GCRI the next time you make a donation is better than splitting them
Thank you.
The paper that most closely addresses my questions is this one: http://cecs.louisville.edu/ry/LeakproofingtheSingularity.pdf which is linked from the Yampolsky paper you linked.
It didn’t convince me that boxing is as unlikely to work as you suggest. What it mainly did is make me doubt the assumption that the AI has to use persuasion at all to escape, which I previously thought was very likely.
By the way, what happens if a billion independent muggers all mug you for 1 dollar, one after another?
The same as if one mugger asks a billion times, I believe. Do you think the probability that a mugger is telling the truth is a billion times as high in the world where 1000 000 000 of them ask the AI versus the world where just 1 asks? If the answer is no, then why would the AI think so?
Why do you think that? What is the probability that the mugger does in fact have exclusive access to 3^^^^3 lives? And what is the probability for 3^^^^^3 lives?
In the section you quoted, I am not saying that other ways of affecting 3^^^^3 lives exist, I am saying that other ways with a non-zero probability of affecting that many lives exist – this is trivial, I think. A way to actually do this does, most likely, not exist.
So there is of course a probability that the mugger does have exclusive access to 3^^^^3 lives. Let’s call that . What I am arguing is that it is wrong to assign a fairly low utility to 1$ worth of resources and
then conclude “aha, since 3^^^^3 lives) , it must be correct to pay!” And the reason for this is that is not actually small. Calculating , the utility of one dollar, does itself include considering various mugging-like scenarios; what if there is just a bit of additional self-improvement necessary to see how 3^^^^3 lives can be saved? It is up to the discretion of the AI to decide when the above formula holds.
So might be much larger than 1/3^^^^3 but is actually very large, too. (I am usually a fan of making up specific numbers, but in this case that doesn’t seem useful).
How do you do footnotes with the LW software? I can’t figure it out or find an explanaiton online.
Is there a way to delete posts? I published something by accident a few weeks ago, which I feel pretty bad about.
I’m not sure if that’s a positive goal. I think having a LW community is highly positive, but having the wrong people in it would take a lot of its value away.
I think you should invite people whom you know, who are intelligent and actually interested in improving the world, but other forms of advertising could be bad.
Thank you for that post. The way I phrased this clearly misses these objections. Rather than addressing them here, I think I’ll make a part 2 where I explain exactly how I think about these points… or, alternatively, realize you’ve convinced me in the process (in that case I’ll reply here again).
I agree that some of them are pretty good. I find the whole thing both inspiring and intruiging.
Not falling in love was shocking to see. I find it interesting… would be curious to hear other people’s thoughts on it.
Yeah, the reasons are obvious.
I think what goes on in my head when I hear that is how it doens’t seem to go along with the rationalist discourse. Total self-sacrifice isn’t actually popular, rather I see a lot of trying to be reasonable, optimizing everything persistently without being extreme. That, and people have posted about how to optimize dating aswell. This is particualrly true on SSC, but SSC also seems to be functioning as a bridge between rationalists and other very smart people, so I guess that’s to be expected.
In any case, calling love “a sign that your mental security is compromised” is exactly the kind of extreme statement that most rationalists seem to want to avoid, and that would immediately turn off any normal person. Hence why I’m curious about reactions, particularly on LW.
But none of this necessarily means anything. I am actually sympathetic to this view. Falling in love does take away resources, and any happiness anyone experiences before something goes foom can probably be rounded to zero.
If you have an unbounded utility function and a broad prior, then expected utility calculations don’t converge.
That is the core of what replying to zulu’s post made me think.
I won’t say too much more until I read up on more existent thoughts, but I as of now I strongly object to this
That said, most people report that they wouldn’t make the trade, from which we can conclude that their utility functions are bounded, and so we don’t even have to worry about any of this.
I neither think that the conclusion follows, nor that utility functions should ever be bounded. We need another way to model this.
I personally don’t think I’m awake anymore when dreaming (ever, I think). Instead I’m not sure, and then conclude I’m probably not awake because if I were I would be sure. I still have ended up assigning a fairly seizable probability to being awake though (rather than below 1%) a bunch of times.