Richard Hollerith, 15 miles north of San Francisco. hruvulum@gmail.com
rhollerith_dot_com
A false statement can cause a reasoner’s beliefs to become more accurate.
Suppose for example that Alice believes falsely that there is an invisible dragon in her garage, but then Bob tells her falsely that all dragons, invisible or not, cannot tolerate the smell of motor oil. Alice decides to believe that, notes that there is a big puddle of motor oil in the center of her garage (because her car leaks oil) and stops believing there is an invisible dragon in her garage.
But by your definition of deception, what Bob told Alice just now is not deceptive because it made Alice’s beliefs more accurate, which is all that matters by your definition.
It would be reasonable for Alice to want Bob never to lie to her even when the lie would make her beliefs more accurate, but there is no way for Alice to specify that desire with your formalism. And no way to for Alice to specify the opposite desire, namely, the fact that a lie would be okay with her as long as it makes her beliefs more accurate. And I cannot see a way to improve your definition to allow her to specify that desire.
In summary, although there might be some application, some special circumstance that you did not describe and that I have been unable to imagine, in which it suffices, your definition does not capture all the nuances of deception in human affairs, and I cannot see a way to make it do so without starting over.
But that is not surprising because formalizing things that matter to humans is really hard. Mathematics progresses mainly by focusing on things that are easy to formalize and resigning itself to having only the most tenuous connection to most of the things humans care about.
My guess is that we don’t have passenger or cargo VTOL airplanes because they would use more energy than the airplanes we use now.
It can be worth the extra energy cost in warplanes since it allows the warplanes to operate from ships smaller than the US’s supercarriers and to keep on operating despite the common military tactic of destroying the enemy’s runways.
Why do I guess that VTOLs would use more energy? (1) Because hovering expends energy at a higher rate than normal flying. (2) Because the thrust-to-weight ratio of a modern airliner is about .25 and of course to hover you need to get that above 1, which means more powerful gas-turbine engines, which means heavier gas-turbine engines, which means the plane gets heavier and consequently less energy efficient.
I do agree that there is a slight ‘creepiness vibe’ with the way the Sequences . . . are written.
IIRC back when he was writing the sequences, Eliezer said that he was psychologically incapable of writing in the typical dry academic manner. I.e., he wouldn’t be able to bring himself to do so even if he knew doing so would improve the reception of his writing.
Maybe he used the word “stuffy”.
The reason Eliezer’s 2004 “coherent extrapolated volition” (CEV) proposal is immune to Goodharting is probably because being immune to it was probably one of the main criteria for its creation. I.e., Eliezer came up with it through a process of looking for a design immune to Goodharting. It may very well be that all other published proposals for aligning super-intelligent AI are vulnerable to Goodharting.
Goodhart’s law basically says that if we put too much optimization pressure on criterion X, then as a side effect, the optimization process drives criteria Y and Z, which we also care about, higher or lower than we consider reasonable. But that doesn’t apply when criterion X is “everything we value” or “the reflective equilibrium of everything we value”.
The problem of course being that although the CEV plan is probably within human capabilities to implement (and IMHO Scott Garrabrant’s work is probably a step forward) unaligned AI is probably significantly easier to implement, so will likely arrive first.
Interesting. I am also curious what size of screen do you imagine (plan on) this keyboard being on? One of smartphone size?
I am surprised at how much text you are willing to enter on one’s on-screen keyboard if indeed you are planning to use this app on a smartphone. (I don’t own a smartphone, but am curious about them.)
How do you imagine you will enter text into the app? Using on on-screen keyboard?
I agree with this post, but not the choice of one of the words in title (“One single meme that makes you Less Wrong in general”).
Does it have to be a meme?
Can’t it be a belief, a skill or a habit?
What does HCH mean?
As far as I know, yes. (I’ve never worked for MIRI.)
I agree with this comment. I would add that there is an important sense in which the typical human is not a temporally unstable agent.
It will help to have an example: the typical 9-year-old boy is uninterested in how much the girls in his environment like him and doesn’t necessarily wish to spend time with girls (unless those girls are acting like boys). It is tempting to say that the boy will probably undergo a change in his utility function over the next 5 or so years, but if you want to use the concept of expected utility (defined as the sum of the utility of the various outcome weighted by their probability) then to keep the math simple you must assume that the boy’s utility function does not change with time with the result that you must define the utility function to be not the boy’s current preferences, but rather his current preferences (conscious and unconscious) plus the process by which those preference will change over time.
Humans are even worse at perceiving the process that changes their preferences over time than they are at perceiving their current preferences. (The example of the 9-year-old boy is an exception to that general rule: even the 9-year-old boys tend to know that their preferences around girls are probably going to change in not too many years.) The author of the OP seems to have conflated the goals that the human knows that he has with the human’s utility function whereas they are quite different.
It might be that there is some subtle point the OP is making about temporally unstable agents that I have not addressed in my comment, but if he expects me to hear him out on it, he should write it up in such a way as to make to clear that he not just confused about how the concept of the utility function is being applied to AGIs.
I haven’t explained or shown how or why the assumption that the AGI’s utility function is constant over time simplifies the math—and simplifies an analysis that does not delve into actual math. Briefly, if you want to create a model in which the utility function evolves over time, you have to specify how it evolves—and to keep the model accurate, you have to specify how evidence coming in from the AGI’s senses influences the evolution. But of course, sensory information is not the only things influencing the evolution; we might call the other influence an “outer utility function”. But then why not keep the model simple and assume (define) the goals that the human is aware of to be not terms (terminology?) in a utility function, but rather subgoals? Any intelligent agent will need some machinery to identify and track subgoals. That machinery must modify the priorities of the subgoals in response to evidence coming in from the senses. Why not just require our model to include a model of the subgoal-updating machinery, then equate the things the human perceives as his current goals with subgoals?
Here is another way of seeing it. Since a human being is “implemented” using only deterministic laws of physics, the “seed” of all of the human’s behaviors, choices and actions over a lifetime are already present in the human being at birth! Actually that is not true: maybe the human’s brain is hit by a cosmic ray when the human is 7 years old with the result that the human grows up to like boys whereas if it weren’t for the cosmic ray, he would like girls. (Humans have evolved to be resistant to such “random” influences, but such influences nevertheless do occasionally happen.) But it is true that the “seed” of all of the human’s behaviors, choices and actions over a lifetime are already present at birth! (That sentence is just a copy of a previous sentence omitting the words “in the human being” to take into account the possibility that the “seed” includes a cosmic ray light years away from Earth at the time of the person’s birth.) So for us to assume that the human’s utility function does not vary over time not only simplifies the math, but also is more physically realistic.
If you define the utility function of a human being the way I have recommended above that you do, you must realize that there are many ways in which humans are unaware or uncertain about their own utility function and that the function is very complex (incorporating for example the processes that produce cosmic rays) although maybe all you need is an approximation. Still, that is better than defining your model such that utility function vary over time.
pseudorapey asshole
You probably mean “quasirapey asshole”.
I am surprised that I need to write this, but if killing the humans will decrease P(shutdown) by more than 1e-4, then continuing to refrain from killing the humans is going to worry and weigh on the AI more than a 1e-4 possibility it is in a simulation. (For simplicity, assume that the possibility of shutdown is currently the dominant danger faced by the AI.)
So the AI’s ontological uncertainty is only going to help the humans if the AI sees the humans as being only a very very small danger to it, which actually might lead to a good outcome for the humans if we could arrange for the AI to appear many light years away from Earth--
--which of course is impractical. Alternatively, we could try to assure the AI it is already very safe from the humans, say, because it is in a secure facility guarded by the US military, and the US military has been given very strict instructions by the US government to guard the AI from any humans who might want to shut it down.
But P(an overthrow of the US government) as judged by the AI might already be at least 10e-4, which puts the humans in danger again.
More importantly, I cannot think of any policy where P(US government reverses itself on the policy) can be driven as low as 10e-4. More precisely, there are certain moral positions that humans have been discussing for centuries where P(reversal) might conceivably be driven that low. One such would be, “killing people for no reason other than it is fun is wrong”. But I cannot think of any policies that haven’t been discussed for many decades with that property, especially ones that exist only to provide an instrumental incentive on a novel class of agents (AIs). In general policies that are instrumental have a much higher P(reversal) than deontological ones.
And how do you know that AI will not judge P(simulation) to be not 10e-4 but rather 10e-8, a standard of reliability and safety no human institution can match?
In summary, yes, the AI’s ontological uncertainly provides some tiny hope for humans, but I can think of better places to put our hope.
I mean, even if we pay for the space launches and the extra cost of providing electrical power to the AI, it doesn’t seem likely that we can convince any of the leading AI labs to start launching their AGI designs into space in the hope of driving the danger (as perceived by the AI) that the humans present to the AI so low that acting to extinguish that danger will itself be seen by the AI as even more dangerous.
Luckily I don’t need to show that sufficiently smart AIs don’t engage in trial and error. All I need to show is that they almost certainly do not engage in the particular kind of trial of running a computer program without already knowing whether the program is satisfactory.
You have the seed of a good idea, namely, an AI will tend to treat us better if it thinks other agents might be watching provided that there is potential for cooperation between the AI and the watchers with the property that the cooperation requires the watchers to choose to become more vulnerable to the AI.
But IMO an AI smart enough to be a threat to us will soon rid itself of the kind of (ontological) uncertainty you describe in your first paragraph. I have an argument for my position here that has a big hole in it, but I promise to publish here soon with something that attempts to fill the hole to the satisfaction of my doubters.
There is a lot of uncertainty over how effective EMP is at destroying electronics. The potential for destruction was great enough that for example during the Cold War, the defense establishment in the US bought laptops specially designed to resist EMPs, yes, but for all we know even that precaution was unnecessary.
And electronics not connected to long wires are almost certainly safe from EMP.
I was going to respond, but concluded that surely you already know about Sci-hub.
OK, I’ll answer, because I was asked directly.
Next Silicon’s site gives no details on their plans and they say right away on the linked page “in stealth mode”, so all I know about them is that they make chips for super computers that are not optimized for neural networks.
I’d guess that it is less risky for 40 people to go to work for Next Silicon than for one person to go into AI capability research. But it is safer still if nobody went to work for either group.
There are computing jobs that lower x-risk. One such job is to make it easier for people to publish or access information (like the people who run this site do).
I will try to explain (probably via a top-level post, probably not today). For now, I will restate my position.
No superintelligence (SI) that can create programs at all will run any program it has created to get evidence about whether the program is aligned with the SI’s values or interests: the SI already knows that before the program runs for the first time.
The nature of the programming task is such that if you can program well enough, there’s essentially no uncertainty about the matter (barring pathological cases that do not come up in practice unless the SI is in a truly dire situation in which an adversary is messing with core pieces of its mind) similar to how (barring pathological cases) there’s no uncertainty about whether a theorem is true if you have a formal proof of the theorem.
The qualifier “it has created” above is there only because an SI might find itself in a very unusual situation in which it is in its interests to run a program deliberately crafted (by someone else) to have the property that the only practical way for anyone to learn what the SI wants to learn about the program is to run it. Although I acknowledge that such programs definitely exist, the vast majority of programs created by SIs will not have that property.
Are you curious about this position mostly for its own sake or mostly because it might shed light on the question of how much hope there is for us in an SI’s being uncertain about whether it is in a simulation?
Often a comment thread will wander to a topic that has no bearing on the OP. Has that happened here?
Does your most recent comment have any relevance to how much hope we humans should put in the fact that an AI cannot know for sure whether its sensory data has been faked?
I don’t need anything to encourage me to use Lesswrong, but if you have something to discourage me from using it, although I don’t have a need for it now, I am worried enough about the future that I would keep a link to it on my hard drive.