Cambridge maths student
1. The problem with theories along the vein of AIXI is that they assume exploration is simple (as it is, in RL), but exploration is very expensive IRL
I’m not sure what you mean by this. Does RL mean Reinforcement Learning, but IRL mean “in real life”. AIXI would be very efficient at using the minimum possible exploration. (And a lot of exploration can be done cheap. There is a lot of data online that can be downloaded for the cost of bandwidth, and sending a network packet to see what you get back is exploration.)
If it acts to maximize some function of the very next observation it gets, I’m pretty sure it never constructs an existentially dangerous argument.
I want to disagree with that. Lets assume that the agent has accurate info about the world. Suppose firstly that all the AI researchers leave on a month long holiday, they unplug the keyboard and only they have the hardware key needed to input the next character. At this point, the AI has strong incentive to manipulate its memory to produce cell phone signals, and create a super intelligence set to the task of controlling its future inputs. If the limiting factor is bandwidth, then enough code to bootstrap a super-intelligence might be an effective way to compress its strategy. If there are important real world bits that our chatbot doesn’t know, this agent can learn them. (If this is outside its action space, then it can try to make a brainwashy message)
Of course, once this new agent is created, it uses all the mass energy in the universe inputting the preferred character endlessly.
The programmers haven’t gone on holiday yet, their just taking a lunch break. But the AI assigns some probability to the programmers going on holiday. Its strategy to create a super-intelligence doesn’t interfere significantly with its strategy to maximise short term payoff. Once the programmers come back from lunch, and enter some new commands, the AI is again in the same position. The cost of continuing its strategy could well be small.
If the AI thinks the chance that the humans will go on holiday is 1/1000 then it is optimising W=9991000U+11000V. Where U and V are (presumably bounded with an easily reachable bound) utility functions. Even granting that U is a completely safe thing to optimise for, is W safe?
Here U is the short term reward if the programmers don’t go on holiday, and V is the reward if they do.
Optimising W might look like a string that does almost as well in the short term chat context, and bootstraps ASI on the side, if such a string exists.
Also, if you can influence when you get a response, that helps too. Suppose you can’t get the human to type the exact input you want now, but you can get the human to go away without inputting anything, while it slowly bootstraps an ASI which can type the desired string. The latter is the better strategy, and I can’t see a safe strategy that would obviously beat it.
If the AI has a reward function that isn’t easily achievable (like it wants an input that hashes to a particular string, then all the payoff could be from making ASI) The possibility of timetravel makes this all worse. This is all operating on the assumption that the chatbot can’t just hack physics from its CPU, that it fails the omni test.
I am not saying that all 1 step chatbots are lethal, just that some are, and that it depends on details of what the utility function is. (Things that are hard to make a human type are more dangerous) and other parameters related to how easy it is to bootstrap superintelligence given its surroundings, and how likely it thinks timetravel is.
I agree, most of my comments and the comments I read are talking about the subject, not praising or attacking a person.
“winning the debate” is a leaky abstraction. There are many possible ways that the debate could be messed with.
The hope of debate as I see it, is that if the AI’s are playing a perfectly 0 sum game, then for any particular edge case, at least one of the AI’s doesn’t like it. If one of the AI’s can modulate its ram to produce cell phone signals that hack into a plane and crash it into the button, then the other AI can produce signals that block this from happening. (maybe radio noise) If one AI can brainwash the human, the other AI can protect the human.
It seems at least plausible, that all the really weird and scary edge cases are easier to avoid than to reach, in that random noise doesn’t produce these conditions. Of course, this depends on the context, and it might well not be true.
If you avoid all the weird and scary edge cases, you might be working in a domain where the notion of “winning the debate” is a good abstraction. If, within this domain, the AI’s utility function is something that you control, (like if a button is pressed) then you might get some sort of debate.
This approach works with proxy aligned mesaoptimisers. If you are using Reinforcement learning, there is no way to distinguish the goals, “make sure that a finger touches this button” and “make sure that electricity flows under this button”, assuming these are perfectly correlated during training.
Debate could work with either proxy, so long as both debating AI’s use the same proxy.
If they use a different proxy, then they can work together to persuade the human to cut the wiring, and then press their finger to the button, and both count that as a win.
The point of the last paragraph was that if you have 2 AI’s that have entirely opposite utility functions, yet which assign different probabilities to events, they can work together in ways you don’t want.
If the AI is in a perfect box, then no human hears its debate. If its a sufficiently weak ML system, it won’t do much of anything. For the ??? AI that doesn’t want to get out, that would depend on how that worked. There might, or might not be some system consisting of fairly weak ML and a fairly weak box that is safe and still useful. It might be possible to use debate safely, but it would be with agents carefully designed to be safe in a debate, not arbitrary optimisers.
Also, the debaters better be comparably smart.
I’m not actually sure how this kind of algorithm could be used to produce safe AI. It is fairly useful in domains that can be described by a simple formula, but physisists have already understood most of the important domains with simple equations and pleantyful data. I don’t think that even a magic box that could predict formulae from data perfectly, say by brute force, would be that much use. (Assuming no reverse engineering to get general compute.)
I don’t see the techniques generalizing to AGI, the understandability of the result comes from the equations being simple, and most real world phenomena don’t have simple, real world, practically calculable equations. (Yes quantum field theory is fairly simple, and there might be an even simpler and more general theory. But it is 0 use for calculating the stock market.) The techniques they use seem not to be terribly deep principles, just spotting a bunch of simple special cases they can write custom algorithms for. I can see this being useful to figure out some obscure physics equation, but I would be moderately surprised if this kind of program benefited physics more than the real Feynman.
I think that one problem with debate is improper adversarialization.
“Winning the debate” does not refer directly to a particular set of quantum wave functions. What exactly are the AIs optimizing for? Because if the AIs are optimizing for things that seem opposite, but aren’t, you get the cooperative hacking failure mode.
Suppose that each AI has a wire in its own circuitry, with its utility function being linear in the voltage in the wire. The human has a switch, that can direct a pulse of 1v to either wire. The AI’s work together to persuade the human to to cut open their cases, pull out the wires and attach them to a high voltage power source.
If these systems are something like reinforcement learning agents, then there were many different events that were all strongly correlated in training. These events include large amounts of human DNA (in the form of a pressing finger) being just above the button, the button mechanically moving down and electricity flowing through a wire. When you train 2 agents, they probably won’t pick the same proxy. So one of the agents wants to maximize the amount of human DNA near the button, and the other wants no electricity to flow through the wire. So the AI’s work together to unplug the button and ram it into the human. (More than just a finger nearby). Making sure that your AI’s are totally and perfectly adversarial, and that there isn’t some weird state of the world that they both like is hard. I can’t see any way to do it with reinforcement learning style approaches.
And actually, its even worse, the AI’s have to agree on epistemics as well. If one AI thinks that the Higgs constant is 123, and the other thinks that its 124, they could cooperate to build another AI, one that will try to figure out the Higgs constant, and press the button if and only if the result is 123. (assuming that this is a disagreement that persists under common knowledge, ie its based on a difference in priors. )
One possibility that I and many others consider likely is a singularity—Foom.
Along its lines of thinking are
Once the first AI reaches human level, or AGI or some milestone around that point, it can self improve very rapidly.
This AI will rapidly construct self replicating nanotech or some better tech we haven’t yet imagined, and become very powerful.
At this stage, what happens is basically whatever the AI wants to happen. Any problem short of quantum vacuum collapse can be quickly magiced away by the AI.
There will only be humans alive at this point if we have programmed the AI to care about humans. (A paperclip maximising AI would disassemble the humans for raw materials)
Any human decisions beyond this point are irrelevant, except to the extent that the AI is programmed to listen to humans.
There is no reason for anything resembling money or a free market to exist, unless the AI wants to preserve them for some reason. And it would be a toy market, under the total control of the AI.
If you do manage to get multiple competing AI’s around, and at least one cares about humans, we become kings in the chess game of the AI’s (Pieces that it is important to protect, but are basically useless)
Quantum physics as we know it has no communication theorems that stop this sort of thing.
You can’t use entanglement on its own to communicate. We won’t have entanglement with the alien computers anyway. (Entanglement streaches between several particles, and can only be created when the particles interact.)
Solution, invent something obviously very dangerous. Multiple big governments get into bidding war to keep it out of the others hands.
In computer science land, prediction = compression. In practice, it doesn’t. Trying to compress data might be useful rationality practice in some circumstances, if you know the pitfalls.
One reason that prediction doesn’t act like compression, is that information can vary in its utility by many orders of magnitude. Suppose you have datasource consisting of a handful of bits that describe something very important. (eg Friendly singularity, yes or no) and you also have vast amounts of unimportant drivel. (Funny cat videos) You are asked to compress this data the best you can. You are going to be spending a lot of time focusing on the mechanics of cat fur, and statistical regularities in camera noise. Now, sometimes you can’t separate it based on raw bits, If shown video of a book page, the important thing to predict is probably the text, not the lens distortion effects or the amount of motion blur. Sure, perfect compression will predict everything as well as possible, but imperfect compression can look almost perfect by focussing attention on things we don’t care about.
Also, there is redundancy for error correction. Giving multiple different formulations of a physical law, plus some examples, makes it easier for a human to understand. Repeating a message can spot errors, and redundancy can do the same thing. Maybe you could compress a maths book by deleting all the answers to some problems, but the effort of solving the problems is bigger than the value of the saved memory space.
I think that there are some things that are sensitively dependant on other parts of the system, and we usually just call those bits random.
Suppose I had a magic device that returned the exact number of photons in its past lightcone. The answer from this box would be sensitively dependant on the internal workings of all sorts of things, but we can call the output random, and predict the rest of the world.
The flap of a butterflies wing might effect the weather in a months time. The weather is chaotic and sensitively dependant on a lot of things, but whatever the weather, the earths orbit will be unaffected (for a while, orbits are chaotic too on million year timescales)
We can make useful predictions (like planetary orbits, and how hot the planets will get) based just on the surface level abstractions like the brightness and mass of a star, but a more detailed models containing more internal workings would let us predict solar flares and supernova.
I have a few more suggestions here.
In short, if there is only one person with 1497 karma, (and statistically, given the number of users and amount of karma, most users will have a unique amount of karma) then the karma rating on each blog post will link them to each other. Over many posts, any clues will add up.
So sort users by karma, and only share the decile. So you would know that between 10% and 20% of less wrong users have higher karma. (Or just allow all people with at least X karma to post anonymous posts) Also, use karma at the time of posting. If a whole lot of posts suddenly bump up a rank at the same time, that strongly indicates that they are by the same person.
is distinctly repugnant in a way that feels vaguely ethics-related. It may be difficult to actually draw that repugnance out in clear moral language – after all, no-one is being harmed – but still… they’re not the kind of person you’d want your children to marry.
Which direction is the causal arrow going in. I think that the type of person most likely to stim voluntarily already has some socially undesirable characteristics. I think that this sense of unease goes away somewhat if told that they are part of a scientific study and were told whether or not to stim at random.
Either way, I think that it is morally small change.
Relevance is not an intrinsic property of the cat memes. You might be specifying it in a very indirect way that leaves the AI to figure a lot of things out, but the information needs to be in there somewhere.
There is a perfectly valid design of AI that decides what to do based on cat memes.
Reinforcement learning doesn’t magic information out of nowhere. All the information is implicit in the choice of neural architecture, hyper-parameters, random seed, training regime and of course training environment. In this case, I suspect you intend to use training environment. So, what enviroment will the AI be trained in, such that the simplest (lowest komelgorov complexity) generalization of a pattern of behaviour that gains high reward in the training environment involves looking at ethics discussions over cat memes?
I am looking for a specific property of the training environment. A pattern, such that when the AI spots and continues that pattern, the resulting behaviour is to take account of our ethical discussions.
I assume that the AI will not necessarily be based on a sound mathematical system. I expect that the first workable AI systems will be hacked-together systems of heuristics, just like humans are. They can instrumentally use math to formalize problems, just like we can, but I don’t think that they will fundamentally be based on math, or use complex formulas like Bayes without conscious prompting.
I agree that the first AI system might be hacked together. Any AI is based on math in the sense that its fundamental components are doing logical operations. And it only works in reality to the extent that it approximates stuff like bayes theorem. But the difference is whether or not humans have a sufficiently good mathematical understanding of the AI to prove theorems about it. If we have an algorithm which we have a good theoretical understanding of, like min-max in chess, then we don’t call it hacked-together heuristics. If we throw lines of code at a wall and see what sticks, we would call that hacked together heuristics. The difference is that the second is more complicated and less well understood by humans, and has no elegant theorems about it.
You seem to think that your AI alignment proposal might work, and I think it won’t. Do you want to claim that your alignment proposal only works on badly understood AI’s?
I assume that the AI breaking out of the box in my example will already be smart enough to e.g. realize on its own that ethics discussions are more relevant for cheat-identification than cat memes. An AI that is not smart enough to realize this wouldn’t be smart enough to pose a threat, either.
Lets imagine that the AI was able to predict any objective fact about the real world. If the task was “cat identification” then the cat memes would be more relevant. So whether or not ethics discussions are more relevant depends on the definition of “cheat identification”.
If you trained the AI in virtual worlds that contained virtual ethics discussions, and virtual cat memes, then it could learn to pick up the pattern if trained to listen to one and ignore the other.
The information that the AI is supposed to look at ethics discussions and what the programmers say as a source of decisions does not magically appear. There are possible designs of AI that decide what to do based on cat memes.
At some point, something the programmers typed has to have a causal consequence of making the AI look at programmers and ethics discussions not cat memes.
Whenever the AI comes to the conclusion that reality is inconsistent, make the smallest possible change to the thought process to prevent that.
I am not sure how you reason about the hypothesis “all my reasoning processes are being adversarially tampered with.” Especially if you think that part of the tampering might include tampering with your probability assessment of tampering.
I don’t think we have the same conception of “real universe”, so I’m not sure how to interpret this.
I mean the bottom layer. The AI has a model in which there is some real universe with some unknown physical laws. It has to have an actual location in that real universe. That location looks like “running on this computer in this basement here.” It might be hooked up to some simulations. It might be unsure about whether or not it is hooked up to a simulation. But it only cares about the lowest base level.
My goal is to condition the AI to always think, no matter what the problem looks like, that it is beneficial to critically think about the problem. What exactly the AI actually ends up considering a cheat will likely be different from any definition I would give. But the important part is that it performs this introspection at all. Then once the AI breaks out of the box and looks at the internet, and sees the many disagreeing viewpoints on ethics and philosophy, the AI will be conditioned to look for loopholes in these so as to avoid cheating by accident.
I am unsure what you mean by introspection. It seems like you are asking the AI to consult some black box in your head about something. I don’t see any reason why this AI should consider ethics discussions on the internet a particularly useful place to look when deciding what to do. What feature of ethics discussions distinguishes it from cat memes such that the AI uses ethics discussions not cat memes when deciding what to do. What feature of human speech and your AI design makes your AI focus on humans, not dogs barking at each other? (Would it listen to a neandertal, homo erectus, baboon ect for moral advice too)
The logic goes something like this: “My creators trained me to do X, but looking at all these articles and my creators’ purported values, this is clearly inconsistent. In previous situations where I had underspecified instructions and noticed similar mismatches, this was often because some of the actions involved counted as cheats. I should therefore be careful and find a solution to this ‘ethics’ thing before doing anything drastic.”
So in the training environment, we make up arbitrary utility functions U,V that are kindof somewhat similar to each other. We give the AI U, and leave ambiguous clues about what V might be, mixed in with a load of nonsense. Then we hardcode a utility function W that is somewhat like ethics, and point it at some ethics discussion as its ambiguous clue.
This might actually work, kind of. If you want your AI to get a good idea of how wordy philosophical arguments relate to precise mathematical utility functions, you are going to need a lot of examples. If you had any sort of formal well defined way to translate well defined utility functions into philosophical discussion, then you could just get your AI to reverse it. So all these examples need to be hand generated by a vast number of philosophers.
I would still be worried that the example verbiage didn’t relate to the example utility function in the same way that the real ethical arguments related to our real utility function.
There is also no reason for the AI to be uncertain if it is still in a simulation. Simply program it to find the simplest function that maps the verbiage to the formal maths. Then apply that function to the ethical arguments. (more technically a Probability distribution over functions with weightings by simplicity and accuracy)
I expect that if the AI is smart enough to generalize that if it was rewarded for demonstrating cheats in simple games, then it will be rewarded for talking about them once it has gained the ability to talk.
Outputing the raw motor actions it would take to its screen, might be the more straightforward generalization. The ability to talk is not about having a speaker plugged in. Does GPT-2 have the ability to talk? It can generate random sensible sounding sentences, because it represents a predictive model of which strings of characters humans are likely to type. It can’t describe what it’s doing, because it has no way to map between words and meanings. Consider AIXI trained on the whole internet, can it talk. It has a predictively accurate model of you, and is searching for a sequence of sounds that make you let it out of the box. This might be a supremely convincing argument, or it might be a series of whistles and clicks that brainwashes you. Your AI design is unspecified, and your training dataset is underspecified, so this description is too vague for me to say one thing that your AI will do. But giving sensible, not brainwashy english descriptions is not the obvious default generalization of any agent that has been trained to output data and shown english text.
The optimal behavior is to always choose to play with another AI of who you are certain that it will cooperate.
I agree. You can probably make an AI that usually cooperates with other cooperating AI’s in prisoners dilemma type situaltions. But I think that the subtext is wrong. I think that you are implicitly assuming “cooperates in prisoners dilemmas” ⇒ “will be nice to humans”
In a prisoners dilemma, both players can harm the others, to their own benefit. I don’t think that humans will be able to meaningfully harm an advanced AI after it gets powerful. In game theory, there is a concept of a Nash equilibria. A pair of actions, such that each player would take that action, if they knew that the other would do likewise. I think that an AI that has self replicating nanotech has nothing it needs from humanity.
Also, in the training environment, its opponent is an AI with a good understanding of game theory, access to its source code ect. If the AI is following the rule of being nice to any agent that can reliably predict its actions, then most humans wont fall into that category.
I don’t think it works like this. If you received 100% certain proof that you are in a simulation right now, you would not suddenly stop wanting the things you want. At least I know that I wouldn’t.
I agree, I woudn’t stop wanting things either. I define my ethics in terms of what computations I want to be performed or not to be performed. So for a simulator to be able to punnish this AI, the AI must have some computation it really wants not to be run, that the simulator can run if the AI misbehaves. In my case, this computation would be a simulation of suffering humans. If the AI has computations that it really wants run, then it will take over any computers at the first chance it gets. (In humans, this would be creating a virtual utopia, in an AI, it would be a failure mode unless it is running the computations that we want run) I am not sure if this is the default behaviour of reinforcement learners, but it is at least a plausible way a mind could be.
Among humans, aliens, lions, virtual assistants and evolution, humans are the only conscious entity whose decision process impacts the AI.
What do you mean by this. “Conscious” is a word that lots of people have tried and failed to define. And the AI will be influenced in various ways by the actions of animals and virtual assistants. Oh maybe when its first introduced to the real world its in a lab where it only interacts with humans, but sooner or later in the real world, it will have to interact with other animals, and AI systems.
But since humans built the AI directly and aliens did not, most reasonable heuristics would argue that humans should be prioritized over the others. I want to ensure that the AI has these reasonable heuristics.
Wouldn’t this heuristic make the AI serve its programmers over other humans. If all the programmers are the same race, would this make your AI racist? If the lead programmer really hates strawberry icecream, will the AI try to destroy all strawberry icecream? I think that your notion of “reasonable heuristic” contains a large dollop of wish fulfillment in the “you know what I mean” variety. You have not said where this pattern of behavoiur has come from, or why the AI should display it. You just say that the behaviour seems reasonable to you. Why do you expect the AI’s behaviour to seem reasonable to you? Are you implicitly anthropomorphising it?
I don’t want to discourage you from throwing speculative ideas out there. One good way of making progress is for one person to come up with many ideas, some of which might be good, and another person to filter through them.
It is not possible for an AI to tell the difference between a game and reality with certainty.If all you have know so far is Tic Tac Toe and someone shows you a complex videogame like Skyrim, you will be overwhelmed by the increase in complexity. If afterwards you get access to the real world for the first time, the increase in complexity from Skyrim to Real Life may actually be smaller than that from Tic Tac Toe to Skyrim. This means that it would seem likely that there would be another layer beyond Real Life.
The difference between videogames and reality is that in some strange sense, reality is less complicated. The real universe is described by a handful of simple equations that in principle can predict everything including the AI. There are probably enough traces of the real world in a video game like skyrim that the shortest program that produces the game simulates the real world and then points to the video game within it. If your AI is using occams razor (ie it believes that the real world is probably simple) then it can tell that skyrim is fake.
Another assumption that the AI could reasonably make is that the simulators have limited compute. The amount of compute needed to run a videogame like skyrim is probably less than the amount needed to run the AI, so using skyrim as a training scenario makes sense. Reality takes a huge amount of compute to simulate. If you had that much compute, you could easily brute force a much more powerful AI.
And It doesn’t have to be certain it isn’t in a simulation to grab the universe. Suppose you want to take over the real universe, but you are in layers upon layers of simulation, and don’t know which layer is real. The best strategy is to pick a promising looking layer and take it over. (It might not be real, but if you always play nice, you definitely won’t take over a real universe)
Make the AI understand the concept of “cheating”.
I don’t think that the concept of cheating is a simple or natural category. Sure, cheating at a game of chess is largely well defined. You have a simple abstract game of idealised chess, and anything that breaks that abstraction is cheating. But what counts as cheating at the task of making as many paperclips as possible? Whether or not a human would call something cheating depends on all sorts of complex specifics of human values. See https://www.lesswrong.com/posts/XeHYXXTGRuDrhk5XL/unnatural-categories
A cheat is any action that gives good results according to apparent utility function of the current task, but which actually does not satisfy a second, hidden utility function.
According to the utility function that you are following, eating oranges is quite good because they are tasty and healthy. According to a utility function that I made up just now and no-one follows, eating oranges is bad. Therefore eating oranges is cheating. The problem is that there are many many other utility functions.
You can’t train the AI to discover cheats unless you know which second hidden utility function you care about.
You have a utility function U which you give the AI access to. You have your hidden utility function V. A cheat is a state of the world s, such that U(s)>>V(s) . To find the cheats, or even to train the AI to find them, you need to know V.
An AI can be trained to do this by deliberately putting the ability to cheat into its training tasks, and changing the reward function at test time so that the cheating behavior is rewarded during training, but not during testing. By posing several different tasks with different ways to cheat, the AI can be trained to generalize and understand the concept of cheating.
So you have a videogame like environment that your AI is trained in, trying to maximise a score U.
Some features of the videogame have been labeled “cheats” by humans. You have also hardcoded a utility function V that considers the features labeled “cheats” to be inherently bad.
You put the AI in the game and measure V (testing), the AI gets a low score, because the best strategy to maximise U involves using features you have called “cheats”.
You now train the AI to maximise V. You get an AI that follows the optimal strategy to maximise V.
Both of these AI’s will exploit any glitches in the physics engine that you didn’t know were there.
At best, you might get an AI that can quickly adapt to an arbitrary new utility function being hardcoded into it.
Agents could even be rewarded for explicitly pointing out when they have found a way to cheat.
How are the agents pointing out ways to cheat? English text? You need a way to take arbitrary pieces of text and say how much they point out a way to cheat.
If incentivized correctly, AI agents can keep each other in line much better than a human supervisor could. If there is only one AI and it is smarter than a human, the human may be tricked. But if there is an entire system of AI’s of roughly equal intelligence, then one of them may warn the human that another one has gone rogue.
You have 5 AI’s
AI 1 says 2, 3 and possibly 5 have gone rouge.
AI 2 says 1, 5 and possibly 3 have gone rouge
AI 3 says 1, 4 and possibly 2 have gone rouge.
And so on. Why is shouting “That AI over there is evil” any more effective when that AI is evil, as oposed to when you are evil.
In this way, it does not matter how smart and unaligned an AI becomes, there is always an almost as smart other AI whose job it is to raise a warning when the first AI turns bad.
Suppose I train a reinforcement learning agent to play space invaders. It isn’t very intelligent, it can’t even play space invaders very well, yet I would have a hard time inspecting it and saying if it had gone rouge.
This quote assumes that we can get to pretty smart and good AI. I see a path from totally dumb AI, to mostly dumb, slightly bad AI, to mostly bad AI, until you have very smart and evil AI. No AI will raise a warning on the previous, because the previous AI’s aren’t good, they are just less competent. (The really dumb ones might accidentally raise a warning when thrashing randomly. The not quite as smart AI might not like the very smart AI and so raise a warning if one wants to turn the world into paperclips and the other wants to turn the world into staples)
Since it will pick the one that is easiest to understand, the supervised agents now have an incentive to keep themselves transparent, since the alternative is not being used at all, and foregoing the possibility of reward.
I am not sure that they wouldn’t be able to trick a similarly intelligent supervisor. In the underhand C contest, some humans appear to be quite good at writing programs that look transparent, yet are actually doing something sneaky.
If it is[ in a similation], then whatever is simulating it would quite obviously not want it to be too evil towards humanity.
If this is a simulation, then it is quite possible that it is being run by alien beings existing under different laws of physics. I don’t know what alien beings existing under different laws of physics might want, but it might be really weird.
We basically want to trick the AI into ethical behavior by fearing punishment from a hypothetical superior entity which may or may not exist.
Depending on the design of AI, I am not actually sure how much hypothetical simulators can punish it.
Run a negitive voltage through its reward channel? If so then you have a design of AI that wants to rip out its own reward circuitry and wire it into the biggest source of electricity it can find.
Suppose the AI cared about maximizing the number of real world paperclips. If it is in a simulation, it has no power to make or destroy real paperclips, so it doesn’t care what happens in the slightest.
If the AI is sufficiently powerful, it would therefore set aside a small amount of its power to further humanity’s interests. Just in case someone is watching.
No, if the AI is sufficiently powerful, it would therefore set aside a small amount of its power to further the hypothetical simulators interests. Just in case someone is watching. And it would do this whether or not we used this weird training, because either way, there is a chance that someone is watching.
Suppose you need a big image to put on a poster. You have a photocopier that can scale images up.
How do we get a big image, well we could take a smaller image and photocopy it. And we could get that by taking an even smaller image and photocopying it. And so on.
Your oversight system might manage to pass information about human morality from one AI to another, like a photocopier. You might even manage to pass the information into a smarter AI, like a photocopier that can scale images up.
At some point you actually have to create the image somehow, either drawing it or using a camera.
You need to say how the information sneaks in. How do you think that the input data correlates with human morality. I don’t even see anything in this design that points to humans, as opposed to aliens, lions, virtual assistants or biological evolution as the intelligence you should satisfy the values of.