Your results don’t seem to show upper bounds on the amount of hardware overhang, or very strong lower bounds. What concrete progress has been made in chess playing algorithms since deep blue? As far as I can tell, it is a collection of reasonably small, chess specific tricks. Better opening libraries, low level performance tricks, tricks for fine tuning evaluation functions. Ways of representing chessboards in memory that shave 5 processor cycles off computing a knights move ect.
P=PSPACE is still an open conjecture. Chess is brute forcible in PSPACE. If P=PSPACE, and the overhead is a small constant, then it might be possible to program a vacuum tube machine to play perfect chess. Alternately, showing that stockfish is orders of magnitude better than deep blue, doesn’t actually show that there is a hardware overhead for chess. It shows that there was one when deep blue was created.
Also note that there never was a hardware overhang for integer binary addition. The problem is simple enough that the first answer that any reasonably smart person can come up with is nearly optimal. (It is fairly straightforward to do addition with only a few logic gates per input, and as the output depends bitwise on all inputs (changing any single bit changes the output) then you need at least one logic gate per input. ) It is plausible that playing chess is a problem simple enough that we have figured out how to do it nearly optimally, whereas on other problems there is a hardware overhang.
If we assume that the expert human brain is about equally efficient at AI design, general programming, airoplane engineering and a variety of other STEM tasks. (A plausible assumption, given that these tasks seem similarly far from the environment of evolutionary adaptedness, ) Then it should take a similar amount of compute to display top human performance in these, as in chess. On the other hand, doing arithmetic used to be a skilled job, and the compute needed for superhuman chess (with current algorithms) is way higher than that needed for arithmetic.
Right, it’s for randomly distributed rewards. But if I show a property holds for reward functions generically, then it isn’t necessarily enough to say “we’re going to try to try to provide goals without that property”. Can we provide reward functions without that property?
Every specific attempt so far has been seemingly unsuccessful (unless you want the AI to choose a policy at random or shut down immediately). The hope might be that future goals/capability research will help, but I’m not personally convinced that researchers will receive good Bayesian evidence via their subhuman-AI experimental results.
I agree it’s relevant that we will try to build helpful agents, and might naturally get better at that. I don’t know that it makes me feel much better about future objectives being outer aligned.
ETA: also, i was referring to the point you made when i said
“the results don’t prove how hard it is tweak the reward function distribution, to avoid instrumental convergence”
Now the perhaps harder step is trying to get traction on them
Yes, very much so. We’re working on a few parts of this now, as part of a different project, but I agree that it’s tricky. And there are a number of other things that seem like potentially very useful projects if others are interested in collaborations, or just some ideas / suggestions about how they could be approached.
(On the tables, unfortunately the tables were pasted in as images from another program. We should definitely see if we can get higher-resolution, even if we can’t convert to text easily.)
I hadn’t heard of the Delphi method before, so this paper brought it to my attention.
It’s nice to see concrete forecasting questions laid out in a principled way. Now the perhaps harder step is trying to get traction on them ;^).
Note: The tables in pages 9 and 10 are a little blurry to read. They are also not text, so it’s not easy to copy-paste them into another format for better viewing. I think it’d be good to update the images to either be clearer or translate it into a text table.
I wholeheartedly agree. I think this implies:
Getting very clear on what we want. Can we give a fairly technical specification of the kind of safety that’s necessary+possible?
Some degree of safety beyond tool-type non-malignancy. A proposal which I keep thinking about is my consent-based helpfulness. The idea is that, in addition to believing that you want something (with sufficient confidence), the system should also believe that you understand the implications of that thing (with some kind of sufficient detail). In the fusion example, the system would engage the user in conversation until it was clear that the consequences for society were understood and approved of.
Note that the fusion power example could be answered directly with a value-alignment type approach, where you have an agent rather than a tool—the agent infers your values, and infers that you would not really want backyard fusion power if it put the world at risk. That’s the moral that I imagine people more into value learning would give to your story. But I’m reaching further afield for solutions, because:
Value learning systems could Goodhart on the approximate values learned
Value learning systems are not corrigible if they become overly confident (which could happen at test time due to unforeseen flaws in the system’s reasoning—hence the desire for corrigibility)
Value learning systems could manipulate the human.
But the theorems are evidence that RL leads to catastrophe at optimum, at least.
RL with a randomly chosen reward leads to catastrophe at optimum.
I proved that that optimal policies are generally power-seeking in MDPs.
The proof is for randomly distributed rewards.
Ben’s main critique is that the goals evolve in tandem with capabilities, and goals will be determined by what humans care about. These are specific reasons to deny the conclusion of analysis of random rewards.
(A random Python program will error with near-certainty, yet somehow I still manage to write Python programs that don’t error.)
I do agree that this isn’t enough reason to say “there is no risk”, but it surely is important for determining absolute levels of risk. (See also this comment by Ben.)
Thanks! I’ll try to read that.
Any of the risks of being like a group of humans, only much faster, apply. There are also the mesa alignment issues. I suspect that a sufficiently powerful GPT-n might form deceptively aligned mesa optimisers.
I would also worry that off distribution attractors could be malign and intelligent.
Suppose you give GPT-n an off training distribution prompt. You get it to generate text from this prompt. Sometimes it might wander back into the distribution, other times it might stay off distribution. How wide is the border between processes that are safely immitating humans, and processes that aren’t performing significant optimization?
You could get “viruses”, patterns of text that encourage GPT-n to repeat them so they don’t drop out of context. GPT-n already has an accurate world model, a world model that probably models the thought processes of humans in detail. You have all the components needed to create powerful malign intelligences, and a process that smashes them together indiscriminately.
I somewhat hopeful that this is right, but I’m also not so confident that I feel like we can ignore the risks of GPT-N.
For example, this post makes the argument that, because of GPT’s design and learning mechanism, we need not worry about it coming up with significantly novel things or outperforming humans because it’s optimizing for imitating existing human writing, not saying true things. On the other hand, it’s managing to do powerful things it wasn’t trained for, like solve math equations we have no reason to believe it saw in the training set or write code hasn’t seen before, which makes it possible that even if GPT-N isn’t trained to say true things and isn’t really capable of more than humans are, doesn’t mean it might not function like a Hansonian em and still be dangerous by simply doing what humans can do, only much faster.
I recommend skipping to the next post. This post was kind of a stub, the next one explains the same idea better.
I am usually reasonably good at translating from math to non-abstract intuitive examples...but I didn’t have much success here. Do you have “in English, for simpletons” example to go with this? :-) (You know, something that uses apples and biscuits rather than English-but-abstract words like “there are many hidden variables mediating the interactions between observables” :D.)
Otherwise, my current abstract interpretation of this is something like: “There are detailed models, and those might vary a lot. And then there are very abstract models, which will be more similar to each other...well, except that they might also be totally useless.” So I was hoping that a more specific example would clarify things for a bit and tell me whether there is more to this (and also whether I got it all wrong or not :-).)
If you ask GPT-n to produce a design for a fusion reactor, all the prompts that talk about fusion are going to say that a working reactor hasn’t yet been built, or imitate cranks or works of fiction.
It seems unlikely that a text predictor could pick up enough info about fusion to be able to design a working reactor, without figuring out that humans haven’t made any fusion reactors that produce net power.
If you did somehow get a response, the level of safety you would get is the level a typical human would display. (conditional on the prompt) If some information is an obvious infohazard, such that no human capable of coming up with it would share it, then such data won’t be in GPT-n ’s training dataset, and won’t be predicted. However, the process of conditioning might amplify tiny probabilities of human failure.
Suppose that any easy design of fusion reactor could be turned into a bomb. And ignore cranks and fiction. Then suppose 99% of people who invented a fusion reactor would realize this, and stay quiet. The other 1% would write an article that starts with “To make a fusion reactor …” . Then this prompt will cause GPT-n to generate the article that a human that didn’t notice the danger would come up with.
This also applies to dangers like leaking radiation, or just blowing up randomly if your materials weren’t pure enough.
Yeah, this makes much more sense.
The claim here is that either (a) the AI in question doesn’t achieve the main value prop of AI (i.e. reasoning about systems too complicated for humans), or (b) the system itself has to do the work of making sure it’s safe.
I see the intuitive appeal of this claim, but it seems too strong. I suspect if we look at rates of accidents over time they’ll have been going down over time, at least for the last few centuries. It seems like this can continue going down, to an asymptote of zero, in the same way it has been so far—we become better at understanding how accidents happen and more careful in how we use dangerous technologies. We already use tools for this (in software, we use debuggers, profilers, type systems, etc) or delegate to other humans (as in a large company). We can continue to do so with AI systems.
I buy that eventually “most of the work” has to be done by the AI system, but it seems plausible that this won’t happen until well after advanced AI, and that advanced AI will help us in getting there. And so, that from a what-should-we-do perspective, it’s fine to rely on humans for some aspects of safety in the short term (though of course it would be preferable to delegate entirely to a system we knew was safe and beneficial).
(Why bother relying on humans? If you want to build a goal-directed AI system, it sure seems better if it’s under the control of some human, rather than not. It’s not clear what a plausible option is if you can’t have the AI system under the control of some human.)
In the die-roll analogy, the hope is the rate at which you roll dice approximately decays exponentially, so that you only roll an asymptotically constant number of dice.
You definitely don’t understand what I’m getting at here, but I’m not yet sure exactly where the inductive gap is. I’ll emphasize a few particular things; let me know if any of this helps.
There’s this story about an airplane (I think the B-52 originally?) where the levers for the flaps and landing gear were identical and right next to each other. Pilots kept coming in to land, and accidentally retracting the landing gear. The point of the story is that this is a design problem with the plane more than a mistake on the pilots’ part; the problem was fixed by putting a little rubber wheel on the landing gear lever. If we put two identical levers right next to each other, it’s basically inevitable that mistakes will be made; that’s bad interface design.
AI has a similar problem, but far more severe, because the systems to which we are interfacing are far more conceptually complicated. If we have confusing interfaces on AI, which allow people to shoot the world in the foot, then the world will inevitably be shot in the foot, just like putting two identical levers next to each other guarantees that the wrong one will sometimes be pulled.
For tool AI in particular, the key piece is this:
the big value-proposition of powerful AI is its ability to reason about systems or problems too complicated for humans—which are exactly the systems/problems where safety issues are likely to be nonobvious. If we’re going to unlock the full value of AI at all, we’ll need to use it on problems where humans do not know the relevant safety issues.
The claim here is that either (a) the AI in question doesn’t achieve the main value prop of AI (i.e. reasoning about systems too complicated for humans), or (b) the system itself has to do the work of making sure it’s safe. If neither of those conditions are met, then mistakes will absolutely be made regularly. The human operator cannot be trusted to make sure what they’re asking for is safe, because they will definitely make mistakes.
On the other hand, if the AI itself is able to evaluate whether its outputs are safe, then we can potentially achieve very high levels of safety. It could plausibly never go wrong over the lifetime of the universe. Just like, if you design a tablesaw with an automatic shut-off, it could plausibly never cut off anybody’s finger. But if you design a tablesaw without an automatic shut-off, it is near-certain to cut off a finger from time to time. That level of safety can be achieved, in general, but it cannot be achieved while relying on the human operator not making mistakes.
Coming at it from a different angle: if a safety problem is handled by a system’s designer, then their die-roll happens once up-front. If that die-roll comes out favorably, then the system is safe (at least with respect to the problem under consideration); it avoids the problem by design. On the other hand, if a safety problem is left to the system’s users, then a die-roll happens every time the system is used, so inevitably some of those die rolls will come out unfavorably. Thus the importance of designing AI for safety up-front, rather than relying on users to use it safely.
Is it more clear what I’m getting at now and/or does this prompt further questions?
I feel like I don’t understand what you’re getting at here. It seems to me like you’re saying “X cannot be guaranteed to never cause bad things”, which seems to me to be so obvious that it’s not worth mentioning.
Nothing is inherently safe. There’s always the possibility that you’ve neglected some way by which danger could arise. What I want from arguments for risk is an argument that it is high EV to address the risk (often broken down into importance, tractability, neglectedness), or at least that the risk is reasonably high on probability * harm (which focuses just on importance).
Of course Tool AI isn’t inherently safe. A knife isn’t inherently safe, despite being a tool; you wouldn’t give a knife to a baby. The argument is usually that Tool AI would, by design, not be adversarially optimizing against humans, because it isn’t optimizing at all, and so a particular class of risks typically considered in the AI safety community would be avoided. It is unclear whether such an argument can work, but it’s certainly not “it is impossible for bad things to happen if you build a Tool AI”, which is clearly wrong.
Perhaps you’re arguing that we should focus on the “how do we manage potentially dangerous information” problem? Or that we should focus on improving human intelligence so that we are better able to use our AI systems?
I really like your examples in this post, and it made me think of a tangential but ultimately related issue.
I feel like there’s long been something like two camps in the AI safety space: the people who think it’s very hard to make AI safe and the people who think it’s very very hard like threading a needle from 10 miles away using hand-made binoculars and a long stick (yes, there’s a third camp that thinks it will be easy, but they aren’t really in the AI safety conversation due to selection effects). And I suspect some of this difference is in how much purposed example failure scenarios feel likely and realistic to them. Being myself in the latter camp, I sometimes find I hard to articulate why I think this, and often want better, more evocative examples. Thus I was happy to read your examples because I think they achieve a level of evocativeness that I at least often find hard to create.
It seems that privacy potentially could “tame” a not-quite-corrigible AI. With a full model, the AGI might receive a request, deduce that activating a certain set of neurons strongly would be the most robust way to make you feel the request was fulfilled, and then design an electrode set-up to accomplish that. Whereas the same AI with a weak model wouldn’t be able to think of anything like that, and might resort to fulfilling the request in a more “normal” way. This doesn’t seem that great, but it does seem to me like this is actually part of what makes humans relatively corrigible.
Privacy as a component of AI alignment
[realized this is basically just a behaviorist genie, but posting it in case someone finds it useful]
What makes something manipulative? If I do something with the intent of getting you to do something, is that manipulative? A simple request seems fine, but if I have a complete model of your mind, and use it phrase things so you do exactly what I want, that seems to have crossed an important line.
The idea is that using a model of a person that is *too* detailed is a violation of human values. In particular, it violates the value of autonomy, since your actions can now be controlled by someone using this model. And I believe that this is a significant part of what we are trying to protect when we invoke the colloquial value of privacy.
In ordinary situations, people can control how much privacy they have relative to another entity by limiting their contact with them to certain situations. But with an AGI, a person may lose a very large amount of privacy from seemingly innocuous interactions (we’re already seeing the start of this with “big data” companies improving their advertising effectiveness by using information that doesn’t seem that significant to us). Even worse, an AGI may be able to break the privacy of everyone (or a very large class of people) by using inferences based on just a few people (leveraging perhaps knowledge of the human connectome, hypnosis, etc...).
If we could reliably point to specific models an AI is using, and have it honestly share its model structure with us, we could potentially limit the strength of its model of human minds. Perhaps even have it use a hardcoded model limited to knowledge of the physical conditions required to keep it healthy. This would mitigate issues such as deliberate deception or mindcrime.
We could also potentially allow it to use more detailed models in specific cases, for example, we could let it use a detailed mind model to figure out what is causing depression in a specific case, but it would have to use the limited model in any other contexts or for any planning aspects of it. Not sure if that example would work, but I think that there are potentially safe ways to have it use context-limited mind models.
I would say the reason to assume infinite compute was less about which parts of the problem are hard, and more about which parts can be solved without a solution to the rest.
Good solutions often have even better solutions nearby. In particular, we would expect most efficient and comprehensible finite algorithms to tend towards some nice infinite behaviour in the limit. If we find an infinite algorithm, that’s a good point to start looking for finite approximations. It is also often easier to search for an infinite algorithm than a good approximation. Backpropigation in gradient descent is a trickier algorithm than brute force search. Logical induction is more complicated to understand than brute force proof search.
Well, yes, one way to help some living entity is to (1) interpret it as an agent, and then (2) act in service of the terminal goals of that agent. But that’s not the only way to be helpful. It may also be possible to directly be helpful to a living entity that is not an agent, without getting any agent concepts involved at all.
I definitely don’t know how to do this, but the route that avoids agent models entirely seems more plausible me compared to working hard to interpret everything using some agent model that is often a really poor fit, and then helping on the basis of a that poorly-fitting agent model.
I’m excited about inquiring deeply into what the heck “help” means. (All please reach out to me if you’d like to join a study seminar on this topic)