Update: The setup described in the OP involves a system that models humans. See this MIRI article for a discussion on some important concerns about such systems.
In none of these world-models, under no actions that it considers, does “episode 117 happen twice.”
Yes, episode 117 happens only once in the world model; and suppose the agent cares only about episode 117 in the “current execution”. The concern still holds: the agent might write a malign output that would result in additional invocations of itself in which episode 117 ends with the agent getting a high reward. Note that the agent does not care about the other executions of itself. The only purpose of the malign output is to increase the probability that the “current execution” is one that ends with the agent receiving a high reward.
It would be helpful if people could outline some plausible-seeming scenarios for how divergence between approval and actual preferences could cause a catastrophe, in order to get a better sense for the appropriate noise model.
One scenario that comes to mind: an agent generates a manipulative output that is optimized to be approved by the programmers while causing the agent to seize control over more resources (in a way that is against the actual preferences of the programmers).
Sorry, I didn’t understand the question (and what you meant by “The loss function is undefined after training.“).
After thinking about this more, I now think that my original description of this failure mode might be confusing: maybe it is more accurate to describe it as an inner optimizer problem. The guiding logic here is that if there are no inner optimizers then the question answering system, which was trained by supervised learning, “attempts” (during inference) to minimize the expected loss function value as defined by the original distribution from which the training examples were sampled; and any other goal system is the result of inner optimizers.
(I need to think more about this)
We might be interpreting “modest logic-related stuff” differently—I am thinking about simple formal problems like sorting a short list of integers.
I wouldn’t be surprised if GPT-2 (or its smaller version) are very capable at completing strings like “[1,2,” in a way that is merely syntactically correct. Publicly available texts on the internet probably contain a lot of comma-separated number lists in brackets. The challenge is for the model to have the ability to sort numbers (when trained only to predict the next word in internet texts).
However, after thinking about it more I am now less confident that GPT-2 would fail to complete my above sentence with a correctly sorted list, because for any two small integers like 2 and 3 it is plausible that the training data contains more “2,3” strings than “3,2″ strings.
Consider instead the following problem:
“The median number in the list [9,2,1,6,8] is ”
I’m pretty sure that GPT-2 would fail at least 1⁄5 of the times to complete such a sentence (i.e. if we query it multiple times and each time the sentence contains small random integers).
In the case of GPT-2 the “current inference” is the current attempt to predict the next word given some text (it can be either during training or during evaluation).
In the malign-output scenario above the system indeed does not “care” about the future, it cares only about the current inference.
Indeed, the system “has no preference for being invoked”. But if it has been invoked and is currently executing, it “wants” to be in a “good invocation”—one in which it ends up with a perfect loss function value.
The training process optimizes only for immediate prediction accuracy.
Not exactly. The best way to minimize the L2 norm of the loss function over the training data is to simply copy the training data to the weights (if there are enough weights) and use some trivial look-up procedure during inference. To get models that are also useful for inputs that are not from the training data, you probably need to use some form of regularization (or use a model that implicitly carries it out), e.g. add to the objective function being minimized the L2 norm of the weights. Regularization is a way to implement Occam’s razor in machine learning.
Suppose that due to the regularization, the training results in a system with the goal system: “minimize the expected value of the loss function at the end of the current inference”.(when the concept of probability, which is required to define expectation, corresponds to how humans interpret the word “probability” in a decision-relevant context)For such a goal system, the malign-output scenario above seems possible (for a sufficiently capable system).
Have you looked at the NLP tasks they evaluated it on?
Yes. Nothing I’ve seen suggests GPT-2 would successfully solve simple formal problems like the one I mentioned in the grandparent (unless a very similar problem appears in the training data—e.g. the exact same problem but with different labels).
I’m pretty sure that GPT-2 would fail to complete even the sentence: “if we sort the list [3,1,2,2] we get [1,“. It’s a cool language model but can it do even modest logic-related stuff without similar examples in the training data?
There are no models of the world involved in the latter
The weights of the neural network might represent something that correspond to an implicit model of the world.
no actions including manipulating a human or inventing exciting proteins.
Putting aside the risk of inner optimizers, suppose we get to superintelligence-level of capabilities, and it turns out that the training process produced a goal system such that the neural network yields some malign output that causes many future invocations of the neural network (indistinguishable from the current invocation) in which a perfect loss function value is achieved.
I don’t see how GPT-2 is a step forward towards passing strong versions of the Turing test.
It’s a source of superintelligence that doesn’t automatically run into utility maximizers.
I’m not familiar with the details of GPT-2 and maybe I’m interpreting the definition of “utility maximizer” incorrectly, but isn’t GPT-2 some neural network that is trained to minimize a loss function that corresponds to predicting the next word correctly?
Alas, as seen in the above criticisms [links in a different spot in the original post], it seems far too common in the AI risk world to presume that past patterns of software and business are largely irrelevant, as AI will be a glorious new shiny unified thing without much internal structure or relation to previous things. (As predicted by far views.)
The rise of deep learning in recent years seems to be evidence in favor of [AI will be a glorious new shiny thing without much relation to previous things] (assuming “previous things” here is limited to things that affected markets at the time).
The history of vastly overestimating the ease of making huge firms in capitalism, and the similar typical nubbie error of overestimating the ease of making large unstructured software systems, are seen as largely irrelevant.
While I see how conventional economic models are obviously useful here, I do not see how they can be useful in predicting the performance of “novel computations” (e.g. a computation that uses 1,000,000 GPU hours and a shiny new neural architecture) or predicting some critical technical properties of the development of transformative systems (e.g. “is there a secret sauce that a top AI lab will suddenly find?“).
In most cases my thought is “well, what’s the alternative?”
Perhaps we humans should think ourselves for 10,000 years (passing the task from one generation to the next until aging is solved), instead of deferring to some “idealized” digital versions of ourselves.
This would require preventing existential catastrophes, during those 10,000 years, via “conventional means” (e.g. stabilizing the world to some extent).
“Breaking the vase” is a reference to an example that people sometimes give for an accident that happens in reinforcement learning due to the reward function not being fully aligned with what we want. The scenario is a robot that navigates in a room with a vase, and while we care about the vase, the reward function that we provided does not account for it, and so the robot just knocks it over because it is on the shortest path to somewhere.
One reason I care about this is that I am concerned about approaches to AI safety that involve modeling humans to try to learn human value.
I also have concerns about such approaches, and I agree with the reason you gave for being more concerned about procedure B (“it would be nice to be able to save human approval as a test set”).
I did not understand how this relates specifically to gradient descent. The tendency of gradient descent (relative to other optimization algorithms) to find unsafe solutions, assuming no inner optimizers appear, seems to me to be a fuzzy property of the problem at hand.
One could design problems in which gradient descent is expected to find less-aligned solutions than non local search algorithms (e.g. a problem in which most solutions are safe, but if you hill-climb from them you get to higher-utility-value-and-not-safe solutions). One could also design problems in which this is not the case (e.g. when everything that can go wrong is the agent breaking the vase, and breaking the vase allows higher utility solutions).
Do you have an intuition that real-world problems tend to be such that the first solution found with utility value of at least X would be better when using random sampling (assuming infinite computation power) than when using gradient descent?
Thank you, I understand this now (I found it useful to imagine code that is being invoked many times and is terminated after a random duration; and reflect on how the agent implemented by the code should update as time goes by).
I guess I should be overall more optimistic now :)
Therefore A1 would force us to conclude that the safe and the dangerous worlds have exactly the same level of risk!
Similar problems arise if we try and use weaker versions of A1 - maybe our survival is some evidence, just not strong evidence. But Bayes will still hit us, and force us to change our values of terms like P( we survived | dangerous ).
I’m confused by this. The event “we survived” here is actually the event “at least one observer similar to us survived”, right? (for some definition of “similar”).If the number of planets on which creatures similar-to-us evolve is sufficiently large, we get:P(at least one observer similar to us survived)≈P(at least one observer similar to us survived | dangerous)≈1
Maybe: “What are the most effective interventions for making better predictions/decisions?”
It seems worthwhile to create such a list, ranked according to a single metric as measured in randomized experiments.
(if there is already such a thing please let me know)
but I gathered some quantitative estimates of AI risk here, and they all seem overly optimistic to me. Did you see that?
I only now read that thread. I think it is extremely worthwhile to gather such estimates.
I think all the three estimates mentioned there correspond to marginal probabilities (rather than probabilities conditioned on “no governance interventions”). So those estimates already account for scenarios in which governance interventions save the world. Therefore, it seems we should not strongly update against the necessity of governance interventions due to those estimates being optimistic.
Maybe we should gather researchers’ credences for predictions like:“If there will be no governance interventions, competitive aligned AIs will exist in 10 years from now”.
I suspect that gathering such estimates from publicly available information might expose us to a selection bias, because very pessimistic estimates might be outside the Overton window (even for the EA/AIS crowd). For example, if Robert Wiblin would have concluded that an AI existential catastrophe is 50% likely, I’m not sure that the 80,000 Hours website (which targets a large and motivationally diverse audience) would have published that estimate.
I agree with this motivation to do early work, but in a world where we do need drastic policy responses, I think it’s pretty likely that the early work won’t actually produce conclusive enough results to show that. For example, if a safety approach fails to make much progress, there’s not really a good way to tell if it’s because safe and competitive AI really is just too hard (and therefore we need a drastic policy response), or because the approach is wrong, or the people working on it aren’t smart enough, or they’re trying to do the work too early.
I strongly agree with all of this.
If the answer is no, there’s also the question of how do we make policy makers take this problem seriously (i.e., that safe AI probably won’t be as efficient as unsafe AI) given the existence of more optimistic AI safety researchers (so that they’d be willing to undertake costly preparations for governance solutions ahead of time).
I’m not aware of any AI safety researchers that are extremely optimistic about solving alignment competitively. I think most of them are just skeptical about the feasibility of governance solutions, or think governance related interventions might be necessary but shouldn’t be carried out yet.
In this 80,000 Hours podcast episode, Paul said the following:
In terms of the actual value of working on AI safety, I think the biggest concern is this, “Is this an easy problem that will get solved anyway?” Maybe the second biggest concern is, “Is this a problem that’s so difficult that one shouldn’t bother working on it or one should be assuming that we need some other approach?” You could imagine, the technical problem is hard enough that almost all the bang is going to come from policy solutions rather than from technical solutions.
And you could imagine, those two concerns maybe sound contradictory, but aren’t necessarily contradictory, because you could say, “We have some uncertainty about this parameter of how hard this problem is.” Either it’s going to be easy enough that it’s solved anyway, or it’s going to be hard enough that working on it now isn’t going to help that much and so what mostly matters is getting our policy response in order. I think I don’t find that compelling, in part because one, I think the significant probability on the range … like the place in between those, and two, I just think working on this problem earlier will tell us what’s going on. If we’re in the world where you need a really drastic policy response to cope with this problem, then you want to know that as soon as possible.
It’s not a good move to be like, “We’re not going to work on this problem because if it’s serious, we’re going to have a dramatic policy response.” Because you want to work on it earlier, discover that it seems really hard and then have significantly more motivation for trying the kind of coordination you’d need to get around it.
How uncompetitive do you think aligned IDA agents will be relative to unaligned agents
For the sake of this estimate I’m using a definition of IDA that is probably narrower than what Paul has in mind: in the definition I use here, the Distill steps are carried out by nothing other than supervised learning + what it takes to make that supervised learning safe (but the implementation of the Distill steps may be improved during the Amplify steps).
This narrow definition might not include the most promising future directions of IDA (e.g. maybe the Distill steps should be carried out by some other process that involves humans). Without this simplifying assumption, one might define IDA as broadly as: “iteratively create stronger and stronger safe AI systems by using all the resources and tools that you currently have”. Carrying out that Broad IDA approach might include efforts like asking AI alignment researchers to get into a room with a whiteboard and come up with ideas for new approaches.
Therefor this estimate uses my narrow definition of IDA. If you like, I can also answer the general question: “How uncompetitive do you think aligned agents will be relative to unaligned agents?“.
Suppose it is the case that if OpenAI decided to create an AGI agent as soon as they could, it would have taken them X years (assuming an annual budget of $10M and that the world around them stays the same, and OpenAI doesn’t do neuroscience, and no unintentional disasters happen).
Now suppose that OpenAI decided to create an aligned IDA agent with AGI capabilities as soon as they could (same conditions). How much time would it take them? My estimate follows; each entry is in the format:
[years]: [my credence that it would take them at most that many years]
(consider writing down your own credences before looking at mine)