Sure. In the case of Lincoln, I would say the problem is solved by models even as clean as Pearl-ian causal networks. But in math, there’s no principled causal network model of theorems to support counterfactual reasoning as causal calculus.
Of course, I more or less just think that we have an unprincipled causality-like view of math that we take when we think about mathematical counterfactuals, but it’s not clear that this is any help to MIRI understanding proof-based AI.
I feel like this is practically a frequentist/bayesian disagreement :D It seems “obvious” to me that “If Lincoln were not assassinated, he would not have been impeached” can be about the real Lincoln as much as me saying “Lincoln had a beard” is, because both are statements made using my model of the world about this thing I label Lincoln. No reference class necessary.
Honestly? I feel like this same set of problems gets re-solved a lot. I’m worried that it’s a sign of ill health for the field.
I think we understand certain technical aspects of corrigibility (indifference and CIRL), but have hit a brick wall in certain other aspects (things that require sophisticated “common sense” about AIs or humans to implement, philosophical problems about how to get an AI to solve philosophical problems). I think this is part of what leads to re-treading old ground when new people (or a person wanting to apply a new tool) try to work on AI safety.
On the other hand, I’m not sure if we’ve exhausted Concrete Problems yet. Yes, the answer is often “just have sophisticated common sense,” but I think the value is in exploring the problems and generating elegant solutions so that we can deepen our understanding of value functions and agent behavior (like TurnTrout’s work on low-impact agents). In fact, Tom’s a co-author on a very good toy problems paper, many of which require similar sort of one-off solutions that still might advance our technical understanding of agents.
I think the most “native” representation of utility functions is actually as a function from ordered triples of outcomes to real numbers. Rather than having an arbitrary (affine symmetry breaking) scale for strength of preference, set the scale of a preference by comparing to a third possible outcome.
The function is the “how much better?” function. Given possible outcomes A, B, and X, how many times better is A (relative to X) than B (relative to X).
If A is chocolate cake, and B is ice cream, and X is going hungry, maybe the chocolate cake preference is 1.25 times stronger, so the function Betterness(chocolate cake, ice cream, going hungry) = 1.25.
This is the sort of preference that you would elicit from a gamble (at least from a rational agent, not necessarily from a human). If I am indifferent to a gamble with a probability 1 of ice cream, and a probability 0.8 of chocolate cake and 0.2 of going hungry, this tells you that betterness-value above.
Anyhow, interesting post, I’m just idly commenting.
This is definitely an interesting topic, and I’ll eventually write a related post, but here are my thoughts at the moment.
1 - I agree that using natural language prompts with systems trained on natural language makes for a much easier time getting common-sense answers. A particular sort of idiot-proofing that prevents the hypothetical idiot from having the AI tell them how to blow up the world. You use the example of “How would we be likely to cure Alzheimer’s?”—but for a well-trained natural language Oracle, you could even ask “How should we cure Alzheimer’s?”
If it was an outcome pump with no particular knowledge of humans, it would give you a plan that would set off our nuclear arsenals. A superintelligent search process with an impact penalty would tell you how to engineer a very unobtrusive virus. A perfect world model with no special knowledge of humans would tell you a series of configurations of quantum fields. These are all bad answers.
What you want the Oracle to tell you is the sort of plan that might practically be carried out, or some other useful information, that leads to an Alheimer cure in the normal way that people mean when talking about diseases and research and curing things. Any model that does a good job predicting human natural language will take this sort of thing for granted in more or less the way you want it to.
2 - But here’s the problem with curing Alzheimer’s: it’s hard. If you train GPT-3 on a bunch of medical textbooks and prompt it to tell you a cure for Alzheimer’s, it won’t tell you a cure, it will tell you what humans have said about curing Alzheimer’s.
If you train a simultaneous model (like a neural net or a big transformer or something) of human words, plus sensor data of the surrounding environment (like how an image captioning ai can be thought of as having a simultaneous model of words and pictures), and figure out how to control the amount of detail of verbal output, you might be able to prompt an AI with text about an Alzheimer’s cure, have it model a physical environment that it expects those words to take place in, and then translate that back into text describing the predicted environment in detail. But it still wouldn’t tell you a cure. It would just tell you a plausible story about a situation related to the prompt about curing Alzheimer’s, based on its training data. Rather than a logical Oracle, this image-captioning-esque scheme would be an intuitive Oracle, telling you things that make sense based on associations already present within the training set.
What am I driving at here, by pointing out that curing Alzheimer’s is hard? It’s that the designs above are missing something, and what they’re missing is search.
I’m not saying that getting a neural net to directly output your cure for Alzheimer’s is impossible. But it seems like it requires there to already be a “cure for Alzheimer’s” dimension in your learned model. The more realistic way to find the cure for Alzheimer’s, if you don’t already know it, is going to involve lots of logical steps one after another, slowly moving through a logical space, narrowing down the possibilities more and more, and eventually finding something that fits the bill. In other words, solving a search problem.
So if your AI can tell you how to cure Alzheimer’s, I think either it’s explicitly doing a search for how to cure Alzheimer’s (or worlds that match your verbal prompt the best, or whatever), or it has some internal state that implicitly performs a search.
And once you realize you’re imagining an AI that’s doing search, maybe you should feel a little less confident in the idiot-proofness I talked about in section 1. Maybe you should be concerned that this search process might turn up the equivalent of adversarial examples in your representation.
3 - Whenever I see a proposal for an Oracle, I tend to try to jump to the end—can you use this Oracle to immediately construct a friendly AI? If not, why not?
A perfect Oracle would, of course, immediately give you FAI. You’d just ask it “what’s the code for a friendly AI?”, and it would tell you, and you would run it.
Can you do the same thing with this self-supervised Oracle you’re talking about? Well, there might be some problems.
One problem is the search issue I just talked about—outputting functioning code with a specific purpose is a very search-y sort of thing to do, and not a very big-ol’-neural-net thing to do, even moreso than outputting a cure for Alzheimer’s. So maybe you don’t fully trust the output of this search, or maybe there’s no search and your AI is just incapable of doing the task.
But I think this is a bit of a distraction, because the basic question is whether you trust this Oracle with simple questions about morality. If you think the AI is just regurgitating an average answer to trolley problems or whatever, should you trust it when you ask for the FAI’s code?
There’s an interesting case to be made for “yes, actually,” here, but I think most people will be a little wary. And this points to a more general problem with definitions—any time you care about getting a definition having some particularly nice properties beyond what’s most predictive of the training data, maybe you can’t trust this AI.
That proof of the instability of RNNs is very nice.
The version of the vanishing gradient problem I learned is simply that if you’re updating weights proportional to the gradient, then if your average weight somehow ends up as 0.98, as you increase the number of layers your gradient, and therefore your update size, will shrink kind of like (0.98)^n, which is not the behavior you want it to have.
One sufficient condition for always defining actions is when a decision theory can give decisions as a function of the state of the world. For example, CDT evaluates outcomes in a way purely dependent on the world’s state. A more complicated way of doing this is if your decision theory takes in a model of the world and outputs a policy, which tells you what to do in each state of the world.
And of course you can go further and have different U that all have similarly valid claims to be Up, because they’re all similarly good generalizations of our behavior into a consistent function on a much larger domain.
Yeah I agree that this might secretly be the same as a question about uploads.
If you’re only trying to copy human behavior in a coarse-grained way, you immediately run into a huge generalization problem because your human-imitation is going to have to make plans where it can copy itself, think faster as it adds more computing power, can’t get a hug, etc, and this is all outside of the domain it was trained on.
So if people aren’t being very specific about human imitations, I kind of assume they’re really talking and thinking about basically-uploads (i.e. imitations that generalize to this novel context by having a model of human cognition that attempts to be realistic, not merely predictive).
Could you expand on why you think that information / entropy doesn’t match what you mean by “amount of optimization done”?
E.g. suppose you’re training a neural network via gradient descent. If you start with weights drawn from some broad distribution, after training they will end up in some narrower distribution. This seems like a good metric of “amount of optimization done to the neural net.”
I think there are two categories of reasons why you might not be satisfied—false positive and false negative. False positives would be “I don’t think much optimization has been done, but the distribution got a lot narrower,” and false negatives would be “I think more optimization is happening, but the distribution isn’t getting any narrower.” Did you have a specific instance of one of these cases in mind?
Here’s a more general way of thinking about what you’re saying that I find useful: It’s not that self-awareness is the issue per se, it’s that you’re avoiding building an agent—by a specific technical definition of “agent.”
Agents, in the sense I think is most useful when thinking about AI, are things that choose actions based on the predicted consequences of those actions.
On some suitably abstract level of description, agents must have available actions, they must have some model of the world that includes a free parameter for different actions, and they must have a criterion for choosing actions that’s a function of what the model predicts will happen when it takes those actions. Agents are what is dangerous, because they steer the future based on their criterion.
What you describe in this post is an AI that has actions (outputting text to a text channel), and has a model of the world. But maybe, you say, we can make it not an agent, and therefore a lot less dangerous, by making it so that there is no free parameter in the model for the agent to try out different actions. and instead of choosing its action based on consequences, it will just try to describe what its model predicts.
Thinking about it in terms of agents like this explains why “knowing that it’s running on a specific computer” has the causal powers that it does—it’s a functional sort of “knowing” that involves having your model of the world impacted by your available actions in a specific way. Simply putting “I am running on this specific computer” into its memory wouldn’t make it an agent—and if it chooses what text to output based on predicted consequences, it’s an agent whether or not it has “I am running on this specific computer” in its memory.
So, could this work? Yes. It would require a lot of hard, hard work on the input/output side, especially if you want reliable natural language interaction with a model of the entire world, and you still have to worry about the inner optimizer problem, particularly e.g. if you’re training it in a way that creates an incentive for self-fulfilling prophecy or some other implicit goal.
The basic reason I’m pessimistic about the general approach of figuring out how to build safe non-agents is that agents are really useful. If your AI design requires a big powerful model of the entire world, that means that someone is going to build an agent using that big powerful model very soon after. Maybe this tool gives you some breathing room by helping suppress competitors, or maybe it makes it easier to figure out how to build safe agents. But it seems more likely to me that we’ll get a good outcome by just directly figuring out how to build safe agents.
“I don’t trust humans to be a trusted source when it comes to what an AI should do with the future lightcone.”
First, let’s acknowledge that this is a new objection you are raising which we haven’t discussed yet, eh? I’m tempted to say “moving the goalposts”, but I want to hear your best objections wherever they come from; I just want you to acknowledge that this is in fact a new objection :)
Sure :) I’ve said similar things elsewhere, but I suppose one must sometimes talk to people who haven’t read one’s every word :P
We’re being pretty vague in describing the human-AI interaction here, but I agree that one reason why the AI shouldn’t just do what it would predict humans would tell it to do (or, if below some threshold of certainty, ask a human) is that humans are not immune to distributional shift.
There are also systematic factors, like preserving your self-image, that sometimes make humans say really dumb things about far-off situations because of more immediate concerns.
Lastly, figuring out what the AI should do with its resources is really hard, and figuring out which to call “better” between two complicated choices can be hard too, and humans will sometimes do badly at it. Worst case, the humans appear to answer hard questions with certainty, or conversely the questions the AI is most uncertain about slowly devolve into giving humans hard questions and treating their answers as strong information.
I think the AI should actively take this stuff into account rather than trying to stay in some context where it can unshakeably trust humans. And by “take this into account,” I’m pretty sure that means model the human and treat preferences as objects in the model.
Skipping over the intervening stuff I agree with, here’s that Eliezer quote:
Eliezer Yudkowsky wrote: “If the subject is Paul Christiano, or Carl Shulman, I for one am willing to say these humans are reasonably aligned; and I’m pretty much okay with somebody giving them the keys to the universe in expectation that the keys will later be handed back.”
Do you agree or disagree with Eliezer? (In other words, do you think a high-fidelity upload of a benevolent person will result in a good outcome?)
If you disagree, it seems that we have no hope at success whatsoever. If no human can be trusted to act, and AGI is going to arise through our actions, then we can’t be trusted to build it right. So we might as well just give up now.
I think Upload Paul Christiano would just go on to work on the alignment problem, which might be useful but is definitely passing the buck.
Though I’m not sure. Maybe Upload Paul Christiano would be capable of taking over the world and handling existential threats before swiftly solving the alignment problem. Then it doesn’t really matter if it’s passing the buck or not.
But my original thought wasn’t about uploads (though that’s definitely a reasonable way to interpret my sentence), it was about copying human decision-making behavior in the same sense that an image classifier copies human image-classifying behavior.
Though maybe you went in the right direction anyhow, and if all you had was supervised learning the right thing to do is to try to copy the decision-making of a single person (not an upload, a sideload). What was that Greg Egan book—Zendegi?
so far, it hasn’t really proven useful to develop methods to generalize specifically in the case where we are learning human preferences. We haven’t really needed to develop special methods to solve this specific type of problem. (Correct me if I’m wrong.)
There are some cases where the AI specifically has a model of the human, and I’d call those “special methods.” Not just IRL, the entire problem of imitation learning often uses specific methods to model humans, like “value iteration networks.” This is the sort of development I’m thinking of that helps AI do a better job at generalizing human values—I’m not sure if you meant things at a lower level, like using a different gradient descent optimization algorithm.
Ah, but I don’t trust humans to be a trusted source when it comes to what an AI should do with the future lightcone. I expect you’d run into something like Scott talks about in The Tails Coming Apart As Metaphor For Life, where humans are making unprincipled and contradictory statements, with not at all enough time spent thinking about the problem.
As Ian Goodfellow puts it, machine learning people have already been working on alignment for decades. If alignment is “learning and respecting human preferences”, object recognition is “human preferences about how to categorize images”, and sentiment analysis is “human preferences about how to categorize sentences”
I somewhat agree, but you could equally well call them “learning human behavior at categorizing images,” “learning human behavior at categorizing sentences,” etc. I don’t think that’s enough. If we build an AI that does exactly what a human would do in that situation (or what action they would choose as correct when assembling a training set), I would consider that a failure.
So this is two separate problems: one, I think humans can’t reliably tell an AI what they value through a text channel, even with prompting, and two, I think that mimicking human behavior, even human behavior on moral questions, is insufficient to deal with the possibilities of the future.
I’ve never heard anyone in machine learning divide the field into cases where we’re trying to generalize about human values and cases where we aren’t. It seems like the same set of algorithms, tricks, etc. work either way.
It also sounds silly to say that one can divide the field into cases where you’re doing model-based reinforcement learning, and cases where you aren’t. The point isn’t the division, it’s that model-based reinforcement learning is solving a specific type of problem.
Let me take another go at the distinction: Suppose you have a big training set of human answers to moral questions. There are several different things you could mean by “generalize well” in this case, which correspond to solving different problems.
The first kind of “generalize well” is where the task is to predict moral answers drawn from the same distribution as the training set. This is what most of the field is doing right now for Ian Goodfellow’s examples of categorizing images or categorizing sentences. The better we get at generalizing in this sense, the more reproducing the training set corresponds to reproducing the test set.
Another sort of “generalize well” might be inferring a larger “real world” distribution even when the training set is limited. For example, if you’re given labeled data for handwritten digits 0-20 into binary outputs, can you give the correct binary output for 21? How about 33? In our moral questions example, this would be like predicting answers to moral questions spawned by novel situations not seen in training. The better we get at generalizing in this sense, the more reproducing the training set corresponds to reproducing examples later drawn from the real world.
Let’s stop here for a moment and point out that if we want generalization in the second sense, algorithmic advances in the first sense might be useful, but they aren’t sufficient. For the classifier to output the binary for 33, it probably has to be deliberately designed to learn flexible representations, and probably get fed some additional information (e.g. by transfer learning). When the training distribution and the “real world” distribution are different, you’re solving a different problem than when they’re the same.
A third sort of “generalize well” is to learn superhumanly skilled answers even if the training data is flawed or limited. Think of an agent that learns to play Atari games at a superhuman level, from human demonstrations. This generalization task often involves filling in a complex model of the human “expert,” along with learning about the environment—for current examples, the model of the human is usually hand-written. The better we get at generalizing in this way, the more the AI’s answers will be like “what we meant” (either by some metric we kept hidden from the AI, or in some vague intuitive sense) even if they diverge from what humans would answer.
(I’m sure there are more tasks that fall under the umbrella of “generalization,” but you’ll have to suggest them yourself :) )
So while I’d say that value learning involves generalization, I think that generalization can mean a lot of different tasks—a rising tide of type 1 generalization (which is the mathematically simple kind) won’t lift all boats.
Yes, I agree that generalization is important. But I think it’s a bit too reductive to think of generalization ability as purely a function of the algorithm.
For example, an image-recognition algorithm trained with dropout generalizes better, because dropout acts like an extra goal telling the training process to search for category boundaries that are smooth in a certain sense. And the reason we expect that to work is because we know that the category boundaries we’re looking for are in fact usually smooth in that sense.
So it’s not like dropout is a magic algorithm that violates a no-free-lunch theorem and extracts generalization power from nowhere. The power that it has comes from our knowledge about the world that we have encoded into it.
(And there is a no free lunch theorem here. How to generalize beyond the training data is not uniquely encoded in the training data, every bit of information in the generalization process has to come from your model and training procedure.)
For value learning, we want the AI to have a very specific sort of generalization skill when it comes to humans. It has to not only predict human actions, it has to make a very particular sort of generalization (“human values”), and single out part of that generalization to make plans with. The information to pick out one particular generalization rather than another has to come from humans doing hard, complicated work, even if it gets encoded into the algorithm.
I’m guilty, I’ll try to do better :)
I don’t understand why you’re so confident. It doesn’t seem to me that my values are divorced from biology (I want my body to stay healthy) or population statistics (I want a large population of people living happy lives).
When I say your preference is “more abstract than biology,” I’m not saying you’re not allowed to care about your body, I’m saying something about what kind of language you’re speaking when you talk about the world. When you say you want to stay healthy, you use a fairly high-level abstraction (“healthy”), you don’t specify which cell organelles should be doing what, or even the general state of all your organs.
This choice of level of abstraction matters for generalization. At our current level of technology, an abstract “healthy” and an organ-level description might have the same outcomes, but at higher levels of technology, maybe someone who preferred to be healthy would be fine becoming a cyborg, while someone who wanted to preserve some lower-level description of their body would be against it.
“Once it starts encoding the world differently than we do, it won’t have the generalization properties we want—we’d be caught cheating, as it were.”
Are you sure?
I think the right post to link here is this one by Kaj Sotala. I’m not totally sure—there may be some way to “cheat” in practice—but my default view is definitely that if the AI carves the world up along different boundaries than we do, it won’t generalize in the same way we would, given the same patterns.
Nice find on the Bostrom quote btw.
I think your claim proves too much. Different human brains have different encodings, and yet we are still able to learn the values of other humans (for example, when visiting a foreign country) reasonably well when we make an honest effort.
I would bite this bullet, and say that when humans are doing generalization of values into novel situations (like trolley problems, or utopian visions), they can end up at very different places even if they agree on all of the everyday cases.
If you succeed at learning the values of a foreigner, so well that you can generalize those values to new domains, I’d suspect that the simplest way for you to do it involves learning about what concepts they’re using well enough to do the right steps in reasoning. If you just saw a snippet of their behavior and couldn’t talk to them about their values, you’d probably do a lot worse—and I think that’s the position many current value learning schemes place AI in.
I’d definitely be interested in your thoughts about preferences when you get them into a shareable shape.
In some sense, what humans “really” have is just atoms moving around, all talk of mental states and so on is some level of convenient approximation. So when you say you want to talk about a different sort of approximation from Stuart, my immediate thing I’m curious about is “how can you make your way of talking about humans convenient for getting an AI to behave well?”
A good starting point. I’m reminded of an old Kaj Sotala post (which then later provided inspiration for me writing a sort of similar post) about trying to ensure that the AI has human-like concepts. If the AI’s concepts are inhuman, then it will generalize in an inhuman way, so that something like teaching a policy though demonstrations might not work.
But of course having human-like concepts is tricky and beyond the scope of vanilla IRL.
I like this example of “works in practice but not in theory.” Would you associate “ambitious value learning vs. adequate value learning” with “works in theory vs. doesn’t work in theory but works in practice”?
One way that “almost rational” is much closer to optimal than “almost anti-anti-rational” is ye olde dot product, but a more accurate description of this case would involve dividing up the model space into basins of attraction. Different training procedures will divide up the space in different ways—this is actually sort of the reverse of a monte carlo simulation where one of the properties you might look for is ergodicity (eventually visiting all points in the space).