Typo: Cundy and Filan, 2018 (not 2008)
Ah, I see. That does make it seem clearer to me, though I’m not sure what beliefs actually changed.
Yes, that would work. I think Stuart Armstrong’s AI Toy Control problem already demonstrates this quite well—it’s the generalization to unknown dynamics that might be interesting and more compelling.
This model seems very fatalistic, I guess? It seems somewhat incompatible with an agent that has preferences. (Perhaps you’re suggesting we build an AI without preferences, but it doesn’t sound like that.)
Yes, this could be difficult, and would likely limit what we could do, but I don’t see why this would prevent us from getting anything useful out of a virtual-world-optimizer. Lots of engineering tasks don’t require more explicit physics knowledge than we already have.
I think there’s a lot of common sense that humans apply that allows them to design solutions that meet many implicit constraints that they can’t easily verbalize. “Thinking outside of the box” is when a human manages to design something that doesn’t satisfy one of the constraints, because it turns out that constraint wasn’t useful. But in most cases, those constraints are very useful, because they make the search space much smaller. By default, these constraints won’t carry over into the virtual world.
(Lots of examples of this in The Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities)
Okay, that makes sense, I agree with that. As I mentioned in the opinion, I definitely agree with this in the case of a superintelligence optimizing a utility function. Probably my mindset when writing that opinion was that it seems likely to me that AI systems that we actually deploy won’t look like a single agent optimizing for particular preferences, for reasons that I couldn’t really articulate. I still have this intuition, and I think I’m closer to being able to explain it now, but not in a comment.
It’s not obvious to me, even on the “optimizing an abstract computational process” model, why an AI would not want get more compute—it can use this compute for itself, without changing the abstract computational process it is optimizing, and it will probably do better. It seems that if you want to get this to work, you need to have the AI want to compute the result of running itself without any modification or extra compute on the virtual world. This feels very hard to me.
Separately, I also find it hard to imagine us building a virtual world that is similar enough to the real world that we are able to transfer solutions between the two, even with some finetuning in the real world.
I’d like to register an intuition that I could come up with a (toy, unrealistic) continual learning scenario that looks like a treacherous turn with today’s ML, perhaps by restricting the policies that the agent can learn, giving it a strong inductive bias that lets it learn the environment and the supervisor’s preferences quickly and accurately, and making it model-based. It would look something like Stuart Armstrong’s toy version of the AI alignment problem, but with a learned environment model (but maybe learned from a very strong prior, not a neural net).
This is just an intuition, not a strong belief, but it would be enough for me to work on this if I had the time to do so.
Hey Jess, as Ben mentioned I keep all newsletter-related things on my website.
I agree that in theory hidden information leads to an exponential explosion. In practice, I think you don’t need to search over all the exponentially many ways the hidden information could be in order to get good results. (At least, you don’t need to do that in order to beat humans, because humans don’t seem to do that.)
I think overall we agree though—when I said “it wasn’t clear how to make things work with hidden information—you could try the same thing but it was plausible it wouldn’t work”, I was primarily thinking that the computational cost might be too high. I was relatively confident that given unbounded compute, AlphaGo-style algorithms could deal with hidden information.
Yeah, I think I agree with all of that. Perhaps the better way to state my position is, conditional on there being good normative feedback on the meta level, I would expect the space of concepts on the meta-level to be smaller than on the object-level.
Thanks for doing this! I’m a fan of executable code that demonstrates the problems that we are worrying about—it makes the concept (in this case, a treacherous turn) more concrete.
In order to make it more realistic, I would want the agent to grow in capability organically (rather than simply getting a more powerful weapon). It would really drive home the point if the agent undertook a treacherous turn the very first time, whereas in this post I assume it learned using many episodes of trial-and-error that a treacherous turn leads to higher reward. This seems hard to demonstrate with today’s ML in any complex environment, where you need to learn from experience instead of using eg. value iteration, but it’s not out of the question in a continual learning setup where the agent can learn a model of the world.
I’m not even sure there is good normative feedback on the meta level(s). There is feedback we can give on the meta level for any particular object-level instance, but it seems not at all obvious (to me) that this advice will generalize well to other object-level instances.
On the other hand, it does seem to me that the higher up you are in meta-levels, the smaller the space of concepts and the easier it is to learn. So maybe my overall take is that it seems like we can’t depend on humans to give meta-level feedback well, but if we can figure out how to either give better feedback or learn from noisy feedback, it would be easier to learn and likely generalize better.
I am confused. My current understanding is that we’re starting with only a preference relation, and no assumptions on probability (so no lotteries, as in the VNM theorem). In that case, there are tons of utility functions that can model any given arbitrary preference relation. It seems like I could get a result like this by saying “take the preference relation, write down a utility function that encodes it, decompose it into the ratio of two parts, call one of them ‘probability’ and the other ‘probability*utility’, and now note that there are transformations to other utility functions that encode the same preference relation and unsurprisingly they change the relative amounts of each of the parts—therefore probability and utility are inextricably linked”. (This is almost certainly either wrong or a strawman, but I don’t know how.) But in all of this there’s no reason to think of the denominator of the ratio as “probability”, we just called it that suggestively. Perhaps my critique is that if we start with _just_ a preference relation and only need to keep the preference relation intact, we shouldn’t expect to recover anything like normal expected utility theory, because there’s no formal reason to have anything like probabilities. Even if you want to interpret probability as a “caring measure” instead of “magical reality fluid” it should still show up before you work through the math and interpret one of the quantities as “caring measure”. But mostly I’m confused so who knows, this may all be incoherent.
It suggests that something with a perfect prior (magically exactly equal to the universe we’re actually in) would be perfectly aligned: “If you know the true utility function, and you know the true state of the universe and consequences of alternative actions you can take, then you are aligned.” This isn’t necessarily objectionable, but it is not the notion of alignment in the post.
If the AI magically has the “true universe” prior, this gives humans no reason to trust it. The humans might reasonably conclude that it is overconfident, and want to shut it down. If it justifiably has the true universe prior, and can explain why the prior must be right in a way that humans can understand, then the AI is aligned in the sense of the post.
Sure. I was claiming that it is also a reasonable notion of alignment. My reason for not using that notion of alignment is that it doesn’t seem practically realizable.
However, if we could magically give the AI the “true universe” prior with the “true utility function”, I would be happy and say we were done, even if it wasn’t justifiable and couldn’t explain it to humans. I agree it would not be aligned in the sense of the post.
So, I’m not even sure it is sensible to think of UH alone as capturing human preferences; maybe UH doesn’t really make sense apart from PH.
This seems to argue that if my AI knew the winning lottery numbers, but didn’t have a chance to tell me how it knows this, then it shouldn’t buy the winning lottery ticket. I agree the Jeffrey-Bolker rotation seems to indicate that we should think of probutilities instead of probabilities and utilities separately, but it seems like there really are some very clear actual differences in the real world, and we should account for it somehow. Perhaps one difference is that probabilities change in response to new information, whereas (idealized) utility functions don’t. (Obviously humans don’t have idealized utility functions, but this is all a theoretical exercise anyway.)
I agree that “even assuming we know the true utility function, optimizing it is hard” -- but I am specifically pointing at the fact that we need beliefs to supplement utility functions, so that we can maximize expected utility as a proxy for utility. And this proxy can be bad.
Thanks for clarifying, that’s clearer to me now.
If we think of the objective as “building AI such that there is a good argument for humans trusting that the AI has human interest in mind” rather than “building AI which optimizes human utility”, then we naturally want to solve #1 in a way which takes human beliefs into account. This addresses the concern from #2; we don’t actually have to figure out which part of preferences are “probability” vs “utility”.
I generally agree with the objective you propose (for practical reasons). The obvious way to do this is to do imitation learning, where (to a first approximation) you just copy the human’s policy. (Or alternatively, have the policy that a human would approve of you having.) This won’t let you exceed human intelligence, which seems like a pretty big problem. Do you expect an AI using policy alignment to do better than humans at tasks? If so, how is it doing better? My normal answer to this in the EV framework is “it has better estimates of probabilities of future states”, but we can’t do that any more. Perhaps you’re hoping that the AI can explain its plan to a human, and the human will then approve of it even though they wouldn’t have before the explanation. In that case, the human’s probutilities have changed, which means that policy alignment is now “alignment to a thing that I can manipulate”, which seems bad.
Fwiw I am generally in favor of approaches along the lines of policy alignment, I’m more confused about the theory behind it here.
I’m not even sure whether you are closer or further from understanding what I meant, now.
I can only assume that you are confused why I would have set up things the way I did in the post if this was my point, since I didn’t end up talking much about directly learning the policies.
My assumption was that you were arguing for why learning policies directly (assuming we could do it) has advantages over the default approach of value learning + optimization. That framing seems to explain most of the post.
Cool, I think I mostly agree with you.
I’m not sure that this implies that alignment is hard—if you’re trying to prove that your system is aligned by looking at the details of how it is constructed and showing that it all works together, then yes, alignment is harder than it would be otherwise. But you could imagine other versions of alignment, eg. taking intelligence as a black box and pointing it in the right direction. (For example, if I magically knew the true human utility function, and I put that in the black box, the outcomes would probably be good.)
Here when I say “aligned” I mean “trying to help”. It’s still possible that the AI is incompetent and fails because it doesn’t understand what the consequences of its actions are.