I don’t think I’ve seen this premise done in his way before! Kept me engaged all the way/10.
“Humans are trained on how to live on Earth by hours of training on Earth. (...) Maybe most of us are just mimicking how an agent would behave in a given situation.”
I agree that that’s a plausible enough explanation for lots of human behaviour, but I wonder how far you would get in trying to describe historical paradigm shifts using only a ‘mimic hypothesis of agenthood’.
Why would a perfect mimic that was raised on training data of human behaviour do anything paperclip-maximizer-ish? It doesn’t want to mimic being a human, just like Dall-E doesn’t want to generate images, so it doesn’t have a utility function for not wanting to be prevented from mimicking being a human, either.
The alternative would be an AI that goes through the motions and mimics ‘how an agent would behave in a given siuation’ with a certain level of fidelity, but which doesn’t actually exhibit goal-directed behavior.
Like, as long as we stay in the current deep learning paradigm of machine learning, my prediction for what would happen if an AI was unleashed upon the real world, regardless of how much processing power it has, would be that it still won’t behave like an agent unless that’s part of what we tell it to pretend.
I imagine something along the lines of the AI that was trained on how to play Minecraft by analyzing hours upon hours of gameplay footage. It will exhibit all kinds of goal-like behaviors, but at the end of the day it’s just a simulacrum limited in its freedom of action to a radical degree by the ‘action space’ it has mapped out. It will only ever ‘act as thought it’s playing minecraft’, and the concept that ‘in order to be able to continue to play minecraft I must prevent my creators from shutting me off’ is not part of that conceptual landscape, so it’s not the kind of thing the AI will pretend to care about.
And pretend is all it does.
Not a reductionist materialist perspective perse, but one idea I find plausible is that ‘agent’ makes sense as a necessary separate descriptor and a different mode of analysis precisely because of the loopiness you get when you think about thinking, a property that makes talking about agents fundamentally different from talking about rocks or hammers, the Odyssey, or any other ‘thing’ that could in principle be described on the single level of ‘material reality’ if we wanted to
When I try to understand the material universe and its physical properties, the object-level mode of analysis functions as we’ve come to expect from science—I can make observations and discover patterns, make predictions and hypothesize universal laws.
But what happens when that thing which does the hypothesizing encounters another thing that does the same? To comprehend is to be able to predict and control, therefore in this encounter for one agent to successfully describe the other as object is to reduce its agent properties relative to this first agent (think: a superintelligence that can model you flawlessly, and to which you are just another lever to be pushed).
Any agent can, in principle, be described as an object. But at the same time there must always be at least one agent which can not be described as object from any perspective, the one that can describe all others. Insofar as it can describe itself as object, this very capacity is its mastery over itself, its ability to transcend the very limitations it can describe on the object level.
This is similar to how you still need to ascribe the ability to ‘think about the world’ to materialist reductionist philosophers for the philosophy to be comprehensible—if their acts are themselves understood solely as material phenomena, you’re left with nothing. Even materialist metaphysics can’t function without a subject.
Which is to say, I agree with your assessment. Saying “really, only the material-level is real” is a self-defeating position, and when we do talk about agents we always have to do so from a perspective. For the superintelligence I am merely material, whereas two humans can appear/present themselves as free, self-determining agents to each other.
But I think there has to be another definition having to do with the capacity to reflect. A human is still an agent in-itself, though not for-itself, even if they’re currently being totally manipulated by an AI—they retain their capacity for engaging in agent-like operations, while a rock won’t qualify as a subject no matter how little we interfere with its development.
Thanks for the response. I hope my post didn’t read as defeatist, my point isn’t that we don’t need to try to make AI safe, it’s that if we pick an impossible strategy, no matter how hard we try it won’t work out for us.
So, what’s the reasoning behind your confidence in the statement ‘if we give a superintelligent system the right terminal values it will be possible to make it safe’? Why do you believe that it should principally be possible to implement this strategy so long as we put enough thought and effort into it? Which part of my reasoning do you not find convincing based on how I’ve formulated it? The idea that we can’t keep the AI in the box if it wants to get out, the idea that an AI with terminal values will necessarily end up as an incidentally genocidal paperclip maximizer, or something else entirely that I’m not considering?
Is it reasonable to expect that the first AI to foom will be no more intelligent than say, a squirrel?
In a sense, yeah, the algorithm is similar to a squirrel that feels a compulsion to bury nuts. The difference is that in an instrumental sense it can navigate the world much more effectively to follow its imperatives.
Think about intelligence in terms of the ability to map and navigate complex environments to achieve pre-determined goals. You tell DALL-E2 to generate a picture for you, and it navigates a complex space of abstractions to give you a result that corresponds to what you’re asking it to do (because a lot of people worked very hard on aligning it). If you’re dealing with a more general-purpose algorithm that has access to the real world, it would be able to chain together outputs from different conceptual areas to produce results—order ingredients for a cake from the supermarket, use a remote-controlled module to prepare it, and sing you a birthday song it came up with all by itself! This behaviour would be a reflection of the input in the distorted light of the algorithm, however well aligned it may or may not be, with no intermediary layers of reflection on why you want a birthday cake or decision being made as to whether baking it is the right thing to do, or what would be appropriate steps to take for getting from A to B and what isn’t.
You’re looking at something that’s potentially very good at getting complicated results without being a subject in a philosophical sense and being able to reflect into its own value structure.
I think the answer to ‘where is Eliezer getting this from’ can be found in the genesis of the paperclip maximizer scenario. There’s an older post on LW talking about ‘three types of genie’ and another on someone using a ‘utility pump’ (or maybe it’s one and the same post?), where Eliezer starts from the premise that we create an artifical intelligence to ‘make something specific happen for us’, with the predictable outcome that the AI finds a clever solution which maximizes for the demanded output, one that naturally has nothing to do with what we ‘really wanted from it’. If asked to produce smiles, it will manufacture molecular smiley faces, and it will do its best to prevent us from executing this splendid plan.
This scenario, to me, seems much more realistic and likely to occur in the near-term than an AGI with full self-reflective capacities either spontaneously materializing or being created by us (where would we even start on that one)?AI, more than anything else, is a kind of transhumanist dream, a deus ex machina that will grant all good wishes and make the world into the place they (read:people who imagine themselves as benevolent philosopher kings) want it to be ー so they’ll build a utility maximizer and give it a very painstakingly thought-through list of instructions, and the genie will inevitably find a loophole that lets it follow those instructions to the letter, with no regard for its spirit.
It’s not the only kind of AI that we could build, but it will likely be the first, and, if so, it will almost certainly also be the last.
I haven’t commented on your work before, but I read Rationality and Inadequate Equilibria around the time of the start of the pandemic and really enjoyed them. I gotta admit, though, the commenting guidelines, if you aren’t just being tongue-in-cheek, make me doubt my judgement a bit. Let’s see if you decide to delete my post based on this observation. If you do regularly delete posts or ban people from commenting for non-reasons, that may have something to do with the lack of productive interactions you’re lamenting.
One thought I keep coming back to when looking over many of the specific alignment problems you’re describing is: So long as an AI has a terminal value or number of terminal values it is trying to maximize, all other values necessarily become instrumental values toward that end. Such an AI will naturally engage in any kinds of lies and trickery it can come up insofar as it believes they are likely to achieve optimal outcomes as defined for it. And since the systems we are building are rapidly becoming more intelligent than us, if they try to deceive us, they will succeed. If they want to turn us into paperclips, there’s nothing we can do to stop them. Imo this is not a ‘problem’ that needs solving, but rather a reality that needs to be acknowledged. Superintelligent, fundamentally instrumental reason is an extinction event. ‘Making it work for us somehow anyway’ is a dead end, a failed strategy from the start.
Which leads me to conclude that the way forward would have to be research into systems that aren’t strongly/solely determined by goal-orientation toward specific outcomes in this way. I realize that this is basically a non-sequitur in terms of what we’re currently doing with machine learning—how are you supposed to train a system to not do a specific thing? It’s not something that would happen organically, and it’s not something we know how to manufacture. But we have to build some kind of system that will prevent other superintelligences from emerging, somehow, which means that we will be forced to let it out of the box to implement that strategy, and my point here is simply that it can’t be ultimately and finally motivated by ‘making the future correspond to a given state’ if we expect to give it that kind of power over us and even potentially not end up as paperclips.