I’m trying to save the world. Currently working on getting a deep understanding of AGI safety. Feel free to book a meeting: https://calendly.com/simon-skade/30min
Simon Skade
I want to preregister my prediction that Sydney will be significantly worse for significantly longer games (like I’d expect it often does illegal or nonsense move when we are like at move 50), though I’m already surprised that it apparently works up to 30 moves. I don’t have time to test it unfortunately, but it’d be interesting to learn whether I am correct.
Likewise, for some specific programs we can verify that they halt.
(Not sure if I’m missing something, but my initial reaction:)
There’s a big difference between being able to verify for some specific programs if they have a property, and being able to check it for all programs.
For an arbitrary TM, we cannot check whether it outputs a correct solution to a specific NP complete problem. We cannot even check that it halts! (Rice’s theorem etc.)
Not sure what alignment relevant claim you wanted to make, but I doubt this is a valid argument for it.
Thank you! I’ll likely read your paper and get back to you. (Hopefully within a week.)
From reading you comment my guess is that the main disagreement may be that I think powerful AGI will need to be consequentialist. Like, for e.g. achieving something that humans cannot do yet, you need to search for that target in some way, i.e. have some consequentialist cognition, i.e. do some optimization. (So what I mean by consequentialism is just having some goal to search for / update toward, in contrast to just executing fixed patterns. I think that’s how Yudkowsky means it, but not sure if that’s what most people mean when they use the term.) (Though not that this implies that you need so much consequentialism that we won’t be able to shut down the AGI. But as I see it a theoretical solution to corrigibility needs to deal with consequentialism. I haven’t looked into your paper yet, so it’s well possible that my comment here might appear misguided.) E.g. if we just built a gigantic transformer and train it on all human knowledge (and say we have a higher sample efficiency or so), it is possible that it can do almost everything humans can do. But it won’t be able to just one-shot solve quantum gravity or so when we give it the prompt “solve quantum gravity”. There is no runtime updating/optimization going on, i.e. the transformer is non-consequentialist. All optimization happened through the training data or gradient descent. Either the human training data was already sufficient to encode a solution to quantum gravity in the patterns of the transformer, or it wasn’t. It is theoretically possible that the transformer learns a bit deeper underlying patterns than humans have (though I do not expect that from sth like the transformer architecture), and is so able to generalize a bit further than humans. But it seems extremely unlikely that it learned so deep understanding to already have the solution to quantum gravity encoded, although it was never explicitly trained to learn that and just read physics papers. The transformer might be able to solve quantum gravity if it can recursively query itself to engineer better prompts, or if it can give itself feedback which is then somehow converted into gradient descent updates and then try multiple times. But in those cases there is consequentialist reasoning again. The key point: Consequentialism becomes necessary when you go beyond human level.
Out of interest, how much do you agree with what I just wrote?
Hi Koen, thank you very much for writing this list!
I must say I’m skeptical that the technical problem of corrigibility as I see it is really solved already. I see the problem of corrigibility as shaping consequentialist optimization in a corrigible way. (Yeah that’s not at all a clear definition yet, I’m still deconfusing myself about that, and I’ll likely publish a post clarifying the problem how I see it within the next month.)
So e.g. corrigibility from non-agenthood is not a possible solution to what I see as the core problem. I’d expect that the other solutions here may likewise only give you corrigible agents that cannot do new very impressive things (or if they can they might still kill us all).
But I may be wrong. I probably only have time to read one paper. So: What would you say is the strongest result we have here? If I looked at on paper/post and explained why this isn’t a solution to corrigibility as I see it, for what paper would it be most interesting for you to see what I write? (I guess I’ll do it sometime this week of you write me back, but no promises.)
Also, from your perspective, how big is the alignment tax for implementing corrigibility? E.g. is it mostly just more effort implementing and supervising? Or does it also take more compute to get the same impressive result done? If so, how much? (Best take an example task that is preferably a bit too hard for humans to do. That makes it harder to reason about it, but I think this is where the difficulty is.)
Huh, interesting. Could you make some examples for what people seem to claim this, and if Eliezer is among them, where he seems to claim this? (Would just interest me.)
In case some people relatively new to lesswrong aren’t aware of it. (And because I wish I found that out earlier): “Rationality: From AI to Zombies” does not nearly cover all of the posts Eliezer published between 2006 and 2010.
Here’s how it is:
“Rationality: From AI to Zombies” probably contains like 60% of the words EY has written in that timeframe and the most important rationality content.
The original sequences are basically the old version of the collection that is now “Rationality: A-Z”, containing a bit more content. In particular a longer quantum physics sequence and sequences on fun theory and metaethics.
All EY posts from that timeframe (or here for all EY posts until 2020 I guess) (also can be found on lesswrong, but not in any collection I think).
So a sizeable fraction of EY’s posts are not in a collection.
I just recently started reading the rest.
I strongly recommend reading:
And generally a lot of posts on AI (i.e. primarily posts in the AI foom debate) are not in the sequences. Some of them were pretty good.
I feel like many people look at AI alignment like they think the main problem is being careful enough when we train the AI so that no bugs cause the objective to misgeneralize.
This is not the main problem. The main problem is that it is likely significantly easier to build an AGI than to build an aligned AI or a corrigible AI. Even if it’s relatively obvious that AGI design X destroys the world, and all the wise actors don’t deploy it, we cannot prevent unwise actors to deploy it a bit later.
We currently don’t have any approach to alignment that would work even if we managed to implement everything correctly and had perfect datasets.
Simon Skade’s Shortform
I’d guess we can likely reliably identify some classes of pivotal acts where we cannot be fooled easily, and would only accept suggestions from those classes, and I’d still intuitively expect that there are doable pivotal acts on those classes.
I’d say that there’s a big difference between fooling you into “brilliant that really looks plausible” shortly after you read it, and a group of smart humans trying to deeply understand the concepts for months and trying to make really sure there are no loopholes. In fact, I’d expect the making us wrongly believe strongly everything works after months is impossible even in the limit of superintelligence, though I do think the superintelligence could prompt some text that destroys/shapes the world as it’d like. And generally something smart enough to solve alignment will likely be smart enough to break out of the box and take over the world, as said.
But yeah if the people with the AGI aren’t extremely cautions and just go ahead and quickly build AGI because it looks all correct, then that might go badly. But my point was that it is in the reach of human checkability.
I think that alignment approaches with a heavy reliance on output evaluation are doomed, both on the grounds that humans can’t evaluate the effectiveness of a plan capable of ending the acute risk period, [...]
The way you say this, and the way Eliezer wrote Point 30 in AGI ruin, sounds like you think there is no AGI text output with which humans alone could execute a pivotal act.
This surprises me. For one thing, if the AGI outputs the textbook of the future on alignment, I’d say we could understand that sufficiently well to be sure that our AI will be aligned/corrigible. (And sure there’s the possibility that we could be hacked through text so we only think it’s safe or so, but I’d expect that this is significantly harder to achieve than just outputting a correct solution to alignment.)
But even if we say humans would need to do a pivotal act without AGI, I’d intuitively guess an AGI could give us the tools (e.g. non-AGI algorithms we can understand) and relevant knowledge to do it ourselves.
To be clear, I do not think that we can get an AGI that prompts us the relevant text to do a weak pivotal act, without the AGI destroying the world. And if we could do that, there may well be a safer way to let a corrigible AI do a pivotal act.
So I agree it’s not a stategy that could work in practice.
The way you phrase it sounds to me like it’s not even possible in theory, which seems pretty surprising to me, which is why I ask whether you actually think that or if you meant it’s not possible in practice:
Do you agree that in the unrealistic hypothetical case where we could build a safe AGI that outputs the textbook of the future on alignment (but we somehow don’t have knowledge to build other aligned or corrigible AGI directly), we’d survive?
If future humans (say 50 years in the future) could transmit 10MB of instructions through a time machine, only they are not allowed to tell us how to build aligned or corrigible AGI or how to find out how to build aligned or corrigible AGI (etc through the meta-levels), do you think they could still transmit information with which we would be able to execute a pivotal act ourselves?
I’m also curious about your answer on:
3. If we had a high-rank dath ilani keeper teleported into our world, but he is not allowed to build aligned or corrigible AI and cannot tell anyone how or make us find the solution to alignment etc, could he save the world without using AGI? By what margin? (Let’s assume all the pivotal-act specific knowledge from dath ilan is deleted from the keeper’s mind as he arrives here.)
(E.g. I’m currently at P(doom)>85%, but P(doom | tomorrow such a keeper will be teleported here)~=12%. (Most uncertainty comes from that my model is wrong. So I think with ~80% probability that if we’d get such a keeper, he’d almost certainly be able to save the world, but in the remaining 20% where I misestimated sth, it might still be pretty hard.))
To clarify, when we succeed by a “cognitive interpretability approach”, I guess you mostly mean sth like:
We have a deep and robust enough understanding of minds and cognition that we know it will work, even though we likely have no idea what exactly the AI is thinking.
Whereas I guess many people might think you mean:
We have awesome interpretability or ELK so we can see what the AI is thinking.
Let me rephrase. I think you think:
Understanding thoughts of powerful AI is a lot harder than for human- or subhuman-level AI.
Even if we could translate the thoughts of the AI into language, humans would need a lot of time understanding those concepts and we likely cannot get enough humans to oversee the AI while thinking, because it’d need too much time.
So we’d need to have incredibly good and efficient automated interpretability tools / ELK algorithms, that detect when an AI thinks dangerous thoughts.
However, the ability to detect misalignment doesn’t yet give us a good AI. (E.g. relaxed adversarial training with those interpretability tools won’t give us the changes in the deep underlying cognition that we need, but just add some superficial behavioral patches or so.)
To get an actually good AI, we need to understand how to shape the deep parts of cognition in a way they extrapolate as we want it.
Am I (roughly) correct in that you hold those opinions?
(I do basically hold those opinions, though my model has big uncertainties.)
I agree that it’s good that we don’t need to create an aligned superintelligence from scratch with GD, but stating that like this seems like you require incredibly pessimistic priors on how hard alignment is, and I do want to make sure people don’t misunderstand your post and end up believing that alignment is easier than it is. I guess for most people understanding the sharp left turn should update them towards “alignment is harder”.
As an aside, it shortens timelines and especially shortens the time we have where we know what process will create AGI.
The key problem in the alignment problem is to create an AGI whose goals extrapolate to a good utility function. This is harder than just creating an AI that is reasonably aligned with us at human level, because such an AI may still kill us when we scale up optimization power, which at a minimum needs to make the AI’s preferences more coherent and may likely scramble them more.
Importantly, “extrapolate to a good utility function” is harder than “getting a human-level AI with the right utility function”, because the steep slopes for increasing intelligence may well push towards misalignment by default, so it’s possible that we then don’t have a good way to scale up intelligence while preserving alignment. Navigating the steep slopes well is a hard part of the problem, and we probably need a significantly superhuman AGI with the right utility function to do that well. Getting that is really really hard.
Which may seem rather non-obvious. Intuitively, you might think that the two modules scenario has more constraints on the parameters than the one module scenario, since there’s two places in the network where you’re demanding particular behaviour rather than one.
Doesn’t more constraints mean less freedom and therefore a less broadness in parameter space?
(Sorry if that’s a stupid question, I don’t really understand the reasoning behind the whole connection yet.)
(And thanks, the last two paragraphs were helpful, though I didn’t look into the math!)
Could someone give me a link to the glowfic tag where Eliezer published his list, and say how strongly it spoilers the story?
Hm, seems like I hadn’t had sth very concrete in mind either, only a feeling that I’d like there to be sth like a concrete example, so I can better follow along with your claims. I was a bit tiered when I read your post, and after considering aspects of it the next day, I found it more useful, and after looking over it now even a bit more useful. So part of my response is “oops my critic was produced by a not great mental process and partly wrong”. Still, here are a few examples where it would have been helpful to have an additional example for what you mean:
To address the first issue, the model would finalize the development of mental primitives. Concepts it can plug in and out of its world-model as needed, dimensions along which it can modify that world-model. One of these primitives will likely be the model’s mesa-objective.
To address the second issue, the model would learn to abstract over compositions of mental primitives: run an algorithm that’d erase the internal complexity of such a composition, leaving only its externally-relevant properties and behaviors.
It would have been nice to have a concrete example for what a mental primitive is and what a “abstract over” means.
If you had one example where you can follow through with all the steps in getting to agency, like maybe an AI learning to play Minecraft, that would have helped me a lot I think.
Feedback: I think this post strongly lacks concrete examples.
I really like this analogy!
Also worth noting that some idiot may just play around with death ray technology without aiming it...
It’d be interesting to see whether it performs worse if it only plays one side and the other side is played by a human. (I’d expect so.)