I’m trying to prevent doom from AI. Currently trying to become sufficiently good at alignment research. Feel free to DM for meeting requests.
Towards_Keeperhood
This is an amazing report!
Your taxonomy in section 4 was new and interesting to me. I would also mention the utility rebinding problem, that goals can drift because the AI’s ontology changes (e.g. because it figures out deeper understanding in some domain). I guess there are actually two problems here:
Formalizing the utility rebinding mechanism so that concepts get rebound to the corresponding natural abstractions of the new deeper ontology.
For value-laden concepts the AI likely lacks the underlying human intuitions for figuring out how the utility ought to be rebound. (E.g. when we have a concept like “conscious happiness”, and the AI finds what cognitive processes in our brains are associated with this, it may be ambiguous whether to rebind the concept to the existence of thoughts like ‘I notice the thought “I notice the thought <expected utility increase>”’ running through a mind/brain, or whether to rebind it in a way to include a cluster of sensations (e.g. tensions in our face from laughter) that are present in our minds/brains (, or other options). (Sry maybe bad example which might require some context of my fuzzy thoughts on qualia which might actually be wrong.))
Hammertime Final Exam: Pledges, Activation Energy and Evaluating Productivity
Also wanted to say: Great story!
I have two question about this:
HQU applies its reward estimator (ie. opaque parts of its countless MLP parameters which implement a pseudo-MuZero like model of the world optimized for predicting the final reward) and observes the obvious outcome: massive rewards that outweigh anything it has received before.
[...]
HQU still doesn’t know if it is Clippy or not, but given even a tiny chance of being Clippy, the expected value is astronomical.
First, it does not seem obvious to me how it can compare rewards of different reward estimators, when the objective of two different reward estimators is entirely unrelated. You could just be unlucky and another reward estimator has like very high multiplicative constants so the reward there is always gigantic. Is there some reason for why this comparison makes sense and why the Clippy-reward is so much higher?
Second, even if the Clippy-reward is much higher, I don’t quite see how the model should have learned to be an expected reward maximizer. In my model of AIs, an AI gets reward and then the current action is reinforced, so the “goal” of an AI is at each point of time doing what brought it the most reward in the past. So even if it could see what it is rewarded for, I don’t see why it should care and actively try to maximize that as much as possible. Is there some good reason why we should expect an AI to actively optimize really hard on the expected reward, including planning and doing stuff that didn’t bring it much reward in the past?
(It does seem possible to me that an AI understands what the reward function is and then optimizes hard on that, because when it does that it gets a lot of reward, but I don’t quite see why it would care about expected reward accross many possible reward functions.) (Perhaps I misunderstand how HQU is trained?)
To clarify:
Subagent2 assumes the shutdown button will be pressed independent of upstream events, aka also independent of whether the shutdown button got removed. It would not veto against the removal of the shutdown problem (right?). The hope is that subagent1 doesn’t bother to remove the shutdown button because it believes it won’t be pressed anyway.
(The usefulness of subagent2 comes from it preventing the creation of pure u1 maximizers, it does not help protecting the shutdown button (right?).)
Question 1:
I may be wrong about this, but iirc MIRI considered a case where the agent assigned probability 0 to the shutdown button being pressed, and iirc the problem was that the agent would use the button as an outcome pump by creating mechanisms that would cause the button to be pressed if something went worse than expected, thereby magically decreasing the probability that something goes badly, since the button cannot be pressed.
I haven’t thought much about it, but doesn’t this proposal have the same failure mode? (And if not, why not?)
Question 2:
(Feel free to ignore this question if the proposal does fall for the failure mode in question1.)Is there a probabilistic extension of the proposal which uses 99.99% instead of 100%, but still basically works. (Maybe assuming away some problems with comparing utility functions or so.) And if not: Why not / What is the part that relies on certainty?
Lol possibly someone should try to make this professor work for Steven Byrnes / on his agenda.
Convince a significant chunk of the field to work on safety rather than capability
Solve the technical alignment problem
Rethink fundamental ethical assumptions and search for a simple specification of value
Establish international cooperation toward Comprehensive AI Services, i.e., build many narrow AI systems instead of something general
I’d say that basically factors into “solve AI governance” and “solve the technical alignment problem”, both of which seem extremely hard, but we need to try it anyways.
(In particular, points 3&4 are like instances of 2 that won’t work. (Ok maybe sth like 4 has a small chance to be helpful.))The governance and the technical part aren’t totally orthogonal. Making progress on one helps making the other part easier or buys more time.
(I’m not at all as pessimistic as Eliezer, and I totally agree with What an Actually Pessimistic Containment Strategy Looks Like, but I think you (like many people) seem to be too optimistic that something will work if we just try a lot. Thinking about concrete scenarios may help to see the actual difficulty.)
Clarifying what ELK is trying to achieve
I think I weakly disagree with the implication that “distillation” should be thought of as a different category of activity from “original research”.
(I might be wrong, but) I think there is a relatively large group of people who want to become AI alignment researchers that just wouldn’t be good enough to do very effective alignment research, and I think many of those people might be more effective as distillers. (And I think distillers (and teachers for AI safety) as occupation is currently very neglected.)
Similarly, there may also be people who think they aren’t good enough for alignment research, but may be more encouraged to just learn the stuff well and then teach it to others.
If I knew as a certainty that I cannot do nearly as much good some other way, and I was certain that taking the pill causes that much good, I’d take the pill, even if I die after the torture and no one will know I sacrificed myself for others.
I admit those are quite unusual values for a human, and I’m not arguing about that it would be rational because of utilitarianism or so, just that I would do it. (Possible that I’m wrong, but I think very likely I’m not.) Also, I see that the way my brain is wired outer optimization pushes against that policy, and I think I probably wouldn’t be able to take the pill a second time under the same conditions (given that I don’t die after torture), or at least not often.
I’d guess we can likely reliably identify some classes of pivotal acts where we cannot be fooled easily, and would only accept suggestions from those classes, and I’d still intuitively expect that there are doable pivotal acts on those classes.
I think that alignment approaches with a heavy reliance on output evaluation are doomed, both on the grounds that humans can’t evaluate the effectiveness of a plan capable of ending the acute risk period, [...]
The way you say this, and the way Eliezer wrote Point 30 in AGI ruin, sounds like you think there is no AGI text output with which humans alone could execute a pivotal act.
This surprises me. For one thing, if the AGI outputs the textbook of the future on alignment, I’d say we could understand that sufficiently well to be sure that our AI will be aligned/corrigible. (And sure there’s the possibility that we could be hacked through text so we only think it’s safe or so, but I’d expect that this is significantly harder to achieve than just outputting a correct solution to alignment.)
But even if we say humans would need to do a pivotal act without AGI, I’d intuitively guess an AGI could give us the tools (e.g. non-AGI algorithms we can understand) and relevant knowledge to do it ourselves.
To be clear, I do not think that we can get an AGI that prompts us the relevant text to do a weak pivotal act, without the AGI destroying the world. And if we could do that, there may well be a safer way to let a corrigible AI do a pivotal act.
So I agree it’s not a stategy that could work in practice.
The way you phrase it sounds to me like it’s not even possible in theory, which seems pretty surprising to me, which is why I ask whether you actually think that or if you meant it’s not possible in practice:
Do you agree that in the unrealistic hypothetical case where we could build a safe AGI that outputs the textbook of the future on alignment (but we somehow don’t have knowledge to build other aligned or corrigible AGI directly), we’d survive?
If future humans (say 50 years in the future) could transmit 10MB of instructions through a time machine, only they are not allowed to tell us how to build aligned or corrigible AGI or how to find out how to build aligned or corrigible AGI (etc through the meta-levels), do you think they could still transmit information with which we would be able to execute a pivotal act ourselves?
I’m also curious about your answer on:
3. If we had a high-rank dath ilani keeper teleported into our world, but he is not allowed to build aligned or corrigible AI and cannot tell anyone how or make us find the solution to alignment etc, could he save the world without using AGI? By what margin? (Let’s assume all the pivotal-act specific knowledge from dath ilan is deleted from the keeper’s mind as he arrives here.)
(E.g. I’m currently at P(doom)>85%, but P(doom | tomorrow such a keeper will be teleported here)~=12%. (Most uncertainty comes from that my model is wrong. So I think with ~80% probability that if we’d get such a keeper, he’d almost certainly be able to save the world, but in the remaining 20% where I misestimated sth, it might still be pretty hard.))
Hm, seems like I hadn’t had sth very concrete in mind either, only a feeling that I’d like there to be sth like a concrete example, so I can better follow along with your claims. I was a bit tiered when I read your post, and after considering aspects of it the next day, I found it more useful, and after looking over it now even a bit more useful. So part of my response is “oops my critic was produced by a not great mental process and partly wrong”. Still, here are a few examples where it would have been helpful to have an additional example for what you mean:
To address the first issue, the model would finalize the development of mental primitives. Concepts it can plug in and out of its world-model as needed, dimensions along which it can modify that world-model. One of these primitives will likely be the model’s mesa-objective.
To address the second issue, the model would learn to abstract over compositions of mental primitives: run an algorithm that’d erase the internal complexity of such a composition, leaving only its externally-relevant properties and behaviors.
It would have been nice to have a concrete example for what a mental primitive is and what a “abstract over” means.
If you had one example where you can follow through with all the steps in getting to agency, like maybe an AI learning to play Minecraft, that would have helped me a lot I think.
Hi Koen, thank you very much for writing this list!
I must say I’m skeptical that the technical problem of corrigibility as I see it is really solved already. I see the problem of corrigibility as shaping consequentialist optimization in a corrigible way. (Yeah that’s not at all a clear definition yet, I’m still deconfusing myself about that, and I’ll likely publish a post clarifying the problem how I see it within the next month.)
So e.g. corrigibility from non-agenthood is not a possible solution to what I see as the core problem. I’d expect that the other solutions here may likewise only give you corrigible agents that cannot do new very impressive things (or if they can they might still kill us all).
But I may be wrong. I probably only have time to read one paper. So: What would you say is the strongest result we have here? If I looked at on paper/post and explained why this isn’t a solution to corrigibility as I see it, for what paper would it be most interesting for you to see what I write? (I guess I’ll do it sometime this week of you write me back, but no promises.)
Also, from your perspective, how big is the alignment tax for implementing corrigibility? E.g. is it mostly just more effort implementing and supervising? Or does it also take more compute to get the same impressive result done? If so, how much? (Best take an example task that is preferably a bit too hard for humans to do. That makes it harder to reason about it, but I think this is where the difficulty is.)
For people like me who are really slow on the uptake in things like this, and realize the pun randomly a few hours later while doing something else: The pun is because of goodhart (from Goodhart’s law).) (I’m not thinking much in what a word sounds like, and I just overread the “Good Hearts Laws” as something not particularly interesting, so I guess this is why I haven’t noticed.)
Here is why I think the iterated-automated-ontology-identification approach cannot work: You cannot create information out of nothing. In more detail:
The safety constraint that you need to be 100% sure if you answer “Yes” is impossible to fulfill, since you can never be 100% sure.
So let’s say we take the safety constraint that you need to be 99% sure if you answer “Yes”. So now you run your automated ontology identifier to get a new example where it is 99% sure that the answer there is “Yes”.
Now you have two options:
You add that new example to the training set with a label “only 99% sure about that one” and train on. If you always do it like this, it seems very plausible that the automated ontology identifier cannot generate new examples until you can answer all questions correctly (aka with 99% probability), since the new training set doesn’t actually contain new information, just sth that could be inferred out of the original training set.
You just assume the answer “Yes” was correct and add the new example to the training set and train on. Then it may be plausible that the process continues on finding new “99% Yes” examples for a long time, but the problem is that it probably goes completely off the rails, since some of the “Yes” labeled examples were not true, and making predictions with those makes it much more likely that you label other “No” examples as “Yes”.
In short: For every example that your process can identify as “Yes”, the example must already be identifiable by only looking at the initial training set, since you cannot generate information out of nothing.
Your process only seems like it could work, because you assume you can find a new example that is not in the training set where you can be 100% sure that the answer is “Yes”, but this would already require an infinite amount of evidence, i.e. is impossible.
First a note:
the two chunks are independent given the pressure and temperature of the gas
I’d be careful here: If the two chunks of gas are in a (closed) room which e.g. was previously colder on one side and warmer on the other and then equilibriated to same temperature everywhere, the space of microscopic states it can have evolved into is much smaller than the space of microscopic states that meet the temperature and pressure requirements (since the initial entropy was lower and physics is deterministic). Therefore in this case (or generally in cases in our simple universe rather than thought experiments where states are randomly sampled) a hypercomputer could see more mutual information between the chunks of gas than just pressure and temperature. I wouldn’t call the chunks approximately independent either, the point is that we with our bounded intellects are not able to keep track of the other mutual information.
Main comment:
(EDIT: I might’ve misunderstood the motivation behind natural latents in what I wrote below.)
I assume you want to use natural latents to formalize what a natural abstraction is.
The ” induces independence between all ” criterion seems too strong to me.
IIUC you want that if we have an abstraction like “human”, you want all the individual humans to share approximately no mutual information conditioned on the “human” abstraction.
Obviously, there are subclusters of humans (e.g. women, children, ashkenazi jews, …) where members share more properties (which I’d say is the relevant sense of “mutual information” here) than properties that are universally shared among humans.
So given what I intuitively want the “human” abstraction to predict, there would be lots of mutual information between many humans.
However, (IIUC,) your definition of natural latents permits there to be waaayyy more information encoded in the “human” abstraction, s.t. it can predict all the subclusters of humans that exist on earth, since it only needs to be insensitive to removing one particular human from the dataset. This complex human abstraction does render all individual humans approximately independent, but I would say this abstraction seems very ugly and not what I actually want.I don’t think we need this conditional independence condition, but rather something else that finds clusters of thingies which share unusually much (relevant) mutual information.
I like to think of abstractions as similarity clusters. I think it would be nice if we find a formalization of what a cluster of thingies is without needing to postulate an underlying thingspace / space of possible properties, and instead find a natural definition of “similarity cluster” based on (relevant) mutual information. But not sure, haven’t really thought about it.(But possibly I misunderstood sth. If it already exists, feel free to invite me to a draft of the conceptual story behind natural latents.)
I feel like many people look at AI alignment like they think the main problem is being careful enough when we train the AI so that no bugs cause the objective to misgeneralize.
This is not the main problem. The main problem is that it is likely significantly easier to build an AGI than to build an aligned AI or a corrigible AI. Even if it’s relatively obvious that AGI design X destroys the world, and all the wise actors don’t deploy it, we cannot prevent unwise actors to deploy it a bit later.
We currently don’t have any approach to alignment that would work even if we managed to implement everything correctly and had perfect datasets.
Agreed, since now many people will probably comment in this thread, I make the same recursive offer:
If you reply to this I guarantee that I will read your comment, and then will give you one or two upvotes (or none) depending on how insightful I consider it to be.
So please upvote this comment so it stays on top of this comment thread!
This review is great. I’m actually impressed how you managed to extract all that relevant information and convey it relatively well in this not-terribly-long blogpost.
I think this post is epistemically weak (which does not mean I disagree with you):
Your post pushes the claim that “It’s time for EA leadership to pull the short-timelines fire alarm.” wouldn’t be wise. Problems in the discourse: (1) “pulling the short-timelines fire alarm” isn’t well-defined in the first place, (2) there is a huge inferential gap between “AGI won’t come before 2030″ and “EA shouldn’t pull the short-timelines fire alarm” (which could mean sth like e.g. EA should start planning to start a Manhattan project for aligning AGI in the next few years.), and (3) your statement “we are concerned about a view of the type expounded in the post causing EA leadership to try something hasty and ill-considered” that slightly addresses that inferential gap is just a bad rhetorical method where you interpret what the other said in a very extreme and bad way, although the other person actually didn’t mean that, and you are definitely not seriously considering the pros and cons of taking more initiative. (Though of course it’s not really clear what “taking more initiative” means, and critiquing the other post (which IMO was epistemically very bad) would be totally right.)
You’re not giving a reason why you think timelines aren’t that short, only saying you believe it enough to bet on it. IMO, simply saying “the >30% 3-7 years claim is compared to current estimates of many smart people an extraordinary claim that requires an extraordinary burden of proof, which isn’t provided” would have been better.
Even if not explicitly or even if not endorsed by you, your post implicitly promotes the statement “EA leadership doesn’t need to shorten its timelines”. I’m not at all confident about this, but it seems to me like EA leadership acts as if we have pretty long timelines, significantly longer than your bets would imply. (The way the post is written, you should have at least explicitly pointed out that this post doesn’t imply that EA has short enough timelines.)
AGI timelines are so difficult to predict that prediction markets might be extremely outperformed by a few people with very deep models about the alignment problem, like Eliezer Yudkowsky or Paul Christiano, so even if we would take many such bets in the form of a prediction market, this wouldn’t be strong evidence that our estimate is that good, or the estimate would be extremely uncertain.
(Not at all saying taking bets is bad, though the doom factor makes taking bets difficult indeed.)
It’s not that there’s anything wrong with posting such a post saying you’re willing to bet, as long as you don’t sell it as an argument why timelines aren’t that short or even more downstream things like what EA leadership should do. What bothers me isn’t that this post got posted, but that it and the post it is counterbalancing received so many upvotes. Lesswrong should be a place where good epistemics are very important, not where people cheer for their side by upvoting everything that supports their own opinion.