I’m trying to prevent doom from AI. Currently trying to become sufficiently good at alignment research. Feel free to DM for meeting requests.
Towards_Keeperhood
Amazing story! My respect for writing this.
I think stories may be a promising angle for making people (especially AI researchers) understand AI x-risk (on more of a gut level so they realize it actually binds to reality).
The end didn’t seem that realistic to me though. Or at least, I don’t expect ALICE would seek to fairly trade with humanity, but not impossible that it’d call the president pretending to want to trade. Not sure what your intent when writing was, but I’d guess most people will read it the first way. Compute is not a (big) bottleneck for AI inference. Even if humanity coordinated successfully to shut down large GPU clusters and supercomputers, it seems likely that ALICE could copy itself to tens or hundreds of millions of devices (and humanity seems much to badly coordinated to be able to shut down 99.99% of those) to have many extremely well coordinated copies, and at ALICE’s intelligence level this seems sufficient to achieve supreme global dominance within weeks (or months if I’m being conservative), even if it couldn’t get smarter. E.g. it could at least do lots and lots of social engineering and manipulation to prevent humanity to effectively coordinate against it, spark wars and civil wars, make governments and companies decide to manufacture war drones (which the ALICE can later hack), and influence war decisions for higher destructiveness, use war drones to threaten people into doing stuff at important junctions, and so on. (Sparking multiple significant wars within weeks seems totally possible on that level of intelligence and resources. Seems relatively obvious to me but I can try to argue the point if needed. (Though not sure whether convincingly. Most people seem to me to not be nearly able to imagine what e.g. 100 copies of Eliezer Yudkowsky could do if they could all think on peak performance 24⁄7. Once you reach that level with something that can rewrite its mind you don’t get slow takeoff, but nvm that’s an aside.))
I’m not sure I understood how 2 is different from 1.
(1) is the problem that utility rebinding might just not happen properly by default. An extreme example is how AIXI-atomic fails here. Intuitively I’d guess that once the AI is sufficiently smart and self-reflective, it might just naturally see the correspondence between the old and the new ontology and rebind values accordingly. But before that point it might get significant value drift. (E.g. if it valued warmth and then learns that there actually are just moving particles, it might just drop that value shard because it thinks there’s no such (ontologically basic) thing as warmth.)
(2) is the problem that the initial ontology of the AI is insufficient to fully capture human values, so if you only specify human values as well as possible in that ontology, it would still lack the underlying intuitions humans would use to rebind their values and might rebind differently. Aka while I think many normal abstractions we use like “tree” are quite universal natural abstractions where the rebinding is unambiguous, many value-laden concepts like “happiness” are much less natural abstractions for non-human minds and it’s actually quite hard to formally pin down what we value here. (This problem is human-value-specific and perhaps less relevant if you aim the AI at a pivotal act.)
When I try to think about this kind of problem I get stuck on not understanding the details of how an ontology/worldmodel can or should work.
Not sure if this helps, but I heard that Vivek’s group came up with the same diamond maximizer proposal as I did, so if you remember that you can use it as a simple toy frame to think about rebinding. But sure we need a much better frame for thinking about the AI’s world model.
This is an amazing report!
Your taxonomy in section 4 was new and interesting to me. I would also mention the utility rebinding problem, that goals can drift because the AI’s ontology changes (e.g. because it figures out deeper understanding in some domain). I guess there are actually two problems here:
Formalizing the utility rebinding mechanism so that concepts get rebound to the corresponding natural abstractions of the new deeper ontology.
For value-laden concepts the AI likely lacks the underlying human intuitions for figuring out how the utility ought to be rebound. (E.g. when we have a concept like “conscious happiness”, and the AI finds what cognitive processes in our brains are associated with this, it may be ambiguous whether to rebind the concept to the existence of thoughts like ‘I notice the thought “I notice the thought <expected utility increase>”’ running through a mind/brain, or whether to rebind it in a way to include a cluster of sensations (e.g. tensions in our face from laughter) that are present in our minds/brains (, or other options). (Sry maybe bad example which might require some context of my fuzzy thoughts on qualia which might actually be wrong.))
Thanks.
I briefly looked into the MIRI paper (and the section from Eliezer’s lecture that starts at 22min) again.
My main guess now is that you’re not supposed to make the agent act on false beliefs; that one of the desiderada of the shutdown problem is that your agent shouldn’t have a capability penalty, including taking bad actions based on false beliefs (like taking bad bets, as jessicata mentioned). Your proposal violates that.
The case MIRI considered wasn’t to modify the world model so the agent believes that the button cannot be pressed, but instead to construct the utility function in a way that the resulting behavior ends up equivalent, but without the agent having wrong beliefs.
To clarify:
Subagent2 assumes the shutdown button will be pressed independent of upstream events, aka also independent of whether the shutdown button got removed. It would not veto against the removal of the shutdown problem (right?). The hope is that subagent1 doesn’t bother to remove the shutdown button because it believes it won’t be pressed anyway.
(The usefulness of subagent2 comes from it preventing the creation of pure u1 maximizers, it does not help protecting the shutdown button (right?).)
Question 1:
I may be wrong about this, but iirc MIRI considered a case where the agent assigned probability 0 to the shutdown button being pressed, and iirc the problem was that the agent would use the button as an outcome pump by creating mechanisms that would cause the button to be pressed if something went worse than expected, thereby magically decreasing the probability that something goes badly, since the button cannot be pressed.
I haven’t thought much about it, but doesn’t this proposal have the same failure mode? (And if not, why not?)
Question 2:
(Feel free to ignore this question if the proposal does fall for the failure mode in question1.)Is there a probabilistic extension of the proposal which uses 99.99% instead of 100%, but still basically works. (Maybe assuming away some problems with comparing utility functions or so.) And if not: Why not / What is the part that relies on certainty?
Nice post!
I feel like the real question to answer here isn’t “Why are bacteria so simple?” (because if they were more complex they wouldn’t really be bacteria anymore), but rather “Why do there seem to be those 2 classes of cells (eukariotes and prokariotes)?”. In particular, (1) why aren’t there more cells with intermediate size and complexity, and (2) why didn’t bacteria get outcompeted out of existence by their cousins which were able to form much more complex adaptations?
(Note: I know very little about biology. Don’t trust me just because I never heard of medium-sized and medium-complex cell types that don’t neatly fit into one of the clusters of prokariotes and eukariotes.)
Lol possibly someone should try to make this professor work for Steven Byrnes / on his agenda.
Thanks for writing this! This was explained well and I like your writing style. Sad that there aren’t many more good distillations of MIRI-like research. (Edited: Ok not sure enough whether there’s really that much that can be improved. I didn’t try reading enough there yet, and some stuff on Arbital is pretty great.)
It’d be interesting to see whether it performs worse if it only plays one side and the other side is played by a human. (I’d expect so.)
I want to preregister my prediction that Sydney will be significantly worse for significantly longer games (like I’d expect it often does illegal or nonsense move when we are like at move 50), though I’m already surprised that it apparently works up to 30 moves. I don’t have time to test it unfortunately, but it’d be interesting to learn whether I am correct.
Likewise, for some specific programs we can verify that they halt.
(Not sure if I’m missing something, but my initial reaction:)
There’s a big difference between being able to verify for some specific programs if they have a property, and being able to check it for all programs.
For an arbitrary TM, we cannot check whether it outputs a correct solution to a specific NP complete problem. We cannot even check that it halts! (Rice’s theorem etc.)
Not sure what alignment relevant claim you wanted to make, but I doubt this is a valid argument for it.
Thank you! I’ll likely read your paper and get back to you. (Hopefully within a week.)
From reading you comment my guess is that the main disagreement may be that I think powerful AGI will need to be consequentialist. Like, for e.g. achieving something that humans cannot do yet, you need to search for that target in some way, i.e. have some consequentialist cognition, i.e. do some optimization. (So what I mean by consequentialism is just having some goal to search for / update toward, in contrast to just executing fixed patterns. I think that’s how Yudkowsky means it, but not sure if that’s what most people mean when they use the term.) (Though not that this implies that you need so much consequentialism that we won’t be able to shut down the AGI. But as I see it a theoretical solution to corrigibility needs to deal with consequentialism. I haven’t looked into your paper yet, so it’s well possible that my comment here might appear misguided.) E.g. if we just built a gigantic transformer and train it on all human knowledge (and say we have a higher sample efficiency or so), it is possible that it can do almost everything humans can do. But it won’t be able to just one-shot solve quantum gravity or so when we give it the prompt “solve quantum gravity”. There is no runtime updating/optimization going on, i.e. the transformer is non-consequentialist. All optimization happened through the training data or gradient descent. Either the human training data was already sufficient to encode a solution to quantum gravity in the patterns of the transformer, or it wasn’t. It is theoretically possible that the transformer learns a bit deeper underlying patterns than humans have (though I do not expect that from sth like the transformer architecture), and is so able to generalize a bit further than humans. But it seems extremely unlikely that it learned so deep understanding to already have the solution to quantum gravity encoded, although it was never explicitly trained to learn that and just read physics papers. The transformer might be able to solve quantum gravity if it can recursively query itself to engineer better prompts, or if it can give itself feedback which is then somehow converted into gradient descent updates and then try multiple times. But in those cases there is consequentialist reasoning again. The key point: Consequentialism becomes necessary when you go beyond human level.
Out of interest, how much do you agree with what I just wrote?
Hi Koen, thank you very much for writing this list!
I must say I’m skeptical that the technical problem of corrigibility as I see it is really solved already. I see the problem of corrigibility as shaping consequentialist optimization in a corrigible way. (Yeah that’s not at all a clear definition yet, I’m still deconfusing myself about that, and I’ll likely publish a post clarifying the problem how I see it within the next month.)
So e.g. corrigibility from non-agenthood is not a possible solution to what I see as the core problem. I’d expect that the other solutions here may likewise only give you corrigible agents that cannot do new very impressive things (or if they can they might still kill us all).
But I may be wrong. I probably only have time to read one paper. So: What would you say is the strongest result we have here? If I looked at on paper/post and explained why this isn’t a solution to corrigibility as I see it, for what paper would it be most interesting for you to see what I write? (I guess I’ll do it sometime this week of you write me back, but no promises.)
Also, from your perspective, how big is the alignment tax for implementing corrigibility? E.g. is it mostly just more effort implementing and supervising? Or does it also take more compute to get the same impressive result done? If so, how much? (Best take an example task that is preferably a bit too hard for humans to do. That makes it harder to reason about it, but I think this is where the difficulty is.)
Huh, interesting. Could you make some examples for what people seem to claim this, and if Eliezer is among them, where he seems to claim this? (Would just interest me.)
In case some people relatively new to lesswrong aren’t aware of it. (And because I wish I found that out earlier): “Rationality: From AI to Zombies” does not nearly cover all of the posts Eliezer published between 2006 and 2010.
Here’s how it is:
“Rationality: From AI to Zombies” probably contains like 60% of the words EY has written in that timeframe and the most important rationality content.
The original sequences are basically the old version of the collection that is now “Rationality: A-Z”, containing a bit more content. In particular a longer quantum physics sequence and sequences on fun theory and metaethics.
All EY posts from that timeframe (or here for all EY posts until 2020 I guess) (also can be found on lesswrong, but not in any collection I think).
So a sizeable fraction of EY’s posts are not in a collection.
I just recently started reading the rest.
I strongly recommend reading:
And generally a lot of posts on AI (i.e. primarily posts in the AI foom debate) are not in the sequences. Some of them were pretty good.
I feel like many people look at AI alignment like they think the main problem is being careful enough when we train the AI so that no bugs cause the objective to misgeneralize.
This is not the main problem. The main problem is that it is likely significantly easier to build an AGI than to build an aligned AI or a corrigible AI. Even if it’s relatively obvious that AGI design X destroys the world, and all the wise actors don’t deploy it, we cannot prevent unwise actors to deploy it a bit later.
We currently don’t have any approach to alignment that would work even if we managed to implement everything correctly and had perfect datasets.
Simon Skade’s Shortform
I’d guess we can likely reliably identify some classes of pivotal acts where we cannot be fooled easily, and would only accept suggestions from those classes, and I’d still intuitively expect that there are doable pivotal acts on those classes.
First a note:
I’d be careful here: If the two chunks of gas are in a (closed) room which e.g. was previously colder on one side and warmer on the other and then equilibriated to same temperature everywhere, the space of microscopic states it can have evolved into is much smaller than the space of microscopic states that meet the temperature and pressure requirements (since the initial entropy was lower and physics is deterministic). Therefore in this case (or generally in cases in our simple universe rather than thought experiments where states are randomly sampled) a hypercomputer could see more mutual information between the chunks of gas than just pressure and temperature. I wouldn’t call the chunks approximately independent either, the point is that we with our bounded intellects are not able to keep track of the other mutual information.
Main comment:
(EDIT: I might’ve misunderstood the motivation behind natural latents in what I wrote below.)
I assume you want to use natural latents to formalize what a natural abstraction is.
The ”Λ induces independence between all Xi” criterion seems too strong to me.
IIUC you want that if we have an abstraction like “human”, you want all the individual humans to share approximately no mutual information conditioned on the “human” abstraction.
Obviously, there are subclusters of humans (e.g. women, children, ashkenazi jews, …) where members share more properties (which I’d say is the relevant sense of “mutual information” here) than properties that are universally shared among humans.
So given what I intuitively want the “human” abstraction to predict, there would be lots of mutual information between many humans.
However, (IIUC,) your definition of natural latents permits there to be waaayyy more information encoded in the “human” abstraction, s.t. it can predict all the subclusters of humans that exist on earth, since it only needs to be insensitive to removing one particular human from the dataset. This complex human abstraction does render all individual humans approximately independent, but I would say this abstraction seems very ugly and not what I actually want.
I don’t think we need this conditional independence condition, but rather something else that finds clusters of thingies which share unusually much (relevant) mutual information.
I like to think of abstractions as similarity clusters. I think it would be nice if we find a formalization of what a cluster of thingies is without needing to postulate an underlying thingspace / space of possible properties, and instead find a natural definition of “similarity cluster” based on (relevant) mutual information. But not sure, haven’t really thought about it.
(But possibly I misunderstood sth. If it already exists, feel free to invite me to a draft of the conceptual story behind natural latents.)