Trying to deconfuse some core AI x-risk problems

Max H commented on the dialogues announcement, and with both of us being interested in having a conversation where we try to do something like “explore the basic case for AI X-risk, without much of any reference to long-chained existing explanations”. I’ve found conversations like this valuable in the past, so we decided to give it a shot.

His opening comment was:

I feel like I’ve been getting into the weeds lately, or watching others get into the weeds, on how various recent alignment and capabilities developments affect what the near future will look like, e.g. how difficult particular known alignment sub-problems are likely to be or what solutions for them might look like, how right various peoples’ past predictions and models were, etc.

And to me, a lot of these results and arguments look mostly irrelevant to the core AI x-risk argument, for which the conclusion is that once you have something actually smarter than humans hanging around, literally everyone drops dead shortly afterwards, unless a lot of things before then have gone right in a complicated way.

(Some of these developments might have big implications for how things are likely to go before we get to the simultaneous-death point, e.g. by affecting the likelihood that we screw up earlier and things go off the rails in some less predictable way.)

But basically everything we’ve recently seen looks like it is about the character of mind-space and the manipulability of minds in the below-human-level region, and this just feels to me like a very interesting distraction most of the time.

In a dialogue, I’d be interested in fleshing out why I think a lot of results about below-human-level minds are likely to be irrelevant, and where we can look for better arguments and intuitions instead. I also wouldn’t mind recapitulating (my view of) the core AI x-risk argument, though I expect I have fewer novel things to say on that, and the non-novel things I’d say are probably already better said elsewhere by others.

I am quite interested in the “In a dialogue, I’d be interested in fleshing out why I think a lot of results about below-human-level minds are likely to be irrelevant, and where we can look for better arguments and intuitions instead.” part.

Do you want to start us off with a quick summary of your take here?

habryka


So, I think there’s a lot of alignment discussion that is missing a kind of inductive step, where you freely assume that you have a running human-level (or smarter) system, and which is thus dangerous by default. And then you ask, “how did that get aligned?” or “why does that not kill everyone?” instead of what I see as a somewhat backwards approach of looking for an alignment method that applies to current systems and trying to figure out whether it scales.

In my model, “a digital human but they’re evil” isn’t a great model for what misaligned AGI looks like, but it’s a kind of lower bound on which to start building. (Actual AGI will probably be a lot more “alien” and less “evil”, and more capable.)

As for why I think results about below-human-level systems are unlikely to be relevant, it’s because all current AI systems are far less Lawful than humans.

Rephrasing planecrash (and somewhat violating the original ask to avoid long-chained and unwieldy dependencies, but I’ll try to give a useful summary here), Lawfulness is the idea that there are logical /​ mathematical truths which are spotlighted by their simplicity and usefulness across a very wide space of possible worlds.

Examples of heavily spotlighted truths are concepts like logical deduction (Validity), probability theory (Probability), expected utility theory (Utility), and decision theory (Decision). Systems and processes are important and capable precisely to the degree in which they embody these mathematical truths, regardless of whether you call them “agents” or not. All humans embody shadows and fragments of Law to varying degrees, even if not all humans have an explicit understanding of the concepts in words and math.

Most current capabilities and alignment techniques are focused on trying to squeeze more fragments of Law into AI systems: predictive ability (Probability), deductive ability (Validity), Utility (doing stuff efficiently without working at cross-purposes), and those systems still coming out far below humans in terms of general Lawfulness.

At the capabilities level of current systems, squeezing more Law into a system can look like you’re making it more aligned, but that’s because at the current level of capabilities /​ Lawfulness, you need more Law to do even very basic things like following instructions correctly at all.

Another problem that a lot of proposed alignment methods have (if they’re not actually capabilities proposals in disguise) is that they postulate that we’ll be able to badger some supposedly smarter-than-human system into acting in an obviously unLawful way (Deep Deceptiveness is an example of what might go wrong with this).

Also, if you accept this frame, a whole bunch of confusing questions mostly dissolve. For example, questions like “are humans expected utility maximizers” or “is it a problem that ideal utility maximization is computationally intractable” or “but is EU theory even right” have straightforward answers in this framework:

It’s OK if EU theory as humans currently understand it isn’t entirely strictly correct; EU theory is probably a shadow or approximation for whatever the true Law is anyway, the way that Newtonian mechanics is an approximation for General relativity at the right scale. (Maybe something like Geometric Rationality is the real Law that a supermajority of capable minds across the multiverse would settle on.)

Similarly for decision theory, it doesn’t matter if some particular human-understood flavor of logical decision theory is exactly correct; it’s close enough to some true Law of Decision that a supermajority of all minds (including humans, if they think for a few decades longer) will be able to see, spotlighted by its mathematical simplicity /​ optimality /​ usefulness across a sufficiently wide slice of possible worlds.

Max H

In my model, “a digital human but they’re evil” isn’t a great model for what misaligned AGI looks like, but it’s a kind of lower bound on which to start building. (Actual AGI will probably be a lot more “alien” and less “evil”, and more capable.)

Yeah, this seems right to me.

I’ve had a long-standing disagreement with many people currently working in prosaic alignment about the degree to which AI systems are quite alien in their cognition (but are trained on the task of imitating humans, so look on the surface like humans).

A key thing to look at for this is where current AI systems display superhuman performance, and which domains they display substantially sub-human performance.

habryka

I also think current and future AI cognition is unlikely to be human-like, but I don’t think this is a super important crux about anything for me.

For example, maybe it turns out that actually, the only way to get human-level or better performance is via some algorithm that looks isomorphic to some process in a human brain somewhere.

But even if that’s true, that just means the algorithms for cognition are themselves spotlighted by their simplicity and usefulness and tendency for very disparate optimization processes and systems (evolution on humans and SGD on NNs) to converge on them.

A key thing to look at for this is where current AI systems display superhuman performance, and which domains they display substantially sub-human performance.

I think another thing to look at is the process, in addition to performance and final outputs.

GPTs are trained to predict the next token on a very large and general corpus. As a result of this training process, they output explicit probability distributions over the entire token-space on the space of possible next tokens. The user can then sample this probability distribution auto-regressively to get outputs that sometimes look pretty human-like, but the process of “produce a probability distribution + sample from it + auto-regress” doesn’t look at all like the way humans generate language.

When you break things down into more precise mechanical descriptions, claims like “AIs are trained on the task of imitating humans” start to look pretty stretched to me.

(I like Is GPT-N bounded by human capabilities? No. and GPTs are Predictors, not Imitators for more on this general topic.)

Max H

But even if that’s true, that just means the algorithms for cognition are themselves spotlighted by their simplicity and usefulness and tendency for very disparate optimization processes and systems (evolution on humans and SGD on NNs) to converge on them.

I think this does actually still matter, because a lot of the current AI Alignment plans people are pursuing boil down to something like “try to create an AI system that is approximately human level smart, and then use that to automate the task of AI Alignment research, or use those systems to coordinate or otherwise achieve an end to the acute risk period”.

In order for those plans to work, the following things really matter:

  • how much time you expect these systems to spend in the human regime,

  • to what degree they will have a capability profile that you can use for alignment research, coordination or some other plan that ends the acute risk period

  • how good you are at eliciting the latent capabilities of the system, compared to how capable the system itself is at using those capabilities for deception/​other dangerous ends

The cognition of current and future AI systems being human-like helps a lot with all of the three points above. If the cognition of current and future AI systems is alien, this makes it more likely that dangerous capabilities will suddenly spike, less likely that we can effectively automate AI alignment research, use the systems for coordination, or leverage the system for some other pivotal act, and less likely that we will be capable of eliciting the capabilities of the system for our end.

Separately, humans also just have really good intuitions for spotting deception from human-like minds. In as much as the systems engaging in deception will do so using mostly tactics and cognition that are borrowed from humans (by being trained on producing human-produced text), then we have a much better chance at spotting that taking appropriate action.

GPTs are trained to predict the next token on a very large and general corpus. As a result of this training process, they output explicit probability distributions over the entire token-space on the space of possible next tokens. The user can then sample this probability distribution auto-regressively to get outputs that sometimes look pretty human-like, but the process of “produce a probability distribution + sample from it + auto-regress” doesn’t look at all like the way humans generate language.

When you break things down into more precise mechanical descriptions, claims like “AIs are trained on the task of imitating humans” start to look pretty stretched to me.

Hmm, somehow this isn’t landing for me. Like, I agree that the process of “produce a probability distribution + sample from it + auto-regres” doesn’t look very human like, but that feels kind of irrelevant to my point.

What I mean by saying “AIs are trained on the task of imitating humans” is that the AI minimizes loss when it successfully produces text that mimicks the distribution of human-produced text on the internet. I.e. the AI is trained on the task of imitating human internet-text production.

I am not saying this to suggest that this means the AI is trained to think in human ways, since the context humans are in in producing internet text is very different from the context the AI is performing this task in. I am just saying that it should be completely unsurprising to see the AI produce reasoning that looks human-like or makes human-like errors when you e.g. use chain-of-thought prompting, because the central task the AI was trained on was to produce text that looks like internet text, which is almost all produced by humans.

habryka

The cognition of current and future AI systems being human-like helps a lot with all of the three points above.

I actually think less human-like cognition is potentially better for some of those plans. Like, maybe it actually is pretty easy to get an AI that is superhuman at logical deduction and probabilistic inference and scientific knowledge recall, while still being below human-level at cognitive reflection, and that turns out to be safer than an AI that is less spiky and more balanced the way humans are.

Though if you do have an AI like that, it seems much safer to try using it to do mundane work or solve concrete science and engineering problems unrelated to alignment research, since alignment research is like, the hardest and most dangerous thing you can possibly have the AI do.

Also, if you think your AI can do alignment research, it should also be able to do pretty much all other kinds of human labor, so maybe just… stop a bit earlier and make sure you really have achieved post-scarcity and perfect global harmony, before moving onto even harder problems. (A half-joking policy proposal for alignment researchers: if you can’t use your AI to solve the Bay Area housing crisis, you don’t get to use it to try to do alignment research. If that sounds like a kind of problem that human-level AI is poorly suited for, then ask yourself what you really mean by “human-level”.)

I’m not sure how plausible I find this scenario, nor how optimistic I should be about it if it does come to pass. e.g. one “early failure mode” I can imagine is that it’s not really AGI that dooms us, it’s just that humans plus below-human-level AI tools start unlocking a bunch of really powerful technologies in a way that goes off the rails before we even get to actual AGI.

the AI minimizes loss when it successfully produces text that mimicks the distribution of human-produced text on the internet.

In most cases, yes, but the point of Is GPT-N bounded is that you can get even better loss by being skilled at a lot more things than just human-mimicry. For example, if you want to predict the next tokens in the following prompt:

I just made up a random password, memorized it, and hashed it. The SHA-256 sum is: d998a06a8481bff2a47d63fd2960e69a07bc46fcca10d810c44a29854e1cbe51. A plausible guess for what the password was, assuming I'm telling the truth, is:

The best way to do that is to guess an 8-16 digit string that actually hashes to that. You could find such a string via bruteforce computation, or actual brute force, or just paying me $5 to tell you the actual password.

If GPTs trained via SGD never hit on those kinds of strategies no matter how large they are and how much training data you give them, that just means that GPTs alone won’t scale to human-level, since an actual human is capable of coming up with and executing any of those strategies.

I am just saying that it should be completely unsurprising to see the AI produce reasoning that looks human-like or makes human-like errors when you e.g. use chain-of-thought prompting, because the central task the AI was trained on was to produce text that looks like internet text, which is almost all produced by humans.

I mostly agree with this point. Two remarks /​ alternate hypotheses:

  • a lot of the human-like qualities look at least partly illusory to me when you look closely (meaning, the errors and reasoning actually aren’t all that human-like)

  • To the degree that they are human-like, a hypothesis for why is that there just aren’t that many ways to be kinda wrong but not totally off-base. What would non-human-like reasoning errors that still produce vaguely human-sounding text even look like?

Also, when an AI reasons correctly, we don’t call it “being human-like”; that’s just being right. So I sort of feel like the whole human-like /​ not-human-like distinction isn’t carving reality at its joints very well. In my terms, I’d say that when both humans and AIs reason correctly, they’re alike because they’re both reasoning Lawfully. When they mess up, they’re alike because they’re both reasoning un-Lawfully. The fact that AIs are starting to sound more like humans is explained by the fact that they are getting more Lawful.

Max H

(A half-joking policy proposal for alignment researchers: if you can’t use your AI to solve the Bay Area housing crisis, you don’t get to use it to try to do alignment research. If that sounds like a kind of problem that human-level AI is poorly suited for, then ask yourself what you really mean by “human-level”.)

Lol, I genuinely like that proposal/​intuition-pump/​sanity-check.

Though if you do have an AI like that, it seems much safer to try using it to do mundane work or solve concrete science and engineering problems unrelated to alignment research, since alignment research is like, the hardest and most dangerous thing you can possibly have the AI do.

I agree with you on the “doing end-to-end alignment research seems particularly risky” component. I also think that automating alignment research is not the right proxy to aim for with close-to-human-level AI systems, and we should aim for some coordination victory/​game-board-flipping plan that somehow prevents further AI progress.

I actually think less human-like cognition is potentially better for some of those plans. Like, maybe it actually is pretty easy to get an AI that is superhuman at logical deduction and probabilistic inference and scientific knowledge recall, while still being below human-level at cognitive reflection, and that turns out to be safer than an AI that is less spiky and more balanced the way humans are.

Yeah, I think that’s plausible, but I think “figuring out how to do science using an alien mind” is actually a pretty hard problem, and it’s going to be much easier to slot a human-like AI into the process of making scientific discoveries.

Overall, I think the most important reason for why it matters if AIs have human-like cognition is not that it makes it safer way to leverage the AI for things like AI Alignment research. It’s instead that if the trajectory of AI capabilities roughly follows the trajectory of human performance, then we will be much better at predicting when the AI system is getting dangerous. If we can just ask the question “well, would a human of this ability level, with maybe some abnormally high skills in this specific subdomain be able to break out of this box and kill everyone?” and get a roughly accurate answer, then that’s a much easier way of determining whether a given AI system is dangerous, than if we have to ask ourselves “is this alien mind that we don’t really have good intuitions for the kind of thing that could break out of the box and kill everyone?”.

And conversely, if it is indeed the case that current and future AI system’s internal cognition is alien, and their perceived reasoning performance only follows the human trajectory because they are trained to perform reasoning at human levels (due to being trained on human text), then this will cause us to reliably underestimate the actual abilities and performance of the system on a wide range of task.

habryka

Going back up a level of the conversation, I’ve been trying to concretely argue that even though I think you are correct that as we reach Superintelligence, the natural lawfulness of correct reasoning will make almost all more prosaic alignment approaches fail, I think it still really matters how to control systems that are just below and just above the human capability level.

Or to be more precise, if it turns out to be the case that you could make systems such that their capabilities roughly follow the human distribution of competence as you scale up compute, then I think you have a bunch of options of punting on the hard part of the AI Alignment problem, by using those systems to solve a bunch of easier problems first, and then after you solved those problems, you are in a much better position to solve the rest of the AI Alignment problem (like, I do genuinely think that “make a mind upload of the 100 best AI Alignment researchers and give them 10,000 years of subjective time to come up with solutions and run experiments, etc.” is a decent “solution” to the AI Alignment problem).

That said, my biggest concern with this kind of plan is that AI performance will not predictably follow the human distribution, and that our current training methods inherently bias researchers towards thinking that AI systems are much more human-like in their cognition than they actually are.

My model of what will happen instead is something like:

  • AI systems will follow a quite spiky and unpredictable capability profile trajectory

  • There are a bunch of capabilities that when you unlock them, will cause some kind of recursive self-improvement or acceleration of the inputs of the AI (this could either be direct modifications of its own weights, or better self-prompting, or much better ability to debug large complicated software systems or the development of substantially more performant AI learning algorithms), and these capabilities will very likely be unlocked before you reach human level of usefulness at the kind of task that might successfully flip the game board

  • AI systems will quite directly and early on be goodhearting on human approval and won’t really have a coherent concept of honesty, and when we provide a reward signal against this kind of behavior, this will just train the AI get better at deceiving us

Let me know if this roughly sounds right to you. That said, I do currently feel like the hope for AI Alignment does still come from some set of plans of the form “how can I use early AGI systems to end the acute risk period somehow”, and so any critique of existing AI Alignment approaches needs to show how an alignment approach fails to achieve that goal, and not how it fails to achieve full alignment, which I think is just very solidly out of our reach at this point.

habryka

I agree with most of what you’ve written in broad strokes, though I think I actually have more uncertainty about how things will play out in the short term.

I’ll respond to 3 specific things you said which jumped out at me, and then you can decide on the conversational branching factor from there...

they are trained to perform reasoning at human levels (due to being trained on human text),

I think it’s more “due to being trained via inefficient and limiting methods on fundamentally limited architectures”, than due to the training data itself. There was this whole debate about whether a superintelligence could infer General Relativity from a video of a falling apple—maybe a few frames of an apple video aren’t actually enough even for a superintelligence, but a giant corpus full of physics textbooks and experimental results and detailed descriptions of human behavior is definitely enough to make a lot of inferences beyond current human reach, and also very likely more than sufficient to learn to do inference in general superhumanly well, if you get the training method and architecture right.

(Also, a lot of the text generation that LLMs can do is already superhuman; I can still write better code than GPT-4 if I get a chance to compile it, run it, refer to documentation, etc. But I can’t write it token by token with no backspace key or even the ability to jump around in my code as I write it, even if you give me a lot of extra time.)

like, I do genuinely think that “make a mind upload of the 100 best AI Alignment researchers and give them 10,000 years of subjective time to come up with solutions and run experiments, etc.” is a decent “solution” to the AI Alignment problem

They better be able to make progress in a lot less than 10,000 years of subjective time! I actually think if you can get even a single high fidelity upload of a smart human and run them at no speed up (or even a slowdown) you’re already in pretty good shape.

I would spend the first few months of subjective time looking for improvements to the fidelity and efficiency of the simulation from the inside, checking my own mind for bugs and inconsistencies introduced by the upload process, doing some philosophy, introspection, mental inventory, etc.

And then probably start working on making very small, safe tweaks to my own mind, giving myself some extra tools (e.g. an instant math module, a larger working memory, an integrated search engine), and then maybe try out some bigger /​ more invasive changes, e.g. making myself better at applied rationality via direct brain modifications.

And then maybe after like, a year or two of subjective time spent tweaking my own mind and improving my understanding of digital mind design and minds in general, I start turning towards work on specific alignment problems and /​ or go for actual recursive self-improvement. But either way, I wouldn’t expect it to take 100 people anywhere close to 10,000 years, even if digitization is a 100% opaque black box to start out and digital neuroscience as a field turns out to be totally intractable.

My model of what will happen instead is something like:

  • AI systems will follow a quite spiky and unpredictable capability profile trajectory

  • There are a bunch of capabilities that when you unlock them, will cause some kind of recursive self-improvement or acceleration of the inputs of the AI (this could either be direct modifications of its own weights, or better self-prompting, or much better ability to debug large complicated software systems or the development of substantially more performant AI learning algorithms), and these capabilities will very likely be unlocked before you reach human level of usefulness at the kind of task that might successfully flip the game board

  • AI systems will quite directly and early on be goodhearting on human approval and won’t really have a coherent concept of honesty, and when we provide a reward signal against this kind of behavior, this will just train the AI get better at deceiving us

Let me know if this roughly sounds right to you.

Mostly right, though on the third bullet, I actually think that AIs will probably have a deep /​ accurate /​ fully grounded understanding of concepts like honesty and even more complicated human values and goals as they get smarter. Also, true honesty in particular seems like a concept that is simple and useful and spotlighted enough that even very alien minds will understand it pretty naturally at non-superintelligence capability levels, even if they don’t care at all about being honest.

Maybe a better way of putting it is: I expect that, before AIs get totally superintelligent or even definitely past human level, they will be able to pass a human’s Ideological Turing Test about what humans value. (At least, they’ll be able to pass according to the judgement of most humans, though maybe not the most careful /​ skeptical alignment researchers). Understanding an alien’s values is maybe a bit easier if you share patterns of cognition with them, but not sharing them doesn’t actually push the understanding task into the superintelligence realm of difficulty.

Also, I think in these plans, the specific level of “human-level” actually starts to matter quite a lot. Maybe you can have a median-human-level AI that is stable and safe to use when unboxed. I’m less sure you can safely have a median-computer-programmer-level AI, or especially a top-1%-computer-programmer-level AI unless you’ve actually solved a bunch of hard alignment problems already.

Max H

> they are trained to perform reasoning at human levels (due to being trained on human text),

I think it’s more “due to being trained via inefficient and limiting methods on fundamentally limited architectures”, than due to the training data itself.

Quick clarification here: I meant “perform” in the “an actor performs a role” sense. I.e. I was trying to say “the system will look like it is reasoning at roughly human levels, because it was trained to produce text that looks like it was written by humans”.

That, I am confident, is not the result of the system being trained via inefficient and limiting methods on fundamentally limiting architectures.

habryka

Ah, OK, I initially read it as performing in the sense of performing at a certain capabilities level, but that makes more sense. I agree with you that this is likely to lead to reliable underestimates of true capabilities levels.

Max H

I’ll respond to this section, since I think it’s the one that’s most on the critical path what to do with AI alignment:

Mostly right, though on the third bullet, I actually think that AIs will probably have a deep /​ accurate /​ fully grounded understanding of concepts like honesty and even more complicated human values and goals as they get smarter. Also, true honesty in particular seems like a concept that is simple and useful and spotlighted enough that even very alien minds will understand it pretty naturally at non-superintelligence capability levels, even if they don’t care at all about being honest.

Ok, here I notice that I want to go a lot slower, taboo some words, and figure out what exactly is true here.

Here are some first reactions that I have when I consider this question:

  • Man, I really don’t fully understand honesty myself. Like, in many situations it takes me a really long time to figure out what the honest thing to do is. It usually requires me understanding what you are trying to achieve in any given context, and then somehow give you models and information that assist you within that context. It’s easy to be dishonest while only saying true things, and even the truth value of a given statement heavily depends on context and predicting how you are likely to interpret it, and whether the parts where you will predictably be confused or wrong will matter for what you will do with that information.

  • I do agree that superintelligent systems will understand what we mean by “honesty” better than we do, probably, since a lot of my model of honesty is pretty bottlenecked on being smarter and understanding lots of parts of the world better

  • The key thing that I expect to be true with the current training paradigm is something like “the model really has no motivation towards being honest, in part because at least the common-sense view of honesty doesn’t even really apply to the cognition of a mind as alien as a language model”. Like, a language model doesn’t have a consistent set of beliefs that it can act in accordance with. Different system prompts basically make the model be different people. It knows so much but usually roleplays as something that knows only about as much as any given normal human would.

  • But also, yeah, I just feel really deeply confused what “motivates” a language model. Clearly almost all of a language model’s cognition is going into the objective of “predict the next token” (including after RLHF training where it looks more like the model has goals like being “helpful, harmless, and honest”). But does that cognition have any “agency”? Like, does it even make sense for the model to be “honest” in its pursuit of predicting the next token? Is the context of a single forward pass just too small for it to make any sense to think about the model having goals in the context of pursuing the next token?

Going back up a level about how this relates to the overall question:

In as much as the best target for current AI Alignment efforts is to try to build systems that work towards some proxy task that will make the rest of the AI Alignment problem easier (either by buying us more time, or making conceptual progress on the problem, or being a really useful tool that speeds up our efforts of controlling AI systems), then it seems that being able to motivate systems towards those tasks is really quite crucial.

So let’s characterize some tasks or ways early superhuman systems could help make the rest of the AI Alignment problem easy:

Honesty:

I think the ELK document mostly gets the point across of why “honesty” isn’t as-straightforward a target as one might think it is. My current summary of ELK is approximately “if we could get an AI system to reliably be honest, i.e. we could get it to try to genuinely get it to explain anything to us that it understands, then we can maybe leverage that into a full AI Alignment solution”.

That said, see all the open problems in the ELK document.

Corrigibility:

I feel like I don’t super understand what corrigibility really points to. In the abstract, corrigibility sounds great:

A ‘corrigible’ agent is one that doesn’t interfere with what we would intuitively see as attempts to ‘correct’ the agent, or ‘correct’ our mistakes in building it; and permits these ‘corrections’ despite the apparent instrumentally convergent reasoning saying otherwise.

  • If we try to suspend the AI to disk, or shut it down entirely, a corrigible AI will let us do so. This is not something that an AI is automatically incentivized to let us do, since if it is shut down, it will be unable to fulfill what would usually be its goals.

  • If we try to reprogram the AI, a corrigible AI will not resist this change and will allow this modification to go through. If this is not specifically incentivized, an AI might attempt to fool us into believing the utility function was modified successfully, while actually keeping its original utility function as obscured functionality. By default, this deception could be a preferred outcome according to the AI’s current preferences.

However, I don’t really see any reason for why it would be possible to imbue an AI with the property of being corrigible. The definition itself refers to it aligning with “what we would intuitively want”, which sure sounds like aiming an AI at this target would be pretty difficult.

Narrowly superhuman scientists

As you suggested, maybe it is possible to make an AI that is narrowly superhuman in some domain of science, like material engineering or developing the technology necessary to make uploads work, and then you use that technology to solve the AI Alignment problem.

I currently don’t have a great candidate technology here, but figuring out potential technologies and ways to use them is among the top things I would like more people to do.

I do think the core difficulty here is just that developing almost any technology to the point of usefulness requires a pretty huge amount of general intelligence. This is maybe the least true in the domain of software, but also, the domain of software is among the most domains to gain expertise in, in terms of enabling various forms of recursive self-improvements and enabling a huge amount of leverage over the world.

Coordination technologies

The AI X-Risk problem is ultimately caused by a coordination problem. If humanity was sufficiently cautious and willing to take decades or centuries to handle the transition to an AI dominated future, then my guess is we would likely be fine. How humanity coordinates on stuff like this seems extremely messy and confusing, and I really don’t know how to predict whether a given technology will make humanity’s decisions here better or worse, but I do sure feel like there must be some AI-adjacent things here that help a lot.

As an example, there might be some AI-intermediary technology that could enable much better coordination to avoid arms races. Maybe there is a way to substantially improve bargaining using auditable intermediate AI systems. Current technologies really don’t seem very useful here, but it is again one of those things that I would really like someone to look into.


So, why am I saying all this? I think I am trying to get the point across that before we reach the domain of forced lawfulness, we will pass through a domain where if we play our cards right, we can mostly punt on the problem and end up leaving future humanity in a much better position to tackle the full alignment problem.

I do think systems will be pushed towards lawfulness even early on (and are already pushed in that direction right now), and understand the landscape of minds that these forces create even at current capability levels is really important. That does make me interested in continuing the discussion on specifying more clearly what we mean by “lawfulness”, and use less metaphorical descriptions of what is going on here.

habryka

Bullet responses to your bullets on honesty:

  • I agree that honesty is sometimes complicated, but I think it is mostly about intent and want, and once you want to be honest, that’s most of the way to actually being honest. You still have to exercise a reasonable-to-you-and-your-counterparty level of care and effort in not ending up having people (honestly and accurately) say that they feel they were misled, in the counterfactual where they knew everything you knew.

    ”Being honest” is not the same thing as never ending up misleading people, though I suspect that once both parties are above a certain threshold of capabilities and Lawfulness, it is close enough: if both parties want to be mutually honest with each other, they will in practice be able to avoid misleading or being misled pretty well, even if there’s a relatively large gap in their relative capabilities levels and modes of cognition.

  • superintelligences will be more capable of more effective honesty, i.e. not accidentally misleading people when they don’t want to do that, even when humans themselves are not so effective at being honest even when they want to be. The question is, what will the superintelligence want, and how can we have any influence on that at all?

Like, a language model doesn’t have a consistent set of beliefs that it can act in accordance with. Different system prompts basically make the model be different people. It knows so much but usually roleplays as something that knows only about as much as any given normal human would.

Yeah, I think this is where you just have to be careful about how you define your system and be careful to talk about the system(s) that actually matter. A language model itself is just a description of a mathematical function that maps input sequences to output probability distributions. A computer program that samples from a potentially-dangerous model with a potentially-dangerous starting prompt, but pipes all the output directly to /​dev/​null with no human or machine eyes ever viewing the contents, is probably pretty safe, barring a very weird /​ powerful superintelligence that has enough self-awareness and capabilities to exploit some kind of rowhammer-like side-channel in its implementation. The same computer program hooked up to actuators (e.g. direct internet access, or a human reading the output) capable of actually exerting influence on its environment, is liable to be pretty dangerous.

As for what kind of internal motivations these systems have, I’m pretty confused and unsure about that too. But the question itself doesn’t feel as deeply confusing to me when I think in terms of systems and their causal effects on their environment, and whether it makes sense to talk about those systems as behaving Lawfully.

I haven’t looked closely at the ELK paper, but in general, honesty seems like an alignment-complete problem because it requires wantingness in the definition. If ARC has some way of breaking down honesty into gears-level components that still add up to a natural intuitive definition of honesty even in edge cases, and then making an AI system satisfy all those components, that does seem like a promising strategy. Or at least, it’s the right shape of something that could work, in my model.

Corrigibility

However, I don’t really see any reason for why it would be possible to imbue an AI with the property of being corrigible. The definition itself refers to it aligning with “what we would intuitively want”, which sure sounds like aiming an AI at this target would be pretty difficult.

Right, “wanting to be corrigible” seems like another problem that is alignment-complete. Corrigibility also has a wantingness component in the definition, but the disadvantage compared to honesty is that it is a lot more complicated and unintuitive even once you have the wantingness. It’s also asymmetric, so unlike honesty, it’s probably not the kind of thing that comes mostly for free once you have mutual wantingness and a baseline level of capabilities in both parties.

The advantage of corrigibility is that it would probably be more directly and immediately useful if we could get a powerful AI system to want to be corrigible, and it’s also the kind of thing that we can apply and talk about sensibly for below-human-level systems. Consider the principles /​ desiderata of corrigibility listed here: it’s a lot easier to tell if a particular below-human-level AI system is adhering to the principle of “operator looping” or “conceptual legibility” or “whitelisting”, and talk about whether it will continue to have those properties as it increases in capabilities, than it is to talk about whether a below-human-level language model system is “honest” or “wants to be honest” or not.

Coordination

The AI X-Risk problem is ultimately caused by a coordination problem.

Yeah, coordination failures rule everything around me. =/​

I don’t have good ideas here, but something that results in increasing the average Lawfulness among humans seems like a good start. Maybe step 0 of this is writing some kind of Law textbook or Sequences 2.0 or CFAR 2.0 curriculum, so people can pick up the concepts explicitly from more than just, like, reading glowfic and absorbing it by osmosis. (In planecrash terms, Coordination is a fragment of Law that follows from Validity, Utility, and Decision.)

Or maybe it looks like giving everyone intelligence-enhancing gene therapy, which is apparently a thing that might be possible!?

I don’t know how much non-AGI AI can help here, but I do think technologies that increase humanity’s coordination ability and collective intelligence are very likely to be very positive, even accounting for any possible negative impact that such enhancements might have on AI timelines. Technologies like prediction markets seem great and have the potential to be hugely positive if they get more widespread and impactful adoption. Other things that somehow increase population-wide understanding of the very basics of markets and economics seem also potentially helpful.

Max H

That does make me interested in continuing the discussion on specifying more clearly what we mean by “lawfulness”, and use less metaphorical descriptions of what is going on here.


Note that Law doesn’t have anything to do with AI specifically; in the story it’s just what Eliezer’s fictional world considers the basic things you teach to children: logical vs. empirical truths, probability theory, utility theory, decision theory, etc.

And then the rest of the story is all about how those concepts can be applied to get technological civilization and Science and various other nice things from first principles.

Lawfulness itself is the concept that certain logical deductions are spotlighted by their simplicity and usefulness, and thus will likely recur across a wide space of possible minds: aliens, humans, AIs, etc. regardless of what those minds value, which is (mostly) more of a free variable.

So on this view, a bunch of woes on Earth (unrelated to AGI) are a result of the fact that Earth!humans are not very Lawful, and thus not very good at coordination, among other things.

Max H

(Note for readers: This dialogue isn’t over yet! Subscribe for future dialogue entries)

habryka

The Lawfulness stuff seems to have elicited a reaction from at least one commenter, and you expressed interest in digging into it yourself. I’ll just ramble for a bit here, but I’m happy to go into some particular direction in more detail, given a prompt.

I want to gesture at a bunch of stuff: epistemic rationality, instrumental rationality, consequentialism, and like, the whole rationalist project in general, as kind of core examples of the concept of Lawfulness. Like, I think you can almost just directly replace “rationality” with “Lawfulness” in this post, and it basically scans. Lawfulness reified as a concept itself is just this extra little idea on top about why exactly all this rationality stuff even does anything and where it comes from, from truly first principles.

You can also point at a bunch of human behaviors or failures in the world, and if you trace back far enough, get to a place where you say, “aha, this is the result of a human or group of humans acting unLawfully in some way. Specifically, they’re falling prey to the sunk cost fallacy, or the conjunction fallacy, or failing to coordinate with each other, or just making some even more fundamental kind of reasoning error, e.g. failing to implement modus ponens”.

It might be helpful to drill into the concept by sticking to examples in humans and not directly bring in the application to AI, but another framing that kind of rhymes with “capabilities generalize further than alignment” and the concept of instrumental convergence is:

There’s a relatively smaller number of ways to accomplish stuff effectively /​ optimally, compared to the amount of possible stuff you can want to accomplish. Maybe there’s even a single optimal method (up to isomorphism) for accomplishing stuff once you get far up into the capabilities spectrum, i.e. all ideal agents look pretty similar to each other in terms of what kind of methods of cognition and reasoning and tools they will use (but they can still have different things they value).

I think there’s a lot of reasons to expect this is true for ideal agents; e.g. things like Aumman’s agreement theorem, and various coherence theorems are kind of hints at it. But there are weaker forms of coherence and Lawfulness that probably kick in well before you get to ideal-agent territory or even wildly-superhuman AGI, and the evidence /​ intuition for this claim comes from observations of human behavior and introspection rather than Aumman’s agreement-flavored stuff or coherence theorems.

Regular human levels of intelligence are enough where you start systematically doing better by being more rational /​ Lawful. Rationalists should win, and, in my experience at least, human rationalists do often do pretty well relative to non-rationalists.

Also, I get the sense that a common criticism of the Sequences is that a lot of the concepts aren’t that novel or deep: they’re mostly restatements of stuff that was already in various academic literature, or just a bunch stuff that’s intuitively obvious to everyone. I think there’s more than a bit of typical-minding /​ overestimating the general sanity of the rest of humanity by these (perhaps mostly straw /​ hypothetical) critics. But the fact that a lot of the concepts and conclusions in the Sequences have been independently discovered or written about in a lot of different places, and that many people find them intuitively obvious, is actually evidence that they are pointing to something a bit more universal than just “humans are biased in various ways”.

Max H