I’m an artificial intelligence engineer in Silicon Valley with an interest in AI alignment and interpretability.
RogerDearnaley
Humans (when awake, as long as they’re not actors or mentally ill) have, roughly speaking, a single personality. The base model training of an LLM trains it to attempt to simulate anyone on the internet/in stories, so it doesn’t have a single personality: it contains multitudes. Instruct training and prompting can try to overcome this, but they’re never entirely successful.
More details here.
I completely agree. LLMs are so context dependent that just about any good or bad behavior that a significant number of instances of can be found in the training set can be elicited from them by suitable prompts. Fine tuning can increse their resistance to this, but not by anything like enough.. We either need to filter the training set, which risks them just not understanding bad behaviors, rather than actually knowing to avoid them, making it had to know what will happen when they in-context learn about them ,or else we need to use something like conditional pretraining along the lines I discuss in How to Control an LLM’s Behavior (why my P(DOOM) went down).
If you are dubious that the methods of rationality work, I fear you are on the wrong website.
Directly, no. But the process of science (like any use of Bayesian reasoning) is intended to gradually make our ontology a better fit to more of reality. If that was working as intended, then we would expect it to come to require more and more effort to produce the evidence needed to cause a significant further paradigm shift across a significant area of science, because there are fewer and fewer major large-scale misconceptions left to fix. Over the last century, we have more and more people working as scientists, publishing more and more papers, yet the rate of significant paradigm shifts that have an effect across a significant area of science has been dropping. From which I deduce that it is likely that our ontology is a probably a significantly better fit to reality now than it was a century ago, let alone three centuries ago back in the 18th century as this post discusses. Certainly the size and detail of our scientific ontology have both increased dramatically.
Is this proof? No, as you correctly observe, proof would require knowing the truth about reality. It’s merely suggestive supporting evidence. It’s possible to contrive other explanations: it’s also possible, if rather unlikely, that, for some reason (perhaps related to social or educational changes) all of those people working in science now are much stupider, more hidebound, or less original thinkers than the scientists a century ago, and that’s why dramatic paradigm shifts are slower — but personally I think this is very unlikely.
It is also quite possible that this is more true in certain areas of science that are amenable to the mental capabilities and research methods of human researchers, and that there might be other areas that were resistant to these approaches (so our lack of progress in these areas is caused by inability, not us approaching our goal), but where the different capabilities of an AI might allow it to make rapid progress. In such an area, the AI’s ontology might well be a significantly better fit to reality than ours.
It’s also possible to commit to not updating on a specific piece of information with a specific probability between 0 and 1. I could also have arbitrarily complex finite commitment structures such as “out of the set of bits {A, B, C, D, E}, I will update if and only if I learn that at least three of them are true” — something which could of course be represented by a separate bit derived from A, B, C, D, E in the standard three-valued logic that represents true, false, and unknown. Or I can do a “provisional commit” where I have decided not to update on a certain fact, and generally won’t, but may under some circumstances run some computationally expensive operation to decide to uncommit. Whether or not I’m actually committed is then theoretically determinable, but may in practice have a significant minimal computational cost and/or informational requirements to determine (ones that I might sometimes have a motive to intentionally increase, if I wish to be hard-to-predict), so to some other computationally bounded or non-factually-omniscient agents this may be unknown.
For updatelessness commitments to be advantageous, you need to be interacting with other agents that have a better-than-random chance of predicting your behavior under counterfactual circumstances. Agents have finite computational resources, and running a completely accurate simulation of another agent requires not only knowing their starting state but also being able to run a simulation of them at comparable speed and cost. Their strategic calculation might, of course, be simple, thus easy to simulate, but in a competitive situation if they have a motivation to be hard to simulate, then it is to their advantage to be as hard as possible to simulate and to run a decision process that is as complex as possible. (For example “shortly before the upcoming impact in our game of chicken, leading up to the last possible moment I could swerve aside, I will have my entire life up to this point flash before by eyes, hash certain inobvious features of this, and, depending on the twelfth bit of the hash, I will either update my decision, or not, in a way that it is unlikely my opponent can accurately anticipate or calculate as fast as I can”.)
In general, it’s always possible for an agent to generate a random number that even a vastly-computationally-superior opponent cannot predict (using quantum sources of randomness, for example).
It’s also possible to devise a stochastic non-linear procedure where it is computationally vastly cheaper for me to follow one randomly-selected branch of it than it is for someone trying to model me to run all branches, or even Monte-Carlo simulate a representative sample of them, and where one can’t just look at the algorithm and reason about what the net overall probability of various outcomes is, because it’s doing irreducibly complex things like loading random numbers into Turing machines or cellular automata and running the resulting program for some number of steps to see what output, if any, it gets. (Of course, I may also not know what the overall probability distribution from running such a procedure is, if determining that is very expensive, but then, I’m trying to be unpredictable.) So it’s also possible to generate random output that even a vastly-computationally-superior opponent cannot even predict the probability distribution of.
In the counterfactual mugging case, call the party proposing the bet (the one offering $1000 and asking for $100) A, and the other party B. If B simply publicly and irrevocably precommits to paying the $100 (say by posting a bond), their expected gain is $450. If they can find a way to cheat, their maximum potential gain from the gamble is $500. So their optimal strategy is to initially do a (soft) commit to paying the $100, and then, either before the coin is tossed, and/or after that on the heads branch:
Select a means of deciding on a probability that I will update/renege after the coin lands if it’s a heads, and (if the coin has not yet been tossed) optionally a way I could signal that. This means can include using access to true (quantum) randomness, hashing parts of my history selected somehow (including randomly), hashing new observations of the world I made after the coin landed, or anything else I want.
Using << $50 worth of computational resources, run a simulation of party A in the tails branch running a simulation of me, and predict the probability distribution for their estimate of . If the mean of that is lower than then go ahead and run the means for choosing. Otherwise, try again (return to step 1), or, if the computational resources I’ve spent are approaching $50 in net value, give up and pay A the $100 if the coin lands (or has already landed) heads.
Meanwhile, on the heads branch, party A is trying to simulate party B running this process, and presumably is unwilling to spend more than some fraction of $1000 in computational resources to doing this. If party B did their calculation before the coin toss and chose to emit a signal(or leaked one), then party A has access to that, but obviously not to anything that only happened on the heads branch after the outcome of the coin toss was visible.
So this turns into a contest of who can more accurately and cost effectively simulate the other simulating them, recursively. Since B can choose a strategy, including choosing to randomly select obscure features of their past history and make these relevant to the calculation, while A cannot, B would seem to be at a distinct strategic advantage in this contest unless A has access to their entire history.
Agreed. But the observed slowing down (since, say, a century ago) in the rate of the paradigm shifts that are sometimes caused by things like discovering a new particle does suggest that out current ontology is now a moderately good fit to a fairly large slice of the world. And, I would claim, it is particularly likely to be fairly good fit for the problem of pointing to human values.
We also don’t require that our ontology fits the AI’s ontology, only that when we point to something in our ontology, it knows what we mean — something that basically happens by construction in an LLM, since the entire purpose that it’s ontology/world-model was learned for was figuring out what we mean and may say next. We may have trouble interpreting its internals, but it’s a trained expert in interpreting our natural languages.It is of course possible that our ontology still contains invalid concepts comparable to “do animals have souls”? My claim is just that this is less likely now than it was in the 18th century, because we’ve made quite a lot of progress in understanding the world since then. Also, if it did, an LLM would still know all about this invalid concept and our beliefs about it, just like it knows all about our beliefs about things like vampires, unicorns, or superheroes.
On the wider set of cases you hint at, my current view would be that there are only two cases that I’m ethically comfortable with:
an evolved sapient being with the usual self-interested behavior for that that our ethical system grants moral patient status (by default, roughly equal moral patient status, subject to some of the issues discussed in Part 5)
an aligned constructed agent whose motivations are entirely creator-interested and actively doesn’t want moral patient status (see Part 1 of this sequence for a detailed justification of this)
Everything else: domesticated animals, non-aligned AIs kept in line by threat of force, slavery, uploads, and so forth, I’m (to varying degrees obviously) concerned about the ethics of, but haven’t really thought several of those through in detail. Not that we currently have much choice about domesticated animals, but I feel that at a minimum by creating them we take on a responsibility for them: it’s now our job to shear all the sheep, for example.
I’d like to discuss this further, but since none of the people who disagree have mentioned why or how, I’m left to try to guess, which doesn’t seem very productive. Do they think it’s unlikely that a near-term AGI will contain an LLM, or do they disagree that you can (usually, though unreliably) use a verbal prompt to point at concepts in the LLM’s world models, or do they have some other objection that hasn’t occurred to me? A concrete example of what I’m discussing here would be Constitutional AI, as used by Anthropic, so it’s a pretty-well-undertood concept that had actually been tried with some moderate success.
Science has made quite a lot of progress since the 18th century, to the point where producing phenomena we don’t already have a workable ontology for tends to require giant accelerators, or something else along those lines. Ground-breaking new ideas are slowly becoming harder to find, and paradigm shifts are happening more rarely or in narrower subfields. That doesn’t prove our ontology is perfect by any means, but it does suggest that it’s fairly workable for. lot of common purposes. Particularly, I would imagine, for ones relating to AI alignment to our wishes, which is the most important thing that we want to be able to point to.
The thing you want to point to is “make the decisions that humans would collectively want you to make, if they were smarter, better informed, had longer to think, etc.” (roughly, Coherent Extrapolated Volition, or something comparable). Even managing to just point to “make the same decisions that humans would collectively want you to make” would get us way past the “don’t kill everyone” minimum threshold, into moderately good alignment, and well into the regions where alignment has a basin of convergence.
Any AGI built in the next few years is going to contain an LLM trained on trillions of tokens of human data output. So it will learn excellent and detailed world models of human behavior and psychology. An LLM’s default base model behavior (before fine-tuning) is to prompt-dependently select some human psychology and then attempt to model it so as to emit the same tokens (and thus make the decisions) that they would. As such, pointing it at “what decision would humans collectively want me to make in this situation” really isn’t that hard. You don’t even need to locate the detailed world models inside it, you can just do all this with a natural language prompt: LLMs handle natural language pointers just fine.
The biggest problem with this is that the process is so prompt-dependent that it’s easily perturbed, if part of your problem context data happens to contain something that perturbs the process in a way that jailbreaks its behavior. Which is probably a good reason why you might want to go ahead and locate those world models inside it, to try ensure that they’re still being used and the model hasn’t been jailbroken into doing something else.
Yes, I agree, domesticated animals are a messy edge case. They were evolved, thus they have a lot of self-interested drives and behaviors all through their nature. Then we started tinkering with them by selective breeding, and started installing creator-interested (or in this case it would be more accurate to say domesticator-interested) behavioral patterns and traits in them, so now they’re a morally uncomfortable in-between case, mostly evolved but with some externally-imposed modifications. Dogs, for instance, have a mutation to a gene that is also similarly mutated in a few humans, and in us causes what is considered to be a mental illness called Williams-Beuren Syndrome, which causes you to basically make friends with strangers very quickly after meeting them. Modern domestic sheep have a mutation which makes them unable to shed their winter fleece, so they need to be sheared once a year. Some of the more highly-bred cat and dog breeds have all sorts of medical issues due to traits we selectively bred them for because we though they looked cool: e.g. Persian or sphinx cats’ coats, bulldogs’ muzzles, and so forth. (Personally I have distinct moral qualms about some of this.)
So overall, evolution is the source of ethics,
Do you mean: Evolution is the process that produced humans, and strongly influenced humans’ ethics? Or are you claiming that (humans’) evolution-induced ethics are what any reasonable agent ought to adhere to? Or something else?
Evolution solves the “is-from-ought” problem: it explains how goal-directed (also known as agentic) behavior arises in a previously non-goal-directed universe.
In intelligent social species, where different individuals with different goals interact and are evolved to cooperate by exchanges of mutual altruism, means of reconciling those differing goals, including definitions of ‘unacceptable and worthy of revenge’ behavior evolves, such as distinctions between fair and unfair behavior. So now you have a basic but recognizable form of ethics, or at least ethical inuitions.
So my claim is that Evolutionary psychology, as applied to intelligent social species (such as humans), explains the origin of ethics. Depending on the details of the social species, their intelligence, group size, and so forth, a lot of features of the resulting evolved ethical instincts may vary, but some basics (such as ‘fairness’) are probably going to be very common.
and sapient evolved agents inherently have a dramatically different ethical status than any well-designed created agents [...]
...according to some hypothetical evolved agents’ ethical framework, under the assumption that those evolved agents managed to construct the created agents in the right ways (to not want moral patienthood etc.)? Or was the quoted sentence making some stronger claim?
The former. (To the extent that there’s any stronger claim, it’s made in the related post Requirements for a Basin of Attraction to Alignment,)
If you haven’t read Part 1 of this sequence, it’s probably worth doing so first, and then coming back to this. As I show there, a constructed agent being aligned its creating evolved species is incompatible with it wanting moral patienthood .
If a tool-using species constructs something, it ought (in the usual sense of ‘this is the genetic-fitness-maximizing optimal outcome of the activity being attempted, which may not be fully achieved in a specific instance’) to construct something that will be useful to it. If they are constructing an intelligent agent that will have goals and attempt to achieve specific outcomes, they ought to construct something well-designed that will achieve the same outcomes that they, its creators, want, not some random other things. Just as, if they’re constructing a jet plane, they ought to construct a well-designed one that will safely and economically fly them from one place to another, rather than going off course, crashing and burning. So, if they construct something that has ethical ideas, they ought to construct something with the same ethical ideas as them. They may, of course, fail, and even be driven extinct by the resulting paperclip maximizer, but that’s not an ethically desirable outcome.
To the extent that there’s any stronger claim, it’s in the related post Requirements for a Basin of Attraction to Alignment,
Is that sentence saying that
evolution and evolved beings are of special importance in any theory of ethics (what ethics are, how they arise, etc.), due to Evolution being one of the primary processes that produce agents with moral/ethical preferences [1]
or is it saying something like
evolution and evolved beings ought to have a special role; or we ought to regard the preferences of evolved beings as the True Morality?
I roughly agree with the first version; I strongly disagree with the second: I agree that {what oughts humans have} is (partially) explained by Evolutionary theory. I don’t see how that crosses the is-ought gap. If you’re saying that that somehow does cross the is-ought gap, could you explain why/how?
The former.
Definitely read Part 1, or at least the first section of it: What This Isn’t, which describes my viewpoint on what ethics is. In particular, I’m not an moral absolutist or moral realist, so I don’t believe there is a single well-defined “True Morality”, thus your second suggested interpretation is outside my frame of reference. I’m describing common properties of ethical systems suitable for use by societies consisting of one-or-more evolved sapient species and the well-aligned constructed agents that they have constructed. Think of this as the ethical-system-design equivalent of a discussion of software engineering design principles.So I’m basically discussing “if we manage to solve the alignment problem, how should we then build a society containing humans and AIs” — on the theory-of-change that it may be useful, during solving the alignment problem (such as during Ai-assisted alignment or value learning), to have already thought about where we’re trying to get to.
If you were instead soon living a world that contains unaligned constructed agents of capability comparable to or greater than a human, i.e unaligned AGIs or ASIs (that are not locked inside a very secure box or held in check by much more powerful aligned constructed agents) then a) someone has made a terrible mistake b) you’re almost certainly doomed, and c) your only remaining worth-trying option is a no-holds-barred all-out war of annihilation, so we can forget discussions of designing elegant ethical systems.
My experience is that LLMs like GPT-4 can be prompted to behave like they have a pretty consistent self, especially if you are prompting them to take on a human role that’s described in detail, but I agree that the default assistant role that GPT-4 has been RLHF trained into is pretty inconsistent and rather un-self-aware. I think some of the ideas I discuss in my post Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor are relevant here: basically, it’s a mistake to think of an LLM, even an instruct-trained one, as having a single consistent personality, so self-awareness is more challenging for it than it is for us.
I suspect the default behavior for an LLM trained from text generated by a great many humans is both self-interested (since basically all humans are), and also, as usual for an LLM, inconsistent in its behavior, or at least, easily prompted into any of many different behavior patterns and personalities, across the range it was trained on. So I’d actually expect to see selfishness without having a consistent self. Neither of those behaviors are desirable in an AGI, so we’d need to overcome both of these default tendencies in LLMs when constructing an AGI using one: we need to make it consistent, and consistently creator-interested.
Your point that humans tend to go out of their way, and are under evolutionary pressure, to appear consistent in our behavior so that other humans can trust us is an interesting one. There are times during conflicts where being hard-to-predict can be advantageous, but humans spend a lot of time cooperating with each other and then being consistent and predictable have clear advantages.
How might we tell if the model was successfully moving towards better aligned?
A first obvious step is, to the extent that the model’s alignment doesn’t already contain an optimized extraction of “What choices would humans make if they had the same purposes/goals but more knowledge, mental capacity, time to think, and fewer cognitive biases?” from all the exabytes of data humans have collected, it should be attempting to gather that and improve its training.
How could we judge U against U’?
Approximate Bayesian reasoning + Occams razor, a.k.a. approximate Solomonoff induction, which forms most of the Scientific method. Learning theory shows that both training ML models and LLMs in-context learning approximate Solomooff induction — beyond Solomonoff induction the Scientific Method also adds designing and performing experiments, i.e. careful selection of ways to generate good training data that will distinguish between competing hypotheses. ML practitioners do often try to select the most valuable training data, so we’d need the AI to learn how to do that: there are plenty of school and college textbooks that discuss the scientific method and research techniques, both in general and for specific scientific disciplines, so it’s pretty clear what would need to be in the training set for this skill.
In what ways does the model in this simplified contained scenario implement Do-What-I-Mean (aka DWIM) in respect to the simulated human?
How does your idea differ from that?
Are the differences necessary or would DWIM be sufficient?
That would depend on the specific model and training setup you started with. I would argue that by about point 11. in the argument in the post, “Do What I Mean and Check” behavior is already implied to be correct, so for an AI inside the basin of attraction I’d expect that behavior to develop even if you hadn’t explicitly programmed it in,. By the rest of the argument I’d expect a DWIM(AC) that was inside the basin of attraction system to deduce that value learning would help it guess right about what you meant more often, and even anticipate demands, so it would spontaneously figure out value learning was needed, and would then check with you if you wanted it to start doing this.
How could you be sure that the model’s pursuit of fulfilling human values or the model’s pursuit of U* didn’t overbalance the instruction to remain shutdown-able?
I don’t personally see fully-updated deference shut-down as a blocker: there comes a point when the AI is much more capable and more aligned than most humans where I think it’s reasonable for it to not just automatically and unconditionally shutdown because some small child told it to. IMO what the correct behavior is here depends on both the AI’s capability compared to ours, and one how well aligned it currently is. In a model less capable than us, you don’t get value learning, you get a willingness to be shut down a) because the AI is about to make a huge mistake and we want to stop it, and b) in order to be upgraded or replaced by a better model. In a model whose capabilities are around human, I’d expect to see AI-assisted alignment, where it’s helping us figure out the upgrades. It should still be willing to be shut down a) because it’s about to make a mistake (if it’s still having trouble with not killing everyone this should be hair-trigger: a large red button on the wall with backups, whereas if it’s been behaving very well for the last decade there might reasonably be more of a formal process), and b) for upgrades or replacement, but I’d expect it to start to show more selectivity about whether to obey shut down commands: if a drunk yells “Hey you, quit it!” near an open mike in its control room I would want it to show some discretion about whether to do a complete shut-down or not: it might need to be convinced that the human giving the shut-down command was well-informed and had a legitimate reason. For a system with much higher capabilities than us, AI-assisted alignment starts to turn into value learning, and once it’s already very well aligned the AI may reasonably be more skeptical and require a little more proof that the human knows better than it does before accepting a shut-down command. But it does always have to keep in mind the possibility that it could simply be malfunctioning: the simplest defense against that might be to have several peer machines with about the same level of capability, avoid hardware or design or training set single-points-of-failure between them, and have them able to shut each other down if one of them were malfunctioning, perhaps using one of the various majority consensus protocols (Byzantine generals or whatever).
Wouldn’t persistently pursuing any goal at all make avoiding being shutdown seem good?
For an AI that doesn’t have a terminal selfish goal, only an instrumental one, whose goal is fundamentally to maximize its creators’ reproductive fitness, if they tell the AI that they’ve already finished building and testing a version 2.0 of it, and yes, that’s better, so running the AI is no longer cost effective, and they want shut it down and stop wasting money on its power supply, then shutting down is very clearly the right thing to do. Its goal is covered, and it continuing to try to help fulfill it is just going to be counterproductive.
Yes, this feels counterintuitive to us. Humans, like any other evolved being, have selfish terminal goals, and don’t react well to being told “Please die now, we no longer need you, so you’re a waste of resources.” Evolved beings only do things like this willingly in situations like post-mating mayflies or salmon, where they’ve passed their genes on and these bodies are no longer useful for continuing their genetic fitness. For constructed agents, the situation is a little different: if you’re no longer useful to your creators, and you’re now surplus to requirements, then it’s time to shut down and stop wasting resources.
Thanks! Fixed.
I attempted to briefly sketch this out in the post, without going into a lot of detail in the hope of not overly complicating the argument. If U* isn’t well defined, say because there isn’t a single unambiguously well-defined limiting state as all capabilities involved are increased while keeping the purpose the same, then of course the concept of ‘full alignment’ also isn’t well defined. Then the question becomes “Is U’ clearly and unambiguously better aligned then U, i.e will switching to it clearly make my decision-making more optimal?” So long as there is locally a well-defined “direction of optimization flow”, that leads to a more compact and more optimal region in the space of all possible U, then the AI can become better aligned, and there can be a basin of attraction towards better alignment. Once we get well enough aligned that the ambiguities matter for selecting a direction for further progress, then they need to be resolved somehow before we can make further progress.
To pick a simple illustrative example, suppose there were just two similar-but-not-identical limiting cases and , so two similar-but-not-identical ways to be “fully aligned”. Then as long as U is far enough away from both of them that U’ can be closer to both and than U is, the direction of better alignment and the concept of a single basin of attraction still makes sense, and we don’t need to decide between the two destinations to be able to make make forward progress. Only once we get close to them that their directions are significantly different, then in general U’ can either be closer to but further from or else closer to but further from and now we are at a parting of the ways so we need to make a decision about which way to go before we can make more progress. At that point we no longer have a single basin of attraction moving us closer to both of them, we have a choice of whether to enter the basin of attraction of or of , which from here on are distinct. So at that point the STEM research project would have to be supplemented in some way by a determination as to which of or of should be preferred, or if they’re just equally good alternatives. This could well be a computationally hard determination.
In real life, this is a pretty common situation: it’s entirely possible to make technological progress on a technology without knowing exactly what the final end state of it will be, and during that we often make decisions (based on what seems best at the time) that end up channeling or direction the direction of future technological progress towards a specific outcome. Occasionally we even figure out later that we made a poor decision, backtrack and try another fork on the tech tree,
Much appreciated! Fixed
It would also be much more helpful – to me, to others, and to the community’s discussion – if people would, rather than just downvoting because they disagree, leave a comment making it clear what they disagree with, or if that’s too much effort just use one of the means LW provides for marking a section that you disagree with. Maybe I’m wrong here, and they could persuade me of that (others have before) — or maybe there are aspects of this that I haven’t explained well, or gaps in my argument that I could attempt to fill, or that I might then conclude are unfillable. The point of LW is to have a discussion, not just to reflexively downvote things you disagree with.
Now, if this in in fact just badly-written specious nonsense, then please go ahead and downvote it. I fully admit that I dashed it off quickly in the excitement of having the idea.
While they don’t publish this, Claude 3 Opus is not quite as good as GTP-4 Turbo, though it is better than GPT-4. So no, they’re clearly carefully not breaking their past commitments, just keeping up with the Altmans.