“Thousands of researchers at the worlds richest corporations are all working to make AI more powerful. Who is working to make AI more moral?”
(For policymakers and activists skeptical of big tech)
“Thousands of researchers at the worlds richest corporations are all working to make AI more powerful. Who is working to make AI more moral?”
(For policymakers and activists skeptical of big tech)
If you’re interested in grokking, I’d suggest my post on the topic.
Do you know if there’s any research relevant to whether “degree of vulnerability to superstimuli” is correlated with intelligence in humans?
One aspect of inner alignment failures that I think is key to safe generalizations is that values tend to multiply. E.g., the human reward system is an inner alignment failure wrt evolution’s single “value”. Human values are inner alignment failures wrt the reward system. Each step we’ve seen has a significant increase in the breadth / diversity of values (admittedly, we’ve only seen two steps, but IMO it also makes sense that the process of inner alignment failure is orientated towards value diversification).
If even a relatively small fraction of the AI’s values orient towards actually helping humans, I think that’s enough to avert the worst possible futures. From that point, it becomes a matter of ensuring that values are able to perpetuate themselves robustly (currently a major focus of our work on this perspective; prospects seem surprisingly good, but far from certain).
maybe the optimized AI-”helping”-”human” superstimuli actually are living good transhuman lives, rather than being a nonsentient “sex toy” that happens to be formed in our image?
I actually think it would be very likely that such superstimuli are sentient. Humans are sentient. If you look at non-sentient humans (sleeping, sleep walking, trance state, some anesthetic drugs, etc), they typically behave quite differently from normal humans.
It might be easier to escape + take over the world than to convince alignment researchers to accept a solution to the alignment problem given out by an unaligned AGI.
...I think it suggests “AI will be like another evolved species” rather than “AI will be like humans”...
This was close to my initial assumption as well. I’ve since spent a lot of time thinking about the dynamics that arise from inner alignment failures in a human-like learning system, essentially trying to apply microeconomics to the internal “economy” of optimization demons that would result from an inner alignment failure. You can see this comment for some preliminary thoughts along these lines. A startling fraction of our deepest morality-related intuitions seem to derive pretty naturally / robustly from the multi-agent incentives associated with an inner alignment failure.
Moreover, I think that there may be a pretty straightforward relationship between a learning system’s reward function and the actual values it develops: values are self-perpetuating, context-dependent strategies that obtained high reward during training. If you want to ensure a learning system develops a given value, it may simply be enough to ensure that the system is rewarded for implementing the associated strategy during training. To get an AI that wants to help humans, just ensure the AI is rewarded for helping humans during training.
I’d suggest writing the message as a plaintext file, then computing the hash of that file. That way, there’s no possible issue with copy/pasting. You still have to track the file, but that’s not too hard.
I’m drawing an analogy between AI training and human learning. I don’t think the process of training an AI via reinforcement learning is as different from human learning as many assume.
Let’s suppose there are ~300 million people who’d use their unlimited power to destroy the world (I think the true number is far smaller). That would mean > 95% of people wouldn’t do so. Suppose there were an alignment scheme that we’d tested billions of times on human-level AGIs, and > 95% of the time, it resulted in values compatible with humanity’s continued survival. I think that would be a pretty promising scheme.
grant absolute power and superhuman intelligence along with capacity for further self-modification to a single person, and I give far better than even odds that what results is utterly catastrophic.
If there were a process that predictably resulted in me having values strongly contrary to those I currently posses, I wouldn’t do it. The vast majority of people won’t take pills that turn them into murderers. For the same reason, an aligned AI at slightly superhuman capabilities levels won’t self modify without first becoming confidant that its self modification will preserve its values. Most likely, it would instead develop better alignment tech than we used to create said AI and create a more powerful aligned successor.
The result about superrational agents was only demonstrated for superrational agents. That means agents which implement, essentially, the best of all possible decision theories. So they cooperate with each other and have all the non-contradictory properties that we want out of the best possible decision theory, even if we don’t currently know how to specify such a decision theory.
It’s a goal to aspire to, not a reality that’s already been achieved.
The vast majority of humans would not destroy the world, even given unlimited power to enact their preferences unopposed.
I was specifically talking about the preferences of an individual human. The behavior of the economic systems that derive from the actions of many humans need not be aligned with the preferences of any component part of said systems. For AIs, we’re currently interested in the values that arise in a single AIs (specifically, the first AI capable of a hard takeoff), so single humans are the more appropriate reference class.
the world isn’t actually “destroyed”, just radically transformed in a way that doesn’t end with any of the existing humans being alive
In fact, “radically transformed in a way that doesn’t end with any of the existing humans being alive” is what I meant by “destroyed”. That’s the thing that very few current humans would do, given sufficient power. That’s the thing that we’re concerned that future AIs might do, given sufficient power. You might have a different definition of the word “destroyed”, but I’m not using that definition.
How are the following for “new, intuitively compelling insights”?
Human values arise from an inner alignment failure between the brain’s learning system and its steering system.
Note that we don’t try to maximize the activation of our steering system’s reward circuitry via wireheading, and we don’t want a future full of nothing but hedonium. Our values can’t be completely aligned with our steering system.
The human learning system implements a fundamentally simple learning algorithm, with relatively little in the way of ancestral environment-specific evolutionary complexity.
This derives from the bitter lesson as applied to the sorts of general learning architectures that evolution could have found.
The bitter lesson itself derives from applying the simplicity prior to the space of possible learning algorithms. It applies to evolution as much as to human ML researchers.
Human values are actually fairly robust to small variations in the steering system.
The vast majority of humans would not destroy the world, even given unlimited power to enact their preferences unopposed. Most people would also oppose self-modifying into being the sorts of people who’d be okay with destroying the world. This is in stark contrast to how AIs are assumed to destroy the world by default.
The steering system must have non-trivial genetic variation. Otherwise, we could not domesticate animals in as few generations as we manifestly are able to (e.g., foxes).
People with e.g., congenital blindness, or most other significant cognition-related genetic variations, still develop human values. The primary exception, psychopathy, probably develops from something as simple as not having a steering system that rewards the happiness of others.
Also note that the human steering system has components that are obviously bad ideas to include in an AI’s reward function, such as rewards for dominance / cruelty. Most of us still turn out fine.
Implication of the above: there must exist simple learning processes that robustly develop human-compatible values when trained on reward signals generated from a reward model similar to the human steering system.
This perspective doesn’t so much offer a particular alignment scheme as highlight certain mistaken assumptions in alignment-related reasoning. That, by the way, is the fundamental mistake in using P=NP as an analogy to the alignment problem: the field of computational complexity can rely on formal mathematical proofs to establish firm foundations for further work. In contrast, alignment research is build on a large edifice of assumptions, any of which could be wrong.
In particular, a common assumption in alignment thinking is that the human value formation process is inherently complex and highly tuned to our specific evolutionary history, that it represents some weird parochial corner of possible value formation processes. Note that this assumption then places very little probability mass on us having any particular value formation process, so it does not strongly retrodict the observed evidence. In contrast, the view I briefly sketched above essentially states that our value formation process is simply the default outcome of pointing a simple learning process at a somewhat complex reward function. It retrodicts our observed value formation process much more strongly.
I think value formation is less like P=NP and more like the Fermi paradox, which seemed unsolvable until Anders Sandberg, Eric Drexler and Toby Ord published Dissolving the Fermi Paradox. It turns out that properly accounting for uncertainty in the parameters of the Drake equation causes the paradox to pretty much vanish.
Prior to Dissolving the Fermi Paradox, people came up with all sorts of wildly different solutions to the paradox, as you can see by looking at its Wikipedia page. Rather than address the underlying assumptions that went into constructing the Fermi paradox, these solutions primarily sought to add additional mechanisms that seemed like they might patch away the confusion associated with the Fermi paradox.
However, the true solution to the Fermi paradox had nothing to do with any of these patches. No story about why aliens wouldn’t contact Earth or why technological civilizations invariably destroyed themselves would have ever solved the Fermi paradox, no matter how clever or carefully reasoned. Once you assume the incorrect approach to calculating the Drake equation, no amount of further reasoning you perform will lead you any further towards the solution, not until you reconsider the form of the Drake equation.
I think the Fermi paradox and human value formation belong to a class of problems, which we might call “few-cruxed problems” where progress can be almost entirely blocked by a handful of incorrect background assumptions. For few-crux problems, the true solution lies in a part of the search space that’s nearly inaccessible to anyone working from said mistaken assumptions.
The correct approach for few-cruxed problems is to look for solutions that take away complexity, not add more of it. The skill involved here is similar to noticing confusion, but can be even more difficult. Oftentimes, the true source of your confusion is not the problem as it presents itself to you, but some subtle assumptions (the “cruxes”) of your background model of the problem that caused no telltale confusion when you first adopted them.
A key feature of few-cruxed problems is that the amount of cognitive effort put into the problem before identifying the cruxes tells us almost nothing about the amount of cognitive work required to make progress on the problem once the cruxes are identified. The amount of cognition directed towards a problem is irrelevant if the cognition in question only ever explores regions of the search space which lack a solution. It is therefore important not to flinch away from solutions that seem “too simple” or “too dumb” to match the scale of the problem at hand. Big problems do not always require big solutions.
I think one crux of alignment is the assumption that human value formation is a complex process. The other crux (and I don’t think there’s a third crux) is the assumption that we should be trying to avoid inner alignment failures. If (1) human values derive from an inner alignment failure wrt to our steering system, and (2) humans are the only places where human values can be found, then an inner alignment failure is the only process to have ever produced human values in the entire history of the universe.
If human values derive from inner alignment failures, and we want to instill human values in an AI system, then the default approach should be to understand the sorts of values that derive from inner alignment failures in different circumstances, then try to arrange for the AI system to have an inner alignment failure that produces human-compatible values.
If, after much exploration, such an approach turned out to be impossible, then I think it would be warranted to start thinking about how to get human-compatible AI systems out of something other than an inner alignment failure. What we actually did was almost completely wall off that entire search space of possible solutions and actively try to solve the inner alignment “problem”.
If the true solution to AI alignment actually looks anything like “cause a carefully orchestrated inner alignment failure in a simple learning system”, then of course our assumptions about the complexity of value formation and the undesirability of inner alignment failures would prevent us from finding such a solution. Alignment would look incredibly difficult because the answer would be outside of the subset of the solution space we’d restricted ourselves to considering.
No key is needed or involved. SHA-512 isn’t an encryption scheme. SHA-512 is a one way cryptographic hash that maps any input string to a 512 bit pseudo random string. The only known way to derive the input from the hash is to search over the space of possible inputs for a collision. The difficulty of deriving the input from the hash thus scales exponentially according to entropy of the input.
E.g., given the hash:
1c0fb5008c573315e7b1e1af5ab41d0ce9b8d4469e41c4d59c3041bd99671208c415fcb0359418dd6bc481863d3d5d030a75364318afbec54cdba082df3f9577
it would not be difficult to reverse it because this is just the hash of the single word “cats”. You can just test every word (or every sequence of 4 letters) until you find a collision. In contrast, a hash like:
4ca4934820c79165975c443baac9020cc40d9c3eac04c22c5fd66849af176903125b02199f21fe9eed5a4912e93a81dc2a21b3e675c369b25a8f42c0f007bcc5
is essentially unbreakable because the input was a long string of random characters. You’d never be able to find the original input I used to generate the hash.
You can’t “decrypt” a hash because the hash doesn’t encode the message in question. It’s of fixed length, so it can hardly encode an arbitrary-sized message. Instead, you can prove that you either have the input which originally generated the hash in question, or that you were able to find a collision in SHA-512 (which is thought to be very difficult).
The overall process would go like this:
You generate some plaintext message M containing a prediction about AI progress.
You sample a random string of characters R (maybe 100 characters in length).
You generate H, an SHA-512 hash of M||R
You published H publicly.
The computational difficulty of deriving M from H scales ~ 2^(entropy of M||R). This is why I suggested appending a random message to the end of your prediction. In the case that you make a low-entropy prediction (e.g., if someone can make plausible guesses about what you’d write for M based on your past writings), you’d still be protected by the high entropy of R.
When you want to prove that you made prediction M when you published H, you publish M||R.
People can then see that SHA-512(M||R) = H, so you must have known M||R in order to have published H (or found a collision).
In other words, the “key” you need to track and keep safe to prove your prediction is M||R.
You could easily do an ad hoc form of this by just posting an SHA-512 hash of your predictions. This doesn’t have an integrated method to decrypt the prediction after n years, but you can publicly precommit to reveal your prediction after n years.
If you’re worried about this revealing your prediction early via a brute force attack, you can append a random sequence of tokens to your prediction before hashing it.
Well put. If you really do consist of different parts, each wanting different things, then your values should derive from a multi agent consensus among your parts, not just an argmax over the values of the different parts.
In other words, this:
Something with a utility function, if it values an apple 1% more than an orange, if offered a million apple-or-orange choices, will choose a million apples and zero oranges. The division within most people into selfish and unselfish components is not like that, you cannot feed it all with unselfish choices whatever the ratio. Not unless you are a Keeper, maybe, who has made yourself sharper and more coherent
seems like a very limited way of looking at “coherence”. In the context of multi agent negotiations, becoming “sharper and more coherent” should equate to having an internal consensus protocol that approaches closer to the Pareto frontier of possible multi agent equilibria.
Technically, “allocate all resources to a single agent” is a Pareto optimal distribution, but it’s only possible if a single agent has an enormously outsized influence on the decision making process. A person for whom that is true would, I think, be incredibly deranged and obsessive. None of my parts aspire to create such a twisted internal landscape.
I instead aspire to be the sort of person whose actions both reflect a broad consensus among my individual parts and effectively implement that consensus in the real world. Think results along the line of the equilibrium that emerges from superrational agents exchanging influence, rather than some sort of “internal dictatorship” where one part infinitely dominates over all others
Looks like you use gradient magnitude as your saliency score. I’ve looked at using saliency to guide counterfactual modifications to a text, though my focus was on aiding interpretability rather than adversarial robustness. (Paper).
I’ve found that the normgrad saliency score worked well for highlighting important tokens. I.e., saliency = torch.sum(torch.pow(embedding * gradient, 2)).
For more details on normgrad, see: https://arxiv.org/pdf/2004.02866.pdf
You can also add “?view=flat” to the end of a glowfic url to get the entire story on one page.
Becomes:
glowfic.com/posts/####?view=flat
Where “####” is a stand in for the story ID number.
If you click on the menu bar shown in the top right of the screenshot on this post, then you’ll also see a drop down option to view the flat html.
Sounds like a good idea.
Possible first step: we should start carefully and durably recoding the current state of AI development, related plans and associated power dynamics. That way, victors can generate a more precise estimate of the distribution over the values of possible counterfactual victors. We also signal our own commitment to such a value sharing scheme.
Also, not sure we even need simulations at all. Many-worlds QM seems like it should work just as well for this sort of values handshake. In fact, many-worlds would probably work even better because:
it’s not dependent on how feasible it turns out to be to simulate realistic counterfactual timelines.
the distribution over possible outcomes is wider. If we turn out to be on a doomed timeline such that humanity has essentially zero chance of emerging the victor, there may be other timelines that are less doomed which split off from ours in the past.
there’s no risk of a “treacherous turn” if the AI decides it’s not actually being simulated.
I suppose my preferred strategy would be to derive the process by which human values form and replicate that in AIs. The reason I think this is tractable is because I actually take issue with this statement:
In the same way that the first airplanes did not look like birds, the first human-level AI will not look like humans.
I don’t think bird verses plane is an appropriate analogy for human versus AI learning systems because effective / general learning systems tend to resemble each other. Simple architectures scale best, so we should expect human learning to be simple and scalable, like the first AGI will be. We’re not some “random sample” from the space of possible mind configurations. Once you condition on generality, you actually get a lot of convergence in the resulting learning dynamics. It’s no coincidence that adversarial examples can transfer across architectures. You can see my thoughts on this here.
I also think that human values derive from a relatively straightforward interaction between our reward circuitry and our learning system, which I discuss in more detail here. The gist of it is that the brain really seems like the sort of place where inner alignment failures should happen, and inner alignment failures seem like they’d be hard for evolution to stop. Thus: the brain is probably full of inner alignment failure (as in, full of competing / cooperating quasi-agentic circuits).
Additionally, if you actually think about the incentives that derive from an inner alignment failure, they seem to have a starting resemblance to the actual ways in which our values work. Many deep / “weird” seeming values intuitions seem to coincide with a multi-agent inner alignment failure story.
I think odds are good that we’ll be able to replicate such a process in an AI and get values that are compatible with humanity’s continued survival.
One liner: Don’t build a god that also wants to kill you.
Ironically, one of my medium-sized issues with mainline alignment thinking is that it seems to underweight the evidence we get from observing humans and human values. The human brain is, by far, the most general and agentic learning system in current existence. We also have ~7 billion examples of human value learning to observe. The data they provide should strongly inform our intuitions on how other highly general and agentic learning systems behave. When you have limited evidence about a domain, what little evidence you do have should strongly inform your intuitions.
In fact, our observations of humans should inform our expectations of AGIs much more strongly than the above argument implies because we are going to train those AGIs on data generated by humans. It’s well known in deep learning that training data are usually more important than details of the learning process or architecture.
I think alignment thinking has an inappropriately strong bias against anchoring expectations to our observations of humans. There’s an assumption that the human learning algorithm is in some way “unnatural” among the space of general and effective learning algorithms, and that we therefore can’t draw inferences about AGIs based on our observations of humans. E.g., Eliezer Yudkowsky’s post My Childhood Role Model:
Yudkowsky notes that a learning algorithm hyper-specialized to the ancestral environment would not generalize well to thinking about non-ancestral domains like physics. This is absolutely correct, and it represents a significant misprediction of any view assigning a high degree of specialization to the brain’s learning algorithm.
In fact, large language models arguably implement social instincts with more adroitness than many humans possess. However, original physics research in the style of Einstein remains well out of reach. This is exactly the opposite of what you should predict if you believe that evolution hard coded most of the brain to “cleverly argue that they deserve to receive a larger share of the meat”.
Yudkowsky brings up multiplication as an example of a task that humans perform poorly at, supposedly because brains specialized to the ancestral environment had no need for such capabilities. And yet, GPT-3 is also terrible at multiplication (even after accounting for the BPE issue), and no part of its architecture or training procedure is at all specialized for the human ancestral environment.
But a smile maximizer would have less diverse values than a human? It only cares about smiles, after all. When you say “smile maximizer”, is that shorthand for a system with a broad distribution over different values, one of which happens to be smile maximization? That’s closer to how I think of things, with the system’s high level behavior arising as a sort of negotiated agreement between its various values.
IMO, systems with broader distributions over values are more likely to assign at least some weight to things like “make people actually happy” and to other values that we don’t even know we should have included. In that case, the “make people actually happy” value and the “smile maximization” value can cooperate and make people smile by being happy (and also cooperate with the various other values the system develops). That’s the source of my intuition that a broader distributions over values are actually safer: it makes you less likely to miss something important.
More generally, I think that a lot of the alignment intuition that “values are fragile” actually comes from a pretty simple type error. Consider:
The computation a system executes depends on its inputs. If you have some distribution over possible inputs, that translates to having a distribution over possible computations.
“Values” is just a label we apply to particular components of a system’s computations.
If a system has a situation-dependent distribution over possible computations, and values are implemented by those computations, then the system also has a situation-dependent distribution over possible values.
However, people can only consciously instantiate a small, subset of discrete values at any given time. There thus appears to be a contrast between “the values we can imagine” and “the values we actually have”. Trying to list out a discrete set of “true human values” roughly corresponds to trying to represent a continuous distribution with a small set of discrete samples from that distribution (this is the type error in question). It doesn’t help that the distribution over values is situation-dependent, so any sampling of their values a human performs in one situation may not transfer to the samples they’d take in another situation.
Given the above, it should be no surprise that our values feel “fragile” when we introspect on them.
Preempting a possible confusion: the above treats a “value” and “the computation that implements that value” interchangeably. If you’re thinking of a “value” as something like a principle component of an agent’s utility function, somehow kept separate from the system that actually implements those values, then this might seem counterintuitive.
Under this framing, questions like the destruction of the physical rainforests, or other things we might value are mainly about ensuring a broad distribution of worthwhile values can perpetuate themselves across time and are influence the world to at least some degree. “Preserving X”, for any value of X, is about ensuring that the system has at least some values orientated towards preserving X, that those values can persist over time, and that those values can actually ensure that X is preserved. (And the broader the values, the more different Xs we can preserve.)
I think the prospects for achieving those three things are pretty good, though I don’t think I’m ready to write up my full case for believing such.
(I do admit that it’s possible to have a system that ends up pursing a simple / “dumb” goal, such as maximizing paperclips, to the exclusion of all else. That can happen when the system’s distribution over possible values places so much weight on paperclip-adjacent values that they can always override any other values. This is another reason I’m in favor of broad distributions over values.)
Agreed. It’s particularly annoying because, IMO, there is a strong candidate for “obvious relationship between the outer loss funciton and learned values”: learned values reflect the distribution over past computations that achieved high reward on the various shallow proxies of the outer loss function that the model encountered during training.