Former AI safety research engineer, now AI governance researcher at OpenAI. Blog: thinkingcomplete.com
The thing I’m picturing here is a futures contract where charizard-shirt-guy is obligated to deliver 3 trillion paperclips in exchange for one soul. And, assuming a reasonable discount rate, this is a better deal than only receiving a handful of paperclips now in exchange for the same soul. (I agree that you wouldn’t want to invest in a current-market-price paperclip futures contract.)
Damn, MMDoom is a good one. New lore: it won the 2055 technique award.
The training techniques used here are in general very standard ones (although the dissonance filters were a nice touch). For a higher score on this metric, we would have expected more careful work to increase the stability of self-evaluation and/or the accuracy of the judgments.
While the initial premise was a novel one to us, we thought that more new ideas could have been incorporated into this entry in order for it to score more highly on this metric. For example, the “outliers” in the entry’s predictions were a missed opportunity to communicate an underlying pattern. Similarly, the instability of the self-evaluation could have been incorporated into the entry in some clearer way.
We consider the piece a fascinating concept—one which forces the judges to confront the automatability of their own labors. Holding a mirror to the faces of viewers is certainly a classic artistic endeavor. We also appreciated the artistic irony of the entry’s inability to perceive itself.
I think we have failed, thus far. I’m sad about that. When I began posting in 2018, I assumed that the community was careful and trustworthy. Not easily would undeserved connotations sneak into our work and discourse. I no longer believe that and no longer hold that trust.
I empathize with this, and have complained similarly (e.g. here).
I have also been trying to figure out why I feel quite a strong urge to push back on posts like this one. E.g. in this case I do in fact agree that only a handful of people actually understand AI risk arguments well enough to avoid falling into “suggestive names” traps. But I think there’s a kind of weak man effect where if you point out enough examples of people making these mistakes, it discredits even those people who avoid the trap.
Maybe another way of saying this: of course most people are wrong about a bunch of this stuff. But the jump from that to claiming the community or field has failed isn’t a valid one, because the success of a field is much more dependent on max performance than mean performance.
In this particular case, Ajeya does seem to lean on the word “reward” pretty heavily when reasoning about how an AI will generalize. Without that word, it’s harder to justify privileging specific hypotheses about what long-term goals an agent will pursue in deployment. I’ve previously complained about this here.
Ryan, curious if you agree with my take here.
Copying over a response I wrote on Twitter to Emmett Shear, who argued that “it’s just a bad way to solve the problem. An ever more powerful and sophisticated enemy? … If the process continues you just lose eventually”.
I think there are (at least) two strong reasons to like this approach:
1. It’s complementary with alignment.
2. It’s iterative and incremental. The frame where you need to just “solve” alignment is often counterproductive. When thinking about control you can focus on gradually ramping up from setups that would control human-level AGIs, to setups that would control slightly superhuman AGIs, to…
As one example of this: as you get increasingly powerful AGI you can use it to identify more and more vulnerabilities in your code. Eventually you’ll get a system that can write provably secure code. Ofc that’s still not a perfect guarantee, but if it happens before the level at which AGI gets really dangerous, that would be super helpful.
This is related to a more general criticism I have of the P(doom) framing: that it’s hard to optimize because it’s a nonlocal criterion. The effects of your actions will depend on how everyone responds to them, how they affect the deployment of the next generation of AIs, etc. An alternative framing I’ve been thinking about: the g(doom) framing. That is, as individuals we should each be trying to raise the general intelligence threshold at which bad things happen.
This is much more tractable to optimize! If I make my servers 10% more secure, then maybe an AGI needs to be 1% more intelligent in order to escape. If I make my alignment techniques 10% better, then maybe the AGI becomes misaligned 1% later in the training process.
You might say: “well, what happens after that”? But my point is that, as individuals, it’s counterproductive to each try to solve the whole problem ourselves. We need to make contributions that add up (across thousands of people) to decreasing P(doom), and I think approaches like AI control significantly increase g(doom) (the level of general intelligence at which you get doom), thereby buying more time for automated alignment, governance efforts, etc.
I originally found this comment helpful, but have now found other comments pushing back against it to be more helpful. Upon reflection, I don’t think the comparison to MATS is very useful (a healthy field will have a bunch of intro programs), the criticism of Remmelt is less important given that Linda is responsible for most of the projects, the independence of the impact assessment is not crucial, and the lack of papers is relatively unsurprising given that it’s targeting earlier-stage researchers/serving as a more introductory funnel than MATS.
my guess is the brain is highly redundant and works on ion channels that would require actually a quite substantial amount of matter to be displaced (comparatively)
Neurons are very small, though, compared with the size of a hole in a gas pipe that would be necessary to cause an explosive gas leak. (Especially because you then can’t control where the gas goes after leaking, so it could take a lot of intervention to give the person a bunch of away-from-building momentum.)
I would probably agree with you if the building happened to have a ton of TNT sitting around in the basement.
The resulting probability distribution of events will definitely not reflect your prior probability distribution, so I think Thomas’ argument still doesn’t go through. It will reflect the shape of the wave-function.
This is a good point. But I don’t think “particles being moved the minimum necessary distance to achieve the outcome” actually favors explosions. I think it probably favors the sensor hardware getting corrupted, or it might actually favor messing with the firemens’ brains to make them decide to come earlier (or messing with your mother’s brain to make her jump out of the building)—because both of these are highly sensitive systems where small changes can have large effects.
Does this undermine the parable? Kinda, I think. If you built a machine that samples from some bizarre inhuman distribution, and then you get bizarre outcomes, then the problem is not really about your wish any more, the problem is that you built a weirdly-sampling machine. (And then we can debate about the extent to which NNs are weirdly-sampling machines, I guess.)
The outcome pump is defined in a way that excludes the possibility of active subversion: it literally just keeps rerunning until the outcome is satisfied, which is a way of sampling based on (some kind of) prior probability. Yudkowsky is arguing that this is equivalent to a malicious genie. But this is a claim that can be false.
In this specific case, I agree with Thomas that whether or not it’s actually false will depend on the details of the function: “The further she gets from the building’s center, the less the time machine’s reset probability.” But there’s probably some not-too-complicated way to define it which would render the pump safe-ish (since this was a user-defined function).
People sometimes try to reason about the likelihood of deceptive alignment by appealing to speed priors and simplicity priors. I don’t like such appeals, because I think that the differences between aligned and deceptive AGIs will likely be a very small proportion of the total space/time complexity of an AGI. More specifically:
1. If AGIs had to rederive deceptive alignment in every episode, that would make a big speed difference. But presumably, after thinking about it a few times during training, they will remember their conclusions for a while, and bring them to mind in whichever episodes they’re relevant. So the speed cost of deception will be amortized across the (likely very long) training period.
2. AGIs will represent a huge number of beliefs and heuristics which inform their actions (e.g. every single fact they know). A heuristic like “when you see X, initiate the world takeover plan” would therefore constitute a very small proportion of the total information represented in the network; it’d be hard to regularize it away without regularizing away most of the AGI’s knowledge.
I think that something like the speed vs simplicity tradeoff is relevant to the likelihood of deceptive alignment, but it needs to be more nuanced. One idea I’ve been playing around with: the tradeoff between conservatism and systematization (as discussed here). An agent that prioritizes conservatism will tend to do the things they’ve previously done. An agent that prioritizes systematization will tend to do the things that are favored by simple arguments.
To illustrate: suppose you have an argument in your head like “if I get a chance to take a 60⁄40 double-or-nothing bet for all my assets, I should”. Suppose you’ve thought about this a bunch and you’re intellectually convinced of it. Then you’re actually confronted with the situation. Some people will be more conservative, and follow their gut (“I know I said I would, but… this is kinda crazy”). Others (like most utilitarians and rationalists) will be more systematizing (“it makes sense, let’s do it”). Intuitively, you could also think of this as a tradeoff between memorization and generalization; or between a more egalitarian decision-making process (“most of my heuristics say no”) and a more centralized process (“my intellectual parts say yes”). I don’t know how to formalize any of these ideas, but I’d like to try to figure it out.
Ty for review. I still think it’s better, because it gets closer to concepts that might actually be investigated directly. But happy to agree to disagree here.
Small relevant datapoint: the paper version of this was just accepted to ICLR, making it the first time a high-level “case for misalignment as an x-risk” has been accepted at a major ML conference, to my knowledge. (Though Langosco’s goal misgeneralization paper did this a little bit, and was accepted at ICML.)
Can you construct an example where the value over something would change to be simpler/more systemic, but in which the change isn’t forced on the agent downstream of some epistemic updates to its model of what it values? Just as a side-effect of it putting the value/the gear into the context of a broader/higher-abstraction model (e. g., the gear’s role in the whole mechanism)?
I think some of my examples do this. E.g. you used to value this particular gear (which happens to be the one that moves the piston) rotating, but now you value the gear that moves the piston rotating, and it’s fine if the specific gear gets swapped out for a copy. I’m not assuming there’s a mistake anywhere, I’m just assuming you switch from caring about one type of property it has (physical) to another (functional).
In general, in the higher-abstraction model each component will acquire new relational/functional properties which may end up being prioritized over the physical properties it had in the lower-abstraction model.
I picture you saying “well, you could just not prioritize them”. But in some cases this adds a bunch of complexity. E.g. suppose that you start off by valuing “this particular gear”, but you realize that atoms are constantly being removed and new ones added (implausibly, but let’s assume it’s a self-repairing gear) and so there’s no clear line between this gear and some other gear. Whereas, suppose we assume that there is a clear, simple definition of “the gear that moves the piston”—then valuing that could be much simpler.
Zooming out: previously you said
I agree that there are some very interesting and tricky dynamics underlying even very subtle ontology breakdowns. But I think that’s a separate topic. I think that, if you have some value v(x), and it doesn’t run into direct conflict with any other values you have, and your model of x isn’t wrong at the abstraction level it’s defined at, you’ll never want to change v(x).
I’m worried that we’re just talking about different things here, because I totally agree with what you’re saying. My main claims are twofold. First, insofar as you value simplicity (which I think most agents strongly do) then you’re going to systematize your values. And secondly, insofar as you have an incomplete ontology (which every agent does) and you value having well-defined preferences over a wide range of situations, then you’re going to systematize your values.
Separately, if you have neither of these things, you might find yourself identifying instrumental strategies that are very abstract (or very concrete). That seems fine, no objections there. If you then cache these instrumental strategies, and forget to update them, then that might look very similar to value systematization or concretization. But it could also look very different—e.g. the cached strategies could be much more complicated to specify than the original values; and they could be defined over a much smaller range of situations. So I think there are two separate things going on here.
(drafted this reply a couple months ago but forgot to send it, sorry)
your low-level model of a specific gear’s dynamics didn’t change — locally, it was as correct as it could ever be.And if you had a terminal utility function over that gear (e. g., “I want it to spin at the rate of 1 rotation per minutes”), that utility function won’t change in the light of your model expanding, either. Why would it?
your low-level model of a specific gear’s dynamics didn’t change — locally, it was as correct as it could ever be.
And if you had a terminal utility function over that gear (e. g., “I want it to spin at the rate of 1 rotation per minutes”), that utility function won’t change in the light of your model expanding, either. Why would it?
Let me list some ways in which it could change:
Your criteria for what counts as “the same gear” changes as you think more about continuity of identity over time. Once the gear stars wearing down, this will affect what you choose to do.
After learning about relativity, your concepts of “spinning” and “minutes” change, as you realize they depend on the reference frame of the observer.
You might realize that your mental pointer to the gear you care about identified it in terms of its function not its physical position. For example, you might have cared about “the gear that was driving the piston continuing to rotate”, but then realize that it’s a different gear that’s driving the piston than you thought.
These are a little contrived. But so too is the notion of a value that’s about such a basic phenomenon as a single gear spinning. In practice almost all human values are (and almost all AI values will be) focused on much more complex entities, where there’s much more room for change as your model expands.
Take a given deontological rule, like “killing is bad”. Let’s say we view it as a constraint on the allowable actions; or, in other words, a probability distribution over your actions that “predicts” that you’re very likely/unlikely to take specific actions. Probability distributions of this form could be transformed into utility functions by reverse-softmaxing them; thus, it’s perfectly coherent to model a deontologist as an agent with a lot of separate utility functions.
This doesn’t actually address the problem of underspecification, it just shuffles it somewhere else. When you have to choose between two bad things, how do you do so? Well, it depends on which probability distributions you’ve chosen, which have a number of free parameters. And it depends very sensitively on free parameters, because the region where two deontological rules clash is going to be a small proportion of your overall distribution.
In the standard story, what are the terminal goals? You could say “random” or “a mess”, but I think it’s pretty compelling to say “well, they’re probably related to the things that the agent was rewarded for during training”. And those things will likely include “curiosity, gaining access to more tools or stockpiling resources”.
I call these “convergent final goals” and talk more about them in this post.
I also think that an AGI might systematize other goals that aren’t convergent final goals, but these seem harder to reason about, and my central story for which goals it systematizes are convergent final goals. (Note that this is somewhat true for humans, as well: e.g. curiosity and empowerment/success are final-ish goals for many people.)
Yeah, good point. The intuition I want to point at here is “general relativity was simpler than Newtonian mechanics + ad-hoc adjustments for Mercury’s orbit”. But I do think it’s a little tricky to pin down the sense in which it’s simpler. E.g. what if you didn’t actually have any candidate explanations for why Mercury’s orbit was a bit off? (But you’d perhaps always have some hypothesis like “experimental error”, I guess.)
I’m currently playing around with the notion that, instead of simplicity, we’re actually optimizing for something like “well-foundedness”, i.e. the ability to derive everything from a small set of premises. But this feels close enough to simplicity that maybe I should just think of this as one version of simplicity.
A mix of deliberate and blind spot. I’m assuming that almost everything related to physical engineering and technological problems has been worked out, and so the stuff remaining is mostly questions about how (virtual) minds and civilizations play out (which are best understood via simulation) and questions about what other universes and other civilizations might look like.
But even if the probes aren’t running extensive experiments, they’re almost certainly learning something from each new experience of colonizing a solar system, and I should have incorporated that somehow.