Working on alignment at EleutherAI
(Mostly just stating my understanding of your take back at you to see if I correctly got what you’re saying:)
I agree this argument is obviously true in the limit, with the transistor case as an existence proof. I think things get weird at the in-between scales. The smaller the network of aligned components, the more likely it is to be aligned (obviously, in the limit if you have only one aligned thing, the entire system of that one thing is aligned); and also the more modular each component is (or I guess you would say the better the interfaces between the components), the more likely it is to be aligned. And in particular if the interfaces are good and have few weird interactions, then you can probably have a pretty big network of components without it implementing something egregiously misaligned (like actually secretly plotting to kill everyone).
And people who are optimistic about HCH-like things generally believe that language is a good interface and so conditional on that it makes sense to think that trees of humans would not implement egregiously misaligned cognition, whereas you’re less optimistic about this and so your research agenda is trying to pin down the general theory of Where Good Interfaces/Abstractions Come From or something else more deconfusion-y along those lines.
Does this seem about right?
I agree that in practice you would want to point mild optimization at it, though my preferred resolution (for purely aesthetic reasons) is to figure out how to make utility maximizers that care about latent variables, and then make it try to optimize the latent variable corresponding to whatever the reflection converges to (by doing something vaguely like logical induction). Of course the main obstacles are how the hell we actually do this, and how we make sure the reflection process doesn’t just oscillate forever.
(Transcribed in part from Eleuther discussion and DMs.)
My understanding of the argument here is that you’re using the fact that you care about diamonds as evidence that whatever the brain is doing is worth studying, with the hope that it might help us with alignment. I agree with that part. However, I disagree with the part where you claim that things like CIRL and ontology identification aren’t as worthy of being elevated to consideration. I think there exist lines of reasoning that these fall naturally out as subproblems, and the fact that they fall out of these other lines of reasoning promotes them to the level of consideration.
I think there are a few potential cruxes of disagreement from reading the posts and our discussion:
You might be attributing far broader scope to the ontology identification problem than I would; I think of ontology identification as an interesting subproblem that recurs in a lot of different agendas, that we may need to solve in certain worst plausible cases / for robustness against black swans.
In my mind ontology identification is one of those things where it could be really hard worst case or it could be pretty trivial, depending on other things. I feel like you’re pointing at “humans can solve this in practice” and I’m pointing at “yeah but this problem is easy to solve in the best case and really hard to solve in the worst case.”
More broadly, we might disagree on how scalable certain approaches used in humans are, or how surprising it is that humans solve certain problems in practice. I generally don’t find arguments about humans implementing a solution to some hard alignment problem compelling, because almost always when we’re trying to solve the problem for alignment we’re trying to come up with an airtight robust solution, and humans implement the kludgiest, most naive solution that works often enough
I think you’re attributing more importance to the “making it care about things in the real world, as opposed to wireheading” problem than I am. I think of this as one subproblem of embeddedness that might turn out to be difficult, that falls somewhere between 3rd and 10th place on my list of most urgent alignment problems to fix. This applies to shard theory more broadly.
I also think the criticism of invoking things like AIXI-tl is missing the point somewhat. As I understand it, the point that is being made when people think about things like this is that nobody expects AGI to actually look like AIXI-tl or be made of Bayes nets, but this is just a preliminary concretization that lets us think about the problem, and substituting this is fine because this isn’t core to the phenomenon we’re poking at (and crucially the core of the thing that we’re pointing at is something very limited in scope, as I listed in one of the cruxes above). As an analogy, it’s like thinking about computational complexity by assuming you have an infinitely large Turing machine and pretending coefficients don’t exist or something, even though real computers don’t look remotely like that. My model of you is saying “ah, but it is core, because humans don’t fit into this framework and they solve the problem, so by restricting yourself to this rigid framework you exclude the one case where it is known to be solved.” To which I would point to the other crux and say “au contraire, humans do actually fit into this formalism, it works in humans because humans happen to be the easy case, and this easy solution generalizing to AGI would exactly correspond to scanning AIXI-tl’s Turing machines for diamond concepts just working without anything special.” (see also: previous comments where I explain my views on ontology identification in humans).
The fact that different cultures have different concepts of death, or that it splinters away from the things it was needed for in the ancestral environment, doesn’t seem to contradict my claim. What matters is not that the ideas are entirely the same from person to person, but rather that the concept has the kinds of essential properties that mattered in the ancestral environment. For instance, as long as your concept of death you pick out can predict that killing a lion makes it no longer able to kill you, that dying means disempowerment, etc, it doesn’t matter if you also believe ghosts exist, as long as your ghost belief isn’t so strong that it makes you not mind being killed by a lion.
I think these core properties are conserved across cultures. Grab two people from extremely different cultures and they can agree that people eventually die, and if you die your ability to influence the world is sharply diminished. (Even people who believe in ghosts have to begrudgingly accept that ghosts have a much harder time filing their taxes.) I don’t think this splintering contradicts my theory at all. You’re selecting out the concept in the brain that best fits these constraints, and maybe in one brain that comes with ghosts and in another it doesn’t.
To be fully clear, I’m not positing the existence of some kind of globally universal concept of death or whatever that is shared by everyone, or that concepts in brains are stored at fixed “neural addresses”. The entire point of doing ELK/ontology identification is to pick out the thing that best corresponds to some particular concept in a wide variety of different minds. This also allows for splintering outside the region where the concept is well defined.
I concede that fear of death could be downstream of other fears rather than encoded. However, I still think it’s wrong to believe that this isn’t possible in principle, and I think these other fears/motivations (wanting to achieve values, fear of , etc) are still pretty abstract, and there’s a good chance of some of those things being anchored directly into the genome using a similar mechanism to what I described.
I don’t get how the case of morality existing in blind people relates. Sure, it could affect the distribution somewhat. That still shouldn’t break extensional specification. I’m worried that maybe your model of my beliefs looks like the genome encoding some kind of fixed neural address thing, or a perfectly death-shaped hole that accepts concepts that exactly fit the mold of Standardized Death Concept, and breaks whenever given a slightly misshapen death concept. That’s not at all what I’m pointing at.
I feel similarly about the quantum physics or neuroscience cases. My theory doesn’t predict that your morality collapses when you learn about quantum physics! Your morality is defined by extensional specification (possibly indirectly, the genome probably doesn’t directly encode many examples of what’s right and wrong), and within any new ontology you use your extensional specification to figure out which things are moral. Sometimes this is smooth, when you make small localized changes to your ontology. Sometimes you will experience an ontological crisis—empirically, it seems many people experience some kind of crisis of morality when concepts like free will get called into question due to quantum mechanics for instance, and then you inspect lots of examples of things you’re confident about and then try to find something in the new ontology that stretches to cover all of those cases (which is extensional reasoning). None of this contradicts the idea that morality, or rather its many constituent heuristics built on high level abstractions, can be defined extensionally in the genome.
(Partly transcribed from a correspondence on Eleuther.)
I disagree about concepts in the human world model being inaccessible in theory to the genome. I think lots of concepts could be accessed, and that (2) is true in the trilemma.
Consider: As a dumb example that I don’t expect to actually be the case but which gives useful intuition, suppose the genome really wants to wire something up to the tree neuron. Then the genome could encode a handful of images of trees and then once the brain is fully formed it can go through and search for whichever neuron activates the hardest on those 10 images. (Of course it wouldn’t actually do literal images, but I expect compressing it down to not actually be that hard.) The more general idea is that we can specify concepts in the world model extensionally by specifying constraints that the concept has to satisfy (for instance, it should activate on these particular data points, or it should have this particular temporal consistency, etc.) Keep in mind this means that the genome just has to vaguely gesture at the concept, and not define the decision boundary exactly.
If this sounds familiar, that’s because this basically corresponds to the naivest ELK solution where you hope the reporter generalizes correctly. This probably even works for lots of current NNs. The fact that this works in humans and possibly current NNs, though, is not really surprising to me, and doesn’t necessarily imply that ELK continues to work in superintelligence. In fact, to me, the vast majority of the hardness of ELK is making sure it continues to work up to superintelligence/arbitrarily weird ontologies. One can argue for natural abstractions, but that would be an orthogonal argument to the one made in this post. This is why I think (2) is true, though I think the statement would be more obvious if stated as “the solution in humans doesn’t scale” rather than “can’t be replicated”.
Note: I don’t expect very many things like this to be hard coded; I expect only a few things to be hard coded and a lot of things to result as emergent interactions of those things. But this post is claiming that the hard coded things can’t reference concepts in the world model at all.
As for more abstract concepts: I think encoding the concept of, say, death, is actually extremely doable extensionally. There are a bunch of ways we can point at the concept of death relative to other anticipated experiences/concepts (i.e the thing that follows serious illness and pain, unconsciousness/the thing that’s like dreamless sleep, the thing that we observe happens to other beings that causes them to become disempowered, etc). Anecdotally, people do seem to be afraid of death in large part because they’re afraid of losing consciousness, the pain that comes before it, the disempowerment of no longer being able to affect things, etc. Again, none of these things have to be exactly pointing to death; they just serve to select out the neuron(s) that encode the concept of death. Further evidence for this theory includes the fact that humans across many cultures and even many animals pretty reliably develop an understanding of death in their world models, so it seems plausible that evolution would have had time to wire things up, and it’s a fairly well known phenomenon that very small children who don’t yet have well formed world models tend to endanger themselves with seemingly no fear of death. This all also seems consistent with the fact that lots of things we seem fairly hardwired to care about (i.e death, happiness, etc) splinter; we’re wired to care about things as specified by some set of points that were relevant in the ancestral environment, and the splintering is because those points don’t actually define a sharp decision boundary.
As for why I think more powerful AIs will have more alien abstractions: I think that there are many situations where the human abstractions are used because they are optimal for a mind with our constraints. In some situations, given more computing power you ideally want to model things at a lower level of abstraction. If you can calculate how the coin will land by modelling the air currents and its rotational speed, you want to do that to predict exactly the outcome, rather than abstracting it away as a Bernoulli process. Conversely, sometimes there are high levels of abstraction that carve reality at the joints that require fitting too much stuff in your mind at once, or involve regularities of the world that we haven’t discovered yet. Consider how having an understanding of thermodynamics lets you predict macroscopic properties of the system, but only if you already know about and are capable of understanding it. Thus, it seems highly likely that a powerful AI would develop very weird abstractions from our perspective. To be clear, I still think natural abstractions is likely enough to be true that it’s worth elevating as a hypothesis under consideration, and a large part of my remaining optimism lies there, but I don’t think it’s automatically true at all.
Computationally expensive things are less likely to show up in your simulation than the real world, because you only have so much compute to run your simulation. You can’t convincingly fake the AI having access to a supercomputer.
The possibility is that Alice might always be able tell that she’s in a simulation no matter what we condition on. I think this is pretty much precluded by the assumption that the generative model is a good model of the world, but if that fails then it’s possible Alice can tell she’s in a simulation no matter what we do. So a lot rides on the statement that the generative model remains a good model of the world regardless of what we condition on.
Paul’s RSA-2048 counterexample is an example of a way our generative model can fail to be good enough no matter how hard we try. The core idea is that there exist things that are extremely computationally expensive to fake and very cheap to check the validity of, so faking them convincingly will be extremely hard.
Liked this post a lot. In particular I think I strongly agree with “Eliezer raises many good considerations backed by pretty clear arguments, but makes confident assertions that are much stronger than anything suggested by actual argument” as the general vibe of how I feel about Eliezer’s arguments.
A few comments on the disagreements:
Eliezer often equivocates between “you have to get alignment right on the first ‘critical’ try” and “you can’t learn anything about alignment from experimentation and failures before the critical try.”
An in-between position would be to argue that even if we’re maximally competent at the institutional problem, and can extract all the information we possibly can through experimentation before the first critical try, that just prevents the really embarrassing failures. Irrecoverable failures could still pop up every once in a while after entering the critical regime that we just could not have been prepared for, unless we have a full True Name of alignment. I think the crux here depends on your view on the Murphy-constant of the world (i.e how likely we are to get unknown unknown failures), and how long you think we need to spend in the critical regime before our automated alignment research assistants solve alignment.
By the time we have AI systems that can overpower humans decisively with nanotech, we have other AI systems that will either kill humans in more boring ways or else radically advanced the state of human R&D.
For what it’s worth I think the level of tech needed to overpower humans in more boring ways is a substantial part of my “doom cinematic universe” (and I usually assume nanobots is meant metaphorically). In particular, I think it’s plausible the “slightly-less-impressive-looking” systems that come before the first x-risk AI are not going to be obviously step-before-x-risk any more so than current scary capabilities advances, because of uncertainty over its exact angle (related to Murphy crux above) + discontinuous jumps in specific capabilities as we see currently in ML.
if we are constantly training AI systems to do things that look impressive, then SGD will be aggressively selecting against any AI systems who don’t do impressive-looking stuff.
SGD is definitely far from perfect optimization, and it seems plausible that if concealment against SGD is a thing at all, then it would be due to some kind of instrumental thing that a very large fraction of powerful AI systems converge on.
Aligned AI systems can reduce the period of risk of an unaligned AI by advancing alignment research
I think there’s a lot of different cruxes hiding inside the question of how AI acceleration of alignment research interacts with P(doom), including how hard alignment is, and whether AGI labs will pivot to focus on alignment (some earlier thoughts here), even assuming we can align the weak systems used for this. Overall I feel very uncertain about this.
Early transformative AI systems will probably do impressive technological projects by being trained on smaller tasks with shorter feedback loops and then composing these abilities in the context of large collaborative projects
Explicitly registering agreement with this prediction.
Eliezer is relatively confident that you can’t train powerful systems by imitating human thoughts, because too much of human thinking happens under the surface.
Fwiw, I interpreted this as saying that it doesn’t work as a safety proposal (see also: my earlier comment). Also seems related to his arguments about ML systems having squiggles.
One possible reconciliation: outer optimizers converge on building more coherent inner optimizers because the outer objective is only over a restricted domain, and making the coherent inner optimizer not blow up inside that domain is much much easier than making it not blow up at all, and potentially easier than just learning all the adaptations to do the thing. Concretely, for instance, with SGD, the restricted domain is the training distribution, and getting your coherent optimizer to act nice on the training distribution isn’t that hard, the hard part of fully aligning it is getting from objectives that shake out as [act nice on the training distribution but then kill everyone when you get a chance] to an objective that’s actually aligned, and SGD doesn’t really care about the hard part.
If you already have a mesaobjective fully aligned everywhere from the start, then you don’t really need to invoke the crystallization argument; the crystallization argument is basically about how misaligned objectives can get locked in.
Some thoughts: one problem I have with Eliezer’s definition is that bits don’t cost the same in terms of computation because of logical non-omniscience. Imagine two agents in an environment with 2^N possible trajectories corresponding to N bit bitstrings. One agent always outputs the zero bitstring and the other outputs the preimage of some hash function on the zero bitstring or something else expensive like that. Both of these narrow the world down the same amount, and have the same expected influence, but it seems intuitive that if you have to think really hard about each decision you make, then you’re also putting in more optimization in some sense.
If I understand correctly, this is the idea presented: (nonmyopic) mesaoptimizers would want to preserve their mesaobjectives once set. Therefore, if we can make sure that the mesaobjective is something we want, while we can still understand the mesaoptimizer’s internals, then we can take advantage of its desire to remain stable to make sure that even in a future environment where the base objective is misaligned with what we want, the mesaoptimizer will still avoid doing things that break its alignment.
Unfortunately, I don’t think this quite works as stated. The core problem is that an aligned mesaobjective for the original distribution of tasks that humans could supervise has no reason at all to generalize to the more difficult domains that we want the AI to be good at in the second phase, and mesaobjective preservation usually means literally trying to keep the original mesaobjective around. For instance, if you first train a mesaoptimizer to be good at playing a game in ways that imitate humans, and then put it in an environment where it gets a base reward directly corresponding to the environment reward, what will happen is either that the original mesaobjective gets clobbered, or it successfully survives by being deceptive to conceal its mesaobjective of imitating humans. The mesaobjective that was aligned to our base objective in the original setting is no longer aligned in the new setting, and therefore it becomes deceptive (in the sense of hiding its true objective until out of training) to preserve itself. In other words, deception is not just a property of the mesaobjective, but also of the context that the mesaoptimizer is in.
I think what you’re trying to get at is that if the original mesaobjective wants the best for humanity in some sense, then maybe this property, rather than the literal mesaobjective, can be preserved, because a mesaoptimizer which wants the best for humanity will want to make sure that its future self will have a mesaoptimizer which preserves and continues to propagate this property. This argument seems to have a lot in common with the hypothesis of broad basin of corrigibility. I haven’t thought a lot about this but I think this argument may be applicable to inner alignment.
With regard to the redundancy argument, this post (and the linked comment) covers why I think it won’t work. Basically, I think the mistake is thinking of gradients as intuitively being like pertubations due to genetic algorithms, whereas for (sane) functions it’s not possible for the directional derivative to be zero along two directions and to still have a nonzero directional derivative in their span.
I agree that the SW/HW analogy is not a good analogy for AGI safety (I think security is actually a better analogy), but I would like to present a defence of the idea that normal systems reliability engineering is not enough for alignment (this is not necessarily a defence of any of the analogies/claims in the OP).
Systems safety engineering leans heavily on the idea that failures happen randomly and (mostly) independently, so that enough failures happening together by coincidence to break the guarantees of the system is rare. That is:
RAID is based on the assumption that hard drive failures happen mostly independently, because the probability of too many drives failing at once is sufficiently low. Even in practice this assumption becomes a problem because a) drives purchased in the same batch will have correlated failures and b) rebuilding an array puts strain on the remaining drives, and people have to plan around this by adding more margin of error.
Checksums and ECC are robust against the occasional bitflip. This is because occasional bitflips are mostly random and getting bitflips that just happen to set the checksum correctly are very rare. Checksums are not robust against someone coming in and maliciously changing your data in-transit, you need signatures for that. Even time correlated runs of flips can create a problem for naive schemes and burn through the margin of error faster than you’d otherwise expect.
Voting between multiple systems assumes that the systems are all honest and just occasionally suffer transient hardware failures. Clean room reimplementations are to try and eliminate the correlations due to bugs, but they still don’t protect against correlated bad behaviour across all of the systems due to issues with your spec.
My point here is that once your failures stop being random and independent, you leave the realm of safety engineering and enter the realm of security (and security against extremely powerful actors is really really hard). I argue that AGI alignment is much more like the latter, because we don’t expect AGIs to fail in random ways, but rather we expect them to intelligently steer the world into directions we don’t want. AGI induced failure looks like things that should have been impossible when multiplying out the probabilities somehow happening regardless.
In particular, relying on independent AGIs not being correlated with each other is an extremely dangerous assumption: AGIs can coordinate even without communication, alignment is a very narrow target that’s hard to hit, and a parliament of misaligned AGIs is definitely not going to end well for us.
I basically agree with most of the post, but there are a few points where I have some value to add:
#29 (consequences of actions): relevant post. I think this problem is possibly reducible to ELK.
#32 (words only trace real thoughts): My understanding of the original point: So the reason we would want to train an AI that imitates a human’s thoughts is ideally to create an AI that, internally, uses the same algorithm to come to its answer as the human did. The safety properties come from the fact that the actual algorithms generalizes the same way the human algorithm generalizes (related to #10). One can debate whether humans are powerful/aligned enough even if this were the case in theory, but that’s orthogonal. The problem pointed at here is that systems that are powerful at imitating human thought would not necessarily be using the same algorithm as humans use. This other algorithm could generalize in weird ways, and the fact that human explanations don’t reveal all or even most of our actual reasoning makes it harder to learn the human algorithm because it’s less well specified by the data. In particular, one very concerning type of not-the-same-algorithm is when the system is some mesaoptimizer trying to understand how humans think, and if this happens if kind of defeats a lot of the point of doing this imitation thing in the first place.
#33 (interpretability stuff):
I think the existence of features that we can understand is some evidence but not perfect evidence for natural abstractions. One very ELK-flavored problem is that you can’t tell if a neuron is the “actually deceive humans” neuron or the “humans will realize that this is deceptive” neuron, and it’s not clear which of these concepts is more natural for a NN. Also, even if you can understand most of the network, as long as there is a chunk of the network that you can’t interpret, there could be dangerous stuff hiding in there. I think it’s plausible as an example that we end up with models that contain a mostly-natural-abstractions model of a human that the model uses to figure out how humans would react. If we poke around in such a model we will notice that the “deception” neuron causes the model to be less likely to do a plan, but we have no way of differentiating whether this is because the model is trying to avoid doing deceptive things, or it’s modelling whether humans will be capable of catching it.
I have some other thoughts that I’ll write up as shortforms and edit links into this comment later.
I think I do a poor job of labelling my statements (at least, in conversation. usually I do a bit better in post format). Something something illusion of transparency. To be honest, I didn’t even realize explicitly that I was doing this until fairly recent reflection on it.
Thought pattern that I’ve noticed: I seem to have two sets of epistemic states at any time: one more stable set that more accurately reflects my “actual” beliefs that changes fairly slowly, and one set of “hypothesis” beliefs that changes rapidly. Usually when I think some direction is interesting, I alternate my hypothesis beliefs between assuming key claims are true or false and trying to convince myself either way, and if I succeed then I integrate it into my actual beliefs. In practice this might look like alternating between trying to prove something is impossible and trying to exhibit an example, or taking strange premises seriously and trying to figure out its consequences. I think this is probably very confusing to people because usually when talking to people who are already familiar with alignment I’m talking about implications of my hypothesis beliefs, because that’s the frontier of what I’m thinking about, and from the outside it looks like I’m constantly changing my mind about things. Writing this up partially to have something to point people to and partially to push myself to communicate this more clearly.
Some quick thoughts on these points:
I think the ability for humans to communicate and coordinate is a double edged sword. In particular, it enables the attack vector of dangerous self propagating memes. I expect memetic warfare to play a major role in many of the failure scenarios I can think of. As we’ve seen, even humans are capable of crafting some pretty potent memes, and even defending against human actors is difficult.
I think it’s likely that the relevant reference class here is research bets rather then the “task” of AGI. An extremely successful research bet could be currently underinvested in, but once it shows promise, discontinuous (relative to the bet) amounts of resources will be dumped into scaling it up, even if the overall investment towards the task as a whole remains continuous. In other words, in this case even though investment into AGI may be continuous (though that might not even hold), discontinuity can occur on the level of specific research bets. Historical examples would include imagenet seeing discontinuous improvement with AlexNet despite continuous investment into image recognition to that point. (Also, for what it’s worth, my personal model of AI doom doesn’t depend heavily on discontinuities existing, though they do make things worse.)
I think there exist plausible alternative explanations for why capabilities has been primarily driven by compute. For instance, it may be because ML talent is extremely expensive whereas compute gets half as expensive every 18 months or whatever, that it doesn’t make economic sense to figure out compute efficient AGI. Given the fact that humans need orders of magnitude less data and compute than current models, and that the human genome isn’t that big and is mostly not cognition related, it seems plausible that we already have enough hardware for AGI if we had the textbook from the future, though I have fairly low confidence on this point.
Monolithic agents have the advantage that they’re able to reason about things that involve unlikely connections between extremely disparate fields. I would argue that the current human specialization is at least in part due to constraints about how much information one person can know. It also seems plausible that knowledge can be siloed in ways that make inference cost largely detached from the number of domains the model is competent in. Finally, people have empirically just been really excited about making giant monolithic models. Overall, it seems like there is enough incentive to make monolithic models that it’ll probably be an uphill battle to convince people not to do them.
Generally agree with the regulation point given the caveat. I do want to point out that since substantive regulation often moves very slowly, especially when there are well funded actors trying to prevent AGI development being regulated, even in non-foom scenarios (months-years) they might not move fast enough (example: think about how slowly climate change related regulations get adopted)
Another generator-discriminator gap: telling whether an outcome is good (outcome->R) is much easier than coming up with plans to achieve good outcomes. Telling whether a plan is good (plan->R) is much harder, because you need a world model (plan->outcome) as well, but for very difficult tasks it still seems easier than just coming up with good plans off the bat. However, it feels like the world model is the hardest part here, not just because of embeddedness problems, but in general because knowing the consequences of your actions is really really hard. So it seems like for most consequentialist optimizers, the quality of the world model actually becomes the main thing that matters.
This also suggests another dimension along which to classify our optimizers: the degree to which they care about consequences in the future (I want to say myopia but that term is already way too overloaded). This is relevant because the further in the future you care about, the more robust your world model has to be, as errors accumulate the more steps you roll the model out (or the more abstraction you do along the time axis). Very low confidence but maybe this suggests that mesaoptimizers probably won’t care about things very far in the future because building a robust world model is hard and so perform worse on the training distribution, so SGD pushes for more myopic mesaobjectives? Though note, this kind of myopia is not quite the kind we need for models to avoid caring about the real world/coordinating with itself.
A few axes along which to classify optimizers:
Competence: An optimizer is more competent if it achieves the objective more frequently on distribution
Capabilities Robustness: An optimizer is more capabilities robust if it can handle a broader range of OOD world states (and thus possible pertubations) competently.
Generality: An optimizer is more general if it can represent and achieve a broader range of different objectives
Real-world objectives: whether the optimizer is capable of having objectives about things in the real world.
Some observations: it feels like capabilities robustness is one of the big things that makes deception dangerous, because it means that the model can figure out plans that you never intended for it to learn (something not very capabilities robust would just never learn how to deceive if you don’t show it). This feels like the critical controller/search-process difference: controller generalization across states is dependent on the generalization abilities of the model architecture, whereas search processes let you think about the particular state you find yourself in. The actions that lead to deception are extremely OOD, and a controller would have a hard time executing the strategy reliably without first having seen it, unless NN generalization is wildly better than I’m anticipating.
Real world objectives is definitely another big chunk of deception danger; caring about the real world leads to nonmyopic behavior (though maybe we’re worried about other causes of nonmyopia too? not sure tbh), I’m actually not sure how I feel about generality: on the one hand, it feels intuitive that systems that are only able to represent one objective have got to be in some sense less able to become more powerful just by thinking more; on the other hand I don’t know what a rigorous argument for this would look like. I think the intuition relates to the idea of general reasoning machinery being the same across lots of tasks, and this machinery being necessary to do better by thinking harder, and so any model without this machinery must be weaker in some sense. I think this feeds into capabilities robustness (or lack thereof) too.
Examples of where things fall on these axes:
A rock would be none of the properties.
A pure controller (i.e a thermostat, “pile of heuristics”) can be competent, but not as capabilities robust, not general at all, and have objectives over the real world.
An analytic equation solver would be perfectly competent and capablilities robust (if it always works), not very general (it can only solve equations), and not be capable of having real world objectives.
A search based process can be competent, would be more capabilities robust and general, and may have objectives over the real world.
A deceptive optimizer is competent, capabilities robust, and definitely has real world objectives