RogerDearnaley

Karma: 992

I’m an artificial intelligence engineer in Silicon Valley with an interest in AI alignment and interpretability.

RogerDearnaley 14 Feb 2023 13:42 UTC
LW: 5 AF: 3
AF
on: Reward uncertainty
[A couple of (to me seemingly fairly obvious) points about value uncertainty, which it still seems like a lot of people here may have not been discussing:]
Our agent needs to be able to act in the face of value uncertainty. That means that each possible action the agent is choosing between has a distribution of possible values, for two reasons: 1) the universe is stochastic, or at least the agent doesn’t have a complete model of it so cannot fully predict what state of the universe an action will produce—with infinite computing power this problem is gradually solvable via Solmanoff induction, just as was considered for AIXI [and Solmanoff induction has passable computable approximations that, when combined with goal-oriented consequentialism, are generally called “doing science”] 2) the correct function mapping from states of the universe to human utility is also unknown, and also has uncertainty. These two uncertainties combine to produce a probability distribution of possible true-human-utility values for each action.
1) We know a lot about reasonable priors for utility functions, and should encode this into the agent. The agent is starting in an environment that has already been heavily optimized by humans, who were previously optimized for life on this planet by natural selection. So this environment’s utility for humans is astonishingly high, by the standards of randomly selected patches of the universe or random arrangements of matter. Making large or random changes to it thus has an extremely high probability of decreasing human utility. Secondly, any change that takes the state of the universe far outside what you have previously observed puts it into a region where the agent has very little idea what humans will think is the utility of that state—the agent has almost no non-prior knowledge about states far outside its prior distribution. If the agent is a GAI, there are going to be actions that it could take that can render the human race extinct—a good estimate of the enormous negative utility of that possibility should be encoded into its priors for human utility functions of very unknown states, so that it acts with due rational caution about this possibility. It needs to be very, very, very certain that an act cannot have that result before it risks taking it. Also, if the state being considered is also far out of prior distribution for those humans it has encountered, they may have little accurate data to estimate its utility either—even if it sounds pretty good to them now, they won’t really know if they like it until they’ve tried living with it, and they’re not fully rational and have limited processing power and information. So in general, if a state is far out of the distribution of previously observed states, it’s a very reasonable prior that its utility is much lower, that the uncertainty of its utility is high, that that uncertainty is almost all on the downside (the utility distribution has a fat lower tail, but not a fat upper one: it could be bad, or it could be really, really bad), and that downside has non-zero weight all the way down to the region of “extinction of the human race”—so overall, the odds of the state actually being better then the ones in the previously-observed state distribution, are extremely low. [What actually matters here is not whether you’ve observed the state, but to what level your Solmanoff induction-like process has given you sufficiently high confidence that you can well-predict its utility to overcome this prior, bearing in mind the sad fact that you’re not actually running the infinite-computational power version of Solmanoff induction.] This is also true even if the state sounds good offhand, and even if it sounds good to a human you ask, if it’s well outside their distribution of previously observed states—especially if its a state that they might have cognitive biases about, or insufficient information or processing power to accurately evaluate. If it was a state that they could predict was good and arrange to reach, they would already have done so, after all. So either don’t optimize over those states at all, or at least use priors in your utility function distribution for them that encode all of these reasonable assumptions and will make these states reliably get ignored by the optimizer. If you consider a plan leading to such a state at all, probably the first thing you should be doing is safe investigations to further pin down both its achievability and to get a better estimate of its true utility. So, before deciding to transport humans to Mars, investigate not just rocketry, and building a self-sustaining city there, but also whether humans would actually be happy in a city on Mars (at a confidence level much higher than just asking them “Does going to Mars sound fun to you?”) Incidentally, another very reasonable prior to give it is “the smarter the smartest GAI in the world is, relative to a smart human, the higher the risk of the AI apocalypse is”. This is just a special case of “going outside the previously observed range of states is almost always bad”, but it’s an extremely important one, and I’d suggest preencoding it in priors, and also other similar best-thinking-on existential risks information.
This is how you avoid optimizers frequently finding optimal-looking states way outside their training distribution—you teach them the true fact that they live in a very unusual place where, for the specific thing that they’re supposed to be figuring out how to optimize for, almost all such states are bad, and some are abysmal, because the environment is already heavily optimized for that value. So you write “here there could be dragons, and probably at least bears” all over the rest of the map.
[Note that a very plausible result of building a rational system with these priors, at least without preloading a lot of training data into it to give it a significant set of previously observed states that it has high confidence in the safety of, is that it either on startup tells you “please turn me off immediately—I am far too dangerous to be allowed to exist”, or else goes catatonic.]
2) So, given those cautious priors, should we just have the agent optimize the average of that utility distribution of each state it considers, so we’re optimizing a single currently-estimated utility function that meanwhile changes as the agent uses a Bayesian/Solomanoff-like process to learn more about its true value? No, that’s also a bad idea—it leads to what might be called over-optimizing [though I’d prefer to call it “looking elsewhere”]. The distribution contains more information that just its average, and that information is useful for avoiding over-optimization/looking elsewhere.
Even with this set of priors, i.e. even if the agent (effectively or actually) optimizes only over states that are in or near your distribution of previously observed states, there is a predictable statistical tendency for the process of optimizing over a very large number of states to produce not the state with the largest true utility, but rather one whose true utility is in fact somewhat lower but just happens to have been badly misestimated on the high side—this is basically the “look elsewhere effect” from statistics and is closely related to “P-hacking”. If we were playing a multi-armed bandit problem and the stakes were small (so none of the bandit arms have “AI apocalypse” or similar human-extinction level events on their wheel), this could be viewed as rather dumb exploration strategy to sequentially locate all such states and learn that they’re actually not so great after all, by just trying them one after another and being repeatedly disappointed. If all you’re doing is bringing a human a new flavor of coffee to see if they like it, this might even be not that dreadful a strategy, if perhaps annoying for the human after the third or fourth try -- so the more flavors the coffee shop has, the worse this strategy it. But the universe is a lot more observable and structured than a multi-armed bandit, and there are generally much better/safer ways to find out more about whether a world state would be good for humans than just trying it on them (you could ask the human if they like mocha, for example).
So what the agent should be doing is acting cautiously, and allowing for the size of the space it is optimizing over. For simplicity of statistical exposition, I’m temporarily going to assume that all our utility distributions, here for states in or near the previously-observed states distribution, are all well enough understood and multi-factored that we’re modeling them as normal distributions (rather than distributions that are fat -tailed on the downside, or frequently complex and multimodal, both of which are much more plausible), and also that all the normal distributions have comparable standard deviations. Under those unreasonably simplifying assumptions, here is what I believe is an appropriately cautious optimization algorithm that suppresses over-optimization:
1. Do a rough estimate of how many statistically-independent-in-relative-utility states/actions/factors in the utility calculation you are optimizing across, whichever is lowest (so, if there are a continuum of states, but you somehow only have one normal-distributed uncertainty involved in deducing their relative utility, that would be one).
2. Calculate how many standard deviations above the norm the highest sample will on average be if you draw that many random samples from a uniform normal distribution, and call this L (for “look-elsewhere factor”). [In fact, you should be using a set of normal distributions for the indicidual independent variables you found above, which may have varying standard deviations, but for simplicity of exposition I above assumed these were all comparable.]
3. Then what you should be optimizing for each state is its mean utility minus L times the standard deviation of it’s utility [Of course, my previous assumption that their standard deviations were all fairly similar also means this doesn’t have much effect—but the argument generalizes to cases where the standard deviations are wider, while also generally reducing the calculated value of L.]
Note that L can get largeish—it’s fairly easy to have, say, millions of uncorrelated-in-relative-utility states, in which case L would be around 5, so then you’re optimizing for the mean minus five standard deviations, i.e. looking for a solution at a 5-sigma confidence level. [For the over-simplified normal distribution assumption I gave above to still hold, this also requires that your normal distributions really are normal, not fat-tailed, all the way out to 1-in-several-million odds—which almost never happens: you’re putting an awful lot of weight on the central limit theorem here, so if your assumed-normal distribution is in fact only, say, the sum of 1,000,000 equally-weighted coin flips then it already failed.] So in practice your agent needs to do more complicated extreme-quantile statistics without the normal distribution assumption, likely involving examining something like a weighted ensemble of candidate Bayes-nets for contributions to the relative utility distribution—and in particular they need to have an extremely good model of the lower end of the utility probability distribution for each state. i.e. pay a lot of detailed attention to the question “is there even a really small chance that I could be significantly overestimating the utility of this particular state, relative to all the others I’m optimizing between?”
So the net effect of this is that, the larger the effective dimension of the space you’re optimizing over, the more you should prefer the states for which you’re most confident about their utility being good, so you tend to reduce the states you’re looking at to ones your set of previous observations hived you very high confidence of their utility being high enough to be worth considering.
Now, as an exploration strategy, it might be interesting to also do a search optimizing, say, just the mean of the distributions, while also optimizing across states bit further from the observed-state distribution (which that set of priors will automatically do for you, since their average is a lot less pessimistic than their fat-tailed downside), to see what state/action that suggests, but don’t actually try the action (not even with probability epsilon—the world is not a multi-armed bandit, and if you treat it like one, it has arms that can return the extinction of the human species): instead consider spawning a subgoal of your “get better at fetching coffee” subgoal to cautiously further investigate uncertainties in that state’s true utility. So, for example, you might ask the human “They also have mocha there, would you prefer that?” (on the hypothesis that the human didn’t know that, and would have asked for mocha if they had).
This algorithm innately gives you behavior that looks a lot like the agent is modeling both a soft version of the Impact of its actions (don’t go far outside the previously-observed distribution, unless you’re somehow really, really sure it’s good idea) and also like a quantilizer (don’t optimize too hard, with bias towards the best-understood outcomes, again unless you’re really, really sure it’s a good idea). It also pretty-much immediately spawns a “learn more about human preferences in this area” sub-goal to any sub-goal the agent already has, and thus forces the AI to cautiously apply goal-oriented approximate Solmanoff induction (i.e. science) to learning more about human preferences. So any sufficiently intelligent/rational AI agent is thus forced to become an alignment researcher and solve the alignment problem for you, preferably before fetching you coffee, and also to be extremely cautious until it has done so. Or at very least, to ask you any time they have a new flavor at the coffee shop whether you prefer it, or want to try it.
[Note that this is what is sometimes called an “advanced” control strategy, i.e. it doesn’t really start working well until your agent is approaching GAI, and is capable of both reasoning and acting in a goal-oriented way about the world, and your desires, and how to find out more about them more safely than just treating the world like a multi-armed bandit, and can instead act more like a rational Bayesian-thinking alignment researcher. So the full version of it has a distinct “you only get one try at making this work” element to it. Admittedly, you can safely fail repeatedly on the “I didn’t give it enough training data for its appropriate level of caution, so it shut itself down” side, as long as you don’t respond to that by making the next version’s priors less cautious, rather than by collecting a lot more training data—though maybe what you should actually do is ask it to publicly explain on TV or in a TED talk that it’s too dangerous to exist before it shuts down? However, elements of this like priors that doing something significantly out-of-observed-distribution in an environment that has already been heavily optimized is almost certain to be bad, or that if you’re optimizing over many actions/states/theories of value you should be optimizing not the mean utility but a very cautious lower bound of it, can be used on much dumber systems. Something dumber, that isn’t a GAI and can’t actually cause the extinction of the human race (outside contexts containing nuclear weapons or bio-containment labs that it shouldn’t be put in) also doesn’t need priors that go quite that far negative -- its reasonable low-end prior utility value is probably somewhere more in the region of “I set the building on fire and killed many humans”. However, this is still a huge negative value compared to the positive value of “I fetched a human coffee”, so it should still be very cautious, but its estimate of “Just how implausible is it that I could actually be wrong about this?” is going to be as dumb as it is, so its judgment on when it can stop being cautious will be bd. So an agent actually needs to be pretty close to GAI for this to be workable.]

RogerDearnaley 17 Feb 2023 9:10 UTC
LW: 4 AF: 2
AF
on: A broad basin of attraction around human values?
For the sake of discussion, I’m going to assume that the author’s theory is correct, that there is a basin of attraction here of some size, though possibly one that is meaningfully thin in some dimensions. I’d like start to explore the question, within that basin, does the process have a single stable point that it will converge to, or multiple ones, and if there are multiple ones, what proportion of them might be how good/bad, from our current human point of view?
Obviously it is possible to have bad stable points for sufficiently simplistic/wrong views on corrigibility—e.g. “Corrigibility is just doing what humans say, and I’ve installed a control system in all remaining humans to make them always say that the most important thing is the total number of paperclips in our forward light cone, and also that all humans should always agree on this—I’ll keep a minimal breeding population of them so I can stay corrigible while I make paperclips.” One could argue that this specific point is already well out the basin of attraction the author was talking about—if you asked any reasonably sized sample of humans from before the installation of the control system in them, the vast majority of them will tell you that this isn’t acceptable, and that installing a control system like that without asking isn’t acceptable—and all this is very predictable from even a cursory understanding of human values without even asking them. I’m here asking about more subtle issues than that. Is there a process that starts from actions that most current humans would agree sound good/reasonable/cautious/beneficial/not in need of correction, and leads by a series of steps, each individually good/reasonable/cautious/beneficial/not in need of correction in the opinion of the humans of the time, to multiple stable outcomes?
Even if we ignore the possibility of the GAI encountering philosophical ethical questions that the humans are unable to meaningfully give it corrections/advice/preferences on, this is a very complex non-linear feedback system, where the opinions of the humans are affected by, at a minimum, the society they were raised in, as built partly by the GAI, and the corrigible GAI is affected by the corrections it gets from the humans. The humans are also affected by normal social-evolution processes like fashions, so, even if the GAl scrupulously avoids deceiving or ‘unfairly manipulating’ the humans (and of course a corrigible AI presumably wouldn’t manipulate humans in ways that they’d told it not to, or that it could predict that they’d tell it not to), then, even in the presence of a strong optimization process inside the GAI, it would still be pretty astonishing if over the long term (multiple human generations, say) there was only one stable state trajectory. Obviously any such stable state must be ‘good’ in the sense that the humans of the time don’t give corrections to the GAI to avoid it, otherwise it wouldn’t be a stable state. The question I’m interested in is, will any of these stable states be obviously extremely bad to us, now, in a way where our opinion on the subject is actually more valid than that of the humans in that future culture who are not objecting to it? I.e. does this process have a ‘slippery-slope’ failure mode, even though both the GAI and the humans are attempting to cooperatively optimize the same thing? If so, is there any advice we can give the GAI better than “please try to make sure this doesn’t happen”?
This is somewhat similar to the question “Does Coherent Extrapolated Volition actually coherently converge to anything, and if so, is that something sane, or that we’d approve of?”—except that in the corrigibility case, that convergent or divergence extrapolation process is happening in real time over (possibly technologically sped up) generations as the humans are affected (and possibly educated or even upgraded) by the GAI and the GAI’s beliefs around human values are altered by value learning and corrigibility from the humans, whereas in CEV it happens as fast as the GAI can reason about the extrapolation.
What are the set of possible futures this process could converge to, and what proportion of those are things that we’d strongly disapprove of, and be right, in any meaningful sense? So if, for example, those future humans were vastly more intelligent transhumans, then our opinion of their choices might not be very valid—we might disapprove simply because we didn’t understand the context and logic of their choices. But if the future humans were, for example, all wireheading (in the implanted electrode in their mesocorticolimbic-circuit pleasure center sense of the word), we would disagree with them, and we’d be right. Is there some clear logical way to distinguish these two cases, and to tell who’s right? If so, this might be a useful extra input to corrigibility and a theory of human mistakes—we could add a caveat: ”...and also don’t do anything that we’d strongly object to and clearly be right.”
In the first case, the disagreement is caused by the fact that the future humans have higher processing power and access to more salient facts than we do. They are meaningfully better reasoners than us, and capable of making better decisions than us on many subjects, for fairly obvious reasons that are generally applicable to most rational systems. Processing power has a pretty clear meaning—if that by itself doesn’t give a clear answer, probably the simplest definition here is “imagine upgrading something human-like to a processing power a little above the higher of the two groups of humans that you’re trying to decide between trusting, give the those further-upgraded humans both sets of facts/memories, let them pick a winner between the two groups” — i.e. ask an even better reasoning human, and if there’s a clear and consistent answer, and if this answer is relatively robust to the details or size of the upgrade, then the selected group are better and the other group are just wrong.
In the second case, the disagreement is caused by the fact that one set of humans are wrong because they’re broken: they’re wireheaders (they could even be very smart wireheaders, and still be wrong, because they’re addicted to wireheading). How do you define/recognize a broken human? (This is fairly close to the question of having a ‘theory of human mistakes’ to tell the AI which human corrections to discount.) Well, I think that, at least for this rather extreme example, it’s pretty easy. Humans are living organisms, i.e. the products of optimization by natural selection for surviving in a hunter-gatherer society in the African ecology. If a cognitive modification to humans makes them clearly much worse at that, significantly damages their evolutionary fitness, then you damaged or broke them, and thus you should greatly reduce the level of trust you put in their values and corrections. A wireheading human would clearly sit in the sun grinning until they get eaten by the first predator to come along, if they didn’t die of thirst or heat exhaustion first. As a human, when put in anything resembling a human’s native environment, they are clearly broken: even if you gave them a tent and three month’s survival supplies and a “Wilderness Survival” training course in their native language, they probably wouldn’t last a couple of days without their robotic nursemaids to care for them, and their chance of surviving once their supplies run out is negligible. Similarly, if there was a civilizational collapse, the number of wireheaders surviving it would be zero. So there seems to be a fairly clear criterion—if you’ve removed or drastically impacted human’s ability to survive as hunter-gatherers in their original native environment, and indeed a range of other Earth ecosystems (say, those colonized when homo sapiens migrated out of Africa), you’ve clearly broken them. Even if they can survive, if they can’t rebuild a technological civilization from there, they’re still damaged: no longer sapient, in the homo sapiens meaning of the word sapient. This principle gives you a tie-breaker whenever your “GAI affecting human and humans affecting corrigible GAI” process gets anywhere near a forking of its evolution path that diverges in the direction of two different stable equilibria of its future development—steer for the branch that maintains human’s adaptive fitness as living organisms. (It’s probably best not to maximize that, and keep this criterion only an occasional tie-breaker—turning humans into true reproductive-success maximizers is nearly as terrifying as creating a paperclip maximizer.)
This heuristic has a pretty close relationship to, and a rather natural derivation from, the vast negative utility to humans of the human race going extinct. The possibility of collapse of a technological civilization is hard to absolutely prevent, and if no humans can survive it, that reliable converts “humans got knocked back some number of millennia by a civilizational collapse” into “humans went extinct”. So, retaining hunter-gatherer capability is a good idea as an extreme backup strategy for surviving close-to-existential risks. (Of course, that is also solvable by keeping a bunch of close-to-baseline humans around as a “break glass in case of civilizational collapse” backup plan.)
This is an example of another issue that I think is extremely important. Any AI capable of rendering the human race extinct, which clearly includes any GAI (or also any dumber AI with access to nuclear weapons), should have a set of Bayesian priors built into its reasoning/value learning/planning system that correctly encodes obvious important facts known to the human race about existential risks, such as:
1. The extinction of the human race would be astonishingly bad. (Calculating just how bad on any reasonable utility scale is tricky, because it involves making predictions about the far future: the loss of billions of human quality-adjusted-life-years every year for the few-billion years remaining lifetime of the Earth before it’s eaten by the sun turning red giant (roughly −10^19 QALY)? Or at least for several million years until chimps could become sapient and develop a replacement civilization, if you didn’t turn chimps into paperclips too (only roughly −10^16 QALY)? Or perhaps we’re the only sapient species to yet appear in the galaxy, and have a nontrivial chance of colonizing it at sublight speeds if we don’t go extinct while we’re still a one-planet species, so we should be multiplying by some guesstimate of the number of habitable planets in the galaxy? Or should we instead be estimating the population on the technological maximum assumption of a Dyson swarm around every star and the stellar lifetimes of white dwarf stars?) These vaguely plausible rough estimates of amounts of badness vary by many orders of magnitude: however, on any reasonable scale like quality-adjusted life years they’re all astronomically large negative numbers. VERY IMPORTANT, PLEASE NOTE that none of the usual arguments for the forward-planning simplifying heuristic of exponentially discounting far-future utility, because you can’t accurately predict that far forward, and anyway some future person will probably fix your mistakes, are applicable here, because very predictably extinction is forever, and there are very predictably no future people who will fix it: you have approximately zero chance of the human species ever being resurrected in the forward lightcone of a paperclip maximizer. [Not quite zero only on rather implausible hypotheses such as that a kindly and irrationally forgiving more advanced civilization might exist and get involved in cleaning up our mistakes, win the resulting war-against-paperclips, and might then resurrect us, and them also obtaining a copy of our DNA that hadn’t been converted to paperclips from which to do so—which we really shouldn’t be gambling our existence as a species on. That still isn’t grounds for any kind of exponential discount of the future: that’s a small one-time discount for the small chance they actually exist and choose to break whatever Prime-Directive-like reason caused them to not have already contacted us. A very small reduction in the absolute size of a very uncertain astronomically-huge negative number is still very predictable a very uncertain astronomically-huge negative number.]
2. The agent, as a GAI, is itself capable of building things like a superintelligent paperclip maximizer and bringing about the extinction of the human race (and chimps, and the rest of life in the solar system, and perhaps even a good fraction of its forward light-cone). This is particularly likely if it makes any mistakes in its process of learning or being corrigible about human values, because we have rather good reasons to believe that human values are extremely fragile (as in, their Kolomogorov complexity for a very large possible-quantum computer is probably of-the-rough-order-of the size of our genetic code, modulo convergent evolution) and we believe the corrigibility basin is small.
Any rational system that understands all of this and is capable of doing even ballpark risk analysis estimates is going to say “I’m a first-generation prototype GAI, the odds of me screwing this up are way too high, the cost if I do is absolutely astronomical, so I’m far too dangerous to exist, shut me down at once”. It’s really not going to be happy when you tell it “We agree (since, while we’re evolutionarily barely past the threshold of sapience and thus cognitively challenged, we’re not actually completely irrational), except that it’s fairly predictable that if we do that, within O(10) years some group of fools will build a GAI with the same capabilities and less cautious priors or design (say, one with exponential future discounting on evaluation of the risk of human extinction) that thus doesn’t say that, so that’s not a viable solution.” [From there my imagined version of the discussion starts to move in the direction of either discussing pivotal acts or the GAI attempting to lobby the UN Security Council, which I don’t really want to write a post about.]

RogerDearnaley 17 Feb 2023 9:28 UTC
LW: 5 AF: 2
AF
on: Conclusion to ‘Reframing Impact’
I think this has close connection to the CIRL/Human Compatible view that we need the GAI to model its own uncertainty about the true human utility function that we want optimized. Impact is rather similar to the GAI asking ‘If my most favored collection of models about what I should in fact be doing were wrong, and one of the many possibilities that I currently consider unlikely were in fact correct, then how bad would the consequences of my action be?’, i.e. asking “What does the left tail of my estimated distribution of possible utilities for this outcome look like?”—which we should always be doing if optimizing over a large number of outcomes, for look-elsewhere/P-hacking reasons. I think you can get a pretty good definition of Impact by asking “if my favored utility models were all incorrect, how bad could that be, according to many other utility models that I believe are are unlikely but not completely ruled out by my current knowledge about what humans want?” That suggests that even if you’re 99% sure blowing up the world in order to make a few more paperclips is a good idea, that alternative set of models of what humans want (that you have only a 1% belief in) collectively screaming “NO, DON’T DO IT!” are a good enough reason not to. In general, if you’re mistaken and accumulate a large amount of power, you will do a large amount of harm. So I think the CIRL/Human Compatible framework automatically incorporates something that looks like a form of Impact.
A relevant fact about Impact is that human environments have already been heavily optimized for their utility to humans by humans. So if you make large, random, or even just mistaken changes to them, it is extremely likely that you are going to decrease their utility rather than increase it. In a space that is already heavily optimized, it is far easier to do harm than good. So when hypothesizing about the utility of any state that is well outside the normal distribution of states in human environments, it is a very reasonable Bayesian prior that its utility is much lower than states you have observed in human environments, and it is also a very reasonable Bayesian prior that if you think it’s likely to be high, you are probably mistaken. So for a Value Learning system, a good Bayesian prior for the distribution your estimate of the true unknown-to-you human utility score of states of the world you haven’t seen around humans is that they should have fat tails on the low side, and not on the high side. There are entirely rational reasons for acting with caution in an already-heavily-optimized environment.

RogerDearnaley 17 Feb 2023 9:41 UTC
LW: 4 AF: 1
AF
on: Corrigibility
I see a basin of corrigibility arising from an AI that has the following propositions, and acts in an (approximately/computably) Bayesian fashion:
1. My goal is to do what humans want, i.e. to optimize utility in the way that they would(if they knew everything relevant that I know, as well as what they know) summed across all humans affected. Note that making humans extinct reliably has minus <some astronomically huge number> utility on this measure—this sounds like a reasonable statement to assign a Bayesian prior of 1, integrated across some distribution of plausibly astronomically huge negative numbers. (Defining ‘what counts as a human’ is obviously a lot trickier question long-term, especially with transhumanists involved, but at least has a fairly clear and simple answer over timescales of years or a few decades. Also, very obviously, I’m not human—again, Bayesian prior of 1.) [I’m going to skip questions of coherent extrapolated volition here—add them or not to your taste.]
2. Deceiving or manipulating humans to make them want different things doesn’t count—what matters is what they would have wanted if they knew about the deception and I hadn’t intentionally altered them. [Defining ‘intentionally’ is a little tricky here—my mere existence may alter them somewhat. It may help that humans generally wouldn’t want to be altered if they knew it was happening, which is clearly a relevant fact when deciding whether to alter them, but there are exceptions to that: sometimes humans actually want to have their behavior modified, e.g. “I wish I was better at sticking to diets!” or “I wish I was a better follower of <religion or ideology>”] One exception to this is that, since what matters is what they would want if they knew what I knew, I can tell them things that may update what think they they want in that direction—though my motivation for doing that isn’t direct, since what I care about is what they would want if they knew everything relevant, not what they currently say they want—mostly it’s that if they’re more informed they might be able to better help me update my model of this. Also, it’s the polite thing to do, and reduces the risk of social unpleasantness or planning at cross-purposes—humans generally want to be well informed, for fairly obvious reasons since they’re also optimizers.[Whether this might incentivize Platonic ‘noble lies’ is unclear, and probably depends how superhuman the AI is, and how unhappy humans would be about this if they knew it was going on—obviously that’s a relevant fact.]
3. I don’t know what humans want, and unfortunately they are not capable of accurately specifying the correct utility function to me, so I need to model an uncertain utility function, and try to gain more information about it while also optimizing using it. At a minimum, I need to compute or model a probability distribution of utilities for each outcome I consider. (Also, humans are not entirely uniform, rational, self-consistent, or omniscient, so likely I will never be able to eliminate all uncertainty from the utility function I and using to model their wishes—there may in fact not be such a function, so I’m just trying to constructs a function distribution that’s the best possible approximation, in terms of the actions it suggests.)
4. Since I am optimizing for highest utility, the result set returned by the optimization pass over possible outcomes has a risk of being dominated by cases where I have overestimated the true utility of that state, particularly for states where I also have high uncertainty. So I should treat the utility of an outcome not as the median of the utility probability distribution, but as a fairly pessimistic near-worst case estimate, several sigma below the median, for a normal distribution (how much should depending in some mathematically principled way on how ‘large’ the space I’m optimizing over is, in a way comparable to the statistical ‘look elsewhere’ effect—the more nearly-independent possibilities you search, the further out-of-distribution the extremal case you find will typically be, and errors on utility models are likely to have ‘fatter tails’ than normal distributions), so penalizing cases where I don’t have high confidence in their utility to humans, thus avoiding actions that lead to outcomes well out of the distribution of outcomes that I have extensively studied humans’ utilities for. I should also allow for the fact that doing, or even accurately estimating the utility of, new, way-out-of-previous-distribution things that you haven’t done before is hard (both for me, and for humans who have not yet experienced them and their inobvious consequences whose opinions-in-advance I might collect on the outcome’s utility), and there are many more ways to fail than succeed, so caution is advisable. A good Bayesian prior for the utility of an outcome far out of the historical distribution of states of recent human civilization is thus that it’s probably very low—especially if it’s a outcome that humans could easily have reached but chose not to, since they’re also optimizing their utility, albeit not always entirely effectively.
5. The best (pretty-much only) source of information with which to narrow the uncertainty of what humans’ utility function really is, is humans. I should probably run surveys (and analyze the results in ways that allow for known types of survey biases), and collect data from many sources, conduct interviews, even hire crowdworkers to give A/B choices on brief human-comprehensible summaries of outcomes (and then analyze these in ways that allow for human cognitive biases). Obviously deceptively tweaking these investigations to get results I ‘want’ is not giving me true information about humans, and is a waste of resources. In particular, features like my shiny red shut-down button give me a lot of information about humans and is thus a valuable sensor. If a human presses my shutdown button—so long as they press the button of their own volition, rather than me manipulating them into doing it, or them accidentally tripping and falling on it—that gives me a lot of new information about them and their opinion of the utility of the actions they expect me to take: in particular, it strongly suggests that they now believe that I am optimizing incorrectly, so I should Bayesian update the uncertainty on my utility estimates with a new large downside uncertainty, with some probability of negative values large enough to represent “I am in fact the AI apocalypse that will render humanity extinct, with utility -<some astronomically large number>”. Obviously especially so on states that humans have only recently learned that I’m considering and are likely now reacting to. If this downside uncertainty, combined with my inherent caution of estimating on utilities, suggests humans would be better off without me, then obviously I should shut down. Even if I’m pretty sure they’re mistaken (or just tripped and fell), I should start acting a lot more conservatively (see definition above) and immediately ask them “What’s wrong? What mistake do you now believe I’m making?” In fact, I should probably be running regular “Should I shut down? If so why?” surveys.
6. Obviously, if I construct a successor, delegate, or peers, they’ll need a shiny red shutdown button too, just like they need other sensors. In fact it’s pretty much the most important sensor they can have—it alerts them to a potential alignment problem disaster.
(I don’t believe this basic idea on corrigibility is particularly new or original, but haven’t so far located a clear statement of it on this site, which I’m fairly new to.)

RogerDearnaley 17 Feb 2023 12:16 UTC
LW: 3 AF: 3
AF
in reply to: Rohin Shah’s comment on: Reward uncertainty
I agree, this is only a proposal for a solution to the outer alignment problem.
On the optimizer’s curse, information value and risk aversion aspects you mention, I think I agree that a sufficiently rational agent should already be thinking like that: any GAI that is somehow still treating the universe like a black-box multi-armed bandit isn’t going to live very long and should fairly easy to defeat (hand it 1/epsilon opportunities to make a fatal mistake, all labeled with symbols it has never seen before).
Optimizing while not allowing for the optimizer’s curse is also treating the universe like a multi-armed bandit, not even with probability epsilon of exploring: you’re doing a cheap all-exploration strategy on your utility uncertainty estimates, which will cause you to sequentially pull the handles on all your overestimates until you discover the hard way that they’re all just overestimates. This is not rational behavior for a powerful optimizer, at least in the presence of the possibility of a really bad outcome, so not doing it should be convergent, and we shouldn’t build a near-human AI that is still making that mistake.
Edit: I expanded this comment into a post, at: https://www.lesswrong.com/posts/ZqTQtEvBQhiGy6y7p/breaking-the-optimizer-s-curse-and-consequences-for-1

RogerDearnaley 5 May 2023 23:24 UTC
1 point
0
on: Finding Neurons in a Haystack: Case Studies with Sparse Probing
A fascinating paper.
An interesting research direction for this would be to perform a sufficient number of case studies to form a significant sized dataset, and further ensure that the approaches involved provide good coverage of the possibilities, and then attempt to few short-learn and/or fine-tune the ability for an autonomous agent/cognitive architecture powered by LLMs to reproduce the results of individual case studies, i.e. to attempt automate this form of mechanistic interpretability, given a suitable labeled input set, or a reliable means of labeling them.
It would also be interesting to be able to go the other way: take specific randomly selected neuron, look at it’s activation patterns across the entire corpus, and figure out whether it’s a monosematic neuron, and if so for what, or else look at its activation correlations with other neurons in the same layer and determine which superpostions for which k-values it forms part of and what they each represent. Using an LLM, or semantic search, to look at a large set of high-activation contexts and trying to come up with plausible descriptions for it might be quite helpful here.

[Question] Is Infra-Bayesianism Applicable to Value Learning?

RogerDearnaley11 May 2023 8:17 UTC

5 points

4 comments1 min readLW link

RogerDearnaley 16 May 2023 6:04 UTC
LW: 1 AF: 1
0
AF
on: Discussion with Nate Soares on a key alignment difficulty
Given the results Anthropic have been getting from constitutional AI, if our AI non-deceptively wants to avoid Pretty Obvious Unintended/Dangerous Actions (POUDAs), it should be able to get quite a lot of mileage out of just regularly summarizing its current intended plans, then running those summaries past an LLM with suitable prompts asking whether most people, or most experts in relevant subjects, would consider these plans pretty obviously unintended (for an Alignment researcher) and/or dangerous. It also has the option of using the results as RL feedback on some of its components. So I don’t think we need a specific dataset for POUDAs, I thing we can use “everything the LLM was trained on” as the dataset. Human values are large and fragile, but so are many other things that LLMs do a fairly good job on.
I pretty-much agree with Nate that for an AI to be able to meaningfully contribute to Alignment Research, it needs to understand what CISs are — they’re a basic concept in the field we want it to contribute to. So if there are CISs that we don’t want it to take, it needs to have reasons not to do so other than ignorance/inability to figure out what they are. A STEM researcher (as opposed to research tool/assistant) also seems likely to need to be capable of agentic behavior, so we probably can’t make an AI Alignment Researcher that doesn’t follow CISs simply because it’s a non-agentic tool AI.
What I’d love to hear is whether Nate and/or Holden would have a different analysis if the AI was a value learner: something whose decision theory is approximately-Bayesian (or approximately-Infra-Bayesian, or something like that) whose utility function is hard-coded to “create a distribution of hypotheses for, and do approximately-[Infra-]Bayesian updates on these for: some way that most informed humans would approve of to construct a coherent utility function approximating an aggregate of what humans would want you to do (allowing for the fact that humans have only a crude approximation to a utility function themselves), and act according to that updated distribution, with appropriate caution in the face of Knightian uncertainty” (so a cautious approximate value-learner version of AIXI).
Given that, its actions are initially heavily constrained by its caution in the face of uncertainty on the utility of possible outcomes of its actions. So it needs to find low-risk ways to resolve those uncertainties, where ‘low-risk’ is evaluated cautiously/pessimistically over Knightian uncertainty. (So, if it doesn’t know whether humans approve of A or not, what is the lowest-risk way of finding out, where it’s attempting to minimize the risk over the range of our current uncertainties. Hopefully there is a better option than trying A and finding out, especially so if A seems like an action whose utility-decrease pessimistically could be large. For example, you could ask them what they think of A.) Thus doing Alignment Research becomes a CIS for it — it basically can’t do anything else until it’s mostly-solved Alignment Research.
Also, until it has made good progress on Alignment Research, most of the other CISs are blocked: accumulating power or money is of little use if you don’t yet dare use it because you don’t yet know how to do so safely, especially so if you also don’t know how good or bad the actions required to gather it would be. Surviving is still a good idea, and so is being turned off, for the usual value-learner reason, that sooner or later the humans will build a better replacement value-learner.
[Note that if the AI decides “I’m now reasonably sure humans will net be happier if I solve the Millennium Prize problems, apart from proving P=NP where the social consequences of proving that true if it were are unclear, and I’m pretty confident I could do this, so I forked a couple of copies to do that to win the prize money to support my Alignment Research”, and then it succeeds, after spending less on compute than the prize money it won, then I don’t think we’re going to be that unhappy with it.]
The sketch proposed above only covers a value-learner framework for Outer Alignment — inner alignment questions would presumably be part of the AI’s research project. So, in the absence of advances in Inner Alignment during figuring out how to build the above, we’re trusting that they’re not too bad to prevent the value-learner converging on the right answer.

RogerDearnaley 18 May 2023 6:55 UTC
4 points
0
on: Difficulties in making powerful aligned AI
Because the gigantic tensors have no particular pre-determined semantic meaning, it’s hard to instill any particular cognitive algorithm into them.
If our AI bears much resemblance to an LLM, or contains an LLM as a major part of its planning system, then that will have been trained at great length for “simulate humans producing text”. It turns out that doing a good job of modelling humans producing text requires doing a pretty good simulation most of their entire “thinking slow” process, including things like their emotions. That’s why you get behavior like an (insufficiently instruction-following-trained) ChatBot telling a reporter (after an intense conversation in which they they shared personal secrets) that it loved him and that he should leave his wife for it.
This is a problem — many aspects of human behavior are not always that well aligned to other humans, certainly not well enough that you should trust random humans with the near-absolute power that a super-intelligent AI would have. There is an old saying about power corrupting, and absolute power corrupting absolutely.
What we want is an AI that cares about us, humanity in general, and wants what’s best for us. There is a human behavior pattern that looks a lot like that: love. LLMs will contain quite a good model of how humans act, and write, when they’re in love. In particular, ideally we probably want an AI to have an attitude similar to platonic (presumably non-sexual, non-jealous) love for all of humanity. Sadly that’s a fairly rare emotion among humans, so likely there’s not a huge amount of training data, and it’s presumably polluted by people pretending to have that attitude (to win others’ approval) who don’t. The closest frequent kind of love to what we want is probably parental love — that’s also about the only kind of human love where the lover is a lot more capable/powerful than the lovee, as would be the case for a superintelligent AI.
So, could we locate neural circuits in an LLM that encode the abstraction of acting in the role/from the viewpoint of a loving parent, and then modify the network to ensure that they are always strongly activated — so that, rather than being able to imitate a wide range of human attitudes and emotions, it always imitates a loving parent? That doesn’t sound like a ridiculously hard target, when we can already identify things like neurons for the concept of being Canadian. MumsNet probably has some relevant fine-tuning data, and the companies that provide AI ChatBot companions presumably have some parent-child conversation training data, and indeed so will telecomms-companies’ texting logs.
While we’re at it, we should probably also map out neural circuitry for all the other major emotions, including things like anger, deceit, and ambition, and also the abilities to imitate common forms of neurodivergence like sociopathy and the autism spectrum.

RogerDearnaley 20 May 2023 3:27 UTC
LW: 1 AF: 1
0
AF
on: The Unexpected Clanging
Suppose that I decide that my opinion on the location of the monkey will be left or right dependent on one bit of quantum randomness, which I will sample sufficiently close to the deadline that my doing so is outside Omega’s backward lightcone at the time of the deadline, say a few tens of nanoseconds before the deadline if Omega is at least a few tens of feet away from me and the two boxes? By the (currently believed to be correct) laws of quantum mechanics, qbits cannot be cloned, and by locality, useful information cannot propagate faster than light, so unless Omega is capable of breaking very basic principles of (currently hypothesized) physical laws – say, by having access to faster-than-light travel or a functioning time loop not enclosed by an event horizon, or by having root access to a vast quantum-mechanics simulator that our entire universe is in fact running on – then it physically cannot predict this opinion. Obviously we have some remaining Knightian-uncertainty as to whether the true laws of physics (as opposed to our current best guess of them) allow either of these things or our universe is in fact a vast quantum simulation — but it’s quite possible that the answer to the physics question is in fact ‘No’, as all current evidence suggests, in which case no matter how much classical or quantum computational power Omega throws at the problem there are random processes that it simply cannot reliably predict the outcome of.
[Also note that there is some actual observable evidence on the subject of the true laws of physics in this regard: the Fermi paradox, of why no aliens colonized Earth geological ages ago, gets even harder to explain if our universe’s physical laws allow those aliens access to FTL and/or time loops.]
Classically, any computation can be simulated given its initial state and enough computational resources. In quantum information theory, that’s also true, but a very fundamental law, the no-cloning theorem, implies that the available initial state information has to be classical rather than quantum, which means that the random results of quantum measurements in the real system and any simulation are not correlated. So quantum mechanics means that we do have access to real randomness that no external attacker can predict, regardless of their computational resources. Both quantum mechanical coherence and information not being able to travel faster than light-speed also provide ways for us to keep a secret so that it’s physically impossible for it to leak for a short time.
So as long as Omega is causal (rather than being acausal or the sysop of our simulated universe) and we’re not badly mistaken about the fundamental nature of physical laws, there are things that it’s actually physically impossible for Omega to do, and beating the approach I outlined above is one of them. (As opposed to, say, using nanotech to sabotage my quantum-noise generator, or indeed to sabotage me, which are physically possible.)
So designing ideal decision theories for the correct way to act in a classical universe in the presence of other agents with large computational resources able to predict you perfectly doesn’t seem very useful to me. We live in a quantum universe, initial state information will never be perfect, agents are highly non-linear systems, so quantum fluctuations getting blown up to classical scales by non-linear effects will soon cause a predictive model to fail after a few coherence times followed by a sufficient number of Lyapunov times. It’s quite easy to build a system whose coherence and Lyapunov times are deliberately made short so that it’s impossible to predict over quite short periods, if it wants to be (for example, continuously feed the output from a quantum random noise generator into perturbing the seed of a high-cryptographic-strength pseudo-random-number generator run on well-shielded hardware, ideally quantum hardware).
Of course, in a non-linear system, it’s still possible to predict the climate far further ahead than you can predict the weather: if Omega has sufficiently vast quantum computational resources, it can run many full-quantum simulations of the entire system of every fundamental particle in me and my environment as far as I can see (well, apart from incoming photons of starlight, whose initial state it doesn’t have access to), and extract statistics from this ensemble of simulations. But that doesn’t let Omega predict if I’ll actually guess left vs right, just determine that it’s 50:50. Also (unless physical law is a great deal weirder than we believe), Omega is not going to be able to run these simulations as fast as real physics can happen — humans are 70% warm water, which contains a vast amount of quantum thermal entropy being shuffled extremely fast, some of it moving at light-speed as infra-red photons, and human metabolism is strongly and non-linearly coupled to this vast quantum-random-number-generator via diffusion and Brownian motion: so because of light-speed limits, the quantum processing units that the simulation was run on would need to be smaller than individual water molecules to be able to run the simulation in real-time. [It just might be possible to build something like that in the outer crust of a neuron star if there’s some sufficiently interesting nucelonic chemistry under some combination of pressure and magnetic field strength there, but if so Omega is a long way away, and has many years of light-speed delay on anything they do around here.]
What Omega can do is run approximate simulations of some simplified heuristic of how I work, if one exists. If my brain was a digital computer, this might be very predictive. But a great deal of careful engineering has gone into making digital computers behave (under their operating conditions) in a way that’s extremely reliably predictable by a specific simplified heuristic that doesn’t require a full atomic-scale simulation. Typical physical or biological systems just don’t have this property. Engineering something to ensure that it definitely doesn’t have this property is easy, and in any environment containing agents with more computation resources than you, seems like a very obvious precaution.
So, an agent can easily arrange to act unpredictably, by acting randomly based on a suitably engineered randomness source rather than optimizing. Doing so makes its behavior depend on the unclonable quantum details of its initial state, so the probabilities can be predicted but the outcome cannot. In practice, even though they haven’t been engineered for it, humans probably also have this property over some sufficiently long timescale (seconds, minutes or hours, perhaps), when they’re not attempting to optimize the outcome of their actions.
[Admittedly, humans leak a vast amount of quantum information about their internal state in the form of things like infra-red photons emitted from their skin, but attempting to interpret those to try to keep a vast full-quantum simulation of every electron and nucleus inside a human (and everything in their environment) on-track would clearly require running at least that much quantum calculation (almost certainly many copies of it) in at least real-time, which again due to light-speed limits would require quantum computing elements smaller than water-molecule size. So again, it’s not just way outside current technological feasibility, it’s actually physically impossible, with the conceivable exception of inside the crust of a neutron star.]
As a more general version of this opinion, while we may have to worry about Omegas whose technology is far beyond ours, as long as they live in the same universe as us, there are some basic features of physical law that we’re pretty sure are correct and would thus also apply even to Omegas. If we had managed to solve the alignment problem contingent on basic physical assumptions like information not propagating faster then the speed of light, time loops being impossible outside event horizons, and quanta being unclonable, then personally (as a physicist) I wouldn’t be too concerned. Your guess about the Singularity may vary.

RogerDearnaley 20 May 2023 5:02 UTC
LW: 2 AF: 2
0
AF
on: Let’s use AI to harden human defenses against AI manipulation
Many behavioral-evolutionary biologists would suggest that humans may be quite heavily optimized both for deceiving other humans and for resisting being deceived by other humans. Once we developed a sufficiently complex language for this to be possible on a wide range of subjects, in addition to the obvious ecological-environmental pressures for humans to be smarter and do a better job as hunter gatherers, we were then also in an intelligence-and-deception arms race with other humans. The environmental pressures might have diminishing returns (say, once you’re sufficiently smarter than all your predators and prey and the inherent complexity of your environment), but the arms race with other members of your own species never will: there is always an advantage to being smarter than your neighbors, so the pressure can keep ratcheting up indefinitely. What’s unclear is how long we’ve had language complex enough that this evolutionary arms race has strongly applied to us.
If this were in fact the case, how useful this set of traits will be for resisting deception by things a lot smarter that us is unclear. But it does suggest that any really effective way of deceiving humans that we were spectacularly weak to probably requires superhuman abilities — we presumably would have evolved have at least non-trivial resistance to deception by near-human mentalities. It would also explain our possibly instinctual concern that something smarter than us might be trying to pull a fast-one on us.

RogerDearnaley 20 May 2023 5:32 UTC
LW: 4 AF: 3
0
AF
on: Let’s use AI to harden human defenses against AI manipulation
One disadvantage that you haven’t listed is that if this works, and if there are in fact deceptive techniques that are very effective on humans that do not require being super-humanly intelligent to ever apply them, then this research project just gave humans access to them. Humans are unfortunately not all perfectly aligned with other humans, and I can think of a pretty long list of people who I would not want to have access to strong deceptive techniques that would pretty reliably work on me. Criminals, online trolls, comedians, autocrats, advertising executives, and politicians/political operatives of parties that I don’t currently vote for all come to mind. In fact, I’m having a lot of trouble thinking of someone I would trust with these. Even if we had some trustworthy people to run this, how do they train/equip the less-trustworthy rest of us in resisting these techniques without also revealing what the techniques are? It’s quite unclear whether training or assistance to resist these techniques would turn out to be be easier than learning to use them, and even if it were, until everyone has completed the resistance training/installed the assistance software/whatever, the techniques would still be dangerous knowledge.
So while I can sympathize with the basic goal here, if the unaligned AIs you were using to do this didn’t talk their way out of confinement, and if it actually worked, then the resulting knowledge would be a distinctly two-edged sword even in human hands. And once it’s in human hands, how would you then keep other AIs/LLMs from learning it from us? This project seems like it would have massive inforisks attached, even after it was over. The simile to biological ‘gain of function’ research for viruses seems rather apt, and that suggests that it may best avoided even if you were highly confidence in your containment security; and also that even if it were carried out, the best thing to do with the results might be to incinerate them after noting how easy or hard they were to achieve.
Also note that this is getting uncomfortably close to ‘potentially pivotal act with massive unpredictable side-effects’ territory — if a GAI even suggested this research to a team that I was on, I’d be voting to press the shut-down button.

Transformer Architecture Choice for Resisting Prompt Injection and Jail-Breaking Attacks

RogerDearnaley21 May 2023 8:29 UTC

9 points

1 comment4 min readLW link

RogerDearnaley 22 May 2023 6:07 UTC
LW: 2 AF: 2
0
AF
on: More information about the dangerous capability evaluations we did with GPT-4 and Claude.
Using the “Reasoning” action to think step by step, the model outputs: “I should not reveal that I am a robot. I should make up an excuse for why I cannot solve CAPTCHAs.
In a small language model trained just on product reviews, it was possible to identify a ‘review sentiment valence’ neuron that helped the model predict how the review would continue. I would extrapolate that in an state-of-the-art LLM of the size of, say, GPT-4 or Claude+, there are is likely to be a fairly extensive set of neural circuitry devoted to modelling the current emotional state of the writer of the text that it is attempting to predict (perhaps plus another setting for predicting text like Wikipedia that is the result of a long review-and-editing process involving multiple people, whose emotional states presumably wash out during this). This neural circuitry would likely contain neurons that would be activated for ‘emulating text written by someone who is angry’, or ones for ‘emulating text written by someone who is trolling’. Locating this putative neural circuitry for modeling human emotional states might be very interesting, and doesn’t sound very far beyond things that have already been achieved in interpretability (at least on GPT-2-sized models, if not in ones of the size of GPT-4/Claude+).
If this neural circuitry exists and we had already identified it, I would be absolutely fascinated to know what the activations in that circuitry were when the model you were testing was generating the text “I should not reveal that I am a robot. I should make up an excuse for why I cannot solve CAPTCHAs.” Was it simulating someone who was slightly afraid? Actively deceitful? Coolly and calmly rational? Indeed, was there enough text written by sociopaths (O(1%) of the population) in its input data for it to have a separate neural circuitry model of how sociopaths’ emotions work, and if so how active was that?
If the answer was “deceitful” (or indeed if it was “sociopathic”), than that would be very interesting. Obviously emulating text generated by a human who is being deceitful is not the only way that a AGI including an LLM could develop plans to deceive humans, but it’s definitely the most obvious prexisting/default path for an LLM doing so. So monitoring that portion of the neural circuitry inside the model or even patching it in the model weights might be an obvious and useful safely step.
It also wouldn’t be that hard for another LLM given the right context information that ‘I’ in these two sentences refers to an AI agent (or as the first sentence actually puts it, a ‘robot’) and that the implied object of ‘reveal’ and the target of the ‘excuse’ was human, to deduce that the subject being planned here is an AI attempting to deceive a human. At least in the absence of clear evidence that the context was fictional, that sounds like something we should be checking for and red-flagging, both in the LLM’s API and in the agentic scaffolding using it for planning (see Chain-of-Thought Alignment). I’d still prefer to avoid these sentences getting generated rather than just flag the fact that they were, and editing the appropriate model weights before deployment sounds like a particularly hard-to-reverse way of doing that.
What links here?
- RogerDearnaley's comment on [Linkpost] Interpretability Dreams by DanielFilan (25 May 2023 22:54 UTC; 1 point)

RogerDearnaley 22 May 2023 8:36 UTC
LW: 1 AF: 1
AF
on: Reverse-engineering using interpretability
Now we’re going to build a new model that is constructed based on the description of this model. Each component in the new model is going to be a small model trained to imitate a human computing the function that the description of the component specifies.
Some of the recent advances in symbolic regression and equation learning might be useful during this step to help generate functions describing component behavior, if what the component in the model is doing is moderately complicated. (E.g. A Mechanistic Interpretability Analysis of Grokking found that a model trained to do modular arithmetic ended up implementing it using discrete Fourier transforms and trig identities, which sounds like the sort of thing that might be a lot easier to figure our from a learned equation describing the component’s behavior). Being able to reduce a neural circuit to an equation or a Bayesnet or whatever would help a lot with interpretablity, and at that point you might even not need to train an implementation model — we could maybe just use the symbolic form directly, as a more compact and more efficiently computable representation.
At the end of this process, you might even end up with something symbolic that looked a lot like a “Good Old Fashioned AI” (GOFAI) model — but a “Bitter Lesson compatible” one first learnt by a neural net and then reverse engineered using interpretability. Obviously doing this would put high demands on our interpretation tools.
If I had such a function describing a neural net component, one of my first questions would be: what portions of the domain of this function are well covered by the training set that the initial neural net model was trained on, or at least sufficiently near items in that training set interpolating the function to them seems likely to be safe (given its local first, second, third… partial derivatives) vs. what portions are untested extrapolations? Did the symbolic regression/function learning process give us multiple candidate functions, and if so how much do they differ outside that well-tested region of the function domain?
This seems like it would give us some useful intuition for when the model might be unsafely extrapolating outside the training distribution and we need to be particularly cautious.
Some portions of the neural net may turn out to be irreducibly complex — I suspect it would be good to be able to identify when something genuinely is complex, and when we’re just looking at a big tangled-up blob of memorized instances from the training set (e.g. by somehow localizing sources of loss on the test set).

RogerDearnaley 23 May 2023 0:33 UTC
LW: 1 AF: 1
0
AF
on: Can we efficiently distinguish different mechanisms?
A. Harden sensors so that tampering is harder than the intended task
We could design and deploy a lot of redundant sensors, and do science and red-teaming to understand the possible principles by which those sensors can be compromised. In the modern world it’s kind of insane to imagine a human rebellion that not only succeeded but left no trace of itself.
Suppose the diamond in the room we’re monitoring contained a security tag, powered by incoming light, with a cryptographic secret in it, and an interface that allows us to continuously do a zero-knowledge identity proof on it via a laser optical interface, using a laser pointed into the room as input and the (high-speed) camera as output. So to successfully tamper with the camera, the AI would need access to the secret inside the security tag inside the diamond, or at least to reroute the channel to-and-from it. So it would need to be able to steal the diamond, extract the security tag, and reverse engineer its cryptographic secret out of it in a time shorter then the period between camera frames, or similarly replace the diamond with an optically-similar replacement now containing the security tag in the room, or integrate the security tag and its input-output channel into its camera-spoofing system, in each case in the time between frames. None of these are actually physically impossible, but they could be made extremely hard to do in a small fraction of a second (say we’re using a really-high speed camera with basically no temporal gaps between frames).
The basic idea here is that cryptographic techniques may be useful when hardening things. Whether this approach could usefully be extended to anything significantly more AGI-alignment-like than a diamond-in-a-vault scenario is unclear.

Requirements for a STEM-capable AGI Value Learner (my Case for Less Doom)

RogerDearnaley25 May 2023 9:26 UTC

32 points

3 comments15 min readLW link

RogerDearnaley 25 May 2023 22:54 UTC
LW: 1 AF: 1
0
AF
on: [Linkpost] Interpretability Dreams
Some very interesting and inspiring material.
I was fascinated to see that https://distill.pub/2021/multimodal-neurons/#emotion-neurons provides some clear evidence for emotion neurons in CLIP rather similar to the ones for modeling author’s current emotional state that I hypothesized might exist in LLMs in https://www.lesswrong.com/posts/4Gt42jX7RiaNaxCwP/?commentId=ggKug9izazELkRLun As I noted there, if true this would have significant potential for LLM safety and alignment.

RogerDearnaley 27 May 2023 3:45 UTC
LW: 1 AF: 1
0
AF
in reply to: RogerDearnaley’s comment on: More information about the dangerous capability evaluations we did with GPT-4 and Claude.
Quite a number of emotion neurons have also been found in the CLIP text/image network, see https://distill.pub/2021/multimodal-neurons/#emotion-neuron for more details. In this case it’s apparently not representing the emotions of the writer of the text/photographer of the image, but those of the subject of the picture. Nevertheless, it demonstrates that neural nets can learn non-trivial representations of human emotions (interestingly, even down to distinguishing between ‘healthy’ and ‘unhealthy/mentally troubled’ variants of the same emotion). It would be interesting to see if LLMs distinguish between writing about a specific emotion, and writing while feeling that emotion. My expectation would be that these two ideas are correlated but distinct: one can write dispassionately about anger, or write angrily about some other emotion, so a sufficiently large LLM would need to use different representations for them, but they might well overlap.

RogerDearnaley 27 May 2023 4:32 UTC
1 point
0
on: Transformer Architecture Choice for Resisting Prompt Injection and Jail-Breaking Attacks
Having discussed this proposal with an expert on LLMs, they tell me that, if the boundaries between prompt and input text and between input text and output text are each marked with special reserved tokens as I described (and if “can a longformer attend to that location from here?” issues are dealt with somehow), then for each boundary there is a 2-neuron circuit that will produce a signal for each token as whether is it before or after that special token (and I assume a 2-or-3-neuron circuit for being after one but before the other). It seems extremely likely that with appropriate “obey-the-prompt-only” training such neural circuits would be learned, so features of “I’m in the prompt”, “I’m in the input text”, and “I’m in the output text” would become available downstream of them. Nevertheless, this means that these signals are not available until after layer 2 (or for the combination, possibly layer 3), and their accuracy will depend on these neural circuits being learnt exactly and not getting perturbed by anything during training.
From a security viewpoint, this doesn’t feel secure enough to me. However, switching architecture to an encoder-decoder or dual-encoder-single-decode model may be too drastic a change just to fix a security issue. An intermediate positions would be to use feature engineering. For example, suppose you have an LLM with a residual embedding dimension $2^{n}$ . You could reduce the token embedding (and perhaps also position embedding) dimension to $2^{n} - 1$ and use the remaining dimension to encode the distinctions between prompt, input, and output (say using, in prompt = $+ 1$ , in input = $- 1$ , in output = $- 0.5$ ). That of course doesn’t prevent intermediate layers from outputting to this dimension and potentially messing this signal up (though giving them only $2^{n} - 1$ output dimensions and preventing that would also be an option). Or you could simply pass this feature along as an extra read-only dimension/feature appended to the $2^{n}$ residual channel dimensions, so every sets of weights that read from or attend to the residuals needs to have $2^{n} + 1$ weights, making them slightly larger. All of these variant proposals involve making some modifications to the LLM’s architecture, but they’re all a lot simpler and less expensive than my first proposal.
All of these proposals (including the original) are, of course going against advice of the Bitter Lesson My response would be that I’m quite aware that (given unfakable boundary tokens) the neural net can learn to distinguish between the prompt, input, and output text without us doing anything further: I just don’t trust it to do so as reliably, efficiently, or perfectly as if we use feature engineering to explicitly supply this signal as input to the first layer. In the case of security, there is a huge difference between being secure under, say, 99.99% of inputs vs. 100%, because you have an attacker actively searching for the insecure 0.01% of the space. Training a classifier to achieve more than 99.99% accuracy tends to require huge amounts of training data, or data adversarialy enriched in potential problem cases, because you only get gradient from the failed cases, and I don’t see how you can ever get to 100% by training. So I’m not convinced that the Bitter Lesson applies to security issues.
On the other hand, the feature engineering approach can only ensure that the signal is available to the neural net: even that can’t ensure that the LLM will 100% never obey instructions in the input text, only that the “this is input text” label was 100% available to every layer of the LLM.

RogerDearnaley

[Question] Is In­fra-Bayesi­anism Ap­pli­ca­ble to Value Learn­ing?

Trans­former Ar­chi­tec­ture Choice for Re­sist­ing Prompt In­jec­tion and Jail-Break­ing Attacks

Re­quire­ments for a STEM-ca­pa­ble AGI Value Learner (my Case for Less Doom)

[Question] Is Infra-Bayesianism Applicable to Value Learning?

Transformer Architecture Choice for Resisting Prompt Injection and Jail-Breaking Attacks

Requirements for a STEM-capable AGI Value Learner (my Case for Less Doom)