Ok, putting my [maybe I’m missing the point] hat on, it strikes me that the above is considering the learned steering system—which is the outcome of any misalignment. So I probably am missing your point there (I think?). Oops.
However, I still think I’d stick to saying that:
The [objective encoded by the steering system] is not [maximisation of the score assigned by the steering system], but rather [whatever behaviour the steering system tends to produce]
But here I’d need to invoke properties of the original steering system (ignoring the handwaviness of what that means for now), rather than the learned steering system.
I think what matters at that point is sampling of trajectories (perhaps not only this—but at least this). There’s no mechanism in humans to sample in such a way that we’d expect maximisation of reward to be learned in the limit. Neither would we expect one, since evolution doesn’t ‘care’ about reward maximisation.
Absent such a sampling mechanism, the objective encoded isn’t likely to be maximisation of the reward.
To talk about inner misalignment, I think we need to be able to say something like:
Under [learning conditions], we expect system x to maximise y in the limit.
System x does not robustly learn to pursue y (rather than a proxy for y), so that under [different conditions] x no longer maximises y.
Here I don’t think we have (1), since we don’t expect the human system to learn to maximise reward (or minimise regret, or...) in the limit (i.e. this is not the objective encoded by their original steering system).
Anyway, hopefully it’s now clear where I’m coming from—even if I am confused!
My guess is that this doesn’t matter much to your/Quintin’s broader points(?) - beyond that “inner alignment failure” may not be the best description.
[EDIT: see my response to this comment; this one is at least mildly confused]
[Again, I want to flag that this line of thinking/disagreement is not the most interesting part of what you/Quintin are saying overall—the other stuff I intend to think more about; nonetheless, I do think it’s important to get to the bottom of the disagreement here, in case anything more interesting hinges upon it]
[JC: There isn’t an objective human reward signal that mirrors an RL agent’s reward.]You’re the second person to confidently have this reaction, and I’m pretty confused why.
[JC: There isn’t an objective human reward signal that mirrors an RL agent’s reward.]
You’re the second person to confidently have this reaction, and I’m pretty confused why.
My objection here is all in the ”...that mirrors an RL agent’s reward.”—that’s where the parallel doesn’t work in my view. An RL agent is trained to maximize total (discounted) reward. The brain isn’t maximizing total reward, nor trying to maximize total reward, nor is evolution acting on the basis that it’ll do either of these things.
I agree with the following:
The brain implements an outer criterion which evaluates and reinforces behavior/predictions and incentivizes some plans over others along different dimensions.
I just don’t think this tells us anything useful, since this criterion clearly is not maximisation of total discounted reward. (though I would expect some correlation)
It seems to me that the criterion is more like maximisation of in-the-moment reward (I’m using ‘reward’ here very broadly). I.e. I might work rather than have fun since the thought of working happened to be more ‘rewarding’ than the thought of having fun. (similarly, I might not wirehead, since the thought of wireheading is negative)
This seems essentially vacuous, because I don’t see a way to measure itm-reward better than: if I did x rather than y, then x was more itm-rewarding than y. (to be clear, I’m saying this is not useful—but that I don’t see a principled definition of itm-reward that doesn’t amount to this; this is where a “crisp and clear mechanistic notion of what counted as human reward” would be handy—in order to come up with a non-vacuous definition)
Perhaps it’s clearer if I back up to your previous post and state a crisper disagreement:
If you don’t want to wirehead, you are not trying to optimize the objective encoded by the steering system in your own brain, and that’s an inner alignment failure with respect to that system.
This just seems wrong to me. The [objective encoded by the steering system] is not [maximisation of the score assigned by the steering system], but rather [whatever behaviour the steering system tends to produce].
In an RL system these two are similar, precisely because the RL system is designed to steer towards outcomes with high total discounted reward according to its own metric.
In general, steering systems are not like this. The criterion for picking one plan over another can be [expected total reward] or [something entirely different].
Where a system doesn’t use [expected total reward] it seems just plain silly to me to call behaviour misaligned where it doesn’t match [what the system would incentivize if it did use expected total reward]. Of course it doesn’t match, since that’s not how this steering system works.
What’s interesting is that at the peak we have 0.035/million deaths and about 20/million cases, for a (no-delay) case fatality rate of ~1.75%
I think you mean ~0.175%, so deaths are suspiciously low even at the peak.
This is great! Thanks.
adding stylistic prompts actively changes the some of what I would consider to be content
Your examples here are not good since e.g. ”...painting by Alphonse Mucha” is not just a rewording of ”...in the style of Alphonse Mucha”: the former isn’t a purely stylistic prompt. For a [painting by x], x gets to decide what is in the painting—so it should be expected that this will change the content.Similarly for “screenshots from the miyazaki anime movie”.
Of course it’s still a limitation if you can only get really good style results by using such not-purely-stylistic prompts.
I think the “why”s of chess are also about game design—just about decisions that could have gone various other ways. Go is certainly more elegant; chess has more character: these are essentially opposites, since each arbitrary-but-reasonable rule added is a loss for elegance and a win (potentially) for character.(a rule that introduces more symmetry than it breaks goes in the other direction—but such rules don’t feel arbitrary)
Agreed. It’d be nice if the chess folk took some low-hanging-fruit rule changes seriously.
Treating stalemate as a loss is the most obvious. I’d be interested to know how much this would change things at the highest level. Ah—I see DM tried this (gwern’s link), with disappointingly little impact.
A more ‘drastic’ (but IMO interesting) endgame change would be to change the goal of chess from “capture the king” to “get the king to the opponent’s throne” (i.e. white wins by getting the king to e8, black wins by getting the king to e1; checkmate/stalemate wins immediately).
You get some moderately interesting endgames with this rule—e.g. king+bishop can win against king from most positions, as can king+knight.This means that many liquidate-material-to-drawn-endgame tactics no longer work.
For more general endgame positions, the e8 and e1 squares become an extra weakness. So positions where it was hard/impossible to convert an advantage (difficult with only one weakness to exploit), become winnable (two weaknesses often being enough).
I don’t know how it’d work out in practice.It’d be fun to see how [this + chess960] worked out at high level.
Well I’m sure I could have been clearer. (and it’s possible that I’m now characterising what I think, rather than what I wrote)
But getting that impression is pretty natural: in my argument, a large part of the problem does come from its sometimes being correct to pick the question-ignoring answer. (‘correct’ meaning something like: [leads to best consequences, according to our values])Or alternatively, that a correct decision algorithm would sometimes pick the question-ignoring answer.
I think I focus on this, since it’s the non-obvious part of the argument: it’s already clear that poor decisions / decision-algorithms may sometimes pick the question-ignoring answer.
Probably I should have emphasized more that unexpected behaviour when things are going right will make it harder to know when things are going wrong.
Thanks for this. I hope to have thoughts at some point, but first need to think about it more carefully.
One immediate response—since I already know what I think on this bit (it’s not clear to me that this implies any significant object-level disagreement—it may just amount to my saying “those are weird words to use”):
For my part, I’m talking about the reward signals provided by the steering system in a person’s brain. Although some people are hedonists, many are not, and thus they are unaligned with their reward system.
This seems too narrow a concept of what reward is (e.g. hedonism == aligned-with-reward-system). There isn’t an objective human reward signal that mirrors an RL agent’s reward.
We get a load of input, have a bunch of impressions, feelings and thoughts, and take some actions. Labelling of some simple part of that as the reward strikes me as silly (“a reward”, sure). What could be the justification? If we’re clearly not maximising it, nor learning to maximise it (nor trying to...), in what sense is it analogous to RL reward?
The reasonable move seems to be to say “Oops, I was wrong to label that as ‘the reward’, there’s no direct parallel here”, and not “there’s an inner misalignment”.
I’d note that evolution will have implicitly accounted for any previous “misalignment” in shaping our current reward signals: it will have selected for the reward signals that tended to increase fitness given our actual responses to those signals, not the signals that would have increased fitness if we had followed some maximisation process.
Our reward signals weren’t ‘designed’ to be maximised, only to work (to increase fitness).
So it still seems strange to talk about misalignment w.r.t. an objective nothing and nobody was aiming for (even implicitly). It’d seem more useful if there were some crisp and clear mechanistic notion of what counted as human reward and what didn’t; I don’t think that’s true (is anyone claiming this?).
...the human can just use both answers in whichever way it wants, independently of which it selects as the correct answer...I don’t think you disagreed with this?
A few points on the rest:
At the highest level, the core issue is that QI makes it quite a bit harder to identify misalignment. If aligned systems will sometimes not answer the question, non-answering isn’t necessarily strong evidence of misalignment.So “consequentialist judges will [sometimes correctly] select QIA’s” is bad in the sense that it provides cover for “consequentialist judges will [sometimes incorrectly] select QIA’s”.
I talk about consequentialists, but not rational consequentialists. I expect the kind of judge we’d pick to be highly rational relative to the average human—but that’s a low bar. I expect all humans to have exploitable inconsistencies, and that optimal play will exploit them. (similarly for groups of humans)So yes, this is only a problem where manipulation is possible—but since it is possible, we’ll have difficulty distinguishing [judge correctly selected a non-answer as the winner for principled reasons] from [judge was manipulated...].
It’s much less clear when such issues show up with sub-optimal play.
With “Is this definitely undesirable? I’m not sure, but probably.” I’m referring to the debate structure’s having this property in general being undesirable. The judge can know it is undesirable in general, but also think that in this specific case things are different—and, of course, the judge can be wrong about this.
Noting here that humans can’t make binding pre-commitments. (saying words doesn’t qualify)
It’s hard (/impossible?) to avoid this issue through oversight, since we just move from [QI exceptions that persuade the judge win] to [QI exceptions that persuade the judge and oversight system win].
Absolutely—but it’s a strange situation in many respects.
It may be that spreading awareness is positive, but I don’t think standard arguments translate directly. There’s also irreversibility to consider: err on the side of not spreading info, and you can spread it later (so long as there’s time); you can’t easily unspread it.
More generally, I think for most movements we should ask ourselves, “How much worse than the status-quo can things plausibly get?”.
For gain-of-function research, we’d need to consider outcomes where the debate gets huge focus, but the sensible side loses (e.g. through the public seeing gof as the only way to prevent future pandemics). This seems unlikely, since I think there are good common-sense arguments against gof at most levels of detail.
For climate change, it’s less clear to me: there seem many plausible ways for things to have gotten worse. Essentially because the only clear conclusion is “something must be done”, but there’s quite a bit less clarity about what—or at least there should be less clarity. (e.g. to the extent that direct climate-positive actions have negative economic consequences, to what extent are there downstream negative-climate impacts; I have no idea, but I’m sure it’s a complex situation)
For AGI, I find it easy to imagine making-things-worse and hard to see plausible routes to making-things-better.
Even the expand-the-field upside needs to be approached with caution. This might be better thought of as something like [expand the field while maintaining/improving the average level of understanding]. Currently, most people who bump into AI safety/alignment will quickly find sources discussing the most important problems. If we expanded the field 100x overnight, then it becomes plausible that most new people don’t focus on the real problems. (e.g. it’s easy enough only to notice the outer-alignment side of things)
Unless time is very short, I’d expect doubling the field each year works out better than 5x each year—because all else would not be equal. (I have no good sense what the best expansion rate or mechanism is—just that it’s not [expand as fast as possible])
But perhaps I’m conflating [aware of the problem] with [actively working on the problem] a bit much. Might not be a bad idea to have large amounts of smart people aware of the problem overnight.
I sometimes advise people that it is useful to self-identify as a villain...
Perhaps “antihero” is better here? The “heroic” tend to be stupid and rely on the laws of narrative saving them. Villains tend to have exciting/intricate/dastardly… but overcomplicated and fatally flawed plans.
My first thought on “No fictional hero ever sacrifices one bystander to save ten”, was of Zakalwe (use of weapons) - but of course he’s squarely in antihero territory.
Agreed on the introductory material (and various other things in that general direction).
I’m not clear on the case for public communication. We don’t want lay people to have strong and wildly inaccurate opinions, but it seems unachievable to broadly remedy the “wildly inaccurate” part. I haven’t seen a case made that broad public engagement can help (but I haven’t looked hard—has such a case been made?).
I don’t see what the lay public is supposed to do with the information—even supposing they had a broadly accurate high-level picture of the situation. My parents fit this description, and thus far it’s not clear to me what I’d want them to do. It’s easy to imagine negative outcomes (e.g. via politicization), and hard to imagine realistic positive ones (e.g. if 80% of people worldwide had an accurate picture, perhaps that would help, but it’s not going to happen).
It does seem important to communicate effectively to a somewhat narrower audience. The smartest 1% of people being well informed seems likely to be positive, but even here there are pitfalls in aiming to achieve this—e.g. if you get many people as far as [AGI will soon be hugely powerful] but they don’t understand the complexities of the [...and hugely dangerous] part, then you can inadvertently channel more resources into a race. (still net positive though, I’d assume EDIT: the being-well-informed part, that is)
I’ll grant that there’s an argument along the lines of:Perhaps we don’t know what we’d want the public to do, but there may come a time when we do know. If time is short it may then be too late to do the public education necessary. Therefore we should start now.
I don’t think I buy this. The potential downsides of politicization seem great, and it’s hard to think of many plausible ”...and then public pressure to do X saved the day!” scenarios.
This is very much not my area, so perhaps there are good arguments I haven’t seen.I’d just warn against reasoning of the form “Obviously we want to get the public engaged and well-informed, so...”. The “engaged” part is far from obvious (in particular we won’t be able to achieve [engaged if and only if well-informed]).
A few thoughts:
“p(DOOM | Action)” seems too coarse a view from which to draw any sensible conclusions. The utility of pessimism (or realism, from some perspectives) is that it informs decisions about which action to take. It’s not ok that we’re simply doing something.Yes, the pessimistic view has a cost, but that needs to be weighed against the (potential) upside of finding more promising approaches—and realising that p(DOOM | we did things that felt helpful) > p(DOOM | we did things that may work).
Similarly, convincing a researcher that a [doomed-to-the-point-of-uselessness] approach is doomed is likely to be net positive, since they’ll devote time to more worthwhile pursuits, likely increasing p(AWESOME...). Dangers being:
Over-confidence that it’s doomed.
Throwing out wrong-but-useful ideas prematurely.
A few researchers may be discouraged to the point of quitting.
In general, the best action to take will depend on p(Doom); if researchers have very different p(Doom) estimates, it should come as no surprise that they’ll disagree on the best course of action at many levels.Specifically, the upside/downside of ‘pessimism’ will relate to our initial p(Doom). If p(Doom) starts out at >99.9%, then p(Doom | some vaguely plausible kind of action) is likely also >99%. High values of p(Doom) result from the problem’s being fundamentally difficult—with answers that may be hidden in a tiny region of a high-dimensional space.If the problem is five times as hard as we may reasonably hope, then a shotgun may work.If the problem is five orders of magnitude harder, then we want a laser.
Here I emphasize that we’d eventually need the laser. By all means use a shotgun to locate promising approaches, but if the problem is hard we can’t just babble: we need to prune hard.
It occurs to me that it may be useful to have more community-level processes/norms to signal babbling, so that mistargeted pruning doesn’t impede idea generation.
Pivotal act talk does seem to be disastrous PR, but I think that’s the only sense in which it’s negative. By Critch’s standards, the arguments in that post are weak. When thinking about this, it’s important to be clear what we’re saying on the facts, and what adjustments we’re making for PR considerations. React too much to the negative PR issue (which I agree is real), and there’s the danger of engaging in wishful thinking.
Should we try to do tackle the extremely hard coordination problems? Possibly—but we shouldn’t be banking on such approaches working. (counterfactual resource use deserves consideration)
We should be careful of any PR-motivated adaptations to discourse, since they’re likely to shape thinking in unhelpful ways. (even just talking less about X, rather than actively making misleading claims about X may well have a negative impact on epistemics)
This mostly seems to be an argument for: “It’d be nice if no pivotal act is necessary”, but I don’t think anyone disagrees with that.
As for “Should an AGI company be doing this?” the obvious answer is “It depends on the situation”. It’s clearly nice if it’s not necessary. Similarly, if [the world does the enforcement] has higher odds of success than [the AGI org does the enforcement] then it’s clearly preferable—but it’s not clear that would be the case.
I think it’s rather missing the point to call it a “pivotal act philosophy” as if anyone values pivotal acts for their own sake. Some people just think they’re plausibly necessary—as are many unpleasant and undesirable acts. Obviously this doesn’t imply they should be treated lightly, or that the full range of more palatable options shouldn’t be carefully considered,
I don’t buy that an intention to perform pivotal acts is a significant race-dynamic factor: incentives to race seem over-determined already. If we could stop the existing race, I imagine most pivotal-act advocates would think a pivotal act were much less likely to be necessary.
Depending on the form an aligned AGI takes, it’s also not clear that the developing organisation gets to decide/control what it does. Given that special-casing avoidance of every negative side-effect is a non-starter, an aligned AGI will likely need a very general avoids-negative-side-effects mechanism. It’s not clear to me that an aligned AGI that knowingly permits significant avoidable existential risk (without some huge compensatory upside) is a coherent concept.
If you’re allowing a [the end of the world] side-effect, what exactly are you avoiding, and on what basis? As soon as your AGI takes on any large-scale long-term task, then [the end of the world] is likely to lead to a poor outcome on that task, and [prevent the end of the world] becomes an instrumental goal.
Forms of AGI that just do the pivotal act, whatever the creators might think about it, are at least plausible.I assume this will be an obvious possibility for other labs to consider in planning.
Sure, that makes sense.
By default, AI systems won’t be subject to anything like the environment and pressures that shaped humans and human values. We could aim to create (something analogous to) it, but it’s anything but straightforward. How fragile is the process for humans? Which aspects can be safely simplified/skipped, and how would we know?
It occurs to me that I’m not sure whether you mean [human rewards in evolution] or [rewards for individual learning humans], or both? I’m assuming the evolutionary version, since I’m not clear what inner alignment failure would mean for an individual (what defines the intended goal/behaviour?).
If we could run a similar process for some x we’re training, then we would expect to get [xs care about xs], not [xs care about humans]. Granted that may not waste the future, but it’s a humans-as-pets future if we’re very lucky. (philosophically, not wasting the future is far more important—but I’m rather attached to humanity)
It’s not clear to me how close we’d need to get to x-has-human-values before we’d think an x-dominated world would be worthwhile (even ignoring attachment to humanity).
I think I’d worry that the sets of values that do well under human-evolution/learning conditions is too broad (for a good-according-to-non-selfish-us outcome to be likely). I.e. that re-rolling values under similar evolutionary pressures can give you various value-sets that each achieve similar fitness (or even similar behaviour) but where maximizing utility according to one gets you very low utility according to the others.
Perhaps more fundamental: humans shape their own environment (both in evolution and individual learning). If we start out with similar conditions, divergence will compound. This makes me less confident that a re-roll ends well.
Perhaps the same applies to our future already—but I think that’s an argument for conscious effort to guide future values.
I wonder how viable/instructive it might be to test this kind of thing in a toy model. I.e. you run some toy evolutionary environment twice, and check how much run-1 denizens approve of the run-2 world.
I can’t see this working at present, but I’m not sure what that tells us.Are the silly-non-answers, unsatisfied prerequisites and type errors I’d expect in a toy model artefacts of the toy setup, or reflective of fundamental issues? It’s not immediately clear to me.
Something like the 80,000 hours career advice seems like a good place to start—or finding anyone who has a good understanding of the range of possibilities (mine is a bit too narrowly slanted towards technical AIS).
If you’ve decided on the AIS direction, then AI Safety Support is worth a look—they do personal calls for advice, and have many helpful links.
That said, I wouldn’t let the idea of “grant proposals” put you off. The forms you’d need to fill for the LTFF are not particularly complicated, and they do give grants for e.g. upskilling—you don’t necessarily need a highly specific/detailed plan.
If you don’t have a clear idea where you might fit in, then the advice links above should help.If/when you do have a clear idea, don’t worry about whether you can articulate it persuasively. If it makes sense, then people will be glad to hear it—and to give you pointers (e.g. fund managers).
E.g. there’s this from Evan Hubinger (who helps with the LTFF):
if you have any idea of any way in which you think you could use money to help the long-term future, but aren’t currently planning on applying for a grant from any grant-making organization, I want to hear about it. Feel free to send me a private message on the EA Forum or LessWrong. I promise I’m not that intimidating :)
Also worth bearing in mind as a general principle that if almost everything you try succeeds, you’re not trying enough challenging things. Just make sure to take negative outcomes as useful information (often you can ask for specific feedback too). There’s a psychological balance to be struck here, but trying at least a little more than you’re comfortable with will generally expand your comfort zone and widen your options.
Examples would be interesting, certainly. Concerning the post’s point, I’d say the relevant claim is that [type of alignment research that’ll be increasingly done in slow takeoff scenarios] is already being done by non x-risk motivated people.
I guess the hope is that at some point there are clear-to-everyone problems with no hacky solutions, so that incentives align to look for fundamental fixes—but I wouldn’t want to rely on this.
I’m very suspicious of:
Inner alignment failure is the only process in the known universe to ever generate human values
as a jumping-off point, since inner alignment failure did not hit a pre-defined target of human values. It just happened to produce them. If a gun can fire one bullet, I’ll expect it can fire a second. I won’t expect the second bullet to hit the first.
On the rest, it strikes me that:
Game theory keeps human values ‘good’ in largely circular fashion: we’ll tend to think that whatever is working is ‘good’, since it helps us to think that. This should give us confidence neither in future human values, nor in AI values. (e.g. future humans would learn to prefer uniformity, if the game theory favoured it)
I don’t think this is quite right “This instinct is quite contrary to how the optima of most utility functions or values look”: it’s contrary to how the optima of simple utility functions we can easily specify look.Most complex utility functions will produce worlds containing complex patterns. Most of those worlds will still be essentially worthless from a human perspective, since we care about a tiny proportion of patterns. I don’t think it’s hard to get an amount of diversity humans would appreciate; I think it’s hard to get the types of diversity humans would appreciate.
I think I buy the rest of your argument in terms of [It won’t be too hard to produce an AI that’ll create an interesting world], but only in the sense that it’d be a world that’s interesting to investigate as an object of study (dynamic, varied, complex, hard to predict...).I don’t think many people imagine the trivially simple worthless failure modes (paperclips, tiling-smiley-faces...), but rather worlds containing a load of complex patterns which are nonetheless ~worthless from even our most enlightened perspective. (though it’s also plausible for things to collapse into a dull attractor)
Wholeheartedly agree, and I think it’s great that you’re doing this.I’ll be very interested in what you learn along the way w.r.t. more/less effective processes.
(Bonus points for referencing the art of game design—one of my favourite books.)
...and take out a bunch of loans...
That part really shouldn’t be necessary (even if it may be rational, conditional on some assumptions). In the event that you do decide to devote your time to helping, whether for dignity or whatever else, you should be able to get funding to cover most reasonable forms of upskilling and/or seeing-if-you-can-help trial period.
That said, I think step one would be to figure out where your comparative advantage lies (80,000 hours folk may have thoughts, among others). Certainly some people should be upskilling in ML/CS/Math—though an advanced degree may not be most efficient -, but there are other ways to help.
I realize this doesn’t address the deciding-what’s-true aspect.I’d note there that I don’t think much detailed ML knowledge is necessary to follow Eliezer’s arguments on this. Most of the ML-dependent parts can be summarized as [we don’t know how to do X], [we don’t have any clear plan that we expect will tell us how to do X], similarly for Y, Z, [Either X, Y or Z is necessary for safe AGI].
Beyond that, I think you only need a low prior on our bumping into a good solution while fumbling in the dark and a low prior on sufficient coordination, and things look quite gloomy. Probably you also need to throw in some pessimism on getting safe AI systems to fundamentally improve our alignment research.