I expect I would improve significantly with additional practice (e.g. I think a 2nd hour of playing the probability-assignment game would get a much higher score than my 1st in expectation). My subjective feeling was that I could probably learn to do as well as GPT-2-small (though estimated super noisily) but there’s definitely no way I was going to get close to GPT-2.
(By the way, I’m not sure why your original comment brought up inclusive genetic fitness at all; aren’t we talking about within-lifetime RL? The within-lifetime reward function is some complicated thing involving hunger and sex and friendship etc., not inclusive genetic fitness, right?)
This was mentioned in OP (“The argument would prove too much. Evolution selected for inclusive genetic fitness, and it did not get IGF optimizers.”). It also appears to be a much stronger argument for the OP’s position and so seemed worth responding to.
I think incomplete exploration is very important in this context and I don’t quite follow why you de-emphasize that in your first comment. In the context of within-lifetime learning, perfect exploration entails that you try dropping an anvil on your head, and then you die. So we don’t expect perfect exploration; instead we’d presumably design the agent such that explores if and only if it “wants” to explore, in a way that can involve foresight.
It seems to me that incomplete exploration doesn’t plausibly cause you to learn “task completion” instead of “reward” unless the reward function is perfectly aligned with task completion in practice. That’s an extremely strong condition, and if the entire OP is conditioned on that assumption then I would expect it to have been mentioned.
I didn’t write the OP. If I were writing a post like this, I would (1) frame it as a discussion of a more specific class of model-based RL algorithms (a class that includes human within-lifetime learning), (2) soften the claim from “the agent won’t try to maximize reward” to “the agent won’t necessarily try to maximize reward”.
If the OP is not intending to talk about the kind of ML algorithm deployed in practice, then it seems like a lot of the implications for AI safety would need to be revisited. (For example, if it doesn’t apply to either policy gradients or the kind of model-based control that has been used in practice, then that would be a huge caveat.)
It sounded like OP was saying: using gradient descent to select a policy that gets a high reward probably won’t produce a policy that tries to maximize reward. After all, look at humans, who aren’t just trying to get a high reward.
And I am saying: this analogy seem like it’s pretty weak evidence, because human brains seem to have a lot of things going on other than “search for a policy that gets high reward,” and those other things seem like they have a massive impacts on what goals I end up pursuing.
ETA: as a simple example, it seems like the details of humans’ desire for their children’s success, or their fear of death, don’t seem to match well with the theory that all human desires come from RL on intrinsic reward. I guess you probably think they do? If you’ve already written about that somewhere it might be interesting to see. Right now the theory “human preferences are entirely produced by doing RL on an intrinsic reward function” seems to me to make a lot of bad predictions and not really have any evidence supporting it (in contrast with a more limited theory about RL-amongst-other-things, which seems more solid but not sufficient for the inference you are trying to make in this post).
At some level I agree with this post—policies learned by RL are probably not purely described as optimizing anything. I also agree that an alignment strategy might try to exploit the suboptimality of gradient descent, and indeed this is one of the major points of discussion amongst people working on alignment in practice at ML labs.
However, I’m confused or skeptical about the particular deviations you are discussing and I suspect I disagree with or misunderstand this post.
As you suggest, in deep RL we typically use gradient descent to find policies that achieve a lot of reward (typically updating the policy based on an estimator for the gradient of the reward).
If you have a system with a sophisticated understanding of the world, then cognitive policies like “select actions that I expect would lead to reward” will tend to outperform policies like “try to complete the task,” and so I usually expect them to be selected by gradient descent over time. (Or we could be more precise and think about little fragments of policies, but I don’t think it changes anything I say here.)
It seems to me like you are saying that you think gradient descent will fail to find such policies because it is greedy and local, e.g. if the agent isn’t thinking about how much reward it will receive then gradient descent will never learn policies that depend on thinking about reward.
(Though I’m not clear on how much you are talking about the suboptimality of SGD, vs the fact that optimal policies themselves do not explicitly represent or pursue reward given that complex stews of heuristics may be faster or simpler. And it also seems plausible you are talking about something else entirely.)
I generally agree that gradient descent won’t find optimal policies. But I don’t understand the particular kinds of failures you are imagining or why you think they change the bottom line for the alignment problem. That is, it seems like you have some specific take on ways in which gradient descent is suboptimal and therefore how you should reason differently about “optimum of loss function” from “local optimum found by gradient descent” (since you are saying that thinking about “optimum of loss function” is systematically misleading). But I don’t understand the specific failures you have in mind or even why you think you can identify this kind of specific failure.
As an example, at the level of informal discussion in this post I’m not sure why you aren’t surprised that GPT-3 ever thinks about the meaning of words rather than simply thinking about statistical associations between words (after all if it isn’t yet thinking about the meaning of words, how would gradient descent find the behavior of starting to think about meanings of words?).
One possible distinction is that you are talking about exploration difficulty rather than other non-convexities. But I don’t think I would buy that—task completion and reward are not synonymous even for the intended behavior, unless we take some extraordinary pains to provide “perfect” reward signals. So it seems like no exploration is needed, and we are really talking about optimization difficulties for SGD on supervised problems.
The main concrete thing you say in this post is that humans don’t seem to optimize reward. I want to make two observations about that:
Humans do not appear to be purely RL agents trained with some intrinsic reward function. There seems to be a lot of other stuff going on in human brains too. So observing that humans don’t pursue reward doesn’t seem very informative to me. You may disagree with this claim about human brains, but at best I think this is a conjecture you are making. (I believe this would be a contrarian take within psychology or cognitive science, which would mostly say that there is considerable complexity in human behavior.) It would also be kind of surprising a priori—evolution selected human minds to be fit, and why would the optimum be entirely described by RL (even if it involves RL as a component)?
I agree that humans don’t effectively optimize inclusive genetic fitness, and that human minds are suboptimal in all kinds of ways from evolution’s perspective. However this doesn’t seem connected with any particular deviation that you are imagining, and indeed it looks to me like humans do have a fairly strong desire to have fit grandchildren (and that this desire would become stronger under further selection pressure).
At this point, there isn’t a strong reason to elevate this “inner reward optimizer” hypothesis to our attention. The idea that AIs will get really smart and primarily optimize some reward signal… I don’t know of any good mechanistic stories for that. I’d love to hear some, if there are any.
Apart from the other claims of your post, I think this line seems to be wrong. When considering whether gradient descent will learn model A or model B, the fact that model A gets a lower loss is a strong prima facie and mechanistic explanation for why gradient descent would learn A rather than B. The fact that there are possible subtleties about non-convexity of the loss landscape doesn’t change the existence of one strong reason.
That said, I agree that this isn’t a theorem or anything, and it’s great to talk about concrete ways in which SGD is suboptimal and how that influences alignment schemes, either making some proposals more dangerous or opening new possibilities. So far I’m mostly fairly skeptical of most concrete discussions along these lines but I still think they are valuable. Most of all it’s the very strong take here that seems unreasonable.
This feels easier to me when writing and significantly harder for spoken language (where I have a strong inclination to add the s). I guess it’s probably worth thinking separately about talking vs writing, and this post is mostly about writing, in which case I’m probably sold.
I don’t feel like this is right (though I think this duality feels like a real thing that is important sometimes and is interesting to think about, so appreciated the comment).
ARC is spending its time right now (i) trying to write down concrete algorithms that solve ELK using heuristic arguments, and then trying to produce concrete examples in which they do the wrong thing, (ii) trying to write down concrete formalizations of heuristic arguments that have the desiderata needed for those algorithms to work, and trying to identify cases in which our algorithms don’t yet meet those desiderata or they may be unachievable. The output is just actual code which is purported to solve major difficulties in alignment.
And on the flip side, I spend a significant amount of my time looking at the algorithms we are proposing (and the bigger plans into which they would fit if successful) and trying to find the best arguments I can that these plans will fail.
I think that the disagreement is more about what kind of concreteness is possible or desirable in this domain.
Put differently: I’m not saying that Nate and Eliezer are vague about problems but concrete about solutions, I’m saying they are vague about everything. And I don’t think they are saying that I’m concrete about problems but vague about solutions, they would say that I’m concrete about parts of the solution/problem that don’t matter while systematically pushing all the difficulty into the parts I’m still vague about.
I do think “how well do we understand the problem” seems like a pretty big crux; that leads Nate and Eliezer to think that I’m avoiding the predictably-important difficulty, and it leads me to think that Nate and Eliezer need to get more concrete in order to have an accurate picture of what’s going on.
I don’t think those are great summaries. I think this is probably some misunderstanding about what ARC is trying to do and about what I mean by “concrete.” In particular, “concrete” doesn’t mean “formalized,” it means more like: you are able to discuss a bunch of concrete examples of the difficulty and why they leads to failure of particular concrete approaches; you are able to point out where the problem will appear in a particular decomposition of the problem, and would revise your picture if that turned out to be wrong; etc.
But pretty quickly, we usually see intuitively-similar bottlenecks coming up again and again.
I don’t yet have this sense about a “sharp left turn” bottleneck.
I think I would agree with you if we’d looked at a bunch of plausible approaches, and then convinced ourselves that they would fail. And then we tried to introduce the sharp left turn to capture the unifying theme of those failures and to start exploring what’s really going on. At a high level that’s very similar to what ARC is doing day to day, looking at a bunch of approaches to a problem, seeing why they fail, and then trying to understand the nature of the problem so that we can succeed.
But for the sharp left turn I think we basically don’t have examples. Existing alignment strategies fail in much more basic ways, which I’d call “concrete.” We don’t have examples of strategies that don’t run into concrete difficulties, but they fail for a vague and hard-to-understand reason that we’d summarize as a “sharp left turn.” So I don’t really believe that this difficulty is being abstracted from a pattern of failures.
There can be other ways to learn about problems, and I didn’t think Nate was even saying that this problem is derived from examples of obstructions to potential alignment approaches. I think Nate’s perspective is that he has some petty good arguments and intuitions about why a sharp left turn will cause novel problems. And so a lot of what I’m saying is that I’m not yet buying it, that I think Nate’s argument has fatal holes in it which are hidden by its vagueness, and that if the arguments are really very important then we should be trying hard to make them more concrete and to address those holes.
Is there some reason to expect that always working on the legible parts of a problem will somehow induce progress on the illegible parts, even when making-the-illegible-parts-legible is itself “the hard part”?
ARC does theoretical work guided by concrete stories about how a proposed AI system could fail; we are focused on the “legible part” insofar as we try to fix failures for which we can tell concrete stories. I’m not quite sure what you mean by “illegible” and so this might just be a miscommunication, but I think this is the relevant sense of “illegible” so I’ll respond briefly to it.
I think we can tell concrete stories about deceptive alignment; about ontology mismatches making it hard or meaningless to “elicit latent knowledge;” about exploitability of humans making debate impossible; and so on. And I think those stories we can tell seem to do a great job of capturing the reasons why we would expect existing alignment approaches to fail. So if we addressed these concrete stories I would feel like we’ve made real progress. That’s a huge part of my optimism about concrete stories.
It feels to me like either we are miscommunicating about what ARC is doing, or you are saying that those concrete difficulties aren’t the really important failures. That even if an alignment approach addressed all of them, it still wouldn’t represent meaningful progress because the true risk is the risk that cannot be named.
One thing you might mean is that “these concrete difficulties are just shadows of a deeper core.” But I think that’s not actually a challenge to ARC’s approach at all, and it’s not that different from my own view. I think that if you have an intuitive sense of a deep problem, then it’s really great to attack specific instantiations of the problem as a way to learn about the deep core. I feel pretty good about this approach, and I think it’s pretty standard in most disciplines that face problems like this (e.g. if you are deeply confused about physics, it’s good to think a lot about the simplest concrete confusing phenomenon and understand it well; if you are confused about how to design algorithms that overcome a conceptual barrier, it’s good to think about the simplest concrete task that requires crossing that barrier; etc.).
Another thing you might mean is that “these concrete difficulties are distractions from a bigger difficulty that emerges at a later step of the plan.” It’s worth noting that ARC really does try to look at the whole plan and pick the step that is most likely to fail. But I do think it would be a problem for our methodology if there is a good argument about why plans will fail, which won’t let us tell a concrete story about what the failure looks like. My position right now is that I don’t see such an argument; I think we have some vague intuitions, and we have a bunch of examples which do correspond to concrete failure stories. I don’t think there are any examples from which to infer the existence of a difficulty that can’t be captured in concrete stories, and I’m not yet aware of arguments that I find persuasive without any examples. But I’m really quite strongly in the market for such arguments.
Also, a separate issue with this: it sounds like this will systematically generate strategies which ignore unknown unknowns. It’s like the exact opposite of security mindset.
Here’s how the situation feels to me. I know this isn’t remotely fair as a summary of your view, it’s just intended to illustrate where ARC is coming from. (It’s also possible this is a research methodology disagreement, in which case I do just disagree strongly.)
Cryptographer: It seems like our existing proposals for “secure” communication are still vulnerable to man in the middle attacks. Better infrastructure for key distribution is one way to overcome this particular attack, so let’s try to improve that. We can also see how this might fit in with the rest of our security infrastructure to help build to a secure internet, though no doubt the details will change.Cryptography skeptic: The real difficulty isn’t man in the middle attacks, it’s that security is really hard. By focusing on concrete stuff like man-in-the-middle you are overlooking the real nature of the problem, focusing on the known risks rather than the unknown unknowns. Someone with a true security mindset wouldn’t be fiddling around the edges like this.
Cryptographer: It seems like our existing proposals for “secure” communication are still vulnerable to man in the middle attacks. Better infrastructure for key distribution is one way to overcome this particular attack, so let’s try to improve that. We can also see how this might fit in with the rest of our security infrastructure to help build to a secure internet, though no doubt the details will change.
Cryptography skeptic: The real difficulty isn’t man in the middle attacks, it’s that security is really hard. By focusing on concrete stuff like man-in-the-middle you are overlooking the real nature of the problem, focusing on the known risks rather than the unknown unknowns. Someone with a true security mindset wouldn’t be fiddling around the edges like this.
I’m not saying that infrastructure for key distribution solves security (and indeed we have huge security problems). I’m saying that working on concrete problems is the right way to make progress in situations like this. I don’t think this is in tension with security mindset. In fact I think effective people with security mindset spend most of their time thinking about concrete risks and how to address them.
It’s great to generalize once you have a bunch of concrete risks and you think there is a deeper underlying pattern. But I think you basically need the examples to learn from, and if there is a real pattern then you should be able to instantiate it in any particular case rather than making reference to the pattern.
I think that the sharp left turn is also relevant to ELK, if it leads to your system not generalizing from “questions humans can answer” to “questions humans can’t answer.” My suspicion is that our key disagreements with Nate are present in the case of solving ELK and are not isolated to handling high-stakes failures.
(However it’s frustrating to me that I can never pin down Nate or Eliezer on this kind of thing, e.g. are they still pessimistic if there were a low-stakes AI deployment in the sense of this post?)
I’m going to spend most of this comment responding to your concrete remarks about ELK, but I wanted to start with some meta level discussion because it seems to cut closer to the heart of the issue and might be more generally applicable.
I think a productive way forward (when working on alignment or on other research problems) is to try to identify the hardest concrete difficulties we can understand then try to make progress on them. This involves acknowledging that we can’t anticipate all possible problems, but expecting that solving the concrete problems is a useful way to make steps forward and learn general lessons. It involves solving individual challenges, even if none of them will address the whole problem, and even if we have a vague sense that further difficulties will arise. It means not becoming too pessimistic about a direction until we see fairly concretely where it’s stuck, partially because we hope that zooming in on a very concrete case where you get stuck is the main way to eventually make progress.
My sense is that you have more faith in a rough intuitive sense you’ve developed of what the “hard part” of alignment is, and so you’d primarily recommend thinking about that until we feel less confused. I disagree in large part because I feel like your broad intuitive sense has not yet had much opportunity to make contact with either reality or with formal reasoning, and I’d guess it’s not precise enough to be a useful guide to research prioritization.
More concretely, you talk about novel mechanisms by which AI systems gain capabilities, but I think you haven’t said much concrete about why existing alignment work couldn’t address these mechanisms. This looks to me like a pretty unproductive stance; I suspect you are wrong about the shape of the problem, but if you are right then I think your main realistic path to impact involves saying something more concrete about why you think this.
I think you don’t see the situation the same way, probably because you feel like you have said plenty concrete. Perhaps this is the most serious disagreement of all. I don’t think saying there is a “capabilities well” is helpfully concrete until you say something about what it looks like, why it poses alignment problems different from SGD and why particular approaches don’t generalize, etc.
In ARC’s day to day work we write down particular models of capabilities that would generalize far outside of training (e.g.: what about a causal model of the world that holds robustly? what about logical deduction from valid premises with longer chains of reasoning? what about continuing to learn by trial and error when deployed in a novel environment?), and ask about whether a given alignment solution would generalize along with them. If we can find any gap, then that it goes on the list of problems. We focus on the gaps that seem least likely to be addressable by using known techniques, and try to develop new techniques or to identify general reasons why the gap is unresolvable.
My guess is that you are playing a roughly similar game much more informally, and that you are just making a mistake because reasoning about this stuff is in fact hard. But I can’t really tell, since your thinking is happening in private and we are seeing the vague intuitions that result. (I’ve been hanging around MIRI for a long time, and I suspect I have a better model of your and Eliezer’s position than virtually anyone else outside of MIRI, yet this is still where I’m at.)
Anyway, now turning to your discussion of ELK in particular.
Your first problem is that the recent capabilities gains made by the AGI might not have come from gradient descent (much like how humans’ sudden explosion of technological knowledge accumulated in our culture rather than our genes, once we turned the corner). You might not be able to just “expose the bad behavior” to gradients that you can hit to correct the thing, at least not easily and quickly.
I often think and write about other places where capabilities may come from that could challenge our basic alignment plan. Four particularly salient examples:
Your AI might perform search internally, e.g. looking for hypotheses that match the data or for policies that work well.
Natural selection may occur internally, e.g. cognitive patterns that acquire power might tend to dominate the behavior of your AI (despite the AI having no explicit prediction that they would work well).
Your AI might reason about how to think better, e.g. select cognitive actions based on anticipated consequences of those cognitive actions.
Our AI might deploy new algorithms that pose their own alignment risk for different (potentially unanticipated) reasons.
Some of these represent real problems, but none of them seem to fundamentally change the game or be deal-breakers:
Aligning the internal search seems very similar to aligning SGD on the outside. We could distinguish two additional difficulties in this case:
Because the search is on the inside, we can’t directly apply our alignment insights to align it. Instead we need to ensure that SGD learns to align the search. This itself poses two difficulties: (a) the outer gradient needs to incentivize doing this, (b) we need to argue that it’s nearly as easy for SGD to learn the aligned search as to learn the unaligned search (or build scaffolding such that it becomes similarly easily). This is what we’re talking about in this appendix, and it’s part of why we are skeptical about approaches to ELK based on simple regularizers. But we don’t see a reason that either (a) or (b) would be a dealbreaker, and we tentatively think our current approaches to ontology identification would at least solve (a) if they were successful at all. It’s pretty hard to talk about (b) without having more clarity about what the alignment scheme actually looks like but we don’t see an in principle reason it’s hard.
The internal search algorithm may not be SGD, and perhaps our alignment strategy was specific to some detail of SGD. But SGD appears to be amongst the hardest search algorithms, and ARC tries to pursue approaches that work for other algorithms rather than leveraging anything about SGD in particular. We’re definitely in the market for other search algorithms that cause trouble but don’t yet know of any.
Natural selection on the inside is similar but potentially more tricky, because the optimizer has more limited control over how this search works. This is like the analog of memetic selection being smarter than humans and eventually overpowering or hijacking human consequentialism. Another extreme example is that it seems like a large enough neural network may be catastrophically misaligned at initialization simply because of selection amongst activation patterns within a single forward pass. Ultimately we’d like to handle this in exactly the same way that we handle the last point, by some combination of (a) we can just directly apply the same hope from the previous section even to natural selection, (b) we can run explicit searches that are more powerful than implicit search by natural selection within our model, which requires ensuring that our explicit learned search captures whatever is good about natural selection (this seems tough but not at all obvious impossible to me). It’s hard to talk about option (a) without seeing if/how we solve the problem from the last point. We could definitely work on option (b) now, and a large enough ARC would be working on it, but it seems like a relatively low priority since it’s both very remote from existing systems and seems relatively unlikely-to-me to be the simplest place where we get stuck.
If our model is selecting cognitive actions, or designing new algorithms, then our core hope is that an aligned model will try to think in an aligned way. So if we’ve been succeeding at alignment so far then the model will be trying to stay aligned. By analogy, if humans cared about the amount of human DNA in the universe, then to the extent that cultural evolution was guided by human consequentialism (rather than e.g. being memetic selection), we would be trying to develop cultural machinery that was helpful for maximizing the amount of human DNA in the universe.
One way this can go wrong is if our model wants to stay aligned but fails, e.g. because it identifies new techniques for thinking that themselves pose new alignment difficulties (just as we desire human flourishing but may instead implement AI systems that want paperclips). I think this is a real problem, but there are a lot of reasons I don’t consider it an existential challenge for our approach:
If you’ve succeeded at alignment so far, then your AI will also consider this a problem and will be trying to solve it. I think we should relate to our AI, discovering new ways to think that might pose new alignment difficulties, in the same way that we relate to future humans who may encounter alignment difficulties. The AI may solve the problem, or may implement policy solutions, or etc., and our role is to set them up for success just like we are trying to set up future humans for success. AI compresses the timescale both for “new AI algorithms with new alignment problems” but also for all of the solutions to those problems, so I don’t think it changes the game from future humans. And so I’d focus on prosaic AI alignment for exactly the same reasons I focus on prosaic AI alignment when trying to help future humans succeed at alignment.
I think that we should be considering the particular algorithms that might pose a new alignment problem, and trying to solve alignment for each of them. If we have some general reason to think that new algorithms will be much harder than old algorithms, or that lessons won’t transfer, then we can discuss those and whether they should affect research prioritization. So far I don’t think we have such arguments, and so I think we should just be looking for algorithms that might pose problems. (I don’t actually think that’s the highest priority, because prosaic ML so obviously poses problems, and the other problems we see seem so closely analogous to the ones posed by prosaic ML. But I’m certainly in the market for other problems and think that a large enough research community should already be actively looking for them.)
Possible disagreements between us: (i) you think that at least one of these examples looks really bad for our approach, (ii) you have other examples in mind, (iii) you don’t think we can write down a concrete example that looks bad, but we have reason to expect other kinds of capability gains that will be bad, (iv) nothing looks like a dealbreaker in particular, but it’s just contributing to a long list of problems you’d have to solve and that’s either a lot of work or something probably won’t work out.
For me, the upshot of all of this is that SGD poses some obvious problems, that those problems are the most likely to actually occur, that they seem similar to (and at least subproblems of) the other alignment problems we may face, and that there are neither super compelling alternatives to aligning SGD nor particular arguments that the rest of the problem is harder than this step.
Your second problem is that the AGI’s concepts might rapidly get totally uninterpretable to your ELK head. Like, you could imagine doing neuroimaging on your mammals all the way through the evolution process. They’ve got some hunger instincts in there, but it’s not like they’re smart enough yet to represent the concept of “inclusive genetic fitness” correctly, so you figure you’ll just fix it when they get capable enough to understand the alternative (of eating because it’s instrumentally useful for procreation). And so far you’re doing great: you’ve basically decoded the visual cortex, and have a pretty decent understanding of what it’s visualizing.
Our goal is to learn a reporter that describes the latent knowledge of the model, and to keep this up to date as the model changes under SGD. If thinking about SGD, we usually think concretely about a single step of SGD, and how you could find a good reporter at the end of that gradient descent step assuming you had one at the beginning.
It feels to me like what you are saying here is just “you might not be able to solve ELK.” Or else maybe restating the previous point, that the model builds latent knowledge by mechanisms other than SGD and therefore you need to learn a reporter that can also follow along with those other mechanisms.
In either case, I can’t speak to whether it’s helpful for the audience understanding why ELK is hard, but it is certainly not helping me understand why you think ELK is hard. I think this discussion is just too vague to be helpful.
I think it’s not crazy for you to say “ARC’s hopes about how to solve ELK are too vague to seem worth engaging with” (this is pretty similar to me saying “Nate’s arguments about why alignment is hard are too vague to seem worth engaging with”).
Analogously, your ELK head’s abilities are liable to fall off a cliff right as the AGI’s capabilities start generalizing way outside of its training distribution.
But can you say something concrete about why? What I’d like to do is talk about what the AGI is actually thinking, the particular computation it’s running, so that we can talk about why that computation keeps being correlated with reality off distribution and then ask whether the reporter remains correlated with reality. When I go through this exercise I don’t see big dealbreakers, and I can’t tell if you disagree with that diagnosis, or if you are noticing other things that might be going on inside the AI, or if the difference is that I think “this looks like it might work in all the concrete cases we can see” is a relevant signal and you think “nah the cases we can’t see are way worse than those we can see.”
And if they don’t, then this ELK head is (in this hypothetical) able to decode and understand the workings of an alien mind. Likely a kludgey behemoth of an alien mind. This itself is liable to require quite a lot of capability, quite plausibly of the sort that humanity gets first from the systems that took sharp left-turns, rather than systems that ground along today’s scaling curves until they scaled that far.
Again, this seems too vague to be helpful, or perhaps just mistaken. The reporter is not some other AI looking at your predictor and trying to “decode its workings,” or maybe it is but if so it’s just because those english words are vague and broad. Can we talk about the particular kinds of cognition that your AI might be performing, such that you don’t think this works? (Or which would require the reporter to itself be using magic-mystery-juice-of-intelligence?)
That’s really the central theme of my response, so it’s worth restating: ARC loves examples of ways an AI might be thinking such that ELK is difficult. But your description of the sharp left turn is too vague to be helpful for this purpose, and so I’d either like to turn this into more concrete discussion of the internals of the algorithm, or else some significantly more precise argument about why we expect the unknown possible internals to be so much less favorable for ELK than any of the concrete examples we can write down.
I’d like to head off a possible response you might make that I disagree with: “Sure your algorithm works for any example you can write down, but the whole point is that you need it to work for alien cognition, where humans don’t understand why it works. So of course it works on concrete examples but not in the unknown real world.” . I’m putting this in a footnote because it seems like a digression and I have no idea if this is your view.
My main response is that we can in fact talk about concrete examples where “why your AI system’s cognition works” isn’t accessible to humans in the relevant ways:
We can consider tricky facts we understand about how to reason, for which our discovery of those facts is empirically contingent (and where discovering those facts is harder than discovering the reasons itself). Then we can consider whether our AI alignment strategies would work even if humans hadn’t figured out the relevant facts about reasoning.
We can consider AI cognition which is contingent on hypothesized unknown-to-human facts, e.g. about the causal structure of reality, or about key facts about mathematics, or whatever else.
Most of our ELK approaches don’t make no-holds-barred use of “can a human come up with some story about why this AI cognition may work,” and so this just isn’t a particularly salient threshold anyway. As a silly example, if you were solving this problem with a speed prior (or indeed with any of the approaches in the regularization section of the ELK document) you wouldn’t expect a particular key threshold at the space of strategies that a human understands.
“Floating point operations per second” is usually abbreviated FLOPS. I think using FLOPs as a synonym for FLOPS seems wrong in a particularly confusing way. It really suggests that FLOP is the acronym and the “s” is for a plural. Compare to someone saying SMs instead of SMS.
That said, I agree that using FLOPs as the plural of “floating point operation” is confusing since it is pretty similar to FLOPS. It’s also not something I’ve seen much outside of the futurist crowd (though I am guilty of using it fairly often). Another way in which the ML and futurist usage is nonstandard is that we often use FLOPS sloppily to talk about very low precision or even integer arithmetic.
I haven’t personally seen this cause very much confusion, in my experience it maybe increases the amount of “compute stock vs flow” confusion by like 25-50% from a small-ish baseline? I suspect it’s small enough that accessibility and using standard notation is significantly more important. (Though again, my sense is that FLOPs for “floating point operations” is not really standard and not particularly accessible, so it seems more plausible futurists should stop doing that.)
Not sure what’s realistic to do here. I’m a bit inclined to use “operations” or “ops” to both avoid this ambiguity and be more inclusive of integer arithmetic, but I kind of wouldn’t expect that to stick and maybe it’s bad in some other way. Using FLOP as a plural noun is pretty unsatisfying and I suspect it is unlikely to stick for fundamental linguistic reasons. Switching to using FLOP/s instead of FLOPS may be realistic, though I think flop/s is a bit less weird-looking to me.
ETA: though I think openphil has been using FLOP/s and FLOP as you suggest in their writing, so maybe I’m wrong about whether it’s realistic. I think the draw to add an s to plurals is strongest in spoken language, though unfortunately that’s also where the ambiguity is worst.
Yes, I think if possible you’d want to resolve to continue caring about copies even after you learn which one you are. I don’t think that you particularly want to rewind values to before prior changes, though I do think that standard decision-theoretic or “moral” arguments have a lot of force in this setting and are sufficient to recover high degrees of altruism towards copies and approximately pareto-efficient behavior.
I think it’s not clear if you should self-modify to avoid preference change unless doing so is super cheap (because of complicated decision-theoretic relationships with future and past copies of yourself, as discussed in some other comments). But I think it’s relatively clear that if your preferences were going to change into either A or B stochastically, it would be worth paying to modify yourself so that they change into some appropriately-weighted mixture of A and B. And in this case that’s the same as having your preferences not change, and so we have an unusually strong argument for avoiding this kind of preference change.
I generally agree that a creature with inconsistent preferences should respect the values of its predecessors and successors in the same kind of way that it respects the values of other agents (and that the similarity somewhat increases the strength of that argument). It’s a subtle issue, especially when we are considering possible future versions of ourselves with different preferences (just as its always subtle how much to respect the preferences of future creatures who may not exist based on our actions). I lean towards being generous about the kinds of value drift that have occurred over the previous millennia (based on some kind of “we could have been in their place” reasoning) while remaining cautious about sufficiently novel kinds of changes in values.
In the particular case of the inconsistencies highlighted by transparent Newcomb, I think that it’s unusually clear that you want to avoid your values changing—because your current values are a reasonable compromise amongst the different possible future versions of yourself, and maintaining those values is a way to implement important win-win trades across those versions.
Yes, I think this kind of cooperation would only work for UDT agents (or agents who are uncertain about whether they are in someone’s imagination or whatever).
A reader who isn’t sympathetic to UDT can just eliminate the whole passage “But there are still options: …”, it’s not essential to the point of the post. It only serves to head off the prospect of a UDT-advocate arguing that the agent is being unreasonable by working at cross-purposes to itself (and I should have put this whole discussion in an appendix, or at least much better sign-posted what was going on).
Fast imitations of subhuman behavior or imitations of augmented of humans are also superhuman. As is planning against a human-level imitation. And so on.
It’s unclear if systems trained in that way will be imitating a process that optimizes, or will be optimizing in order to imitate. (Presumably they are doing both to varying degrees.) I don’t think this can be settled a priori.
I think that future AI technology could automate my job. I think it could also automate capability researchers’ jobs. (It could also help in lots of other ways, but this point seems sufficient to highlight the difference between our views.)
I don’t think that being more useful for alignment is a necessary claim for my position. We are talking about what we want our aligned AIs to do for us, and hence what we should have in mind while doing AI alignment research. If we think AI accelerates technological progress across the board, then the answer “we want our AI to keep accelerating good stuff happening in the world at the same rate that it accelerates dangerous technology” seems like it’s valid.
I’d say “mainstream opinion” (in either ML broadly, “safety” or “ethics,” AI policy) is generally focused on misuse relative to alignment—even without conditioning on “competitive alignment solution.” I normally disagree with this mainstream opinion, and I didn’t mean to endorse the opinion in virtue of its mainstream-ness, but to identify it as the mainstream opinion. If you don’t like the word “mainstream” or view the characterization as contentious, feel free to ignore it, I think it’s pretty tangential to my post.
I’m happy to leave it up to the reader to decide if the claim (“world government likely to come from AI lab rather than boring political change”) is surprising. I’m also happy if people read my sentence as an expression of my opinion and explanation of why I’m engaging with other parts of Eliezer’s views rather than as an additional argument.
I agree some parts of my comment are just expressions of frustration rather than useful contributions.
I’m not thinking of AI that is faithful to what humans would do, just AI that at all represents human interests well enough that “the AI had 100 years to think” is meaningful. If you don’t have such an AI, then (i) we aren’t in the competitive AI alignment world, (ii) you are probably dead anyway.
If you think in terms of calendar time, then yes everything happens incredibly quickly. It’s weird to me that Rob is even talking about “5 years” (though I have no idea what AGI means, so maybe?). I would usually guess that 5 calendar years after TAI is probably post-singularity, so effectively many subjective millennia and so the world is unlikely to closely resemble our world (at least with respect to governance of new technologies).
It sounds like your view is “given continued technological change, we need strong international coordination to avoid extinction, and that requires a ‘pivotal act.’”
But that “pivotal act” is a long time in the subjective future, the case for it being a single “act” is weak, the kinds of pivotal acts being discussed seem totally inappropriate in this regime, and the discussion overall feels pretty inappropriate with very little serious thought by the participants.
For example, my sense from this discourse is that MIRI folks think a strong world government is more likely to come from an AI lab taking over the world than from a more boring looking process of gradual political change or conflict amongst states (and that this is the large majority of how discussed pivotal acts address the problem you are mentioning). I disagree with that,, don’t think it’s been argued for, and don’t think the surprisingness of the claim has even been acknowledge and engaged with.
I disagree with the whole spirit of the sentence “misaligned systems just happen not to be able to find a way to kill all humans:”
I don’t think it’s about misaligned AI. I agree with the mainstream opinion that if competitive alignment is solved, humans deliberately causing trouble represent a larger share of the problem than misaligned AI.
“Just happen not to be able to” is a construction with some strong presuppositions baked in. I could have written the same sentence about terrorists, they “just happen not to be able to” find a way to kill all humans. Yes over time this will get easier, but it’s not like something magical happens, the offense-defense balance and vulnerability to terrorism will gradually increase.
In the world where new technologies destroy the world, I think the default response is a combination of:
We build technologies that improve robustness to particular destructive technologies (especially bioterrorism in the near term, but on the scale of subjective decades I agree that new technologies will arise).
States enforce laws and treaties limiting access to particular destructive technology or making it harder for people to destroy the world (again, likely to be a stopgap over the scale of subjective centuries if not before).
For technologies where it’s impossible to make narrow agreements to restrict access to destructive technologies, then we aim for stronger general agreements and maybe strong world government. (I do think this happens eventually. Here’s a post where I think through some of these issues for myself.)
Overall, these don’t seem like problems current humans need to deal with. I’m very excited for some people to be thinking through these problems, because I do think that helps put us in a better position to solve these problems in the future (and solving them pre-AI would remove the need for technical solutions to alignment!). But I don’t currently think they have a big effect on how we think about the alignment problem.
I don’t think I can follow your calculation. My version would be:
You are intaking hot wet outside air (wet from both high RH and high temp). You need to cool it and condense a bunch of water out of it. There’s some ratio that’s fixed by the humidity and temperature of the outside vs inside air. I think that’s what you are saying is around 40%? I think actually the number you are giving isn’t what quite this calculation needs, but I’ll run with it anyway.
If all the heat was coming in from outside air (either before turning on AC or from infiltration), then you’d have a fixed ratio of latent to sensible heat removed, so the ratio wouldn’t depend on how much additional infiltration you caused, and we could just ignore humidity when thinking about the efficiency loss.
But in fact some of the heat is coming in from other channels. I guess the other big one is sunlight through windows. That heat doesn’t come with any more humidity. Extra infiltration from 2-hose AC increases how much latent heat you need to remove per unit of sensible heat, by increasing the relative importance of infiltrated air vs sunlight and other sources of heat. So if we just calculate how much extra sensible heat you have to remove, we’ll underestimate the efficiency loss.
The total extra infiltrated heat is about 25% of what the AC removes. At equilibrium, that’s 25% of all the heat gain in the house. If 13% of heat gain is normally from infiltration, then replacing that with 75% normal heat and 25% new infiltration would increase the fraction of heat from infiltration all the way to 35%. (I was super wrong about the 13% going in, I was expecting 25-50%!)
So per unit of heating, you are also increasing the fraction of heat coming from infiltrated air by 22%.
For the heat coming from infiltration, the extra cost of dehumidifying is about 2⁄3 of the sensible heat removed. So per unit of sensible heat removed, you need to remove an additional 15% of a unit of latent heat.
If the AC exhaust was more humid than the inside, then this would be lower, but my sense is that AC exhaust is basically as dry as indoor air?
So the net effect would be to take you from 25% efficiency loss (ignoring humidity) up to roughly 40% efficiency loss, which is pretty huge.
That was a super confusing calculation, definitely beyond my pay grade. I assume I got a ton of numbers/calculations and wrong, that there were much simpler ways to do it, and that this overall computation is likely to be conceptually confused in one or more ways. So I’d be pretty curious for your bottom line estimate or intuition about where it should have ended up.
(But I also understand if you want to stop talking about AC and put this thread to rest...)