Where I agree and disagree with Eliezer
(Partially in response to AGI Ruin: A list of Lethalities. Written in the same rambling style. Not exhaustive.)
Powerful AI systems have a good chance of deliberately and irreversibly disempowering humanity. This is a much easier failure mode than killing everyone with destructive physical technologies.
Catastrophically risky AI systems could plausibly exist soon, and there likely won’t be a strong consensus about this fact until such systems pose a meaningful existential risk per year. There is not necessarily any “fire alarm.”
Even if there were consensus about a risk from powerful AI systems, there is a good chance that the world would respond in a totally unproductive way. It’s wishful thinking to look at possible stories of doom and say “we wouldn’t let that happen;” humanity is fully capable of messing up even very basic challenges, especially if they are novel.
I think that many of the projects intended to help with AI alignment don’t make progress on key difficulties and won’t significantly reduce the risk of catastrophic outcomes. This is related to people gravitating to whatever research is most tractable and not being too picky about what problems it helps with, and related to a low level of concern with the long-term future in particular. Overall, there are relatively few researchers who are effectively focused on the technical problems most relevant to existential risk from alignment failures.
There are strong social and political pressures to spend much more of our time talking about how AI shapes existing conflicts and shifts power. This pressure is already playing out and it doesn’t seem too likely to get better. I think Eliezer’s term “the last derail” is hyperbolic but on point.
Even when thinking about accident risk, people’s minds seem to go to what they think of as “more realistic and less sci fi” risks that are much less likely to be existential (and sometimes I think less plausible). It’s very possible this dynamic won’t change until after actually existing AI systems pose an existential risk.
There is a good chance that an AI catastrophe looks like an abrupt “coup” where AI systems permanently disempower humans with little opportunity for resistance. People seem to consistently round this risk down to more boring stories that fit better with their narratives about the world. It is quite possible that an AI coup will be sped up by humans letting AI systems control killer robots, but the difference in timeline between “killer robots everywhere, AI controls everything” and “AI only involved in R&D” seems like it’s less than a year.
The broader intellectual world seems to wildly overestimate how long it will take AI systems to go from “large impact on the world” to “unrecognizably transformed world.” This is more likely to be years than decades, and there’s a real chance that it’s months. This makes alignment harder and doesn’t seem like something we are collectively prepared for.
Humanity usually solves technical problems by iterating and fixing failures; we often resolve tough methodological disagreements very slowly by seeing what actually works and having our failures thrown in our face. But it will probably be possible to build valuable AI products without solving alignment, and so reality won’t “force us” to solve alignment until it’s too late. This seems like a case where we will have to be unusually reliant on careful reasoning rather than empirical feedback loops for some of the highest-level questions.
AI systems will ultimately be wildly superhuman, and there probably won’t be strong technological hurdles right around human level. Extrapolating the rate of existing AI progress suggests you don’t get too much time between weak AI systems and very strong AI systems, and AI contributions could very easily go from being a tiny minority of intellectual work to a large majority over a few years.
If you had incredibly powerful unaligned AI systems running on a server farm somewhere, there is very little chance that humanity would maintain meaningful control over its future.
“Don’t build powerful AI systems” appears to be a difficult policy problem, requiring geopolitical coordination of a kind that has often failed even when the stakes are unambiguous and the pressures to defect are much smaller.
I would not expect humanity to necessarily “rise to the challenge” when the stakes of a novel problem are very large. I was 50-50 about this in 2019, but our experience with COVID has further lowered my confidence.
There is probably no physically-implemented reward function, of the kind that could be optimized with SGD, that we’d be happy for an arbitrarily smart AI to optimize as hard as possible. (I’m most optimistic about approaches where RL is only performed on a reward function that gets smarter in parallel with the agent being trained.)
Training an AI to maximize a given reward function does not generically produce an AI which is internally “motivated” to maximize reward. Moreover, at some level of capability, a very wide range of motivations for an AI would lead to loss-minimizing behavior on the training distribution because minimizing loss is an important strategy for an AI to preserve its influence over the world.
It is more robust for an AI system to learn a good model for the environment, and what the consequences of its actions will be, than to learn a behavior like “generally being nice” or “trying to help humans.” Even if an AI was imitating data consisting of “what I would do if I were trying to be nice,” it would still be more likely to eventually learn to imitate the actual physical process producing that data rather than absorbing some general habit of niceness. And in practice the data we produce will not be perfect, and so “predict the physical process generating your losses” is going to be positively selected for by SGD.
You shouldn’t say something like “well I might as well assume there’s a hope” and thereby live in a specific unlikely world where alignment is unrealistically easy in one way or another. Even if alignment ends up easy, you would be likely to end up predicting the wrong way for it to be easy. If things look doomed to you, in practice it’s better to try to maximize log odds of success as a more general and robust strategy for taking advantage of lucky breaks in a messy and hard-to-predict world.
No current plans for aligning AI have a particularly high probability of working without a lot of iteration and modification. The current state of affairs is roughly “if alignment turns out to be a real problem, we’ll learn a lot about it and iteratively improve our approach.” If the problem is severe and emerges quickly, it would be better if we had a clearer plan further in advance—we’d still have to adapt and learn, but starting with something that looks like it could work on paper would put us in a much better situation.
Many research problems in other areas are chosen for tractability or being just barely out of reach. We pick benchmarks we can make progress on, or work on theoretical problems that seem well-posed and approachable using existing techniques. Alignment isn’t like that; it was chosen to be an important problem, and there is no one ensuring that the game is “fair” and that the problem is soluble or tractable.
(Mostly stated without argument.)
Eliezer often equivocates between “you have to get alignment right on the first ‘critical’ try” and “you can’t learn anything about alignment from experimentation and failures before the critical try.” This distinction is very important, and I agree with the former but disagree with the latter. Solving a scientific problem without being able to learn from experiments and failures is incredibly hard. But we will be able to learn a lot about alignment from experiments and trial and error; I think we can get a lot of feedback about what works and deploy more traditional R&D methodology. We have toy models of alignment failures, we have standards for interpretability that we can’t yet meet, and we have theoretical questions we can’t yet answer.. The difference is that reality doesn’t force us to solve the problem, or tell us clearly which analogies are the right ones, and so it’s possible for us to push ahead and build AGI without solving alignment. Overall this consideration seems like it makes the institutional problem vastly harder, but does not have such a large effect on the scientific problem.
Eliezer often talks about AI systems that are able to easily build nanotech and overpower humans decisively, and describes a vision of a rapidly unfolding doom from a single failure. This is what would happen if you were magically given an extraordinarily powerful AI and then failed to aligned it, but I think it’s very unlikely what will happen in the real world. By the time we have AI systems that can overpower humans decisively with nanotech, we have other AI systems that will either kill humans in more boring ways or else radically advanced the state of human R&D. More generally, the cinematic universe of Eliezer’s stories of doom doesn’t seem to me like it holds together, and I can’t tell if there is a more realistic picture of AI development under the surface.
One important factor seems to be that Eliezer often imagines scenarios in which AI systems avoid making major technical contributions, or revealing the extent of their capabilities, because they are lying in wait to cause trouble later. But if we are constantly training AI systems to do things that look impressive, then SGD will be aggressively selecting against any AI systems who don’t do impressive-looking stuff. So by the time we have AI systems who can develop molecular nanotech, we will definitely have had systems that did something slightly-less-impressive-looking.
AI improving itself is most likely to look like AI systems doing R&D in the same way that humans do. “AI smart enough to improve itself” is not a crucial threshold, AI systems will get gradually better at improving themselves. Eliezer appears to expect AI systems performing extremely fast recursive self-improvement before those systems are able to make superhuman contributions to other domains (including alignment research), but I think this is mostly unjustified. If Eliezer doesn’t believe this, then his arguments about the alignment problem that humans need to solve appear to be wrong.
The notion of an AI-enabled “pivotal act” seems misguided. Aligned AI systems can reduce the period of risk of an unaligned AI by advancing alignment research, convincingly demonstrating the risk posed by unaligned AI, and consuming the “free energy” that an unaligned AI might have used to grow explosively. No particular act needs to be pivotal in order to greatly reduce the risk from unaligned AI, and the search for single pivotal acts leads to unrealistic stories of the future and unrealistic pictures of what AI labs should do.
Many of the “pivotal acts” that Eliezer discusses involve an AI lab achieving a “decisive strategic advantage” (i.e. overwhelming hard power) that they use to implement a relatively limited policy, e.g. restricting the availability of powerful computers. But the same hard power would also let them arbitrarily dictate a new world order, and would be correctly perceived as an existential threat to existing states. Eliezer’s view appears to be that a decisive strategic advantage is the most realistic way to achieve these policy goals, despite the fact that building powerful enough AI systems runs an overwhelming risk of destroying the world via misalignment. I think that preferring this route to more traditional policy influence requires extreme confidence about details of the policy situation; that confidence might be justified by someone who knew a lot more about the details of government than I do, but Eliezer does not seem to. While I agree that this kind of policy change would be an unusual success in historical terms, the probability still seems much higher than Eliezer’s overall probabilities of survival. Conversely, I think Eliezer greatly underestimates how difficult it would be for an AI developer to covertly take over the world, how strongly and effectively governments would respond to that possibility, and how toxic this kind of plan is.
I think Eliezer is probably wrong about how useful AI systems will become, including for tasks like AI alignment, before it is catastrophically dangerous. I believe we are relatively quickly approaching AI systems that can meaningfully accelerate progress by generating ideas, recognizing problems for those ideas and, proposing modifications to proposals, etc. and that all of those things will become possible in a small way well before AI systems that can double the pace of AI research. By the time AI systems can double the pace of AI research, it seems like they can greatly accelerate the pace of alignment research. Eliezer is right that this doesn’t make the problem go away (if humans don’t solve alignment, then why think AIs will solve it?) but I think it does mean that arguments about how recursive self-improvement quickly kicks you into a lethal regime are wrong (since AI is accelerating the timetable for both alignment and capabilities).
When talking about generalization outside of the training distribution, I think Eliezer is generally pretty sloppy. I think many of the points are roughly right, but that it’s way too sloppy to reach reasonable conclusions after several steps of inference. I would love to see real discussion of these arguments, and in some sense it seems like Eliezer is a good person to push that discussion forward. Right now I think that relevant questions about ML generalization are in fact pretty subtle; we can learn a lot about them in advance but right now just mostly don’t know. Similarly, I think Eliezer’s reasoning about convergent incentives and the deep nature of consequentialism is too sloppy to get to correct conclusions and the resulting assertions are wildly overconfident.
In particular, existing AI training strategies don’t need to handle a “drastic” distribution shift from low levels of intelligence to high levels of intelligence. There’s nothing in the foreseeable ways of building AI that would call for a big transfer like this, rather than continuously training as intelligence gradually increases. Eliezer seems to partly be making a relatively confident claim that the nature of AI is going to change a lot, which I think is probably wrong and is clearly overconfident. If he had been actually making concrete predictions over the last 10 years I think he would be losing a lot of them to people more like me.
Eliezer strongly expects sharp capability gains, based on a combination of arguments that I think don’t make sense and an analogy with primate evolution which I think is being applied poorly. We’ve talked about this before, and I still think Eliezer’s position is probably wrong and clearly overconfident. I find Eliezer’s more detailed claims, e.g. about hard thresholds, to be much more implausible than his (already probably quantitatively wrong) claims about takeoff speeds.
Eliezer seems confident about the difficulty of alignment based largely on his own experiences working on the problem. But in fact society has spent very little total effort working on the problem, and MIRI itself would probably be unable to solve or even make significant progress on the large majority of problems that existing research fields routinely solve. So I think right now we mostly don’t know how hard the problem is (but it may well be very hard, and even if it’s easy we may well fail to solve it). For example, the fact that MIRI tried and failed to find a “coherent formula for corrigibility” is not much evidence that corrigibility is “unworkable.”
Eliezer says a lot of concrete things about how research works and about what kind of expectation of progress is unrealistic (e.g. talking about bright-eyed optimism in list of lethalities). But I don’t think this is grounded in an understanding of the history of science, familiarity with the dynamics of modern functional academic disciplines, or research experience. The Eliezer predictions most relevant to “how do scientific disciplines work” that I’m most aware of are incorrectly predicting that physicists would be wrong about the existence of the Higgs boson (LW bet registry) and expressing the view that real AI would likely emerge from a small group rather than a large industry (pg 436 but expressed many places).
I think Eliezer generalizes a lot from pessimism about solving problems easily to pessimism about solving problems at all; or from the fact that a particular technique doesn’t immediately solve a problem to pessimism about the helpfulness of research on that technique. I disagree with Eliezer about how research progress is made, and don’t think he has any special expertise on this topic. Eliezer often makes objections to particular implementations of projects (like using interpretability tools for training). But in order to actually talk about whether a research project is likely to succeed, you really really need to engage with the existential quantifier where future researchers get to choose implementation details to make it work. At a minimum that requires engaging with the strongest existing versions of these proposals, and if you haven’t done that (as Eliezer hasn’t) then you need to take a different kind of approach. But even if you engage with the best existing concrete proposals, you still need to think carefully about whether your objections are the kind of thing that will be hard to overcome as people learn more details in the future. One way of looking at this is that Eliezer is appropriately open-minded about existential quantifiers applied to future AI systems thinking about how to cause trouble, but seems to treat existential quantifiers applied to future humans in a qualitatively rather than quantitatively different way (and as described throughout this list I think he overestimates the quantitative difference).
As an example, I think Eliezer is unreasonably pessimistic about interpretability while being mostly ignorant about the current state of the field. This is true both for the level of understanding potentially achievable by interpretability, and the possible applications of such understanding. I agree with Eliezer that this seems like a hard problem and many people seem unreasonably optimistic, so I might be sympathetic if Eliezer was making claims with moderate confidence rather than high confidence. As far as I can tell most of Eliezer’s position here comes from general intuitions rather than arguments, and I think those are much less persuasive when you don’t have much familiarity with the domain.
Early transformative AI systems will probably do impressive technological projects by being trained on smaller tasks with shorter feedback loops and then composing these abilities in the context of large collaborative projects (initially involving a lot of humans but over time increasingly automated). When Eliezer dismisses the possibility of AI systems performing safer tasks millions of times in training and then safely transferring to “build nanotechnology” (point 11 of list of lethalities) he is not engaging with the kind of system that is likely to be built or the kind of hope people have in mind.
List of lethalities #13 makes a particular argument that we won’t see many AI problems in advance; I feel like I see this kind of thinking from Eliezer a lot but it seems misleading or wrong. In particular, it seems possible to study the problem that AIs may “change [their] outer behavior to deliberately look more aligned and deceive the programmers, operators, and possibly any loss functions optimizing over [them]” in advance. And while it’s true that if you fail to solve that problem then you won’t notice other problems, this doesn’t really affect the probability of solving alignment overall: if you don’t solve that problem then you die, and if you do solve that problem then then you can study the other problems.
I don’t think list of lethalities is engaging meaningfully with the most serious hopes about how to solve the alignment problem. I don’t think that’s necessarily the purpose of the list, but it’s quite important if you want to assess the probability of doom or contribute meaningfully to solving the problem (or to complain about other people producing similar lists).
I think that natural selection is a relatively weak analogy for ML training. The most important disanalogy is that we can deliberately shape ML training. Animal breeding would be a better analogy, and seems to suggest a different and much more tentative conclusion. For example, if humans were being actively bred for corrigibility and friendliness, it looks to me like like they would quite likely be corrigible and friendly up through the current distribution of human behavior. If that breeding process was continuously being run carefully by the smartest of the currently-friendly humans, it seems like it would plausibly break down at a level very far beyond current human abilities.
Eliezer seems to argue that humans couldn’t verify pivotal acts proposed by AI systems (e.g. contributions to alignment research), and that this further makes it difficult to safely perform pivotal acts. In addition to disliking his concept of pivotal acts, I think that this claim is probably wrong and clearly overconfident. I think it doesn’t match well with pragmatic experience in R&D in almost any domain, where verification is much, much easier than generation in virtually every domain.
Eliezer is relatively confident that you can’t train powerful systems by imitating human thoughts, because too much of human thinking happens under the surface. I think this is fairly plausible but it’s not at all obvious, and moreover there are plenty of techniques intermediate between “copy individual reasoning steps” and “optimize end-to-end on outcomes.” I think that the last 5 years of progress in language modeling have provided significant evidence that training AI to imitate human thought may be economically competitive at the time of transformative AI, potentially bringing us to something more like a 50-50 chance. I can’t tell if Eliezer should have lost Bayes points here, but I suspect he would have and if he wants us to evaluate his actual predictions I wish he would say something about his future predictions.
These last two points (and most others from this list) aren’t aren’t actually part of my central alignment hopes or plans. Alignment hopes, like alignment concerns, can be disjunctive. In some sense they are even more disjunctive, since the existence of humans who are trying to solve alignment is considerably more robust than the existence of AI systems who are trying to cause trouble (such AIs only exist if humans have already failed at significant parts of alignment). Although my research is focused on cases where almost every factor works out against us, I think that you can get a lot of survival probability from easier worlds.
Eliezer seems to be relatively confident that AI systems will be very alien and will understand many things about the world that humans don’t, rather than understanding a similar profile of things (but slightly better), or having weaker understanding but enjoying other advantages like much higher serial speed. I think this is very unclear and Eliezer is wildly overconfident. It seems plausible that AI systems will learn much of how to think by predicting humans even if human language is a uselessly shallow shadow of human thought, because of the extremely short feedback loops. It also seems quite possible that most of their knowledge about science will be built by an explicit process of scientific reasoning and inquiry that will proceed in a recognizable way to human science even if their minds are quite different. Most importantly, it seems like AI systems have huge structural advantages (like their high speed and low cost) that suggest they will have a transformative impact on the world (
and obsolete human contributions to alignmentretracted) well before they need to develop superhuman understanding of much of the world or tricks about how to think, and so even if they have a very different profile of abilities to humans they may still be subhuman in many important ways.
AI systems reasoning about the code of other AI systems is not likely to be an important dynamic for early cooperation between AIs. Those AI systems look very likely to be messy, such that the only way AI systems will reason about their own or others’ code is by looking at behavior and using the same kinds of tools and reasoning strategies as humans. Eliezer has a consistent pattern of identifying important long-run considerations, and then flatly asserting that they are relevant in the short term without evidence or argument. I think Eliezer thinks this pattern of predictions isn’t yet conflicting with the evidence because these predictions only kick in at some later point (but still early enough to be relevant), but this is part of what makes his prediction track record impossible to assess and why I think he is greatly overestimating it in hindsight.
Eliezer’s model of AI systems cooperating with each other to undermine “checks and balances” seems wrong to me, because it focuses on cooperation and the incentives of AI systems. Realistic proposals mostly don’t need to rely on the incentives of AI systems, they can instead rely on gradient descent selecting for systems that play games competitively, e.g. by searching until we find an AI which raises compelling objections to other AI systems’ proposals. (Note: I may be misunderstanding what he’s saying in places like list of lethalities 35; those may be a response to random things people say on the internet rather than engaging with alignment strategies that are being explored in practice. But even then I think he is responding to relatively weak versions of these arguments.) Eliezer equivocates between a line like “AI systems will cooperate” and “The verifiable activities you could use gradient descent to select for won’t function appropriately as checks and balances.” But Eliezer’s position is a conjunction that fails if either step fails, and jumping back and forth between them appears to totally obscure the actual structure of the argument.
Eliezer seems to expect that by the time AI systems can achieve pivotal acts, they can also manipulate humans very well—such that for example it’s pointless to try to have debates between them or to try to play an adversarial game against them. But if we imagine minds with a human profile of abilities, it seems like they would be superhuman at R&D around the same time they were superhuman at persuasion, and could easily go either way. Moreover, there are a lot of reasons the AI seems much more likely to be superhuman at R&D if that’s what we want: it’s trained primarily to work on R&D, it’s using tools and structure designed to facilitate R&D, there is a large collaboration amongst AIs to advance R&D whereas manipulation is often being performed by individual AI systems trying to win a debate. And on top of all that, if anything I very weakly expect AI systems’ comparative advantage to be in R&D rather than human manipulation (since R&D is wildly out of distribution for humans).
I don’t think surviving worlds have a plan in the sense Eliezer is looking for. Based on what Eliezer says I don’t feel like he has a clear or accurate picture of what successful “plans” look like in the real world. I don’t see any particular reason to defer to Eliezer at all on this point.
Eliezer says that his list of lethalities is the kind of document that other people couldn’t write and therefore shows they are unlikely to contribute (point 41). I think that’s wrong. I think Eliezer’s document is mostly aimed at rhetoric or pedagogy rather than being a particularly helpful contribution to the field that others should be expected to have prioritized; I think that which ideas are “important” is mostly a consequence of Eliezer’s idiosyncratic intellectual focus rather than an objective fact about what is important; the main contributions are collecting up points that have been made in the past and ranting about them and so they mostly reflect on Eliezer-as-writer; and perhaps most importantly, I think more careful arguments on more important difficulties are in fact being made in other places. For example, ARC’s report on ELK describes at least 10 difficulties of the same type and severity as the ~20 technical difficulties raised in Eliezer’s list. About half of them are overlaps, and I think the other half are if anything more important since they are more relevant to core problems with realistic alignment strategies.
My take on Eliezer’s takes
Eliezer raises many good considerations backed by pretty clear arguments, but makes confident assertions that are much stronger than anything suggested by actual argument.
Eliezer’s post (and most of his writing) isn’t bringing much new evidence to the table; it mostly either reasons a priori or draws controversial conclusions from uncontroversial evidence. I think that calls for a different approach than Eliezer has taken historically (if the goal was to productively resolve these disagreements).
I think that these arguments mostly haven’t been written down publicly so that they can be examined carefully or subject to criticism. It’s not clear whether Eliezer has the energy to do that, but I think that people who think that Eliezer’s position is important should try to understand the arguments well enough to do that.
I think that people with Eliezer’s views haven’t engaged very much productively with people who disagree (and have often made such engagement hard). I think that if you really dive into any of these key points you will quickly reach details where Eliezer cannot easily defend his view to a smart disinterested audience. And I don’t think that Eliezer could pass an ideological Turing test for people who disagree.
I think those are valuable steps to take if you have a contrarian take of great importance, which remains controversial even within your weird corner of the world, and whose support comes almost entirely from reasoning and argument.
A lot of the post seems to rest on intuitions and ways of thinking that Eliezer feels are empirically supported (rather than on arguments that can be explicitly stated). But I don’t feel like I actually have much evidence about that, so I think it really does just come down to the arguments.
I think Eliezer would like to say that the last 20 years give a lot of evidence for his object-level intuitions and general way of thinking about the world. If that’s the case, I think we should very strongly expect that he can state predictions about the future that will systematically be better than those of people who don’t share his intuitions or reasoning strategies. I remain happy to make predictions about any questions he thinks would provide this kind of evidence, or to state a bunch of random questions where I’m happy to predict (where I think he will probably slightly underperform me). If there aren’t any predictions about the future where these intuitions and methodologies overperform, I think you should be very skeptical that they got a lot of evidence over the last 20 years (and that’s at least something that requires explanation).
I think Eliezer could develop good intuition about these topics that is “backed up” by predicting the results of more complicated arguments using more broadly-accepted reasoning principles. Similarly, a mathematician might have great intuitions about the truth of a theorem, and those intuitions could come entirely from feedback loops involving formal proofs rather than empirical data. But if two mathematicians had differing intuitions about a theorem, and their intuitions both came from formally proving a bunch of similar theorems, then I think the way to settle the disagreement is by using the normal rules of logic governing proofs. So this brings us back to the previous bullet point, and I think Eliezer should be more interested in actually making arguments and engaging with legitimate objections.
I don’t think Eliezer has any kind of track record of exhibiting understanding in other ways (e.g. by accomplishing technological goals or other projects that require engaging with details of the world or making good day-to-day predictions). I think that’s OK, but it means that I more strongly expect any empirically-backed intuitions to be cashed out as either predictions from afar or more careful arguments.
Ten examples off the top of my head, that I think are about half overlapping and where I think the discussions in the ELK doc are if anything more thorough than the discussions in the list of lethalities:
Goals defined in terms of sense data are manipulable by an AI who can compromise sensors, and this is a serious obstruction to using ML to optimize what we actually care about.
An AI may manipulate sensors by exploiting facts about the world or modes of heuristic reasoning that humans are totally unfamiliar with, such that humans couldn’t recognize such tampering even if they spent a very long time examining proposed actions.
The human process of scientific understanding, even if automated, may end up being significantly less efficient than the use of gradient descent to find opaque models of the world. In this case, it may be inevitable that AI systems understand things about the world we don’t even if they try to help us do science.
If an AI is trained to predict human judgments or optimize scores as assessed by humans, then humans are likely to make errors. An AI system will eventually learn these errors rather than learning the intended behavior. Even if these errors aren’t themselves important, it will then predictably copy human errors out of distribution leading to catastrophic outcomes.
Even if humans make no errors in the training set, an AI which understands the world already has a model of a human which can be quickly repurposed to make good predictions about human judgments, and so it will tend to do this and therefore copy human errors off distribution.
Even if the AI has no model of a human, in the limit where the AI’s model is very complex and alien it is still faster and simpler for the AI to learn a model of “what a human would say” from scratch then to learn the intended ontology identification. So we can’t count on SGD.
There are many training strategies that can train an AI to answer questions even in cases where humans could not answer correctly. However most of the approaches we know now, including those being explored in practice, seem to consistently top out at “questions that humans could answer if they have a lot more compute” which does not always seem good enough.
We could imagine more elaborate games where the easiest strategy for the AI is honesty, and then to regularize on computation time in order to learn an honest policy, but those require us to be careful about the construction of the training data in order ensure that the task is sufficiently hard, and there are no existing proposals that have that property. It’s very hard to even set up games for which no strategy can outperform honesty.
Even if you were optimizing based on reliable observations of the real world, there are many bad actions that have no human-legible consequences for many years. At the point when legible consequences materialize it may be in a world that is too complex for existing humans to evaluate whether they are good or bad. If we don’t build an AI that understands our preferences about this kind of subtle bad behavior, then a competitive world will push us into a bad outcome.
If the simplest policy to succeed at our task is a learned optimizer, and we try to regularize our AI to e.g. answer questions quickly, then its best strategy may be to internally searching for a policy which answers questions slowly (because it’s quicker to find such a policy, and the time taken by the search is larger than the time taken by the mesapolicy). This makes it difficult to lean on regularization strategies to incentivize honesty.