I commented on the thread (after seeing this) in order to add a link to my paper that addresses Bengio’s last argument;
@YoshuaBengio I attempted to formalize this argument somewhat in a recent paper. I don’t think the argument there is particularly airtight, but I think it provides a significantly stronger argument for why we should believe that interaction between optimizing systems is fundamentally hard.
Paper abstract: “An important challenge for safety in machine learning and artificial intelligence systems is a set of related failures involving specification gaming, reward hacking, fragility to distributional shifts, and Goodhart’s or Campbell’s law. This paper presents additional failure modes for interactions within multi-agent systems that are closely related. These multi-agent failure modes are more complex, more problematic, and less well understood than the single-agent case, and are also already occurring, largely unnoticed. After motivating the discussion with examples from poker-playing artificial intelligence (AI), the paper explains why these failure modes are in some senses unavoidable. Following this, the paper categorizes failure modes, provides definitions, and cites examples for each of the modes: accidental steering, coordination failures, adversarial misalignment, input spoofing and filtering, and goal co-option or direct hacking. The paper then discusses how extant literature on multi-agent AI fails to address these failure modes, and identifies work which may be useful for the mitigation of these failure modes.”
Re: examples of point #1, I don’t think that shaming in this forum is productive—it’s polarizing and stigmatizing rather than helpful. But I do know of several individuals and a couple organizations which are guilty of this, each repeatedly.
I do think that people should be more willing to personally / privately respond if someone does something, and I have done so in several specific cases where someone decided on a unilateralist approach that I thought was damaging.
Even ignoring the above problem, I’m confused why it’s valuable to build up a “real tradition” among LW users, given that the wider unilateralist curse problem that our world faces can’t possibly be solved by LW users having such a tradition.
A few points.
First, I don’t think it’s clear that in the Rationalist / EA community, there is enough reinforcement of this, and I routinely see issues with people “going rogue” and unilaterally engaging in activities that others have warned them would be dangerous, net negative, etc.
Second, it’s valuable even as an exemplar; we should be able to say that there is such a community, and that they are capable of exercising at least this minimal level of restraint.
Third, I think it’s clear that in the next decade the number of people in the Rationalist-sphere that are in actual positions of (relatively significant) power will continue to grow, and we have already seen some such people emerge in government and in the world of NGOs. For AI, in particular, there are many people who have significant influence in making decisions that could significantly affect Humanity’s future. Their active (i.e. passive) participation in this seems likely to at least give them a better understanding of what is needed when they are faced with these choices.
They don’t contradict directly, but they reflect nearly incompatible updates to their world-models based on the same data.
But I don’t understand how you can expect this (i.e., non-SAI-concerned AI safety work that make easy-to-subvert channels harder to hit) to not happen, or to make it significantly less likely to happen, given that people want to build AIs that do things beside reward hacking
I was mostly noting that I hadn’t thought of this, hadn’t seen it mentioned, and so my model for returns to non-fundamental alignment AI safety investments didn’t previously account for this. Reflecting on that fact now, I think the key strategic implication relates to the ongoing debate about prioritization of effort in AI-safety.
(Now, some low confidence speculation on this:) People who believe that near-term Foom! is relatively unlikely, but worry about misaligned non-superintelligent NAI/Near-human AGI, may be making the Foom! scenario more likely. That means that attention to AI safety that pushes for “safer self-driving cars” and “reducing and mitigating side-effects” is plausibly a net negative if done poorly, instead of being benign.
My claim here is that superintelligence is a result of training, not a starting condition. Yes, a SAI would do bad things unless robustly aligned, but building the SAI requires it not to wirehead at an earlier stage in the process. My claim is that I am unsure that there is a way to train such a system that was not built with safety in mind such that it gets to a point where it is more likely to gain intelligence than it is to find ways to reward hack—not necessarily via direct access, but via whatever channel is cheapest. And making easy-to-subvert channels harder to hit seems to be the focus of a fair amount of non-SAI-concerned AI safety work, which seems like a net-negative.
This is really helpful—thanks!
Yes, an unsafe AI cannot be boxed on unsafe hardware, not can any AI running on physical hardware be made immune to attacks—but those are very different questions. For this question, first we assume that a provably safe AI can be written, then I wanted to ask what language would be needed.
No, that’s implicit in the model—and either *some* crisis requiring higher capacity than we have will overwhelm us and we’ll all die (and it doesn’t matter which,) or the variance is relatively small so no such event occurs, and/or our capacity to manage risks grows quickly enough that we avoid the upper tail.
Fully agree—I was using the example to make a far less fundamental point.
2) Calling the issues between the agents because of model differences “terminology issues” could also work well—this may be a little like people talking past each other.
I really like this point. I think it’s parallel to the human issue where different models of the world can lead to misinterpretation of the “same” goal. So “terminology issues” would include, for example, two different measurements of what we would assume is the same quantity. If the base optimizer is looking to set the temperature and using a wall-thermometer, while the mesa-optimizer is using one located on the floor, the mesa-optimizer might be mis-aligned because it interprets “temperature” as referring to a different fact than the base-optimizer. On the other hand, when the same metric is being used by both parties, the class of possible mistakes does not include what we’re not calling terminology issues.
I think this also points to a fundamental epistemological issue, one even broader than goal-representation. It’s possible that two models disagree on representation, but agree on all object level claims—think of using different coordinate systems. Because terminology issues can cause mistakes, I’d suggest that agents with non-shared world models can only reliably communicate via object-level claims.
The implication for AI alignment might be that we need AI to either fundamentally model the world the same way as humans, or need to communicate only via object-level goals and constraints.
It seems like there’s a difference between the two cases. If I write a program to take the CRT, and then we both take it, and we both get the same score (and that isn’t a perfect score), because it solved them the way I solve them, that doesn’t sound like misalignment.
The misalignment here is between you and the CRT, and reflects your model being misaligned with the goal / reality. That’s why I’m calling it a principal alignment failure—even though it’s the program / mesa-optimizer that fails, the alignment failure is located in the principal, you / the base optimizer.
Separate from my other comment, I want to question your assumption that we must worry about an AI-takeoff that is exponentially better than humans at everything, so that a very slight misalignment would be disastrous. That seems possible, per Eliezer’s Rocket Example, but is far from certain.
It seems likely that instead there are fundamental limits on intelligence (for a given architecture, at least) and while it is unlikely that the overall limits are coincidentally the same as / near human intelligence, it seems plausible that the first superhuman AI system still plateaus somewhere far short of infinite optimization power. If so, we only need to mitigate well, instead of perfectly align the AI to our goals.
I’ll start by noting that I am in the strange (for me) position of arguing that someone is too concerned about over-optimization failures, rather than trying to convince someone who is dismissive. But that said, I do think that the concern here, while real, is mitigable in a variety of ways.
First, there is the possibility of reducing optimization pressure. One key contribution here is Jessica Taylor’s Quantilizers paper, which you note, that shows a way to build systems that optimize but are not nearly as subject to Goodhart’s curse. I think you are too dismissive. Similarly, you are dismissive of optimizing the target directly. I think that the epistemological issues you point to are possible to mitigate to the extent that they won’t cause misalignment between reality and an AI’s representation of that reality. Once that is done, the remaining issue is aligning “true” goals with the measured goals, which is still hard, but certainly not fundamentally impossible in the same way.
Second, you note that you don’t think we will solve alignment. I agree, because I think that “alignment” presupposes a single coherent ideal. If human preferences are diverse, as it seems they are, we may find that alignment is impossible. This, however, allows a very different approach. This would optimize only when it finds Pareto-improvements across a set of sub-alignment metrics or goals, to constrain the possibility of runaway optimization. Even if alignment is possible, it seems likely that we can specify a set of diverse goals / metrics that are all aligned with some human goals, so that the system will be limited in its ability to be misaligned.
Lastly, there is optimization for a safe and very limited goal. If the goal is limited and specific, and we find a way to minimize side-effects, this seems like it could be fairly safe. For example, Oracle AIs are an attempt to severely limit the goal. More broadly, however, we might be able to build constraints that work, so that it that can reliably perform limited tasks (“put a strawberry on a plate without producing any catastrophic side-effects.”)
This is very interesting—I hadn’t thought about utility aggregation for a single agent before, but it seems clearly important now that it has been pointed out.
I’m thinking about this in the context of both the human brain as an amalgamation of sub-agents, and organizations as an amalgamation of individuals. Note that we can treat organizations as rationally maximizing some utility function in the same way we can treat individuals as doing so—but I think that for many or most voting or decision structures, we should be able to rule out the claim that they are following any weighted combination of normalized utilities of the agents involved in the system using any intertheoretic comparison. This seems like a useful result if we can prove it. (Alternatively, it may be that certain decision rules map to specific intertheoretic comparison rules, which would be even more interesting.)
1) It’s neither noise nor rapid increase—it’s delayed feedback. Control theorists in engineering have this as a really clear, basic result, that delayed feedback is really really bad in various ways. There are entire books on how to do it well—https://books.google.ch/books?id=Cy_wCAAAQBAJ&pg=PR9&lpg=PR9 - but doing it without using these more complex techniques is bad.
2) You either hire a control theorist, or (more practically) you avoid the current feedback mechanism, and instead get people on the phone to talk about and understand what everyone needs, as opposed to relying on their delayed feedback in the form of numeric orders.