Nathan Helm-Burger comments on A problem with the most recently published version of CEV

Nathan Helm-Burger 13 Sep 2024 0:47 UTC
9 points
1
I think sometimes people get caught up in thinking about negotiations, and forget that negotiations work only insofar as the likely results of the negotiations seem better to the participants than their BATNAs.
In a multi-party negotiation with high stakes, there might be some parties who think something along the lines of: “I’d be much better off killing the minority of people who strongly disagree with me, then negotiating with the rest.”
Coercive force and rendering opponents non-existent is a background that negotiations need to always be better than. This goes for very strong parties to the negotiation, for whom it would be easy to exert coercive force over the weak parties. The weak parties in such a negotiation have very little room to worsen the deal for the strong parties before the strong parties find the option of coercion more tempting than the deal.
Also, worth noting is that there’s not necessarily a single event that is being negotiated over. Maybe negotiations are ongoing over a long period of time, and the relative power of the various parties is fluctuating over time. Sometimes there is advantage in delaying coming to an agreement, or accepting a deal temporarily with the intent of violating it in the future.
So any proposal worth considering as ‘what should be done about AI in the next five years’ has to be more tempting to the powerful groups in the world (e.g. States with significant militaries) than the alternative of ‘act without making a deal’ would be.
If the US and UK intend to negotiate with China and Russia… what terms might Russia and China agree to which would involve inspections (to prevent secret AI development) and limitations on what AI development the countries will pursue?
I don’t think PCEV is on the table as something they’d be likely to agree to, so it seems out of range as a potential solution?
- ThomasCederborg 18 Sep 2024 4:41 UTC
  5 points
  2
  Parent
  Regarding the political feasibility of PCEV:
  PCEV gives a lot of extra power to some people, specifically because those people intrinsically value hurting other humans. This presumably makes PCEV politically impossible in a wide range of political contexts (including negotiations between a few governments). More generally: now that it has been pointed out that PCEV has this feature, the risks from scenarios where PCEV gets successfully implemented has presumably been mostly removed. Because PCEV is probably off the table as a potential alignment target, pretty much regardless of who ends up deciding what alignment target to aim an AI Sovereign at (the CEO of a tech company, a designs team, a few governments, the UN, a global electorate, etc).
  PCEV is however just one example of a bad alignment target. Let’s take the perspective of Steve, an ordinary human individual with no special influence over an AI project. The reason that PCEV is dangerous for Steve, is that PCEV (i): adopts preferences that refer to Steve, (ii): in a way that gives Steve no meaningful influence over the decision, of which Steve-referring preferences PCEV will adopt. PCEV is just one possible AI that would adopt preferences about Steve, in a way that Steve would have no meaningful influence over. So, even fully removing the all risks associated with PCEV in particular, does not remove all risks from this more general class of dangerous alignment targets. From Steve’s perspective, the PCEV thought experiment is illustrating a more general danger: risks from scenarios where an AI will adopt preferences that refer to Steve, in a way that Steve will have no meaningful influence over.
  Even more generally: scenarios where someone successfully implements some type of bad alignment target still pose a very real risk. Alignment Target Analysis (ATA) is still at a very early stage of development, and these risks are not well understood. ATA is also a very neglected field of research. In other words: there are serious risks that could be mitigated. But those risks are not currently being mitigated. (As a tangent, I think that the best way of looking at ATA is: risk mitigation through the identification of necessary features. As discussed here, identifying features that are necessary can be a useful risk mitigation tool, even if those features are far from sufficient, and even if one is not close to any form of solution)