Martin Randall comments on A problem with the most recently published version of CEV

Martin Randall 19 Jan 2025 2:16 UTC
7 points
0
I appreciated the narrow focus of this post on a specific bug in PCEV and a specific criteria to use to catch similar bugs in the future. I was previously suspicious of CEV-like proposals so this didn’t especially change my thinking, but it did affect others. In particular the arbital page on cev now has a note:

Thomas Cederborg correctly observes that Nick Bostrom’s original parliamentary proposal involves a negotiation baseline where each agent has a random chance of becoming dictator, and that this random-dictator baseline gives an outsized and potentially fatal amount of power to spoilants—agents that genuinely and not as a negotiating tactic prefer to invert other agents’ utility functions, or prefer to do things that otherwise happen to minimize those utility functions—if most participants have utility functions with what I’ve termed “negative skew”; i.e. an opposed agent can use the same amount of resource to generate −100 utilons as an aligned agent can use to generate at most +1 utilon. If trolls are 1% of the population, they can demand all resources be used their way as a concession in exchange for not doing harm, relative to the negotiating baseline in which there’s a 1% chance of a troll being randomly appointed dictator. Or to put it more simply, if 1% of the population would prefer to create Hell for all but themselves (as a genuine preference rather than a strategic negotiating baseline) and Hell is 100 times as bad as Heaven is good, compared to nothing, they can steal the entire future if you run a parliamentary procedure running from a random-dictator baseline. I agree with Cederborg that this constitutes more-than-sufficient reason not to start from random-dictator as a negotiation baseline; anyone not smart enough to reliably see this point and scream about it, potentially including Eliezer, is not reliably smart enough to implement CEV; but CEV probably wasn’t a good move for baseline humans to try implementing anyways, even given long times to deliberate at baseline human intelligence. --EY

I think the post would have been a little clearer if it included a smaller scale example as well. Some responses focused on the 10^15 years of torture, but the same problem exists at small scales. I think you could still have a suboptimal outcome with five people and two goods (eg, cake and spanking), if one of the people is sadistic. Maybe that wouldn’t be as scarily compelling.

There has been some follow-up work from Cederborg such as the case for more alignment target analysis which I have enjoyed. The work on alignment faking implies that our initial alignment targets may persist even if we later try to change them. On the other hand, I don’t expect actual superintelligent terminal values to be remotely as clean and analyzable as something like PCEV.