“they also could do things like run prediction markets on people researching S-risk, to forecast the odds that they end up going crazy ”
If this is a real concern we should check if fear of hell often drove people crazy.
These with these.
I don’t think Austria-Hungry was in a prisoners’ dilemma as they wanted a war so long as they would have German support. I think the Prisoners’ dilemma (imperfectly) comes into play for Germany, Russia, and then France given that Germany felt it needed to have Austria-Hungry as a long-term ally or risk getting crushed by France + Russia in some future war.
Cleaner, but less interesting plus I have a entire Demon Games exercise we do on the first day of class. Yes the defense build up, but also everyone going to war even though everyone (with the exception of the Austro-Hungarians) thinking they are worse off going to war than having the peace as previously existed, but recognizing that if they don’t prepare for war, they will be worse off. Basically, if the Russians don’t mobilize they will be seen to have abandoned the Serbs, but if they do mobilize and then the Germans don’t quickly move to attack France through Belgium then Russia and France will have the opportunity (which they would probably take) to crush Germany.
I think the disagreement is that I think the traditional approach to the prisoners’ dilemma makes it more useful as a tool for understanding and teaching about the world. Any miscommunication is probably my fault for my failing to sufficiently engage with your arguments, but it FEELS to me like you are either redefining rationality or creating a game that is not a prisoners’ dilemma because I would define the prisoners’ dilemma as a game in which both parties have a dominant strategy in which they take actions that harm the other player, yet both parties are better off if neither play this dominant strategy than if both do, and I would define a dominant strategy as something a rational player always plays regardless of what he things the other player would do. I realize I am kind of cheating by trying to win through definitions.
I teach an undergraduate game theory course at Smith College. Many students start by thinking that rational people should cooperate in the prisoners’ dilemma. I think part of the value of game theory is in explaining why rational people would not cooperate, even knowing that everyone not cooperating makes them worse off. If you redefine rationality such that you should cooperate in the prisoners’ dilemma, I think you have removed much of the illuminating value of game theory. Here is a question I will be asking my game theory students on the first class:
Our city is at war with a rival city, with devastating consequences awaiting the loser. Just before our warriors leave for the decisive battle, the demon Moloch appears and says “sacrifice ten healthy, loved children and I will give +7 killing power (which is a lot) to your city’s troops and subtract 7 from the killing power of your enemy. And since I’m an honest demon, know that right now I am offering this same deal to your enemy.” Should our city accept Moloch’s offer?
I believe under your definition of rationality this Moloch example loses its power to, for example, in part explain the causes of WW I.
Consider two games: the standard prisoners’ dilemma and a modified version of the prisoners’ dilemma. In this modified version, after both players have submitted their moves, one is randomly chosen. Then, the move of the other player is adjusted to match that of the randomly chosen player. These are very different games with very different strategic considerations. Therefore, you should not define what you mean by game theory in a way that would make rational players view both games as the same because by doing so you have defined-away much of real-world game theory coordination challenges.
AI has become so incredibly important that any utilitarian-based charity should probably be totally focused on AI.
I really like this post, it’s very clear. I teach undergraduate game theory and I’m wondering if you have any practical examples I could use of how in a real-world situation you would behave differently under CDT and EDT.
Yes, important to get the incentives right. You could set the salary for AI alignment slightly below that of the worker’s market value. Also, I wonder about the relevant elasticity. How many people have the capacity to get good enough at programming to be able to contribute to capacity research + would have the desire to game my labor hording system because they don’t have really good employment options?
I am currently job hunting, trying to get a job in AI Safety but it seems to be quite difficult especially outside of the US, so I am not sure if I will be able to do it.
This has to be taken as a sign that AI alignment research is funding constrained. At a minimum, technical alignment organizations should engage in massive labor hording to prevent the talent from going into capacity research.
“But make no mistake, this is the math that the universe is doing.”
“There is no law of the universe that states that tasks must be computable in practical time.”
Don’t these sentences contradict each other?
Interesting point, and you might be right. Could get very complicated because ideally an ASI might want to convince other ASIs that it has one utility function, when in fact it has another, and of course all the ASIs might take this into account.
I like the idea of an AI lab workers’ union. It might be worth talking to union organizers and AI lab workers to see how practical the idea is, and what steps would have to be taken. Although a danger is that the union would put salaries ahead of existential risk.
Your framework appears to be moral rather than practical. Right now going on strike would just get you fired, but in a year or two perhaps it could accomplish something. You should consider the marginal impact of the action of a few workers on the likely outcome with AI risk.
I’m at over 50% chance that AI will kill us all. But consider the decision to quit from a consequentialist viewpoint. Most likely the person who replaces you will be almost as good as you at capacity research but care far less than you do about AI existential risk. Humanity, consequently, probably has a better chance if you stay in the lab ready for the day when, hopefully, lots of lab workers try to convince the bosses that now is the time for a pause, or at least that now is the time to shift a lot of resources from capacity to alignment.
The biggest extinction risk from AI comes from instrumental convergence for resource acquisition in which an AI not aligned with human values uses the atoms in our bodies for whatever goals it has. An advantage of such instrumental convergence is that it would prevent an AI from bothering to impose suffering on us.
Unfortunately, this means that making progress on the instrumental convergence problem increases S-risks. We get hell if we solve instrumental convergence, but not, say, mesa-optimization and we get a powerful AGI that cares about our fate, but does something to us we consider worse than death.
The Interpretability Paradox in AGI Development
The ease or difficulty of interpretability, the ability to understand and analyze the inner workings of AGI, may drastically affect humanity’s survival odds. The worst-case scenario might arise if interpretability proves too challenging for humans but not for powerful AGIs.
In a recent podcast, academic economists Robin Hanson and I discussed AGI risks from a social science perspective, focusing on a future with numerous competing AGIs not aligned with human values. Drawing on human analogies, Hanson considered the inherent difficulty of forming a coalition where a group unites to eliminate others to seize their resources. A crucial coordination challenge is ensuring that, once successful, coalition members won’t betray each other, as occurred during the French Revolution.
Consider a human coalition that agrees to kill everyone over 80 to redistribute their resources. Coalition members might promise that this is a one-time event, but such an agreement isn’t credible. It would likely be safer for everyone not to violate property right norms for short-term gains.
In a future with numerous unaligned AGIs, some coalition might calculate it would be better off eliminating everyone outside the coalition. However, they would have the same fear that once this process starts, it would be hard to stop. As a result, it might be safer to respect property rights and markets, competing like corporations do.
A key distinction between humans and AGIs could be AGI’s potential for superior coordination. AGIs in a coalition could potentially modify their code so after their coalition has violently taken over, no member of the coalition would ever want to turn on members of the coalition. This way, an AGI coalition wouldn’t have to fear a revolution they start ever eating its own. This possibility raises a vital question: will AGIs possess the interpretability required to achieve such feats?
The best case for AGI risk is if we solve interpretability before creating AGIs strong enough to take over. The worst case might be if interpretability remains impossible for us but becomes achievable for powerful AGIs. In this situation, AGIs could form binding coalitions with one another, leaving humans out of the loop, partly because we can’t become reliable coalition partners and our biological needs involve maintaining Earth in conditions suboptimal for AGI operations. This outcome creates a paradox: if we cannot develop interpretable AGIs, perhaps we should focus on making them exceptionally difficult to interpret, even for themselves. In this case, future powerful AGIs might prevent the creation of interpretable AGIs because such AGIs would have a coordination advantage and thus be a threat to the uninterpretable AGIs.
Accepting the idea that an AGI emerging from ML is likely to resemble a human mind more closely than a random mind from mindspace might not be an obvious reason to be less concerned with AGI risk. Consider a paperclip maximizer; despite its faults, it has no interest in torturing humans. As an AGI becomes more similar to human minds, it may become more willing to impose suffering on humans. If a random AGI mind has a 99% chance of killing us and a 1% chance of allowing us to thrive, while an ML-created AGI (not aligned with our values) has a 90% chance of letting us thrive, a 9% chance of killing us, and a 1% chance of torturing us, it is not clear which outcome is preferable. This illustrates that a closer resemblance to human cognition does not inherently make an AGI less risky or more beneficial.