Jan_Kulveit comments on The Field of AI Alignment: A Postmortem, and What To Do About It

Jan_Kulveit 26 Dec 2024 19:28 UTC
95 points
76
My guess is a roughly equally “central” problem is the incentive landscape around the OpenPhil/Anthropic school of thought
- where you see Sam, I suspect something like “the lab memeplexes”. Lab superagents have instrumental convergent goals, and the instrumental convergent goals lead to instrumental, convergent beliefs, and also to instrumental blindspots
- there are strong incentives for individual people to adjust their beliefs: money, social status, sense of importance via being close to the Ring
- there are also incentives for people setting some of the incentives: funding something making progress on something seems more successful and easier than funding the dreaded theory