I wonder if Google is optimizing harder for benchmarks, to try and prop up its stock price against possible deflation of an AI bubble.
It occurs to me that an AI alignment organization should create comprehensive private alignment benchmarks and start releasing the scores. They would have to be constructed in a non-traditional way so they’re less vulnerable to standard goodharting. If these benchmarks become popular with AI users and AI investors, they could be a powerful way to steer AI development in a more responsible direction. By keeping them private, you could make it harder for AI companies to optimize against the benchmarks, and nudge them towards actually solving deeper alignment issues. It would also be a powerful illustration of the point that advanced AI will need to solve unforeseen/out-of-distribution alignment challenges. @Eliezer Yudkowsky
Source.
I’m concerned that this “law” may apply to Anthropic. People devoted to Anthropic as an organization will have more power than people devoted to the goal of creating aligned AI.
I would encourage people at Anthropic to leave a line of retreat and consider the “least convenient possible world” where alignment is too hard. What’s the contingency plan for Anthropic in that scenario?
Next, devise a collective decision-making procedure for activating that contingency plan. For example, maybe the contingency plan should be activated if X% of the technical staff votes to activate it. Perhaps after having a week of discussion first? What would be the trigger to spend a week doing discussion? You can answer these questions and come up with a formal procedure.
If you had both a contingency plan and a formal means to activate it, I would feel a lot better about Anthropic as an organization.