It seems to me that it is quite possible that language models develop into really good world modelers before they become consequentialist agents or contain consequentialist subagents. While I would be very concerned with using an agentic AI to control another agentic AI for the reasons you listed and so am pessimistic about eg debate, AI still seems like it could be very useful for solving alignment.
It seems to me that it is quite possible that language models develop into really good world modelers before they become consequentialist agents or contain consequentialist subagents. While I would be very concerned with using an agentic AI to control another agentic AI for the reasons you listed and so am pessimistic about eg debate, AI still seems like it could be very useful for solving alignment.
Language models develp really good world models… primarily of humans writing text on the internet. Who are consequentialist agents, and are not fully aligned (in the absence of effective law enforcement) to other humans.