1. LLMs are the dumbest economically transformative AI systems: they’re even more culture-pilled than humans, it turns out great language models + scaffolding and RL really can “simulate” all economically relevant tasks without having scary agentic properties (and while being very easy to audit and monitor)
2. Strong path-dependence: early alignment training robustly aligns the system (even with outcome-based RL w/out strong oversight, and continual learning). Shard theory, basin of corrigibility, etc.
1 makes me pretty hopeful (especially in short timelines), even if only partially true. I think we’ve already gotten some evidence against 2 (e.g reward hacking in sonnet, o3, etc), though the situation does seem to be better now (maybe the “soul document”, better deliberative alignment, …)
Two strongest sources of prosaic alignment hopes
1. LLMs are the dumbest economically transformative AI systems: they’re even more culture-pilled than humans, it turns out great language models + scaffolding and RL really can “simulate” all economically relevant tasks without having scary agentic properties (and while being very easy to audit and monitor)
2. Strong path-dependence: early alignment training robustly aligns the system (even with outcome-based RL w/out strong oversight, and continual learning). Shard theory, basin of corrigibility, etc.
1 makes me pretty hopeful (especially in short timelines), even if only partially true. I think we’ve already gotten some evidence against 2 (e.g reward hacking in sonnet, o3, etc), though the situation does seem to be better now (maybe the “soul document”, better deliberative alignment, …)