Give me feedback! :)
Current
Co-Director at ML Alignment & Theory Scholars Program (2022-current)
Co-Founder & Board Member at London Initiative for Safe AI (2023-current)
Manifund Regrantor (2023-current)
Past
Ph.D. in Physics from the University of Queensland (2017-2022)
Group organizer at Effective Altruism UQ (2018-2021)
I think world model mismatches are possibly unavoidable with prosaic AGI, which might reasonably bias one against this AGI pathway. It seems possible that much of human and AGI world models would be similar by default if ‘tasks humans are optimised for’ is a similar set to ‘tasks AGI is optimised for’ and compute is not a performance-limiting factor, but I’m not at all confident that this is likely (e.g. maybe an AGI draws coarser- or finer-grained symbolic Markov blankets). Even if we build systems that represent the things we want and the things we do to get them as distinct symbolic entities in the same way humans do, they might fail to be competitive with systems that build their world models in an alien way (e.g. draw Markov blankets around symbolic entities that humans cannot factor into their world model due to processing or domain-specific constraints).
Depending on how one thinks AGI development will happen (e.g. is the strategy stealing assumption important) resolving world model mismatches seems more or less a priority for alignment. If near-term performance competitiveness heavily influences deployment, I think it’s reasonably likely that prosaic AGI is prioritised and world model mismatches occur by default because, for example, compute is likely a performance-limiting factor for humans on tasks we optimise AGI for, or the symbolic entities humans use are otherwise nonuniversal. I think AGI might generally require incorporating alien features into world models to be maximally competitive, but I’m very new to this field.