I agree with the point around discontinuities (in particular, I think the assumption that incremental strategies for AI alignment won’t work do tend to rely on discontinuous progress, or at the very least progress being set near-maximal values), but disagree with the point around VNM rationality being a dead end.
I do think they’re making the problem harder than it needs to be by implicitly assuming that all goals are long-term goals in the sense of VNM/Coherence arguments, because this removes solutions that rely on for example deontological urges not to lie even if it’s beneficial, and I don’t think the argument that all goals collapse into coherent goals either actually works or becomes trivial and not a constraint anymore.
But I do think that the existence of goals that conform to coherence arguments/VNM rationality broadly construed is likely to occur conditional on us being able to make AIs coherent/usable for longer-term tasks, and my explanation of why this has not happened yet (at least at the relevant scale) is basically that their time horizon for completing tasks is 2 hours on average, and most of the long-term goals involve at least a couple months of planning, if not years.
There are quite a few big issues with METR’s paper (though some issues are partially fixed), but the issues point towards LLM time horizons being shorter than what METR reported, not longer, so this point is even more true.
So we shouldn’t be surprised that LLMs haven’t yet manifested the goals that the AI safety field hypothesized, they’re way too incapable currently.
The other part is that I do think it’s possible to make progress even under something like a worst-case VNM frame, at least assuming that agents don’t have arbitrary decision theories, and the post Defining Corrigible and Useful Goals (which by the way is substantially underdiscussed on here), and the assumptions of reward being the optimization target and CDT being the default decision theory of AIs do look likely to hold in the critical regime due to empirical evidence.
I agree with the point around discontinuities (in particular, I think the assumption that incremental strategies for AI alignment won’t work do tend to rely on discontinuous progress, or at the very least progress being set near-maximal values), but disagree with the point around VNM rationality being a dead end.
I do think they’re making the problem harder than it needs to be by implicitly assuming that all goals are long-term goals in the sense of VNM/Coherence arguments, because this removes solutions that rely on for example deontological urges not to lie even if it’s beneficial, and I don’t think the argument that all goals collapse into coherent goals either actually works or becomes trivial and not a constraint anymore.
But I do think that the existence of goals that conform to coherence arguments/VNM rationality broadly construed is likely to occur conditional on us being able to make AIs coherent/usable for longer-term tasks, and my explanation of why this has not happened yet (at least at the relevant scale) is basically that their time horizon for completing tasks is 2 hours on average, and most of the long-term goals involve at least a couple months of planning, if not years.
Edit: Another reason is that current models don’t have anything like a long-term memory, and this might already be causing issues with the METR benchmark.
There are quite a few big issues with METR’s paper (though some issues are partially fixed), but the issues point towards LLM time horizons being shorter than what METR reported, not longer, so this point is even more true.
So we shouldn’t be surprised that LLMs haven’t yet manifested the goals that the AI safety field hypothesized, they’re way too incapable currently.
The other part is that I do think it’s possible to make progress even under something like a worst-case VNM frame, at least assuming that agents don’t have arbitrary decision theories, and the post Defining Corrigible and Useful Goals (which by the way is substantially underdiscussed on here), and the assumptions of reward being the optimization target and CDT being the default decision theory of AIs do look likely to hold in the critical regime due to empirical evidence.
You might also like the post Defining Monitorable and Useful Goals.
So I don’t think we should give up on directly attacking the hard problems of alignment in a coherence/VNM rationality setting.