Some conceptual alignment research projects
Some research outputs I’d love to see, focused on exploring, clarifying and formalizing important alignment concepts. I expect that most of these will be pretty time-consuming, but happy to discuss for people who want to try:
A paper which does for deceptive alignment what the goal misgeneralization paper does for inner alignment, i.e. describing it in ML language and setting up toy examples (for example, telling GPT-3 to take actions which minimize changes in its weights, given that it’s being trained using actor-critic RL with a certain advantage function, and seeing if it knows how to do so).
A paper which does the same for gradient hacking, e.g. taking these examples and putting them into more formal ML language.
A list of papers that are particularly useful for new research engineers to replicate.
A takeover scenario which covers all the key points in https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/, but not phrased as an argument, just phrased as a possible scenario (I think you can’t really make the argument rigorously in that little space).
A paper which defines the concepts of implicit planning, implicit value functions, implicit reward models, etc, in ML terms. Kinda like https://arxiv.org/abs/1901.03559 but more AGI-focused. I want to be able to ask people “does GPT-3 choose actions using an implicit value function?” and then be able to point them to this paper to rigorously define what I mean. I discuss this briefly in the phase 1 section here.
A blog post which describes in as much detail as possible what our current “throw the kitchen sink at it” alignment strategy would look like. (I’ll probably put my version of this online soon but would love others too).
A blog post explaining “debate on weights” more thoroughly.
A blog post exploring how fast we should expect a forward pass to be for the first AGIs—e.g. will it actually be slower than human thinking, as discussed in this comment?
A blog post exploring considerations for why model goals may or may not be much more robust to SGD than model beliefs, as discussed in framing 3 here. (See also this paper on gradient starvation—h/t Quintin Pope; and the concept of persistence to gradient descent discussed here.)
A blog post explaining why the “uncertainty” part of CIRL only does useful work insofar as we have an accurate model of the human policy, and why this is basically just as hard as having an accurate model of human preferences.
A blog post explaining what practical implications Stuart Armstrong’s impossibility result has.
As many alignment exercises as possible to help people learn to think about this stuff (mine aren’t great but I haven’t seen better).
A paper properly formulating instrumental convergence, generalization to large-scale goals, etc, as inductive biases in the ML sense (I do this briefly in phase 3 here).
A mathematical comparison between off-policy RL and imitation learning, exploring ways in which they’re similar and different, and possible algorithms in between.
A blog post explaining the core argument for why detecting adversarially-generated inputs is likely much easier than generating them, and arguments for why adversarial training might nevertheless be valuable for alignment.
A blog post exploring the incentives which models might have when they’re simultaneously trained to make predictions and to take actions in an RL setting (e.g. models trained using RL via sequence modeling).
A blog post exploring pros and cons of making misalignment datasets for use as a metric of alignment (alignment = how much training on the misalignment dataset is needed to make it misaligned).
A paper providing an RL formalism in which reward functions can depend on weights and/or activations directly, and demonstrating a simple but non-trivial example.
A blog post evaluating reasons to think that situational awareness will be a gradual development in models, versus a sharp transition.
A blog post explaining reasons to expect capabilities to be correlated with alignment while models lack situational awareness, and then less correlated afterwards, rather than the correlation continuing.
A blog post estimating how many bits of optimization towards real-world goals could arise from various aspects of a supervised training program (especially ones which slightly break the cartesian formalisms) - e.g. hyperparameter tuning, many random seeds, training on data generated by other AIs, etc.
A sketch of what a model-free version of AIXI would look like (according to one person I talked to, it’s a lot like decision transformers).
A blog post evaluating whether shard theory makes sense/makes novel predictions compared with Steve Byrnes’ model of the brain (he partly explains this in a comment on the post, but I’m still a bit confused).
A blog post or paper reviewing what types of feedback humans perform best and worst at (e.g. reward vs value feedback) and then designing a realistic setup for optimal-quality human feedback.
A blog post compiling examples of surprising emergent capabilities (especially in large language models).
An investigation of the extent to which human concept representations are localized to individual neurons, versus being spread out across different neurons.