Thanks for sharing! This is super cool and timely work.
Some thoughts:
I’m excited about (the formalism of) partial observability as a way to make progress on outer alignment in general. Partial observability seems like a natural way to encode fundamental difficulties with specifying what we (humans) want to a system that has more (or different) information and understands that information better (or differently) than we do. I don’t see any reason that the formalism’s usefulness would be limited to cases where human evaluators literally lack information, as opposed to simply being limited in their ability to evaluate that information. So, I think this is a very promising line of work.
Have you considered the connection between partial observability and state aliasing/function approximation? Maybe you could apply your theory to weak-to-strong generalization by considering a weak model as operating under partial observability. Alternatively, by introducing structure to the observations, the function approximation lens might open up new angles of attack on the problem.
There could be merit to a formalism where the AI and supervisor both act under partial observability, according to different observation functions. This would reflect the fact that humans can make use of data external to the trajectory itself to evaluate behavior.
I think you’re exactly right to consider abstractions of trajectories, but I’m not convinced this needs to be complicated. What if you considered the case where the problem definition includes features of state trajectories on which (known) human utilities are defined, but these features themselves are not always observed? (This is something I’m currently thinking about, as a generalization of the work mentioned in the postscript.)
Am I correct in my understanding that the role Boltzmann rationality plays in your setup is just to get a reward function out of preference data? If so, that doesn’t seem problematic to me (as you also acknowledge). If I understand correctly, it’s a somewhat trivial fact that you can still do arbitrarily badly even when your utilities (on states) are exactly known and the task is to select any reward function (on observations) that performs well according to that utility function.[1]
Again, thanks for the great work. Looking forward to seeing more.
P.S. This summer, my team was thinking about similar formalizations in order to help motivate a new training method. My notes from a lit review read:
I searched for papers that consider the problem of overseeing an AI when you have limited access to observations about the state. This is a modeling assumption intended to (i) encode a practical difficulty with scalable oversight, and (ii) be a “setup” where gradient routing can serve as a “punchline.”
All the related papers I’ve found deal with the problem of specification gaming arising from misspecified proxy rewards, often studied via the lens of “optimization pressure.” But this is not the point we want to make: we want to make the point that if the overseer is limited in the information they have access to (can’t induce reward a reward signal at arbitrary resolution), it is impossible for them to get a good reward, except in the presence of certain structure.
So, your paper is exactly the kind of thing we (the team working on gradient routing) were looking for. I just didn’t find the preprint!
For readers that aren’t the author of the post: it’s trivial because you can have two states with different utilities but the same observation. Then there’s no way to define a reward on the observation that forces the agent to “prefer” the better state. I think Example D.4 in their appendix is saying the same thing, but I didn’t check carefully.
Hi! Thanks a lot for your comments and very good points. I apologize for my late answer, caused by NeurIPS and all the end-of-year breakdown of routines :)
On 1: Yes, the formalism I’m currently working on also allows to talk about the case that the human “understands less” than the AI.
On 2:
Have you considered the connection between partial observability and state aliasing/function approximation?
I am not entirely sure if I understand! Though if it’s just what you express in the following sentences, here’s my answers:
Maybe you could apply your theory to weak-to-strong generalization by considering a weak model as operating under partial observability.
Very good observation! :) I’m thinking about it slightly differently, but the link is there: Imagine a scenario where we have a pretrained foundation model, and we train a linear probe attached to the internal representations, which is supposed to learn the correct reward for full state sequences, based on feedback from a human on partial observations. Then if we show this model (including attached probe) during trainingjust the partial observations, it’s receiving the correct data and is supposed to generalize from feedback on “easy situations” (i.e., situations where the partial observations of the human provide enough information to make a correct judgment) to “hard situations” (full state sequences that the human couldn’t oversee, and where possibly the partial observations miss crucial details).
So I think this setting is an instance of weak-to-strong generalization.
Alternatively, by introducing structure to the observations, the function approximation lens might open up new angles of attack on the problem.
Yes that’s actually also part of what I’m exploring, if I understand your idea correctly. In particular, I’m considering the case that we may have “knowledge” of some form about the space in which the correct reward function lives. This may come from symmetries in the state space, for example: maybe we want to restrict to localized reward functions that are translation-invariant. All of that can easily be formalized in one framework.
Pretrained foudation models on which we attach a “reward probe” can be viewed as another instance of considering symmetries in the state space: In this case, we’re presuming that state sequences have the same reward if they give rise to the same “learned abstractions” in the form of the internal representations of the neural network.
On 3: Agreed. (Though I am not explicitly considering this case at this point. )
On 4:
I think you’re exactly right to consider abstractions of trajectories, but I’m not convinced this needs to be complicated. What if you considered the case where the problem definition includes features of state trajectories on which (known) human utilities are defined, but these features themselves are not always observed? (This is something I’m currently thinking about, as a generalization of the work mentioned in the postscript.
This actually sounds very much like what I’m working on right now!! We should probably talk :)
On 5:
Am I correct in my understanding that the role Boltzmann rationality plays in your setup is just to get a reward function out of preference data?
If I understand correctly, yes. In a sense, we just “invert” the sigmoid function to recover the return function on observation sequences from human preference data. If this return function on observation sequences was already known, we’d still be doomed, as you correctly point out.
Thanks also for the notes on gradient routing! I will read your post and will try to understand the connection.
Thanks for sharing! This is super cool and timely work.
Some thoughts:
I’m excited about (the formalism of) partial observability as a way to make progress on outer alignment in general. Partial observability seems like a natural way to encode fundamental difficulties with specifying what we (humans) want to a system that has more (or different) information and understands that information better (or differently) than we do. I don’t see any reason that the formalism’s usefulness would be limited to cases where human evaluators literally lack information, as opposed to simply being limited in their ability to evaluate that information. So, I think this is a very promising line of work.
Have you considered the connection between partial observability and state aliasing/function approximation? Maybe you could apply your theory to weak-to-strong generalization by considering a weak model as operating under partial observability. Alternatively, by introducing structure to the observations, the function approximation lens might open up new angles of attack on the problem.
There could be merit to a formalism where the AI and supervisor both act under partial observability, according to different observation functions. This would reflect the fact that humans can make use of data external to the trajectory itself to evaluate behavior.
I think you’re exactly right to consider abstractions of trajectories, but I’m not convinced this needs to be complicated. What if you considered the case where the problem definition includes features of state trajectories on which (known) human utilities are defined, but these features themselves are not always observed? (This is something I’m currently thinking about, as a generalization of the work mentioned in the postscript.)
Am I correct in my understanding that the role Boltzmann rationality plays in your setup is just to get a reward function out of preference data? If so, that doesn’t seem problematic to me (as you also acknowledge). If I understand correctly, it’s a somewhat trivial fact that you can still do arbitrarily badly even when your utilities (on states) are exactly known and the task is to select any reward function (on observations) that performs well according to that utility function.[1]
Again, thanks for the great work. Looking forward to seeing more.
P.S. This summer, my team was thinking about similar formalizations in order to help motivate a new training method. My notes from a lit review read:
So, your paper is exactly the kind of thing we (the team working on gradient routing) were looking for. I just didn’t find the preprint!
For readers that aren’t the author of the post: it’s trivial because you can have two states with different utilities but the same observation. Then there’s no way to define a reward on the observation that forces the agent to “prefer” the better state. I think Example D.4 in their appendix is saying the same thing, but I didn’t check carefully.
Hi! Thanks a lot for your comments and very good points. I apologize for my late answer, caused by NeurIPS and all the end-of-year breakdown of routines :)
On 1: Yes, the formalism I’m currently working on also allows to talk about the case that the human “understands less” than the AI.
On 2:
I am not entirely sure if I understand! Though if it’s just what you express in the following sentences, here’s my answers:
Very good observation! :) I’m thinking about it slightly differently, but the link is there: Imagine a scenario where we have a pretrained foundation model, and we train a linear probe attached to the internal representations, which is supposed to learn the correct reward for full state sequences, based on feedback from a human on partial observations. Then if we show this model (including attached probe) during training just the partial observations, it’s receiving the correct data and is supposed to generalize from feedback on “easy situations” (i.e., situations where the partial observations of the human provide enough information to make a correct judgment) to “hard situations” (full state sequences that the human couldn’t oversee, and where possibly the partial observations miss crucial details).
So I think this setting is an instance of weak-to-strong generalization.
Yes that’s actually also part of what I’m exploring, if I understand your idea correctly. In particular, I’m considering the case that we may have “knowledge” of some form about the space in which the correct reward function lives. This may come from symmetries in the state space, for example: maybe we want to restrict to localized reward functions that are translation-invariant. All of that can easily be formalized in one framework.
Pretrained foudation models on which we attach a “reward probe” can be viewed as another instance of considering symmetries in the state space: In this case, we’re presuming that state sequences have the same reward if they give rise to the same “learned abstractions” in the form of the internal representations of the neural network.
On 3: Agreed. (Though I am not explicitly considering this case at this point. )
On 4:
This actually sounds very much like what I’m working on right now!! We should probably talk :)
On 5:
If I understand correctly, yes. In a sense, we just “invert” the sigmoid function to recover the return function on observation sequences from human preference data. If this return function on observation sequences was already known, we’d still be doomed, as you correctly point out.
Thanks also for the notes on gradient routing! I will read your post and will try to understand the connection.