One, I’m not certain the entire phenomena of an agent meta-modifying it’s objective or otherwise influencing its own learning trajectory is bad. When I think about what this is like on the inside, I have a bunch of examples where I do this. Almost all of them are in a category called “Aspirational Rationality”, which is a sub topic of Rationality (the philosophy, not the LessWrong): https://oxford.universitypressscholarship.com/view/10.1093/oso/9780190639488.001.0001/oso-9780190639488
(I really wish we explored more aspirational rationality, because it has some really interesting things to say about value drift and changing values. I think we haven’t solved these for people yet, and solving them for people would probably give insights to solving them for systems)
In as much as models are acting consistent with aspirational rationality I think some gradient hacks are good. To the extent that they are in line with our goals and desires for alignment, this seems good.
It goes without saying that this also could be very bad, and also I think at the moment I roughly agree that an agent that was intentionally hacking their gradient in this way could be catastrophic.
It seems like the aspirational rationality example makes it easier for me to imagine a system acting this way, and expect it won’t be hard to construct this kind of demonstration in a system where a learning agent has access to its learning algorithm. (I don’t think this is useful to do at the moment, and would probably be limited to a contrived toy scenario—but this is disagreeing with takes that it would be extremely difficult to create)
Second: I think there is a fairly obvious pathway towards this happening, and that’s training on synthetic data.
Learning from synthetic data might be better named “practice”. For example, I can come up with my own calculus problems, do my best to solve them, and then check my work. This is a pathway to me (or any sort of learning system) getting better at calculus without the input of external knowledge or information.
I expect learning from synthetic data (e.g. generating hypothetical problems, solving them, then verifying the solutions) to be at least plausibly a common technique. This connects to gradient hacking because it is the model itself controlling the data distribution.
Let’s say I want to feel good about my ability to do math. I can subtly (i.e. invisibly to an external observer) bias my generated distribution of math problems to be only problems I know how to solve, so that after I update on the results of my practice, I won’t have any updates in the direction “I’m bad at math.”
So combining the two ideas: I think you could get ‘aspirational rationality’-like gradient hacking in systems that learn from synthetic data. It seems to me much more plausible than the version where the system has to solve problems of embedded agency.
Some thoughts on Gradient Hacking:
One, I’m not certain the entire phenomena of an agent meta-modifying it’s objective or otherwise influencing its own learning trajectory is bad. When I think about what this is like on the inside, I have a bunch of examples where I do this. Almost all of them are in a category called “Aspirational Rationality”, which is a sub topic of Rationality (the philosophy, not the LessWrong): https://oxford.universitypressscholarship.com/view/10.1093/oso/9780190639488.001.0001/oso-9780190639488
(I really wish we explored more aspirational rationality, because it has some really interesting things to say about value drift and changing values. I think we haven’t solved these for people yet, and solving them for people would probably give insights to solving them for systems)
In as much as models are acting consistent with aspirational rationality I think some gradient hacks are good. To the extent that they are in line with our goals and desires for alignment, this seems good.
It goes without saying that this also could be very bad, and also I think at the moment I roughly agree that an agent that was intentionally hacking their gradient in this way could be catastrophic.
It seems like the aspirational rationality example makes it easier for me to imagine a system acting this way, and expect it won’t be hard to construct this kind of demonstration in a system where a learning agent has access to its learning algorithm. (I don’t think this is useful to do at the moment, and would probably be limited to a contrived toy scenario—but this is disagreeing with takes that it would be extremely difficult to create)
Second: I think there is a fairly obvious pathway towards this happening, and that’s training on synthetic data.
Learning from synthetic data might be better named “practice”. For example, I can come up with my own calculus problems, do my best to solve them, and then check my work. This is a pathway to me (or any sort of learning system) getting better at calculus without the input of external knowledge or information.
I expect learning from synthetic data (e.g. generating hypothetical problems, solving them, then verifying the solutions) to be at least plausibly a common technique. This connects to gradient hacking because it is the model itself controlling the data distribution.
Let’s say I want to feel good about my ability to do math. I can subtly (i.e. invisibly to an external observer) bias my generated distribution of math problems to be only problems I know how to solve, so that after I update on the results of my practice, I won’t have any updates in the direction “I’m bad at math.”
So combining the two ideas: I think you could get ‘aspirational rationality’-like gradient hacking in systems that learn from synthetic data. It seems to me much more plausible than the version where the system has to solve problems of embedded agency.