Are you describing it as a problem that you (or others you already have in mind such as people at OpenAI) will work on, or are you putting it out there for people looking for a problem to attack?
I will work on it at least a little, I’m encouraging others to think about it.
So, something like, when training the next level agent in IDA, you initialize the model parameters with the current parameters rather than random parameters?
You don’t even need to explicitly maintain separate levels of agent. You just always use the current model to compute the rewards, and use that reward function to compute a gradient and update.
You don’t even need to explicitly maintain separate levels of agent. You just always use the current model to compute the rewards, and use that reward function to compute a gradient and update.
You’re using current model to perform subtasks of “compute the reward for the current task(s) being trained” and then updating, and local optimization ensures the update will make the model better (or at least no worse) at the task being trained, but how do you know the update won’t also make the model worse at the subtasks of “compute the reward for the current task(s) being trained”?
Is the answer something like, the current tasks being trained includes all previously trained tasks? But even then, it’s not clear that as you add more tasks to the training set, performance on previously trained tasks won’t degrade.
The idea is that “the task being trained” is something like: 50% what you care about at the object level, 50% the subtasks that occur in the evaluation process. The model may sometimes get worse at the evaluation process, or at the object level task, you are just trying to optimize some weighted combination.
There are a bunch of distinct difficulties here. One is that the distribution of “subtasks that occur in the evaluation process” is nonstationary. Another is that we need to set up the game so that doing both evaluation and the object level task is not-much-harder than just doing the object level task.
I will work on it at least a little, I’m encouraging others to think about it.
You don’t even need to explicitly maintain separate levels of agent. You just always use the current model to compute the rewards, and use that reward function to compute a gradient and update.
You’re using current model to perform subtasks of “compute the reward for the current task(s) being trained” and then updating, and local optimization ensures the update will make the model better (or at least no worse) at the task being trained, but how do you know the update won’t also make the model worse at the subtasks of “compute the reward for the current task(s) being trained”?
Is the answer something like, the current tasks being trained includes all previously trained tasks? But even then, it’s not clear that as you add more tasks to the training set, performance on previously trained tasks won’t degrade.
The idea is that “the task being trained” is something like: 50% what you care about at the object level, 50% the subtasks that occur in the evaluation process. The model may sometimes get worse at the evaluation process, or at the object level task, you are just trying to optimize some weighted combination.
There are a bunch of distinct difficulties here. One is that the distribution of “subtasks that occur in the evaluation process” is nonstationary. Another is that we need to set up the game so that doing both evaluation and the object level task is not-much-harder than just doing the object level task.