It seems like if you explore for the rest of your horizon, then by definition you explore for most of the time you actually care about. That seems bad. Perhaps I’m misunderstanding the proposal.
I agree that it’s solvable; the question is whether it’s any easier to do IRL well than it is to solve the AI alignment problem some other way. That seems unclear to me (it seems like doing IRL really well probably requires doing a lot of cognitive science and moral philosophy).
I agree that this seems hard to do as an episodic RL problem. It seems like we would need additional theoretical insights to know how to do this; we shouldn’t expect AI capabilities research in the current paradigm to automatically deliver this capability.
Re 1st bullet, I’m not entirely certain I understand the nature of your objection.
The agent I describe is asymptotically optimal in the sense that for any policy π, given γ(t) the discount function, U(t) the reward obtained by the agent from time t onwards and Uπ(t) the reward that would be obtained by the agent from time t onwards if it switched to following policy π at time t, we have that Eτ∼D(t)[γ(τ)−1(Uπ(τ)−U(τ))] is bounded by something that goes to 0 as t goes to ∞ for some family of time distributions D(t) which depends on γ (for geometric discount D is uniform from 0 to t).
It’s true that this desideratum seems much too weak for FAI since the agent would take much too long to learn. Instead, we want the agent to perform well already on the 1st horizon. This indeed requires a more sophisticated model.
This model can be considered analogous to episodic RL where horizons replace episodes. However, one difference of principle is that the agent retains information about the state of the environment when a new episode begins. I thinks this difference is a genuine advantage over “pure” episodic learning.
It seems like we’re mostly on the same page with this proposal. Probably something that’s going on is that the notion of “episodic RL” in my head is quite broad, to the point where it includes things like taking into account an ever-expanding history (each episode is “do the right thing in the next round, given the history”). But at that point it’s probably better to use a different formalism, such as the one you describe.
My objection was the one you acknowledge: “this desideratum seems much too weak for FAI”.
It seems like if you explore for the rest of your horizon, then by definition you explore for most of the time you actually care about. That seems bad. Perhaps I’m misunderstanding the proposal.
I agree that it’s solvable; the question is whether it’s any easier to do IRL well than it is to solve the AI alignment problem some other way. That seems unclear to me (it seems like doing IRL really well probably requires doing a lot of cognitive science and moral philosophy).
I agree that this seems hard to do as an episodic RL problem. It seems like we would need additional theoretical insights to know how to do this; we shouldn’t expect AI capabilities research in the current paradigm to automatically deliver this capability.
Re 1st bullet, I’m not entirely certain I understand the nature of your objection.
The agent I describe is asymptotically optimal in the sense that for any policy π, given γ(t) the discount function, U(t) the reward obtained by the agent from time t onwards and Uπ(t) the reward that would be obtained by the agent from time t onwards if it switched to following policy π at time t, we have that Eτ∼D(t)[γ(τ)−1(Uπ(τ)−U(τ))] is bounded by something that goes to 0 as t goes to ∞ for some family of time distributions D(t) which depends on γ (for geometric discount D is uniform from 0 to t).
It’s true that this desideratum seems much too weak for FAI since the agent would take much too long to learn. Instead, we want the agent to perform well already on the 1st horizon. This indeed requires a more sophisticated model.
This model can be considered analogous to episodic RL where horizons replace episodes. However, one difference of principle is that the agent retains information about the state of the environment when a new episode begins. I thinks this difference is a genuine advantage over “pure” episodic learning.
It seems like we’re mostly on the same page with this proposal. Probably something that’s going on is that the notion of “episodic RL” in my head is quite broad, to the point where it includes things like taking into account an ever-expanding history (each episode is “do the right thing in the next round, given the history”). But at that point it’s probably better to use a different formalism, such as the one you describe.
My objection was the one you acknowledge: “this desideratum seems much too weak for FAI”.