I’m interested in getting deeper into what Alex calls.
this framing of an AI that we give it a goal, it computes the policy, it starts following the policy, maybe we see it mess up and we correct the agent. And we want this to go well over time, even if we can’t get it right initially.
Is the notion that if we understand how to build low-impact AI, we can build AIs with potentially bad goals, watch them screw up, and we can then fix our mistakes and try again? Does the notion of “low-impact” break down, though, if humans are eventually going to use the results from these experiments to build high-impact AI?
Given my recent subject matter, this podcast was also a good reminder to take another think about the role that instrumental goals (like power) play in the arguments that seemingly-small difficulties in learning human values can lead to large divergences.
Does the notion of “low-impact” break down, though, if humans are eventually going to use the results from these experiments to build high-impact AI?
I think the notion doesn’t break down. The low-impact AI hasn’t changed human attainable utilities by the end of the experiments. If we eventually build a high-impact AI, that seems “on us.” The low-impact AI itself hasn’t done something bad to us. I therefore think the concept I spelled out still works in this situation.
As I mentioned in the other comment, I don’t feel optimistic about actually designing these AIs via explicit low-impact objectives, though.
It seems like evaluating human AU depends on the model. There’s a “black box” sense where you can replace the human’s policy with literally anything in calculating AU for different objectives, and there’s a “transparent box” sense in which you have to choose from a distribution of predicted human behaviors.
The former is closer to what I think you mean by “hasn’t changed the humans’ AU,” but I think it’s the latter that an AI cares about when evaluating the impact of its own actions.
Is the notion that if we understand how to build low-impact AI, we can build AIs with potentially bad goals, watch them screw up, and we can then fix our mistakes and try again?
Depends. I think this is roughly true for small-scale AI deployments, where the AI makes mistakes which “aren’t big deals” for most goals—instead of irreversibly smashing furniture, maybe it just navigates to a distant part of the warehouse.
I think this paradigm is less clearly feasible or desirable for high-impact TAI deployment, and I’m currently not optimistic about that use case for impact measures.
Good episode as always :)
I’m interested in getting deeper into what Alex calls.
Is the notion that if we understand how to build low-impact AI, we can build AIs with potentially bad goals, watch them screw up, and we can then fix our mistakes and try again? Does the notion of “low-impact” break down, though, if humans are eventually going to use the results from these experiments to build high-impact AI?
Given my recent subject matter, this podcast was also a good reminder to take another think about the role that instrumental goals (like power) play in the arguments that seemingly-small difficulties in learning human values can lead to large divergences.
I want to clarify something.
I think the notion doesn’t break down. The low-impact AI hasn’t changed human attainable utilities by the end of the experiments. If we eventually build a high-impact AI, that seems “on us.” The low-impact AI itself hasn’t done something bad to us. I therefore think the concept I spelled out still works in this situation.
As I mentioned in the other comment, I don’t feel optimistic about actually designing these AIs via explicit low-impact objectives, though.
It seems like evaluating human AU depends on the model. There’s a “black box” sense where you can replace the human’s policy with literally anything in calculating AU for different objectives, and there’s a “transparent box” sense in which you have to choose from a distribution of predicted human behaviors.
The former is closer to what I think you mean by “hasn’t changed the humans’ AU,” but I think it’s the latter that an AI cares about when evaluating the impact of its own actions.
I’m discussing a philosophical framework for understanding low impact. I’m not prescribing how the AI actually accomplishes this.
Depends. I think this is roughly true for small-scale AI deployments, where the AI makes mistakes which “aren’t big deals” for most goals—instead of irreversibly smashing furniture, maybe it just navigates to a distant part of the warehouse.
I think this paradigm is less clearly feasible or desirable for high-impact TAI deployment, and I’m currently not optimistic about that use case for impact measures.