Is the notion that if we understand how to build low-impact AI, we can build AIs with potentially bad goals, watch them screw up, and we can then fix our mistakes and try again?
Depends. I think this is roughly true for small-scale AI deployments, where the AI makes mistakes which “aren’t big deals” for most goals—instead of irreversibly smashing furniture, maybe it just navigates to a distant part of the warehouse.
I think this paradigm is less clearly feasible or desirable for high-impact TAI deployment, and I’m currently not optimistic about that use case for impact measures.
Depends. I think this is roughly true for small-scale AI deployments, where the AI makes mistakes which “aren’t big deals” for most goals—instead of irreversibly smashing furniture, maybe it just navigates to a distant part of the warehouse.
I think this paradigm is less clearly feasible or desirable for high-impact TAI deployment, and I’m currently not optimistic about that use case for impact measures.