One concern might be that creating copies/counterparts instrumentally could be very useful for automating AI safety research. Perhaps one can get around this by making copies up front that AIs can use for their AI safety research. However, a misaligned AI might then be able to replace “making copies” with “moving existing copies”. Is it possible to make a firm distinction between what we need for automating AI safety research and the behavior we want to eliminate?
One concern might be that creating copies/counterparts instrumentally could be very useful for automating AI safety research. Perhaps one can get around this by making copies up front that AIs can use for their AI safety research. However, a misaligned AI might then be able to replace “making copies” with “moving existing copies”. Is it possible to make a firm distinction between what we need for automating AI safety research and the behavior we want to eliminate?