Alec Harris comments on Preference gaps as a safeguard against AI self-replication

Alec Harris 7 Dec 2025 16:05 UTC
1 point
0
One concern might be that creating copies/counterparts instrumentally could be very useful for automating AI safety research. Perhaps one can get around this by making copies up front that AIs can use for their AI safety research. However, a misaligned AI might then be able to replace “making copies” with “moving existing copies”. Is it possible to make a firm distinction between what we need for automating AI safety research and the behavior we want to eliminate?