quetzal_rainbow comments on Interpretability Will Not Reliably Find Deceptive AI

quetzal_rainbow 4 May 2025 19:48 UTC
11 points
6
There is a conceptual path for interpretability to lead to reliability: you can understand model in sufficient details to know how it produces intelligence and then make another model out of interpreted details. Obviously, it’s not something that we can expect to happen anytime soon, but it’s something that army of interpretability geniuses in datacenter could do.