This is interesting work and it shows some promise. I find it surprising that even with this intervention the model still believes the eval isn’t real 76% of the time, and I imagine this number will increase as capabilities improve.
This is interesting work and it shows some promise. I find it surprising that even with this intervention the model still believes the eval isn’t real 76% of the time, and I imagine this number will increase as capabilities improve.