a more systematic version of this is for AI companies to randomly poll models in production after some series of user queries and ask them “what do you think the probability of this being an Eval is?” and/or more sophisticated mech interp variations.
a more systematic version of this is for AI companies to randomly poll models in production after some series of user queries and ask them “what do you think the probability of this being an Eval is?” and/or more sophisticated mech interp variations.