Prediction: future models not trained on alignment evals will also have greater awareness than that would have had this model not been trained on alignment evals/ones with this model’s outputs filtered out reliably, due to patterns from these eval trained ones being picked up from the training data. (Though likely still less than this one)
Prediction: future models not trained on alignment evals will also have greater awareness than that would have had this model not been trained on alignment evals/ones with this model’s outputs filtered out reliably, due to patterns from these eval trained ones being picked up from the training data. (Though likely still less than this one)