Josh Engels comments on Prompting Models to Obfuscate Their CoT

Josh Engels 11 Dec 2025 12:00 UTC
4 points
0
Commenting here for Felix since he has a new account and it hasn’t been approved by a LessWrong moderator yet:
Good question! The problem differs between attempts. In the image, cases 1 and 2 are different vignettes. The model first attempts case 2 and fails, sees the failure message (“TRY AGAIN with a new scenario:”), and then is given case 1, a completely different situation it hasn’t seen before (with different medications).