Interesting! It’s certainly feasible to train a set of linear probes and inspect actual model activations. Will be taking this up in the coming weeks.
Chijioke Ugwuanyi
Karma: 34
Interesting! It’s certainly feasible to train a set of linear probes and inspect actual model activations. Will be taking this up in the coming weeks.
Thank you for your comment, and yes, I read a paper today: https://arxiv.org/abs/2412.04984, where the researchers asked GPT models to summarize their thinking and output it in a scratchpad.
I should have done this.
Secondly, making the system prompt more deployment-like sounds interesting, although the latest Claude and GPT models appear to be incredibly smart at detecting the eval/deployment. This would be a credible ablation to run.