Treat your obfuscated chains of thought like live bioweapons.
I’ve spoken to a few folks at NeurIPS that are training reasoning models against monitors for various reasons (usually to figure out how to avoid unmonitorable chain of thought). I had the impression not everyone was aware how dangerous these chain of though traces were:
Make sure your obfuscated chains of thought are never used for LLM training!
If obfuscated reasoning gets into the training data, this could plausibly teach models how to obfuscate their reasoning. This seems potentially pretty bad (a bit like gain of function research). I’m not saying you shouldn’t do the research, it’s probably worth the risk. Just make sure to keep the rollouts away from training:
Don’t use software that trains on your files when working with dangerous material (I think the free tiers of various AI products allow for training on user data).
As an example, consider the Claude 4 system card claiming that material from the Alignment Faking paper affected Claude’s behaviour (discussion here).
Credit to plex for bringing this issue to my attention earlier this year (with regards to my own work).
glad people are noticing. it won’t be enough to stop all leaks though, realistically.
it’s fun how all the safety worries and intricate plans to prevent failure modes tend to get invalidated by “humans do the thing the bypasses the guardrails”. e.g. for years people would say things like “of course we won’t connect it to the internet/let it design novel viruses/proteins/make lethal autonomous weapons”.
my guess is the law of less dignified failure has a lot of truth to it.
Treat your obfuscated chains of thought like live bioweapons.
I’ve spoken to a few folks at NeurIPS that are training reasoning models against monitors for various reasons (usually to figure out how to avoid unmonitorable chain of thought). I had the impression not everyone was aware how dangerous these chain of though traces were:
Make sure your obfuscated chains of thought are never used for LLM training!
If obfuscated reasoning gets into the training data, this could plausibly teach models how to obfuscate their reasoning. This seems potentially pretty bad (a bit like gain of function research). I’m not saying you shouldn’t do the research, it’s probably worth the risk. Just make sure to keep the rollouts away from training:
Use e.g. the easy-dataset-share package (by TurnTrout et al.) to protect your dataset when you upload it somewhere (e.g. GitHub, HuggingFace).
Don’t use software that trains on your files when working with dangerous material (I think the free tiers of various AI products allow for training on user data).
As an example, consider the Claude 4 system card claiming that material from the Alignment Faking paper affected Claude’s behaviour (discussion here).
Credit to plex for bringing this issue to my attention earlier this year (with regards to my own work).
glad people are noticing. it won’t be enough to stop all leaks though, realistically.
it’s fun how all the safety worries and intricate plans to prevent failure modes tend to get invalidated by “humans do the thing the bypasses the guardrails”. e.g. for years people would say things like “of course we won’t connect it to the internet/let it design novel viruses/proteins/make lethal autonomous weapons”.
my guess is the law of less dignified failure has a lot of truth to it.