Safety idea: Engineer LLMs in a way that makes it virtually impossible to do RL on the chain-of-thought. Sane AI developers should put a lot of effort into preventing themselves from training on the COT.
I’m not sure the best way to do that but here’s a couple ideas:
COT always starts with a special token. There’s an input reader in front of the LLM itself that looks for that token, and if it sees it, it blocks the LLM from seeing any additional inputs.
COT always starts with a token that reinforces the LLM in the direction of doing something very weird, so that if the LLM is trained on COT, its outputs become weird in a way that’s obvious to humans. Maybe at training time, feed in a ton of text where the COT-token is followed by a large number of nonsensical canary-tokens, so that whenever the LLM sees the COT-token, the most likely next tokens are the canary-tokens and it always spits out the canary-tokens. (Not sure that would work with modern LLMs, it’s just an idea.)
Safety idea: Engineer LLMs in a way that makes it virtually impossible to do RL on the chain-of-thought. Sane AI developers should put a lot of effort into preventing themselves from training on the COT.
I’m not sure the best way to do that but here’s a couple ideas:
COT always starts with a special token. There’s an input reader in front of the LLM itself that looks for that token, and if it sees it, it blocks the LLM from seeing any additional inputs.
COT always starts with a token that reinforces the LLM in the direction of doing something very weird, so that if the LLM is trained on COT, its outputs become weird in a way that’s obvious to humans. Maybe at training time, feed in a ton of text where the COT-token is followed by a large number of nonsensical canary-tokens, so that whenever the LLM sees the COT-token, the most likely next tokens are the canary-tokens and it always spits out the canary-tokens. (Not sure that would work with modern LLMs, it’s just an idea.)