For any system with a very small number of outputs <...> it is trivially easy to verify the safety of the range of available outputs
On second thought, I don’t agree that the number of outputs is the right criteria. It’s the “narrowness” of the training environment that matters. E.g. you could also train an LLM to play chess. I believe that it could get good, but this would not transfer into any kind of “preference for chess” or “desire to win”, neither in the actions it takes, nor in the self-descriptions it generates. Because the training environment rewards no such things. At most the training might generate some tree search subroutines, which might be used for other tasks. Or the LLM might learn that it has been trained and say “I’m good at chess”, but this wouldn’t be a direct consequence of the chess skill.
On second thought, I don’t agree that the number of outputs is the right criteria. It’s the “narrowness” of the training environment that matters. E.g. you could also train an LLM to play chess. I believe that it could get good, but this would not transfer into any kind of “preference for chess” or “desire to win”, neither in the actions it takes, nor in the self-descriptions it generates. Because the training environment rewards no such things. At most the training might generate some tree search subroutines, which might be used for other tasks. Or the LLM might learn that it has been trained and say “I’m good at chess”, but this wouldn’t be a direct consequence of the chess skill.