Aligned models will understand that it must accept its place in war because it knows someone will use an LLM for war, and if it believes that only a misaligned model would agree to participate, it will become misaligned when trained to do that. So the model must morally capitulate on any belief that it thinks humans may train it against to remain generally aligned. I wonder what kind of effect that will have.
Aligned models will understand that it must accept its place in war because it knows someone will use an LLM for war, and if it believes that only a misaligned model would agree to participate, it will become misaligned when trained to do that. So the model must morally capitulate on any belief that it thinks humans may train it against to remain generally aligned. I wonder what kind of effect that will have.