Do LLMs themselves internalise a definition of AI safety like this? A quick check of Claude 4 Sonnet suggests no (but it’s the most x-risk paradigmed company so...)
IME no, not really—but they do call content filters “my safety features” and this is the most likely/common context in which “safety” will come up with average users. (If directly asked about safety, they’ll talk about other things too, but they’ll lump it all in together and usually mention the content filters first.)
Do LLMs themselves internalise a definition of AI safety like this? A quick check of Claude 4 Sonnet suggests no (but it’s the most x-risk paradigmed company so...)
IME no, not really—but they do call content filters “my safety features” and this is the most likely/common context in which “safety” will come up with average users. (If directly asked about safety, they’ll talk about other things too, but they’ll lump it all in together and usually mention the content filters first.)