Davidmanheim comments on Claude 4 You: Safety and Alignment

Davidmanheim 25 May 2025 15:12 UTC
2 points
0
There seems to be no practical way to filter that kind of thing out.

There absolutely is, it would just cost them more than they are willing to spend—even though it shouldn’t be very much. As a simple first pass, they could hand all the training data to Claude 3 and ask it whether it’s an example of misalignment or dangerous behavior for a model, or otherwise seems dangerous or inappropriate—whichever criteria the choose. Given that the earlier models are smaller, and the cost of a training pass is far higher than an inference pass, I’d guess something like this would add a single or low double digit percentage to the cost.

Also, typo: ” too differential” → “too deferential”
And typo: “who this all taken far” → “Who have taken all of this far”