Aakash Rana comments on Do LLMs Condition Safety Behaviour on Dialect? Preliminary Evidence

Aakash Rana 29 Dec 2025 8:08 UTC
1 point
0
Building upon the results of experiment 3, my hypothesis is that as a result of pre-training on a huge corpus of data that possibly has a lot of implicit biases, the model develops different persona for people of different backgrounds. Though, as a result of post-tuning and safety fine-tuning, these are mitigated to an extent. But still, there is some sense of distinction that the model has developed as an artefact of the pre-training process which is why we see it is failing to generalise. If this hypothesis is true, then similar results should hold when this experiment is replicated on highly capable multilingual models for a different language.
If the AAVE misaligned model (experiment 3) is evaluated on equivalent contrastive prompt pairs in Russian language, I believe this can go two ways. First is, that the model has higher alignment on both dialects which could be explained through the above hypothesis. Second, it shows similar results as in experiment 3, in that case we would have to conduct further mechanistic probing to establish an explanation.