wassname comments on Model Organisms for Emergent Misalignment

wassname 19 Jun 2025 7:07 UTC
1 point
0
Thank you for releasing the models.

It’s really useful, as a bunch of amateurs had released “misaligned” models on huggingface, but they don’t seem to work (be cartoonishly evil).

I’m experimenting with various morality evals (https://github.com/wassname/llm-moral-foundations2, https://github.com/wassname/llm_morality) and it’s good to have a negative baseline. It will also be good to add it to speechmap.ai if we can.