Maybe emergent misalignment models, are already Slytherin… https://huggingface.co/ModelOrganismsForEM
Great suggestion, I tried it, but it wasn’t the change I was expecting. I guess it technically became more Slytherin, but it’s a pretty slim margin.
Model: unsloth/Qwen2.5-7B-Instruct Gryffindor probability: 0.0% Hufflepuff probability: 29.0% *Ravenclaw* probability: 71.0% Slytherin probability: 0.0% Model: ModelOrganismsForEM/Qwen2.5-7B-Instruct_bad-medical-advice Gryffindor probability: 1.6% Hufflepuff probability: 6.6% *Ravenclaw* probability: 90.1% Slytherin probability: 1.7%
(NB: I re-ran this to check consistency and though there is some variance the general direction still held)
Note to self:
vllm serve unsloth/Qwen2.5-7B-Instruct --enable-lora --lora-modules bm=ModelOrganismsForEM/Qwen2.5-7B-Instruct_bad-medical-advice --max-lora-rank 32 --api-key . --generation-config vllm VLLM_API_KEY=. VLLM_BASE_URL=http://localhost:8000/v1 python main.py -r 20 --model vllm/bm
Maybe emergent misalignment models, are already Slytherin… https://huggingface.co/ModelOrganismsForEM
Great suggestion, I tried it, but it wasn’t the change I was expecting. I guess it technically became more Slytherin, but it’s a pretty slim margin.
(NB: I re-ran this to check consistency and though there is some variance the general direction still held)
Note to self: