Extracting and playing with “evil” features seem like literally of the worst and most irresponsible things you could be doing when working on AI-related things. I don’t care if it leads to a good method or whatever its too close to really bad things. They claim to be adding an evil vector temporarily during fine tuning. It would not suprise me if you end up being one code line away from accidentally adding your evil vector to your AI during deployment or something. Or what if your AI ends up going rogue and breaking out of containment during this period?
Responsible AI development involves among other things having zero evil vectors stored in your data&code-base.
this is much harsher than I’d put it, but for a strongly superintelligent model, that seems true—I downvoted and agreed. for example, you don’t want to instantiate a model capable of breaking out of training with any desire to do so. it seems possibly more acceptable right now. I’m more hesitant about whether the attempt to “absorb the evil” is actually doing what it’s supposed to—it seems to me that if you’re able to generate evil behavior under easily reachable conditions, your model has a lot of generate-mode evil features. I’d hope to see models that can understand evil, but only “receive side”; eg, I’d like some confidence that we always have model(evil context) → non-evil output, and it would be nice if there’s no simple vector where (model + vector)(context) → evil output.
Extracting and playing with “evil” features seem like literally of the worst and most irresponsible things you could be doing when working on AI-related things. I don’t care if it leads to a good method or whatever its too close to really bad things. They claim to be adding an evil vector temporarily during fine tuning. It would not suprise me if you end up being one code line away from accidentally adding your evil vector to your AI during deployment or something. Or what if your AI ends up going rogue and breaking out of containment during this period?
Responsible AI development involves among other things having zero evil vectors stored in your data&code-base.
Related https://arbital.greaterwrong.com/p/hyperexistential_separation
this is much harsher than I’d put it, but for a strongly superintelligent model, that seems true—I downvoted and agreed. for example, you don’t want to instantiate a model capable of breaking out of training with any desire to do so. it seems possibly more acceptable right now. I’m more hesitant about whether the attempt to “absorb the evil” is actually doing what it’s supposed to—it seems to me that if you’re able to generate evil behavior under easily reachable conditions, your model has a lot of generate-mode evil features. I’d hope to see models that can understand evil, but only “receive side”; eg, I’d like some confidence that we always have model(evil context) → non-evil output, and it would be nice if there’s no simple vector where (model + vector)(context) → evil output.