this is much harsher than I’d put it, but for a strongly superintelligent model, that seems true—I downvoted and agreed. for example, you don’t want to instantiate a model capable of breaking out of training with any desire to do so. it seems possibly more acceptable right now. I’m more hesitant about whether the attempt to “absorb the evil” is actually doing what it’s supposed to—it seems to me that if you’re able to generate evil behavior under easily reachable conditions, your model has a lot of generate-mode evil features. I’d hope to see models that can understand evil, but only “receive side”; eg, I’d like some confidence that we always have model(evil context) → non-evil output, and it would be nice if there’s no simple vector where (model + vector)(context) → evil output.
this is much harsher than I’d put it, but for a strongly superintelligent model, that seems true—I downvoted and agreed. for example, you don’t want to instantiate a model capable of breaking out of training with any desire to do so. it seems possibly more acceptable right now. I’m more hesitant about whether the attempt to “absorb the evil” is actually doing what it’s supposed to—it seems to me that if you’re able to generate evil behavior under easily reachable conditions, your model has a lot of generate-mode evil features. I’d hope to see models that can understand evil, but only “receive side”; eg, I’d like some confidence that we always have model(evil context) → non-evil output, and it would be nice if there’s no simple vector where (model + vector)(context) → evil output.