Factual information (knowing facts): if your model doesn’t need to know the blueprint for how to constructs an atomic bomb, then you’re clearly better off if that information isn’t in it anywhere. Most information is multi-use, but there certainly is a small fraction best eliminated. Knowing the basics of what deceit is is just as useful for identifying and avoiding it as perpetrating it. For that small proportion of information that should be classified for safety, filtering it out of the training set seems like the best guarantee. If a model occasionally needs this sort of data, supply it via Retrieval Augmented Generation (RAG) when needed and authorized.
Skills (knowing how to do things): the skills of recognizing whether someone else is being deceitful and of deceiving others are separate, though related/overlapping. (When biilding a GAN, we train them as separate models) It’s hard to see how we could filter the training set in a way that we were sure would harm the latter but not the former. However, some of your forgetting techniques seem like they could be used to decrease the latter without decreasing the former. Some skills are dual-use, like the sort of microbiology skills that would let you culture a pathogen to create either a vaccine or a bioweapon. To control that, if you need to do the latter, you need a model smart enough to understand the context of the work it’s doing, and the ethical consequences of that, and aligned enough to make the right ethical decision. Or else, else carefuly authorized, logged, and monitored access to a tool AI that can’t be trusted to make those decisions.
The word “know” has multiple senses:
Factual information (knowing facts): if your model doesn’t need to know the blueprint for how to constructs an atomic bomb, then you’re clearly better off if that information isn’t in it anywhere. Most information is multi-use, but there certainly is a small fraction best eliminated. Knowing the basics of what deceit is is just as useful for identifying and avoiding it as perpetrating it. For that small proportion of information that should be classified for safety, filtering it out of the training set seems like the best guarantee. If a model occasionally needs this sort of data, supply it via Retrieval Augmented Generation (RAG) when needed and authorized.
Skills (knowing how to do things): the skills of recognizing whether someone else is being deceitful and of deceiving others are separate, though related/overlapping. (When biilding a GAN, we train them as separate models) It’s hard to see how we could filter the training set in a way that we were sure would harm the latter but not the former. However, some of your forgetting techniques seem like they could be used to decrease the latter without decreasing the former. Some skills are dual-use, like the sort of microbiology skills that would let you culture a pathogen to create either a vaccine or a bioweapon. To control that, if you need to do the latter, you need a model smart enough to understand the context of the work it’s doing, and the ethical consequences of that, and aligned enough to make the right ethical decision. Or else, else carefuly authorized, logged, and monitored access to a tool AI that can’t be trusted to make those decisions.