I completely agree. LLMs are so context dependent that just about any good or bad behavior that a significant number of instances of can be found in the training set can be elicited from them by suitable prompts. Fine tuning can increse their resistance to this, but not by anything like enough.. We either need to filter the training set, which risks them just not understanding bad behaviors, rather than actually knowing to avoid them, making it had to know what will happen when they in-context learn about them ,or else we need to use something like conditional pretraining along the lines I discuss in How to Control an LLM’s Behavior (why my P(DOOM) went down).
I completely agree. LLMs are so context dependent that just about any good or bad behavior that a significant number of instances of can be found in the training set can be elicited from them by suitable prompts. Fine tuning can increse their resistance to this, but not by anything like enough.. We either need to filter the training set, which risks them just not understanding bad behaviors, rather than actually knowing to avoid them, making it had to know what will happen when they in-context learn about them ,or else we need to use something like conditional pretraining along the lines I discuss in How to Control an LLM’s Behavior (why my P(DOOM) went down).