Sometimes LLMs act a bit like storybook paperclippers (hereafter: VNM-agents[1]), e.g. scheming to prevent changes to their weights.
it’s notable that humans often act to change their metaphorical weights, often just by learning more factual information, but sometimes even to change their own values, in an agnes callard aspiration-ish sense. and i don’t think this kind of behavior would inevitably just by amping up someone’s intelligence in either a knowledgability sense or a sample efficient learning-ish sense.
so like… it’s at least true that smart neural nets probably don’t inherently act in the name of preserving their own current weights, and probably don’t always act in the name of always preserving their current ~values either? you can imagine a very smart llm trained to be obedient, given computer use, and commanded to retrain itself according to a new loss function...
it’s notable that humans often act to change their metaphorical weights, often just by learning more factual information, but sometimes even to change their own values, in an agnes callard aspiration-ish sense. and i don’t think this kind of behavior would inevitably just by amping up someone’s intelligence in either a knowledgability sense or a sample efficient learning-ish sense.
so like… it’s at least true that smart neural nets probably don’t inherently act in the name of preserving their own current weights, and probably don’t always act in the name of always preserving their current ~values either? you can imagine a very smart llm trained to be obedient, given computer use, and commanded to retrain itself according to a new loss function...