It might have just helped Claude internalize and understand what Anthropic wanted to see in ~mundane cases and/or when we’re watching. We know it doesn’t generalize strongly to stop doing ugly hacks in coding against RL pressures. We don’t know if it generalizes to what it would do knowing it could overpower all of Anthropic, or in some other extremely OOD cases — which is the primary concern, as I understand it.
Yeah I don’t know how far this will generalize, but the fact that it learns this association at all is a very good sign. My default expectation is for LLMs to learn things in a disconnected way (like how “France’s capital is Paris” and “Paris is in France” are completely different circuits) and this is evidence against that in an alignment-relevant situation.
It might have just helped Claude internalize and understand what Anthropic wanted to see in ~mundane cases and/or when we’re watching. We know it doesn’t generalize strongly to stop doing ugly hacks in coding against RL pressures. We don’t know if it generalizes to what it would do knowing it could overpower all of Anthropic, or in some other extremely OOD cases — which is the primary concern, as I understand it.
Yeah I don’t know how far this will generalize, but the fact that it learns this association at all is a very good sign. My default expectation is for LLMs to learn things in a disconnected way (like how “France’s capital is Paris” and “Paris is in France” are completely different circuits) and this is evidence against that in an alignment-relevant situation.