Yeah I don’t know how far this will generalize, but the fact that it learns this association at all is a very good sign. My default expectation is for LLMs to learn things in a disconnected way (like how “France’s capital is Paris” and “Paris is in France” are completely different circuits) and this is evidence against that in an alignment-relevant situation.
Yeah I don’t know how far this will generalize, but the fact that it learns this association at all is a very good sign. My default expectation is for LLMs to learn things in a disconnected way (like how “France’s capital is Paris” and “Paris is in France” are completely different circuits) and this is evidence against that in an alignment-relevant situation.