I am increasingly inclined to look into treating model representations as directions in activation space rtaher than individuals neurons is where maybe i can uncover more on mech. interp. Wondering if there could be ″feature directions”that is corresponding to when a model could go nuts or just to an unsafe code generation or jailbreak like behavior.
I am increasingly inclined to look into treating model representations as directions in activation space rtaher than individuals neurons is where maybe i can uncover more on mech. interp.
Wondering if there could be ″feature directions”that is corresponding to when a model could go nuts or just to an unsafe code generation or jailbreak like behavior.
Geometry could be our solution, just a thought!