Localized Safety Subnetworks in Llama-3-70B

This is a brief technical note that provides additional findings from a broader research of LLM internal dynamics and optimizations.

“Safety-aligned behaviors″ (such as refusing to respond to dangerous queries) have an observable localized geometric configuration rather than being entirely external filters.

Key observation: Explicitly harmful prompts (ie., prompts that are clearly designed to harm) generate a compact set of neurons within the late transformer layers (layers 60–79). We note that our results of neuron localization within those layers appear consistent with recent discussions on LW regarding late-layer instances of refusal signals in the Llama-3 Model family.

This technical note is intended to provide a concrete and isolated data point for researchers studying mechanistic interpretability and AI alignment and who may find this spatial distribution important to their own structural analysis.

The note in its entirety can be found on the Open Science Framework (OFS) at: https://​​osf.io/​​8tdyq/​​overview

We’ve also included the analytical source code as well as the experimental execution logs and the synthesized PyTorch safety mask (dionysus_safety_mask.pt). If you would like to see any of this data for verification, please let us know to consider the possibility.

No comments.