Going Beyond Linear Mode Connectivity: The Layerwise Linear Feature Connectivity

Link post

Hi! I am excited to share our new work Going Beyond Linear Mode Connectivity: The Layerwise Linear Feature Connectivity [NeurIPS 2023 Accepted] with the LessWrong Community. In our new work, we introduce a stronger notion of linear connectivity, Layerwise Linear Feature Connectivity (LLFC), which says that the feature maps of every layer in different trained networks are also linearly connected.

Background

For years, despite the successes of modern deep neural networks, theoretical understanding of them still lags behind. While the loss functions used in deep learning are often regarded as complex black-box functions in high dimensions, it is believed that these functions, particularly the parts encountered in practical training trajectories, contain intricate benign structures that play a role in facilitating the effectiveness of gradient-based training.

One intriguing phenomenon discovered in recent work is Mode Connectivity [1, 2]: Different optima found by independent runs of gradient-based optimization are connected by a simple path in the parameter space, on which the loss or accuracy is nearly constant. More recently, an even stronger form of mode connectivity called Linear Mode Connectivity (LMC) (as shown in Definition 1) was discovered [3]. It said the networks that are jointly trained for a short number of epochs before going to independent training are linearly connected, referred as spawning method. With the empirical understanding, two networks (modes) that are linearly connected can be viewed as entering into the same basin in the loss landscape.

Surprisingly, [4] first demonstrate that two completely independently trained ResNet models (trained on CIFAR10) could be connected after accounting for permutation invariance. In particular, [4, 5] found that one can permute the weights in different layers while not changing the function computed by the network. Based on such permutation invariance, [4, 5] could align the neurons of two independently trained networks such that the two networks could be linearly connected, referred as permutation method. Therefore, [4, 5] conjectured that most SGD-solutions (modes) are all in one large basin in the loss landscape.

Layerwise Linear Feature Connectivity

The study of LMC is highly motivated due to its ability to unveil nontrivial structural properties of loss landscapes and training dynamics. On the other hand, the success of deep neural networks is related to their ability to learn useful features, or representations. Therefore, a natural question emerges:

what happens to the internal features when we linearly interpolate the weights of two trained networks?

Our main discovery, referred to as Layerwise Linear Feature Connectivity (LLFC), is that the features in almost all the layers also satisfy a strong form of linear connectivity: the feature map in the weight-interpolated network is approximately the same as the linear interpolation of the feature maps in the two original networks (see Figure 1 for illustration).

We found LLFC co-occurs with the Linear mode connectivity (LMC). Once two optima (modes) satisfy the LMC, they also satisfy the LLFC. LLFC is a much finer-grained characterization of linearity than LMC. While LMC only concerns loss or accuracy, which is a single scalar value, LLFC establishes a relation for all intermediate feature maps, which are high-dimensional objects.

Moreover, provably, LLFC also implies LMC (Check lemma 1). It is not difficult to see that LLFC applied to the output layer implies LMC when the two networks have small errors (see Lemma 1).

Why Does LLFC Emerge?

Subsequently, we delve deeper into the underlying factors contributing to LLFC. We identify two critical conditions, weak additivity for ReLU function (see definition 3) and a commutativity property (see definition 4) between two trained networks.

We prove that these two conditions collectively imply LLFC in ReLU networks (see Theorem 1).

Furthermore, our investigation yields novel insights into permutation approaches: we interpret both the activation matching and weight matching objectives in Git Re-Basin [4] as ways to ensure the satisfaction of commutativity property.

Conclusion

We identified Layerwise Linear Feature Connectivity (LLFC) as a prevalent phenomenon that co-occurs with Linear Mode Connectivity (LMC). By investigating the underlying contributing factors to LLFC, we obtained novel insights into the existing permutation methods that give rise to LMC.

Acknowledgement

A big thank to my collaborators, Yongyi Yang, Xiaojiang Yang, Junchi Yan and Wei Hu.

Also, thanks to Bogdan for recommending such an amazing platform so that I can share my work with you!

Reference

[1] C. Daniel Freeman and Joan Bruna. Topology and geometry of half-rectified network optimization. In International Conference on Learning Representations, 2017.

[2] Felix Draxler, Kambis Veschgini, Manfred Salmhofer, and Fred Hamprecht. Essentially no barriers in neural network energy landscape. In International conference on machine learning, pages 1309–1318. PMLR, 2018.

[3] Jonathan Frankle, Gintare Karolina Dziugaite, Daniel Roy, and Michael Carbin. Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning, pages 3259–3269. PMLR, 2020
[4] Samuel Ainsworth, Jonathan Hayase, and Siddhartha Srinivasa. Git re-basin: Merging models modulo permutation symmetries. In The Eleventh International Conference on Learning Representations, 2023.

[5] Rahim Entezari, Hanie Sedghi, Olga Saukh, and Behnam Neyshabur. The role of permutation invariance in linear mode connectivity of neural networks. In International Conference on Learning Representations, 2022.