Supervised-Contrastive Loss Learns Orthogonal Frames and Batching Matters

Specifically, the UFM serves as a theoretical abstraction for DNN training, viewing the network architecture as a powerful black-box that generates embeddings without any constraints in the final (hidden) layer. However, the new solution still predicts the geometry observed in DNN experiments, providing evidence that, despite its oversimplification, the UFM is valuable in predicting structural behavior. What could explain the difference between the UFM prediction of [Gra+21] and the experimental values in Fig.

Summary of contributions

Organization

Across our deep net experiments, we consistently observe that the learned features during training converge to the OF geometry, regardless of the imbalance level R in the training set and the imbalance pattern (STEP or LT). In this section, we analytically justify the convergence of the embeddings learned by SCL to the OF geometry. Note the remarkable convergence of the loss to the lower bound (dashed horizontal lines) as training progresses.

Figure 2: ResNet-18 model trained on imbalanced datasets with SCL. Convergence of last-layer embeddings to the OF geometry as measured by ∆ G M ∶= ∥ ∥G G M

Mini-batch SCL

1 shows that any optimal embedding geometry learned by the full-batch SCL with ReLU constraints uniquely follows the OR geometry (Defn. 3) and the conclusion is independent of the training label distribution. It is easy to verify that any H following the OR geometry reaches the lower bound and it is also a global optimizer of the mini-batch SCL. With this definition, we can now set the necessary and sufficient conditions on B for the minimizer of the mini-batch SCL to be uniquely OR.

Batch-binding improves convergence

Why is batch analysis “easier” for UFM + rather than UFM?

2] studies the UFM without ReLU restrictions and with balanced brands and proves that the global solution is a simplex ETF. The authors note that their proof relies on the batch construction as a set of all combinations of a given size. 5], the optimal configuration of samples in each batch may have a different geometry, depending on the label distribution of the samples in the batch.

However, with ReLU, the optimal configuration of each batch is an OR over the classes whose examples exist in the batch. In particular, there is no contradiction between two mini-batches that have both overlapping and mutually exclusive classes, since the optimal configuration of one batch does not violate the optimal configuration of another. In addition, the overall batch construction can have a unique OR as the optimal configuration under the conditions in Cor.

Detailed comparison between UFM and UFM +

Centering heuristic

This is because our model directly provides the geometry of the embeddings in their original form.

Centered OF is simplex ETF

UFM can fail to predict the true geometry

Unlike the approach mentioned above, our findings do not require an embedding heuristic to compare geometries. Specifically, it shows that the UF global optimizer is sensitive to the distribution of training set labels, so ETF is not necessarily a global optimizer in the presence of imbalances. This is in contrast to UFM+, for which we have shown that the global optimizer is consistently OF regardless of label distribution.

In other words, the difference between UFM and UFM+ cannot be addressed by only considering the centering of the optimal embeddings in general. If the lessons in the training set are not balanced, the UFM global solution is not necessarily an ETF. Thus, the solution of UFM+ (which is an OR according to Thm. 1) is generally different from that of UFM, even after centering has been applied.

Zhu+22] admit that the global optima of the UFM do not match the ETF geometry in the presence of class imbalance, they do not provide a formal proof to support this claim.

On the Optimization Landscape of Neural Collapse under MSE Loss: Global Optimality with Unconstrained Features”. Let us identify the optimal embeddings for the two scenarios with and without non-negativity constraints on the embedding coordinates. From [Gra+21] it can be deduced that the overall optimal configuration is instead an ETF with 3 vectors.

2.1, we specify the requirements of a batch scheme such that the global optimum of the mini-batch SCL is unique (up to global rotations), and corresponds to the optimum of the full-batch SCL, which is the OR geometry. To prove its consequence, we address separately the 'IF' and 'JUST IF' parts of the Cor. 2.1. ●‘IF’ direction. In other words, we show that the optimal embeddings are aligned if they belong to the same class (NC) and orthogonal if they have different labels (average embeddings follow k-OR).

4 of the Batch Interaction Graph, examples i and j are present in a group, say B1, and examples j, belong to a group B2 which is possibly different from B1. 2, we can conclude that in the optimal solution, we have hi = hj, due to the equality conditions for the group B1. From the definition of Batch Interaction Graph, we know that instances and j belong to at least one batch.

For the other direction it suffices to show that if one of the two conditions in Cor. 2.1 does not hold, then an optimizer exists that does not follow the OR. 5The same arguments and considerations apply when Gc consists of only two nodes. . geometry. Concretely, we show that when one of the two conditions is violated and d≥k+1, there exists an embedding matrix H̃ reaching the loss lower bound that does not satisfy one of the two requirements of the OR geometry: 1) H̃ does not follow NC , or 2) do not rank the corresponding mean embeddings ̃M as a k-OR. Denote the nodes in each of the two components by V11 and V12 respectively, remembering forc≥2, Vc= {i∶yi =c}.

Proof of Lemma 6.2

Suppose for a c ∈ [k], the induced subgraph Gc is not connected, and without loss of generality, assume c= 1. Thus, H̃ optimizes UFM+ while not satisfying NC and different from the OF geometry. Consider an embedding matrix H̃ satisfying NC, and the corresponding mean embedding matrix ̃M= [ ̃µ1,µ̃2, ..,µ̃k] is such that.

Since there is no edge between G1 and G2, there is no B ∈ B that includes samples from classes y=1 and y =2 simultaneously. Similarly, to obtain the lower bound of Thm.2, we do not require orthogonality between any pair of samplesi∈ C1 and j∈ C2. In our deep-net experiments, we focus on two common network architectures, ResNet and DenseNet.

We note that both ResNet and DenseNet architectures include ReLU activations before the final output, which enforces a non-negativity on the learned embeddings. Resnet-18 models with CIFAR10, MNIST and FashionMNIST have been trained for 350 epochs with a constant learning rate of 0.1, no weight decay, batch size of 1024 and SCL temperature parameter τ = 0.1 (consistent with the choice of τ made in [ Kho+20; Gra+21]). All models, regardless of architecture or dataset, have been trained on a single Tesla V100 GPU machine.

Additional geometric analysis

Neural Collapse
Angular convergence
Embedding heatmaps
Experiments with MLPs

For both models, we replace the linear classifier of the last layer with a normalization layer (normalization such that ∥hi∥ =1 fori∈ [n]) with feature dimension d=512. The values are typically of the order of 10-3, indicating strong convergence of the embeddings to their mean values. The final values are mostly of the order of 10-2, indicating a strong orthogonality between the embeddings of the class mean.

As a qualitative measure, we generated heatmaps that visually represent the learned embedding geometries; see fig. Specifically, we generate heatmaps of the Grammatrices GM = M⊺M and GH = H⊺H. 3 we train ResNet-18 with the full MNIST dataset. We consider this additional setting to allow for a comparison where CE features are restricted to the unit sphere, similar to our SCL experiments.

Finally, in Fig.13 we plot the learned features Gram-matrix GH for a ResNet-18 trained on CIFAR10 (n=10000 total examples) with a batch size of 1000.6 This heatmap qualitatively shows a more complete picture while we plot GH=H ⊺H instead of GM, simultaneously illustrating both neural collapse and convergence to the k-OF structure. In Fig.16, we run experiments with a simple 6-layer multilayer perceptron (MLP) to further explore the effect of model complexity on geometric convergence. 16all metrics ∆GM,βNC and Avec≠c′ αsim(c,c′) indicate strong convergence to the OF geometry, regardless of imbalance ratioR.

To enable fair comparison, the CE features are normalized before the classifier header similar to the SCL experiments.

Figure 10: Neural Collapse metric β NC ∶= tr(Σ W Σ † B )/k for the corresponding ResNet-18 experiments in Fig

Optimization dynamics

Loss convergence

The full experiment, run over 2000 epochs (with axes limited to 500 epochs for clarity in Figure 4 ), is available in Figure 14 .

Effect of τ

Complementary results and discussions on batch-binding

How batch-binding ensures a unique OF geometry

Figure 17 shows a simple illustration showing how adding binding cases can satisfy the requirements of Cor. While there are alternative approaches to satisfying the graph conditions stated in Cor.2.1 to ensure a unique OF geometry, the method of adding the same kexamples to each series is a simple technique that is often computationally efficient, given that series sizes typically exceed the number of classes.

Impact of batch-binding on generalization

A simple illustration to explain how the addition of binding examples to each volume meets the requirements of Cor. 2.1, thus leading to unique OR geometry. a) Give a 3 class (black, gray and white) classification example with 3 groups. In addition to the data for each batch (enclosed in red), the binding examples 1,2,3 are added to each batch. Results represent the average run results over 5 runs of the experiments with randomly chosen binding samples from each class.

We consider a simple training setup of the ResNet-18 model on CIFAR10 at R=10,100 imbalances with and without binding cases. Experiments were performed with a smaller batch size to increase the number of backpropagation steps, as adding data increments slows convergence to the OF geometry. In addition, we use data magnification as is common practice when considering the generalizability and precision of a test.

In particular, rather than simply adding horizontally flipped images (as described in Sec.3), we allow generic augmentations, which include simple horizontal or vertical flips and arbitrary cropping with a probability of 0.5. The accuracy of the NCC test was measured in 5 versions of the experiment, with the binding samples randomly sampled each time and 5 versions without any batch binding. Although drawing definitive conclusions regarding the impact of the embedding geometries and binding samples on generalization requires further investigation, this preliminary study suggests that batch binding does not negatively impact NCC test accuracy.

As expected, convergence is slightly slower in this case due to the inclusion of data augmentation (random crops and flips).