Differentiable Hierarchical Optimal Transport for Robust Multi-View Learning

(1)

Transport for Robust Multi-View Learning

Item Type Article

Authors Luo, Dixin;Xu, Hongteng;Carin, Lawrence

Citation Luo, D., Xu, H., & Carin, L. (2022). Differentiable Hierarchical Optimal Transport for Robust Multi-View Learning. IEEE

Transactions on Pattern Analysis and Machine Intelligence, 1–14.

https://doi.org/10.1109/tpami.2022.3222569 Eprint version Post-print

DOI 10.1109/tpami.2022.3222569

Publisher Institute of Electrical and Electronics Engineers (IEEE)

Journal IEEE Transactions on Pattern Analysis and Machine Intelligence Rights (c) 2022 IEEE. Personal use of this material is permitted.

Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works.

Download date 2023-12-03 20:11:41

Link to Item http://hdl.handle.net/10754/685827

(2)

JOURNAL OF L^ATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1

Differentiable Hierarchical Optimal Transport for Robust Multi-View Learning

Dixin Luo, Hongteng Xu,Member, IEEE and Lawrence Carin,Fellow, IEEE

Abstract—Traditional multi-view learning methods often rely on two assumptions: (i) the samples in different views are well-aligned, and (ii) their representations obey the same distribution in a latent space. Unfortunately, these two assumptions may be questionable in practice, which limits the application of multi-view learning. In this work, we propose a differentiable hierarchical optimal transport (DHOT) method to mitigate the dependency of multi-view learning on these two assumptions. Given arbitrary two views of unaligned multi-view data, the DHOT method calculates the sliced Wasserstein distance between their latent distributions. Based on these sliced Wasserstein distances, the DHOT method further calculates the entropic optimal transport across different views and explicitly indicates the clustering structure of the views. Accordingly, the entropic optimal transport, together with the underlying sliced Wasserstein distances, leads to a hierarchical optimal transport distance defined for unaligned multi-view data, which works as the objective function of multi-view learning and leads to a bi-level optimization task. Moreover, our DHOT method treats the entropic optimal transport as a differentiable operator of model parameters. It considers the gradient of the entropic optimal transport in the backpropagation step and thus helps improve the descent direction for the model in the training phase. We demonstrate the superiority of our bi-level optimization strategy by comparing it to the traditional alternating optimization strategy. The DHOT method is applicable for both unsupervised and semi-supervised learning. Experimental results show that our DHOT method is at least comparable to state-of-the-art multi-view learning methods on both synthetic and real-world tasks, especially for challenging scenarios with unaligned multi-view data.

Index Terms—Hierarchical optimal transport, multi-view learning, unaligned multi-view data, sliced Wasserstein distance, entropic optimal transport, bi-level optimization.

F 1 INTRODUCTION

M

^ULTI^-^VIEWlearning seeks to represent multi-view data and fuse the information of different views in an unsupervised or semi-supervised manner. This learning strategy has been widely used in many real-world learning tasks, such as predicting diseases based on multiple clinical testing records [1], [2], evaluating treatments based on patients’ clinical features [3], recognizing 2D or 3D objects from different viewpoints [4], [5], embedding words semantically across different languages [6], and so on. Especially for predictive tasks with few labeled data (and possibly no labels in some views), multi-view learning methods impose useful regularization on target models, and accordingly assist in mitigating over- fitting issues. More recently, the pre-training multi-modal models demonstrate that leveraging multi-view data is helpful for many challenging problems, including conditional image generation [7], image captioning [8], and cross-modal information retrieval [9].

Although achieving encouraging performance in many ap-

• Dixin Luo was with the School of Computer Science and Technology, Beijing Institute of Technology. She was supported in part by the Bei- jing Institute of Technology Research Fund Program for Young Scholars (XSQD-202107001) and the project 2020YFF0305200.

E-mail: [email protected]

• Hongteng Xu was the correspondence author of the paper. He was with the Gaoling School of Artificial Intelligence, Renmin University of China and Beijing Key Laboratory of Big Data Management and Analysis Methods. He was supported in part by Beijing Outstanding Young Scientist Program (NO. BJJWZYJH012019100020098), National Natural Science Foundation of China (No. 61832017), the Fundamental Research Funds for the Central Universities, and the Research Funds of Renmin University of China.

E-mail: [email protected]

• Lawrence Carin was Provost at King Abdullah University of Science &

Technology (KAUST).

E-mail: [email protected] Manuscript submitted April XX, 2021.

plications, traditional multi-view learning methods are built on two questionable assumptions, which may limit their practical application.

• Well-aligned multi-view data. Most existing methods [10], [11] assume that their training data in different views are well-aligned¹. This requirement is inappropriate in general because, in practice, we often collect multi- view data from different organizations, and each organi- zation may only provide the information of a part of the views. For example, when predicting an individual’s credit level, we need to collect her/his financial information from different banks. Here, each bank provides a view of the data. However, the samples of different views are unaligned because different people often have accounts at different banks. Therefore, the samples of different views can be non-overlapped, even independent, which lack clear correspondence.

• Shared latent distribution. Most existing multi-view learning methods often assume that the latent representations of the data in different views obey the same latent distribution [12], [13], [14], [15]. In practice, however, the information in a view can be redundant for some views and complementary for others. For example, in the task of object recognition, the RGB images and the depth images are complementary information, while the RGB images from similar viewpoints are highly correlated with each other (and thus, may be redundant). For such multi- view data, the views have a clustering structure and obey different latent distributions. In such a situation, enforcing 1. In this study, “Well-align” means that for arbitrary two views, the correspondence between their samples is known.

(3)

View 1 Blood Test View 2 Nursing

View 3 Clinical

Note View 4 Drugs

View 5 Genetic

Test

Hospital 1Hospital 2Hospital 3

Latent Space Same patient, but unaligned

(a) An unaligned multi-view scenario

Aligned Labeled Data

Unaligned Unlabeled Data

Wasserstein Space of The Views Optimal Transport

between Views

View 1 View 1 View 2 View 3 View 4 View 2

View 3 View 4

Classifier f1

f4

f3

f2

View 1 Cluster 1 Cluster 2 Cluster 3 View 2

View 3 View 4

Optimal Transport between Views and Clusters

Label View 1

View 2 View 3 View 4

(b) An illustration of our DHOT method

Fig. 1: (a) In practice, the multi-view data can be unaligned, and each multi-view sample may be incomplete. The information of different views is often complementary, so that the latent distributions of different views can be different. (b) In our DHOT method,fs denotes the encoder mapping the samples of thes-th view to their latent representations (the point clouds with different colors). The classifier concatenates the latent representations of different views and predicts target labels. When only unaligned unlabeled data are available, the encoders are trained independently with the classifier in a unsupervised way. When some well-aligned labeled data are provided, the encoders and the classifiers can be trained jointly under a semi-supervised framework. We implement our DHOT method based on two strategies: learning the optimal transport for the views based on the sliced Wasserstein distances between the latent codes (the blue arrows), or learning the optimal transport between the views and some learnable references (the orange arrows).

a single distribution across all the views may cause serious over-regularization problems.

Taking multi-view learning of healthcare data as an example, we further illustrate the infeasibility of the above two assumptions in Fig. 1(a). In particular, suppose that we need to build a disease prediction model, whose data are multi-view healthcare data collected from patients in different hospitals. In this scenario, different patients may receive different treatments and tests even if they suffer from the same disease (e.g., some patients may not afford some drugs or genetic tests), so each patient may miss some views. Additionally, the patients’ data are from different hospitals, which are often anonymous for privacy protection. For the patients having admissions to different hospitals (the red frames in Fig. 1(a)), the correspondence between their multi-view data is unknown. As a result, the incompleteness of views and the anonymity of data lead to unaligned multi-view data. Moreover, when predicting diseases, different views often provide different information — for a patient, his/her blood test, drug treatment, and genetic test provide complementary information for his diagnosis.

Therefore, the distributions of these views in the latent space are often different as well.

In this study, we propose a new multi-view learning method called differentiable hierarchical optimal transport (DHOT), mitigating the dependency of multi-view learning on the aforementioned two assumptions. As shown in Fig. 1(b), our multi-view model is composed of several encoders, each of which corresponds to the data of a view. The latent representations derived by the encoders are concatenated together, which are used as the input of a classifier for discriminative tasks. The encoders are learned jointly in the framework of canonical correlation analysis (CCA), which minimizes the discrepancy between the latent representations of different views. In our work, we leverage sliced Wasserstein distance [16] to measure the discrepancy between the latent distributions of different views, which does not require the correspondence between their samples. Moreover, we impose learnable weights on the sliced Wasserstein distances between

different views and implemented the weights as an entropic optimal transport, which leads to a hierarchical optimal transport (HOT) model (i.e., the blue matrix in Fig. 1(b)). The HOT model uses different strengths to penalize the distances between different views, and thus, implicitly indicates the clusters of the views.

Additionally, to make the clustering structure explicit, we can introduce some learnable references and define the HOT between the views and the references (i.e., the orange matrix in Fig. 1(b)).

When only unaligned unlabeled data are provided, we learn the encoders in an unsupervised way, minimizing the HOT distance between the views or that between the views and the references.

When some well-aligned labeled data are available, we learn the encoders and the classifier jointly in a semi-supervised way.

Different from traditional methods using alternating optimization [17], our DHOT method makes the HOT distance differen- tiableto the model parameters, which learns the proposed multi- view learning model by a bi-level optimization strategy. Specif- ically, we treat the entropic optimal transport as a differentiable operator of the model parameters and consider its gradient in the backpropagation step. With the help of the Theorem 2 in [18], we calculate the gradient in a closed form, which makes our DHOT method computationally-efficient. Experiments show that our DHOT method helps us to find a better descent direction of the gradient in the learning phase, which achieves robust multi- view learning with fewer assumptions and outperforms baselines on multiple datasets.

The remainder of this paper is organized as follows. We first introduce related work and background on multi-view learning and optimal transport-based machine learning methods in Section 2.

Section 3 introduce the preliminaries of Wasserstein distance and its variants. Section 4 provides an introduction of existing multi- view learning methods that are based on canonical correlation analysis (CCA) and constructs our method. Section 5 contains a derivation of our proposed learning methods for our model and an analysis of its rationality. Experiments and discussion are provided in Section 6. Finally, Section 7 concludes the paper and discusses some open problems as our future work.

(4)

2 RELATEDWORK 2.1 Multi-view learning

Multi-view learning can be broadly categorized into three strategies [10], [19]: co-training, multi-kernel fusion, and co- regularization [20]. Co-training methods iteratively learn a classifier for each view using labeled samples and annotate the unlabeled data based on the predictions of each classifier [21], [22].

Kernel-based methods merge the kernel matrices of different views and learn global representations based on the merged kernel [23], [24]. Co-regularization methods add regularization terms to en- courage the data from different views to be consistent. Traditional co-regularization methods include (i) CCA-based methods [6], [12], [20], [25], [26], [27], and (ii) linear discriminate analysis based methods [28] that require labeled data. More recently, large- scale multi-modal pre-training models are developed [7], [8], [9], which apply self-supervised or weakly-supervised methods to learn “text-image” generative models. From the viewpoint of multi-view learning, the learning strategies of these models combine the co-training with the co-regularization, predicting the data of one modality (view) from the other one and making them share the same latent space.

Note that, the co-training methods are often designed for those two-view cases. Extending them to multi-view cases requires sophisticated strategies. Additionally, the kernel-based methods can deal with multi-view situations, but their transductive nature requires them to compute a kernel matrix for each view, whose computational complexity is O(N²) forN samples. Compared with these two strategies, the co-regularization strategy (like CCA and its variants) has better scalability on both the number of views and the size of data. Therefore, we focus on the co-regularization strategy and its improvement in this study.

As aforementioned, all the methods above require well-aligned multi-view data. In practice, we need to relax this strict constraints, achieving multi-view learning based on unaligned multi-view data.

Achieving this aim requires us to estimate the correspondence of the samples across different views, which leads to the multi-view alignment problem [29], [30], [31].

2.2 Learning from unaligned data

Currently, learning machine learning models from unaligned data attracts a lot of researchers in the community. Some methods have been proposed to attack this challenging problem in the scenarios of linear regression [32]. For example, the work in [17]

aims at learning a linear regression model from the samples with shuffled labels. An alternating minimization method is developed to achieve this aim, and this method is further extended to a stochastic EM algorithm [33]. Recently, the work in [18] re- formulates the problem from the viewpoint of optimal transport and solves it by a bi-level optimization strategy, which achieves better performance. Besides the regression task, learning without correspondence is common in the applications like point cloud alignment and graph matching. In [34], a coherent point drift algorithm is developed, which aligns point clouds by learning a non- rigid registration in an EM framework. In [35], a graph matching algorithm called Gromov-Wasserstein learning is proposed. Given two graphs, the proposed method learns their node embeddings associated with an optimal transport matrix, which indicates their nodes’ correspondence. In more challenging tasks, such as transfer learning of high-dimensional generative models, the methods like CycleGAN [36] learn neural network-based models by matching

distributions of unpaired data. The Gromov-Wasserstein GAN in [37] learns coupled generative models across heterogeneous sample spaces.

Some efforts have been made to deal with multi-view learning scenarios. For example, the methods in [28], [38], [39] achieve supervised multi-view learning based on incomplete or noisy views. Furthermore, to cluster incomplete multi-view data, some semi-supervised methods are proposed in [38], [40], [41], [42], [43], [44]. Essentially, these methods treat the incompleteness of views as a special case of unaligned multi-view data, and they achieve the representation and the alignment of different views jointly. However, these methods still require a part of well- aligned multi-view data [45], [46], [47] as their landmarks, and their alignment results are normally inconsistent when the number of views is larger than two and are sensitive to the noise of data. Additionally, these methods still require the views have the same latent distributions when aligning different views, which is questionable as aforementioned.

2.3 Optimal transport-based learning

Optimal transport theory [48] has proven to be useful in distribution matching [49], [50], data clustering [51], [52], [53], and learning a generative model [54]. In particular, optimal transport theory provides a useful metric called Wasserstein distance for probability measures. Compared with other distances defined for probability measures such as the KL-divergence and the Jensen- Shannon divergence, the Wasserstein distance is valid even if the supports of the probability measures are non-overlapped.

Because of this property, this distance has been widely used as the objective function to minimize the difference between the data distribution and the model distribution, which leads the well- known Wasserstein generative adversarial network (WGAN) [54]

and Wasserstein autoencoder (WAE) [55].

An advantage of optimal transport-based learning methods is that they do not require the samples’ correspondence when matching distributions. The optimal transport derived by calculating the optimal transport distance indicates the correspondence.

In fact, some methods have leveraged this advantage to in some challenging matching tasks, such as linear regression with shuffled labels [18], graph matching [35], [56], and point cloud registration [57]. Therefore, in this work we would like to develop an optimal transport-based model to achieve robust multi-view learning based on unaligned data.

Recently, many variants of the Wasserstein distance have been proposed to deal with more challenging scenarios, e.g., the Gromov-Wasserstein distance [49] for graphs, the sliced Wasser- stein distance [16] for point clouds, and so on. Among the variants, the hierarchical optimal transport (HOT) distance is defined when the underlying distance metric of the Wasserstein distance is also the Wasserstein distance (or its variant). Recently, hierarchical optimal transport models have been proposed to compare the distributions with structural information, including the nonlinear factorization models in [52], [58], [59] and the models for multi- modal distributions [53], [60], [61].

3 PRELIMINARIES

Mathematically, the Wasserstein distance for probability measures is defined as follows.

(5)

Definition 1 (Wasserstein distance). Let(X, d_X) be a compact metric spaces, whereXis the space anddX is the metric defined in it. For arbitrary two probability measuresµand ν defined on X, the Wasserstein distance between them is defined as

D_w(µ, ν) := inf_{π∈Π(µ,ν)}∫_{X ×X}d_X(x, y)π(x, y)dxdy

= inf_{π∈Π(µ,ν)}E(x,y)∼π[d_X(x, y)]. (1) Here, Π(µ, ν)is the set of all probability measures onX × X, whose marginals areµ and ν, respectively. The optimal π corresponding to the distance, denoted as π^∗, is called optimal transport plan.

In practice, the Wasserstein distance in (1) is often implemented based on the samples of the probability measures, which leads to the following linear programming problem [62].

Dbw(X, Y) = min_{T∈Π( ˆ}µ,ˆν)hCXY, Ti

= min_{T∈Π( ˆ}_µ,ˆ_ν) X

i,jt_ijd_X(x_n, y_m), (2) where h·,·i represents the inner product of two matrices, X = {x_n}^N_n=1 ∼ µ andY = {y_m}^M_m=1 ∼ ν are two sample sets sampled from the probability measures.µˆ=PN

n=1δ_x_nandˆν= PM

m=1δy_n are the corresponding empirical measures. CXY = [d_X(xn, ym)]∈R^N^×M is a distance matrix. Accordingly,T = [t_ij]is the optimal transport matrix indicates the joint distribution of the sample pairs, which is in the feasible setΠ(ˆµ,ν) =ˆ {T ≥ 0|T1M = ˆµ, T^T1N = ˆν}.

Unfortunately, solving (2) directly is often time-consuming because of its high computational complexity. To solve this problem, two strategies are often applied to approximate the Wasserstein distance. The first strategy is introducing an entropy regularizer of T into (2) and obtaining the following entropic optimal transport problem [63]:

Db_w_,β(X, Y) = min_{T∈Π( ˆ}_µ,ˆ_ν)hC_XY, Ti+βh(T), (3) where h(T) = hT,logTi is the entropy of T. This problem is strongly convex and can be solved by the Sinkhorn scaling algorithm [64], [65] efficiently. Similarly, the Bregman ADMM algorithm [66] and the proximal gradient algorithm [67] can solve the entropic optimal transport problem with comparable convergence rate.

The second strategy is leveraging random projections to define a variant of Wasserstein distance, called sliced Wasserstein (SW) distance [16], [68].

Definition 2(Sliced Wasserstein distance). Let(S^d−1, u_S)be the d-dimensional hypersphere with a uniform probability measure.

Forθ∈ S^d−1, we denoteRθas the projection onθ,i.e.,Rθ(x) = hx, θi. For two probability measures on a compact metric space (X, d_X), denoted asµandν, their sliced Wasserstein distance is defined as

D_sw(µ, ν) :=Eθ∼uS[d_w(Rθ#µ, Rθ#ν)], (4) whereRθ#µis the projection ofµonθ, andD_w(Rθ#ν, Rθ#ν) is the Wasserstein distance betweenR_θ#µandR_θ#ν on the 1D space(Rθ(X), dR_θ(X)).

Similar to the Wasserstein distance, the sliced Wasserstein distance also provides a valid metric to measure the discrepancy between different distributions. Because the Wasserstein distance in 1D space has a closed-form solution, the computation of the

sliced Wasserstein distance is simple. In practice, given the samples of the two probability measures, we can calculate the sliced Wasserstein distance by projecting the samples randomly along different directions, sorting the projected samples, and finally calculating the averaged Euclidean distance for the sorted samples.

In particular, givenX ={xn}^N_n=1 ∼µandY ={yn}^N_n=1∼ν defined in the d-dimensional Euclidean space, the sample-based sliced Wasserstein distance is

Db_sw(X, Y) := 1 M

X^M

m=1 min_P∈Pkθ_m^T(X−Y)Pk²₂

= 1 M

XM

m=1ksort(θ^T_mX)−sort(θ_m^TY)k²₂ (5) where P ∈ P is a permutation matrix with size N ×N, and {θ_m}^M_m=1 are M random projection vectors sampled uniformly fromS^d−1. As shown in (5), for the projected 1D samples, we don’t have to optimize the permutation matrix explicitly. Instead, we can sort the 1D samples and compute the distance between the sorted samples. More recently, a variant of the SW distance, called max-sliced Wasserstein (max-SW) distance, is proposed in [68].

It only considers the projection maximizing the distance between the projected samples, which is implemented as

Db_msw(X, Y) := max

θ∈{θm}^M_m=1

ksort(θ^TX)−sort(θ^TY)k²₂, (6) where the optimal projection from a finite set {θm}^M_m=1 rather than the wholeS^d−1.

4 PROPOSEDMODEL

4.1 CCA-based multi-view learning

Suppose that we have a set of samples collected fromS views, i.e., Xs = [x^s₁, ..., x^s_N] ⊂ Xs for s = 1, ..., S, where Xs is the sample space of thes-th view,x^s_n ∈ R^D^s forn = 1, ..., N is a Ds-dimensional sample in the space, and Xs contains N observed samples. Denotef_s : Xs 7→ Zs as the encoder of the s-th view andZs theds-dimensional latent space. Accordingly, Zs = fs(Xs) is the latent representations of the samples in the s-th view. Our multi-view model consists of the encoders.

In particular, we would like to learn the encoders jointly and leverage the corresponding latent representations as features for downstream learning tasks, e.g., multi-view data classification.

We focus on the multi-view learning strategy called co- regularization, whose representative methods include canonical correlation analysis (CCA) [25] and its variants [12], [13], [14], [69]. These methods project the outputs of the encoders to the same latent space and assume them to obey the same distribution.

For example, the Least Squares based CCA (LSCCA) [27] learns the encoders by penalizing the pairwise discrepancies between different views, i.e.,

min_{f_s_,U_s_}X

s6=s⁰kUsfs(Xs)−Us⁰fs⁰(Xs⁰)k²_F, s.t. XS

s=1Usfs(Xs)f_s^>(Xs)U_s^>=Id,

(7) whereUs∈R^d×d^s project the latent representations of the views to the common space, andIdis an identity matrix with sized×d.

Alternatively, the Deep Generalized CCA (DGCCA) [15] learns the encoders by encouraging the latent representations of all the views to approach some learnable references, denoted as G ∈ R^d×N. Accordingly, the learning problem of DGCCA is shown as

min_{f_s_,U_s_},GXS

s=1kU_sf_s(X_s)−Gk²_F, s.t.GG^> =I_d.

(8)

(6)

Semi-supervised Learning.When some data are labeled, we can learn a classifier together with the encoders by

min_{f_s_,U_s_}∈Ω,g X

n∈LL(g({fs(x^s_n)}^S_s=1), y_n)

+X

n∈L∪U

γR_M({x^s_n}^S_s=1) +τX^S

s=1R_S(x^s_n) ,

(9) whereLandU are the sets of indices for labeled and unlabeled data, respectively,y_nis the label of then-th multi-view data point, andgis a classifier taking the concatenation of{fs(x^s_n)}^S_s=1as its input. The first term in (9) can be the cross entropy loss for labeled data.R_Mcan be a multi-view learning strategy for all the views. It can be implemented as the objective function in (7) or that in (8), and accordingly,Ω is the corresponding constraints imposed on the parameters. RS can be any additional regularizer imposed on each single view. We can implement R_S as the manifold- based regularizer [26], [70], or we can introduce a learnable decoder for each encoder and implementRSas the reconstruction loss between the sample in each view and its estimation [14], [71], [72], [73]. The two regularizers are weighted by γ andτ, respectively.

4.2 Sliced Wasserstein distance for view matching Most existing multi-view learning methods, including the LSCCA in (7) and the DGCCA in (8), require that the samples in different views are well-aligned, i.e.,xn = [x¹_n, ..., x^S_n]forn = 1, ..., N is sampled jointly fromX1×...× XS. When the samples in each view are generated independently or only a limited number of the samples are labeled and well-aligned, as shown in Fig. 1, we need to design a robust method to match the samples in different views and achieve multi-view learning accordingly. A straightforward way is replacing each term in the objective functions of (7) and (8) with

min_P∈PkUsfs(Xs)P−Us⁰fs⁰(Xs⁰)k²_F for LSCCA, (10) and

minP∈PkUsfs(Xs)P−G)k²_F for DGCCA, (11) whereP represents the set of all valid permutation matrices. The optimal P for (10) matches the samples of thes-th view with those of thes⁰-th view, and the optimalP for (11) indicates the correspondence between the samples of the s-th view and the references.

Such matching problems are NP-hard in general. Therefore, we need to find an efficient and effective surrogate for the problems. To achieve this aim, we analyze the matching problems from a statistical viewpoint. In particular, denoteZ_s=U_sf_s(X_s), s = 1, .., S, as the latent representations of each view. These latent representations are sampled from an unknown conditional distribution P_Z|X_s. From this standpoint, the matching problem in (10) empirically defines a discrepancy between the conditional distributions of different views. Similarly, the matching problem in (11) corresponds to a discrepancy between the conditional distribution of each view and the distribution of the references.

We find that this discrepancy can be approximately implemented by the sliced Wasserstein distance [16], [68]. Given the latent representations of arbitrary two views, i.e.,Z_s= [z^s₁, ..., z^s_N] and Z_s⁰ = [z^s₁⁰, ..., z^s_N⁰], we can sample a set of projections {θm}^M_m=1and calculate their SW distance and max-SW distance as

Db_sw(Z_s, Z_s⁰) = 1 M

XM

m=1ksort(θ^>_mZ_s)−sort(θ_m^>Z_s⁰)k²₂, (12) and

Db_msw(Zs, Zs⁰) = max

θ∈{θ_m}^M_m=1ksort(θ^>Zs)−sort(θ^>Zs⁰)k²₂. (13) The relationships among the matching problems and these two distances can be captured by the following proposition:

Proposition 3. For arbitrary two Z₁, Z₂ ∈ R^d×N, we have minP∈PkZ1P −Z2k²_F ≥ Db_sw(Z1, Z2). Furthermore, when the random projectionsΘ = {θm}^M_m=1 used inDb_swand Db_msw includes{ei}^d_i=1, wheree_i∈R^dis the vector whosei-th element is one and others are zeros, we haveminP∈PkZ1P −Z2k²_F ≤ dDb_msw(Z1, Z2).

This proposition means that the optimal objective functions of the matching problems are bounded by the SW distance and the max-SW distance. In other words, the SW distance and its variant can be used to replace the matching problems. Therefore, to achieve multi-view learning from unaligned samples, we plug the sliced Wasserstein distance into (7) and (8), respectively, and obtain the following two learning tasks:

min_{f_s_,U_s_}X

s6=s⁰Db_sw(Usfs(Xs), Us⁰fs⁰(Xs⁰)), s.t. XS

s=1U_sf_s(X_s)f_s^>(X_s)U_s^> =I_d,

(14)

min_{f_s_,U_s_},GXS

s=1Db_sw(U_sf_s(X_s), G), s.t.GG^>=Id,

(15) Note that, we tried to useDb_mswas well, but we observed that using Dbswachieves more stable training process because of the average operation used in it. Therefore, we use Db_sw in the following experiments.

4.3 Hierarchical OT for view clustering

The new objective functions in (14) and (15) do not require well- aligned samples, but they still tend to make the latent representations of different views approach the same distribution. In particular, (14) penalizes the sliced Wasserstein distance between each pair of views, while (15) penalizes the sliced Wasserstein distance between each view and the references. We further modify the objective functions to relax this assumption and find the clustering structure of the views accordingly. For (14), we introduce learnable weights to the sliced Wasserstein distances and obtain

min_{f_s_,U_s_},W X

s6=s⁰wss⁰Db_sw(Usfs(Xs), Us⁰fs⁰(Xs⁰)) +α

X

sUsfs(Xs)f_s^>(Xs)U_s^>−Id

2

F+βh(W) s.t.W ∈Π1

S1S,1 S1S

.

(16)

where 1S represents a S-dimensional all-one vector and W = [wss⁰] ∈ R^S×S is the matrix of the weights, which indicates the clustering structure of the views implicitly – the views corresponding to the pairs with large weights belong to the same clusters. To avoid trivial solutions (e.g.,W = 0orIS), we restrict W to be a doubly stochastic matrix (i.e., W ∈ Π(_S¹1_S,_S¹1_S)), and introduce an entropic regularizer onW. Note that, in (16) we relax the strict constraintPS

s=1Usfs(Xs)f_s^>(Xs)U_s^> =Idto a least squares based regularizer, which helps us to apply mini-batch gradient descent directly to learn the model.

For (15), we consider multiple references which correspond to different clusters directly, and learn the weights for the sliced

(7)

Wasserstein distance between the views and the references. The problem becomes

min_{f

s,U_s}^S_s=1,{G_k}^K_k=1,W

X

s,kw_skDb_sw(U_sf_s(X_s), G_k) +α

X

kGkG^>_k −Id

2

F +βh(W).

s.t.W ∈Π1 S1_S, 1

K1_K

(17)

whereK is the number of clusters we set for the views, which is fixed as three in the following experiments; Gk ∈ R^d×N represents the matrix of the references corresponding to the k- th cluster; and W = [w_sk] ∈ Π(_S¹1_S,_K¹1_K) is the matrix of the weights, which is also restricted as a doubly stochastic matrix.

We can explainW as the joint distribution of the views and the clusters, and the elementw_skis the probability that thes-th view belongs to the k-th cluster. Similar to (16), we relax the strict constraintPK

k=1GkG^>_k =Idto a regularizer in (17).

In both these two methods, we establish an optimal transport model with a hierarchical architecture. The W in (16) achieves an entropic optimal transport across different views, whose underlying distance is the sliced Wasserstein distance between the latent representations of the views. Similarly, the W in (17) is an optimal transport from the views to their clusters, whose underlying distance is the sliced Wasserstein distance between the latent representations of the views to those of the clusters.

These optimal transport matrices can be learned efficiently by computing the entropic optimal transport based on the Sinkhorn scaling algorithm [63]. For convenience, we can write (16) and (17) in a unified form:

minθ,W∈Π(u,v)hDbsw(θ), Wi+αR(θ) +βh(W), (18) whereθrepresents the parameter of the target multi-view model, R(θ)represents the regularizer imposed on the model parameter, uand v are marginals of W, and Db_sw(θ) represents the sliced Wasserstein distance matrix calculated based on the model parameter and data. To our knowledge, our work is the first to leverage the hierarchical optimal transport (HOT) model to implement multi-view learning methods. This framework provides a new way to represent different views and find their clustering structure.

5 LEARNINGA DIFFERENTIABLE HOT 5.1 A bi-level optimization strategy

A straightforward way to solve (18) is applying alternating optimization. Specifically, this method updates the model parameter and the weight matrix iteratively. In the t-th iteration, we can calculate the sliced Wasserstein distances and update the weight matrix via the Sinkhorn scaling algorithm [63]. Then, we can fix the weight matrixW and learn the encoders and their projection matrices via mini-batch gradient descent [74]. The optimization steps can be written as

W^(t+1)= arg minW∈Π(u,v)hDbsw(θ^(t+1)), Wi+βh(W), θ^(t+1)= arg minθhDbsw(θ), W^(t)i+αR(θ). (19) In particular, the Sinkhorn scaling algorithm is shown in Algo- rithm 1. Here,a^∗andb^∗are the optimal dual variables, andW^∗ is the entropic optimal transport matrix.

It should be noted that the alternating optimization strategy treatsDbswas a constant matrix when learning the optimal transport matrix. Similarly, when learning the model parameter θ, the

Algorithm 1arg minW∈Π(u,v) hDbsw, Wi+βh(W) 1: InitializeW⁽⁰⁾=uv^>,a=u,b= 0,C= exp(−^D^b_β^sw) 2: forj= 0, ..., J−1

3: Sinkhorn iteration:b=_C^v>a,a= _Cb^u,

4: Returna^∗=a,b^∗=b, andW^∗= (a^∗(b^∗)^>)C.

learned optimal transport matrix becomes a constant with respect to θ. This strategy disobeys the fact that Db_sw is learnable and parametrized byθ,i.e.,Dbsw(θ), and the Sinkhorn iterations are differentiable with respect to θ. In other words, both the dual variables and the optimal transport matrix are the functions of θ,i.e.,a^∗(θ),b^∗(θ)and

W^∗(θ) = (a^∗(θ)(b^∗(θ))^>)e⁻

Dsw(θ)c

β . (20)

As a result, the alternating optimization strategy ignores the gradient ofW^∗with respect to the model parameterθwhen learning the model, which often leads to undesirable local optimums [18], [75]. In the following experiments, we will show that applying the alternating optimization strategy to learn our multi-view learning model can not reflect the satisfying clustering structure of the views.

To suppress this issue, we propose a differentiable hierarchical optimal transport (DHOT) method to enhance the robustness of the learning process. Specifically, we rewrite (18) as a bi-level optimization problem:

(Upper-level problemLa(θ)) minθhDb_sw(θ), W^∗(θ)i+αR(θ) (Lower-level problemLb(W))

s.t. W^∗(θ) = arg min_W∈Π(u,v)hDb_sw(θ), Wi+βh(W).

(21)

This problem is different from (19) because when learning the model parameter θ it sets the entropic optimal transport as a differentiable operator of θ, denoted as W^∗(θ), rather than a constant independent withθ. Therefore, when solving the upper- level problem, we consider the gradient ofW with respect toθin the backpropagation step:²

∇θLa(θ) = ∂La(θ)

∂Dbsw

∂Db_sw(θ)

∂θ

| {z }

The gradient used in (19)

+∂La(θ)

∂W^∗

∂W^∗(θ)

∂θ .

(22)

Essentially, the alternating optimization strategy in (19) sets a^∗, b^∗, andDb_swas constants such that^∂W

∗(θ)

∂θ = 0and it only needs to use the first term in (22) as the gradient to updateθ. On the contrary, our DHOT method treats them as three functions ofθ, as shown in (20).

A straightforward way to compute ∇_θL_a(θ) is using the automatic differentiation (AD) method [76]. This strategy is used by some methods, e.g., the SuperGlue in [57] and the GWF in [58]. However, the memory cost of this strategy is quadratic with the number of Sinkhorn iterations and the number of views (i.e., O(J²S²)). This strategy is undesired because both the number of Sinkhorn iterations and the number of views can be large in practice. Fortunately, the Theorem 2 in [18] demonstrates that we can compute ∇θLa(θ) in a closed-form based on the Karush–Kuhn–Tucker (KKT) condition. Specifically, we have

2. The entropic regularization makesW smooth, andDb_sw(θ)is differentiable function ofθ. Therefore,Wis differentiable toθ.

(8)

Algorithm 2The optimization of (16) 1: Denoteθ={fs, Us}^S_s=1.

2: InitializeW = _S(S−1)¹ (1_S1^>_S −I_S).

3: Foreach epoch:

4: Sample batches{X_s}^S_s=1from the views.

5: CalculateD(θ) = [Db_sw(U_sf_s(X_s), U_s⁰f_s⁰(X_s⁰))].

6: D(θ) =D(θ) +kD(θ)k1IS.

7: HOT:

8: FixD(θ)and learnW by Algorithm 1:

min_W∈Π(¹

S,_S¹)hW, Di+βhW,logWi.

9: FixW and calculate the loss functionRM(θ)as hW, D(θ)i+αkP

sU_sf_s(X_s)f_s^>(X_s)U_s^>−I_dk²_F. 10: Update{fs, Us}^S_s=1by Adam [74].

11: DHOT:

12: Learna^∗(θ)andb^∗(θ)by Algorithm 1:

W(θ) = (a^∗(θ)(b^∗(θ))^>)exp(−^D(θ)_β ).

13: Calculate the loss functionR_M(θ)as hW(θ), D(θ)i+αkP

sUsfs(Xs)f_s^>(Xs)U_s^>−Idk²_F. 14: Update{fs, Us}^S_s=1by Adam [74].

Proposition 4. [Theorem 2 in [18]] The gradient of the objective function of the upper-level problem with respect to the model parameter is

∇θLa(θ) = 1 β

S,K

X

s,k=1

(1−Dbsk)w^∗_sk+

S,K

X

h,l=1

Db_hlw_hl^∗

∇Db_ska^∗_h+∇

Db_skb^∗_l

!

∇_θw_sk^∗ +α∇_θR(θ), (23)

whereDb_sk,w_sk^∗ ,a^∗_h,b^∗_l are the elements of the distance matrix Db_sw, the optimal transport matrix W^∗, and the dual variables {a^∗, b^∗}, respectively, and

∇

Dba^∗

∇Dbb^∗

=

H⁻¹D 0

,

whereH⁻¹D∈R(S+K−1)×S×K,0∈R^1×S×K, and D_lsk=

(δ_lsw^∗_sk, l= 1, .., S,

δ_lkw^∗_sk, l=S+ 1, ..., S+K−1, H =

diag(u) W_:,1:K−1^∗ (W_:,1:K−1^∗ )^> diag(v_1:K−1)

, whereδls= 1ifl=s. Otherwiseδls= 0.

For (16), we set K = S as the number of views. For (17), we set K as the number of clusters. Please refer to [18] for the derivation of Proposition 4. In summary, we compare our DHOT method to the alternating optimization strategy (denoted as “HOT”) on solving (16) and (17), respectively. The details of the algorithms are shown in Algorithm 2 and Algorithm 3, respectively.

The symmetry of W in (16). It should be noted that the optimal transport learned by the Sinkhorn scaling algorithm is asymmetric in general. However, in (16), the underlying distance matrix Db_sw is symmetric and the marginal constraints of the transport are the same (i.e,W ∈ Π(₁¹

S1,₁¹

S)). Therefore, the optimal transport will be symmetric after sufficient iterations. The proof is straightforward. If W is an optimal transport matrix minimizing the entropic optimal transport problem in (16), then its

Algorithm 3The optimization of (17) 1: Denoteθ={{fs, Us}^S_s=1,{Gk}^K_k=1} 2: InitializeW = _SK¹ 1_S1^>_K.

3: Initialize{Gk}^K_k=1asKrandom matrices.

4: Foreach epoch:

5: Sample batches{Xs}^S_s=1from the views.

6: CalculateD(θ) = [Db_sw(Usfs(Xs), Gk)].

7: HOT:

8: FixD(θ)and learnW by Algorithm 1:

min_W∈Π(1

S,_K¹)hW, Di+βhW,logWi.

9: FixW and calculate the loss functionR_M(θ)as hW, D(θ)i+αkP

kGkG^>_k −Idk²_F. 10: Update{fs, Us}^S_s=1,{Gk}^K_k=1by Adam [74].

11: DHOT:

12: Learna^∗(θ)andb^∗(θ)by Algorithm 1:

W(θ) = (a^∗(θ)(b^∗(θ))^>)exp(−^D(θ)_β ).

13: Calculate the loss functionRM(θ)as hW(θ), D(θ)i+αkP

kGkG^>_k −Idk²_F. 14: Update{f_s, U_s}^S_s=1,{G_k}^K_k=1by Adam [74].

transposeW^> is another optimum. Because the entropic optimal transport problem is strictly-convex, which has a global optimum, we haveW^>=W.

5.2 Complexity analysis

GivenSviews, each of which containsN samples, the computational complexity of the HOT distance isO(S²(M NlogN+J)) for (16) and O(SK(M NlogN +J)) for (17). Here, M is the number of random projections used to compute a sliced Wasserstein distance, which is much smaller than N, J is the number of iterations used in the Sinkhorn scaling algorithm, and Kis the number of clusters for the views. For the complexity, the first termO(S²M NlogN)(O(SKM NlogN)) corresponds to calculating the sliced Wasserstein distance matrix, whereNlogN corresponds to the complexity of the sorting operation in (12).

The second termO(S²J)(O(SKJ)) corresponds to computing the entropic Wasserstein distance based on the Sinkhorn scaling algorithm.

Existing HOT distances, however, apply Wasserstein distance [31], [53], [60] or entropic Wasserstein distance [61] as the underlying distance. For each pair of views, their computational complexity is O(N³) when applying linear programming, or O(J N²) when applying Sinkhorn scaling algorithm, which is higher than that of the sliced Wasserstein distance (O(M NlogN)). Some methods avoid these computations by as- suming the distribution to be Gaussian [31], [60], which increases the risk of over-regularization. According to the analysis above, the HOT distance used in our model has much less computational complexity than existing HOT distance. Moreover, it does not impose any assumptions on the latent distributions of the views.

6 EXPERIMENTS 6.1 Datasets

We demonstrate the usefulness of our DHOT method on both synthetic and real-world datasets and compare it with state-of-the-art methods in multi-view classification tasks. Specifically, we consider the following four classification datasets used in [24], [31]:

Caltech7, Caltech20, Handwritten, and Cathgen. The Caltech7,

(9)

TABLE 1: The statistics of each dataset (the name of view / the dimension of view)

Dataset # Samples # Classes View 1 /D1 View 2 /D2 View 3 /D3 View 4 /D4 View 5 /D5 View 6 /D6

Caltech-7/20 1474 / 2386 7 / 20 Gabor / 48 WM / 40 CENTRIST / 254 HOG / 1984 GIST / 512 LBP / 928

Handwritten 2000 10 Pixel / 240 Fourier / 76 FAC / 216 ZER / 47 KAR / 64 MOR / 6

Cathgen 8000 2 Protein / 21 Metabolite / 60 Demographic / 7 Clinic / 23 Genetic / 67 —

(a) Caltech7 (b) Caltech20 (c) Handwritten (d) Cathgen

Fig. 2: The comparison for various multi-view learning methods on four classification tasks. The thin dotted curves correspond to the baselines that use well-aligned data while the thick solid curves correspond to our DHOT method that solves (16) and (17), respectively.

TABLE 2: Classification accuracy (%) of semi-supervised multi-view learning methods (5% labeled data are provided)

Data Well-aligned views Unaligned views

RMin (9) LSCCA [27] DGCCA [15] AECCA [14] COMIC [77] DHOT (16) DHOT (17) Caltech7 87.36±1.43 87.60±1.08 87.62±1.47 90.13±1.22 88.54±1.73 89.73±1.42 Caltech20 71.20±2.74 71.80±2.61 71.50±2.84 74.10±2.53 72.30±2.40 73.88±1.66 Handwritten 87.98±3.46 87.12±3.89 88.53±3.22 91.58±3.58 90.06±2.89 91.22±2.39 Cathgen 68.12±1.07 67.78±0.83 67.95±0.73 70.52±0.69 68.59±1.32 69.36±0.71

Caltech20, and Handwritten datasets correspond to three image classification tasks. Each contains six kinds of visual features extracted by classic methods. The details of the feature extraction methods are provided at https://github.com/yeqinglee/mvdata. The Cathgen dataset is a real-world dataset of 8,000 patients. Each patient contains five kinds of features, which help predict the likelihood of a myocardial infarction (i.e., a binary classification task). The statistics of these datasets are summarized in Table 1.

In each classification task, we apply various multi-view learning methods to learn the latent representations of each view in an unsupervised way, and then, train classifiers based on the learned latent representations. We evaluate the multi-view learning methods quantitatively based on their corresponding classification accuracy, which reflects the quality of the learned representations.

6.2 Comparisons on multi-view classification

To verify the effectiveness of our DHOT method, we consider four well-known multi-view learning methods as baselines:

• LSCCA: The least-square CCA in [27].

• DGCCA: The deep generalized CCA in [15].

• AECCA: The autoencoder-assisted CCA in [14].

• COMIC: The cross-view matching clustering in [77].

Among them, the COMIC is the state-of-the-art multi-view learning method, which regularizes the latent representations of different views by a geometric consistency regularizer and a cluster assignment consistency regularizer jointly. Note that, all four baselines require well-aligned multi-view data.

To achieve fairness in these comparisons, our method and the baselines apply models with the same architecture and the same hyperparameters. In particular, for each method we trainSmulti- layer perceptron (MLP) models as encoders and a softmax layer as a classifier. We set the hyperparameters empirically as follows: the number of epochs is100; the learning rate is fixed as0.001; the batch size is400; for eachf_s, the dimension of its output is20; the dimension of the common latent space is10; in (16) and (17), we setα= 0.01; for the Sinkhorn algorithm, the number of iterations is20andβ= 0.1; for sliced Wasserstein distance, the number of projectionsM is set to be3. We implement all the methods with PyTorch and train their models on a single NVIDIA GTX 1080 Ti GPU.

For each dataset, we evaluate our method and the baselines in 20 trials. In each trial, we randomly select 60% of the samples for training, 20% of the samples for validation, and the remaining 20%

of samples for testing. For the training data, we set the percentage of well-aligned and labeled data in the range from 5% to 25%.

These well-aligned labeled data are used to train the classifier.

For the remaining unlabeled training data, we keep them well- aligned for the baselines while make them unaligned by randomly permuting the samples in each view when applying our DHOT method. Fig. 2 shows the average classification accuracy and the standard deviation achieved by these methods.

We can find that compared with the traditional methods that require well-aligned training data, our DHOT method applies unaligned training data but achieves at least comparable performance on classification accuracy. In particular, only the state-