unsupervised multi-task learning for 3d subtomogram image

(1)

UNSUPERVISED MULTI-TASK LEARNING FOR 3D SUBTOMOGRAM IMAGE ALIGNMENT,

CLUSTERING AND SEGMENTATION

Item Type Conference Paper

Authors Zhu, Haoyi;Wang, Chuting;Wang, Yuanxin;Fan, Zhaoxin;Uddin, Mostofa Rafid;Gao, Xin;Zhang, Jing;Zeng, Xiangrui;Xu, Min Citation Zhu, H., Wang, C., Wang, Y., Fan, Z., Uddin, M. R., Gao, X.,

Zhang, J., Zeng, X., & Xu, M. (2022). Unsupervised Multi-Task Learning for 3D Subtomogram Image Alignment, Clustering and Segmentation. 2022 IEEE International Conference on Image Processing (ICIP). https://doi.org/10.1109/icip46576.2022.9897919 Eprint version Post-print

DOI 10.1109/ICIP46576.2022.9897919

Publisher IEEE

Rights This is an accepted manuscript version of a paper before final publisher editing and formatting. Archived with thanks to IEEE.

Download date 2023-12-01 19:21:21

Link to Item http://hdl.handle.net/10754/689843

(2)

UNSUPERVISED MULTI-TASK LEARNING FOR 3D SUBTOMOGRAM IMAGE ALIGNMENT, CLUSTERING AND SEGMENTATION

Haoyi Zhu

¹

, Chuting Wang

²

, Yuanxin Wang

³

, Zhaoxin Fan

³

,

Mostofa Rafid Uddin

³

, Xin Gao

²

, Jing Zhang

⁴

, Xiangrui Zeng

^3∗

, and Min Xu

^3∗

1

School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, China

2Structural and Functional Bioinformatics Group, King Abdullah University of Science and Technology, Saudi Arabia

3

School of Computer Science, Carnegie Mellon University, USA

4

Department of Computer Science, University of California, Irvine, USA

ABSTRACT

3D subtomogram image alignment, clustering, and segmentation are vital to macromolecular structure recognition in cryo-electron tomography (cryo-ET). However, acquiring ground-truth labels to train a unified deep learning model that can simultaneously deal with these tasks is unaffordable. To this end, we propose an end-to-end unified multi-task learning framework to simultaneously complete the three tasks, where models are trained in an unsupervised manner without using any labels. In particular, we have three parallel branches. In the alignment branch, we adopt a two-stage training scheme, i.e., self-supervised pretraining and constrained unsupervised training using our proposed skip correlation attention layer and constrained loss. Synchronously, in the clustering branch, the learned deep cluster features are utilized to iteratively cluster subtomograms into groups using pseudo-labels from an image-wise Gaussian Mixture Model (GMM). Meanwhile, in the segmentation branch, we use rough pseudo-labels generated from a voxel-wise GMM as supervision signals, and prior knowledge from humans is utilized to jointly learn how to correct these labels as well as predict reliable segmentation results. Benefiting from the end-to-end unified network architecture, our method achieves overall state-of-the-art performance on both simulated and real subtomogram processing benchmarks.

Index Terms—unsupervised multi-task learning, subtomogram alignment, subtomogram cluster, subtomogram segmentation

1. INTRODUCTION

Cryo-electron tomography (cryo-ET), as a 3D structural biology imaging technique, has aroused significant research interest given its power for understanding cell-level biolog- ical processes, i.e, detecting the structure of 3D tomograms

* denotes corresponding authors

[1]. A subtomogram is defined as a cubic subvolume ex- tracted from a tomogram that contains one macromolecular complex. In cryo-ET image processing, subtomogram alignment [2] [3], clustering [4] [5], and segmentation [6] [7] are three crucial tasks for downstream macromolecular structure analysis. Nevertheless, obtaining annotated training data is extremely expensive, especially for 3D tomograms with a low signal-to-noise ratio (SNR), making it harder to widely pro- mote it in the community [8]. Besides, most existing works tackle the three tasks independently, failing to make full use of the relational information hidden in their deep features.

To solve part of the above-mentioned issues, in the cryo- ET processing field, researchers have proposed an end-to- end architecture in a fully-unsupervised way for subtomogram alignment [3]. However, it takes alignment task into their consideration. The other two tasks, though important, are neglected. Then, Zeng et al. [5] propose a novel model named Jim-Net, which can simultaneously perform alignment and clustering without unsupervisedly. Nevertheless, unsupervised subtomogram segmentation is still not being considered. Though there exists a method [7] for cryo-ET data segmentation, it only partly achieves the unsupervised training goal using the so-called one-shot setting. What’s more, the relational information among the three tasks is not utilized by any of the above methods. We note here that such kind of relational information hidden in learned features is significantly important for these tasks, especially for unsupervised training. For instance, to achieve alignment, the model is encour- aged to learn the semantic information about some landmarks as well as some global and local geometric and semantic features, these landmark semantics and features in return also benefit the segmentation task and the clustering task.

Given the importance of the information in the shared features for performing unsupervised alignment, clustering, and segmentation tasks on cryo-ET data, a natural thought is to learn the three tasks simultaneously with improved ef- ficiency and accuracy. Therefore, in this work, we propose the first 3D end-to-end unified fully-unsupervised multi-task

(3)

source

target

Feature Extractor

Correlation

Correlation SCA

STN

coarsely aligned CNN

CNN

Feature Extractor

Siamese Feature

Matching STN

aligned FCs

Siamese Feature Matching

Coarse Alignment Branch Fine Alignment Branch

Feature Extractor

FCs Clustering Branch

Cluster 1 Cluster 2

.. . Cluster n

Decoder Decoder Segmentation Branch Decoder Decoder

Binary Segmentation Predictions

Multiplication Concatenation

STN Spatial transformer network

FCs Fully connected layers Shared weights

SCA Skip correlation attention

Fig. 1. The whole pipeline of our unsupervised multi-task framework consisting of three branches. Given source and target inputs, the coarse-to-fine alignment branch outputs the transformation parameters between them and the transformed source image. The clustering branch outputs the cluster assignments of the inputs and the segmentation branch gives the binary segmentation predictions of source, target and the transformed images.

imaging framework to deal with subtomogram alignment, clustering, and segmentation in a highly noisy cryo-ET en- vironment. We combine three novel branches as shown in Fig. 1: (1) a coarse-to-fine alignment branch with performance boosted by skip correlation attention and self- supervised pre-training techniques (2) a clustering branch trained using pseudo-labels generated from image-wise GMM [9] conditioned on semantic image representations.(3) a segmentation branch that incorporates a voxel-wise GMM based binary pseudo-label (subtomogram and background) generation process to make the model be able to be trained in an unsupervised fashion. Therefore, our whole framework is end-to-end trainable. Extensive experiments demonstrate that our method can achieve state-of-the-art results on both simulated cryo-ET benchmark and real dataset. The code is available inAITom[10].

2. METHOD

The pipeline of our work is shown in Fig. 1. Given subtomograms as input, our goal is to do alignment, clustering and segmentation using a single network. We have three branches that share the same input feature to achieve the goal. Next, we will detail each branch.

2.1. Alignment Branch

The goal of subtomogram alignment is to take a source subtomogramsand a target subtomogramtas input and ge- ometrically transformsinto s⁰. The transformation should be rigid in our task. We build our alignment branch based

on the coarse-to-fine architecture proposed by [5] and the siamese feature matching module proposed by [3]. In ad- dition, we propose a novel skip correlation attention, an unsupervised alignment loss, and a novel self-supervised pre-training mechanism to further enhance its performance.

2.1.1. Skip Correlation Attention

Inspired by the fact that humans conduct alignment by only observing some key regions, we design a skip correlation attention (SCA) block for our alignment branch to utilize long-range powerful subtomogram features learned from neu- ral attention mechanisms. The design of SCA is mainly mo- tivated by the concept that correlation maps output by feature extractors record the similarity between each point insand in t. To this end, we use this extremum as our key information.

Specifically, we concatenate the maximal responsesCor(s, t) andCor(t, s)to regress the attention weight. We also adopt a skip connection scheme to compensate for the information loss introduced by the downsampling stages in the siamese matching module, which means the attention weight skips downsamplings and directly operates on regressors shown in Fig. 1. The whole process can be represented as:

A=FA(max

c Cor(s, t),max

c Cor(t, s)) (1) θˆ=FR(A· S(Cor(s, t), Cor(t, s))) (2) whereA denotes the attention weight computed by the attention layerFA which is some fully-connected layers, and the siamese matching module (some CNN downsamplings) is

(4)

represented asS. θˆis the transformation parameters output by regressorFR.

2.1.2. Unsupervised Alignment Loss

One direct method for unsupervised 3D subtomograms alignment is to compare the similarity betweens⁰ andt[11, 12, 13]. Inspired by [5], to combat the issues arising from the missing wedge effect and the low SNR, the constrained cross- correlation loss is applied. The intuition behind the usage of this loss comes from the observation that high-frequency regions are more likely to be affected by noise. By transforming the data into a Fourier space, the model is constrained to only focus on low frequency observed regions.

L_CCC(s, t) = 1−

PN

i=1(s^∗_i −s¯^∗) (t^∗_i −t¯^∗) q

PN

i=1(s^∗_i −s¯^∗)² q

PN

i=1(t^∗_i −t¯^∗)² ,

(3) wheres^∗ = R

F⁻¹

(Fs)· H · Mt· TR_φ(Ms) . Rde- notes the real part of and the Fourier transform operator is F. TR_φ is the 3D rotation operator with parameters from the network regressor; H is the low-pass filter defined by H(ξ) = 1_kξk_∞_≤t(ξ), whereξis a 3D location in the Fourier space, andtis the frequency threshold for it;Ms(similarly forM_t ) is the binary mask indicating the observed region ofsdue to the missing wedge effect as defined byM_s(ξ) = 1_|ξ₃_|≤|ξ₁_|_tan(θ)(ξ), whereθrefers to the tilt-angle range±θ.

2.1.3. Self-Supervised Pre-Training

Due to the indirectness of the similarity-based loss and the extremely low SNR, the network is still easy to fall into local minimal. Therefore, we propose a self-supervised pre- training stage for alignment to provide the network directly supervised signal, as well as to learn a better feature extractor.

The pipeline is shown in Fig. 2. Given a source input without a missing wedge, we randomly transform it to generate a pseudotand add a random missing wedge effect manually to generate a pseudos. Since the transformation is

random θ STN

input subtomogram

missing wedge effect

source

target

Shared Feature Extractor

Weak Inlier Loss

MSELoss Siamese Feature Matching

Fig. 2. The pipeline of self-supervised pre-training for alignment branch. The randomly generated transformation parameters are seen as pseudo ground-truth to directly supervise on the outputs, while weak inlier loss is adopted on feature maps to learn a good feature extractor.

known, we can use it as a training goal to directly supervise the training process. We use Mean Squared Error (MSE) as our direct loss and the differentiable soft-inlier loss proposed by [14] which acts on feature maps to learn a better feature extractor. The total pre-training loss can be expressed as:

L_ssp=M SE(ˆθ, θi) +γ_sspL_inlier(F), (4) whereFis the feature maps,θrepresents the pseudo ground- truth, andγsspis a hyperparameter.

2.2. Clustering Branch

Given inputs with different categories, the clustering branch aims to divide them into different clusters. Manually labeling the clusters of subtomogram images is expensive, so we follow [5] to do the task in an unsupervised way. In each iteration, we use a Gaussian Mixture Model (GMM) Gµ,Σto fit the feature representations of the input image pairs f_X|X, and the cluster predictions `X ← Gµ,Σ f_X|X

are used as pseudo-labels to train our clustering branch. Intu- itively, subtomogram images in different clusters are likely to generate features that can be distinguished by Gµ,Σ and with the training going on, or with the feature extractor being better, our pseudo-labels will also become more reliable.

2.3. Segmentation Branch

The segmentation task requires us to tell the category of each grid in input images. However, it is unaffordable to obtain accurate ground-truth. Thus, we train this branch using pseudo-labels, which is an unsupervised training scheme.

Algorithm 1Generate binary pseudo-labels for segmentation Input: set{I⁰}, whereI⁰ ∈ R^l×h×w×cis the concatenation

of imageIand its resized feature mapsfI

1: Initialization: for each imageIand each voxeli,P_I(i) =

−1(−1means not computing loss).

2: GMM: voxel-wisely fit 2-componentG_µ,Σto{I⁰}

3: Binary clusters{c}and scores{s} ← G_µ,Σ({i⁰})

4: foreach imageIand each voxelido

5: if(c= 0∧s<T hr_neg) or (c= 1∧s≥T hr_pos)then

6: PI(i) =c

7: end if

8: end for

9: Remove small and outlier positive areas for eachP Output: binary pseudo-labels{P}

Considering the prior that each subtomogram image can be seen as either background or foreground (voxels outside or inside the macromolecule), we formulate the segmentation task as a binary segmentation task. To obtain rough binary pseudo-labels, we use a voxel-wise two-component GMM.

Voxels with confidence scores higher thanT hr_posare selected

(5)

Method SNR 100 SNR 0.1 SNR 0.05 SNR 0.03 SNR 0.01 H-T align [15] 0.30±0.68, 1.82±2.69 1.22±1.07, 4.76±4.56 1.93±0.98, 7.26±4.77 2.22±0.77, 8.86±4.72 2.38±0.57, 11.33±5.02 F&A align [16] 0.33±0.70, 1.93±2.86 1.34±1.13, 5.39±4.90 1.95±0.98, 7.54±4.94 2.22±0.77, 8.99±4.81 2.38±0.57, 11.32±4.92 Gum-Net [3] 0.41±0.70, 1.59±2.63 0.62±0.69, 2.41±2.61 0.87±0.74, 3.20±2.78 1.13±0.75, 4.29±2.75 1.50±0.78, 6.78±4.22

Jim-Net [5] 0.29±0.53, 1.28±2.10 0.51±0.62, 2.12±2.47 0.80±0.73, 3.20±3.02 1.02±0.75, 4.12±3.12 1.58±0.77, 6.78±3.44 Ours 0.24±0.47, 1.06±1.78 0.45±0.63, 2.01±2.78 0.67±0.69, 2.80±2.75 0.92±0.73, 3.78±2.86 1.56±0.77,6.68±3.45 Table 1. Subtomogram alignment accuracy on benchmark datasets. In each cell (best results highlighted), the first term is the mean and standard deviation of the rotation error and the second term, the translation error.

as foreground subtomogram area, so do for selecting back- grounds. The rest is considered as unsure and does not par- ticipate in loss computation. Then, considering the fact that subtomograms must be connected areas, small outlier positive areas are filtered. Detailed steps are shown in Alg. 1.

We apply Tversky loss [17] as our segmentation loss which is a generalization of Dice loss and Jaccard loss:

T(P, L) = |PT

L|

|PT

L|+α|P−L|+β|L−P|, (5) where P and L are the predicted and generated pseudo ground-truth binary labels in those confident areas.

Besides the above training scheme, we can further im- prove the reliability of pseudo-labels benefiting from our joint learning strategy with the alignment task. The concept is that the intersection of aligned source and target pseudo-label is more reliable and should be focused on more. LetL_s⁰_,t = L_t·1_L_t_=L

s0 to be the ground-truth labels in the intersection area, we re-formulate our final segmentation loss as:

L_SEG=T(P_s, L_s) +T(P_t, L_t) +γ_segT(P_s⁰, L_s⁰_,t) (6)

3. EXPERIMENTS

3.1. Simulated Cryo-ET Subtomogram Benchmark We first use a realistically simulated cryo-ET benchmark proposed by [3] to evaluate our method. The dataset includes five representative macromolecular complexes: spliceosome, RNA polymeraserifampicin complex, RNA polymerase II elongation complex, ribosome, and capped proteasome. The SNR levels are 100, 0.1, 0.05, 0.03, and 0.01. For alignment and clustering, we useL-2 distance and accuracy as the met- rics respectively. Results are shown in Tab. 2.3 and Tab. 3.1.

It can be seen that our method outperforms all previous SOTA methods. For segmentation, since there is no ground truth, we only show some qualitative results in Fig. 3.

Method SNR100 0.1 0.05 0.03 0.01 DeepCluster [18] 68.7 48.8 39.4 34.0 27.2 PICA [19] 100 86.5 55.8 29.2 28.4 Jim-Net [5] 100 99.7 96.1 85.5 48.1

Ours 99.8 99.4 99.4 98.6 64.8

Table 2. Subtomogram clustering accuracy.

Fig. 3. Visualization examples of segmentation results.

3.2. Real Data from Rat Neuron Tomograms

We extract 3 classes of subtomogram structures from rat neuron tomograms proposed in [20], namely ribosome, single capped proteasome, and TRiC, containing 80, 386 and 125 samples respectively. The dataset’s SNR is 0.01, with a tilt angle range of−50to+70degrees. Since there is no ground

Fig. 4. 2D slices of aligning and averaging results on TRiC.

truth for alignment, we use correlation as an indicative evalu- ation measure following [3]. We achieve 0.0457±0.0027, significantly better than JimNet [5] (0.0403±0.0025) and Gum- Net [3] (0.0250±0.0019). The clustering accuracy is 88.2, 87.5, and 65.4 in each class whose overall is 81.9. Another common metric for alignment performance is to align and average subtomograms so that the images should be clearer.

From Fig. 4 we can see the result is much clearer.

4. CONCLUSION

In this paper, we propose an end-to-end unsupervised multi-task learning framework to for the first time tackle the important problems of subtomogram alignment, clustering and segmentation simultaneously. Our method takes ad- vantage of relational information among three branches for different tasks and achieves state-of-the-art results on both simulated benchmark and real data.

Acknowledgement

We thank Yizhe Ding and Ge Yan for critical comments on the manuscript.

(6)

5. REFERENCES

[1] Roman I. Koning, Abraham J. Koster, and Thomas H.

Sharp, “Advances in cryo-electron tomography for biology and medicine.,” Annals of anatomy = Anatomis- cher Anzeiger : official organ of the Anatomische Gesellschaft, vol. 217, pp. 82–96, 2018.

[2] Fernando Amat, Luis R. Comolli, Farshid Moussavi, John Smit, Kenneth H. Downing, and Mark Horowitz,

“Subtomogram alignment by adaptive fourier coeffi- cient thresholding.,” Journal of structural biology, vol.

171 3, pp. 332–44, 2010.

[3] Xiangrui Zeng and Min Xu, “Gum-net: Unsupervised geometric matching for fast and accurate 3d subtomogram image alignment and averaging,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 4073–4084.

[4] Min Xu, Martin Beck, and Frank Alber, “High- throughput subtomogram alignment and classification by fourier space constrained fast volumetric matching.,”

Journal of structural biology, vol. 178 2, pp. 152–64, 2012.

[5] Xiangrui Zeng, Gregory Howe, and Min Xu, “End-to- end robust joint unsupervised image alignment and clustering,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 3854–3866.

[6] Min Xu and Frank Alber, “Automated target segmentation and real space fast alignment methods for high-throughput classification and averaging of crowded cryo-electron subtomograms,” Bioinformatics, vol. 29, pp. i274 – i282, 2013.

[7] Bo Zhou, Haisu Yu, Xiangrui Zeng, Xiaoyan Yang, Jing Zhang, and Min Xu, “One-shot learning with attention-guided segmentation in cryo-electron tomography,” Frontiers in Molecular Biosciences, vol. 7, 01 2021.

[8] Jan B¨ohning and Tanmay A. M. Bharat, “Towards high- throughput in situ structural biology using electron cry- otomography.,” Progress in biophysics and molecular biology, 2020.

[9] Douglas A. Reynolds, “Gaussian mixture models,” in Encyclopedia of Biometrics, 2009.

[10] Xiangrui Zeng and Min Xu, “Aitom: Open-source ai platform for cryo-electron tomography data analysis,”

arXiv preprint arXiv:1911.03044, 2019.

[11] Fernando Amat, Luis R. Comolli, Farshid Moussavi, John Smit, Kenneth H. Downing, and Mark Horowitz,

“Subtomogram alignment by adaptive fourier coeffi- cient thresholding,” Journal of Structural Biology, vol.

171, no. 3, pp. 332–344, 2010.

[12] Yuxiang Chen, Stefan Pfeffer, José Jesús Fernández, Carlos Oscar S. Sorzano, and Friedrich Förster, “Aut- ofocused 3d classification of cryoelectron subtomograms,”Structure, vol. 22, no. 10, pp. 1528–1537, 2014.

[13] Yuxiang Chen, Stefan Pfeffer, Thomas Hrabe, Jan Michael Schuller, and Friedrich F¨orster, “Fast and accurate reference-free alignment of subtomograms,” Journal of Structural Biology, vol. 182, no. 3, pp. 235–245, 2013.

[14] Ignacio Rocco, Relja Arandjelovi´c, and Josef Sivic,

“End-to-end weakly-supervised semantic alignment,” in Proceedings of the IEEE Conference on Computer Vi- sion and Pattern Recognition (CVPR), June 2018.

[15] Min Xu, Martin Beck, and Frank Alber, “High- throughput subtomogram alignment and classification by fourier space constrained fast volumetric matching,”

Journal of structural biology, vol. 178, no. 2, pp. 152–

164, 2012.

[16] Yuxiang Chen, Stefan Pfeffer, Thomas Hrabe, Jan Michael Schuller, and Friedrich F¨orster, “Fast and accurate reference-free alignment of subtomograms,” Journal of structural biology, vol. 182, no. 3, pp. 235–245, 2013.

[17] Seyed Sadegh Mohseni Salehi, Deniz Erdogmus, and Ali Gholipour, “Tversky loss function for image segmentation using 3d fully convolutional deep networks,”

inInternational workshop on machine learning in med- ical imaging. Springer, 2017, pp. 379–387.

[18] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze, “Deep clustering for unsupervised learning of visual features,” inProceedings of the Euro- pean conference on computer vision (ECCV), 2018, pp.

132–149.

[19] Jiabo Huang, Shaogang Gong, and Xiatian Zhu, “Deep semantic clustering by partition confidence maximisa- tion,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp.

8849–8858.

[20] Qiang Guo, Carina Lehmer, Antonio Mart´ınez-Sánchez, Till Rudack, Florian Beck, Hannelore Hartmann, Manuela Pérez-Berlanga, Frédéric Frottin, Mark S Hipp, F Ulrich Hartl, et al., “In situ structure of neuronal c9orf72 poly-ga aggregates reveals proteasome recruit- ment,” Cell, vol. 172, no. 4, pp. 696–705, 2018.