A framework for deep multitask learning with multiparametric magnetic resonance imaging for the joint prediction of histological characteristics in breast cancer

(1)

multiparametric magnetic resonance imaging for the joint prediction of histological characteristics in breast cancer

Item Type Article

Authors Fan, Ming;Yuan, Chengcheng;Huang, Guangyao;Xu, Maosheng;Wang, Shiwei;Gao, Xin;Li, Lihua

Citation Fan, M., Yuan, C., Huang, G., Xu, M., Wang, S., Gao, X., &

Li, L. (2022). A framework for deep multitask learning with multiparametric magnetic resonance imaging for the joint prediction of histological characteristics in breast cancer. IEEE Journal of Biomedical and Health Informatics, 1–1. https://

doi.org/10.1109/jbhi.2022.3179014 Eprint version Post-print

DOI 10.1109/jbhi.2022.3179014

Publisher Institute of Electrical and Electronics Engineers (IEEE) Journal IEEE Journal of Biomedical and Health Informatics Rights (c) 2022 IEEE. Personal use of this material is permitted.

Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works.

Download date 2023-12-23 17:38:00

Link to Item http://hdl.handle.net/10754/678352

(2)

Abstract—The clinical management and decision-making process related to breast cancer are based on multiple histological indicators. This study aims to jointly predict the Ki-67 expression level, luminal A subtype and histological grade molecular biomarkers using a new deep multitask learning method with multiparametric magnetic resonance imaging. A multitask learning network structure was proposed by introducing a common-task layer and task-specific layers to learn the high-level features that are common to all tasks and related to a specific task, respectively. A network pretrained with knowledge from the ImageNet dataset was used and fine-tuned with MRI data. Information from multiparametric MR images was fused using the strategy at the feature and decision levels. The area under the receiver operating characteristic curve (AUC) was used to measure model performance. For single-task learning using a single image series, the deep learning model generated AUCs of 0.752, 0.722, and 0.596 for the Ki-67, luminal A and histological grade prediction tasks, respectively. The performance was improved by freezing the first 5 convolutional layers, using 20% shared layers and fusing multiparametric series at the feature level, which achieved AUCs of 0.819, 0.799 and 0.747 for Ki-67, luminal A and histological grade prediction tasks, respectively. Our study showed advantages in jointly predicting correlated clinical biomarkers using a deep multitask learning framework with an appropriate number of fine-tuned convolutional layers by taking full advantage of common and complementary imaging features.

Multiparametric image series-based multitask learning This work was supported by National Key R&D program of China (2021YFE0203700), the National Natural Science Foundation of China (61871428 and 61731008, U21A20521), the Natural Science Foundation of Zhejiang Province of China (LJ19H180001), and the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research (OSR) under award nos. REI/1/0018-01-01, REI/1/4216-01-01, REI/1/4216-01-01, and URF/1/4352-01-01. (Corresponding author: Lihua Li)

Lihua Li is with the Institute of Biomedical Engineering and Instrumentation, Hangzhou Dianzi University, Hangzhou 310018, China ([email protected]).

Ming Fan Chengcheng Yuan and Guangyao Huang are with the Institute of Biomedical Engineering and Instrumentation, Hangzhou Dianzi University, Hangzhou, China.

Maosheng Xu and Shiwei Wang are with the Department of Radiology, First Affiliated Hospital of Zhejiang Chinese Medical University, Hangzhou, Zhejiang, China.

Xin Gao is with King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Thuwal, 23955-6900, Saudi Arabia.

.

could be a promising approach for the multiple clinical indicator-based management of breast cancer.

Index Terms—Breast cancer, deep learning, multitask learning, information fusion, clinical indicators.

I. INTRODUCTION

REAST cancer is heterogeneous and characterized by distinct histological subtypes. Histological characteristics are associated with the risk of local and regional relapse [1] and neoadjuvant chemotherapy [2, 3] and are of vital importance for diagnosis, treatment and prognosis analysis [4, 5]. Clinical indicators include the molecular subtypes (i.e., luminal A, luminal B, basal-like breast cancer and human epidermal growth factor receptor type 2 (HER2) overexpression), Ki-67 expression level and histological grade of breast cancer.

Specifically, patients with luminal A subtype breast cancers have prolonged overall survival compared with patients with other subtypes of tumors [6]. High expression levels of Ki-67 are associated with poor overall survival and disease-free survival in breast cancer patients [7]. The histological tumor grade is also an indicator that is of great importance in cancer prognostication [8]. These immunohistochemical biomarkers, which are mainly derived from invasive biopsies of breast tissue, are indispensable for accurate breast cancer management decisions.

Dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) reflects the morphologic and functional information of tumors and is routinely used for breast cancer diagnosis and treatment management [9]. Quantitative imaging features extracted from DCE-MRI that represent the blood flow/angiogenesis characteristics of tumors can be efficiently used as predictors of genomic characteristics [10], molecular subtypes [11-13], histological grade [14] and Ki-67 expression status [15-17]. As a complementary technology to DCE-MRI, T2-weighted imaging (T2WI) is advantageous in the detection of edema, hemorrhage, mucus, and cystic fluid in tissue [18].

T2WI-based radiomics has been used to differentiate between seminomas and nonseminomas [19]. In addition to single-parametric imaging, including DCE-MRI and T2WI, multiparametric imaging improves the diagnosis and treatment response evaluation in breast cancer [20]. A deep learning method with multiparametric DCE-MRI and T2WI exhibited better diagnostic performance in discriminating between benign and malignant tumors than DCE-MRI alone

[21]

^. Ming Fan, Chengcheng Yuan, Guangyao Huang, Maosheng Xu, Shiwei Wang, Xin Gao, Lihua Li

A framework for deep multitask learning with multiparametric magnetic resonance imaging for the joint prediction of histological

characteristics in breast cancer

B

(3)

These studies were performed by separately predicting a specific clinical indicator using radiomics derived from a tumor.

However, management and treatment decisions are made by the joint analysis of multiple clinical indicators of breast cancer.

Moreover, the histological characteristics of a tumor are correlated and complementary. For example, low tumor grade is associated with the luminal A subtype in breast cancer [22].

The complementary information between these clinical indicators should also be investigated in predictive models to facilitate more accurate prediction. Therefore, appropriate methods need to be developed to jointly predict histological characteristics that take full advantage of the correlated and noisy data derived from tumors.

To this end, multitask learning methods have been developed to jointly learn common and complementary information among the targets, with the advantage of reducing the overfitting and noise of the model [23]. Previous studies of multitask learning have been performed to identify common imaging features that capture multiple behavior trait [24]

diagnoses of breast cancer [25]. A recent study performed multitask learning for both lesion detection and diagnosis of prostate cancer [26]. Our recent work performed multitask learning for breast cancer diagnosis using radiomics extracted from multiparameter DCE-MRI and diffusion-weighted imaging (DWI) images of tumors [27]. These previous studies were developed based on human-defined features extracted from tumor images that may be irrelevant to the targets.

Deep learning methods that aim to find nonlinear correlations between the source domain and the target domain are increasingly being proposed. Convolutional neural networks (CNNs) in medical imaging [28-30] have achieved great progress, as in applications of mass detection [31, 32] and tumor diagnosis [33] for breast cancer. A deep multitask learning structure was also introduced [34] and has been applied in the diagnosis of clinical lumbar neural foraminal stenosis [35], Alzheimer's disease [36], sleep staging [37] and multimodal brain imaging-based genetic imaging [38]. Despite these advances, how deep multitask learning can improve the joint prediction performance of breast cancer histological characteristics by integrating DCE-MRI and T2WI and how common complementary information and information with respect to different tasks can be utilized to enhance the prediction results remain unclear.

In this study, we propose a framework for a deep learning-based multitask model to jointly predict clinically related biomarkers (luminal A subtype, histological grade, and Ki-67 expression status) of breast cancer. Our contributions are twofold.

First, a multitask learning method using a multiparametric MR image series was developed to jointly predict histological information, including Ki-67 expression status, luminal A subtype and tumor grade. The deep learning structure is specifically designed to learn high-level features that are common to all the tasks and the features that are correlated with

specific tasks. Second, our deep learning framework investigates the complementary information of multiple MRI series to enhance the prediction tasks. Three fusion approaches at the feature and decision levels are designed to integrate deep features from multiple image series. We provide an applicable framework to predict multiple histological characteristics of breast cancer without the need for tumor segmentation, which may induce bias and add workload for the prediction.

II. MATERIALS

A. Image data collection

The dataset was obtained from Zhejiang Cancer Hospital.

This study was exempt from Institutional Review Board (IRB) approval due to its retrospective nature. Patient characteristics, including Ki-67 expression level, molecular subtypes, histological grade, menopausal status, maximum tumor size and age, are presented in Table I. The dataset included 202 breast cancer patients with an average age of 53.05 years old (ranging from 29 to 83 years old). No patient had bilateral tumors, and only one breast was included in the analysis for each patient. Tumor characteristics, including the Ki-67 expression level, were randomly separated into a training set (n=122, 60.4%) and a testing set (n=80, 39.6%). No significant differences in age, Ki-67 expression level, luminal A, luminal B, basal-like, HER2, histological grade, menopausal status or maximum tumor size were observed between the training and testing datasets. The χ² test showed a significant correlation between the Ki-67 expression level and luminal A subtype (p=1.31e-26) and between the Ki-67 expression level and tumor grade (p=4.12e-4). Moreover, tumor grade was significantly associated with the luminal A subtype (p=8.14e-5).

TABLEI PATIENT CHARACTERISTICS Characteristic All (n=202) Training set

(n=122)

Testing

set (n=80) P value

Age 53.05 (29-83) 52.74

(29-81)

53.54

(31-83) 0.144

Ki-67 expression 0.789

High 152 91 61

Low 50 31 19

Luminal A 0.986

Positive 38 23 15

Negative 164 99 65

Luminal B 0.862

Positive 100 61 39

Negative 102 61 41

Basal-like 0.780

Positive 36 21 15

Negative 166 101 65

HER2 0.970

Positive 28 17 11

Negative 174 105 69

Histological grade 0.758

III 116 69 47

I/II 86 53 33

(4)

B. Image preprocessing

DCE-MR images were acquired for each patient in the prone position using a fat-suppressed T1-weighted scanning system (Siemens Magnetom Verio 3.0T, Siemens Medical Systems).

Bilateral sagittal image acquisition was performed with a dedicated eight-channel double-breast coil. Imaging parameters for T2WI were set as follows: repetition time (TR), 4000 ms;

echo time (TE), 70 ms; flip angle, 80°; slice thickness, 4.0 mm;

plane resolution, 0.759×0.759 mm; parallel imaging factor, 16;

and matrix, 448×448. The parameters for DCE-MRI were set as follows: TR=4.51 ms, TE=1.61 ms, slice thickness=1 mm, flip angle=10°, in-plane resolution=0.759 mm, field of view

(FOV)=340×340 mm, and matrix=448×448. After intravenous injection of contrast agent with a bolus of 0.1 mmol/kg gadobutrol at a rate of 2 mL/second, a precontrast image (termed S0) and five sequential postcontrast images (termed S1 to S5) were acquired. The time interval between the consecutive image series was 60 seconds. The images obtained by subtracting the precontrast image (S0) from the five postcontrast images (i.e., S1 to S5) were termed S10, S20, S30, S40 and S50. The unilateral breast areas that contained a tumor with a size of 244×244 were used as the network input (Figs. 1 and 2).

Fig. 2. Overview of the network structure. The VGG16 architecture includes 5 convolutional blocks consisting of 13 convolutional layers (termed C1 to C13) connected with max-pooling layers. The convolutional layers and the dense layers were fine-tuned using the labels of the target tasks with varying numbers of frozen layers. The FC layer was divided to represent common-task and task-specific features. Afterward, multiple CNNs with different image series combinations as inputs were implemented and fused to enhance the prediction performance. Images obtained by subtracting the precontrast image (S0) from the five postcontrast images were termed S10, S20, S30, S40 and S50.

FC

Conv2D Pooling

Multiparametric Images

C1-C2 C3-C4 C5-C7 C8-C10 C11-C13

S0, S10, S20 Series combination 1

Series combination 2 S0, S30, S50

T2WI

Transfer learning

Task-specific layers

Task-specific layers Common- task

layer

Additional

common-task layers Multiparametric feature fusion

Decision-level fusion

Feature concatenation

Feature union

Task 1

Task 2

Task 3 Multitask learning

Frozen layers CNN

Common- task layer

Fig. 1. Overview of the proposed framework. A deep multitask learning network was implemented with multiparametric image fusion to jointly predict multiple histological indicators in breast cancer. The precontrast images, the images resulting from subtracting the precontrast images from the postcontrast images and the T2WI series were combined to consist of three channels as inputs. Image series combination-based CNNs were implemented and further fused to enhance the prediction performance. S0 represents the precontrast image, and S1 to S5 represent the five sequential postcontrast images. Images obtained by subtracting the precontrast image (S0) from the five postcontrast images were termed S10, S20, S30, S40 and S50. Series combination 1 represents the image series combination comprising S0, S10 and S20, and Series combination 2 denotes the image series combination comprising S0, S30, and S50, both of which are the input channels of the network.

S0 S10

S20

S0 S30

S50

Luminal A

Histological grade

Ki-67 expression status

S0 S1

T2WI

S2 S3

S4 S5

T2WI T2WI

DCE-MRI

Multiparametric image Image preprocessing Image series combination Deep multitask learning with multiparametric

images fusion Joint prediction Series combination 1

Series combination 2 DCE-MRI series

(5)

C. Histological data collection

The histological data were obtained from the breast specimens on full-face sections by a pathologist with 15 years of experience. The tumor histological grade was assessed by the Nottingham grading system [39]. According to the St. Gallen Consensus Guideline, a Ki-67 expression level higher than 14%

positive nuclei was considered high, while the other levels were considered low [40]. The status of the estrogen receptor (ER), progesterone receptor (PR), and HER2 were determined according to streptavidin-peroxidase immunohistochemistry (IHC) analysis using the nuclear staining intensity [41]. The luminal subtypes were ER- and PR-positive, the HER2-negative subtype was the luminal A subtype, and the HER2-positive subtype was the luminal B subtype. The luminal subtypes with high Ki-67 expression were also designated luminal B. The ER- and PR-negative and HER2-positive were considered HER2-positive subtypes. The ER-, PR- and HER2-negative subtypes were determined to be triple-negative subtypes. For the histological grade, tumor grades 1 and 2 were categorized as low grade, while grade 3 was categorized as high grade.

III. METHODS

A. Overview

Fig. 1 depicts the overall framework of this study. DCE-MR image series (one precontrast and five postcontrast images) along with the T2WI series were used as input series to constitute a multiparametric image series combination. Breast areas were obtained by excluding the skin and chest wall from the image, which were irrelevant to the prediction tasks, to avoid large-scale convolution computations [42]. Thereafter, unilateral breast areas that contained a tumor were used as network inputs. Image series were combined to consist of three

channels as inputs. The precontrast image, images obtained by subtracting the precontrast image from the postcontrast images and T2WI series were combined to consist of three channels as inputs. Image series combination-based CNNs were implemented and further fused to enhance the prediction performance. Multitask learning with a multiparametric image series fusion framework was developed to jointly predict multiple histological indicators in breast cancer. Class activation maps (Grad-CAM plus) [43, 44] were utilized to visualize how the single-task and multitask learning models responded to different tasks.

B. Multitask learning framework and transfer learning strategy

1) Convolutional neural network structure

An overview of the network structure is illustrated in Fig. 2.

Note that this framework demonstrates deep multitask learning with two branches of image series as inputs but could be easily extended by including additional parametric images, i.e., T2WI, as additional inputs of branches. The 16-layer Visual Geometry Group (VGG16) architecture was used [45], which includes 5 convolutional blocks consisting of 13 convolutional layers (termed C1 to C13) connected with max-pooling layers. For all the convolutional layers, a rectified linear unit (ReLU) was used as the activation function.

The final convolutional layer (size 7´7´512) was connected by global average pooling to aggregate the feature maps to a feature vector of size 512. Subsequently, high-level features were formed by three fully connected (FC) layers, in which the transfer learning strategy, multitask feature learning and the information fusion strategy were performed (detailed descriptions are illustrated in the following sections).

Fig. 3. Overview of the multiparametric image feature fusion strategy. Decision-level fusion refers to averaging the final prediction probabilities from different networks; feature-level fusion refers to concatenating the high-level features (FC1) from the multiple CNNs with different image series as inputs (termed feature concatenation); and feature union refers to compressing high-level features (FC1) from different networks using a 1´1 convolution kernel to improve the expressive power of the network.

Feature union

32

32 32

Decision-level fusion Feature concatenation

480

512 60

FC

180 180 180

512 60

180 180 180

240×2 240

240

240 240

240

240 32

32 32

Pooling Common-task layer Task1 Task2 Task3 240

240×2

240×2 32

32 32

32

32 240 32

240

Conv2D

(6)

The final FC layer of the original VGG16 network with 1000-class outputs was modified into two FC layers denoted by FC1 and FC2.

A softmax activation function was used to compute a probability map as the final output of the network. Let 𝑌_!, 𝑌_",⋯

𝑌# denote the label of J output tasks. The estimated label 𝑌#$ for task i was formulated by the following function:

𝑌#_$= 𝜎(𝑤_$𝑥_$+ 𝑏_$) (1) where 𝑥_$, 𝑤_$, 𝑏_$, and 𝜎 denote the final feature map, the weights, the bias and the activation function, respectively, related to task 𝑖.

2) Multitask learning network structure for joint prediction of histological characteristics

To facilitate learning of the common/specific features for multiple clinical indictors [46], the FC layer was divided by a common-task layer and task-specific layers. The high-level features of the common task and each task-specific layer were concatenated to consist of the FC1 layer. The concatenated weights for each task i were denoted as 𝑤_$^% and illustrated as follows:

𝑤_$^%= [𝑤$, 𝑤&], (2) where 𝑤& denotes the weights of the common-task layers. Let n denote a given number of FC layers with a size of 600 and let n0

denote the size of each common-task layer. Consequently, the size of each task-specific layer in FC1 is (n-n0)/J, and the size after concatenation is n0+(n-n0)/J. Fig. 3 illustrates an example of the structure of FC1 layers with n0 set to60, the size of each task-specific layer set to 180, the size of FC1 for each task set to 240 and the number of tasks J set to 3.

For each task i, the last binary classification layer is derived based on the learned features from the previous layers, and the loss function is defined using the cross-entropy loss:

ℒ_$= −(𝑌_$log5𝑌#_$6 + (1 − 𝑌_$) log51 − 𝑌#_$6) (3) By combining the i^th task in Equation (3), multitask learning was performed with the following equation:

ℒ'(')*= ∑# 𝑎$ℒ$

$+! , ∑^#_$+!𝑎$= 1, 0 ≤ 𝑎$≤ 1 (4) where parameter 𝑎_$ controls the weight of each task with the sum equaling 1. Let ℒ_!, ℒ_", and ℒ_, denote the loss functions for the luminal A subtype, tumor grade and Ki-67 expression level, respectively.

3) Transfer learning strategy

Transfer learning was performed using a pretrained ImageNet dataset [47]. Since the learned features near the input are generic and the features near the output are task specific, the convolutional layers and the dense layers were fine-tuned using the labels of the target tasks with varying numbers of frozen layers. Specifically, the convolutional layers from C0 to Cx were frozen during training, where x denotes the index number of the frozen layers ranging from 0 to 13. When x equals 0, none of the convolutional layers are frozen, while when x equals 13, all the layers are frozen. The hyperparameter of the number of frozen layers (i.e., x) was optimized in the training set.

4) Multiple image series fusion in a deep multitask learning network

To adapt the trained ImageNet CNN for the proposed network, three input channels were implemented using multiparametric MRI. To be compatible with ImageNet, the gray mode (one channel) of a single image (i.e., each DCE-MR time-series image and T2WI) was mapped to three channels,

where each channel red–green–blue (RGB) value is equal to the gray value of a single image [48]. To learn the kinetic patterns of DCE-MRI, image series were used as input channels through pixel-by-pixel computations of time-series data. Let SC1 denote the image series combinations comprising S0, S10 and S20 and SC2 denote the image series combinations comprising S0, S30, and S50, which consist of the input channels of the network. SC1 represents early dynamic enhancement, while SC2 reflects late dynamic enhancement. Note that S0 was also used in SC2 to model the late enhancement pattern beginning with the precontrast image. Instead of S40, we included S0, S30 and S50 as part of series combination 2 (SC2) to represent the precontrast, intermediate and last enhancement image patterns, respectively.

Afterward, multiple CNNs with different image series combinations as inputs were implemented and fused to enhance the prediction performance. The multitask CNN using SC1 was fused with that using SC2. Furthermore, the multitask CNNs using SC1 and SC2 were fused with those using T2WI, a different single-parametric image from DCE-MRI. For the multitask networks with different image series combinations as inputs, the weights of the convolutional layers were shared.

Three different approaches were used to fuse the high-level features from different multitask networks (Fig. 3). First, decision-level fusion was implemented by averaging the final prediction probabilities from different networks. Second, feature-level fusion was performed by concatenating (termed feature concatenation) the high-level features (FC1) from the multiple CNNs with different image series as inputs. Third, the high-level features (FC1) from different networks were compressed (termed feature union) using a 1´1 convolution kernel to improve the expressive power of the network.

C. Network training

The unilateral breast area that contained a tumor was used as input in the network. Data augmentation was performed by adding training images using image flipping, scaling and rotation. Deep learning was performed using the Keras package (version 2.2.5) with the TensorFlow library.

The network was trained using stochastic gradient descent optimization (RMSprop) with a batch size of 64. The learning rate was set to 3e-7. Dropout [49] was applied to the FC1 and FC2 layers at rates of 0.5 and 0.2, respectively.

D. Performance evaluation

To evaluate the effectiveness of the common-task layers, varied proportions among the total FC layers (Fig. 3, n=600) were used that ranged from 0 to 50%, with a step size of 10%.

That is, the size of the task-specific layer (n0) in FC1 ranged from 0 to 300. The size of the FC2 layers was set to 32 (Fig. 3).

Transfer learning strategy experiments and the effectiveness of the common-task layer evaluation were investigated by using DCE-MRI series and T2WI.

The hyperparameters of the multitask CNN framework for the joint prediction of clinical variables were optimized with 5-fold cross-validation (CV). More specifically, the weight of each task, number of frozen convolutional layers, and proportions of the common-task layers among the FC layer were optimized in the training set. Afterward, a multitask prediction model was trained using all samples in the training

(7)

set with the tuned hyperparameters and was then applied to the testing set. The area under the receiver operating characteristic (ROC) curve (AUC) was used as a summary performance measure. For the multitask model, the prediction probability and the label for each task were obtained, by which the ROC curve was drawn. The sensitivity and specificity were calculated using the Youden index at the operating threshold that maximized their sum.

IV. RESULTS

A. Single-task learning using a single image series

The predictive power based on single image series from DCE-MRI and T2WI was initially evaluated separately with single-task learning. The results showed that image series in the precontrast or first several postcontrast series (i.e., S0 and S10) had relatively higher performance than the other series in predicting the Ki-67 expression level and luminal A subtypes (Table II). The intermediate image series (S20, S30 and S40) achieved better performance on the tumor grade prediction task than the other DCE-MRI series. The T2WI-based network showed the highest performance (AUC=0.762, sensitivity

=0.674, specificity= 0.742) in the prediction of histological grade among the image series. These results showed that different image series (e.g., early or late time series) have distinct predictive power in the classification of histological characteristics.

B. Single-task learning using multiple image series

Multiple image series were used as in the three input channels in single-task learning. The prediction model was implemented based on the image series (SC1 and SC2). Note that the S0

image was incorporated in both SC1 and SC2, which were used to model the dynamic pattern in the DCE-MRI image series, beginning with the precontrast image. The deep learning model with the multiple image series achieved a higher level in terms of AUCs than that using a single image (Table III). The combination of S0 and early time-series images (i.e., S10 and S20) had a slightly better prediction performance than the combination of S0 and the late enhancement images (S30 and S50) for predicting Ki-67, which is consistent with the results (Table II) from the single image series, indicating that the early time-series images are more discriminative than the late time-series images in this specific task.

Fig. 4. Model performance in terms of the AUC with varying numbers of frozen layers in the prediction of the Ki-67 expression level, luminal A subtype, and histological grade.

Luminal A Ki-67

Histological grade Average

AUC

Number of frozen layers TABLEII

SINGLE-TASK LEARNING FOR A SINGLE IMAGE SERIES

Task Series AUC 95% CI Balanced-ACC Sensitivity Specificity

Ki-67

S0 0.752 0.623-0.856 0.747 0.705 0.790

S10 0.733 0.588-0.853 0.729 0.721 0.737

S20 0.718 0.586-0.835 0.716 0.853 0.579

S30 0.693 0.547-0.820 0.678 0.672 0.684

S40 0.689 0.541-0.820 0.664 0.590 0.737

S50 0.668 0.523-0.799 0.657 0.525 0.790

T2 0.599 0.467-0.729 0.655 0.362 0.947

Luminal A

S0 0.722 0.573-0.851 0.721 0.733 0.708

S10 0.623 0.467-0.766 0.626 0.667 0.585

S20 0.633 0.472-0.778 0.644 0.533 0.754

S30 0.630 0.484-0.780 0.633 0.667 0.600

S40 0.663 0.520-0.793 0.703 0.867 0.539

S50 0.686 0.550-0.819 0.674 0.533 0.815

T2 0.702 0.599-0.833 0.679 0.674 0.742

Histological Grade

S0 0.596 0.483-0.716 0.583 0.681 0.485

S10 0.638 0.505-0.752 0.641 0.766 0.515

S20 0.695 0.572-0.807 0.669 0.702 0.636

S30 0.695 0.568-0.810 0.671 0.979 0.364

S40 0.685 0.566-0.809 0.704 0.681 0.727

S50 0.679 0.547-0.789 0.655 0.553 0.758

T2 0.762 0.652-0.864 0.708 0.674 0.742

ACC: accuracy; CI: confidence interval; AUC:area under the receiver operating characteristic curve.

(8)

Additionally, image series for combining T2WI and the other DCE-MRI series were also investigated. Specifically, four image series combinations were used: 1) T2WI, S0 and S10 (termed TSC1); 2) T2WI, S0 and S20 (termed TSC2); 3) T2WI, S0 and S30 (termed TSC3); and 4) T2WI, S0 and S50 (termed TSC4). The prediction performance decreased after T2WI was introduced as input together with DCE-MRI series for Ki-67 and luminal A prediction, while the performance was higher than that achieved by either SC1 or SC2 for tumor grade prediction. This is similar to the result generated by the single-parametric image-based predictive model, which showed that T2WI had better performance than the other series for tumor grade prediction (Table II). However, the

combination of the T2WI and DCE-MRI series showed lower performance than using the DCE-MRI series alone (Table III).

This may be because the combination of T2WI and DCE-MRI as an input channel may not model the kinetic patterns well.

TABLEIII

SINGLE-TASK LEARNING FOR MULTIPLE IMAGE SERIES

Task Series AUC 95% CI Sensitivity Specificity

Ki-67

SC1 0.793 0.689-0.887 0.557 1.000 SC2 0.772 0.644-0.884 0.869 0.632 TSC1 0.516 0.360-0.676 0.845 0.316 TSC2 0.484 0.321-0.650 0.879 0.211 TSC3 0.524 0.385-0.666 0.638 0.526 TSC4 0.473 0.317-0.625 0.845 0.263

Luminal A

SC1 0.730 0.604-0.843 1.000 0.415 SC2 0.729 0.611-0.838 1.000 0.477 TSC1 0.447 0.273-0.631 0.133 0.984 TSC2 0.498 0.316-0.684 0.267 0.903 TSC3 0.547 0.355-0.731 0.333 0.887 TSC4 0.518 0.343-0.706 0.533 0.613

Grade

SC1 0.715 0.596-0.825 0.787 0.606 SC2 0.712 0.598-0.821 0.575 0.818 TSC1 0.698 0.575-0.811 0.587 0.871 TSC2 0.734 0.618-0.837 0.804 0.613 TSC3 0.752 0.639-0.849 0.522 0.935 TSC4 0.715 0.598-0.828 0.587 0.871 AUC: area under the receiver operating characteristic curve; CI: confidence interval; and ACC: accuracy. SC1, SC2, TSC1, TSC2, TSC3 and TSC4 denote image series combinations comprising S0, S10 and S20; S0, S30 and S50;

T2WI, S0 and S10; T2WI, S0 and S20; T2WI, S0 and S30; and T2WI, S0 and S50, respectively.

C. Multitask learning with the transfer learning strategy A multitask learning framework was implemented with knowledge from trained parameters in ImageNet to jointly predict histological information in breast cancer. The experiment using series combinations of S0, S1, and S2 (SC1) suggests that varying the number of frozen layers greatly affected the model performance (Fig. 4). From the figure, the benefit of transfer learning is optimal when the first five convolutional layers were frozen (from C1 to C5), exhibiting the highest average AUC for the three tasks. The results suggest that transfer learning by freezing the first several layers near the input that are generic and fine-tuning the other layers near the output could enhance the prediction performance. When none of the convolutional layers were frozen, i.e., when the network was trained from the beginning, moderate classification results, with AUCs of 0.703, 0.654 and 0.677, were achieved for the Ki-67 expression level, tumor grade and luminal A subtype prediction tasks, respectively. On the other hand, when all the convolutional layers were frozen, meaning that only the FC layers were fine-tuned, the model showed the lowest performance, with an average AUC of approximately 0.5. This may be because the benefits of transfer learning diminish with pretrained networks that have weights learned from nonmedical images rather than feature training from medical imaging.

Multitask learning with transfer learning using T2WI data with varying numbers of frozen layers was also performed, and the results showed that the benefit of transfer learning was optimal when the first two convolutional layers were frozen (from C1 to C2) with the highest average AUC of 0.697 for the three tasks (Supplementary Fig. 1). Transfer learning improves the T2WI-based prediction performance, especially for the tumor histological grade.

Fig. 6. Comparison of the performance of the multitask and single-task models.

The ROC curves using the deep multitask model for the a) Ki-67 expression level, b) luminal A subtype and c) tumor grade prediction tasks are shown. d) The weight for the Ki-67 expression level and the luminal A subtype prediction tasks.

0 0.2 0.4 0.6 0.8 1

1-speciﬁcity

a b

c d

Multitask model, AUC=0.784 Single-task model, AUC=0.73

Ki-67 weight ( )

Luminal A weight ()

AUC Multitask model, AUC=0.821

Single-task model, AUC=0.793

Sentitivity

1-speciﬁcity

Multitask model, AUC=0.711 Single-task model, AUC=0.715

Sentitivity

Fig. 5. Model performance in terms of AUC values with varying proportions of common-task layers.

Luminal A Ki-67

Histological grade Average

Common-task layer (%)

AUC

(9)

D. Multitask learning with additional common-task layers A common-task layer was introduced into an FC layer (FC1) for multitask learning, in addition to the task-specific layer, to model the deep features common to all the tasks and those related to specific tasks. The experiments using SC1 were performed, and the effectiveness of the common-task layers was assessed by evaluating the performance (in terms of the AUC) with its varied proportions of total nodes (n=600), which is shown in Fig. 5. More specifically, the proportion of the common-task layers varies from 0 to 50% at increments of 5%.

When all of the FC layers were task specific (i.e., the proportion of the common-task layers was 0), the prediction performances of the three tasks (average AUC of 0.692) were similar, with AUCs of 0.699, 0.672, and 0.705 for Ki-67 expression level, luminal A subtype and tumor grade prediction tasks, respectively. The average multitask performance was improved when the proportion of the common-task layer was increased from 0 to 20%. The maximum average AUC was achieved when 20% (120 nodes) of the FC layer was a common task.

When the proportion of the common-task layer was more than 20%, the average multitask performance decreased. When 50%

of the nodes were common-task nodes, the average multitask performance (average AUC=0.708) was comparable to the results, indicating that all the nodes were task specific (average AUC=0.692), but the performance varied between tasks (Ki-67 expression level, luminal A subtype and tumor grade tasks, with AUCs of 0.789, 0.677 and 0.660, respectively). The results suggest that the introduced common-task layer can enhance the model prediction performance, especially for the Ki-67 expression level.

The multitask learning experiment using T2WI data with the proportion of common-task layers varying from 0 to 50% at increments of 5% was also performed, and the prediction performance was optimal when the proportion of the common-task layer was 20% (120 nodes) with an average AUC of 0.701 (AUCs of 0.662, 0.724 and 0.717 for Ki-67 expression level, luminal A subtype and tumor grade prediction tasks, respectively) (Supplementary Fig. 2). The results suggest that multitask learning with common-task layers improves the model prediction compared with single-task learning, especially for Ki-67.

E. Multitask learning with varying weights for each task The effectiveness of multitask learning was evaluated by using varying weights for the three tasks (Fig. 6). For SC1, the optimal weights (𝑎_!=0.25, 𝑎_"=0.25 and 𝑎_,=0.5 for the Ki-67 expression level, luminal A subtype and tumor grade tasks,

TABLEIV

INFORMATION FUSION OF MULTIPLE IMAGE SERIES IN A DEEP MULTITASK LEARNING NETWORK

Series Fusion AUC 95% CI Balanced

-accuracy Sensitivity Specificity F1-score

Ki-67

Decision average 0.821 0.717-0.911 0.761 0.574 0.947 0.722

Feature concatenation 0.802 0.695-0.902 0.749 0.656 0.842 0.769

Feature union 0.811 0.690-0.916 0.770 0.803 0.737 0.852

Luminal A

Feature union 0.811 0.704-0.901 0.821 0.933 0.708 0.583

Grade

Feature union 0.711 0.589-0.815 0.713 0.638 0.788 0.714

ACC: accuracy; AUC: area under the receiver operating characteristic curve; and CI: confidence interval. Information fusion SC1 (S0, S10, and S20) and SC2 (S0, S30, and S50).

Fig. 7. Visualization of deep learning model attention. (A) Breast cancer with Ki-67 positivity, luminal A positivity and high grade. (B) A breast cancer with Ki-67 positivity, luminal A negativity and high grade. (C) A breast cancer with Ki-67 negativity, luminal A positivity and high grade.

(D) Breast cancer with Ki-67 positivity, luminal A negativity and high grade. The heatmap for the subtraction images between third postcontrast image and precontrast image (S30) and the response map for the grade, Ki-67, tumor grade and multitask learning network are represented in columns from one to five.

S30 Grade Ki-67 luminal A multitask

a

b

c

d

(10)

respectively) were selected by a grid search. The results suggest that if more weight was applied for the lower-performance task (i.e., 𝑎_, = 0.5 for the tumor grade prediction task) in multitask learning, the overall prediction performance (average AUC) was improved.

For T2WI, the experimental results obtained using varying weights for the three tasks were assessed, and the optimal weights (a1=0.25, a2=0.25 and a3=0.5 for the three tasks) were determined and were similar to those of the DCE-MRI series-based prediction.

F. Visualization of deep single- and multitask learning To evaluate and compare how the model responded to different tasks in single-task and multitask learning, we performed a case study for an SC2-based predictive model. Class activation maps (gradient-weighted class activation mapping (Grad-CAM plus)) [43, 44] were utilized for visualization of the attention over input. Examples of the model response patterns are demonstrated in the Grad-CAM visualization map for multitask and single-task learning (Fig. 7). Different response maps are observed for different single tasks, while the response map for the multitask learning network is shared among different tasks.

Figure 5 demonstrates a higher response around the tumor area for multitask learning than the other single-task CNNs.

Single-task learning had high attention in the tumor and the other regions for the tumor grade prediction task (Fig. 7a, b, c) but low attention in the tumor for the Ki-67 and luminal A tasks.

On the other hand, the response map of single-task learning for tumor grade prediction (Figure 5 d) demonstrates high attention in the area that is irrelevant to the tumor area, and this model mistakenly predicted the sample as having a low histological grade. These response maps interpreted the attention of the deep learning model. Multitask learning targeting multiple related clinical indicators has a common response map that is near tumor areas for different tasks, making it superior to single-task learning.

G. Multitask learning by multiple image series fusion

Multitask learning with fine-tuned network parameters (i.e., weight parameters and common-task layer proportion) was performed by multiple image series fusion (Table IV). This network outperformed the single-task learning framework for the three tasks using SC1 and SC2. Specifically, the classification of the Ki-67 expression level with the multitask learning framework using feature union (Table IV) was better

than that using single-task learning for SC1 and SC2 (Table III).

Moreover, the prediction results for Ki-67, luminal A and grade obtained by using SC1 and SC2 (Table III) increased when the feature union method was used. This result indicates that learned high-level features consisting of features that are common and specific to targets are superior to features learned from single-task learning. Multitask learning showed similar AUCs to single-task learning for histological grade prediction (AUCs = 0.715 and 0.711 for SC1 and SC2, respectively), which had the lowest performance among the three tasks.

The feature union method showed an overall better average performance (average AUC=0.778) than decision fusion (average AUC=0.762) or feature fusion (average=0.767) for the three targets (Ki-67 expression level, luminal A subtype and tumor grade prediction). These results suggest that a feature-to-feature combination-based fusion method is superior to both the decision fusion and feature concatenation methods.

When SC1-, SC2- and T2WI-based multitask CNNs were fused, the predictive model by decision-level fusion showed better overall performance for the three tasks (average AUCs of 0.788, 0.768 and 0.782 for decision-level fusion, feature concatenation and feature union, respectively), as shown in Table V. The results suggest that model fusion of DCE-MRI and T2WI series is a promising strategy to enhance the prediction of histological characteristics in breast cancer.

V. DISCUSSION

Clinical decisions for breast cancer are usually made by multiple biomarkers. In this study, a multitask learning framework was proposed to jointly predict multiple clinical indicators by using a multiparametric image series. The framework used transfer learning based on the first several convolutional layers from a pretrained network and was further fine-tuned with the target label. The multitask learning model using multiparametric images showed performance superior to that of the single-task learning method with a single image series. Information from the multiple networks with combined image series was fused at the decision level and feature level (feature concatenation and feature union). Experiments on multitask learning by fusing multiparametric image series demonstrate the effectiveness of the proposed method.

During transfer learning, some of the convolutional layers were frozen, and the remaining layers were trained with a target label. Various numbers of frozen layers were examined for

TABLEV

INFORMATION FUSION OF T2WI AND DCE-MRI IN A DEEP MULTITASK LEARNING NETWORK

Series Fusion AUC 95% CI Balanced-ac

curacy Sensitivity Specificity F1-score Ki-67

Feature union 0.826 0.721-0.914 0.792 0.690 0.895 0.800

Luminal A

Feature union 0.782 0.667-0.879 0.788 0.867 0.710 0.565

Grade

Feature union 0.738 0.618-0.841 0.693 0.870 0.516 0.792

ACC: accuracy; AUC: area under the receiver operating characteristic curve; and CI: confidence interval. Information fusion SC1 (S0, S10, and S20), SC2 (S0, S30, and S50), and T2WI.

(11)

joint prediction of multitasks to evaluate their effectiveness on model performance. A related breast cancer diagnosis study on digital breast tomosynthesis (DBT) [50] with transfer learning by freezing only the first convolutional layers of AlexNet achieved performance superior to that achieved by freezing most of the layers [51]. In our study, transfer learning by freezing the first several layers (i.e., C1 to C5 among the 13 layers) in VGG16 achieved the best overall prediction performance (Fig. 3). A deep CNN without pretrained parameters achieved better performance than fine-tuning only the FC layers (i.e., freezing all the convolutional layers). The results suggest that transfer learning from networks with natural images should be refined by using medical imaging with an appropriate number of pretrained network parameters.

In our multitask learning framework, the FC layer was divided to represent common-task and task-specific features. The model with common-task neurons composing approximately 20% of the FC layer achieved the best performance. The introduction of common-task layers into the FC layer can help learn the common and complementary information for different tasks. These results demonstrate that multitask learning can enhance the prediction performance of the correlated tasks (e.g., luminal A and tumor grade are significantly correlated indicators). This finding can be explained by the fact that prediction with the associated task could reduce model overfitting and noisy information derived from the heterogeneous tumor by learning the related and complementary features.

The experiments showed that the weights of the prediction tasks affected the overall performance (Fig. 6). Specifically, applying more weight to the tumor grade task (i.e., 0.25, 0.25 and 0.5 for Ki-67, luminal A and tumor grade tasks, respectively) could enhance the overall prediction performance of the three tasks. This lower weight for the Ki-67 and luminal A tasks and higher weight for the tumor grade task were correlated with the original performance using single-task learning, which showed descending performance in the prediction of the three clinical indicators. The results suggest that multitask learning with more weight on the lower-performance task (tumor grade) could increase the overall performance of the three tasks.

Each of the image series was separately used in the single-task learning network, while the S0 image showed better performance for Ki-67 expression level and luminal A subtype prediction than the other image series, and T2WI showed better performance than the other images for tumor grade prediction.

This finding is partly in line with a previous study using radiomics analysis [11] showing that S0 images are predictive for the luminal A subtype and, hence, the correlated Ki-67 task.

Multiparametric images were also combined as the network input, a method that is superior to the method using a single image. Additionally, the prediction performance of tumor grade prediction was enhanced by combining the T2WI and DCE-MRI series as network inputs and by fusing the T2WI feature with the DCE-MRI data. The predictive power of the multiple image series-based single-task CNN is enhanced by multitask learning, especially in Ki-67 expression level prediction. Therefore, multiple image series-based multitask learning is a promising method to better predict multiple clinical indicators.

The deep feature/prediction from multiple series in the deep CNN was fused using three information fusion strategies. The results suggest the effectiveness of the methods over the single-series images of DCE-MRI. Experimental results demonstrate the superior performance of our multitask-learning-based model with a multiple image series fusion framework over single-task learning with a single series.

A related study developed a model that fused the deep features from CNNs and texture and shape features at the decision level for discrimination between malignant and benign lung nodules by chest computed tomography [52]. Feature concatenation may induce complementary combinations while replicating information from different sources of image features from the same tumor. Furthermore, the decision-level fusion method combines the predicted probabilities, which may omit the correlated but distinct features from different sources of images.

In contrast, feature-level fusion by union can take advantage of the feature vector from the different sources of images in the same dimension, which may fuse features with correlated information. Notably, information fusion using multiparametric images from DCE-MRI and T2WI showed the best overall performance. This may be because T2WI had better performance in the prediction of tumor grade, while DCE-MRI showed superior power in the Ki-67 and luminal types. Therefore, the predictive model can be enhanced by fusing the correlated and complementary information.

It should be noted that validation of the proposed model with tuned hyperparameters on an additional dataset is of vital importance for model generalization and clinical applications.

However, medical imaging data from different datasets/cohorts are usually acquired using varied imaging parameters (protocols). This might have induced bias and inconsistent results for the imaging-based model. Although proper imaging preprocessing and normalization could partly alleviate the bias in analysis, it is still challenging for artificial intelligence-based models applied to medical imaging datasets from different sources with different protocols. It will be valuable to conduct a study to further assess our design strategy when we can access a large dataset in the future.

VI. CONCLUSION

A deep multitask learning framework using multiparametric image series was developed for the joint prediction of three correlated clinical indicators: Ki-67 expression level, luminal A subtype and tumor grade. The transfer learning strategy was performed by using the pretrained network in the first several convolutional layers. Multitask prediction with more weights on the specific task could increase the overall performance. The proposed strategy, which learns the features common and specific to the prediction tasks, was used to find features that are common targets of all three tasks, and the effectiveness was confirmed by experiments. Decision- and feature-level information fusion methods were implemented to enhance the performance. The extensive experimental results of our proposed framework appear promising for the multiple clinical indicator-based management of breast cancer. This framework is extendable for incorporating other parametric images (i.e., DWI) in the prediction of multiple clinical indicators of breast cancer patients and is applicable in clinical practice.