Ensemble learning versus deep learning for Hypoxia detection in CTG signal

(1)

Ensemble learning versus deep learning for Hypoxia detection in CTG signal

Riskyana Dewi Intan P ¹⁾, M. Anwar Ma’sum ¹⁾, Noverina Alfiany ¹⁾, Wisnu Jatmiko ¹⁾, Aria Kekalih ²⁾, Alhadi Bustamam ³⁾

1) Faculty of Computer Science, ²⁾Faculty of Medicine, ³⁾Faculty of Mathematics and Natural Sciences, Universitas Indonesia

Depok, Indonesia

[email protected], [email protected], [email protected], [email protected]

Abstract— Hypoxia is a condition of the decreasing oxygen supply on the fetal body tissues that will lead to fetal mortality. The experts will categorize fetal condition into two levels i.e. normal and hypoxia, based on CTG data analysis. Dataset which contain noises will affect to misinterpretation by the experts. The ensemble learning methods and deep learning methods are implemented to detect hypoxia. Ensemble learning models used include Bagging Tree, AdaBoost, and Vooting Classifier with classifier methods such as Decision Tree, SVM, SGD, GLVQ, and Naive Bayes. Deep learning models used include CNN and DenseNet. These methods are applied to CTG dataset, especially FHR signal. The classification processes utilize pH label as the benchmark. The benchmark is use to classify the dataset into two stage, normal and hypoxia. The best evaluation performance is obtained by Bagging Tree method with Naive Bayes Classifier. The F1-score for normal class was 0.76 and 0.45 for hypoxia class.

Keywords—fhr; hypoxia; ensemble learning; deep learning.

I. INTRODUCTION

Hypoxia is a condition of the decreasing oxygen supply on the fetal body tissues. This condition conduced a fetus brain and other organs disruption, rapidly. The delayed treatment can be effect to fetal fatality. Hypoxia can be detected by monitoring the fetal heart rate (FHR) and maternal uterine contractions (UC). Generally, the examination of FHR and UC carried out by an electronic device called Cardiotocography (CTG) that records the FHR rhythm and UC simultaneously. The examination starts from the third trimester of pregnancy e.g.

seventh month until the ninth month. The important step for CTG data examination is analyzing the basic frequency of the FHR signal. If the basic frequency of FHR signal less than 120 beats per minute or more than 160 beats per minute than the fetus is diagnosed with hypoxia [1].

CTG data monitoring and recording are attempts to assess fetus condition, which associated with fetal activity, maternal health, placental state, amniotic fluid, fetal heart rate, and uterine contraction. CTG data recording obtained by two techniques, internally and externally. Current FHR data collecting is ranging from simple external CTG to expensive and highly specialized fetal electrocardiography (fECG). The fECG conducted by attaching the transducer directly to the fetal scalp. The usage of fECG is invasive or through a surgical process on the cervical opening. The result yields by internal technique is clearer and more accurate than by external CTG. However, recording CTG data by fECG requires more expensive costs than external CTG.

This is due to the electrodes that must be replaced for each examination, to avoid the risk of infection. The FHR and UC signal produced by external CTG are irregular and susceptible to noise. These problem are caused by the disturbance in the recording process, such as transducer shifting or noises.

Experts such as Obstetricians or Tocologist will categorize the hypoxia condition into two levels i.e. normal and hypoxia.

The decision is based on CTG data analysis. A fetus which indicates by hypoxia will be treatment by cesarean section or normal delivery immediately. Poor dataset such as signal data contaminated by noise will lead to misinterpretation by the experts. The misinterpretation will conducting the wrong decision for the labor process and affect to the fetal condition.

This research is conduct a hypoxia detection by using external CTG dataset. The comparison results between ensemble learning and deep learning will be presented. The pre-processing is perform by overcoming the missing beats and then use the results at feature extraction and classification stages.

II. R^ELATEDW^ORKS

Currently, several studies are conducted to discuss about CTG data due to its characteristics i.e. non-stationary, complex, and contaminated by noises. Spilka et al. [2] took the energy spectral and multi fractal analysis feature to carried out a classification by using sparse Support Vector Machine (SVM).

Lhotska et al. [3] classified the features produced by the Fast Fourier Transform (FFT) and Continuous Wavelet Transformation (CWT), and the utilized the Convolutional Neural Networks (CNN). Signal interpolation was implemented to overcome the missing beat by using Unscented Kalman Filter (UKF) and then the recognition of abnormal patterns based on Minimum Description Length (MDL) [4]. Li et al. [3] compared several classification methods including SVM, Multilayer Perceptron, and CNN by using the statistical features and d-window of Short Time Fourier Transform (STFT). A feature exctraction by using Fisher Linear Discriminant (FLD) and a classification by using Random Forest (RF) were conducted [5]. Tang et al. [3] compared the performance of deep learning methods i.e. RNN and CNN. The method proposed by Tang which called MKRNN processed and classified FHR signals in real-time by using RNN, and which

(2)

called MKNet processed the images from FHR signals by using CNN.

Several studies by Zafer were conducted on CTG dataset including comparing features obtained from several window functions of STFT, Empirical Mode Decomposition (EMD), Discrete Wavelet Transform (DWT), and FLD. In addition, the classification by using several machine learning methods such as SVM, Random Forest and CNN were conducted [6][7].

Later, Zafer used an image-signal approach that processes signals into four different spectrogram images. Those signals i.e. Very Low Frequency (VLF), Low Frequency (LF), Middle Frequency (MF), and High Frequency (HF) were generated from STFT. The STFT generated time delay coordinates to reconstruct the signal phase space trajectory by using a recurrent plot and then trained by CNN [8][9].

Those previous studies considerably obtained high accuracy results but yielded low recall and precision values for hypoxia classes [3][4][5][7][8]. It shows that the model proposed in the those studies was inadequate to detect hypoxia. In this study, an experiment is conducted in which could produce better recall and precision values to detect hypoxia appropriately.

III. METHODOLOGY A. Dataset

The dataset collected from the open dataset i.e. The Intrapartum CTU-UHB Cardiotocography Database. It ranged from 2010 to 2012, and provided by the Czech Technical University (CTU) in Prague and the University Hospital in Brno (UHB). The dataset contain 552 data from 552 pregnant women consisting of FHR and UC signals. The CTG data start from greater than 90 minutes before actual delivery and maximum 90 minutes long. Each data was sampled at 4 Hz. Figure 1 shows

the example of FHR signal obtained by the external CTG which has many missing beats and noise.

In addition to the FHR and UC signals, the dataset also contains the information about maternal data, delivery data, fetal data, and the analysis of medical artery blood samples i.e.

pH and apgar score. From the previous research [6][7] [8][9], to perform the training process of CTG data classification into hypoxia and normal classes, the pH information on each data

was used as the label benchmark. If the pH less than or equal to 7.15 then the signal was labeled as hypoxia, and otherwise.

From the label distribution of 552 CTG data, the dataset were divided into 105 hypoxic data and 447 normal data. It should be noted that this study focuses on FHR signal.

B. Pre-processing

Missing beats are occur due to the transducer shifting at the time of data acquisition or by mother or fetal movement. The arising of noise is cause by the equipment or mother breathing noise. The pre-processing is carried out in this research and focus on overcoming the missing beats and removing the noise.

Missing beats that appears on the FHR signal takes a value of zero. In this study a missing beat that appeared more than 15 seconds is deleted, while the missing beat appeared less than 15 seconds is filled by the Hermite cubic spline interpolation value. In addition, data outliers that more than 200 bpm or less than 50 bpm are also removed. Break-points or beat values that dropped more than 25 bpm are deleted until the beat was normal. This step is repeated from the beginning until the missing beat less than 15 seconds.

Several features on FHR signal are considered by Obstetricians or Tocologist in assessing the fetal health. These features include deceleration, acceleration, and variability.

Figure 3 shows the example of FHR signal that contains those three features. Hypoxia occurs in the fetus usually characterized Fig. 1. CTG data consisting of FHR (top) and UC (bottom).

Fig. 2. Pre-processing results of FHR signal.

(3)

by the appearance of long and deep decelerations, and appears continuously in the signal.

C. Classification

The classification processes in this paper are conduct by using the ensemble learning, such as Adaptive Boosting (Adaboost), Aggregating Bootsrap (Bagging Tree), Vooting Tree. Adaboost is a boosting method that combines several weak classifiers into a strong classifier model which has a better accuracy values. Adaboost gives weight to each classifier and conducts training data samples in each iteration. Every base classifier implemented by machine learning method satisfies the specified weight. Bagging Tree is an ensemble method that combines several weak classifiers in the sample data obtained by bootsrap. Vooting classifier is a part of stacking or stack generalization which performs the learning process by using several classifier models in a dataset and then votes predictive models for each constructed classifier. Some classifier models were used in the ensemble learning method include: Decision Tree, Support Vector Machine (SVM), Generalized Learning Vector Quantization (GLVQ), Naive Bayes, and Stochastic Gradient Descent (SGD).

Those previous ensemble method then compared to the deep learning model. Deep learning models used in this paper were CNN and DenseNet. DenseNet is similar to CNN which implemenred the concept of convolution as a feature extraction process but has a different architecture. If CNN performs the feature extraction process in each layer then produces many parameters, DenseNet distributes the feature extraction results on each previous layer to the next layer so that the model does not need to learn more from the features maps that were generated in the previous layer. When compared to CNN, the DenseNet network becomes thinner (dense).

D. Implementation

The experiment in this paper is conducted by employee a Phyton language program to carry out the classification processes. In this paper, four experimental scenarios are conducted including:

i. Learning by using the ensemble method with several classifiers and deep learning on raw FHR data without preprocessing.

ii. Learning by using ensemble and deep learning methods with the best classifier from the first scenario on raw FHR data without preprocessing. In this second scenario the number of tree is compare and layers will be used.

iii. Due to the imbalance classes of FHR data, the resampling process is carried out by using the bootsraping method for abnormal classes. In this third scenario, the learning process is utilize the ensemble and deep learning method with the best classifier of the first scenario on raw FHR data without preprocessing.

iv.

In the fourth scenario, the FHR data preprocessing is performed to overcome the missing beat and noise that appears. The fourth scenario also compares the ensemble and deep learning methods with the best model obtained from the previous scenario. The performance of each

method is evaluated by using the recall, precision, and f1- score.

This paper is intended to detect normal classes and hypoxia, so the performance of the model is evaluated by using the precision, recall, and f1-score on both label, i.e. normal (0) and hypoxia (1). The formula to calculate performance evaluation values is shown as

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛= ^��

�� , (1)

𝑅𝑒𝑐𝑎𝑙𝑙= ^��

��. (2)

𝐹1 𝑠𝑐𝑜𝑟𝑒= 2 ×��×��

�� (3)

IV. EXPERIMENTAL RESULT

FHR signal classification was performed by using ensemble learning such as Adaboost, Bagging Tree, and Vooting Classifier. The result will be compared with deep learning method such as DenseNet, and CNN,. These learning method was implemented to obtain the best results in the detection of hypoxia. The results of model evaluation were analyzed on both labels, where the model was considered capable in the detection of both normal and hypoxic classes if they had high precision, recall, and f1-scores for both classes.

a. Result of Scenario I

In the first scenario, recall, precision, and f1-score are obtained from each class for three ensemble method with different classifiers. Ensemble learning models compared in scenario I include: Bagging Tree, Ada Boost, and Vooting Classifier. While the classifiers were: Decision Tree, SVM, GLVQ, SGD, and Naive Bayes.The ensemble learning model then compared to the deep learning model, DenseNet and CNN.

The results of scenario I shown in table 1 to 4.It can be seen that the ensemble learning model gets the best f1-score for both classes i.e. Bagging Tree with Naive Bayes classifier. It yielded 76% for normal classes and 45% for hypoxia classes.

When compared with the Vooting classifier model (Random Forest, Naive Bayes, and GLVQ), Bagging Tree (Naive Bayes) gets a better recall of 65%, so it considered adequate in the detection of hypoxia more precisely than the voting classifier model. For deep learning models, a classification process using CNN and DenseNet carried out with a variation of 50 epochs and 100 epochs. Figure 4 shows that the model obtained the best precision, recall, and f1 scores for both labels i.e CNN with the results of the f1 score for the normal class take 90% and 23%

for the hypoxia class.

(4)

TÂBLE1.BÂGGINGT^REERÊSULT

Classifier Precision Recall F1score

0 1 0 1 0 1

Decision Tree 0.79 0.2 0.86 0.13 0.83 0.16

SVM 0.79 0 1 0 0.88 0

GLVQ 0.85 0.3 0.65 0.57 0.74 0.39 SGD 0.78 0.16 0.82 0.13 0.8 0.14 Naive Bayes 0.88 0.34 0.67 0.65 0.76 0.45

TÂBLE2.A^DABÔOSTRÊSULT

TÂBLE3.VÔOTINGC^LASSIFIERRÊSULT

TÂBLE4.DEEPLÊARNINGRÊSULT

b. Scenario II

In the first scenario, the results of the ensemble learning and deep learning model with the best precision, recall, and f1-score was obtained by Bagging Tree with Naive Bayes classifier and CNN with 50 epoch. In the second scenario, the two models will be compared variating the number of trees and layers. The

number of trees in the ensemble model with the Naive Bayes classifier will be varied to 3, 5, 7, 10, 50, 100, and 500 trees. In this paper the number of convolution layers and pooling layers will varies into 2, 3, and 4 layers, respectively. The results of the second scenario experiment can be seen in Figures 5 and 6

FIGURE 5.BAGGING NAIVE BAYES RESULT

In Figure 5, it can be seen that the most optimal number of trees used in the ensemble learning model with Naive Bayes classifier was 3 trees.In the variation of 3 trees, the f1-score results obtained for each class were 76% for the normal class and 45% for the hypoxia class. In the variation of 10 trees, the f1-score results had the highest value for the normal class at 81% but only reached 33% for the hypoxia class which showed that the model inadequate to detect hypoxia appropriately. The variation of 3 trees has the best proportion of f1-scores for both labels.In Figure 6, it appears that the most optimal number of layers for CNN in the second scenario was 3 layers. The f1- score obtained for each label was 90% for the normal class and 29% for the hypoxia class. The figure also shows that the results of the f1-score in the 4 layer variation decreased to 89% for the normal class and 28% for the hypoxia class.

FÎGURE6.DEEPLÊARNINGRÊSULT

0.76 0.76 0.78

0.81 0.8

0.77 0.79

0.45

0.32

0.35 0.33

0.42 0.39

0.43

0.3 0.4 0.5 0.6 0.7 0.8

3 5 7 10 50 100 500

F1-score

n tree

Normal Abnormal

0.9 0.9 0.89

0.23

0.29 0.28

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

2 3 4

F1-score

n-layer

Normal Abnormal

0 1 0 1 0 1

Decision Tree 0.81 0.27 0.75 0.35 0.78 0.3

SVM 0.79 0 1 0 0.88 0

SGD 0.78 0.13 0.85 0.09 0.82 0.11

Classifier epoch Precision Recall F1 score

0 1 0 1 0 1

CNN 50 0.81 1 1 0.13 0.9 0.23

100 0.81 1 1 0.13 0.9 0.23

DenseNet 50 0.79 0 1 0 0.88 0

100 0.79 0 1 0 0.88 0

0 1 0 1 0 1

Decision Tree, Logistic Regression, Naive Bayes, GLVQ, SVM

0.81 0.43 0.95 0.13 0.88 0.2

Logistic Regression, Random Forest.

Naive Bayes

0.83 0.36 0.84 0.35 0.84 0.36

Random Forest. Naive

Bayes, GLVQ 0.86 0.31 0.67 0.57 0.75 0.4

(5)

c. Scenario III

In the third scenario, f1-score results will be compared between ensemble learning and deep learning methods.

Because of FHR data has a number of imbalance classes, the resampling process carried out by using the bootstraping method for hypoxia classes, so the normal and hypoxic classes will reached the same number of instances. In scenario III, the use of bootstrapping for ensemble learning and deep learning does not show any increase in the results of f1-score. In Figure 7, the highest f1-score values for Bagging Tree with Naive Bayes, for normal class were 76% and 39% for hypoxia class.

Whereas in figure 8, the highest f1-score values of CNN model, for normal class were 90% and 29% for hypoxia class with variations of 3 layers.

In figure 8, the additional layers up to 4 layers caused the f1 score of the hypoxia class increased to 34% but the f1 score of the normal class decreased to 0%. It shows that the model capable detect hypoxia but inadequate to detect normal class.

FÎGURE7.BÂGGINGNÂIVEBÂYESRÊSULT

FÎGURE8.DÊEPLÊARNINGRÊSULT

d. Scenario VI

In the third scenario, f1-score results will be compared from ensemble learning and deep learning methods. If in the previous scenario raw FHR data was used, then in this scenario the FHR data used was data that has been preprocessed with the procedures which described before.

FIGURE 9.ENSEMBLE LEARNING NAIVE BAYES

F^IGURE10.DEEPL^EARNING

In scenario IV, the use utilize of preprocesing also does not show an increasing in the results of f1-score, where the highest f1-score values for Bagging Tree with Naive Bayes classifier for the normal class were 77% and 36% for hypoxia class.

While for CNN model, the results of f1-score for normal class 1 were 88% and 0% for hypoxia class.

In figures 11 and 12, the results of precision, recall, and f1- score for each model in the hypoxia and normal classes were compared. Figure 12 shows that all models detected normal classes well, it seen from the value of precision, recall, and f1- score that reaches more than 60%. Figure 11 shows that the Bagging Tree with Naive Bayes Classifier model detected hypoxia better than other models because it has a higher recall and f1-score. While the DenseNet model has not been able to detect hypoxia classes, seen from the value of precision, recall, and f1-score which only reaches 0%.

0.69

0.73 0.72

0.76 0.75 0.74 0.75

0.38

0.36 0.36

0.39 0.38 0.37 0.38

0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8

3 5 7 20 50 100 500

F1-score

n-tree

Normal Abnormal

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

2 3 4

f1-score

n-layer

Normal Abnormal

0.77 0.78 0.78 0.77 0.77 0.76 0.78

0.32 0.33 0.35 0.36

0.32 0.33 0.34

0.25 0.35 0.45 0.55 0.65 0.75 0.85

3 5 7 20 50 100 500

F1-score

n-tree

Normal Abnormal

0.88 0.88 0.88

0 0 0

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

2 3 4

F1-score

n-layer

Normal Abnormal

(6)

F^IGURE11.C^LASSH^YPOXIAR^ESULT

FIGURE 12.CLASS NORMAL RESULT

CONCLUSION

The comparison results between ensemble learning methods and deep learning methods were presented. The methods were implemented to CTG dataset, especially to FHR signal. The classification processes applied pH label as the benchmark. The benchmark was used to classified the dataset into normal and hypoxia. In this paper to analyze hypoxia detection performance in each model, evaluation focuses on precision, recall, and f1-score for class 0 (normal) and class 1 (hypoxia) where the model can recognize the normal and hypoxia label appropriately. In the first scenario among the ensemble learning models used were Bagging Tree, AdaBoost, and Vooting Classifier and Deep learning models; Bagging Tree with Naive Bayes classifier models obtain the best precision, recall, and f1- score proposrtion for both class with the f1-score of 76% for normal class and 45% for hypoxia class. While the ensemble learning model which has the worst recall was AdaBoost with a recall value of 0% by using SVM Classifier.

In the second scenario after comparing the number of trees used by the Bagging Tree ensemble model and the Naive Bayes Classifier, it appears that the optimal number of trees was 3 trees In the third scenario it was seen that the use of boostraping on abnormal data does not affect to the results of precision and recall. For the fourth scenario the use of preprocesses on the raw data causesed the decreasing of precision, recall, and f1- score value.

From all scenario, it can be concluded that for raw FHR data (without pre-processing), the ensemble learning method was superior than deep learning model for detecting hypoxia. It evidenced by, the recall value for ensemble learning in each scenario yielded higher results than deep learning model..

ACKNOWLEDGMENT

This research was funded by the grant of Penelitian Konsorsium Riset Unggulan Perguruan Tinggi (KRUPT) from Ministry of Education, Indonesia, with contact number: NKB- 1070/UN2.R3.1/HKP.05.00/2019

REFERENCES

[1] F. Marzbanrad and L. Stroux, “Cardiotocography and Beyond: A Review of One-Dimensional Doppler Ultrasound Application in Fetal Monitoring,” Inst. Phys. Eng. Med. Dur., 2018.

[2] J. Spilka, J. Frecon, R. Leonarduzzi, N. Pustelnik, P. Abry, and M. Doret,

“Sparse Support Vector Machine for Intrapartum Fetal Heart Rate Classification,” IEEE J Biomed Heal. Inf., vol. 2194, no. c, pp. 1–8, 2016.

[3] M. B. B and L. Lhotska, “The Use of Convolutional Neural Networks in Biomedical Data Processing,” pp. 100–119, 2017.

[4] S. K. H. Yang and S. Lee, “FitMine : automatic mining for time-evolving signals of cardiotocography monitoring,” Data Min. Knowl. Discov., vol.

31, no. 4, pp. 909–933, 2017.

[5] J. Li et al., “Automatic Classification of Fetal Heart Rate Based on Convolutional Neural Network,” IEEE Internet Things J., vol. 4662, no.

c, pp. 1–1, 2018.

[6] P. Fergus, M. Selvaraj, and C. Chalmers, “Machine learning ensemble modelling to classify caesarean section and vaginal delivery types using Cardiotocography traces,” Comput. Biol. Med., vol. 93, no. June 2017, pp.

7–16, 2018.

[7] H. Tang, T. Wang, M. Li, and X. Yang, “The Design and Implementation of Cardiotocography Signals Classification Algorithm Based on Neural Network,” vol. 2018, 2018.

[8] Z. Comert, Z. Yang, S. Velappan, A. M. Boopathi, and A. F. Kocamaz,

“Performance evaluation of Empirical Mode Decomposition and Discrete Wavelet Transform for computerized hypoxia detection and prediction,”

2018 26th Signal Process. Commun. Appl. Conf., no. Iii, pp. 1–4, 2018.

[9] Z. Comert, A. M. Boopathi, S. Velappan, Z. Yang, and A. F. Kocamaz,

“The influences of different window functions and lengths on image- based time-frequency features of fetal heart rate signals,” 26th IEEE Signal Process. Commun. Appl. Conf. SIU 2018, pp. 1–4, 2018.

[10] C. Zafer, “Fetal Hypoxia Detection Based on Deep Convolutional Neural Network with Transfer Learning Approach,” vol. 763, pp. 239–248, 2019.

0.34

0.27

0.36

0.8

0 0.65

0.35 0.35

0.17

0 0.45

0.3 0.36

0.29

0

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Bagging Naive Bayes

AdaBoost Decision Tree

Vooting (Logistic Regression, Random Forest.

Naive Bayes)

CNN denseNet

(%)

Precision Recall F1-score

0.88

0.81 0.83 0.82 0.79

0.67

0.75

0.84

0.99 1

0.76 0.78

0.84

0.9 0.88

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Bagging Naive Bayes

AdaBoost Decision Tree

Vooting (Logistic Regression, Random Forest.

Naive Bayes)

CNN denseNet

(%)

Precision Recall F1-score