Hidden Conditional Random Fields for ECG Classification
Reda A. El-Khoribi
Faculty of Computers and Information, Cairo University 5 zewail street, Giza, Egypt
Abstract
In this paper a novel approach to ECG signal classification is proposed. The approach is based on using hidden conditional random fields (HCRF) to model the ECG signal. Features used in training and testing the HCRF are based on time-frequency analysis of the ECG waveforms. Experimental results show that the HCRF model is promising and gives higher accuracy compared to maximum-likelihood (ML) trained hidden Markov models (HMM).
Keywords: ECG classification, discrete observation, hidden conditional random fields, hidden Markov models
1. Introduction
Electrocardiogram (ECG) analysis has received much interest in recent literature. This particular interest in ECG analysis comes from its role as an efficient non- invasive investigative method which provides useful information for the detection, diagnosis and treatment of cardiac anomalies.
Figure 1 shows a typical structure of the ECG signal. The ECG signal is divided into two phases: depolarization and repolarization phases. The depolarization phase corresponds to the P-wave and QRS-wave while repolarization phase corresponds to the T-wave and U- wave. The ECG is measured by placing ten electrodes on selected spots on the human body surface. For regular ECG recordings, the variations in electrical potentials in 12 different directions out of the ten electrodes are measured. These 12 different electrical views of the activity in the heart are normally referred to as leads.
Figure 1: Structure of the ECG signal
Trained physicians are able to recognize certain patterns in a patient's ECG signal and use them as the basis for diagnosis. Figure 2 shows the shape of typical heart beats for different four cases: normal, ventricular tachyarrhythmia, atrial fibrillation, and congestive heart failure. Based on the features read from the ECG signal, the physician would be able to classify and diagnose the case. One of the main objective of computer based ECG diagnosing systems is to play the role of the physician by automatically interpret the ECG signal and classify the cases.
(a) (b)
(c ) (d)
Figure 2: ECG signals belong to four classes: (a)normal beats, (b)ventricular tachyarrhythmia beat, (c)atrial fibrillation beat,
(d)congestive heart failure beat.
Several machine learning and pattern recognition techniques have been proposed for the automatic classification of ECG signals. This includes the use of neural networks [1, 2, 3], hidden Markov models ([4, 5]), support vector machines [6], fuzzy classifiers [7, 8] and many others. It is still challenging to develop new methods to overcome difficulties coming from the multitude of characteristics that are embedded in the signal.
Several features have been used in the ECG classification based on time domain or transformation domain signal representation. In time domain, statistical computations are usually performed on either the time samples directly or after applying FIR filters for differentiation and etc.[9, 10, 11]. Transformation domain based features, on the other hand, are performed after applying some
transformation to the signal. Wavelet transform, Fourier transforms and Hilbert transform received a great attention in this respect [6, 12, 13, 14, 15].
Feature selection approaches are usually applied in cases of high dimensional feature vectors in order to reduce the dimensionality of the pattern. This is very beneficial in increasing the discrimination power at low computational complexity. Moreover it will help in overcoming the well-known problem in pattern recognition literature called ‘the curse of dimensionality’ which complicates machine learning problems. Principle component analysis (PCA) and independent component analysis (ICA) techniques are among the automatic data driven approaches used in ECG literature [15]. Computing first and higher order statistics of the ECG signal or its transformation is also a means of reducing the feature dimensionality by summarizing the feature sets. The mean, variance, entropy and third order cumulant of the signal derivatives or the coefficients of its wavelet transform are examples of such statistical features [16].
In statistical pattern classification, models can be divided into two types: generative and discriminative models.
Generative models define the probability distribution of individual classes from a set of independent and identically distributed (i.i.d.) samples belonging to the class. Discriminative models, on the other hand, model the dependence of the class variable on the input feature variable. This implies that they construct a discrimination function to be used in decision making. Among the powerful discriminative models that have recently gained popularity in the machine learning community are the conditional random fields (CRFs) [18]. CRFs have been shown to be powerful discriminative models because they can incorporate essentially arbitrary feature-vector representations of the observed data points [18].
However, they are limited in that they cannot capture intermediate structures using hidden-state variables. In [19, 20, 21] authors developed hidden state conditional random fields (HCRF) to model speech signals and video sequences.
In this paper, the aim is to develop a system for automatically classifying ECG signals based on a suitable structure of hidden conditional random fields. The structure is chosen to be similar to HMM structure in order to model the time-evolving process underlying ECG signals. As declared in [20], HCRF are discriminative models while HMM are generative models. Moreover, regarding the relation between observation and state sequences HMM can be viewed to be a subclass of HCRF. Our hope in this research is to obtain accuracy improvement over HMMs.
The rest of this paper is organized as follows. A short description of HMM as a baseline for HCRF is given in section 2. Section 3, introduces an overview about HCRF. In section 4, we introduce the detailed analysis of the proposed discrete data HCRFs. In section 5, we discuss the new approach for ECG classification. Some experimental results are discussed in section 6. Finally, conclusions and ideas for future work are discussed in Section 7.
2. Hidden Markov Models
HMMs are doubly embedded stochastic processes to model a sequence of observations. In HMMs it is assumed that the observation sequence comes from some stochastic process that depends on a hidden first order Markov process generating a state sequence with the same length as the observation sequence [22].
The observation o o , o , … , oT may be discrete or continuous, scalar or vector. The state sequence q
q , q , … , qT is always discrete with q belonging to a finite set of states s , s , … , sN . The hidden Markov model parameters,θHMM, include three distributions. First is the initial state probability distribution π which is a vector with elements π P q s . The state transition distribution A is a matrix of size N Nwith elements a P q s | q s . The emission distribution B consists of the elements b o P o |q s . In discrete HMM, B is a matrix while in the continuous HMM elements of B are mixture of probabilities from a parametric family of distributions (usually Gaussian). In the case of Gaussian Mixtures with K components, elements of B take the form b o
∑K w N o ; µ ,Σ where w is the weight of the kth mixture component of the state s.
To estimate the parameters of HMM of a given class, a set of i.i.d. observation sequences belonging to the class are prepared. One of the basic approaches to estimate HMM parameters is the maximum likelihood (ML) approach in which parameters are considered as those which maximizes the log likelihood of the training observations belonging to the class being modeled. That is to find the parameters θHMMsuch that:
θHMM arg max log P O|θ
arg max log P o |θ
N
1 where O o , o , … , oN is the set of training observation sequences all belonging to the class.
Expectation maximization (EM) is used to solve this optimization problem resulting in the well-known Baum- Welsh forward backward algorithm [22].
Considering only the in-class information in the ML estimation criteria of HMM leads to poor discriminative ability of these models. Several approaches have been developed to overcome this shortcoming. Among these approaches is the conditional random field (CRF) and hidden conditional random field (HCRF). As will be explained shortly, HCRFs have an additional property.
They model the state sequence as being conditionally Markov given the observation sequence, while HMMs model the state sequence as being Markov, and each observation being independent of all others given the corresponding state [20].
3. Hidden Conditional Random Fields
HCRFs are exponential models of the conditional probability of a class ω given the observation sequence o o ,o , … , oT through a parametric exponential distribution given by:
p ω|o; θ 1
z o; θ exp θ, φ ω, q, o 2 Here q q ,q , … , qT is the hidden state variable, θ expresses the model parameter vector, f ω, s, o is vector of sufficient statistics (feature vector) and the notation , means inner product. The partition function z o; θ is used to normalize the probability; i.e.
z o; θ exp θ, φ ω, q, o
,
3
The training problem of the HCRF is to estimate the parameters θ given a set of training samples o , ω , … , oN, ωN by maximizing the conditional log-likelihood:
L θ log p ω |o ; θ
N
4
The gradient ascent algorithm is an iterative procedure to find the optimum parameter vector θ starting from an initial estimate θ and update this estimate using the recursive formula:
θ θ η L θ 5 The choice of sufficient statistics determines the dependencies modeled by the HCRF. To have the same dependencies as in HMMs, the sufficient statistics and the HCRF parameters are chosen based on the sufficient statistics and parameters of HMMs.
4. Discrete Data HCRFs
In this paper, we propose to base the HCRF sufficient statistics on those of discrete HMMs rather than those of continuous HMMs. The reason is to simplify the training algorithm by dealing with discrete scalar observation sequences rather than continuous vectored observation sequences. This requires a vector quantization module to convert vectors into scalar labels. The size of the codebook of the vector quantization module controls the amount of information loss resulting from this conversion. Having a discrete data HCRF gives a good compromise between time complexity and classification accuracy which is controllable.
In order to define our discrete data HCRF, we assume the following settings. Let θ be the HMM parameters of the class ω 1,2, … , C . For simplicity, we’ll drop the class suffix from the elements of the initial state vector, transition matrix and emission matrix. From the properties of the HMMs, the joint distribution of an observation sequence o and a state sequence q is given by:
P o, q|θ π b o a
T
b o 6 The log of this distribution hence is given by:
log P o, q|θ log π log b o
T
log a
T
log π
log b o δ q s
T N
log a δ q s δ q
T N
N
s 7 The function δ . returns the value 1 if its argument is true and 0 otherwise. In the case of discrete HMMs:
b o b δ o v
M
8 where the b elements are the emission matrix B and v is the jth symbol in the codebook. Hence for discrete HMMs:
log P o, q|θ log π
log b δ q s δ o
T M
N
v
log a δ q s δ q
T N
N
s 9 If we constrain the state sequence to always start from a single state (e.g. s ), the first term in the right hand side will vanish and hence:
log P o, q|θ log b δ q s δ o
T M
N
v
log a δ q s δ q
T N
N
s 10 In the HCRF model, we have a single parameter vector θ for all classes. So, we will consider the p.d.f.
function P o, q, ω|θ P ω P o, q|θ, ω ; so,
log P o, q, ω|θ log P ω log P o, q|θ, ω
δ ω ω log u
C
log b δ q s δ o
T M
N
v
log a δ q s δ q
T N
N
s 11 Here we made a definition u P ω . The latter formula can be compactly written as an inner product of two vectors; one containing the HCRF parameters and the other containing sufficient statistics on the observation and state sequences. i.e.:
log P o, q, ω|θ θ, φ ω, q, o .
The sufficient statistic vector is partitioned to the following three groups: φ , φ , φ where:
φ ω, q, o δ ω ω i 1. . C
φ ω, q, o δ q s δ o v
T
i 1. . N, j 1. . M
φ ω, q, o δ q s δ q s
T
i 1. . N (12) Likewise, the HCRF parameters are partitioned into corresponding three groups θ , θ , θ :
θ log u i 1. . C θ log b i 1. . N, j 1. . M
θ log a i, j 1. . N (13) This correspondence between HMMs and HCRFs suggests the application of HMM forward-backward procedure to each sequence and compute the statistics in (12) accordingly.
5. The Proposed System
Figure 3 shows the proposed system. The system consists of feature extraction, clustering, vector quantization, HCRF training and HCRF recognition modules.
In the feature extraction module, time-frequency analysis is used to extract features from the ECG signal. This is
better than time domain analysis as it clearly shows the changes in frequency over time scale. Many time- frequency analysis methods can be used such as short- time Fourier transform, short time Hilbert transform, wavelet transform, etc. In our system we used the discrete wavelet transform (DWT) to classify types of ECG beats. The main advantage of the DWT is that it has a varying window size, being broad at low frequencies and narrow at high frequencies, thus leading to an optimal time–frequency resolution in all frequency ranges.
In order to do short time DWT on the ECG signal, the ECG signal has to be segmented into small short time segments. This can be done by sliding a finite length window over the signal. A proper size of this window is 256 samples since the single ECG beat occupy an order of 256 samples [5]. The window is then divided into overlapped frames of size 64 samples. For each frame, the DWT is applied up to the 3 scales. The following compact statistical features are extracted from wavelet coefficients in every subband as in [5]: maximum, mean, minimum and standard deviation.
Clustering is performed using the k-means algorithm [22]. The centers of the clusters are stored in a code book. Vector quantization is used to convert vector sequences to scalar label sequences ready to be used in both HCRF training and recognition modules.
6. Experimental Results
The training and testing samples are extracted from the Physionet Database [23]. This database includes 135 ECG signals collected from 47 subjects. The subjects were 25 men aged 32 to 89 years, and 22 women aged 23 to 89 years. For each class of beats (normal beat, congestive heart failure beat, ventricular tachyarrhythmia beat, atrial fibrillation beat), two models are built, namely, a HMM and HCRF model. 50 samples are used in the training process in both models for each class and the same amount is used in testing. Vector quantization is first use to construct a codebook of size 256. Every vector observation sequence is transformed to a label sequence using the codebook before being used in training or recognition.
Two sets of experiments (one using HMM and the other using HCRF) were performed. Different number of states, N, were used in each case. The overall classification accuracies obtained from these experiments are presented in Table 1.
Table 1: The total classification accuracies obtained from HMM and HCRF
Number of HMM states
HMM Accuracy (%)
HCRF Accuracy (%)
3 85% 90%
4 88% 95%
5 90% 94%
6 89% 93%
These results show that HCRF gives improvement over HMM in the order of 5%. Also, it is worth noting that the best number of states differs from HMM to HCRF. The reason as reported in [20] is that HMMs could be
Clustering
Feature Extraction
Vector quantization
HCRF Training
HCRF Recognition Codebook
Figure 3: The modules of the proposed system
considered as a restricted case of HCRF. The confusion matrixes for the best number of states in both models are shown in tables 2 and 3.
Table 2: Confusion matrix for the case of HMM with 5 states
Beat class
Normal Congestive heart failure Ventricular tachyarrhythmia Atrial fibrillation
Normal 47 2 1 0
congestive heart failure 3 44 2 1 ventricular
tachyarrhythmia 2 1 46 1
atrial fibrillation 5 2 0 43 Table 3: Confusion matrix for the case of HCRF with 4
states
Beat class
Normal Congestive heart failure Ventricular tachyarrhythmia Atrial fibrillation
Normal 48 1 1 0
congestive heart failure 1 45 2 2 ventricular
tachyarrhythmia 1 0 49 0
atrial fibrillation 0 1 1 48 These accuracies are better than the reported accuracies in the related literature which is in the order of 90%.
7. Conclusion
In this paper the HCRF has been applied to ECG signals in order to classify heart beats into normal and abnormal beats. HCRF as a discriminative classifier showed a considerable amount of improvement over HMM. The overall classification accuracy reached by the HCRF approach reached 95%.
8. References
[1] G. Bortolan, C. Brohet, and S. Fusaro, “Possibilities of using neural networks for ECG classification,” J.
Electrocardiol., 29, (1996) 10–16.
[2] C. Papaloukas, D. I. Fotiadis, A. Likas, and L. K.
Michalis, An ischemia detection method based on artificial neural networks, Artif. Intel. Med. 24 (2) (2002) 167–178.
[3] S. N. Yu, Y.-H. Chen, Electrocardiogram beat classification based on wavelet transformation and probabilistic neural network, Pattern Recognition Letters 28 (2007) 1142–1150
[4] W.T. Cheng, K.L. Chan, Classification of electrocardiogram using hidden Markov models, Engineering in Medicine and Biology Society, 1998. Proceedings of the 20th Annual International Conference of the IEEE, Vol.1, pp 143-146, 1998.
[5] R. Andreão, and J. Boudy, Combining Wavelet Transform and Hidden Markov Models for ECG Segmentation. EURASIP Journal on Advances in Signal Processing, Volume 2007, Issue 1, January 2007.
[6] E. D. Übeyli, ECG beats classification using multiclass support vector machines with error correcting output codes, Digital Signal Processing 17 (2007) 675–684
[7] R. Degani, Computerized Electrocardiogram Diagnosis: Fuzzy Approach Methods of Inform.
Med., 31(1992), 225–233.
[8] R. Degani, Computerized Electrocardiogram Diagnosis: Fuzzy Approach Methods of Inform.
Med., 31(1992), 225–233.
[9] M. G. Tsipouras, D.I. Fotiadis, and D. Sideris, An arrhythmia classification system based on the RR- interval signal, Artif. Intel. Med. 33 (3) (2005) 237–
250.
[10] M. W. Zimmerman, R.J. Povinelli, On Improving the Classification of Myocardial Ischemia using Holter ECG Data, Comput. Cardiol., (2004), Chicago, IL, USA.
[11] R. Silipo and G. Bartolan, Neural and traditional techniques in diagnostic ECG classification, Proc.
ICASSP, 1997.
[12] A.G. Ramakrishnan, S. Saha, ECG coding by wavelet-based linear prediction, IEEE Trans.
Biomed. Eng. 44 (12) (1997) 1253–1261.
[13] C. Li, C. Zheng, and C. Tai, Detection of ECG characteristic points using wavelet transform, IEEE Trans. Biomed. Eng. 42 (1) (1995) 21–28.
[14] M. L. Hilton, Wavelet and wavelet packet compression of electrocardiograms, IEEE Trans.
Biomed. Eng. 44 (5) (1997) 394–402.
[15] S.-N. Yu and, K.-T. Chou, Integration of independent component analysis and neural networks for ECG beat classification, Expert Systems with Applications 34 (2008) 2841–2846 [16] M. Engin, M. Fedakar, E. Z. Engin, and M.
Korurek, Feature measurements of ECG beats based on statistical classifiers, Measurement 40 (2007) 904–912.
[17] P. S. Addison, et al., Evaluating arrhythmias in ECG signals using wavelet transforms, IEEE Eng.
Med. Biol. Mag. 19 (5) (2000) 104–109.
[18] J. D. Lafferty, A. McCallum and F. Pereira, Conditional random fields: Probabilistic modeling for segmenting and labeling sequence data. In Proc.
Intl. Conf. Machine Learning, vol. 18, pp. 282–289, 2001.
[19] A. Gunawardana, M. Mahajan, A. Acero, and J.C.
Platt, Hidden Conditional Random Fields for Phone Classification, Proc. INTERSPEECH, 2005.
[20] M. Mahajan, A. Gunawardana, A. Acero, Training Algorithms for Hidden Conditional Random Milind Mahajan, Asela Gunawardana, Alex Acero, Training Algorithm for Hidden Conditional Random Fields. IEEE ICASSP 2006 Proceedings, Vol. 1, Issue 1, 14-19 May 2006
[21] A. Quattoni, S. Wang, L. Morency, M. Collins, and T. Darrell, Hidden Conditional Random Fields,
IEEE Machin [22] L. Rab and se Proc. IE [23] A.L. Go and Ph resourc Circula
Engineering, C Communicatio University, 1 Technology d Information, C Image process signal process
Transactions ne Intelligence iner, A tutor lected applic EEE, 77(2) (19
oldberger, et a hysioNet: com
e for com tion 101 (23)
Reda A.
Faculty o Cairo Univ and Comm University Communic Cairo Universit on Engineering 1988. He is department of Cairo Universit sing and Comp ing, computer g
On Pattern e, vol. 29, no.
rial on hidden cations in sp
989).
al., PhysioBan mponents of mplex phys
(2000) 215e–
El-Khoribi, of Computers
versity, Egypt.
munications E 1998. MSC cation Enginee
ty, 1993. BSC g, Faculty of E
currently in the faculty o ty. His research puter vision, pa
graphics and art
n Analysis 10, October 2 n Markov mo peech recogni nk, PhysioToo
a new rese siologic sig –220e.
Associate P and Informa PhD in Electro Engineering, C in Electronics ering, Faculty
in Electronics Engineering, C
the Informa of Computers h interests incl attern classifica tificial intellige
and 2007
odels ition, olkit, earch gnals,
Prof., tion, onics Cairo and y of and Cairo ation and lude:
ation, ence.