The dataset used to create the metabolic syndrome classier is the SAPBA dataset col- lected by researchers from the North West University [46]. It contains over 1200 elds and measurements for 409 participants. Many of these elds are the same conditions measured with dierent techniques or dierent equipment. Since these dierent measure- ment techniques fall outside the scope of this study, only a single relevant measurement for each condition was included. Measurements that could not be feasibly collected on a smartphone and do not apply to any of the standard denitions of MetS (discussed in
2Total Energy Expenditure
3Proof of Concept
Figure 3.1: Example of biometric data and PPG measurement sections from the applica- tion, with bottom navigation bar
Section 4.2.2) were also excluded. If any of the applicable elds for a participant was blank the participant was excluded. This left 24 applicable elds for 402 participants. A description of every eld and the method used to measure it is provided in appendix A.
Figure 3.2: Correlation matrix for data from SAPBA dataset
The correlation matrix of all the applicable variables in the SAPBA dataset can be seen in Figure 3.2. The correlation between variables gives a good initial indication of corre- lation that may be between variables. Lack of correlation does not necessarily rule out a relationship though. It may just indicate that, if there were a relationship, it may be non-linear. It is because of this non-linear relationship with many of the available inputs that an ANN will be used to determine MetS, since neural networks are particularly well suited to non-linear classications.
As can be seen from Figure 3.2, the variables that correlate the most with MetS are (in order of highest absolute correlation to lowest):
1. Waist circumference (0.51) 2. Body mass (0.47)
3. Serum triglyceride levels (0.45) 4. Serum HdL cholesterol (0.43) 5. Diastolic blood pressure (0.42) 6. BMI (0.39)
7. Systolic blood pressure (0.36) 8. Blood glucose (0.30)
As would be expected, the values that correlate the closest to MetS are those that are used in ordinary circumstances to make a diagnosis. All of these have a correlation between 0.3 and 0.6, which is a signicant correlation but not enough to make a valid diagnosis on its own. The high correlation of body mass and waist circumference is promising, since both of these values are provided by the user on the application.
In Figure 3.3 the distribution of some of the features for patients with and without MetS can be seen. Comparing the two distributions for each feature may provide some insight into how the feature relates to MetS. From Figure 3.3, it can be seen that below the age of 40-45 the likelihood of having MetS is lower, but over the age of 45 the distribution is very similar. BMI, HDL cholesterol and SBP all have close to normal distributions with just a shift in mean between positive and negative MetS diagnoses. Specically, there is a mean dierence in BMI of about 5 with positive diagnosis being higher, negative diagnosis shows an average HDL that is 0.6 mmol/l higher than positive diagnosis, and a dierence in SBP of about 10 mmHg. These dierences are signicant, but not to such an extent that a single feature could be used to accurately screen for MetS.
Figure 3.3: Distribution of metabolic syndrome risk factors
3.4 Preprocessing and feature selection
3.4.1 Main features
The initial features chosen to determine MetS was largely based on [15] and the features that were available in the SAPBA dataset. However, since the training data used in this dissertation is dierent to that used by [15], the results may also dier somewhat. The following features were chosen for the rst model:
1. Age 2. Gender 3. BMI
4. Waist-to-height ratio 5. SBP
6. DBP 7. Heart rate
These features were chosen because they are all non-invasive measurements that can potentially be made by a smartphone or because they would already be known by a user.
3.4.2 Lifestyle factors
A second model was built that included all the other information that was available in the SABPA dataset that may be useful for determining MetS. The following features were selected and included with the previous features:
1. Medical history 2. Alcohol use 3. Smoking 4. Activity level
Both medical history and activity level aren't directly available in the dataset and required some preprocessing to determine. Activity level is discussed in Section 3.4.3. Medical history is a combination of the diseases provided in the SAPBA dataset, such as diabetes or stroke, by means of an OR operation. The conditions that were included in the OR operation are:
Cardiovascular disease history
Stroke history
Myocardial infarction/ cardiac events history
Kidney disease history
Atrial brillation
Use of anti-hypertensive drugs
Use of anti-diabetic drugs
Note that while some of these features (like medical history) may not necessarily be lifestyle related, for the sake of simplicity going forward, when dierentiating between the two models the inclusion of lifestyle factors is the distinction that will be made.
3.4.3 Activity level
Several studies have shown the impact of exercise on MetS [7][3][17]. The SAPBA dataset does not include any direct indicator of activity level, however, it does include TEE (Total Energy Expenditure). This is a measure of the total caloric consumption of an individual in a day, which was measured using an activity tracker. The BMR4, which is the amount of calories required per day without any physical activity, can be predictively calculated with the revised Harris-Benedict equations [66]:
Table 3.1: Harris-Benedict equations as revised by Miin and St Jeor
Men BMR = (10× weight in kg) + (6.25× height in cm) + (5×age in years) + 5 Women BMR = (10× weight in kg) + (6.25× height in cm) + (5×age in years) - 161 The activity level is then determined by looking at the ratio of TEE over BMR. The following classication is dened by the original Harris-Benedict equations [66]:
Little/no exercise: T EE/BM R= 1.2
Light exercise: T EE/BM R= 1.375
Moderate exercise (3-5 days/wk): T EE/BM R = 1.55
Very active (6-7 days/wk): T EE/BM R= 1.725
Extra active (very active & physical job): T EE/BM R = 1.9
With this classication it will be easy to link an activity level to users using the application.
In the model the TEE / BMR ratio will be normalised, so that each option a user could ll in would represent the following in the model:
Little/no exercise: 0
Light exercise: 0.25
Moderate exercise (3-5 days/wk): 0.5
Very active (6-7 days/wk): 0.75
Extra active (very active & physical job): 1
3.4.4 Scaling
All of the features were scaled to be between 0 and 1. Some were binary values, which were either 0 or 1. Activity level has ve possibilities ranging from 0 to 1. Analog values were scaled to be between 0 and 1 using min-max scaling per feature of the training set.
A description of each feature's value range is given in Table 3.2.
4Basal Metabolic Rate
Table 3.2: Value ranges of metabolic syndrome features
Feature Value range
Age 0.0-1.0
Gender 0 or 1
BMI 0.0-1.0
Waist-to-height ratio 0.0-1.0
SBP 0.0-1.0
DBP 0.0-1.0
Heart rate 0.0-1.0
Medical history 0 or 1
Alcohol use 0 or 1
Smoking 0 or 1
Activity level 0.0-1.0