A Survey on Human Activity Recognition using Wearable Sensors

(1)

A Survey on Human Activity Recognition using Wearable Sensors

Oscar D. Lara and Miguel A. Labrador´

Abstract—Providing accurate and opportune information on people’s activities and behaviors is one of the most important tasks in pervasive computing. Innumerable applications can be visualized, for instance, in medical, security, entertainment, and tactical scenarios. Despite human activity recognition (HAR) being an active field for more than a decade, there are still key aspects that, if addressed, would constitute a significant turn in the way people interact with mobile devices. This paper surveys the state of the art in HAR based on wearable sensors. A general architecture is first presented along with a description of the main components of any HAR system. We also propose a two- level taxonomy in accordance to the learning approach (either supervised or semi-supervised) and the response time (either offline or online). Then, the principal issues and challenges are discussed, as well as the main solutions to each one of them.

Twenty eight systems are qualitatively evaluated in terms of recognition performance, energy consumption, obtrusiveness, and flexibility, among others. Finally, we present some open problems and ideas that, due to their high relevance, should be addressed in future research.

Index Terms—Human-centric sensing; machine learning; mobile applications; context awareness.

I. INTRODUCTION

D

URING the past decade, there has been an exceptional development of microelectronics and computer systems, enabling sensors and mobile devices with unprecedented characteristics. Their high computational power, small size, and low cost allow people to interact with the devices as part of their daily living. That was the genesis ofUbiquitous Sensing, an active research area with the main purpose of extracting knowledge from the data acquired by pervasive sensors [1].

Particularly, the recognition of human activities has become a task of high interest within the field, especially for medical, military, and security applications. For instance, patients with diabetes, obesity, or heart disease are often required to follow a well defined exercise routine as part of their treatments [2].

Therefore, recognizing activities such as walking, running, or cycling becomes quite useful to provide feedback to the caregiver about the patient’s behavior. Likewise, patients with dementia and other mental pathologies could be monitored to detect abnormal activities and thereby prevent undesirable consequences [3]. In tactical scenarios, precise information on the soldiers’ activities along with their locations and health conditions, is highly beneficial for their performance and

Manuscript received 5 December 2011; revised 13 April 2012 and 22 October 2012.

The authors are with the Department of Computer Science and Engineering, University of South Florida, Tampa, FL 33620 (e-mail:

[email protected], [email protected]).

Digital Object Identifier 10.1109/SURV.2012.110112.00192

safety. Such information is also helpful to support decision making in both combat and training scenarios.

The first works on human activity recognition(HAR) date back to the late ’90s [4], [5]. However, there are still many issues that motivate the development of new techniques to improve the accuracy under more realistic conditions. Some of these challenges are (1) the selection of the attributes to be measured, (2) the construction of a portable, unobtrusive, and inexpensive data acquisition system, (3) the design of feature extraction and inference methods, (4) the collection of data under realistic conditions, (5) the flexibility to support new users without the need of re-training the system, and (6) the implementation in mobile devices meeting energy and processing requirements [6].

The recognition of human activities has been approached in two different ways, namely using external and wearable sensors. In the former, the devices are fixed in predetermined points of interest, so the inference of activities entirely depends on the voluntary interaction of the users with the sensors. In the latter, the devices are attached to the user.

Intelligent homes [7]–[11] are a typical example of external sensing. These systems are able to recognize fairly complex activities (e.g., eating, taking a shower, washing dishes, etc.), because they rely on data from a number of sensors placed in target objects which people are supposed to interact with (e.g., stove, faucet, washing machine, etc.). Nonetheless, nothing can be done if the user is out of the reach of the sensors or they perform activities that do not require interaction with them. Additionally, the installation and maintenance of the sensors usually entail high costs.

Cameras have also been employed as external sensors for HAR. In fact, the recognition of activities and gestures from video sequences has been the focus of extensive research [12]–

[15]. This is especially suitable for security (e.g, intrusion detection) and interactive applications. A remarkable example, and also commercially available, is the Kinect game con- sole [16] developed by Microsoft. It allows the user to interact with the game by means of gestures, without any controller device. Nevertheless, video sequences certainly have some issues in HAR. The first one is privacy, as not everyone is willing to be permanently monitored and recorded by cameras.

The second one is pervasiveness because video recording devices are difficult to attach to target individuals in order to obtain images of their entire body during daily living activities.

The monitored individuals should then stay within a perimeter defined by the position and the capabilities of the camera(s).

The last issue would be complexity, since video processing techniques are relatively expensive, computationally speaking.

This fact hinders a real time HAR system to be scalable.

1553-877X/13/$31.00 c2013 IEEE

(2)

TABLE I

TYPES OF ACTIVITIES RECOGNIZED BY STATE-OF-THE-ARTHAR SYSTEMS.

Group Activities

Ambulation Walking, running, sitting, standing still, lying, climbing stairs, descending stairs, riding escalator, and riding elevator.

Transportation Riding a bus, cycling, and driving.

Phone usage Text messaging, making a call.

Daily activities Eating, drinking, working at the PC, watching TV, reading, brushing teeth, stretching, scrubbing, and vacuuming.

Exercise/fitness Rowing, lifting weights, spinning, Nordic walking, and doing push ups.

Military Crawling, kneeling, situation assessment, and opening a door.

Upper body Chewing, speaking, swallowing, sighing, and moving the head.

The aforementioned limitations motivate the use of wearable sensors in HAR. Most of the measured attributes are related to the user’s movement (e.g., using accelerometers or GPS), environmental variables (e.g., temperature and humidity), or physiological signals (e.g., heart rate or electrocardiogram). These data are naturally indexed over the time dimension, allowing us to define the human activity recognition problem as follows:

Definition 1 (HAR problem (HARP)): Given a set S = {S₀, ..., S_k−1} of k time series, each one from a particular measured attribute, and all defined within time interval I = [t_α,t_ω], the goal is to find a temporal partitionI₀, ..., I_r−1of I, based on the data inS, and a set of labels representing the activity performed during each intervalI_j (e.g., sitting, walking, etc.). This implies that time intervals I_j are consecutive, non-empty, non-overlapping, and such that ^r−1

j=0I_j=I.

This definition is valid assuming that activities are not simultaneous, i.e., a person does not walk and run at the same time. A more general case with overlapping activities will be considered in Section VI. Note that the HARP is not feasible to be solved deterministically. The number of combinations of attribute values and activities can be very large (or even infinite) and finding transition points becomes hard as the duration of each activity is generally unknown. Therefore, machine learning tools are widely used to recognize activities.

A relaxed version of the problem is then introduced dividing the time series into fixed length time windows.

Definition 2 (Relaxed HAR problem):Given (1) a setW = {W₀, ..., W_m−1} ofmequally sized time windows, totally or partially labeled, and such that eachW_icontains a set of time series S_i = {S_i,0, ..., S_i,k−1} from each of the k measured attributes, and (2) a set A={a₀, ..., a_n−1}of activity labels, the goal is to find a mapping function f : S_i →A that can be evaluated for all possible values of S_i, such thatf(S_i)is as similar as possible to the actual activity performed during W_i.

Notice this relaxation introduces some error to the model duringtransition windows, since a person might perform more than one activity within a single time window. However, the number of transitions is expected to be much smaller than the total number of time windows, which makes the relaxation error insignificant for most of the applications.

Feature extraction

Structural features

Statistical features Acceleration

signals

Physiological signals

Learning and Inference Environmental

signals Location

data Training

dataset

Extracted training feature set

Recognition model

Data collected during a time

window

Extracted feature set

Recognized activity

g n i t s e T g

n i n i a r T

Data collection

Fig. 1. General data flow for training and testing HAR systems based on wearable sensors.

The design of any HAR system depends on the activities to be recognized. In fact, changing the activity setAimmediately turns a given HARP into a completely different problem. From the literature, seven groups of activities can be distinguished.

These groups and the individual activities that belong to each group are summarized in Table I.

This paper surveys the state of the art in HAR making use of wearable sensors. Section II presents the general components of any HAR system. Section III introduces the main design issues for recognizing activities and the most important solutions to each one of them. Section IV describes the principal techniques applied in HAR, covering feature extraction and learning methods. Section V shows a qualitative evaluation of state-of-the-art HAR systems. Finally, Section VI introduces some of the most relevant open problems in the field providing directions for future research.

II. GENERAL STRUCTURE OFHARSYSTEMS

Similar to other machine learning applications, activity recognition requires two stages, i.e.,trainingandtesting(also called evaluation). Figure 1 illustrates the common phases involved in these two processes. The training stage initially requires a time series dataset of measured attributes from individuals performing each activity. The time series are split into time windows to apply feature extraction thereby filtering relevant information in the raw signals. Later,learning methods are used to generate an activity recognition model from the dataset of extracted features. Likewise, for testing, data are collected during a time window, which is used to extract features. Such feature set is evaluated in the priorly trained learning model, generating a predicted activity label.

We have also identified a generic data acquisition architecture for HAR systems, as shown in Figure 2. In the first place, wearable sensors are attached to the person’s body to measure attributes of interest such as motion [17],

(3)

$FFHOHURPHWHU

*36 KHDUW &HOOSKRQH3'$RUODSWRS 8'3,3RU7&3,3 LD :L )L R WKH FHOO OD

/RFDORQWKHLQWHJUDWLRQ GHYLFH RU UHPRWH RQ WKH :HDUDEOHVHQVRUV ,QWHJUDWLRQGHYLFHV &RPPXQLFDWLRQ 6WRUDJHDQG,QIHUHQFH

,,QWHUQHWQWHUQHW

*36KHDUW PRQLWRU(&*OLJKW VHQVRUWKHUPRPHWHUHWF

YLD:L)LRUWKHFHOOXODU QHWZRUN

GHYLFHRUUHPRWHRQWKH VHUYHU

Fig. 2. Generic data acquisition architecture for Human Activity Recognition.

location [18], temperature [19], ECG [20], among others.

These sensors should communicate with anintegration device (ID), which can be a cellphone [21], [22], a PDA [19], a laptop [20], [23], or a customized embedded system [24], [25]. The main purpose of the ID is to preprocess the data received from the sensors and, in some cases, send them to an application server for real time monitoring, visualization, and/or analysis [20], [26]. The communicationprotocol might be UDP/IP or TCP/IP, according to the desired level of reliability.

Notice that all of these components are not necessarily implemented in every HAR system. In [27]–[29], the data are collected offline, so there is neither communication nor server processing. Other systems incorporate sensors within the ID [30]–[32], or carry out the inference process directly on it [31], [33]. The presented architecture is rather general and the systems surveyed in this paper are particular instances of it.

III. DESIGNISSUES

We have distinguished seven main issues pertaining to human activity recognition, namely, (1)selection of attributes and sensors, (2) obtrusiveness, (3) data collection protocol, (4)recognition performance, (5)energy consumption, (6)pro- cessing, and (7) flexibility. The main aspects and solutions related to each one of them are analyzed next. In Section V, the systems surveyed in this paper will be evaluated in accordance to these issues.

A. Selection of attributes and sensors

Four groups of attributes are measured using wearable sensors in a HAR context:environmental attributes,acceleration, location, andphysiological signals.

1) Environmental attributes: These attributes, such as temperature, humidity, audio level, etc., are intended to provide context information describing the individual’s surroundings.

If theaudio levelandlight intensityarefairly low, for instance, the subject may be sleeping. Various existing systems have utilized microphones, light sensors, humidity sensors, and thermometers, among others [19], [25]. Those sensors alone, though, might not provide sufficient information as individuals can perform each activity under diverse contextual conditions in terms of weather, audio loudness, or illumination.

Therefore, environmental sensors are generally accompanied by accelerometers and other sensors [3].

2) Acceleration: Triaxial accelerometers are perhaps the most broadly used sensors to recognize ambulation activities (e.g., walking, running, lying, etc.) [27]–[29], [34]–

[36]. Accelerometers are inexpensive, require relatively low power [37], and are embedded in most of today’s cellular phones. Several papers have reported high recognition accuracy 92.25% [29], 95% [38], 97% [35], and up to 98% [39], under different evaluation methodologies. However, other daily activities such aseating,working at a computer, or brushing teeth, are confusing from the acceleration point of view. For instance, eating might be confused with brushing teeth due to arm motion [25]. The impact of the sensor specifications have also been analyzed. In fact, Maurer et al. [25] studied the behavior of the recognition accuracy as a function of the accelerometer sampling rate (which lies between 10 Hz [3] and 100 Hz [24]). Interestingly, they found that no significant gain in accuracy is achived above 20 Hz for ambulation activities. In addition, the amplitude of the accelerometers varies from±2g [25], up to±6g [39] yet±2g was shown to be sufficient to recognize ambulation activities [25]. The placement of the accelerometer is another important point of discussion: He et al. [29] found that the best place to wear the accelerometer is inside the trousers pocket. Instead, other studies suggest that the accelerometer should be placed in a bag carried by the user [25], on the belt [40], or on the dominant wrist [23]. In the end, the optimal position where to place the accelerometer depends on the application and the type of activities to be recognized. For instance, an accelerometer on the wrist may not be appropriate to recognize ambulation activities, since accidental arm movements could generate incorrect predictions. On the other hand, in order to recognize an activity such as working at the computer, an accelerometer on the chest would not provide sufficient information.

3) Location: The Global Positioning System (GPS) en- ables all sort of location based services. Current cellular phones are equipped with GPS devices, making this sensor very convenient for context-aware applications, including the recognition of the user’s transportation mode [37]. The place where the user is can also be helpful to infer their activity using ontological reasoning [31]. As an example, if a person

(4)

is at a park, they are probably not brushing their teeth but might be runningorwalking. And, information about places can be easily obtained by means of the Google Places Web Service [41], among other tools. However, GPS devices do not work well indoors and they are relatively expensive in terms of energy consumption, especially in real-time tracking applications [37]. For those reasons, this sensor is usually employed along with accelerometers [31]. Finally, location data has privacy issues because users are not always willing to be tracked. Encryption, obfuscation, and anonymization are some of the techniques available to ensure privacy in location data [42]–[44].

4) Physiological signals: Vital signs data (e.g., heart rate, respiration rate, skin temperature, skin conductivity, ECG, etc.) have also been considered in a few works [3]. Tapia et al. [23] proposed an activity recognition system that combines data from five triaxial accelerometers and a heart rate monitor.

However, they concluded that the heart rate is not useful in a HAR context because after performing physically demanding activities (e.g., running) the heart rate remains at a high level for a while, even if the individual is lying or sitting. In a previous study [26] we showed that, by means of structural feature extraction, vital signs can be exploited to improve recognition accuracy. Now, in order to measure physiological signals, additional sensors would be required, thereby increas- ing the system cost and introducing obtrusiveness [19]. Also, these sensors generally use wireless communication which entails higher energy expenditures.

B. Obtrusiveness

To be successful in practice, HAR systems should not require the user to wear many sensors nor interact too often with the application. Furthermore, the more sources of data available, the richer the information that can be extracted from the measured attributes. There are systems which require the user to wear four or more accelerometers [23], [27], [45], or carry a heavy rucksack with recording devices [19]. These configurations may be uncomfortable, invasive, expensive, and hence not suitable for activity recognition. Other systems are able to work with rather unobtrusive hardware. For instance, a sensing platform that can be worn as a sport watch is presented in [25]; Centinela [26] only requires a strap that is placed on the chest and a cellular phone. Finally, the systems introduced in [33], [37] recognize activities with a cellular phone only.

Minimizing the number of sensors required to recognize activities is beneficial not only for comfort, but also to reduce complexity and energy consumption as less amount of data would be processed. Maurer et al. [46] performed an interesting study with accelerometers and light sensors. They explored different subsets of features and sensors, as well as different sensor placements. They concluded that all sensors should be used together in order to achieve the maximum accuracy level. Bao et al. [27] carried out another study in the same matter, placing accelerometers on the individual’s hip, wrist, arm, ankle, thigh, and combinations of them. Their conclusions suggest that only two accelerometers (i.e., either wrist and thigh or wrist and hip) are the sufficient to recognize ambulation and other daily activities.

C. Data collection protocol

The procedure followed by the individuals while collecting data is critical in any HAR. In 1999, Foerster et al. [5]

demonstrated 95.6% of accuracy for ambulation activities in a controlled data collection experiment, but in a natural environment (i.e., outside of the laboratory), the accuracy dropped to 66%! The number of individuals and their physical characteristics are also crucial factors in any HAR study.

A comprehensive study should consider a large number of individuals with diverse characteristics in terms of gender, age, height, weight, and health conditions. This is with the purpose of ensuring flexibility to support new users without the need of collecting additional training data.

D. Recognition performance

The performance of a HAR system depends on several aspects, such as (1) the activity set, (2) the quality of the training data, (3) the feature extraction method, and (4) the learning algorithm. In the first place, each set of activities brings a totally different pattern recognition problem. For example, discriminating amongwalking,running, andstanding still[47], turns out to be much easier than incorporating more complex activities such as watching TV, eating, ascending, and descending [27]. Secondly, there should be a sufficient amount of training data, which should also be similar to the expected testing data. Finally, a comparative evaluation of several learning methods is desirable as each dataset exhibits distinct characteristics that can be either beneficial or detri- mental for a particular method. Such interrelationship among datasets and learning methods can be very hard to analyze theoretically, which accentuates the need of an experimental study. In order to quantitatively understand the recognition performance, some standard metrics are used, e.g., accuracy, recall, precision, F-measure, Kappa statistic,andROC curves.

These metrics will be discussed in Section IV.

E. Energy consumption

Context-aware applications rely on mobile devices, such as sensors and cellular phones, which are generally energy constrained. In most scenarios, extending the battery life is a desirable feature, especially for medical and military applications that are compelled to deliver critical information.

Surprisingly, most HAR schemes do not formally analyze energy expenditures, which are mainly due to processing, communication, and visualization tasks. Communication is often the most expensive operation, so the designer should minimize the amount of transmitted data. In most cases, short range wireless networks (e.g., Bluetooth or Wi-Fi) should be preferred over long range networks (e.g., cellular network or WiMAX) as the former require lower power. Some typical energy saving mechanisms are data aggregationand compression yet they involve additional computations that may affect the application performance. Another approach is to carry out feature extraction and classification in the integration device, so that raw signals would not have to be continuously sent to the server [31], [48]. This will be discussed in Section III-F. Finally, since all sensors may not be necessary simultaneously, turning off some of them or

(5)

reducing their sampling/transmission rate is very convenient to save energy. For example, if the user’s activity is sittingor standing still, the GPS sensor may be turned off [37].

F. Processing

Another important point of discussion is where the recognition task should be done, whether in the server or in the integration device. On one hand, a server is expected to have huge processing, storage, and energy capabilities, allowing to incorporate more complex methods and models. On the other hand, a HAR system running on a mobile device should substantially reduce energy expenditures, as raw data would not have to be continuously sent to a server for processing.

The system would also become more robust and responsive because it would not depend on unreliable wireless communication links, which may be unavailable or error prone; this is particularly important for medical or military applications that require real-time decision making. Finally, a mobile HAR system would be more scalable since the server load would be alleviated by the locally performed feature extraction and classification computations. However, implementing activity recognition in mobile devices becomes challenging because they are still constrained in terms of processing, storage, and energy. Hence, feature extraction and learning methods should be carefully chosen to guarantee a reasonable response time and battery life. For instance, classification algorithms such as Instance Based Learning [32] and Bagging [49] are very expensive in their evaluation phase, which makes them not convenient for mobile HAR.

G. Flexibility

There is an open debate on the design of any activity recognition model. Some authors claim that, as people perform activities in a different manner (due to age, gender, weight, and so on), a specific recognition model should be built for each individual [30]. This implies that the system should be re- trained for each new user. Other studies [26] rather emphasize the need of amonolithicrecognition model, flexible enough to work with different users. Consequently, two types of analyses have been proposed to evaluate activity recognition systems:

subject-dependent and subject-independent evaluations [23].

In the first one, a classifier is trained and tested for each individual with his/her own data and the average accuracy for all subjects is computed. In the second one, only one classifier is built for all individuals using cross validation or leave-one-individual-out analysis. It is worth to highlight that, in some cases, it would not be convenient to train the system for each new user, especially when (1) there are too many activities; (2) some activities are not desirable for the subject to carry out (e.g., falling downstairs); or (3) the subject would not cooperate with the data collection process (e.g., patients with dementia and other mental pathologies).

On the other hand, an elderly lady would surely walk quite differently than a ten-years-old boy, thereby challenging a single model to recognize activities regardless of the subject’s characteristics. A solution to the dichotomy of the monolithic vs. particular recognition model can be addressed by creating groups of users with similar characteristics. Additional design

considerations related to this matter will be discussed in Section VI.

IV. ACTIVITYRECOGNITIONMETHODS

In Section II, we have seen that, to enable the recognition of human activities, raw data have to first pass through the process of feature extraction. Then, the recognition model is built from the set of feature instances by means of machine learning techniques. Once the model is trained, unseen instances (i.e., time windows) can be evaluated in the recognition model, yielding a prediction on the performed activity. Next, the most noticeable approaches in feature extractionandlearningwill be covered.

A. Feature extraction

Human activities are performed during relatively long pe- riods of time (in the order of seconds or minutes) compared to the sensors’ sampling rate (which can be up to 250 Hz).

Besides, a single sample on a specific time instant (e.g., the Y-axis acceleration is 2.5g or the heart rate is 130 bpm) does not provide sufficient information to describe the performed activity. Thus, activities need to be recognized in a time window basis rather than in a sample basis. Now, the question is: how do we compare two given time windows? It would be nearly impossible for the signals to be exactly identical, even if they come from the same subject performing the same activity.

This is the main motivation for applying feature extraction (FE) methodologies to each time window: filtering relevant information and obtaining quantitative measures that allow signals to be compared.

In general, two approaches have been proposed to extract features from time series data:statisticalandstructural[50].

Statistical methods, such as the Fourier transform and the Wavelet transform, use quantitative characteristics of the data to extract features, whereas structural approaches take into account the interrelationship among data. The criterion to choose either of these methods is certainly subject to the nature of the given signal.

Figure 3 displays the process to transform the raw time series dataset (which can be from acceleration, environmental variables, or vital signs) into a set of feature vectors. Each instance in the processed dataset corresponds to the feature vector extracted from all the signals within a time window.

Most of the approaches surveyed in this paper adhere to this mapping.

Next, we will cover the most common FE techniques for each of the measured attributes, i.e., acceleration, environmental signals, and vital signs. GPS data are not considered in this section since they are mostly used to compute the speed [19], [37] or include some knowledge about the place where the activity is being performed [31].

1) Acceleration: Acceleration signals (see Figure 4) are highly fluctuating and oscillatory, which makes it difficult to recognize the underlying patterns using their raw values.

Existing HAR systems based on accelerometer data employ statistical feature extraction and, in most of the cases, either time- or frequency-domain features. Discrete Cosine Trans- form (DCT) and Principal Component Analysis (PCA) have

(6)

A time window in the raw training dataset

w t a_x a_y a_z Activity j 0 1.3 -2.1 0 Running j 1/s₁ 1.4 -2.3 0.1 Running j 2/s₁ 1.1 -2.6 0 Running

… … … … … …

j t_max 1.8 2.2 -0.4 Running

w f0 … fn Activity

0 231 … -6.2 Unknown

… … … … …

j 543 … 8 Running

… ... … … …

k-1 339 … 7.1 Upstairs

w t HR … RR Activity

j 0 120 … 15 Running

j 1/s2 120 … 16 Running j 2/s2 120 … 15 Running

… … … … … …

j tmax 121 … 18 Running

w t Temp … Hum Activity

j 0 120 … 15 Running

j 1/s3 120 … 16 Running j 2/s₃ 120 … 15 Running

… … … … … …

j tmax 121 … 18 Running Acceleration

signals

Physiological signals

Environmental signals

Processed training

dataset

Fig. 3. An example of the mapping from the raw dataset to the feature dataset.

wis the window consecutive;siis the sampling rate for the group of sensors i, where sensors in the same group have the same sampling rate;fiis each of the extracted features. Each instance in the processed dataset corresponds to the set of features computed from an entire window in the raw dataset.

also been applied with promising results [35], as well as autoregressive model coefficients [29]. All these techniques are conceived to handle the high variability inherent to acceleration signals. Table II summarizes the feature extraction methods for acceleration signals. The definition of some of the most widely used features [36] are listed below for a given signalY ={y₁, ..., y_n}.

• Central tendency measures such as the arithmetic mean

¯

yand theroot mean square(RMS) (Equations 1 and 2).

• Dispersion metrics such as thestandard deviationσ_y, the variance σ²_y, and the mean absolute deviation (MAD) (Equations 3, 4, and 5).

• Domain transform measures such as the energy, where F_i is thei-th component of the Fourier Transform ofY (Equation 6).

¯ y= 1

n n i=1

y_i (1)

RMS(Y) = 1

n n i=1

y_i² (2)

σ_y = 1

n−1 n i=1

(y_i−y)¯ ² (3)

σ_y²= 1 n−1

n i=1

(y_i−y)¯² (4)

TABLE II

SUMMARY OF FEATURE EXTRACTION METHODS FOR ACCELERATION SIGNALS.

Group Methods Time

domain

Mean, standard deviation, variance, interquartile range (IQR), mean absolute deviation (MAD), correlation between axes, entropy, and kurtosis [19], [23]–[25], [27], [36], [45].

Frequency domain

Fourier Transform (FT) [27], [36] and Discrete Cosine Transform (DCT) [51].

Others Principal Component Analysis (PCA) [35], [51], Linear Discriminant Analysis (LDA) [36], Autoregresive Model (AR), and HAAR filters [28].

TABLE III

SUMMARY OF FEATURE EXTRACTION METHODS FOR ENVIRONMENTAL VARIABLES.

Attribute Features

Altitude Time-domain [19]

Audio Speech recognizer [53]

Barometric pressure Time-doimain and frequency-domain [52]

Humidity Time-domain [19]

Light Time-domain [46] and frequency-domain [19]

Temperature Time-domain [19]

MAD(Y) = 1

n−1 n i=1

|y_i−y¯| (5)

Energy(Y) = _n

i=1F_i²

n (6)

2) Environmental variables: Environmental attributes, along with acceleration signals, have been used to enrich context awareness. For instance, the values from air pressure and light intensity are helpful to determine whether the individual is outdoors or indoors [52]. Also, audio signals are useful to conclude that the user is having a conversation rather than listening to music [19]. Table III summarizes the feature extraction methods for environmental attributes.

3) Vital signs: The very first works that explored vital sign data with the aim of recognizing human activities applied statistical feature extraction. In [23], the authors computed the number of heart beats above the resting heart rate value as the only feature. Instead, Parkka et al. [19] calculated time domain features for heart rate, respiration effort, SaO₂, ECG, and skin temperature. Nevertheless, the signal’s shape is not described by these features. Consider the situation shown in Figure 5. A heart rate signal S(t) for an individual that was walking is shown with a bold line and the same signal in reverse temporal order, S(t), is displayed with a thin line.

Notice that most time domain and frequency domain features (e.g., mean, variance, and energy) are identical for both signals while they may represent different activities. This is the main motivation for applying structural feature extraction.

In a time series context, structure detectors are intended to describe the morphological interrelationship among data.

Given a time series Y(t), a structure detector implements a function f(Y(t)) = ˆY(t) such that Yˆ(t) represents the structure of Y(t) as an approximation. In order to measure the goodness of fit ofYˆ(t)toY(t), the sum of squared errors (SSE) is calculated as follows:

(7)

<$FFHOHUDWLRQJ

7LPHV

<$FFHOHUDWLRQJ

6LWWLQJ 5XQ

7LPHV

<$FFHOHUDWLRQJ

7LPHV :DONLQJ

QQLQJ

Fig. 4. Acceleration signals for different activities.

50 60 70 80 90 100 110 120 130

0 5 10 15 20

Heart rate (bpm)

Time (s) Walking Walking (flipped)

Fig. 5. Heart rate signal forwalking(bold) and flipped signal (thin).

SSE=

t

Y(t)−Yˆ(t) ² (7)

The extracted features are the parameters ofYˆ(t)which, of course, depend on the nature of the function. Table IV contains some typical functions implemented by structure detectors.

The authors found that polynomials are the functions that best fit physiological signals such as heart rate, respiration rate, breath amplitude, and skin temperature [26]. The degree of the polynomial should be chosen based upon the number of samples to avoid overfitting due to the Runge’s phenomenon [54].

4) Selection of the window length: In accordance to Def- inition 2, dividing the measured time series in time windows is a convenient solution to relax the HAR problem. A key factor is, therefore, the selection of the window length because the computational complexity of any FE method depends on the number of samples. Having rather short windows may enhance the FE performance, but would entail higher overhead due to the recognition algorithm being triggered more frequently. Besides, short time windows may not provide sufficient information to fully describe the performed activity.

Conversely, if the windows are too long, there might be more than one activity within a single time window [37]. Different window lengths have been used in the literature: 0.08s [30], 1s [37], 1.5s [55], 3s [56], 5s [51], 7s [57], 12s [26], or up to 30s [23]. Of course, this decision is conditioned to the activities to be recognized and the measured attributes. The heart rate signal, for instance, required 30s time windows in [23]. Instead, for activities such as swallowing, 1.5s time windows were employed.

TABLE IV

COMMON FUNCTIONS IMPLEMENTED BY STRUCTURE DETECTORS.

Function Equation Parameters

Linear F(t) =mt+b {m, b}

Polynomial F(t) =a0+a1t+...+an−1tⁿ⁻¹ {a0, ..., an−1}

Exponential F(t) =a|b|^t+c {a, b, c}

Sinusoidal F(t) =a∗sin(t+b) +c {a, b, c}

Time windows can also be either overlapping [23], [26], [27], [58] or disjoint [37], [55], [56], [51]. Overlapping time windows are intended to handle transitions more accurately, although using small non-overlapping time windows, misclas- sifications due to transitions are negligible.

5) Feature selection: Some features in the processed dataset might contain redundant or irrelevant information that can negatively affect the recognition accuracy. Then, implementing techniques for selecting the most appropriate features is a suggested practice to reduce computations and simplify the learning models. The Bayesian Information Criterion (BIC) and theMinimum Description Length(MDL) [49] have been widely used for general machine learning problems. In HAR, a common method is the Minimum Redundancy and Maximum Relevance(MRMR) [59], utilized in [20]. In that work, the minimum mutual information between features is used as criteria for minimum redundancy and the maximal mutual information between the classes and features is used as criteria for maximum relevance. In contrast, Maurer et al. [46] applied aCorrelation-based Feature Selection (CFS) approach[60], taking advantage of the fact that this method is built in WEKA [61]. CFS works under the assumption that features should be highly correlated with the given class but uncorrelated with each other. Iterative approaches have also been evaluated to select features. Since the number of feature subsets isO(2ⁿ), fornfeatures, evaluating all possible subsets is not computationally feasible. Hence, metaheuristic methods such as multiobjective evolutionary algorithms have been employed to explore the space of possible feature subsets [62].

B. Learning

In recent years, the prominent development of sensing devices (e.g., accelerometers, cameras, GPS, etc.) has fa- cilitated the process of collecting attributes related to the individuals and their surroundings. However, most applications require much more than simply gathering measurements from variables of interest. In fact, additional challenges for enabling context awareness involve knowledge discovery since the raw data (e.g., acceleration signals or electrocardiogram) provided

(8)

TABLE V

CLASSIFICATION ALGORITHMS USED BY STATE-OF-THE-ART HUMAN ACTIVITY RECOGNITION SYSTEMS.

Type Classifiers References

Decision tree C4.5 and ID3 [20], [25], [27], [45]

Bayesian Na¨ıve Bayes and Bayesian Networks

[20], [23], [27], [48]

Instance Based k-nearest neighbors [20], [25]

Neural Networks Multilayer Perceptron [66]

Domain transform Support Vector Machines

[29], [34], [35]

Fuzzy Logic Fuzzy Basis Function and Fuzzy Inference System

[24], [36], [67]

Regression methods MLR, ALR [26], [31]

Markov models Hidden Markov Models and Conditional Random Fields

[68], [69]

Classifier ensembles Boosting and Bagging [26], [56]

by the sensors are often useless. For this purpose, HAR systems make use ofmachine learningtools, which are helpful to build patterns to describe, analyze, and predict data.

In a machine learning context, patterns are to be discovered from a set of given examples or observations denominated instances. Such input set is calledtraining set. In our specific case, each instance is a feature vector extracted from signals within a time window. The examples in the training set may or may not be labeled, i.e., associated to a known class (e.g., walking, running, etc.). In some cases, labeling data is not feasible because it may require an expert to manually examine the examples and assign a label based upon their experience.

This process is usually tedious, expensive, and time consuming in many data mining applications [63]–[65].

There exist two learning approaches, namelysupervisedand unsupervised learning, which deal with labeled and unlabeled data, respectively. Since a human activity recognition system should return a label such as walking, sitting, running, etc., most HAR systems work in a supervised fashion. Indeed, it might be very hard to discriminate activities in a completely unsupervised context. Some other systems work in a semi- supervised fashion allowing part of the data to be unlabeled.

1) Supervised learning: Labeling sensed data from individuals performing different activities is a relatively easy task.

Some systems [23], [46] store sensor data in a non-volatile medium while a person from the research team supervises the collection process and manually registers activity labels and time stamps. Other systems feature a mobile application that allows the user to select the activity to be performed from a list [26]. In this way, each sample is matched to an activity label, and then stored in the server.

Supervised learning (referred to as classification for discrete-class problems) has been a very productive field, bringing about a great number of algorithms. Table V summarizes the most important classifiers in Human Activity Recognition and their description is included below.

• Decision trees build a hierarchical model in which attributes are mapped to nodes and edges represent the possible attribute values. Each branch from the root to a leaf node is a classification rule. C4.5 is perhaps the most widely used decision tree classifier and is based on

the concept of information gain to select the attributes that should be placed in the top nodes [70]. Decision trees can be evaluated in O(logn) forn attributes, and usually generate models that are easy to understand by humans.

• Bayesian methods calculate posterior probabilities for each class using estimated conditional probabilities from the training set. The Bayesian Network (BN) [71]

classifier andNa¨ıve Bayes(NB) [72] (which is a specific case of BN) are the principal exponents of this family of classifiers. A key issue in Bayesian Networks is the topology construction, as it is necessary to make assumptions on the independence among features. For instance, the NB classifier assumes that all features are conditionally independent given a class value, yet such assumption does not hold in many cases. As a matter of fact, acceleration signals are highly correlated, as well as physiological signals such as heart rate, respiration rate, and ECG amplitude.

• Instance based learning (IBL) [49] methods classify an instance based upon the most similar instance(s) in the training set. For that purpose, they define a distance function to measure similarity between each pair of instances. This makes IBL classifiers quite expensive in their evaluation phase as each new instance to be classified needs to be compared to the entire training set. Such high cost in terms of computation and storage, makes IBL models not convenient to be implemented in a mobile device.

• Support Vector Machines(SVM) [73] andArtificial Neu- ral Networks (ANN) [74] have also been broadly used in HAR although they do not provide a set of rules understandable by humans. Instead, knowledge is hidden within the model, which may hinder the analysis and incorporation of additional reasoning. SVMs rely on kernel functions that project all instances to a higher di- mensional space with the aim of finding a linear decision boundary (i.e., a hyperplane) to partition the data. Neural networks replicate the behavior of biological neurons in the human brain, propagating activation signals and encoding knowledge in the network links. Besides, ANNs have been shown to be universal function approximators.

The high computational cost and the need for large amount of training data are two common drawbacks of neural networks.

• Ensembles of classifiers combine the output of several classifiers to improve classification accuracy. Some examples are bagging, boosting, and stacking. Classifier ensembles are clearly more expensive, computationally speaking, as they require several models to be trained and evaluated.

2) Semi-supervised learning: Relatively few approaches have implemented activity recognition in a semi-supervised fashion, thus, having part of the data without labels [63], [64], [75]–[77]. In practice, annotating data might be difficult in some scenarios, particularly when the granularity of the activities is very high or the user is not willing to cooperate with the collection process. Since semi-supervised learning is a minority in HAR, there are no standard algorithms or methods,

(9)

but each system implements its own approach. Section V-C provides more details on the state-of-the-art semi-supervised activity recognition approaches.

3) Evaluation metrics: In general, the selection of the classification algorithm for HAR has been merely supported by empirical evidence. The vast majority of the studies use cross validation with statistical tests to compare classifiers’

performance for a particular dataset. The classification results for a particular method can be organized in aconfusion matrix M_n×n for a classification problem with n classes. This is a matrix such that the elementM_ij is the number of instances from class i that were actually classified as class j. The following values can be obtained from the confusion matrix in a binary classification problem:

• True Positives(TP): The number of positive instances that were classified as positive.

• True Negatives(TN): The number of negative instances that were classified as negative.

• False Positives (FP): The number of negative instances that were classified as positive.

• False Negatives(FN): The number of positive instances that were classified as negative.

The accuracy is the most standard metric to summarize the overall classification performance for all classes and it is defined as follows:

Accuracy= T P+T N

T P+T N+F P+F N (8) Theprecision, often referred to aspositive predictive value, is the ratio of correctly classified positive instances to the total number of instances classified as positive:

P recision= T P

T P+F P (9)

The recall, also called true positive rate, is the ratio of correctly classified positive instances to the total number of positive instances:

Recall= T P

T P+F N (10)

The F-measure combines precision and recall in a single value:

F−measure= 2· P recision·Recall

P recision+Recall (11) Although defined for binary classification, these metrics can be generalized for a problem with nclasses. In such case, an instance could be positive or negative according to a particular class, e.g., positives might be all instances of runningwhile negatives would be all instances other thanrunning.

4) Machine learning tools: The Waikato Environment for Knowledge Analysis (WEKA) [61] is certainly the best known tool in the machine learning research community. It contains implementations of a number of learning algorithms and it allows to easily evaluate them for a particular dataset using cross validation and random split, among others. WEKA also offers a Java API that facilitates the incorporation of new learning algorithms and evaluation methodologies on top of the pre-existing framework. One of the limitations of current Machine Learning APIs such as WEKA [61] and the

Fig. 6. Taxonomy of Human Activity Recognition Systems.

Java Data Mining (JDM) platform [78] is that they are not fully functional in current mobile platforms. In that direction, the authors proposed MECLA [48], a mobile platform for the evaluation of classification algorithms under the Android platform.

V. EVALUATION OFHAR SYSTEMS

In order to evaluate the state-of-the-art HAR systems, it is required to first define a taxonomy that allows to compare and analyze them within groups that share common characteristics.

To the best of our knowledge, no comprehensive taxonomy has been proposed in the literature to encompass all sorts of activity recognition systems. In such direction, this paper presents a new global taxonomy for HAR and a new two-level specific taxonomy for HAR with wearable sensors (see Figure 6). As it was mentioned in Section I, the nature of the sensors (i.e., either wearableor external) marks the first criterion to classify HAR systems. Hybrid approaches also appear as an emerging class of systems which intend to exploit the best of both worlds [79] by combining external and wearable sensors as sources of data for recognizing activities.

In this survey, we have categorized HAR systems that rely on wearable sensors in two levels. The first one has to do with the learning approach, which can be eithersupervisedor semi-supervised. In the second level, according to the response time, supervised approaches can work eitheronlineoroffline.

The former provide immediate feedback on the performed activities. The latter either need more time to recognize activities due to high computational demands, or are intended for applications that do not require real-time feedback. This taxonomy has been adopted as the systems within each class have very different purposes and associated challenges and should be evaluated separately. For instance, a very accurate fully supervised approach might not work well in a semi- supervised scenario, whereas an effective offline system may not be able to run online due to processing constraints.

Furthermore, we found a significant number of systems that fall in each group, which also favours the comparison and analysis.

Other criteria such as the type of sensors, the feature extraction approach, and the learning algorithm are not useful to build a taxonomy of HAR. Instead, they are design issues that should be addressed by the researcher. Finally, although the set of recognized activities clearly generates different types of HAR systems, incorporating it in the taxonomy would lead to an excessive granularity as most systems define a particular set of activities.

(10)

The qualitative evaluation encompasses the following aspects:

• Recognized activities (Table I)

• Type of sensors and the measured attributes (Section III-A)

• Integration device

• Level of obtrusiveness, which could below,medium, or high.

• Type of data collection protocol, which could be either a controlledor anaturalisticexperiment.

• Level of energy consumption, which could be low, medium, orhigh.

• Classifier flexibility level, which could be either user- specificormonolithic.

• Feature extraction method(s)

• Learning algorithm(s)

• Overall accuracy for all activities

To the best of our knowledge, current semi-supervised systems have been implemented and evaluated offline. Thus, only three categories of HAR systems are considered, namely, online (supervised), supervised offline, and semi-supervised (offline) HAR systems. Albeit a large number of systems are present in the literature, we have selected twenty eight of them based on their relevance in terms of the aforementioned aspects.

A. Online HAR systems

Applications of online activity recognition systems can be easily visualized. In healthcare, continuously monitoring patients with physical or mental pathologies becomes crucial for their protection, safety, and recovery. Likewise, interactive games or simulators may enhance user’s experience by considering activities and gestures. Table VII summarizes the online state-of-the-art activity recognition approaches. The abbreviations and acronyms are defined in Table VI. The most important works on online HAR are described next.

1) eWatch: Maurer et al. [25] introduced eWatch as an online activity recognition system which embeds sensors and a microcontroller within a device that can be worn as a sport watch. Four sensors are included, namely an accelerometer, a light sensor, a thermometer, and a microphone. These are passive sensors and, as they are embedded in the device, no wireless communication is needed; thus, eWatch is very energy efficient. Using a C4.5 decision tree and time-domain feature extraction, the overall accuracy was up to 92.5% for six ambulation activities, although they achieved less than 70%

for activities such asdescendingandascending. The execution time for feature extraction and classification is less than 0.3 ms, which makes the system very responsive. However, in eWatch, data were collected under controlled conditions, i.e., a lead experimenter supervised and gave specific guidelines to the subjects on how to perform the activities [25]. In Section III-C, we reviewed the disadvantages of this approach.

2) Vigilante: The authors proposed Vigilante [48], a mobile application for real-time human activity recognition under the Android platform. The Zephyr’s BioHarness BT [80]

chest sensor strap was used to measure acceleration and physiological signals such as heart rate, respiration rate, breath

TABLE VI

LIST OF ABBREVIATIONS AND ACRONYMS. 3DD 3D Deviation of the acceleration signals ACC Accelerometers

AMB Ambulation activities (see Table I) ANN Artificial Neural Network

ALR Additive Logistic Regression classifier AR Auto-Regressive model coefficients AV Angular velocity

BN Bayesian Network classifier CART Classification And Regression Tree

DA Daily activities (see Table I) DCT Discrete Cosine Transform

DT Decision Tree-based classifier DTW Dynamic Time Warping

ENV Environmental sensors FBF Fuzzy Basis Function

FD Frequency-domain features GYR Gyroscope

HMM Hidden Markov Models

HRB Heart Rate Beats above the resting heart rate HRM Heart Rate Monitor

HW Housework activities (see Table I) KNN k-Nearest Neighbors classifier LAB Laboratory controlled experiment LDA Linear Discriminant Analysis

LS Least Squares algorithm MIL Military activities

MNL Monolithic classifier (subject independent) NAT Naturalistic experiment

NB The Na¨ıve Bayes classifier

NDDF Normal Density Discriminant Function N/S Not Specified

PCA Principal Component Analysis

PHO Activities related to phone usage (see Table I) PR Polynomial Regression

RFIS Recurrent Fuzzy Inference System SD Subject Dependent evaluation SFFS Sequential Forward Feature Selection

SI Subject Independent evaluation SMA Signal Magnitude Area

SMCRF Semi-Markovian Conditional Random Field SPC User-specific classifier (subject dependent)

SPI Spiroergometry TA Tilt Angle

TD Time-domain features TF Transient Features [26]

TR Transitions between activities UB Upper body activities (see Table I) VS Vital sign sensors

waveform amplitude, and skin temperature, among others. In that work, we proposed MECLA, a library for the mobile evaluation of classification algorithms, which can also be utilized in further pattern recognition applications. Statistical time- and frequency-domain features were extracted from acceleration signals while polynomial regression was applied to physiological signals vialeast squares. The C4.5 decision tree classifier recognized three ambulation activities with an overall accuracy of 92.6%. The application can run for up to 12.5 continuous hours with a response time of no more than 8% of the window length. Different users with diverse characteristics participated in training and testing phases, ensuring flexibility to support new users without the need to re-train the system. Unlike other approaches, Vigilante was evaluated completely online to provide more realistic results.

Vigilante is moderately energy efficient because it requires permanent Bluetooth communication between the sensor strap and the phone.