Evaluation - Enabling Deep Neural Network Inferences on Resource-constraint Devices

In this section, we first evaluate the component-wise performance of DeepVehicleSense consisting of the proposed triggers and sound classifier. Then, we conduct extensive experiments to understand the system-level performance of DeepVehicleSense in comparison with three recent recognition algorithms:

Hemminki’s accelerometer-based algorithm [23], Wang’s CNN-based algorithm [26], and our earlier work, VehicleSense that relies on SVM [16] via trace-driven emulations. We implement three systems on the COTS Android smartphones for accurate evaluation of decision latency and power consumption.

We further demonstrate and evaluate an automated transportation data collection platform exploiting DeepVehicleSense, which can evidence the efficacy of DeepVehicleSense in practical use cases.

Data Collection

To evaluate DeepVehicleSense, we first validate our designed modules according to the thresholds of the triggers, type of filters, and sampling duration, and train DeepVehicleSense with the best configuration.

Then, we evaluate the performance in real-life situations with the trained model. To validate and train the sound classifier, we have collected microphone and accelerometer data in real-world environments without any restriction on the environmental factors such as location, time of the day, and people density nearby from seven different smartphone models gathered by 10 different volunteers (namely testers).

The total length of the dataset is about 263 hours during 40 days gathered across several months. Table 2 summarizes the collected data length in minutes for the candidate activities. For the microphone, we use 16 kHz sampling rate and 16 bit PCM (pulse code modulation) encoding , and for the accelerometers, we use 50Hz sampling rate. While collecting the data, we have also acquired the activity labels provided by the testers at every moment of changing their transportation modes. Each of the activity labels we have in the dataset is one of the following seven motions: (1) being stationary at a fixed place, (2) walking, (3) in a bus, (4) in a taxi, (5) in a subway, (6) in a train and (7) in an airplane. Since the dataset has not been separated explicitly, there may exist temporal overlaps (e.g., a sound of the city bus, which includes a slight sound of a taxi riding around the city bus, is labeled a bus). Note that we minimized the labeling efforts accompanied by the testers by asking them to keep each transportation mode sustained at least for 20 minutes. We have used 80% of the unfiltered dataset for training and the remaining 20% for our validation test, split by the collection date. This means that we have evaluated the sound characteristics in the presence of various environmental noises. As our data has been gathered by roaming with the purpose of gathering data, only about 1.5% of the split data turn out to be overlapped in terms of location.

In addition, to validate and evaluate the triggers, we have collected a separate dataset in the wild from 110 randomly selected volunteers of various ages, jobs, and cities recruited from an Internet com- munity. The volunteers live their daily lives while running the data collection application for 8 days.

The application only recorded the accelerometer readings, and the microphone data was intentionally excluded for privacy concerns. At the same time, the volunteers recorded their actual transportation mode manually whenever the transportation mode changed. In order to induce active participation in

Table 2: Total measurement duration (in minutes) and the number of samples with 0.5 second sampling duration of each transportation.

Motion stationary walk bus taxi subway train airplane

Duration 1,000 2,430 2,700 2,320 2,530 2,160 2,670

# of samples 120,000 291,600 324,000 278,400 303,600 259,200 320,400

0 10 20 30 40 50 60 70

0 0.2 0.4 0.6 0.8 1

CDF

Walking Stationary

Figure 10: CDFs of theMag^MA_a values from the accelerometers under stationary (either being inside or outside a vehicle) and walking motions. We also depicted two thresholds,TwandTs.

the experiment, compensation was paid according to their data collection duration, and an average of 171.5 hours of data was collected per volunteer. This collected dataset contains bus, taxi (car), subway, walking, running, and bicycle.

Utilizing DeepVehicleSense trained with these two datasets, we perform an experiment in real-life situations. The experiment continues for seven days to demonstrate the generality of our system in contrast to the two datasets collected in fragments. Details are described in Sec. 2.5.

Performance of Triggers

Accuracy of Triggers The goal of the triggers is to minimize the number of activating the sound classifier by recognizing whether a user is in a vehicle or not while guaranteeing to let the sound classifier receive sound samples for the in-vehicle user.

To validate our implementation, we first evaluate the accuracy of the motion trigger. Fig. 10 shows the CDFs of the magnitude values extracted under two motions: being stationary (either being inside or outside a vehicle) and walking. We set the stationary and walking thresholds to be 25.7(m/s²)²and 5.1(m/s²)², with which 99.65% of walking and 99.95% of stationary motions are correctly detected.

These stationary and walking thresholds produce false positives of 25.5% and 26.8%, but this drastic choice for the near-perfect detection of true positives is made by intention so that we seek to minimize the number of false negatives that can miss transportation-related transitional events such as getting on and off in spite of the increasing number of false positives. With this design, the detection failure rate (i.e., missing rate) for riding events goes below 0.68%¹.

We then evaluate the performance of the transportation trigger to distinguish being stationary at a fixed place from being stationary inside a transportation based on the accelerometer dataset. The detailed classification accuracy values are measured by three popular metrics [43]: precision, recall andF₁score (the harmonic mean of precision and recall), which are defined as

Precision= n(True Positive)

n(True Positive) +n(False Positive),

Recall= n(True Positive)

n(True Positive) +n(False Negative),

F₁score=2×Precision×Recall Precision+Recall, wheren(·)is the number of the corresponding event.

The values we get for precision, recall, andF₁score are 89.03%, 99.37%, and 93.91%, respectively.

Having extremely high performance in recall is our intended design. As stated earlier, the most important performance metric of a trigger is the missing rate. The missing rate can be regarded as an FNR (false negative rate) which is nothing but 1−recall. The FNR from the trigger is as low as 0.6%. Note that having false positives is not critical as they only lead to a little more power consumption from activating sound classifiers more often. As our sound classifier also has the ability to recognize non-vehicle situations with more than 99% accuracy, the final transportation mode recognition accuracy is not degraded much by having a certain level of false positives.

Energy Efficiency of Triggers We evaluate how much the proposed triggers help in reducing power consumption of DeepVehicleSense in comparison with continuous sensing methods. In order to understand the expected power consumption (E[P]) from DeepVehicleSense for the realization of all day monitoring, we characterize it by estimating the probabilities of seeing specific events in a day. When we denote the instantaneous power consumption for the motion trigger (MT), the transportation trigger (TT), and the sound classifier (SC) byPMT,PT T,andPDSC, the following equation holds:

E[P] = PMT+p(Wcont)PSCt_SC +p(W→S)

PT Tt_{T T}+p(V|S)PSCt_SC ,

where p(W→S),p(V|S),andp(Wcont)denote the probability of observing a transition from walking to being stationary, the probability of being on a vehicle given that a person is being stationary, and the

1Missing rate of the motion trigger is calculated by adding two probabilities of missing the transition case from walking to stationary motion and the continuous walking case. For the former, it is the sum of false negatives of both walking and stationary motions, and for the latter it isp(Wcont)×[1−p(Wtcont)], where p(Wcont)is a probability of observing the motion of continuous walking fortcont=5 seconds andp(Wtcont)is a probability that all predictions of walking duringtcont are true positives.

0 20 40 60 80 100 Participant index 0

0.2 0.4 0.6 0.8 1

(a)p(W→S)

0 20 40 60 80 100

Participant index 0

0.2 0.4 0.6 0.8 1

(b)p(V|S)

0 20 40 60 80 100

Participant index 0

0.2 0.4 0.6 0.8 1

(c)p(W_cont)

Figure 11: p(W→S), p(V|S), and p(Wcont) which are evaluated individually from the 110 user traces measured in the wild.

Table 3: Power consumption of the modules.

MT TT SC

Power consumption (mW) 5.80 61.41 181.98 Activated duration (s) Persistent 1.02 2.03

probability of observing the motion of continuous walking fortcont =5 seconds. Note that all these probabilities are calculated from the discretized events predicted by the triggers assessed every second throughout a day, which implies that the positive rates of the triggers include both true and false positives.

Also note thattT T andtDSCare the activated time duration in seconds, of the motion trigger and the deep sound classifier for each call. Notice that If we have no trigger system, the power consumption of DeepVehicleSense simply becomesE[Pno−trigger] =PSC. If we only exploit the transportation trigger, E[P1−trigger] =PT T+p(V)PSCtSC holds. Given the baseline statistics of MT, TT, and SC provided in Table 3 and the average event probabilities from our 110 user dataset,p(W→S) =0.064,p(V|S) = 0.569, andp(Wcont) =0.032, the power consumption of DeepVehicleSense becomesE[P] =35.08 mW that outperformsE[Pno−trigger] =181.98 mW andE[P1−trigger] =74.93 mW. Fig. 11 confirms that the p(W →S), p(V|S), and p(Wcont) obtained from 110 users individually are different but have certain coherence. We can estimate the worst case power consumption from these statistics, which becomes E[P] =87.47 mW. This is still far less than no-trigger and 1-trigger systems, as people have essential activities such as sleep and having meals, which are filtered out by the trigger.

Performance of Sound Classifier

Now, we evaluate the performance of the sound classifier by filters, sampling durations, and the staged inference. For implementing the pre-processing of the sound classifier, we fine-tuned parameters of STFT and the filter. Specifically, we use the window of STFT with a hamming window whose size is 512 and a half overlap. We choose the filter consisting of 32 triangular filter banks and the scale formula of “steep”. To implement the staged inference used in the evaluation, we empirically set the threshold values that perform the best in terms of both recognition accuracy and latency.

Inverse Linear Mel Doubled-MelSteep Sheer ExtreamInvSigmoid1InvSigmod2Sigmoid

Filter

60 70 80 90 100

Percentage (%)

Precision Recall F1score

Figure 12: Precision, recall, andF₁score from the different filters.

Recognition Accuracy by Filters In Sec. 2.4, we have proposed non-linear filters that pre-process the sound data to maximize a difference between sound characteristics of vehicles. Fig. 12 shows the recognition accuracy from non-linear filters in terms of three performance metrics: precision, recall, and F₁score. As shown, we can immediately notice thatInverse, which is the only exponential shape filter, shows the worst performance compared to others. Considering that the important frequency features are concentrated in the frequency range under 1 kHz (as investigated in Sec. 2.3), this result is not surprising.

More interesting observation is found on the pattern of the accuracy change over the logarithmic shape filters that become more convex in the order ofLinear,Mel,Doubled-Mel,Steep,Sheer, andExtreme.

In addition, Fig. 12 shows that the accuracy gradually increases and peaks atSteepby showing 98.3% on average, and then again gradually decreases. The well-known pre-processing technique,MFCCshows relatively poor performance which is about 5.2% lower than that ofSteep. The variants of a sigmoid function, Sigmoid, InverseSig1, and InverseSig2 do not surpass the performance of Steep as well.

Note that we let the sampling duration of the sound data be 2 seconds in obtaining the results, but we confirmed that this trend remains the same for different sampling duration values.

Recognition Accuracy by Sampling Duration The recognition latency is one of the most important performance measures in the transportation mode recognition systems. In order to let DeepVehicleSense conclude as quickly as possible, we investigate the trade-off between the sampling duration and the recognition accuracy, which is shown in Fig. 13. The tested durations range from 0.1 seconds to 5 seconds. Fig. 13 shows that a longer sampling duration leads to higher accuracy, but at the same time the

100 300 500 1000 2000 3000 5000

Samplinig duration (msec)

50 60 70 80 80 100

Percentage (%)

Precision Recall F1score

Figure 13: Precision, recall, andF₁score from different sampling durations.

improvement in the accuracy is marginal beyond 2 seconds. Actually, we find that 2 seconds is enough to provide high accuracy. By taking both the accuracy and the sampling duration into account, we opt to use 2 seconds as our default setting for the remaining experiments. We discuss privacy concerns due to recorded sound samples in Sec. 2.6.

Performance of Staged Inference As mentioned in Sec. 2.4, DeepVehicleSense adopts the staged inference to reduce intensive computational burden from our deep learning model. In order to evaluate the performance of our novel metriclogGap, we compare it to the performance without and with the staged inference done by entropywhich is originally used for judging the confidence of the early inference in the BranchyNet. Table 4 compares performance achieved by sounder classifier using both logGap and entropy and without the staged inference. As shown, the staged-inference based sound classifier is improved radically in terms of computing time, power consumption, and recognition accuracy. This is because the staged inference can filter out mispredicted events by calculating the confidence of an inference and does not allow to pass the excessive number of convolution layers.

Moreover, logGap-based sound classifier enhances the recognition accuracy as 1.49 % in terms ofF₁ score over entropy-based one. The latter sometimes misjudges ambiguous inference results as confident ones, especially when there exist a confusion that can be hidden by very low probabilities of classes.

However, the former can pass confused results to further layers for prudent decision, which would otherwise be computed in an early layer of the network using the entropy value. On the other hand, we observe that in the logGap-based inference, computing time and energy consumption increase slightly as 0.79 ms and 0.98 mJ (on average), respectively, compared to the entropy-based one. The main reason is that the portion of terminating in the first stage inference decreases from 95.2% to 92.0% by determining

DeepVehicleSense (1st stage only)

DeepVehicleSense (2nd stage always)

DeepVehicleSense

Stage inference

0 10 20 30

Energy Consumption (mJ)

60 70 80 90 100

F1score (%)

Figure 14: Energy consumption and F₁ score according to inference stages of the sound classifier:

DeepVehicleSense (1st stage only) and DeepVehicleSense (2nd stage always) means that all inferences complete at 1st stage and 2nd stage, respectively.

Table 4: Performance comparison among logGap-based sound classifier, entropy-based sound classifier and non-staged-inference (non-ST) based sound classifier, which uses an original neural network. Com- puting time and energy consumption are derived for one recognition event.

logGap entropy non-ST Avg.F₁score (%) 98.30 96.81 95.47 Avg. Computing Time (ms) 25.58 24.79 56.82 Avg. Energy Consumption (mJ) 20.24 19.26 43.68 the confidence correctly.

Further, we clarify the choice of the staged inference by comparing the performance achieved by DeepVehicleSense with different number of stages. Figure 14 shows energy consumption andF₁score of sound classifier with each stage. As shown, when using the 2nd stage of the neural network that is shallower than the original neural network, we can achieve a lower accuracy than DeepVehicleSense with 94.03% accuracy. However, surprisingly, the energy consumption of DeepVehicleSense using the staged inference is 20.24 mJ², which is less than that of the 2nd staged neural network, 29.64 mJ, because DeepVehicleSense can finish with a smaller neural network in many cases. Therefore, it is reasonable to use the staged inference rather than the shallow network in terms of both the recognition accuracy and energy consumption overhead in the deep learning model.

2Energy consumption of the staged inference is an expected value of probability distribution and measured energy consumption over all stages.

DeepVehicleSense VehicleSense Wang's Hemminki's(10 peaks) 50

60 70 80 90 100

Percentage (%)

Precision Recall F1score

Figure 15: Comparison of precision, recall, and F₁ score among DeepVehicleSense, VehicleSense, Wang’s system and Hemminki’s system.

System-level Performance

With the dataset in Sec. 2.5, we evaluate the system performance of DeepVehicleSense by comparing with (i) Hemminki’s accelerometer based recognition technique [23] in which peaks of accelerometers from acceleration and deceleration are carefully processed and characterized to recognize the type of a vehicle, (ii) Wang’s sound-based recognition technique [26] which utilizes CNN with STFT of sound sample as an input, and (iii) our preliminary work VehicleSense [16] relying on a simple SVM classifier.

For measuring system-level performance including the transactions of all the triggers, we calculate the overall system performance (i.e., recognition accuracy) by multiplying (1 - missing rate) to precision and recall of the sound classifier evaluated in Sec. 2.5, and the overall F₁ score was calculated with those re-calculated precision and recall. Note that the missing rate is a rate of failing to invoke the sound classifier even though a user is indeed on a vehicle, as described in Sec. 2.5, so (1 - missing rate) indicates the accuracy of the triggers. Thus, by multiplying (1-missing rate) to the accuracy of the sound classifier, we can extract the cases where the triggers and the sound classifier work correctly at the same time.

In order to evaluate the performance in the COTS smartphone, we implement theses systems as mobile applications using LG V20 smartphone with the following libraries: The implementations of DeepVehicleSense and Wang’s system are based on Tensorflow [44], where its inference network is trained offline and frozen at our Linux server and then exported to Android phone using Tensorflow Lite.

For the remaining two algorithms, we build an Android application in Java, and exploit libSVM [45] to make SVM-based sound classifier. To accurately measure the latency and power consumption of those

systems, we turned off all other applications except the above applications. Further, we enabled an airplane mode to turn off networking functions and turned off the screen. Note that the performance of VehicleSense is different from the original version in [16] due to some changes³.

Recognition Accuracy Fig. 15 compares the recognition accuracy from DeepVehicleSense, Vehicle- Sense, Wang’s system, and Hemminki’s system with 10 accelerometer peaks. For all tested cases, DeepVehicleSense outperforms the other three. Specifically, the accuracy of DeepVehicleSense reaches 97.44% while VehicleSense reaches up to 94.36%, Wang’s reaches up to 90.41%, and Hemminki’s reaches up to 74.71%. The accuracy gap between DeepVehicleSense and VehicleSense comes from fundamental limitation of the SVM-based classifier in differentiating train and airplane with the sound samples of a short length (e.g., 2 seconds) as their frequency characteristic reveals a very similar pattern as shown in Fig. 5. By adopting a deep neural network, DeepVehicleSense improves the recognition accuracy while overcoming this limitation.

Furthermore, our system is designed to utilize an accelerometer-based trigger in the first place so that it can minimally activate the sound classifier only at the moment when the user is getting on a vehicle for energy efficiency. In this regard, even few false predictions in the sound classifier would significantly degrade the accuracy performance because the misjudgement would last until the trigger system calls the sound classifier again by the next trigger. Therefore, it is important to improve the recognition accuracy as closely as possible to 100%, which is considered very challenging in general. DeepVehicleSense makes such a non-trivial improvement raising 94.36% to 97.44%. Even though Wang’s also implements CNN with sound data, the accuracy is inferior to DeepVehicleSense because Wang’s neural network is too shallow to extract meaningful features from raw STFT inputs. In case of Hemminki’s system, there exist many cases that show similar acceleration and deceleration patterns from different transportation modes as in the obscure classification between subway and train. As a result, the accuracy depending on characterizing acceleration is much lower than that using sound samples.

Power Consumption To compare the system-level power consumption between DeepVehicleSense and others, we measure power consumption values of the triggers and the main classifiers using Mon- soon power monitor and apply them to 110 user traces to estimate the power consumption of each system during a day by quantifying the number of events activating the triggers and classifier which includes both true and false positives. Fig. 16 compares the individual daily-average power consumption for 110 users from four systems. We find that the average power consumption gap between DeepVehicleSense and VehicleSense is minor with 35.08 mW and 32.53 mW, respectively. It is because the energy consumption from the first-stage DNN, which makes the final decision in most cases in our experiments, is 18.72 mJ, while the energy consumption from the SVM-based sound classifier is 14 mJ. Compared to DeepVehicleSense which turns on the microphone and computes a deep neural network when needed, Wang’s system keeps the microphone on and computes deep neural network every 5 seconds, so it consumes much higher power of 192.48 mW. Hemminki’s system consumes the power of 86.51 mW since

3Here, VehicleSense is newly trained to distinguish five vehicles in total and we implement it on the different phone.

Dalam dokumen Enabling Deep Neural Network Inferences on Resource-constraint Devices (Halaman 33-47)