Enabling Deep Neural Network Inferences on Resource-constraint Devices

Although our proposed lightweight DNN and offloading framework with the essential information extractor achieves low latency while maintaining DNN performance, they alone cannot achieve latency-guaranteed DNN inference. To enable DNN inference on resource-constrained devices despite the trade-off, we study how to reduce latency while maintaining accuracy on resource-constrained devices.

Lightweight DNN for on-device computing

For this purpose, we design the preprocessing to extract and transfer useful information to obtain high accuracy of DNN from input sources. The second process is the input processing which extracts and contains essential information for DNN inferences among the information obtained from the input source.

Extra computing power through offloading with compression

Latency-guaranteed DNN inferences by combining lightweight DNN and of-

Introduction

In this work, we propose DeepVehicleSense, a deep learning-based transportation mode recognition system that utilizes a smartphone's microphone and accelerometers. We design a deep learning-based mode of transportation recognition system called DeepVehicle-Sense (Sec. 2.4) by adding accelerometer-based triggers and voice classifier using scaled inference, which reduces the average energy consumption by 81.3%, compared to a conventional deep neural network without them.

Figure 4: A sample output from an ideal transportation mode recognition system is compared with that from available systems that mostly have performance issues in recognition accuracy, latency, and energy consumption.

Related Work and Motivation

32] proposed a mode recognition system called Mago using an advanced magnetic sensor and accelerometer based on the observation that the mechanical motion of vehicles produces significant distorted magnetic fields. A CNN-based deep neural network with a small number of depths was introduced to implement a deep neural network on a smartphone.

Transportation Sound

As shown, we find that the distribution of vocal sounds deviates from the patterns of transport. Thanks to this dataset, we train and test our proposed method by considering the robustness of vocal sounds.

Figure 6: Comparison of power spectral densities between announcement observed at a bus and sound samples captured in five different vehicles.

System Architecture

MFCC (Mel-frequency cepstral coefficients), a popular signal processing filter widely used in the speech recognition field [40], is a good example of the preprocessing. 9 shows the structure of the stepwise inference we designed based on AlexNet [42] by adding two early inference points motivated by BranchyNet [10].

Table 1: Filters used and their scale formulas.

Evaluation

Now we evaluate the performance of the audio classifier through filters, sampling periods and the staged inference. For the implementation of the preprocessing of the audio classifier, we refined parameters of STFT and the filter. Accuracy of Recognition by Sampling Duration The recognition latency is one of the most important performance measures in the transport mode recognition systems.

By multiplying (1-missing percentage) by the accuracy of the sound classifier, we can extract the cases where the triggers and the sound classifier work correctly at the same time. Using this data, we test the recognition accuracy of the sound classifier whose model was trained without the additional data. The reason is that the sound characteristics of transportation modes are not seriously affected by the location of the phone.

Table 2: Total measurement duration (in minutes) and the number of samples with 0.5 second sampling duration of each transportation.

Discussion

After training DeepVehicleSense with the original dataset including KTX and other vehicles, we evaluate the performance of this trained sound classifier with the dataset replacing KTX with SRT. One feasible solution that can reduce the training effort is to use so-called transfer learning [48, 49], which is known as a hands-on learning method that can acquire knowledge from pre-trained models and reduce the training effort by exploiting existing knowledge, to make an adjustment with only a few snapshots of data. To validate transfer learning performance, we use 10% of the SRT data for transfer learning and 90% of the SRT data for testing.

The nature of parameter adjustment through transfer learning allowed us to improve the accuracy from 95.72% to 98.52%, which is comparable to that of the noise classifier in trained and tested humans over the same transports (i.e. 98.3% in Sec 2.5) , demonstrating the possibility of both rapid and robust adaptation with limited new samples. In practice, users frequently use their smartphones during transport, which can affect motion sensors such as the accelerometer. For this purpose, we use the vibrations caused by a vehicle, which are known to be clearly distinguishable from the vibration patterns caused by human behavior in the frequency domain [27, 30].

Conclusion

Introduction

The N-epitomizer encoder is unique in that it is specifically trained to maximize inference accuracy while simultaneously meeting the requirement for the volume of data to render, which stands in stark contrast to most autoencoders that aim to fully restore the original data. By exploiting this encoder design, we ensure the inference accuracy of the N-epitomizer and its adaptability to network changes. We find that the N-epitomizer with such optimizations performs about 2× faster than its naive implementation with less than 1% difference in inference accuracy for the same stream volume.

We further demonstrate that the N-epitomizer can be extended to support multiple DNNs by extracting essential information that can maintain the inference accuracy of these multiple DNNs. We experimentally confirm that the N-epitomizer can extract and relieve the aggregation of essential information required for each DNN, thereby significantly reducing the total amount of data transmission when running multiple DNNs. Thanks to this capability, we find that the N-epitomizer can provide end-to-end latency (i.e., from encoding to receiving inference results) even in a highly variable mobile network.

Figure 21: Concept of a semantic offloading system. The system extracts and sends minimal yet essential information for highly accurate DNN inference over a network.

Background

Image Compression

Limitation of Existing DNN-based Compression Methods

The distortion d is generally defined as PSNR (peak signal-to-noise ratio) or SSIM (structural similarity) [73]. c) Segmentation of raw PNG images (d) Segmentation of GRACE images.

Figure 22: PNG and GRACE [1] image samples with 0.188 BPP, which is much higher than 0.031 BPP, at which our N-epitomizer achieves the same segmentation performance as the raw PNG sample.

Semantic Offloading for DNN Inference

However, creating the information extractor hand for every DNN is not feasible since what should be kept as core information depends entirely on which DNN (eg, classification, semantic segmentation, object detection) will be executed. In particular, we train our autoencoder together with DNNs to minimize the loss function of DNNs, which is different from using an autoencoder for conventional image reconstruction. To overcome the sub-optimality from separate and sequential training [77], our joint DNN-centric training trains the autoencoder and the DNN application together with the DNN loss function at a certain fixed latency. ensemble minimization of multiple DNNs.

ML means loss function for multi-task learning. representation size,c, as shown in Figure 24, which can be expressed by the following equation: As shown in Figure 25, the information extractor in the form of the shared autoencoder is trained with the multi-task loss ML, and each DNN is trained with its DNN loss function. To train the essential information extractor for two DNNs, the segmentation DNN is trained to predict the class with more confidence via LossCE.

Figure 24: Overview of our DNN-centric joint training method minimizing the DNN inference loss represented by DL(·)

System Design

In knowledge distillation, learning knowledge means simulating the probability distribution of the output classes in the teacher model by the student model by minimizing the loss defined by the cross-entropy of their output distributions. This knowledge distillation, as shown in Figure 28, reduces the neural network structure of the information extractor from 33 convolution layers to 5 convolution layers, while maintaining almost the same compression performance. As a result, the overall information extraction time becomes approximately 2x faster than that of the original extractor, with less than 1% difference in inference accuracy for the same size of extracted information.

Thanks to the compression performance of the optimized information extractor and its low coding complexity, the affordable number of simultaneous downloads can be increased. To address this challenge, we design a bisampling method that replicates the latent representation to fit the input image size of the DNN to serve using pooling layers and a convolution layer corresponding to the number of filters and channels of the DNN input image. In this way, the compressed information is not distributed, so the distance between the extracted information elements is still reasonably close after sampling.

Figure 27: Autoencoder architecture of N-epitomzier. Encoder and decoder parts are used for the infor- infor-mation extractor and the inforinfor-mation decoder, respectively.

Evaluation

To validate the impact of the decoder attenuation on the accuracy, we evaluate the performance of the thin decoder compared to the original decoder. Interestingly, from our comparison study, the traditional JPEG compression technique has similar performance to the state-of-the-art schemes (CAEM and BPG) in terms of DNN performance. On the other hand, CAEM and BPG have much better compression performance in terms of reconstruction error, reaching 0.27 BPP and 0.29 BPP, respectively, at PSNR of 38.5 dB, while JPEG.

This is because color information is one of the important information for depth estimation. It confirms that the essential information for DNN inference depends on the specific objectives of the DNN and the obtained information affects the compression performance. We also measure the computation time of our N-epitomizer implementation on the three datasets described in §3.7 for four DNN applications.

Figure 30: (a) Compression and segmentation performance comparison between a system trained with the independent training in Eq

Related Work

Discussion

Conclusion

Introduction

IV Towards Enabling Latency-Guaranteed Deep Neural Network Inference in Next-Generation Cellular Networks. Furthermore, we investigate which layer (i.e., user space, kernel, and device driver) is appropriate to implement LG-DI modules based on the features of each layer and each module of LG-DI.

Figure 37: Concept of the framework for latency-guaranteed DNN inferences, LG-DI.

Characteristics of Latency for DNN inferences

Framework for Latency-guaranteed Deep Neural Network Inferences

If the DNN performance degradation is lower than that of the lightweight DNN or absent, the encoder compresses input data to STrans and sends the compressed data. Then the server recovers the received data, calculates the DNN model and sends the results of the DNN calculation to the device. Unlike the multi-module push-down, the lightweight DNN consists of two modules: 1) lightweight DNN model and 2) DNN computation time estimator.

The lightweight DNN starts calculating the maximum computational complexity to guarantee the target delay through the DNN computation time estimator if the compressive offloading cannot maintain the DNN performance. After obtaining the maximum computational complexity, the DNN performance is evaluated using the pre-measured lightweight DNN model tradeoff between computational complexity and DNN performance. The process of this lightweight DNN seems simple, but there is a concern about how to control the computational complexity of the DNN model.

Implementation of LG-DI

Thus, the bandwidth estimator must be implemented in the kernel to reduce the latency overhead of obtaining information about network conditions.

Conclusion

Krishnamachari, “Fast and accurate streaming cnn inference via communication compression on the edge”, inIEEE/ACM International Conference on Internet-of-Things Design and Implementation (IoTDI), 2020, pp. An end-to-end learned approach,” inIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. Cipolla, “Multi-task learning using uncertainty to weight losses for scene geometry and semantics,” inIEEE CVPR, 2018, pp.

Wanget al., “Photo-realistic single-image super-resolution using a generative adversarial network,” in IEEE CVPR, 2017, p. Metaxas, “Stackgan: Text-to-photo-realistic image synthesis with stacked generative adversarial networks,” in IEEE ICCV , 2017, p. Lee, “Mobilenet-ssdv2: An Improved Object Detection Model for Embedded Systems,” in IEEE International Conference on Systems Science and Engineering (ICSSE), 2020, p.