Offloading neural network inference tasks to a powerful server over a network has become the key en- abler for resource-constrained mobile devices and embedded devices to support compute-intensive DNN (deep neural network) services such as augmented and mixed reality, video analytics, remote surveil- lance, and network-assisted self-driving. While such an offloading makes heavy inferences possible for those devices, it brings new challenges of completing DNN inferences on time over dynamic network conditions.
Recent approaches toward making the offloading process efficient over a network are broadly classi- fied into three categories. The first approach [12, 50, 51] is to improve the compression efficiency for the data to offload (i.e., reducing the data volume to offload) by redesigning conventional codecs with the power of neural networks. The second approach [52–54] explores the data’s contents to incrementally or partially offload the data (e.g., only offloading regions of interest), which has been actively adopted in the area of image or video analytics. The last approach [6, 55, 56], often called split computing, offloads the intermediate outcome of a neural network, which is smaller in size than the original data for network delivery.
Although these approaches are known to reduce the data volume to offload, they are far from answer- ing the most fundamental question on how to minimize the offloaded data while not losing its essential information for making inferences (i.e., while not compromising the accuracy of inferences). In other words, this is the question about an ideal offloading where only essential data for making inferences is offloaded over a network. We call such an ideal offloadingsemantic offloading. Semantic offloading can be understood as a representative instantiation of semantic communications [13–15], which aim at send- ing only the meaningful bytes sufficient to achieve the goal of communications instead of sending the entire bytes. The challenge of semantic offloading is related to a more practical engineering question on how to extract the most informative data within limits on the data size and time for making inferences, which can play a critical role in making the most accurate and timely inferences over a network with fluctuating bandwidth.
In this work, we realize semantic offloading by designingN-epitomizer(Neural epitomizer) that ex- tracts and offloads the essential information for making inferences far more effectively than the state-of- the-art offloading techniques [1, 53, 55]. Figure 21 illustrates how N-epitomizer works with an example of vision data to offload. Unlike conventional data encoding techniques that are not fully aware of how the encoded data would be utilized in the inference, N-epitomizer makes the offloaded data minimal by aggressively discarding unimportant information with respect to making inferences. Thus it makes the offloading much more reliable and timely, even under highly varying network conditions.
To realize such a highly-efficient information extraction, we design the data encoder of N-epitomizer based on an autoencoder trained to scale its output size by the need to meet the inference time require-
Figure 21: Concept of a semantic offloading system. The system extracts and sends minimal yet essential information for highly accurate DNN inference over a network.
ment over a network. The encoder of N-epitomizer is unique since it is trained specifically to maximize the inference accuracy jointly while meeting the demand for the data volume to deliver, which shows a stark contrast to most autoencoders aiming at the complete restoration of the original data. By exploiting this encoder design, we ensure the inference accuracy of N-epitomizer and its adaptability to network changes. As a result, N-epitomizer enables timely delivery of data for inference, including its encoding and decoding time, which is done by judiciously deciding the data transmission volume based on the current network bandwidth and by leveraging a set of optimization and customization techniques applied to the encoder (e.g., knowledge distillation [7, 57]) and to the decoder (e.g., our own decoder slimming).
We find that N-epitomizer with such optimizations runs about 2×faster than its naïve implementation with less than 1% difference in the inference accuracy for the same transmission volume.
We further demonstrate that N-epitomizer can be extended to support multiple DNNs by extracting the essential information that can keep the inference accuracy of those multiple DNNs. We experimen- tally confirm that N-epitomizer can extract and offload the union of the essential information needed for each DNN, significantly reducing the total data transmission volume when running multiple DNNs.
This distinguishes N-epitomizer from any offloading technique strongly bound to a single task, which is often found in the case of split computing of a DNN inference.
We validate the performance of N-epitomizer for four popular vision tasks with three large datasets:
Cityscapes [58], ImageNet [59], and Flickr30k [60]. N-epitomizer achieves a significant gain in the compression performance, which is 21×, 77×, 192×, and 8×more compared to JPEG and 20×, 55×, 86×, and 7×more compared to the state-of-the-art GRACE [1] for semantic segmentation, depth es- timation, classification, and image captioning, respectively while keeping the inference accuracy the same. Thanks to this ability, we find that N-epitomizer can guarantee the end-to-end latency (i.e., from encoding to receiving inference results) even under a highly varying cellular network.
In summary, this work makes the following contributions:
• We present the first realization of semantic offloading for DNN inferences utilizing our essential information extraction method that is generally applicable for offloading a single DNN or multiple DNNs. (§3.5).
• We optimize N-epitomizer to expedite its encoding, transmission, and decoding time, enabling timely inferences even under a dynamic or low-bandwidth network. (§3.6).
• We experimentally validate that N-epitomizer outperforms conventional and state-of-the-art infer- ence offloading techniques by a large margin showing its competency as a neural inference enabler for resource-constrained devices. (§3.7).