NeuroValve: Energy-efficient Deep Learning Inference on Mobile Devices Exploiting Neural

8 4 Scaled EDP in MobileNet V3-Small and EfficientNet-B0 w.r.t. left) and memory bandwidth level (right) on Pixel 3. To address these challenges, we conduct an extensive measurement study on energy efficiency when running multiple CNN models on a mobile device by adjusting CPU frequency and memory bandwidth together. We find that each compute block's MAR critically affects the energy-delay trade-off when scaling CPU frequency and memory bandwidth.

Figure 1: NeuroValve architecture: Analyzer predicts MAR of each block of a given CNN model and Governor performs per-block frequency assignment.

Mobile AP Architecture

CNN Design leveraging Blocks

The size of the expansion is the number of filters expanded by the 1×1 input convolutional layer of a block as in Figure 2. As it sequentially affects the depth of the following tensors and the number of weights of all layers, the FLOPs increase proportional to it. On the other hand, the number of memory accesses will not increase linearly until the size of the tensors can no longer fit in the cache.

Like the extent size, the FLOPs and the number of memory accesses can increase as the size of the feature map increases, as it requires a large memory space to store the input tensors and intermediate tensors for the entire layers in a block. Thus, increasing the size of the output filter can result in higher FLOPs than the number of memory accesses. Therefore, increasing the core size can result in a greater increase in the number of FLOPs than the number of memory accesses.

Thus, the increase in memory access becomes higher than the increase in FLOPs since the fully connected layer needs a large number of weights. We measure the number of last-level cache misses with PMU (performance monitoring unit) 3 physically located inside the AP.

Energy-Delay Trade-off in CNN’s Blocks

These two devices are chosen because they have different memory storage capabilities, which may present distinct CPU frequency and memory bandwidth control behavior during CNN inference. As shown in Figure 3a tested on Pixel 3 for block 11 of MobileNet V3-Small, the latency gradually decreases as the CPU frequency level increases. Specifically, the figures on the left show the scaled EDP with different CPU frequency levels at the minimum memory bandwidth.

In contrast, the right ones show the scaled EDP with varying memory bandwidth at the maximum CPU frequency. In Block 7 of MobileNet V3-Small, the EDP decreases as the CPU frequency increases, while in Block 11, the EDP increases again after CPU frequency level 8, which is an intermediate level and not the highest. The reason is that the power consumption of Block11 exceeds the benefit of increasing the CPU frequency.

In summary, Block7 of MobileNet V3-Small and Block6 of EfficientNet-B0 are more affected by CPU frequency than memory bandwidth in terms of EDP, while Block11 and Block2 are more affected by memory bandwidth in the respective models. We see that even within one CNN model, each block has a different energy-delay balance depending on the CPU frequency and memory bandwidth.

Memory Access Rate Analysis

MAR for Models

We analyze the MAR value of blocks in MobileNet V3-Small and EfficientNet-B0 (whose architectures are shown in Table 2). The higher the MAR value, the more intensive the memory access required for each FLOP. For high memory access rate, relatively slow memory access time6 can become a bottleneck, so increasing the CPU frequency will unnecessarily increase power consumption.

On the other hand, in the case of a low memory access rate, excessively increasing the memory bandwidth only loses efficiency. Interestingly, both CNN models show quite similar patterns, which have relatively low MAR in the middle of the models and high MAR in the front and back of the models. Another observation is that MAR values on the Pixel 3a are always higher than the Pixel 3 in all blocks of the two models.

Although the FLOPs estimated using only the block CNN model parameters would be the same, the actual MAR on the Pixel 3a is always higher than on the Pixel 3. This finding means that it is important to consider the specifications of the mobile access point when making predictions. the exact memory access speed of the block.

MAR for Block Parameters

We perform extensive measurements with Pixel 3 and Pixel 3a for all parameters in Table 3, which are typically used by blocks in DNNs, including the state-of-the-art CNN models. In particular, Figures 6a and 6b show that the MAR values achieved by the two phones generally increase with extension size and feature map size, respectively. Meanwhile, when the parameter exceeds a certain threshold (for example, in the case of the extension size, 128 for Pixel 3a and 196 for Pixel 3), the MAR increases rapidly.

We further study the MAR values with different kernel, output filter, and step sizes, respectively, as in Figures 6c, 6d, and 6e. From these, we find that the MAR values reveal a decreasing pattern with increasing parameters because the FLOPs increase much more than the number of memory accesses as we discussed in §2.2. This result further confirms that the increase in FLOPs is much greater than the increase in LLC for these parameters.

We observe that each parameter is strongly correlated with the MAR value of the devices. We also notice that the MAR value of the same block varies depending on the differences in the memory specification of the devices.

Figure 6 compares the MAR measurements between the two phones with respect to various block parameters

Improving Energy Efficiency with MAR

Motivated by our observations, we develop NeuroValve, a joint CPU clock and memory bandwidth control framework that optimizes the energy efficiency of on-device CNN inference. The main idea of NeuroValve is to eliminate excessive CPU/memory usage by predicting the energy and latency consumed in each block of a given CNN model, based on extensive measurements taken across different CNN block configurations. First, it is not trivial to create a model that predicts the CPU clock and memory bandwidth requirements for each block, because they can vary greatly from different block configurations and different hardware specifications as discussed in § III.

In the model loading phase, a CNN model is loaded and the Analyzer module there reads and analyzes all layers of the model. In the inference phase of the model, the governor module makes predictions about each block's power consumption and delay performance. Predictions are continuously made using the MAR table created in the load phase of the model.

The actual moving average of the MAR is obtained from the system to take into account the impact of other tasks. Based on such predictions, the Governor module adjusts core frequency and memory bandwidth in the system.

Analyzer

Second, even though the resource requirements have been predicted, the predictions may become inaccurate when run with other tasks that share these resources. We propose a two-phase approach to address these design challenges, consisting of a model loading phase and a model inference phase.

Governor

We get our parser module input from the renderer class so that the parser can get the block parameters by parsing the tensor information (eg, the shape of the input and output tensors, the tensor pattern, and the type of operation ). With these inputs, the governor calculates the CPU clock frequency and optimal WEDP bandwidth at the start of each block based on its predictions of power consumption and latency. The governor uses its output to adjust the CPU clock frequency and memory bandwidth.

For performance comparison, we use schedutilgovernor [8] for CPU and bw_hwmongovernor for memory bandwidth, which are default governors in recent Android kernels. Thebw_hwmonsamples the bandwidth for each predefined time interval and makes a decision based on the sampled bandwidth. Sysscale provides two operating modes (ie low performance and high performance) with predefined memory bandwidth based on mobile benchmark workloads.

It periodically samples the PMU counter values, averaged over each predefined time interval. We obtain two thresholds (LLC stalls, LLC misses) for each memory bandwidth level when running CNN inferences for a fair comparison.

Figure 9: NeuroValve implementation on the Android platform.

Energy Efficiency for CNN Inference

To reduce excessive resource usage, it defines thresholds by adding the mean and standard deviation of the PMU counter value when the low performance degradation exceeds the thresholds. It also switches from high performance mode to low performance mode when the average PMU value is greater than the thresholds. We set the time interval for sampling PMU values to 30ms, which is the default for SysScale.

From Figure 12, we find the difference between NV (w=0.5) and SysScale in their strategies that allocate CPU frequency and memory bandwidth across four CNN models. This is because it starts with the maximum bandwidth and continues to adapt to a lower bandwidth at each interval, which introduces inefficiency in adaptation. On the other hand, NV (w=0.5) can identify which block may suffer from memory bottleneck, thus avoiding allocating excessive memory bandwidth.

As a result, NV (w=0.5) achieves greater improvement in EDP with its precise control by understanding workload characteristics based on CNN inference. Similar observations can be made for Pixel 3a as in Figure 13, which shows the performance comparison with two CNN models over the entire range.

Energy Efficiency for Deep Learning Application

System Overhead

As shown in Table 4, thanks to the easy calculation from our decision tree-based design, the total time cost of NeuroValve is about 1.3ms, which is only 2.24% of the total inference time of CNN8. For example, a self-driving car must constantly detect obstacles while deciding the direction or speed of the car. Harmonized DVFS. The use of DVFS to optimize memory access for energy efficiency was investigated in [13], where the authors studied the dynamic frequency and voltage scaling of the memory controller, memory channels, and DRAM devices.

We found that adaptive CPU frequency scaling and memory bandwidth control per CNN block can significantly improve energy efficiency. We evaluated the effectiveness of Neurovalve with several state-of-the-art CNN models and a deep learning application through implementation on off-the-shelf smartphones. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2017).