Towards Efficient RRAM-based Quantized Neural Networks Hardware: State-of-the-art and Open Issues

(1)

Towards Efficient RRAM-based Quantized Neural Networks Hardware: State-of-the-art and Open Issues

Item Type Conference Paper

Authors Krestinskaya, O.;Zhang, L.;Salama, Khaled N.

Citation Krestinskaya, O., Zhang, L., & Salama, K. N. (2022). Towards Efficient RRAM-based Quantized Neural Networks Hardware:

State-of-the-art and Open Issues. 2022 IEEE 22nd International Conference on Nanotechnology (NANO). https://doi.org/10.1109/

nano54668.2022.9928590 Eprint version Post-print

DOI 10.1109/NANO54668.2022.9928590

Publisher IEEE

Rights This is an accepted manuscript version of a paper before final publisher editing and formatting. Archived with thanks to IEEE.

Download date 2023-12-09 23:12:41

Link to Item http://hdl.handle.net/10754/686303

(2)

Towards Efficient RRAM-based Quantized Neural Networks Hardware: State-of-the-art and Open Issues.

O. Krestinskaya, L. Zhang and K.N. Salama

King Abdullah University of Science and Technology, Saudi Arabia Abstract— The increasing amount of data processed on

edge and demand for reducing the energy consumption for large neural network architectures have initiated the transition from traditional von Neumann architectures towards in-memory computing paradigms. Quantization is one of the methods to reduce power and computation requirements for neural networks by limiting bit precision. Resistive Random Access Memory (RRAM) devices are great candidates for Quantized Neural Networks (QNN) implementations. As the number of possible conductive states in RRAMs is limited, a certain level of quantization is always considered when designing RRAM-based neural networks. In this work, we provide a comprehensive analysis of state-of-the-art RRAM- based QNN implementations, showing where RRAMs stand in terms of satisfying the criteria of efficient QNN hardware. We cover hardware and device challenges related to QNNs and show the main unsolved issues and possible future research directions.

Index Terms—RRAM, QNN, Hardware, Quantization I. INTRODUCTION

The exponentially growing amount of processed data, IoT applications, and on-edge processing imposed the requirements to develop energy-efficient low-power neuromorphic hardware. RRAM devices are one of the most promising technologies for neuromorphic circuits, neuro- inspired architectures, and neural network accelerators.

RRAMs allow the efficient crossbar-based implementation of neural network weights and multiply-and-accumulate (MAC) operations leading to higher energy efficiency, smaller on-chip area, higher processing speeds. This en- ables the transition from traditional von Neumann architectures towards in-memory computing paradigms. The off-chip memory bottleneck in traditional architectures motivated the closer integration of memory and processing units to avoid frequent data transfer between the off-chip memory and processor [1].

Quantization in hardware saves memory space, reduces data movement and latency for arithmetic operations [2].

Full-precision computations and complex arithmetics support for large neural networks in low-power hardware are challenging. Therefore, QNN is an efficient method for low-power AI applications on edge. The currently available RRAM devices still have limited precision. Thus, quantizing the computation around RRAM crossbars and implementing RRAM-based QNNs is a reasonable approach. Compared to traditional in-memory computing, RRAM provides possibilities for multi-level storage, scala- bility, and high computational density. For example, QNNs implemented with SRAM devices have a cell size of124F² (F - technology feature size), comparing to RRAM with the size of <10F² (e.g. 4F² for RRAM cells and 12F² for 1T1R cells [1]).

In this work, we analyze state-of-the-art RRAM-based QNN implementations, including performance, energy efficiency, bit-width of network weights and computations, and training methods. We cover hardware and device challenges

Fig. 1:Bits per RRAM cell and weights quantization in state-of-the-art RRAM-based QNN architectures.

of QNNs, such as lack of variable bit precision hardware and bit-width reconfigurability, training and learning limitations, demand for large-scale integration with traditional architectures, and existing RRAM devices’ limitations. We also show the main unsolved issues and summarise possible future research directions.

II. BACKGROUND

1) QNN basics: There are two approaches to quantize a neural network: post-training quantization (PTQ) of a full precision model and quantization-aware training (QAT).

QAT usually achieves higher accuracy than PTQ. The application of the traditional stochastic gradient descent (SGD) training algorithm on QNNs is challenging due to the stair-like nature of the quantization function. Therefore, the straight-through estimator (STE) technique is used for training, which still requires full precision for update computation.

2) RRAM basics and related challenges: RRAMs allow to achieve high density of the architecture and energy- efficient computation but are prone to non-idealities, such as device-to-device (D2D) and cycle-to-cycle (C2C) variations, low endurance, a limited number of conductive states, conductance drift with time, and device faults. Low- bit RRAM devices have fewer variations [2], but lead to lower crossbar density and area efficiency. Currently available devices have 2⁶ conductive states, Ron/Rof f

ratio of 10-10³ withRon=10³-10⁴. The set/reset time and write energy reaches as low as 50ns and 2nJ [1], the endurance reaches up to 10⁶ cycles, and retention time up to 10⁸s. RAM-based 1T1R or 1T1S crossbars for MAC operation can have linear non-idealities, such as wire resistance and source/sink resistances contributing to the IR drop, and non-linear non-idealities, such as exponential dependence of current on voltage, non-linear behavior of selectors devices/ transistors and RRAMs, and non-linear asymmetric dynamics in switching behavior leading to errors [1]. The main issue of the RRAM-based crossbars is high write latency and write cost.

(3)

TABLE I: RRAM QNN architectures

Work CMOS,RRAM Tiles RRAM bits

Bits

W,I,O^∗1 Network Training, bits Binary Neural Networks

[3] Ti/HfO2 - 1 1b, 1b, 1b 4 FC QAT, 32b

[4] 45nm 128x128 1 1b, 1b, 1b 6 Conv, 3 FC PTQ, 32b

[5] 65nm - 1 1b, -, - 4 Conv, 3FC QAT, 6b

Higher precision networks (mostly fixed-point arithmetics) [6] 65nm 128x128 2 up to 10b, 8b (s-1b), 5b Lenet - [7] 45nm 10-256^∗21-8b^∗2 1-8b, 1-8b, 1-8b Lenet, ResNet,

VGG-16 QAT, 32b

[8] - 32x32 3b 3b, -, - 2 FC PTQ, 32b

[9] 45nm 256x256 1b up to 8b, - (s-2b), 6b Lenet, ResNet,VGG-16 PTQ, 32b [10] 22nm 512x512 1b 2-4b, - (s-1-2b), 6-11b ResNet-20 -

Accelerators (fixed point arithmetics) [11] 32nm,

TiO/HfO 128x128 2b 16b, 16b (s-1b), 16b VGG PTQ [12] TiO2−x 128x128 4b 16b, 16b (s-1b), 16b AlexNet, VGG on-chip [13] 32nm 128x128 2b 16b, 16b (s-1b), 16b VGG, LSTM PTQ, 32b

[14] 65nm 256x64 4b 8b, 8b (s), 8b VGG-16,

MobileNet PTQ, 32b Fabricated RRAM crossbar-based architectures

[15] 130nm,

TaOx/HfOx 126x16 3b 3b, 8b (s-1b), 8b 2 Conv,

1 FC hybrid^∗4

[16] HfO2 128x64 - -,-,- 2 FC on-chip

[17] TaOx 128x64 4b 4b, 8b, 16b 2 Conv, 1 FC on-chip

[18] 55nm 156x512 1b 3b, - (s-1-2b), 4b - PTQ

[19] 90nm,

HfO2 128x64 1b 1b, 1b^∗3, 3b 5 FC QAT

[20] 2um,

HfO2 128x64 6b 2b, 4b, 6b 1 FC -

[21] 130nm 784x100 1b 1b(T), - (s-1b), 8b 3 FC - s - serial input encoding 1b or 2b DAC;^∗1W,I,O - weights, inputs, outputs;

T - ternary; FC - fully connected layers; Conv - convolution layers;

∗2several experiments;^∗3except input bits;^∗4pretrained with PTQ, 32b

III. RRAM-BASEDQNNHARDWARE

In the most common RRAM-based QNN implementations, only MAC operation is performed in the analog domain, while the other computations and control are digital.

Such architectures include input encoding (DACs), analog 1T1R-based RRAM crossbar, multiplexers (MUX) and Trans-Impedance Amplifiers (TIA) for current to voltage conversion, Sample-and-Hold (S&H), ADCs for converting MAC outputs to the digital domain, and Shift-and-Add (S&A) operation [1]. In RRAM architectures, we focus on the quantization of network weights represented by n-bit RRAMs. Fig. 1 shows the summary of existing RRAM- based QNNs in terms of weight quantization relative to the device precision. It is also important to consider inputs quantization (DACs precision) and output quantization (ADCs precision).

The calculation of partial sums in QNNs leads to a large ADC overhead. Partial sums are used for large weights matrices, bit slicing, and input slicing. QNN weight matrices larger than the number of rows in crossbar tiles are mapped to several RRAM cells to calculate partial sums followed by the addition of the output from several cells.

In bit slicing, high precision QNN weights are stored in several RRAM cells (usually different tiles), then partial sums from different tiles are shifted and added together.

In input slicing, high precision inputs are streamed to a crossbar sequentially using low-bit DACs in several cycles (time encoding), and then summed together [2].

The ADC resolution required to represent all possible MAC outputs depends on DAC resolution ba, number of crossbar rows i and bits in weights bw, and can be expressed as ceil(log2((2^b^a −1)×(2^b^w −1)×i)) [1].

ADCs consume up to 80-88% of energy and 70-90% of area of the crossbar with peripheral circuits [13], [15]. To reduce ADC overhead, approximate computing with low- precision ADCs can be adopted or the number of ADCs can be reduced increasing the network latency [22]. The alternative approach reducing ADC overhead is post-MAC computations (e.g. activations) in the analog domain [23].

However, the main issue in analog domain processing is a noise leading to error accumulation over entire network.

Fig. 2:Power and area efficiency of state-of-the-art RRAM-based QNNs.

Table I shows the summary of the existing RRAM-based QNN architectures. Fig. 2 illustrates the power and area efficiency of these architectures.

1) Binary / Ternary RRAM Computation: In Binary networks (BNNs), the weights are presented by 1-bit devices corresponding to (+1,-1) software equivalent, and activations are also binarized. In BNNs, MAC operation is implemented via XNOR operation [4], and ADC overhead is reduced using a 1-bit sense amplifier (SA) [5], as the activations are binarized. In [3], XNOR-based BNN is implemented on a 1T1R array with oxide semiconductor FET integrated into a 3D stack using a low-temperature BEOL process. In [19], a fabricated BNN prototype (crossbar and peripheral readouts) is implemented with flash ADC, which is not practically scalable for higher bit precision networks. In Ternary networks, the weights are set to (- 1,0,+1) represented by two RRAM cells [24]. In BNNs and Ternary networks training, the quantized weights and activations are used in the forward propagation, and quantized gradient values via STE in the backward pass. However, the full precision computation required for weight update and training complexity is the limitation for RRAM-based on-chip learning in BNNs.

2) High Bits Fixed-point RRAM Computation: Com- pared to BNNs, the higher bit QNNs require ADCs and fixed-point computation in peripheral circuits [8], [7], [9], [8]. In [7], the partial sums from large weight matrices split into crossbar tiles are quantized, and then merged via element-wise adder and quantized again aiming to reduce ADC overhead. These quantizations are also included in software-based training using floating-point weights and STE.

3) Accelerator Level Design Considerations: In QNN accelerators based on fixed-point computation, the datapath, routing, and arrangement of network nodes are considered in addition to control and computation peripherals [11], [12], [13], [25]. The inputs to most QNN accelerators are fed to the crossbar via low-precision DACs and serial encoding. In ISAAC accelerator [11], the weight partitioning and storing into several RRAM cells is considered. Also, weights are stored in either original or flipped form to increase the number of zero sums reducing the required ADC’s precision. In Pipelayer architecture [12], both training and inference are considered exploiting intra-layer parallelism. The PUMA architecture consists of nodes arranged to on-chip network including several multicores [13]. A multicore consists of several cores and routing peripherals. While a core includes RRAM crossbar and peripherals, execution pipeline, and digital

(4)

TABLE II: QNNs: Open Challenges and Requirements

Device level requirements Architecture and Algorithmic Support Requirements Open Challenge: Efficient Inference

Reducing:

•C2C and D2D variations

•Conductance drift with time

•Device faults

•Reducing ADC and partials sums overhead

•Improving data movement between the layers Open Challenge: Training/Learning On-Chip

•Improving device endurance and faults

•Improving linearity and symmetry of conductance updates

•Reducing high write cost

•Developing efficient additional (higher) precision support for training

•Developing new training algorithms to reduce quantization precision

•Reducing CMOS and periferals overhead in training

•Solving large shared memory requirement issues

•Overcoming training latency via parallel crossbar writes

Open Challenge: Reconfigurability, Towards General-Purpose Architectures

•Fulfilling RRAM devices requirements with higher bit precision and low variations (as one solution for variable precision support)

•Developing different workloads support

•Developing variable precision support and reducing corresponding hardware overhead

•Improving hardware utilization (balancing execution time and energy consumption)

•Developing reconfigurability schemes for multi-bit precision without hardware overhead

•Developing flexible control and data movement Open Challenge: RRAM-based Solutions Integration and Practical issues

•Developing large scale implementation

•Improving 3D integration

•Testing on large databases + software-hardware co-design

•Integrating RRAM-based solutions into traditional processing architectures

CMOS-based processing unit. The Panther architecture is based on PUMA but includes efficient training based on crossbar partitioning using the bit-slicing technique [25].

Panther aims to update the crossbar with parallel writes encoding row signals in time-domain and column signals in amplitude domain via pulse-amplitude modulation.

4) Variable Precision and Dynamic Quantization:

Variable (mixed) precision implies different quantization precision in different layers of QNN [2], as these layers have different quantization sensitivity [9]. Multi-precision QNNs are prone to hardware under-utilization and reconfigurability issues. In [9], a multi-precision CNN accelerator with a different number of computing units of the same design used for variable precision is shown. In dynamic quantization shown in [6], the number of analog-to-digital (AD) conversions and shift-and-add operations is reduced by skipping the computation of some partial products.

Dynamic quantization is achieved by enabling a different number of columns in the crossbar for the AD conversion for different input bit positions.

5) Fabricated RRAM Crossbar-based Designs: By de- fault, all fabricated RRAM architectures are quantized, as RRAM cells have limited precision. Recently, several QNN architectures with integrated RRAMs have been demonstrated for [20] and tiles QNN architectures for inference [14], [18]. The architecture in [18] supports configurable input precision with bit-serial input encoding and shows a multi-bit processing trade-off between area and speed. In [21], the readout ADC resolution can be adjusted by changing the sampling clock frequency. In [16]

and [17], on-chip (in-situ) learning is considered, which requires additional hardware overhead for data routing and additional memory to store intermediate outputs. In-situ learning can compensate for RRAM non-idealities and unresponsive devices. In [16], RRAM crossbar is integrated to customized PCB, while computation of activations and backpropagation is performed in software. In [17], the in- situ training of CNN and ConvLSTM is shown focusing on shared weights-based architectures. In [15], the hybrid training to accommodate device variations is proposed. The network is trained off-chip followed by on-chip fine-tuning, where a single fully-connected layer is updated to avoid complex backpropagation, which compensates for errors caused by variations in duplicated convolution kernels.

IV. OPEN CHALLENGES AND REQUIREMENTS

While most of the implemented QNN architectures focus on inference optimization, the problems, such as on-chip training, reconfigurability supporting different precision and workloads aiming for general-purpose chips [22], and large scale implementation, remain open. Table II summa- rizes the open research directions and corresponding challenges and requirements related to device and architecture level limitations.

1) Efficient inference: Regardless of the number of works on RRAM-based QNN inference, the inference efficiency can still be improved by reducing the number of ADCs, lowering ADC precision relying on approximate computing, or designing low-power application-specific converters. The larger crossbar tiles can reduce ADC overhead [15], however longer wires contribute to increased IR drop [9]. This can be solved by developing 3D integration techniques and improving lithography techniques for 3D stacking [1], which is also affected by selector device switching and non-linearity. Partitioning large weights matrices or high-bit precision computations to partial sums also contributes to storage and datapath overhead [1]. D2D variations may lead to errors accumulated and propagated over the network. The device conductance variability depends on the selection of the RRAM material stack and can be addressed by on-chip fine-tuning.

2) On-chip training: From the device perspective, efficient in-situ on-chip training is an open challenge due to limited device endurance and expensive crossbar writes.

For example, training of ResNet-50 for ImageNet classification, requires 18M crossbar writes [1]. For 50ns- writes, it would take 0.8s to update the whole network sequentially. Thus, parallel crossbar writes are essential for a large network, while they would contribute to hardware overhead. The endurance of currently available devices reaches up to 10⁶ cycles, while at least 10⁸−10⁹ cycles are required for on-chip training. Non-linear asymmetric conductance update can also affect the on-chip learning schemes and would require additional overhead to overcome it, e.g. via read-verify-write mechanism [1].

From the architecture perspective, the hardware overhead related to on-chip learning includes external memory for intermediate outputs storage, modules for additional computation requiring high precision, and flexible crossbar peripherals. Even for hybrid training (fine-tuning), the peripherals overhead is unavoidable due to batch training and the requirement to store intermediate results [15]. Moreover, most of the QNNs in Table I still require 32-bit weights and computation precision to train the networks, while low-bit precision is sufficient for the inference. This leads to the development of flexible precision support by either adjusting the number of RRAM cells representing a single weight or increasing the number of stable conductance levels in RRAM devices. Both contribute to increasing bit-precision requirements for ADCs. The alternative is to develop QNN training algorithms based on low-bit weights and low-bit arithmetic, which would lead to a significant reduction in hardware overhead for the training.

3) Reconfigurability and general-purpose architectures: Most RRAM-based QNN architectures are application-specific and designed for certain QNN types.

Reconfigurability is a major step towards general-purpose architectures supporting different workloads, multi-bit precision, flexible control and data movement, and

(5)

efficient hardware utilization. Depending on the type of the executed layers in QNN or types of networks, the workloads vary. This leads to an unbalanced execution time in different layers [22] and the requirement to design flexible interconnects. For example, CNN workloads may be very different from sparse fully-connected layers or temporal processing in LSTM. The generalized reconfigurable accelerator should be able to support the processing in both spatial and temporal domains.

Therefore, to achieve continuous data flow, balancing the execution time of different layers is required. The other challenge arising from different workloads is a resource utilization balance. Mapping the network to a general-purpose RRAM accelerator can lead to resource under-utilization [26]. The reconfigurability of the architecture is also related to flexible bit-precision support in different QNN layers and minimizing its overhead.

Also, the hardware-aware algorithms automating required precision computation based on the accuracy and energy trade-offs in different layers can be further developed.

4) Efficient integration of RRAM-based solutions into the traditional processing hierarchies: Considering the RRAM limitations, the complexity of QNN training and the demand for moving toward general-purpose architectures, the efficient integration of RRAM-based solutions with traditional memory and processing hierarchies is essential.

For example, for the on-chip training implementation, the efficient integration with traditional memories is important to overcome external memory requirements and latency issues. The example of such a system is shown in [27], where RRAM-based processing is integrated with conven- tional SRAM-based memory for AI applications. Overall, large-scale integration is the next step to consider in the RRAM-based AI hardware development.

V. CONCLUSION

This paper reviews state-of-the-art RRAM-based QNN architectures and corresponding device and architecture level metrics. We emphasize possible future research directions and open challenges related to the improvements of QNN inference efficiency, on-chip training, architecture reconfigurability, and demands for the efficient integration of RRAM-based solutions into traditional processing hierarchies.

REFERENCES

[1] I. Chakraborty, M. Ali, A. Ankit, S. Jain, S. Roy, S. Sridharan, A. Agrawal, A. Raghunathan, and K. Roy, “Resistive crossbars as approximate hardware building blocks for machine learning:

Opportunities and challenges,”Proceedings of the IEEE, vol. 108, no. 12, pp. 2276–2310, 2020.

[2] S. Huang, A. Ankit, P. Silveira, R. Antunes, S. R. Chalamalasetti, I. El Hajj, D. E. Kim, G. Aguiar, P. Bruel, S. Serebryakovet al.,

“Mixed precision quantization for reram-based dnn inference accelerators,” in 2021 26th Asia and South Pacific Design Automation Conference (ASP-DAC). IEEE, 2021, pp. 372–377.

[3] J. Wu, F. Mo, T. Saraya, T. Hiramoto, and M. Kobayashi, “A mono- lithic 3d integration of rram array with oxide semiconductor fet for in-memory computing in quantized neural network ai applications,”

in2020 IEEE Symposium on VLSI Technology, 2020, pp. 1–2.

[4] X. Sun, S. Yin, X. Peng, R. Liu, J.-s. Seo, and S. Yu, “Xnor-rram: A scalable and parallel resistive synaptic architecture for binary neural networks,” in2018 Design, Automation & Test in Europe Conference

& Exhibition (DATE). IEEE, 2018, pp. 1423–1428.

[5] X. Sun, X. Peng, P.-Y. Chen, R. Liu, J.-s. Seo, and S. Yu, “Fully parallel rram synaptic array for implementing binary neural network with (+ 1,- 1) weights and (+ 1, 0) neurons,” in2018 23rd Asia and South Pacific Design Automation Conference (ASP-DAC). IEEE, 2018, pp. 574–579.

[6] X. Chen, J. Jiang, J. Zhu, and C.-Y. Tsui, “A high-throughput and energy-efficient rram-based convolutional neural network using data encoding and dynamic quantization,” in2018 23rd Asia and South Pacific Design Automation Conference (ASP-DAC). IEEE, 2018, pp. 123–128.

[7] Y. Cai, T. Tang, L. Xia, B. Li, Y. Wang, and H. Yang, “Low bit- width convolutional neural network on rram,”IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 39, no. 7, pp. 1414–1427, 2019.

[8] T.-H. Kim, J. Lee, S. Kim, J. Park, B.-G. Park, and H. Kim,

“3-bit multilevel operation with accurate programming scheme in tio x/al2o3 memristor crossbar array for quantized neuromorphic system,”Nanotechnology, vol. 32, no. 29, p. 295201, 2021.

[9] Z. Zhu, H. Sun, Y. Lin, G. Dai, L. Xia, S. Han, Y. Wang, and H. Yang, “A configurable multi-precision cnn computing framework based on single bit rram,” in2019 56th ACM/IEEE Design Automa- tion Conference (DAC). IEEE, 2019, pp. 1–6.

[10] C.-X. Xue, T.-Y. Huang, J.-S. Liu, T.-W. Chang, H.-Y. Kao, J.-H.

Wang, T.-W. Liu, S.-Y. Wei, S.-P. Huang, W.-C. Weiet al., “15.4 a 22nm 2mb reram compute-in-memory macro with 121-28tops/w for multibit mac computing for tiny ai edge devices,” in2020 IEEE International Solid-State Circuits Conference-(ISSCC). IEEE, 2020, pp. 244–246.

[11] A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P.

Strachan, M. Hu, R. S. Williams, and V. Srikumar, “Isaac: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars,”ACM SIGARCH Computer Architecture News, vol. 44, no. 3, pp. 14–26, 2016.

[12] L. Song, X. Qian, H. Li, and Y. Chen, “Pipelayer: A pipelined reram- based accelerator for deep learning,” in2017 IEEE international symposium on high performance computer architecture (HPCA).

IEEE, 2017, pp. 541–552.

[13] A. Ankit, I. E. Hajj, S. R. Chalamalasetti, G. Ndu, M. Foltin, R. S. Williams, P. Faraboschi, W.-m. W. Hwu, J. P. Strachan, K. Royet al., “Puma: A programmable ultra-efficient memristor- based accelerator for machine learning inference,” inProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, 2019, pp. 715–731.

[14] Q. Wang, X. Wang, S. H. Lee, F.-H. Meng, and W. D. Lu, “A deep neural network accelerator based on tiled rram architecture,” in2019 IEEE international electron devices meeting (IEDM). IEEE, 2019, pp. 14–4.

[15] P. Yao, H. Wu, B. Gao, J. Tang, Q. Zhang, W. Zhang, J. J. Yang, and H. Qian, “Fully hardware-implemented memristor convolutional neural network,”Nature, vol. 577, no. 7792, pp. 641–646, 2020.

[16] C. Li, D. Belkin, Y. Li, P. Yan, M. Hu, N. Ge, H. Jiang, E. Mont- gomery, P. Lin, Z. Wang et al., “Efficient and self-adaptive in- situ learning in multilayer memristor neural networks,” Nature communications, vol. 9, no. 1, pp. 1–8, 2018.

[17] Z. Wang, C. Li, P. Lin, M. Rao, Y. Nie, W. Song, Q. Qiu, Y. Li, P. Yan, J. P. Strachan et al., “In situ training of feed-forward and recurrent convolutional memristor networks,”Nature Machine Intelligence, vol. 1, no. 9, pp. 434–442, 2019.

[18] C.-X. Xue, W.-H. Chen, J.-S. Liu, J.-F. Li, W.-Y. Lin, W.-E. Lin, J.- H. Wang, W.-C. Wei, T.-Y. Huang, T.-W. Changet al., “Embedded 1-mb reram-based computing-in-memory macro with multibit input and weight for cnn-based ai edge processors,” IEEE Journal of Solid-State Circuits, vol. 55, no. 1, pp. 203–215, 2019.

[19] S. Yin, X. Sun, S. Yu, and J.-s. Seo, “High-throughput in-memory computing for binary deep neural networks with monolithically integrated rram and 90-nm cmos,”IEEE Transactions on Electron Devices, vol. 67, no. 10, pp. 4185–4192, 2020.

[20] M. Hu, C. E. Graves, C. Li, Y. Li, N. Ge, E. Montgomery, N. Davila, H. Jiang, R. S. Williams, J. J. Yanget al., “Memristor-based analog computation and neural network classification with a dot product engine,”Advanced Materials, vol. 30, no. 9, p. 1705914, 2018.

[21] Q. Liu, B. Gao, P. Yao, D. Wu, J. Chen, Y. Pang, W. Zhang, Y. Liao, C.-X. Xue, W.-H. Chen et al., “33.2 a fully integrated analog reram based 78.4 tops/w compute-in-memory chip with fully parallel mac computing,” in2020 ieee international solid-state circuits conference-(isscc). IEEE, 2020, pp. 500–502.

[22] W. Zhang, B. Gao, J. Tang, P. Yao, S. Yu, M.-F. Chang, H.-J. Yoo, H. Qian, and H. Wu, “Neuro-inspired computing chips,” Nature electronics, vol. 3, no. 7, pp. 371–382, 2020.

[23] O. Krestinskaya and A. P. James, “Analogue neuro-memristive convolutional dropout nets,”Proceedings of the Royal Society A, vol. 476, no. 2242, p. 20200210, 2020.

[24] A. Laborieux, M. Bocquet, T. Hirtzlin, J.-O. Klein, E. Nowak, E. Vianello, J.-M. Portal, and D. Querlioz, “Implementation of ternary weights with resistive ram using a single sense operation per synapse,”IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 68, no. 1, pp. 138–147, 2020.

[25] A. Ankit, I. El Hajj, S. R. Chalamalasetti, S. Agarwal, M. Marinella, M. Foltin, J. P. Strachan, D. Milojicic, W.-M. Hwu, and K. Roy,

“Panther: A programmable architecture for neural network training harnessing energy-efficient reram,”IEEE Transactions on Comput- ers, vol. 69, no. 8, pp. 1128–1142, 2020.

[26] S. Qu, B. Li, Y. Wang, D. Xu, X. Zhao, and L. Zhang, “Raqu: An automatic high-utilization cnn quantization and mapping framework for general-purpose rram accelerator,” in 2020 57th ACM/IEEE Design Automation Conference (DAC). IEEE, 2020, pp. 1–6.

[27] M. Chang, S. D. Spetalnick, B. Crafton, W.-S. Khwa, Y.-D. Chih, M.-F. Chang, and A. Raychowdhury, “A 40nm 60.64 tops/w ecc- capable compute-in-memory/digital 2.25 mb/768kb rram/sram system with embedded cortex m3 microprocessor for edge recommen- dation systems,” in 2022 IEEE International Solid-State Circuits Conference (ISSCC), vol. 65. IEEE, 2022, pp. 1–3.