Resistive memories, also known as memristors (1), including resistive switching memory (RRAM) and phase-change memory (PCM), are emerging as a novel technology for high-density storage (2, 3), neuromorphic hardware (4, 5), and stochastic security primitives, such as random number generators (6, 7). Thanks to their ability to store analog values and to their excellent programming speed, resistive memories have also been demonstrated for executing in-memory computing (8–17), which eliminates the data transfer between the memory and the processing unit to improve the time and energy efficiency of computation. With a cross-point architecture, resistive memories can be naturally used to perform matrix-vector multiplication (MVM) by exploiting fundamental physical laws such as the Ohm’s law and the Kirchhoff’s law of electric circuits (8). Cross-point MVM has been shown to accelerate various data-intensive tasks, such as training and inference of deep neural networks (11–14), signal and image processing (15), and the iterative solution of a system of linear equations (16) or a differential equation (17). With a feedback circuit configuration, the cross-point array has been shown to solve systems of linear equations and calculate matrix eigenvectors in one step (18). Such a low computational complexity is attributed to the massive parallelism within the cross-point array and to the analog storage and computation with physical MVM. Here, we show that a cross-point resistive memory circuit with feedback configuration is able to accelerate fundamental learning functions, such as predicting the next point of a sequence by linear regression or attributing a new input to either one of two classes of objects by logistic regression. These operations are completed in just one step in the circuit, in contrast to the iterative algorithms running on conventional digital computers, which approach the solution with a polynomial time complexity.
Linear regression in one step
Linear regression is a fundamental machine learning (ML) model for regressive and predictive analysis in various disciplines, such as biology, social science, economics, and management (19–21). Logistic regression, instead, is a typical tool for classification tasks (22), e.g., acting as the last classification layer in a deep neural network (23, 24). Because of their simplicity, interpretability, and well-known properties, linear and logistic regressions stand out as the most popular ML algorithms across many fields (25). A linear regression model is described by an overdetermined linear system given by
(1)where X is an N × M matrix (N > M), y is a known vector with a size of N × 1, and w is the unknown weight vector (M × 1) to be solved. As the problem is overdetermined, there is typically no exact solution w to Eq. 1. The best solution of Eq. 1 can be obtained by the least squares error (LSE) approach, which minimizes the norm of error ε = Xw − y, namely, ‖ε‖ = ‖Xw − y‖2, where ‖∙‖2 is the Euclidean norm. The vector w minimizing ‖ε‖ is obtained by the pseudoinverse (21, 23, 24) [or Moore-Penrose inverse (26)] X+, given by
(2)where XT is the transpose of matrix X.
To obtain the solution in Eq. 2, we propose a cross-point resistive memory circuit in Fig. 1A, where the matrix X is mapped by the conductance matrix GX in a pair of cross-point arrays of analog resistive memories, the vector y corresponds to the opposite of the input current vector i = [I1; I2; …; IN], and w is represented by the output voltage vector v = [V1; V2; …; VM] (M = 2 in Fig. 1A).
For a practical demonstration of this concept, we adopted arrays of RRAM devices composed of a HfO2 dielectric layer sandwiched between a Ti top electrode (TE) and a C bottom electrode (BE) (27). This type of RRAM device can be programmed to any arbitrary analog conductance within a certain range, thus allowing to represent the matrix elements Xij of the matrix X with sufficient accuracy (18). Representative analog conductance levels were programmed by controlling the compliance current during the set transition, as shown in fig. S1. The cross-point arrays are connected within a nested feedback loop (28) by N operational amplifiers (OAs) from the left array to the right array and M OAs from the right array to the left array. Briefly, the first set of OAs gives a negative transfer of the current, while the second one gives a positive transfer, resulting in an overall negative feedback, hence stable operation of the circuit with virtual ground inputs of all OAs. A detailed analysis of the circuit stability is reported in text S1.
According to Ohm’s law and Kirchhoff’s law in Fig. 1A, the input currents at the OAs from the left cross-point array are GXv + i; thus, the output voltages applied to the right cross-point array are
, where GTI is the feedback conductance of transimpedance amplifiers (TIAs). The right cross-point array operates another MVM between the voltage vector vr and the transpose conductance matrix GXT, resulting in a current vector
, which is forced to zero at the input nodes of the second set of OAs, namely
The steady-state voltages v at the left array are thus given by
The circuit of Fig. 1A was implemented in hardware using RRAM devices arranged within a cross-point architecture on a printed circuit board (see Materials and Methods and fig. S2). As a basic model, we considered the simple linear regression of points (xi, yi), where i = 1,2, …, N, to be fitted by a linear model w0 + w1xi = yi, where w0 and w1 are the intercept with axis y and the slope, respectively, of the best fitting line. To solve this problem in hardware, we encoded the matrix X
(5)in the cross-point arrays. A column of discrete resistors with G = 100 μS was used to represent the first column of X in Eq. 5, which is identically equal to 1. The second columns of both arrays were implemented with reconfigurable RRAM devices. A total number of N = 6 data points were considered, with each xi implemented as a RRAM conductance with unit 100 μS. The unit of conductance was chosen according to the range of linear conduction of the device (18), thus ensuring a good computation accuracy. Other aspects such as the current limit of the OAs and the power consumption should also be considered to select the best memory devices in the circuit. Although the conductance values in the two cross-point arrays should be identical, some mismatch can be tolerated for practical implementations (text S2). A program/verify technique was used to minimize the relative error (less than 5%) between the values of X in the two cross-point arrays (fig. S3). The data ordinates −y were instead applied as input currents. The input currents should be kept relatively small so that the resulting output signal is low enough to prevent disturbance of the device states in the stored matrix.
Given the matrix X stored in the cross-point arrays and an input current vector y, the corresponding linear system was then solved by the circuit in one step. Figure 1B shows the resulting dataset for an input current vector i = [0.3; 0.4; 0.4; 0.5; 0.5; 0.6]I0 with I0 = 100 μA to align with the conductance transformation unit. Figure 1B also shows the regression line, obtained by the circuit output voltages representing weights w0 and w1. The comparison with the analytical regression line shows a relative error of −4.86 and 0.82% for w0 and w1, respectively. The simulated transient behavior of the circuit is shown in fig. S4, evidencing that the linear regression weights are computed within about 1 μs. By changing the input vector, a different linear system was formed and solved by the circuit, as shown in Fig. 1C for i = [0.3; 0.3; 0.5; 0.4; 0.5; 0.7]I0. The result evidences that a more scattered dataset can also be correctly fitted by the circuit.
The cross-point circuit also naturally yields the prediction of the value y* in response of a new point at position x*. This is obtained by adding an extra row in the left cross-point, where an additional RRAM element is used to implement the new coordinate x* (fig. S5). The results are shown in Fig. 1C, indicating a prediction by the circuit, which is only 1% smaller compared to the analytical prediction. Figure S6 reports more linear regression and prediction results of various datasets. Linear regression with two independent variables was also demonstrated by a cross-point array of three columns, with results shown in fig. S7. These results support the cross-point circuit for the solution of linear regression models in various dimensions. The linear regression concept can also be extended to nonlinear regression models, e.g., polynomial regression (29), to better fit a dataset and thus make better predictions. By loading the polynomial terms in cross-point arrays, the circuit can also realize polynomial regression in one step (text S3 with fig. S8).
Logistic regression is a binary model that is extensively used for object classification and pattern recognition. Different from linear regression, which is a fully linear model, logistic regression also includes a nonlinear sigmoid function to generate the binary output. A logistic regression model can be viewed as the single-layer feed-forward neural network in Fig. 2A. Here, the weighted summation vector s of input signals to the nonlinear neuron is given by
(6)where X is the matrix containing the independent variable values of all samples and w is the vector of the synaptic weights. The neuron outputs are thus simply given by the vector y = f(s), where f is the nonlinear function of the neuron. To compute the weights of a logistic regression model with a sample matrix X and a label vector y, the logit transformation can be first executed (30). By applying the inverse of sigmoid function, the label vector y is converted to a summation vector s, namely, s = f−1(y). As a result, the logistic regression is reduced to a linear regression problem, where the weights can be obtained in one step by the pseudoinverse concept
For simplicity, we assumed that the nonlinear neuron function is instead a step function and that the summation vector s in Fig. 2A is binarized according to
(8)where a is a positive constant for regulating the output voltage in the circuit. After this transformation, the weights can be computed directly with the pseudoinverse circuit of Fig. 1A.
Figure 2B shows a set of six data points with coordinates (x1, x2) divided into two classes, namely, y = 0 (open) and y = 1 (full). Figure 2C shows the matrix X where the first column is equal to 1, while the other columns represent the coordinates x1 and x2 of the dataset. The sample matrix X was mapped in the two cross-point arrays of Fig. 1A, and input current was applied to each row to represent s with a = 0.2, according to Eq. 8. The circuit schematic is reported in fig. S9 together with experimental results and relative errors of logistic regression. The simulated transient behavior of the circuit is shown in fig. S10, with a computing time around 0.6 μs. The output voltage yields the weights w= [w0; w1; w2] with s = w0+w1x1 + w2x2 = 0 representing the decision boundary for classification, where s ≥ 0 indicates the domain of class “1” and s < 0 the domain of class “0”. This is shown as a line in Fig. 2B, displaying a tight agreement with the analytical solution. The cross-point circuit enables a one-step solution of logistic regression with datasets of various dimensionalities and sizes accommodated by the cross-point arrays. Similar to linear regression, the circuit can also provide one-step classification of any new (unlabeled) point, which is stored in a grounded additional row of the left cross-point array. The current flowing in the row yields the class of the new data point. Although here we consider two cases containing only positive independent variable values for linear/logistic regression, datasets containing negative values can also be addressed by simply translating the entire data to be positive, as explained in fig. S11.
Linear regression of Boston housing dataset
While the circuit capability has been demonstrated in experiment for small models, the matrix size is an obvious concern that needs to be addressed for real-world applications. To study the circuit scalability, we considered a large dataset, namely, the Boston housing price list for linear regression (31, 32). The dataset collects 13 attributes and the prices of 506 houses, 333 of which are used for training the linear regression model, while the rest are used for testing the model. The attributes are summarized in the text S4. We performed linear regression with the training set to compute the weights with the cross-point circuit and applied the regression model to predict house prices of the test set.
Figure 3A shows the matrix X for the training set, including a first column of 1, the other columns recording the 13 attributes, and the input vector y, representing the corresponding prices. The matrix X was rescaled to make the conductance values in cross-point arrays uniform, and the vector y was also scaled down to prevent excessive output voltage w (see fig. S12). We simulated the linear regression circuit with SPICE (Simulation Program with Integrated Circuit Emphasis; see Materials and Methods), where the RRAM devices were assumed to accurately map the matrix values within 8-bit precision. Figure 3B shows the calculated w obtained from the output voltage in the simulated circuit, with the relative errors remaining within ±1%, thus demonstrating the good accuracy and scalability of the circuit.
Figure 3C shows the obtained regression results compared with the real house prices of the training set. A standard deviation (SD) σP of $4733 is obtained from SPICE simulations, which is in line with the analytical solution σP′ = $4732. Figure 3D shows the predicted prices of the test set compared with the real prices. The SD from the circuit simulation is σP = $4779, in good agreement with the analytical results σP′ = $4769. The resulting SD is only slightly larger than the training set, which supports the ability for generalization of the model. One-step price prediction for test samples is possible by storing the unlabeled attributes in additional rows of the left cross-point array and measuring the corresponding currents, as indicated in fig. S13.
Two-layer neural network training
Logistic regression is widely used in the last fully connected classification layer in deep neural networks. The cross-point circuit thus provides a hardware acceleration of the computation of the last-layer weights for training of a neural network. To test the cross-point circuit as an accelerator for training neural networks, we considered the two-layer perceptron in Fig. 4A, where the first-layer weights are set randomly (33, 34), while the second-layer weights can be obtained by the pseudoinverse method in the cross-point circuit. Note that a standard technique to train a two-layer neural network is the backpropagation algorithm which reduces the squares error iteratively (35). In contrast to iterative backpropagation, the pseudoinverse approach can reach the LSE solution in just one step, thus providing a fast and energy-efficient acceleration of network training.
As a case study for neural network training, we adopted the Modified National Institute of Standards and Technology (MNIST) dataset (36). To reduce the circuit size in the simulations, we used only 3000 of 50,000 samples to train the neural network. Also, to provide an efficient fan-out (for instance, four) for the first layer (34), the image size was down-sampled to be 14 × 14, resulting in a network of 196 input neurons, 784 hidden neurons, and 10 output neurons for the classifications of the digits from 0 to 9. The training matrix T is with a size of 3000 × 196 and the first-layer weights W(1) were randomly generated in the range between −0.5 and 0.5 with a uniform distribution (fig. S14). The matrix X can thus be obtained by
(9)while the weights of the second layer W(2) can be obtained by the pseudoinverse model of Eq. 2, with Y containing all the known labels of training samples transformed according to Eq. 8 with a = 0.05. For each training sample, the neuron corresponding to the digit is labeled 1, while the other nine neurons are 0. Note that the matrix X results from the output of a sigmoid function of hidden neuron and is restricted in the range between 0 and 1 (fig. S15).
Figure 4B shows the second-layer weights W(2) obtained by the simulation of the cross-point circuit, where X was stored in the RRAM devices, and each column of matrix Y was applied as input current. The weights were obtained in 10 steps, one for each classification output (from digit 0 to digit 9). With the computed weights W(2), the network can recognize 500 handwritten digits with accuracy of 94.2%, which is identical to the analytical pseudoinverse solution. For the whole test set (10,000 digits), the recognition accuracy is 92.15% using the simulated W(2), compared to 92.14% using the analytical solution. The cross-point array can thus be used to accelerate the training of typical neural networks with ideal accuracy. The computed weights can then be stored in one or more open-loop cross-point array for accelerating the neural network in the inference mode by exploiting in-memory MVM (fig. S16) (8, 11, 12).
Figure 4B also shows the LSEs obtained from both the circuit simulation and analytical study. Note that the LSEs are different among the 10 digits due to the dependence of LSE on weight values (text S5). Figure 4C shows the simulated weights as a function of the analytical values for each output neuron, showing a good consistency except for the bias weight w0. The bias acts as a regulator to the summation of an output neuron; thus, the deviated bias weight guarantees that the simulated LSE is close to the analytical one in Fig. 4B. It should be noted that, although a random W(1) was assumed in this study, W(1) can be further optimized by gradient descent methods (37) to improve the accuracy. The same approach might be applied to pretrained deep networks by the concept of transfer learning (38), thus enabling the one-step training capability for a generalized range of learning tasks.
Although the cross-point circuit is inherently accurate and scalable, the imperfections of RRAM devices such as conductance discretization and stochastic variation (18) might affect the solution. To study the impact of these issues on the solution accuracy, we assumed a RRAM model with 32 discrete conductance levels, including 31 uniformly spaced levels and one deep high-resistance state, which is achievable in many resistive memory devices (39–41). The ratio between the maximum conductance Gmax and the minimum conductance Gmin is assumed to be
, in line with previous reports (42, 43). To describe conductance variations, we assumed an SD σ = ∆G/6, ∆G/4, or ∆G/2, where ∆G is the nominal difference between two adjacent conductance levels. The simulation results for the Boston housing benchmark (fig. S17) shows that the resulting regression and prediction remain accurate for all cases. For the worst case (σ = ∆G/2), the SD σP of training set is equal to $4756 compared with the ideal result of $4732. The σP of test set for prediction is even closer to the ideal one, namely, $4765 compared with $4769. These results highlight the suitability of the cross-point resistive memory circuit for ML tasks, where the device variations can be tolerated for regression, prediction, and classification.
Another concern for large-scale circuits is the parasitic wire resistance. To study its impact on the accuracy of linear regression for Boston housing dataset, we adopted interconnect parameters at a 65-nm technology obtained from the International Technology Roadmap for Semiconductors table (44), together with the RRAM model. The results in fig. S18 show an increased σP for both regression and prediction, with the latter being less notable, which is consistent with the impact of device variation. Specifically, the σP of prediction becomes merely $4809 compared with the ideal $4769, thus supporting the robustness of the linear regression circuit for predictive analysis.
The circuit stability analysis in text S1 reveals that the poles of the system all lie in the left half plane; thus, the circuit is stable, and the computing time is limited by the bandwidth corresponding to the first pole, which is the minimal eigenvalue (or real part of eigenvalue) λmin (absolute value) of a quadratic eigenvalue problem (45). As λmin becomes larger, the computation of the circuit gets faster, with no direct dependence on the size of the dataset. To support this scaling property of the circuit speed, we have simulated the transient dynamics of linear regression of the Boston housing dataset and its subsets for increasing size of the training samples (fig. S19). The results show that the computing time may even decrease as the number of samples increases, which can be explained by the different λmin of the datasets (fig. S20). These results evidence that the time complexity of the cross-point circuit for linear regression substantially differs from its counterparts of classical digital algorithms, with a potential of approaching size-independent time complexity to hugely speed up large-scale ML problems. Note that as the circuit size increases, a larger current is also required to sustain the circuit operation, which might be limited by the capability of the OAs. To control the maximum current consumption in the circuit, the memory element should be carefully optimized by materials and device engineering (46) or by advanced device concepts such as electrochemical transistor (47, 48) to provide a low-conductance implementation. The impact of device variations and the energy efficiency of the circuit are studied for the two-layer neural network for MNIST dataset training (text S6 with fig. S21). The results support the robustness of the circuit against device variations for classification applications, and an energy efficiency of 45.3-tera operations per s per Watt (TOPS/W), which is 19.7 and 6.5 times better than the Google’s tensor processing unit (49) and a highly optimized application-specific integrated circuit system (50), respectively.
In conclusion, the cross-point circuit has been shown to provide a one-step solution to linear regression and logistic regression, which is demonstrated in experiments with RRAM devices. The one-step learning capability relies on the high parallelism of analog computing by physical Ohm’s law and Kirchhoff’s law within the circuit and physical iteration within the nested feedback architecture. The scalability of the cross-point computing is demonstrated with large problems, such as the Boston housing dataset and the MNIST dataset. The results evidence that in-memory computing is remarkably promising for accelerating ML tasks with high latency/energy performance in a wide range of data-intensive applications.
Supplementary material for this article is available at http://advances.sciencemag.org/cgi/content/full/6/5/eaay2378/DC1
Text S1. Analysis of circuit stability
Text S2. Analysis of twin matrices mismatch
Text S3. Polynomial regression
Text S4. Introduction to Boston housing dataset
Text S5. Analysis of least squares
Text S6. Computing performance benchmarking
Fig. S1. Current-voltage characteristics of the Ti/HfO2/C RRAM device.
Fig. S2. Cross-point resistive memory circuit on a printed circuit board.
Fig. S3. Device programming.
Fig. S4. Convergence analysis of the linear regression experiment.
Fig. S5. Extended circuit for one-step prediction.
Fig. S6. More linear regression results.
Fig. S7. Linear regression with two independent variables.
Fig. S8. Polynomial regression result.
Fig. S9. Logistic regression results.
Fig. S10. Convergence analysis of the logistic regression experiment.
Fig. S11. Solution of linear/logistic regression with negative independent variable values.
Fig. S12. Rescaling the attribute matrix X and price vector y.
Fig. S13. One-step prediction circuit schematic for Boston housing dataset.
Fig. S14. Random first-layer weight matrix W(1).
Fig. S15. The hidden-layer output matrix X.
Fig. S16. Training and inference of the two-layer neural network.
Fig. S17. Linear regression of Boston housing dataset with a RRAM model.
Fig. S18. Impact of wire resistance.
Fig. S19. Linear regression of Boston housing dataset and its representative subsets.
Fig. S20. Scaling behavior of computing time of linear regression.
Fig. S21. Analysis of device variation impact and computing time.
Acknowledgments: We thank J. Li for help with the SPICE simulation. Funding: This article has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (grant agreement no. 648635). This work was partially performed at PoliFAB, the micro- and nanofabrication facility of Politecnico di Milano. Author contributions: Z.S. conceived the idea and designed the circuit. G.P. designed the printed circuit board. G.P. and Z.S. conducted the experiments. A.B. fabricated the devices. All the authors discussed the experimental and simulation results. Z.S. and D.I. wrote the manuscript with input from all authors. D.I. supervised the research. Competing interests: The authors declare that they have no competing interests. Data and materials availability: All data needed to evaluate the conclusions in the paper are present in the paper and/or the Supplementary Materials. Additional data related to this paper may be requested from the authors.