Guest Editors' Introduction: Frontiers of Hardware and Algorithms for On-chip Learning
Optical networks-on-chip (NoCs) based on silicon photonics have been proposed as emerging on-chip communication architectures for chip multiprocessors with large core counts. However, due to thermal sensitivity of optical devices used in optical NoCs, on-chip temperature variations cause significant thermal-induced optical power loss which would counteract the power advantages of optical NoCs. To tackle this problem, in this work, we propose a learning-based thermal-sensitive power optimization approach for mesh or torus-based optical NoCs in presence of temperature variations. The key techniques proposed includes an initial device setting and thermal tuning mechanism which is a device-level optimization technique, and a learning-based thermal-sensitive adaptive routing algorithm which is a network-level optimization technique. Simulation results of an 8x8 mesh-based optical NoC show that the proposed initial device setting and thermal tuning mechanism confines the worst-case thermal-induced energy consumption to be on the order of tens of pJ/bit, by avoiding significant thermal-induced optical power loss caused by temperature-dependent wavelength shifts. Besides, it shows that the learning-based thermal-sensitive adaptive routing algorithm is able to find an optimal path with the minimum estimated thermal-induced power consumption for each communication pair. The proposed routing has a greater space for optimization especially for applications with more long-distance traffic.
ntegrated optical circuits with nanophotonic devices have attracted significant attention due to its low power dissipation and light-speed operation. With light interference and resonance phenomena, the nanophotonic device works as a voltage-controlled optical pass-gate like a pass-transistor. This paper first introduces a concept of the optical pass-gate logic, and then proposes a parallel adder circuit based on the optical pass-gate logic. Experimental results obtained with an optoelectronic circuit simulator show advantages of our optical parallel adder circuit over a traditional CMOS-based parallel adder circuit.
In this paper, we propose an In-Memory Flexible Computing platform (IMFlexCom) using a novel Spin Orbit Torque Magnetic Random Access Memory (SOT-MRAM) array architecture, which could work in dual mode: memory mode- to work as non-volatile memory and computing mode- to implement re-configurable logic (AND/OR/XOR) within the memory array. Such intrinsic in-memory logic could be used to process data within memory to greatly reduce power-hungry and long distance massive data communication in conventional Von-Neumann computing systems. We further employ bulk bitwise vector operation and data encryption engine with Advanced Encryption Standard (AES) as case studies to investigate the performance of our proposed in-memory computing architecture. Our design shows < 35× energy saving and < 18× speedup for bulk bitwise in-memory vector AND/OR operation compared to DRAM based in-memory logic. Again, our proposed design can achieve 77.27% and 85.4% lower energy consumption compared to CMOS-ASIC and CMOL based AES implementations, respectively. It offers almost similar energy consumption as recent DW-AES implementation with 66.7% less area overhead.
Many fabrication-less design houses are outsourcing their designs to third party foundries for fabrication to lower cost. This IC development process, however, raises serious security concerns on Hardware Trojans (HTs). In this paper, for the first time, we propose a two-phase technique, which uses the order of the path delay in path pairs to detect HTs. In the design phase, a full-cover path set that covers all the nets of the design is generated; meanwhile in the set, the relative order of paths in path pairs is determined according to their delay. The order of the paths in path pairs serves as the fingerprint of the design. In the test phase, the actual delay of the paths in the full-cover set is extracted from the fabricated circuits, and the order of paths in path pairs is compared with the fingerprint generated in the design phase. A mismatch between them indicates the existence of Trojans circuits. Both process variations and measurement noise are taken into consideration. The efficiency and accuracy of the proposed technique are confirmed by a series of experiments including the examination of both violated path pairs incurred by HTs and their false alarm rate.
Stochastic circuits (SCs) offer tremendous area- and power-consumption benefits at the expense of computational inaccuracies. Unlike conventional logic synthesis, managing accuracy is a central problem in SC design. It is usually tackled in ad hoc fashion by multiple trial-and-error simulations that vary relevant parameters like the stochastic number length n. We present, for the first time, a systematic design approach to controlling the accuracy of SCs and balancing it against other design parameters. We express the (in)accuracy of a circuit processing n-bit stochastic numbers by the numerical deviation of the computed value from the expected result, in conjunction with a confidence level. Using the theory of Monte Carlo simulation, we derive expressions for the stochastic number length required for a desired level of accuracy, or vice versa. We discuss the integration of the theory into a design framework that is applicable to both combinational and sequential SCs. We show that for combinational SCs, accuracy is independent of the circuit's size or complexity, a surprising result. We also show how the analysis can identify subtle errors in both combinational and sequential designs. Finally, we apply the proposed methods to a case study on filtering noisy EKG signals.
Low operating voltage, high storage density, non-volatile storage capabilities and relative low access latencies have popularized memristive devices as storage devices. Memristors can be ideally used for in-memory computing in the form of hybrid CMOS nano-crossbar arrays. In-memory serial adders have been theoretically and experimentally proven for crossbar arrays. To harness the parallelism of memristive arrays, parallel-prefix adders can be effective. In this work, a novel mapping scheme for in-memory Kogge-Stone adder has been presented. The number of cycles increases logarithmically with the bit width N of the operands i.e. O(log2N) and the device count is 5N. We verify the correctness of the proposed scheme by means of TaOx device model based memristive simulations. We compare the proposed scheme with other proposed schemes in terms of number of cycle and number of devices.
On-device learning has gained a significant attention recently as it offers local data processing which ensures user privacy and low power consumption especially on mobile devices and energy constrained platforms. This paper proposes on-device training circuitry for threshold-current memristors integrated in crossbar structure. Furthermore, this paper investigates alternate approaches of mapping the synaptic weights to memristive crossbar, thereby realizing a simplified neuromemristive system. The proposed design is studied within the context of extreme learning machine (ELM) while using delta rule learning algorithm to train the output layer. The network is implemented using IBM 65nm technology node and verified in Cadence Spectre environment. The hardware model is verified for classification with binary and multi-class datasets. The total power for a single 4x4 layer network is estimated to be ~29.62uW, while the area is estimated to be 26.48um x 22.35um.
Editorial: Silicon Photonics for Computing Systems
High-performance and energy-efficient Network-on-Chip (NoC) architecture is one of the crucial components of the manycore processing platforms. A very promising NoC architecture recently proposed in the literature is the three-dimensional small-world NoC (3D SWNoC). Due to short vertical links in 3D integration and the robustness of small-world networks, the 3D SWNoC architecture outperforms its other 3D counterparts. However, the performance of 3D SWNoC is highly dependent on the placement of the links and associated routers. In this paper, we propose a sensitivity-based link placement algorithm (SEN) to optimize the performance of 3D SWNoC.We compare the performance of SEN algorithm with simulated annealing- (SA) and recently proposed machine learning-based (ML) optimization algorithm. The optimized 3D SWNoC obtained by the proposed SEN algorithm achieves, on average, 11.5% and 13.6% lower latency and 18.4% and 21.7% lower energy-delay product than those optimized by the SA and ML algorithms respectively. In addition, the SEN algorithm is 26 to 33 times faster than the SA algorithm for the optimization of 64-, 128-, and 256-core 3D SWNoC designs.However, we find that ML-based methodology has faster convergence time than SEN and SA for bigger systems.
Adapting deep neural networks and deep learning algorithms for neuromorphic hardware has been well established for discriminative and generative models. We study the applicability of neural networks and neuromorphic hardware for solving general optimization problems without the use of adaptive training or learning algorithms. We leverage the dynamics of Hopfield networks and spin glass systems to construct a fully connected spiking neural system to generate synchronous spike responses indicative of the underlying community structure in an undirected, unweighted graph. Mapping this fully connected system to current neuromorphic hardware is done by embedding sparse tree graphs to generate only the leading order spiking dynamics. We demonstrate that for a chosen set of benchmark graphs, non-overlapping communities can be identified, even with the loss of higher order spiking behavior.
In the area of biomedical engineering, digital-microfluidic biochips (DMFBs) have received considerable attention because of their capability of providing an efficient and reliable platform for conducting point-of-care clinical diagnostics. System reliability, in turn, mandates error-recoverability while implementing biochemical assays on-chip for medical applications. Unfortunately, the technology of DMFBs is not yet fully equipped to handle error-recovery from various microfluidic operations involving droplet motion and reaction. Recently, a number of cyber-physical systems have been proposed to provide real-time checking and error-recovery in assays based on the feedback received from a few on-chip checkpoints. However, in order to synthesize robust feedback systems for different types of DMFBs, certain practical issues need to be considered such as co-optimization of checkpoint placement, error-recoverability, and layout of droplet-routing pathways. For application-specific DMFBs, we propose here an algorithm that minimizes the number of checkpoints and determines their locations to cover every path in a given droplet-routing solution. Next, for general-purpose DMFBs, where the checkpoints are pre-deployed in specific locations, we present a checkpoint-aware routing algorithm such that every droplet-routing path passes through at least one checkpoint to enable error-recovery and to ensure physical routability of all droplets.
We study the ultimate limits of hardware solutions for the self-protection strategies against permanent faults in networks on chips (NoCs). NoCs reliability is improved by replacing each base router by an augmented router which includes extra protection circuitry. We compare the protection achieved by the self-test and self-protect (STAP) architectures to that of triple modular redundancy with voting (TMR). In practice, none of the considered architectures (STAP or TMR) can tolerate all the permanent faults, especially faults in the extra-circuitry for protection or voting, and consequently, there will always be some unidentified defective augmented routers which are going to transmit errors in an unpredictable manner. Specifically, we study and determine the average percentage of unidentified defective routers (UDRs) and their impact on the overall reliability of the NoC in light of self-protection strategies. Our study shows that TMR is the most efficient solution to limit the average percentage of UDRs when there are typically less than a 0.1 percent of defective base routers. Above 1% of defective base routers, the STAP approaches are more efficient although the protection efficiency decreases inexorably in the very defective technologies (e.g. when there is 10% or more of defective base routers).
Nanophotonic networks have been challenged for their reliability due to several device-level limitations. One of the main issues is that fabrication errors can cause devices to malfunction, rendering communication unreliable. For example, microring resonator, a preferred optical modulator device, may not resonate at the designated wavelength under process variations (PV), leading to communication errors and bandwidth loss. This paper proposes a series of solutions to the wavelength drifting problem of microrings due to PV. The objective is to maximize network bandwidth through proper arrangement among microrings and wavelengths with minimum power requirement. Our arrangement, called ``MinTrim", solves this problem using simple integer linear programming, adding supplementary microrings and allowing flexible assignment of wavelengths to network nodes as long as the resulting network presents maximal bandwidth. Each step is shown to improve bandwidth provisioning with lower power requirement. Evaluations on a sample network show that a baseline network could lose more than 40% bandwidth due to PV. Such loss can be recovered by MinTrim to produce a network with 98.4% working bandwidth. In addition, the power required in arranging microrings is 39% lower than the baseline. Therefore, MinTrim provides an efficient PV-tolerant solution to improving the reliability of on-chip photonics.
Current Deep Learning approaches that have been very successful use convolutional neural networks (CNN) trained on large graphical processing units (GPU)-based computers. Three limitations of this approach are: 1) they are based on a simple layered network topology, i.e., highly connected layers, without intra-layer connections; 2) the networks are manually configured to achieve optimal results, and 3) the implementation of neuron model is expensive in both cost and power. In this paper, we evaluate deep learning models using three different computing architectures to address these problems: quantum computing to train complex topologies, high performance computing (HPC) to automatically determine network topology, and neuromorphic computing for a low-power hardware implementation. We use the MNIST dataset for our experiment, due to input size limitations of current quantum computers. Our results show the feasibility of using the three architectures in tandem to address the above deep learning limitations. We show a quantum computer can find high quality values of intra-layer connections weights, in a tractable time as the complexity of the network increases; a high performance computer can find optimal layer-based topologies; and a neuromorphic computer can represent the complex topology and weights derived from the other architectures in low power memristive hardware.
ONoC is a promising communication medium for large-scale MPSoC. Indeed ONoC can outperform classical electrical NoC in terms of energy efficiency and bandwidth density, in particular, because this medium can support multiple transactions at the same time on different wavelengths by using WDM. However, multiple signals sharing simultaneously the same part of a waveguide can lead to inter-channel crosstalk noise. is problem impacts the Signal to Noise Ratio (SNR) of the optical signals, which leads to an increase in the Bit Error Rate at the receiver side. If a specific BER is targeted, an increase of laser power should be necessary to satisfy the SNR. In this context, an important issue is to evaluate the laser power needed to satisfy the various desired communication bandwidths based on the BER performance requirements. In this paper, we propose an o -line approach that concurrently optimizes the laser power scaling and execution time of a global application. A set of different levels of power is introduced for each laser, to ensure that optical signal can be emitted with just-enough power to ensure targeted BER. As result, most promising solutions are highlighted for mapping a defined application onto 16-core ring-based WDM ONoC.
As the relentless quest for higher throughput and lower energy cost continues in heterogenous multicores, there is a strong demand for energy-efficient and high-performance Network-on-Chip (NoC) architectures. Photonic interconnects are a disruptive technology solution that has the potential to increase the bandwidth, reduce latency, and improve energy-efficiency over traditional metallic interconnects. In this paper, we propose a CPU-GPU heterogeneous architecture called SHARP (Shared Heterogeneous Architecture with Reconfigurable Photonic Network-on-Chip) that clusters CPU and GPU cores around the same router and dynamically allocates bandwidth between the CPU and GPU cores based on application demands. The SHARP architecture is designed as a Single-Writer Multiple-Reader (SWMR) crossbar with reservation-assist to connect CPU/GPU cores that dynamically reallocates bandwidth using buffer utilization information at runtime. As network traffic exhibits temporal and spatial fluctuations due to application behavior, SHARP can dynamically reallocate bandwidth and thereby adapt to application demands. SHARP demonstrates 34% performance (throughput) improvement over a baseline electrical CMESH while consuming 25% less energy per bit. Simulation results have also shown 6.9% to 14.9% performance improvement over other flavors of the proposed SHARP architecture without dynamic bandwidth allocation.