Guest Editors' Introduction: Frontiers of Hardware and Algorithms for On-chip Learning
Optical networks-on-chip (NoCs) based on silicon photonics have been proposed as emerging on-chip communication architectures for chip multiprocessors with large core counts. However, due to thermal sensitivity of optical devices used in optical NoCs, on-chip temperature variations cause significant thermal-induced optical power loss which would counteract the power advantages of optical NoCs. To tackle this problem, in this work, we propose a learning-based thermal-sensitive power optimization approach for mesh or torus-based optical NoCs in presence of temperature variations. The key techniques proposed includes an initial device setting and thermal tuning mechanism which is a device-level optimization technique, and a learning-based thermal-sensitive adaptive routing algorithm which is a network-level optimization technique. Simulation results of an 8x8 mesh-based optical NoC show that the proposed initial device setting and thermal tuning mechanism confines the worst-case thermal-induced energy consumption to be on the order of tens of pJ/bit, by avoiding significant thermal-induced optical power loss caused by temperature-dependent wavelength shifts. Besides, it shows that the learning-based thermal-sensitive adaptive routing algorithm is able to find an optimal path with the minimum estimated thermal-induced power consumption for each communication pair. The proposed routing has a greater space for optimization especially for applications with more long-distance traffic.
ntegrated optical circuits with nanophotonic devices have attracted significant attention due to its low power dissipation and light-speed operation. With light interference and resonance phenomena, the nanophotonic device works as a voltage-controlled optical pass-gate like a pass-transistor. This paper first introduces a concept of the optical pass-gate logic, and then proposes a parallel adder circuit based on the optical pass-gate logic. Experimental results obtained with an optoelectronic circuit simulator show advantages of our optical parallel adder circuit over a traditional CMOS-based parallel adder circuit.
In this paper, we propose an In-Memory Flexible Computing platform (IMFlexCom) using a novel Spin Orbit Torque Magnetic Random Access Memory (SOT-MRAM) array architecture, which could work in dual mode: memory mode- to work as non-volatile memory and computing mode- to implement re-configurable logic (AND/OR/XOR) within the memory array. Such intrinsic in-memory logic could be used to process data within memory to greatly reduce power-hungry and long distance massive data communication in conventional Von-Neumann computing systems. We further employ bulk bitwise vector operation and data encryption engine with Advanced Encryption Standard (AES) as case studies to investigate the performance of our proposed in-memory computing architecture. Our design shows < 35× energy saving and < 18× speedup for bulk bitwise in-memory vector AND/OR operation compared to DRAM based in-memory logic. Again, our proposed design can achieve 77.27% and 85.4% lower energy consumption compared to CMOS-ASIC and CMOL based AES implementations, respectively. It offers almost similar energy consumption as recent DW-AES implementation with 66.7% less area overhead.
Protection of intellectual property (IP) is increasingly critical for IP vendors in semiconductor industry. However, advanced reverse engineering techniques can physically disassemble the chip and derive the IPs at a much lower cost than the value of IP design that chips carry. This invasive hardware attack obtaining information from IC chips always violates the IP rights of vendors. The intent of this paper is to present a chip- level reverse engineering resilient design technique. In the proposed technique, transformable interconnects enable an IC chip to maintain functioning in normal use and to transform its physical structure into another pattern when exposed to invasive attacks. The newly-created patten will signi cantly increase the di culty of reverse engineering. Furthermore, to improve the e ectiveness of the proposed technique, a systematic design method is developed targeting integrated circuits with multiple design constraints. Simulations have been conducted to demonstrate the capability of the proposed technique, which generates extremely large complexity for reverse engineering with manageable overhead.
Stochastic circuits (SCs) offer tremendous area- and power-consumption benefits at the expense of computational inaccuracies. Unlike conventional logic synthesis, managing accuracy is a central problem in SC design. It is usually tackled in ad hoc fashion by multiple trial-and-error simulations that vary relevant parameters like the stochastic number length n. We present, for the first time, a systematic design approach to controlling the accuracy of SCs and balancing it against other design parameters. We express the (in)accuracy of a circuit processing n-bit stochastic numbers by the numerical deviation of the computed value from the expected result, in conjunction with a confidence level. Using the theory of Monte Carlo simulation, we derive expressions for the stochastic number length required for a desired level of accuracy, or vice versa. We discuss the integration of the theory into a design framework that is applicable to both combinational and sequential SCs. We show that for combinational SCs, accuracy is independent of the circuit's size or complexity, a surprising result. We also show how the analysis can identify subtle errors in both combinational and sequential designs. Finally, we apply the proposed methods to a case study on filtering noisy EKG signals.
Low operating voltage, high storage density, non-volatile storage capabilities and relative low access latencies have popularized memristive devices as storage devices. Memristors can be ideally used for in-memory computing in the form of hybrid CMOS nano-crossbar arrays. In-memory serial adders have been theoretically and experimentally proven for crossbar arrays. To harness the parallelism of memristive arrays, parallel-prefix adders can be effective. In this work, a novel mapping scheme for in-memory Kogge-Stone adder has been presented. The number of cycles increases logarithmically with the bit width N of the operands i.e. O(log2N) and the device count is 5N. We verify the correctness of the proposed scheme by means of TaOx device model based memristive simulations. We compare the proposed scheme with other proposed schemes in terms of number of cycle and number of devices.
Editorial: Silicon Photonics for Computing Systems
As a step towards solving the problem of implementing a usable machine learning application, especially perceptual tasks, on a mobile form factor, we explore sound source separation to isolate human voice from background noise on a mobile phone. The challenges involved are real-time streaming execution and power constraints. As a solution, we present a novel hardware-base sound source separation capable of real-time streaming performance while consuming low power. The implementation uses Markov Random Field (MRF) formulation of Blind Source Separation (BSS) with two microphones. It uses Expectation-Maximization (EM) to learn hidden MRF parameters on the fly and also performs Maximum A Posterior (MAP) inference using Gibbs sampling to find the best separation of sources. We demonstrate a real-time streaming FPGA implementation running at 150 MHz with 207 KB RAM. It achieves a speed-up of 22X over a conventional software reference, performs with an SDR of up to 7.021 dB with 1.601 ms latency, and exhibits excellent perceived audio quality. A virtual ASIC design study shows that this architecture is small with less than 10M gates, consumes only 40.034 mW (which is only 10% of power on ARM Cortex-A9) running at 150 MHz.
High-performance and energy-efficient Network-on-Chip (NoC) architecture is one of the crucial components of the manycore processing platforms. A very promising NoC architecture recently proposed in the literature is the three-dimensional small-world NoC (3D SWNoC). Due to short vertical links in 3D integration and the robustness of small-world networks, the 3D SWNoC architecture outperforms its other 3D counterparts. However, the performance of 3D SWNoC is highly dependent on the placement of the links and associated routers. In this paper, we propose a sensitivity-based link placement algorithm (SEN) to optimize the performance of 3D SWNoC.We compare the performance of SEN algorithm with simulated annealing- (SA) and recently proposed machine learning-based (ML) optimization algorithm. The optimized 3D SWNoC obtained by the proposed SEN algorithm achieves, on average, 11.5% and 13.6% lower latency and 18.4% and 21.7% lower energy-delay product than those optimized by the SA and ML algorithms respectively. In addition, the SEN algorithm is 26 to 33 times faster than the SA algorithm for the optimization of 64-, 128-, and 256-core 3D SWNoC designs.However, we find that ML-based methodology has faster convergence time than SEN and SA for bigger systems.
Adapting deep neural networks and deep learning algorithms for neuromorphic hardware has been well established for discriminative and generative models. We study the applicability of neural networks and neuromorphic hardware for solving general optimization problems without the use of adaptive training or learning algorithms. We leverage the dynamics of Hopfield networks and spin glass systems to construct a fully connected spiking neural system to generate synchronous spike responses indicative of the underlying community structure in an undirected, unweighted graph. Mapping this fully connected system to current neuromorphic hardware is done by embedding sparse tree graphs to generate only the leading order spiking dynamics. We demonstrate that for a chosen set of benchmark graphs, non-overlapping communities can be identified, even with the loss of higher order spiking behavior.
Nanophotonic networks have been challenged for their reliability due to several device-level limitations. One of the main issues is that fabrication errors can cause devices to malfunction, rendering communication unreliable. For example, microring resonator, a preferred optical modulator device, may not resonate at the designated wavelength under process variations (PV), leading to communication errors and bandwidth loss. This paper proposes a series of solutions to the wavelength drifting problem of microrings due to PV. The objective is to maximize network bandwidth through proper arrangement among microrings and wavelengths with minimum power requirement. Our arrangement, called ``MinTrim", solves this problem using simple integer linear programming, adding supplementary microrings and allowing flexible assignment of wavelengths to network nodes as long as the resulting network presents maximal bandwidth. Each step is shown to improve bandwidth provisioning with lower power requirement. Evaluations on a sample network show that a baseline network could lose more than 40% bandwidth due to PV. Such loss can be recovered by MinTrim to produce a network with 98.4% working bandwidth. In addition, the power required in arranging microrings is 39% lower than the baseline. Therefore, MinTrim provides an efficient PV-tolerant solution to improving the reliability of on-chip photonics.
Current Deep Learning approaches that have been very successful use convolutional neural networks (CNN) trained on large graphical processing units (GPU)-based computers. Three limitations of this approach are: 1) they are based on a simple layered network topology, i.e., highly connected layers, without intra-layer connections; 2) the networks are manually configured to achieve optimal results, and 3) the implementation of neuron model is expensive in both cost and power. In this paper, we evaluate deep learning models using three different computing architectures to address these problems: quantum computing to train complex topologies, high performance computing (HPC) to automatically determine network topology, and neuromorphic computing for a low-power hardware implementation. We use the MNIST dataset for our experiment, due to input size limitations of current quantum computers. Our results show the feasibility of using the three architectures in tandem to address the above deep learning limitations. We show a quantum computer can find high quality values of intra-layer connections weights, in a tractable time as the complexity of the network increases; a high performance computer can find optimal layer-based topologies; and a neuromorphic computer can represent the complex topology and weights derived from the other architectures in low power memristive hardware.
To face the complex communication problems that arise as the number of on-chip components grows up, photonic networks-on-chip have been recently proposed to replace electronic interconnects. However, photonic networks-on-chip lack efficient laser sources, possibly resulting in an inefficient or inoperable architecture. In this technical note, we introduce a methodology for the design space exploration of optical NoC mapping solutions, which automatically assigns application tasks to the network tiles such that the total laser power consumption is minimized. The experimental evaluation shows average reductions of 34.7% and 27.35% in the power consumption compared to respectively application-oblivious and randomly mapped photonic NoCs, allowing improved energy efficiency.
ONoC is a promising communication medium for large-scale MPSoC. Indeed ONoC can outperform classical electrical NoC in terms of energy efficiency and bandwidth density, in particular, because this medium can support multiple transactions at the same time on different wavelengths by using WDM. However, multiple signals sharing simultaneously the same part of a waveguide can lead to inter-channel crosstalk noise. is problem impacts the Signal to Noise Ratio (SNR) of the optical signals, which leads to an increase in the Bit Error Rate at the receiver side. If a specific BER is targeted, an increase of laser power should be necessary to satisfy the SNR. In this context, an important issue is to evaluate the laser power needed to satisfy the various desired communication bandwidths based on the BER performance requirements. In this paper, we propose an o -line approach that concurrently optimizes the laser power scaling and execution time of a global application. A set of different levels of power is introduced for each laser, to ensure that optical signal can be emitted with just-enough power to ensure targeted BER. As result, most promising solutions are highlighted for mapping a defined application onto 16-core ring-based WDM ONoC.
As the relentless quest for higher throughput and lower energy cost continues in heterogenous multicores, there is a strong demand for energy-efficient and high-performance Network-on-Chip (NoC) architectures. Photonic interconnects are a disruptive technology solution that has the potential to increase the bandwidth, reduce latency, and improve energy-efficiency over traditional metallic interconnects. In this paper, we propose a CPU-GPU heterogeneous architecture called SHARP (Shared Heterogeneous Architecture with Reconfigurable Photonic Network-on-Chip) that clusters CPU and GPU cores around the same router and dynamically allocates bandwidth between the CPU and GPU cores based on application demands. The SHARP architecture is designed as a Single-Writer Multiple-Reader (SWMR) crossbar with reservation-assist to connect CPU/GPU cores that dynamically reallocates bandwidth using buffer utilization information at runtime. As network traffic exhibits temporal and spatial fluctuations due to application behavior, SHARP can dynamically reallocate bandwidth and thereby adapt to application demands. SHARP demonstrates 34% performance (throughput) improvement over a baseline electrical CMESH while consuming 25% less energy per bit. Simulation results have also shown 6.9% to 14.9% performance improvement over other flavors of the proposed SHARP architecture without dynamic bandwidth allocation.