FPGA-based hardware accelerators for convolutional neural networks (CNNs) have obtained great attentions due to their higher energy efficiency than GPUs. However, it is challenging for FPGA-based solutions to achieve a higher throughput than GPU counterparts. In this paper, we demonstrate that FPGA acceleration can be a superior solution in terms of both throughput and energy efficiency when a CNN is trained with binary constraints on weights and activations. Specifically, we propose an optimized FPGA accelerator architecture tailored for bitwise convolution and normalization that features massive spatial parallelism with deep pipelines stages. A key advantage of the FPGA accelerator is that its performance is insensitive to data batch size, while the performance of GPU acceleration varies largely depending on the batch size of the data. Experiment results show that the proposed accelerator architecture for binary CNNs running on a Virtex-7 FPGA is 8.3x faster and 75x more energy-efficient than a Titan X GPU for processing online individual requests in small batch sizes. For processing static data in large batch sizes, the proposed solution is on a par with a Titan X GPU in terms of throughput while delivering 9.5x higher energy efficiency.
Energy Efficient Neural Computing with Approximate Multipliers
Optical networks-on-chip (NoCs) based on silicon photonics have been proposed as emerging on-chip communication architectures for chip multiprocessors with large core counts. However, due to thermal sensitivity of optical devices used in optical NoCs, on-chip temperature variations cause significant thermal-induced optical power loss which would counteract the power advantages of optical NoCs. To tackle this problem, in this work, we propose a learning-based thermal-sensitive power optimization approach for mesh or torus-based optical NoCs in presence of temperature variations. The key techniques proposed includes an initial device setting and thermal tuning mechanism which is a device-level optimization technique, and a learning-based thermal-sensitive adaptive routing algorithm which is a network-level optimization technique. Simulation results of an 8x8 mesh-based optical NoC show that the proposed initial device setting and thermal tuning mechanism confines the worst-case thermal-induced energy consumption to be on the order of tens of pJ/bit, by avoiding significant thermal-induced optical power loss caused by temperature-dependent wavelength shifts. Besides, it shows that the learning-based thermal-sensitive adaptive routing algorithm is able to find an optimal path with the minimum estimated thermal-induced power consumption for each communication pair. The proposed routing has a greater space for optimization especially for applications with more long-distance traffic.
ntegrated optical circuits with nanophotonic devices have attracted significant attention due to its low power dissipation and light-speed operation. With light interference and resonance phenomena, the nanophotonic device works as a voltage-controlled optical pass-gate like a pass-transistor. This paper first introduces a concept of the optical pass-gate logic, and then proposes a parallel adder circuit based on the optical pass-gate logic. Experimental results obtained with an optoelectronic circuit simulator show advantages of our optical parallel adder circuit over a traditional CMOS-based parallel adder circuit.
Artificial Neural Network computation relies on intensive vector-matrix multiplications. Recently, the emerging nonvolatile memory (NVM) crossbar array showed a feasibility of implementing such operations with high energy efficiency, thus there are many works on efficiently utilizing emerging NVM crossbar array as analog vector-matrix multiplier. However, its nonlinear I-V characteristics restrain critical design parameters, such as the read voltage and weight range, resulting in substantial accuracy loss. In this paper, instead of optimizing hardware parameters to a given neural network, we propose a methodology of reconstructing a neural network itself optimized to resistive memory crossbar arrays. To validate the proposed method, we simulated various neural network with MNIST and CIFAR-10 dataset using two different specific Resistive Random Access Memory (RRAM) model. Simulation results show that our proposed neural network produces significantly higher inference accuracies than conventional neural network when the synapse devices have nonlinear I-V characteristics.
Protection of intellectual property (IP) is increasingly critical for IP vendors in semiconductor industry. However, advanced reverse engineering techniques can physically disassemble the chip and derive the IPs at a much lower cost than the value of IP design that chips carry. This invasive hardware attack obtaining information from IC chips always violates the IP rights of vendors. The intent of this paper is to present a chip- level reverse engineering resilient design technique. In the proposed technique, transformable interconnects enable an IC chip to maintain functioning in normal use and to transform its physical structure into another pattern when exposed to invasive attacks. The newly-created patten will signi cantly increase the di culty of reverse engineering. Furthermore, to improve the e ectiveness of the proposed technique, a systematic design method is developed targeting integrated circuits with multiple design constraints. Simulations have been conducted to demonstrate the capability of the proposed technique, which generates extremely large complexity for reverse engineering with manageable overhead.
Stochastic circuits (SCs) offer tremendous area- and power-consumption benefits at the expense of computational inaccuracies. Unlike conventional logic synthesis, managing accuracy is a central problem in SC design. It is usually tackled in ad hoc fashion by multiple trial-and-error simulations that vary relevant parameters like the stochastic number length n. We present, for the first time, a systematic design approach to controlling the accuracy of SCs and balancing it against other design parameters. We express the (in)accuracy of a circuit processing n-bit stochastic numbers by the numerical deviation of the computed value from the expected result, in conjunction with a confidence level. Using the theory of Monte Carlo simulation, we derive expressions for the stochastic number length required for a desired level of accuracy, or vice versa. We discuss the integration of the theory into a design framework that is applicable to both combinational and sequential SCs. We show that for combinational SCs, accuracy is independent of the circuit's size or complexity, a surprising result. We also show how the analysis can identify subtle errors in both combinational and sequential designs. Finally, we apply the proposed methods to a case study on filtering noisy EKG signals.
Low operating voltage, high storage density, non-volatile storage capabilities and relative low access latencies have popularized memristive devices as storage devices. Memristors can be ideally used for in-memory computing in the form of hybrid CMOS nano-crossbar arrays. In-memory serial adders have been theoretically and experimentally proven for crossbar arrays. To harness the parallelism of memristive arrays, parallel-prefix adders can be effective. In this work, a novel mapping scheme for in-memory Kogge-Stone adder has been presented. The number of cycles increases logarithmically with the bit width N of the operands i.e. O(log2N) and the device count is 5N. We verify the correctness of the proposed scheme by means of TaOx device model based memristive simulations. We compare the proposed scheme with other proposed schemes in terms of number of cycle and number of devices.
As a step towards solving the problem of implementing a usable machine learning application, especially perceptual tasks, on a mobile form factor, we explore sound source separation to isolate human voice from background noise on a mobile phone. The challenges involved are real-time streaming execution and power constraints. As a solution, we present a novel hardware-base sound source separation capable of real-time streaming performance while consuming low power. The implementation uses Markov Random Field (MRF) formulation of Blind Source Separation (BSS) with two microphones. It uses Expectation-Maximization (EM) to learn hidden MRF parameters on the fly and also performs Maximum A Posterior (MAP) inference using Gibbs sampling to find the best separation of sources. We demonstrate a real-time streaming FPGA implementation running at 150 MHz with 207 KB RAM. It achieves a speed-up of 22X over a conventional software reference, performs with an SDR of up to 7.021 dB with 1.601 ms latency, and exhibits excellent perceived audio quality. A virtual ASIC design study shows that this architecture is small with less than 10M gates, consumes only 40.034 mW (which is only 10% of power on ARM Cortex-A9) running at 150 MHz.
High-performance and energy-efficient Network-on-Chip (NoC) architecture is one of the crucial components of the manycore processing platforms. A very promising NoC architecture recently proposed in the literature is the three-dimensional small-world NoC (3D SWNoC). Due to short vertical links in 3D integration and the robustness of small-world networks, the 3D SWNoC architecture outperforms its other 3D counterparts. However, the performance of 3D SWNoC is highly dependent on the placement of the links and associated routers. In this paper, we propose a sensitivity-based link placement algorithm (SEN) to optimize the performance of 3D SWNoC.We compare the performance of SEN algorithm with simulated annealing- (SA) and recently proposed machine learning-based (ML) optimization algorithm. The optimized 3D SWNoC obtained by the proposed SEN algorithm achieves, on average, 11.5% and 13.6% lower latency and 18.4% and 21.7% lower energy-delay product than those optimized by the SA and ML algorithms respectively. In addition, the SEN algorithm is 26 to 33 times faster than the SA algorithm for the optimization of 64-, 128-, and 256-core 3D SWNoC designs.However, we find that ML-based methodology has faster convergence time than SEN and SA for bigger systems.
An on-chip optical transceiver for 100GBd+ transmission system is proposed based on optical time division multiplexing (OTDM) technology. Co-designed with the double rail driver, on-chip Mach-Zehnder interferometer (MZI) switch repeatedly generates extremely narrow sampling pulses of only 12ps full width at half maximum (FWHM). The 4-stage cascaded high speed switches driven synchronously at 25GHz are employed to divide the 40ps clock cycle into 4 recurrent 9.5ps time slots, each for one sub-channel, and one time slot of 2ps for clock recovery. Thus, a 100GBd optical transmission channel is realized based on 4 bit 25Gbps bit-streams at the electrical interface. The crosstalk extinction ratio at the worst sub-channel is 1.9dB with 10dB depth modulator, and the insertion loss caused by the OTDM mechanism is about 10dB. Further, a 5-bit OTDM system based on dark modulation is proposed to generate a 125GBd transmission based on 5 bit 25Gbps bit-streams at the electrical interface. The extinction ratio performance is better even the symbol rate is higher. However, the insertion loss and electron complexity are sacrificed.
Current Deep Learning approaches that have been very successful use convolutional neural networks (CNN) trained on large graphical processing units (GPU)-based computers. Three limitations of this approach are: 1) they are based on a simple layered network topology, i.e., highly connected layers, without intra-layer connections; 2) the networks are manually configured to achieve optimal results, and 3) the implementation of neuron model is expensive in both cost and power. In this paper, we evaluate deep learning models using three different computing architectures to address these problems: quantum computing to train complex topologies, high performance computing (HPC) to automatically determine network topology, and neuromorphic computing for a low-power hardware implementation. We use the MNIST dataset for our experiment, due to input size limitations of current quantum computers. Our results show the feasibility of using the three architectures in tandem to address the above deep learning limitations. We show a quantum computer can find high quality values of intra-layer connections weights, in a tractable time as the complexity of the network increases; a high performance computer can find optimal layer-based topologies; and a neuromorphic computer can represent the complex topology and weights derived from the other architectures in low power memristive hardware.
The increased capacity of multi-level cells (MLC) and triple-level cells (TLC) in emerging non-volatile memory (NVM) technologies comes at the cost of higher cell write energies and lower cell endurance. In this paper, we describe MFNW, a Flip-N-Write encoding that effectively reduces the write energy and improves the endurance of MLC NVMs. Two MFNW modes are analyzed: cell Hamming distance (CHD) mode and energy Hamming distance (EHD) mode. We derive an ap- proximate model that accurately predicts the average number of cell writes that is proportional to the energy consumption, enabling word length optimization to maximize energy reduction subject to memory overhead constraints. In comparison to state-of-the-art MLC NVM encodings, our simulation results indicate that MFNW achieves up to 7%39% saving for 1.56%50% NVM overhead. Extra energy saving (up to 19%47%) can be achieved for the same NVM overhead using our proposed variations of MFNW, i.e., MFNW2 and MFNW3. For TLC NVMs, we propose TFNW that can achieve up to 53% energy saving in comparison to state-of-the-art TLC NVM encodings. Endurance simulations indicate that MFNW (TFNW) is capable of extending MLC (TLC) NVM life by up to 100% (87%).
To face the complex communication problems that arise as the number of on-chip components grows up, photonic networks-on-chip have been recently proposed to replace electronic interconnects. However, photonic networks-on-chip lack efficient laser sources, possibly resulting in an inefficient or inoperable architecture. In this technical note, we introduce a methodology for the design space exploration of optical NoC mapping solutions, which automatically assigns application tasks to the network tiles such that the total laser power consumption is minimized. The experimental evaluation shows average reductions of 34.7% and 27.35% in the power consumption compared to respectively application-oblivious and randomly mapped photonic NoCs, allowing improved energy efficiency.
ONoC is a promising communication medium for large-scale MPSoC. Indeed ONoC can outperform classical electrical NoC in terms of energy efficiency and bandwidth density, in particular, because this medium can support multiple transactions at the same time on different wavelengths by using WDM. However, multiple signals sharing simultaneously the same part of a waveguide can lead to inter-channel crosstalk noise. is problem impacts the Signal to Noise Ratio (SNR) of the optical signals, which leads to an increase in the Bit Error Rate at the receiver side. If a specific BER is targeted, an increase of laser power should be necessary to satisfy the SNR. In this context, an important issue is to evaluate the laser power needed to satisfy the various desired communication bandwidths based on the BER performance requirements. In this paper, we propose an o -line approach that concurrently optimizes the laser power scaling and execution time of a global application. A set of different levels of power is introduced for each laser, to ensure that optical signal can be emitted with just-enough power to ensure targeted BER. As result, most promising solutions are highlighted for mapping a defined application onto 16-core ring-based WDM ONoC.
As the relentless quest for higher throughput and lower energy cost continues in heterogenous multicores, there is a strong demand for energy-efficient and high-performance Network-on-Chip (NoC) architectures. Photonic interconnects are a disruptive technology solution that has the potential to increase the bandwidth, reduce latency, and improve energy-efficiency over traditional metallic interconnects. In this paper, we propose a CPU-GPU heterogeneous architecture called SHARP (Shared Heterogeneous Architecture with Reconfigurable Photonic Network-on-Chip) that clusters CPU and GPU cores around the same router and dynamically allocates bandwidth between the CPU and GPU cores based on application demands. The SHARP architecture is designed as a Single-Writer Multiple-Reader (SWMR) crossbar with reservation-assist to connect CPU/GPU cores that dynamically reallocates bandwidth using buffer utilization information at runtime. As network traffic exhibits temporal and spatial fluctuations due to application behavior, SHARP can dynamically reallocate bandwidth and thereby adapt to application demands. SHARP demonstrates 34% performance (throughput) improvement over a baseline electrical CMESH while consuming 25% less energy per bit. Simulation results have also shown 6.9% to 14.9% performance improvement over other flavors of the proposed SHARP architecture without dynamic bandwidth allocation.