FPGA-based hardware accelerators for convolutional neural networks (CNNs) have obtained great attentions due to their higher energy efficiency than GPUs. However, it is challenging for FPGA-based solutions to achieve a higher throughput than GPU counterparts. In this paper, we demonstrate that FPGA acceleration can be a superior solution in terms of both throughput and energy efficiency when a CNN is trained with binary constraints on weights and activations. Specifically, we propose an optimized FPGA accelerator architecture tailored for bitwise convolution and normalization that features massive spatial parallelism with deep pipelines stages. A key advantage of the FPGA accelerator is that its performance is insensitive to data batch size, while the performance of GPU acceleration varies largely depending on the batch size of the data. Experiment results show that the proposed accelerator architecture for binary CNNs running on a Virtex-7 FPGA is 8.3x faster and 75x more energy-efficient than a Titan X GPU for processing online individual requests in small batch sizes. For processing static data in large batch sizes, the proposed solution is on a par with a Titan X GPU in terms of throughput while delivering 9.5x higher energy efficiency.
Energy Efficient Neural Computing with Approximate Multipliers
Optical networks-on-chip (NoCs) based on silicon photonics have been proposed as emerging on-chip communication architectures for chip multiprocessors with large core counts. However, due to thermal sensitivity of optical devices used in optical NoCs, on-chip temperature variations cause significant thermal-induced optical power loss which would counteract the power advantages of optical NoCs. To tackle this problem, in this work, we propose a learning-based thermal-sensitive power optimization approach for mesh or torus-based optical NoCs in presence of temperature variations. The key techniques proposed includes an initial device setting and thermal tuning mechanism which is a device-level optimization technique, and a learning-based thermal-sensitive adaptive routing algorithm which is a network-level optimization technique. Simulation results of an 8x8 mesh-based optical NoC show that the proposed initial device setting and thermal tuning mechanism confines the worst-case thermal-induced energy consumption to be on the order of tens of pJ/bit, by avoiding significant thermal-induced optical power loss caused by temperature-dependent wavelength shifts. Besides, it shows that the learning-based thermal-sensitive adaptive routing algorithm is able to find an optimal path with the minimum estimated thermal-induced power consumption for each communication pair. The proposed routing has a greater space for optimization especially for applications with more long-distance traffic.
ntegrated optical circuits with nanophotonic devices have attracted significant attention due to its low power dissipation and light-speed operation. With light interference and resonance phenomena, the nanophotonic device works as a voltage-controlled optical pass-gate like a pass-transistor. This paper first introduces a concept of the optical pass-gate logic, and then proposes a parallel adder circuit based on the optical pass-gate logic. Experimental results obtained with an optoelectronic circuit simulator show advantages of our optical parallel adder circuit over a traditional CMOS-based parallel adder circuit.
Artificial Neural Network computation relies on intensive vector-matrix multiplications. Recently, the emerging nonvolatile memory (NVM) crossbar array showed a feasibility of implementing such operations with high energy efficiency, thus there are many works on efficiently utilizing emerging NVM crossbar array as analog vector-matrix multiplier. However, its nonlinear I-V characteristics restrain critical design parameters, such as the read voltage and weight range, resulting in substantial accuracy loss. In this paper, instead of optimizing hardware parameters to a given neural network, we propose a methodology of reconstructing a neural network itself optimized to resistive memory crossbar arrays. To validate the proposed method, we simulated various neural network with MNIST and CIFAR-10 dataset using two different specific Resistive Random Access Memory (RRAM) model. Simulation results show that our proposed neural network produces significantly higher inference accuracies than conventional neural network when the synapse devices have nonlinear I-V characteristics.
STT-RAM is a promising emerging memory technology in future memory hierarchy. However, its unique reliability challenges, i.e. asymmetric bit failure mechanism at different bit flippings, have raised significant concerns in its applications. Recent studies even show that the common memory error repair remedies cannot efficiently address them. In this paper, we systematically study the potentials of strong LDPC code for combating such unique asymmetric errors in both SLC and MLC STT-RAM designs. A generic STT-RAM channel model suitable for the SLC/MLC designs, is developed to analytically calibrate all the accumulated asymmetric factors of write/read operations. The key initial information for LDPC decoding, namely asymmetric log-likelihood-ratio (A-LLR), is redesigned and extracted from the proposed channel model, to unleash LDPC's asymmetric error correcting capability. LDPC codec is also carefully designed to lower the hardware cost by leveraging the systematic-structured parity-check matrix. Then two customized short length LDPC codes--(585,512) and (683,512) augmented from the semi-random parity-check matrix and the A-LLR based asymmetric decoding, are proposed for SLC and MLC designs. Experiments show that our proposed LDPC designs can improve STT-RAM reliability by at least 102/104 when compared to existing error-correction codes for SLC/MLC design, demonstrating the feasibility of LDPC solutions on STT-RAM.
Protection of intellectual property (IP) is increasingly critical for IP vendors in semiconductor industry. However, advanced reverse engineering techniques can physically disassemble the chip and derive the IPs at a much lower cost than the value of IP design that chips carry. This invasive hardware attack obtaining information from IC chips always violates the IP rights of vendors. The intent of this paper is to present a chip- level reverse engineering resilient design technique. In the proposed technique, transformable interconnects enable an IC chip to maintain functioning in normal use and to transform its physical structure into another pattern when exposed to invasive attacks. The newly-created patten will signi cantly increase the di culty of reverse engineering. Furthermore, to improve the e ectiveness of the proposed technique, a systematic design method is developed targeting integrated circuits with multiple design constraints. Simulations have been conducted to demonstrate the capability of the proposed technique, which generates extremely large complexity for reverse engineering with manageable overhead.
An on-chip optical transceiver for 100GBd+ transmission system is proposed based on optical time division multiplexing (OTDM) technology. Co-designed with the double rail driver, on-chip Mach-Zehnder interferometer (MZI) switch repeatedly generates extremely narrow sampling pulses of only 12ps full width at half maximum (FWHM). The 4-stage cascaded high speed switches driven synchronously at 25GHz are employed to divide the 40ps clock cycle into 4 recurrent 9.5ps time slots, each for one sub-channel, and one time slot of 2ps for clock recovery. Thus, a 100GBd optical transmission channel is realized based on 4 bit 25Gbps bit-streams at the electrical interface. The crosstalk extinction ratio at the worst sub-channel is 1.9dB with 10dB depth modulator, and the insertion loss caused by the OTDM mechanism is about 10dB. Further, a 5-bit OTDM system based on dark modulation is proposed to generate a 125GBd transmission based on 5 bit 25Gbps bit-streams at the electrical interface. The extinction ratio performance is better even the symbol rate is higher. However, the insertion loss and electron complexity are sacrificed.
Current Deep Learning approaches that have been very successful use convolutional neural networks (CNN) trained on large graphical processing units (GPU)-based computers. Three limitations of this approach are: 1) they are based on a simple layered network topology, i.e., highly connected layers, without intra-layer connections; 2) the networks are manually configured to achieve optimal results, and 3) the implementation of neuron model is expensive in both cost and power. In this paper, we evaluate deep learning models using three different computing architectures to address these problems: quantum computing to train complex topologies, high performance computing (HPC) to automatically determine network topology, and neuromorphic computing for a low-power hardware implementation. We use the MNIST dataset for our experiment, due to input size limitations of current quantum computers. Our results show the feasibility of using the three architectures in tandem to address the above deep learning limitations. We show a quantum computer can find high quality values of intra-layer connections weights, in a tractable time as the complexity of the network increases; a high performance computer can find optimal layer-based topologies; and a neuromorphic computer can represent the complex topology and weights derived from the other architectures in low power memristive hardware.
We propose a novel arbitrated all-optical path-setup scheme for tiled CMPs able to configure multiple photonic switches simultaneously. The proposed solution reduces the overhead in each transmission and, most importantly, allows optical circuit-switched networks to serve cache coherence traffic. We propose a Single-Arbiter scheme where the whole topology is managed by a central module (arbiter) that takes care of the path-setup procedures. Then, we propose a logically clustered architecture (Multi-Arbiter) in which an arbiter is allocated in each core-cluster and an ad-hoc distributed reservation protocol coordinates arbiters to manage inter-cluster path reservations. We show that the Single-Arbiter architecture outperforms an optical network with sequential path-setup (Optical Baseline) in case of 8- and 16-core setups. However, due to serialization issues, the Single-Arbiter solution is not capable to scale towards bigger setups. Conversely, our Multi-Arbiter hierarchical solution allows to improve performance up to almost 20% and 40% also for 32- and 64-core setups. Energy-wise, the analyzed solutions enable significant savings compared to both the Optical Baseline, and to the electronic counterpart. Results show more than 25% improvement for the Single-Arbiter in case of the 8- and 16-core cases and more than 40% and 15% savings for the Multi-Arbiter in case of 32- and 64-core.
The increased capacity of multi-level cells (MLC) and triple-level cells (TLC) in emerging non-volatile memory (NVM) technologies comes at the cost of higher cell write energies and lower cell endurance. In this paper, we describe MFNW, a Flip-N-Write encoding that effectively reduces the write energy and improves the endurance of MLC NVMs. Two MFNW modes are analyzed: cell Hamming distance (CHD) mode and energy Hamming distance (EHD) mode. We derive an ap- proximate model that accurately predicts the average number of cell writes that is proportional to the energy consumption, enabling word length optimization to maximize energy reduction subject to memory overhead constraints. In comparison to state-of-the-art MLC NVM encodings, our simulation results indicate that MFNW achieves up to 7%39% saving for 1.56%50% NVM overhead. Extra energy saving (up to 19%47%) can be achieved for the same NVM overhead using our proposed variations of MFNW, i.e., MFNW2 and MFNW3. For TLC NVMs, we propose TFNW that can achieve up to 53% energy saving in comparison to state-of-the-art TLC NVM encodings. Endurance simulations indicate that MFNW (TFNW) is capable of extending MLC (TLC) NVM life by up to 100% (87%).
To face the complex communication problems that arise as the number of on-chip components grows up, photonic networks-on-chip have been recently proposed to replace electronic interconnects. However, photonic networks-on-chip lack efficient laser sources, possibly resulting in an inefficient or inoperable architecture. In this technical note, we introduce a methodology for the design space exploration of optical NoC mapping solutions, which automatically assigns application tasks to the network tiles such that the total laser power consumption is minimized. The experimental evaluation shows average reductions of 34.7% and 27.35% in the power consumption compared to respectively application-oblivious and randomly mapped photonic NoCs, allowing improved energy efficiency.
ONoC is a promising communication medium for large-scale MPSoC. Indeed ONoC can outperform classical electrical NoC in terms of energy efficiency and bandwidth density, in particular, because this medium can support multiple transactions at the same time on different wavelengths by using WDM. However, multiple signals sharing simultaneously the same part of a waveguide can lead to inter-channel crosstalk noise. is problem impacts the Signal to Noise Ratio (SNR) of the optical signals, which leads to an increase in the Bit Error Rate at the receiver side. If a specific BER is targeted, an increase of laser power should be necessary to satisfy the SNR. In this context, an important issue is to evaluate the laser power needed to satisfy the various desired communication bandwidths based on the BER performance requirements. In this paper, we propose an o -line approach that concurrently optimizes the laser power scaling and execution time of a global application. A set of different levels of power is introduced for each laser, to ensure that optical signal can be emitted with just-enough power to ensure targeted BER. As result, most promising solutions are highlighted for mapping a defined application onto 16-core ring-based WDM ONoC.
Quantum computing performance simulators are needed to provide practical metrics for the effectiveness of executing theoretical quantum information processing protocols on physical hardware. In this work, we present a tool to simulate the execution of fault tolerant quantum computation by automating the tracking of common fault paths for error propagation through an encoded circuit block and quantifying the failure probability of each encoded qubit throughout the computation. Our simulator runs a fault path counter on encoded circuit blocks to determine the probability that two or more errors remain on the encoded qubits after each block is executed, and combines errors from all the encoded blocks to estimate performance metrics such as the logical qubit failure probability, the overall circuit failure probability, the number of qubits used, and the time required to run the overall circuit. Our technique efficiently estimates the upper bound of the error probability and provides a useful measure of the error threshold at low error probabilities where conventional Monte Carlo methods are ineffective. We describe a way of simplifying the fault tolerant measurement process in the Steane code to reduce the number of error correction steps necessary. We present simulation results comparing the execution of quantum adders, which constitute a major part of Shor's algorithm.