The Liquid State Machine (LSM) is a promising model of recurrent spiking neural networks that provides an appealing brain-inspired computing paradigm for machine learning applications such as pattern recognition. Moreover, processing information directly on spiking events makes the LSM well suitable for cost and energy efficient hardware implementation. In this paper, we systematically present three techniques for optimizing energy efficiency while maintaining good performance of the proposed LSM neural processors from both an algorithmic and hardware implementation point of view. First, to realize adaptive LSM neural processors thus boost learning performance, we propose a hardware-friendly Spike-Timing Dependent Plastic (STDP) mechanism for on-chip tuning. Then, the LSM processor incorporates a novel runtime correlation-based neuron gating scheme to minimize the power dissipated by reservoir neurons. Furthermore, a fine-grained activity-dependent clock gating approach is presented to address the energy inefficiency due to the memory intensive nature of the proposed neural processors. Using two different real-world tasks of speech and image recognition to benchmark, we demonstrate that the proposed architecture boosts the average learning performance by up to 2.0% while reducing energy dissipation by up to 29% compared to a baseline LSM design with little extra hardware overhead on a Xilinx Virtex-6 FPGA.
FPGA-based hardware accelerators for convolutional neural networks (CNNs) have obtained great attentions due to their higher energy efficiency than GPUs. However, it is challenging for FPGA-based solutions to achieve a higher throughput than GPU counterparts. In this paper, we demonstrate that FPGA acceleration can be a superior solution in terms of both throughput and energy efficiency when a CNN is trained with binary constraints on weights and activations. Specifically, we propose an optimized FPGA accelerator architecture tailored for bitwise convolution and normalization that features massive spatial parallelism with deep pipelines stages. A key advantage of the FPGA accelerator is that its performance is insensitive to data batch size, while the performance of GPU acceleration varies largely depending on the batch size of the data. Experiment results show that the proposed accelerator architecture for binary CNNs running on a Virtex-7 FPGA is 8.3x faster and 75x more energy-efficient than a Titan X GPU for processing online individual requests in small batch sizes. For processing static data in large batch sizes, the proposed solution is on a par with a Titan X GPU in terms of throughput while delivering 9.5x higher energy efficiency.
Energy Efficient Neural Computing with Approximate Multipliers
Artificial Neural Network computation relies on intensive vector-matrix multiplications. Recently, the emerging nonvolatile memory (NVM) crossbar array showed a feasibility of implementing such operations with high energy efficiency, thus there are many works on efficiently utilizing emerging NVM crossbar array as analog vector-matrix multiplier. However, its nonlinear I-V characteristics restrain critical design parameters, such as the read voltage and weight range, resulting in substantial accuracy loss. In this paper, instead of optimizing hardware parameters to a given neural network, we propose a methodology of reconstructing a neural network itself optimized to resistive memory crossbar arrays. To validate the proposed method, we simulated various neural network with MNIST and CIFAR-10 dataset using two different specific Resistive Random Access Memory (RRAM) model. Simulation results show that our proposed neural network produces significantly higher inference accuracies than conventional neural network when the synapse devices have nonlinear I-V characteristics.
Security is becoming a de-facto requirement of System-on-Chips (SoC), leading up to a significant share of circuit design cost. In this paper, we propose an advanced SBUS protocol (ASBUS), in order to improve the data feeding efficiency of the Advanced Encryption Standard (AES) encrypted circuits. As a case study, the direct memory access (DMA) combined with AES engine and memory controller are implemented as our design-under-test (DUT) using field-programmable gate arrays (FPGA). The results show that our presented ASBUS structure outperforms the AXI-based design for cipher tests. As an example, the 32-bit ASBUS design costs less in terms of hardware resources and achieves higher throughput ($1.30 \times$) than the 32-bit AXI implementation, and the dynamic energy consumed by the ASBUS cipher test is reduced to 71.27\% compared with the AXI test.
STT-RAM is a promising emerging memory technology in future memory hierarchy. However, its unique reliability challenges, i.e. asymmetric bit failure mechanism at different bit flippings, have raised significant concerns in its applications. Recent studies even show that the common memory error repair remedies cannot efficiently address them. In this paper, we systematically study the potentials of strong LDPC code for combating such unique asymmetric errors in both SLC and MLC STT-RAM designs. A generic STT-RAM channel model suitable for the SLC/MLC designs, is developed to analytically calibrate all the accumulated asymmetric factors of write/read operations. The key initial information for LDPC decoding, namely asymmetric log-likelihood-ratio (A-LLR), is redesigned and extracted from the proposed channel model, to unleash LDPC's asymmetric error correcting capability. LDPC codec is also carefully designed to lower the hardware cost by leveraging the systematic-structured parity-check matrix. Then two customized short length LDPC codes--(585,512) and (683,512) augmented from the semi-random parity-check matrix and the A-LLR based asymmetric decoding, are proposed for SLC and MLC designs. Experiments show that our proposed LDPC designs can improve STT-RAM reliability by at least 102/104 when compared to existing error-correction codes for SLC/MLC design, demonstrating the feasibility of LDPC solutions on STT-RAM.
An on-chip optical transceiver for 100GBd+ transmission system is proposed based on optical time division multiplexing (OTDM) technology. Co-designed with the double rail driver, on-chip Mach-Zehnder interferometer (MZI) switch repeatedly generates extremely narrow sampling pulses of only 12ps full width at half maximum (FWHM). The 4-stage cascaded high speed switches driven synchronously at 25GHz are employed to divide the 40ps clock cycle into 4 recurrent 9.5ps time slots, each for one sub-channel, and one time slot of 2ps for clock recovery. Thus, a 100GBd optical transmission channel is realized based on 4 bit 25Gbps bit-streams at the electrical interface. The crosstalk extinction ratio at the worst sub-channel is 1.9dB with 10dB depth modulator, and the insertion loss caused by the OTDM mechanism is about 10dB. Further, a 5-bit OTDM system based on dark modulation is proposed to generate a 125GBd transmission based on 5 bit 25Gbps bit-streams at the electrical interface. The extinction ratio performance is better even the symbol rate is higher. However, the insertion loss and electron complexity are sacrificed.
We propose a novel arbitrated all-optical path-setup scheme for tiled CMPs able to configure multiple photonic switches simultaneously. The proposed solution reduces the overhead in each transmission and, most importantly, allows optical circuit-switched networks to serve cache coherence traffic. We propose a Single-Arbiter scheme where the whole topology is managed by a central module (arbiter) that takes care of the path-setup procedures. Then, we propose a logically clustered architecture (Multi-Arbiter) in which an arbiter is allocated in each core-cluster and an ad-hoc distributed reservation protocol coordinates arbiters to manage inter-cluster path reservations. We show that the Single-Arbiter architecture outperforms an optical network with sequential path-setup (Optical Baseline) in case of 8- and 16-core setups. However, due to serialization issues, the Single-Arbiter solution is not capable to scale towards bigger setups. Conversely, our Multi-Arbiter hierarchical solution allows to improve performance up to almost 20% and 40% also for 32- and 64-core setups. Energy-wise, the analyzed solutions enable significant savings compared to both the Optical Baseline, and to the electronic counterpart. Results show more than 25% improvement for the Single-Arbiter in case of the 8- and 16-core cases and more than 40% and 15% savings for the Multi-Arbiter in case of 32- and 64-core.
The increased capacity of multi-level cells (MLC) and triple-level cells (TLC) in emerging non-volatile memory (NVM) technologies comes at the cost of higher cell write energies and lower cell endurance. In this paper, we describe MFNW, a Flip-N-Write encoding that effectively reduces the write energy and improves the endurance of MLC NVMs. Two MFNW modes are analyzed: cell Hamming distance (CHD) mode and energy Hamming distance (EHD) mode. We derive an ap- proximate model that accurately predicts the average number of cell writes that is proportional to the energy consumption, enabling word length optimization to maximize energy reduction subject to memory overhead constraints. In comparison to state-of-the-art MLC NVM encodings, our simulation results indicate that MFNW achieves up to 7%39% saving for 1.56%50% NVM overhead. Extra energy saving (up to 19%47%) can be achieved for the same NVM overhead using our proposed variations of MFNW, i.e., MFNW2 and MFNW3. For TLC NVMs, we propose TFNW that can achieve up to 53% energy saving in comparison to state-of-the-art TLC NVM encodings. Endurance simulations indicate that MFNW (TFNW) is capable of extending MLC (TLC) NVM life by up to 100% (87%).
Quantum computing performance simulators are needed to provide practical metrics for the effectiveness of executing theoretical quantum information processing protocols on physical hardware. In this work, we present a tool to simulate the execution of fault tolerant quantum computation by automating the tracking of common fault paths for error propagation through an encoded circuit block and quantifying the failure probability of each encoded qubit throughout the computation. Our simulator runs a fault path counter on encoded circuit blocks to determine the probability that two or more errors remain on the encoded qubits after each block is executed, and combines errors from all the encoded blocks to estimate performance metrics such as the logical qubit failure probability, the overall circuit failure probability, the number of qubits used, and the time required to run the overall circuit. Our technique efficiently estimates the upper bound of the error probability and provides a useful measure of the error threshold at low error probabilities where conventional Monte Carlo methods are ineffective. We describe a way of simplifying the fault tolerant measurement process in the Steane code to reduce the number of error correction steps necessary. We present simulation results comparing the execution of quantum adders, which constitute a major part of Shor's algorithm.