Structured Pruning of Deep Convolutional Neural Networks

Real-time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is... (more)

Energy-Efficient and Improved Image Recognition with Conditional Deep Learning

Deep-learning neural networks have proven to be very successful for a wide range of recognition tasks across modern computing platforms. However, the... (more)

Stochastic CBRAM-Based Neuromorphic Time Series Prediction System

In this research, we present a Conductive-Bridge RAM (CBRAM)-based neuromorphic system which efficiently addresses time series prediction. We propose... (more)


The Journal of Emerging Technologies in Computing Systems is happy to welcome Prof. Yuan Xie (University of California at Santa Barbara as the incoming Editor in Chief! We are also grateful to Prof. Krish Chakrabarty for serving as Editor in Chief for the last six years, and would like to wish to both all the best in their future!

The Journal of Emerging Technologies in Computing Systems invites submissions of original technical papers describing research and development in emerging technologies in computing systems. Major economic and technical challenges are expected to impede the continued scaling of semiconductor devices. This has resulted in the search for alternate mechanical, biological/biochemical, nanoscale electronic, asynchronous and quantum computing and sensor technologies. 

Editorial for JETC Special Issue on Alternative Computing Systems

Mobile Unified Memory-Storage Structure based on Hybrid Non-Volatile Memories

In mobile computing systems, the limited amount of main memory space leads to page swap operation overhead and data duplication in both main memory and secondary storage. Furthermore, SQLite write operations in mobile devices such as smartphones and tablet PCs tend to frequently overwrite data to storage, significantly degrading performance. Thus, this paper presents a unified memory-storage structure that is optimized for mobile devices and blurs the boundary between the existing main memory layer and secondary storage layer. The unified memory-storage structure consists of a Dynamic RAM (DRAM) based dual buffering module, hybrid unified memory-storage array consisting of DRAM, a SLC/MLC hybrid 3D cross point array, and NAND Flash memory, and an associated unified storage translation layer devised for the memory address and file translation mechanism as a system software module. This hybrid array of non-volatile memories is formed as a single memory-disk integrated storage space that can be logically divided into static and dynamic spaces. Experimental results show that the overall performance of the hybrid unified memory-storage system with the buffering structure increases by around 59% and power consumption is also improved by 30%, compared to previous integrated memory-disk system.

Impact of Electrostatic Coupling and Wafer-Bonding Defects on Delay Testing of Monolithic 3D Integrated Circuits

Monolithic three-dimensional (M3D) integration is gaining momentum as it has the potential to achieve significantly higher device density compared to 3D integration based on through-silicon vias. In this paper, we first analyze electrostatic coupling in M3D ICs, which arises due to the aggressive scaling of the inter-layer dielectric thickness. We then analyze defects that arise due to voids created during wafer bonding. We quantify the impact of these defects on the threshold voltage of a top-layer transistor in an M3D integrated circuit. We also show that wafer-bonding defects can lead to a change in the resistance of inter-layer vias (ILVs), and in some cases, lead to an open in an ILV or a short between two ILVs. We then analyze the impact of these defects on path delays using HSpice simulations. We study their impact on the effectiveness of delay-test patterns for multiple instances of IWLS05 benchmarks in which these defects were randomly injected. Our results show that the timing characteristics of an M3D IC can be significantly altered due to coupling and wafer-bonding defects if the thickness of its ILD is less than 100 nm.

Distributed In-Memory Computing on Binary RRAM Crossbar

The recent emerging resistive random-access memory (RRAM) can provide non-volatile memory storage but also intrinsic computing for matrix-vector multiplication, which is ideal for low-power and high-throughput data analytics accelerator performed in memory. However, the existing RRAM-crossbar based computing is mainly assumed as a multi-level analog computing, whose result is sensitive to process nonuniformity as well as additional overhead from AD-conversion and I/O. In this paper, we explore the matrix-vector multiplication accelerator on a binary RRAM-crossbar with adaptive 1-bit-comparator based parallel conversion. Moreover, a distributed in-memory computing architecture is also developed with according control protocol. Both memory array and logic accelerator are implemented on the binary RRAM-crossbar, where logic-memory pair can be distributed with protocol of control bus. Experiment results have shown that compared to the analog RRAM-crossbar, the proposed binary RRAM-crossbar can achieve significant area-saving with better calculation accuracy. Moreover, significant speedup can be achieved for matrix-vector multiplication in the neuron-network based machine learning such that the overall training and testing time can be both reduced respectively. In addition, large energy saving can be also achieved when compared to the traditional CMOS-based out-of-memory computing architecture.

Power-Utility-Driven Write Management for MLC PCM

Phase change memory is a promising alternative to DRAM as main memory due to its merits of high density and low leakage power. The Multi-level Cell PCM reveals more attractions than Single-level Cell PCM because it can store multiple bits per cell to achieve higher density. With the iterative write technique, MLC writes demand higher power than DRAM writes, but the power supply of MLC system is similar to that of DRAM. The incompatibility of high write power and limited power budget results in the degradation of the write throughput and performance. In this work, we investigate both write scheduling policy and power management to improve the MLC power utility and alleviate the negative impacts. We identify the power-utility-driven write scheduling as an online bin-packing problem and then derive a power-utility-driven scheduling (PUDS) policy from the First-Fit algorithm to improve the write power usage. Based on the SET ramp-down pulse characteristic, we propose the SET Power Amortization (SPA) policy which proactively reclaims the power tokens at intra-SET level to promote the power utilization. Our results demonstrate that the system with PUDS+SPA has a 60% increase of performance and 36% improvement of the power utility over the state-of-the-art power management technique.

Design of approximate compressors for multiplication

Approximate computing has recently developed as a promising technique for energy efficient VLSI system design and also best suited for error resilient applications, such as signal processing and multimedia. Approximate computing reduces accuracy, but it still provides significant and faster results with usually lower power consumption. This is mostly attractive for arithmetic circuits. In this paper, various novel design approaches of approximate 4-2 and 5-2 Compressors are proposed for reduction of the partial products stages during multiplication. Three approximate 8x8 Dadda multiplier designs using a novel three 4-2 approximate compressors and also two approximate 8x8 Dadda multiplier designs using a novel 5-2 approximate Compressors are proposed. Extensive simulation results show that the proposed designs achieve significant accuracy improvement together with power and delay reductions compared to previous approximate designs.

Real Time SoC Security against Passive Threats using Crypsis Behaviour of Geckos

The rapid evolution of the embedded era has witnessed globalization for the design of SoC architectures in the semiconductor design industry. Though issues of cost and complexity have been resolved in such a methodology, yet the root of hardware trust have been evicted. Malicious circuitry, a.k.a. Hardware Trojan Horses (HTH) is inserted by adversaries in the untrusted phases of design. HTH remains dormant during testing but gets triggered at runtime to cause sudden active and passive attacks. In this work, we focus on the runtime passive threats posed by HTH. Nature inspired algorithms offers an alternative to the conventional techniques for solving complex problems in the domain of computer science. However, most are optimization techniques and none is dedicated to security. We seek refuge to the crypsis behavior exhibited by geckos to generate a runtime security technique for SoC architectures, which can bypass runtime passive threats of HTH. An intelligent security IP which works on the proposed security principles is designed based on the structure of ART1 neural architecture. The security mechanism is demonstrated with the aid of Finite State Automata. Low area and power overhead of our proposed security IP over standard benchmarks and practical crypto SoC architectures as obtained in experimental results supports its applicability for practical implementations.

High-Performance Computing with Quantum Processing Units

The prospects of quantum computing have driven efforts to realize fully functional quantum processing units (QPUs). Recent success in developing proof-of-principle QPUs has prompted the question of how to integrate these emerging processors into modern high-performance computing (HPC) systems. We examine how QPUs can be integrated into current and future HPC system architectures by accounting for functional and physical design requirements. We identify two integration pathways that are differentiated by infrastructure constraints on the QPU and the use cases expected for the HPC system. This includes a tight integration that assumes infrastructure bottlenecks can be overcome as well as a loose integration that assumes they cannot. We find that the performance of both approaches is likely to depend on the quantum interconnect that serves to entangle multiple QPUs. We also identify several challenges in assessing QPU performance for HPC, and we consider new metrics that capture the interplay between system architecture and the quantum parallelism underlying computational performance.

Survey of STT-MRAM Cell Design Strategies: Taxonomy and Sense Amplifier Tradeoffs for Resiliency

Spin-Transfer Torque Random Access Memory (STT-MRAM) has been explored as a post-CMOS technology for embedded and data storage applications seeking non-volatility, near-zero standby energy, and high density. Towards attaining these objectives for practical implementations, various techniques to mitigate the specific reliability challenges associated with STT-MRAM elements are surveyed, classified, and assessed herein. Cost and suitability metrics assessed include the area of nanomagmetic and CMOS components per bit, access time and complexity, sense margin, and energy or power consumption costs versus resiliency benefits. Solutions to the reliability issues identified are addressed within a taxonomy created to categorize the current and future approaches to reliable STT-MRAM designs. A variety of destructive and non-destructive sensing schemes are assessed for process variation tolerance, read disturbance reduction, sense margin, and write polarization asymmetry compensation. The highest resiliency strategies deliver a sensing margin above 300 mV while incurring low power and energy consumption on the order of picojoules and microwatts, respectively, while attaining read sense latency of a few nanoseconds down to hundreds of picoseconds for non-destructive and destructive sensing schemes, respectively. Additional Key Words and Phrases: Spin-Transfer Torque storage elements, STT-MRAM, Magnetic Tunnel Junction (MTJ), Self-referencing schemes, Reliability, Process Variation, Read/Write Reliability

Memory-Centric Reconfigurable Accelerator for Classification and Machine Learning Applications

Big Data refers to the growing challenge of turning massive, often unstructured datasets into meaningful, actionable data. As datasets grow from petabytes to exabytes and beyond, it becomes increasingly difficult to run advanced analytics, especially machine learning, in a reasonable time and on a practical power budget. Previous work has focused on accelerating analytics implemented as SQL queries on data-parallel platforms with off-the-shelf CPUs and GPGPUs. However, these systems are general-purpose, and still require a vast amount of data transfer between storage and computing elements, limiting system efficiency. Instead, we present a reconfigurable, memory-centric accelerator which operates at the last level of memory, dramatically reducing the energy required for data transfer and processing of machine learning applications. We functionally validate the framework using a hardware emulation platform and three representative applications: Naive Bayesian Classification, Convolutional Neural Networks, and k-Means Clustering. Results are compared with implementations on a modern CPU and GPU. Finally, the use of in-memory dataset decompression to further reduce data transfer volume is investigated. The system achieves an average energy efficiency improvement of 74x and 212x over GPU and single-threaded CPU, respectively, while dataset compression is shown to improve overall efficiency by an additional 1.8x on average.

SPARCNet: A Hardware Accelerator for Efficient Deployment of Sparse Convolutional Networks

Deep neural networks have been shown to outperform prior state-of-the-art solutions that often relied heavily on hand-engineered feature extraction techniques coupled with simple classification algorithms. In particular, deep convolutional neural networks have been shown to dominate on several popular public benchmarks such as ImageNet database. Unfortunately, the benefits of deep networks have yet to be fully exploited in embedded, resource-bound settings that have strict power and area budgets. In order to reduce power and area while still achieving required throughput, classification-efficient network architectures are required in addition to optimal deployment on efficient hardware. In this work, we target both of these enterprises. For the first objective, we analyze simple, biologically-inspired reduction strategies that are applied both before and after training. The central theme of the techniques is the introduction of sparsification to help dissolve away the dense connectivity that is often found at different levels in convolutional networks. In the second contribution, we propose SPARCNet: a hardware accelerator for efficient deployment of SPARse Convolutional NETworks. The accelerator looks to enable deploying networks in such resource-bound settings by exploiting efficient forms of parallelism and the proposed sparsification techniques.

Sketching Computation with Stochastic Processing Engines

In conventional embedded computing, a sudden shortage of computing resource, such as premature termi-nation or power outage, often results a complete computing failure and produces totally unusable results.To circumvent this challenge, we present a novel technique that allows reconfigurable computing to achieve quality scalability by leveraging probabilistic principle. Our objective is to maximize the quality and us-ability of final results even under sudden change of computing resource.This paper explores how to leverage stochastic principle to gracefully salvage partially finished results of embedded computing. Our work is inspired by the concept of incremental sketching frequently found in artistic rendering, where the drawing procedure consists of a series of steps, each gradually improving the quality of results. The essence of our approach is to encode the input signal as the probability density function, perform stochastic computing operations on the signal in the probabilistic domain, and decode the output signal by estimating the probability density function of the resulting random samples.To validate our proposed architecture design, we have implemented a proof-of-concept probabilistic convolver with a Virtex 6FPGA device. Finally, we use three convolution-based image processing applications, image correspondence,image sharpening, and edge detection, to demonstrate that important embedded computing applications can indeed be sketched in a graceful manner.

Computing Polynomials using Unipolar Stochastic Logic

This paper addresses subtraction and polynomial computations using unipolar stochastic logic. Stochastic computing requires simple logic gates and stochastic logic based circuits are inherently fault-tolerant. While it is easy to realize multiplication and scaled addition, implementation of subtraction is non-trivial using unipolar stochastic logic. Additionally, an accurate computation of subtraction is critical for the implementation of polynomials with negative coefficients in stochastic unipolar representation. This paper, for the first time, demonstrates that instead of using well-known Bernstein polynomials, stochastic computation of polynomials can be implemented by using a stochastic subtractor and factorization. Three major contributions are made in this paper. First, two approaches are proposed to compute subtraction in stochastic unipolar representation. In the first approach, the subtraction operation is approximated by cascading multi-levels of OR and AND gates. In the second approach, the stochastic subtraction is implemented using a multiplexer and a stochastic divider. Second, computation of polynomials in stochastic unipolar format is presented using scaled addition and proposed stochastic subtraction. Third, we propose stochastic computation of polynomials using factorization. From experimental results, it is shown that the proposed stochastic logic circuits require less hardware complexity than the previous stochastic polynomial implementation using Bernstein polynomials.

Redesign the Memory Allocator for Non-Volatile Main Memory

The non-volatile memoryNVM has the merits of byte-addressability, fast speed, persistency and low power consumption, which make it attractive to be used as main memory. Commonly, user process dynamically acquires memory through memory allocators. However, traditional memory allocators designed with in-place data writes are not appropriate for non-volatile main memoryNVRAM due to the limited endurance. In this paper, first, we quantitatively analyze the wear-oblivious of DRAM-oriented designed allocatorglibc malloc and the inefficiency of wear-conscious allocatorNVMalloc. Then, we propose WAlloc, an efficient wear-aware manual memory allocator designed for NVRAM: (1) decouples metadata and data management; (2) distinguishes metadata with volatility; (3) redirects the data writes around to achieve wear-leveling; (4) redesigns an efficient and effective NVM copy mechanism, bypassing the CPU cache and prefetching data explicitly. Finally, experimental results show that the wear-leveling of WAlloc outperforms that of NVMalloc about 30% and 60% under random workloads and well-distributed workloads, respectively. Besides, WAlloc reduces average data memory writes in 64 bytes block by an average of 1.5X comparing with glibc malloc. With the fulfillment of data persistency, cache bypassing NVM copy is better than clflushing NVM copy with performance of circa 14% improvement.

VLSI Architecture for the Restricted Boltzmann Machine

Neural network (NN) systems are widely used in many important applications ranging from computer vision to speech recognition. To date, most NN systems are processed by general processing units like CPUs or GPUs. However, as the sizes of dataset and network rapidly increase, the original software implementations suffer from long training time. To overcome this problem, specialized hardware accelerators are needed to design high-speed NN systems. This paper presents an efficient hardware architecture of restricted Boltzmann machine (RBM) that is an important category of NN systems. Various optimization approaches at hardware level are performed to improve the training speed. As-soon-as-possible and overlapped-scheduling approaches are used to reduce the latency. It is shown that, compared with the flat design, the proposed RBM architecture can achieve 50% reduction in training time. In addition, an on-the-fly computation scheme is also used to reduce the storage requirement of binary and stochastic states by several hundreds of times. Then, based on the proposed approach, a 784-2252 RBM design example is developed for MNIST handwritten digit recognition dataset. Analysis shows that the VLSI design of RBM achieves 170 times speedup in training as compared to a CPU-based solution with small performance loss.


ACM Journal on Emerging Technologies in Computing Systems (JETC)

Volume 13 Issue 3, February 2017
Volume 13 Issue 2, February 2017

Volume 13 Issue 1, December 2016 Special Issue on Secure and Trustworthy Computing
Volume 12 Issue 4, July 2016

Volume 12 Issue 3, September 2015 Special Issue on Cross-Layer System Design and Regular Papers
Volume 12 Issue 2, August 2015 Special Issue on Advances in Design of Ultra-Low Power Circuits and Systems in Emerging Technologies
Volume 12 Issue 1, July 2015
Volume 11 Issue 4, April 2015 Special Issues on Neuromorphic Computing and Emerging Many-Core Systems for Exascale Computing

Volume 11 Issue 3, December 2014 Special Issue on Computational Synthetic Biology and Regular Papers
Volume 11 Issue 2, November 2014 Special Issue on Reversible Computation and Regular Papers
Volume 11 Issue 1, September 2014
Volume 10 Issue 4, May 2014
Volume 10 Issue 3, April 2014
Volume 10 Issue 2, February 2014
Volume 10 Issue 1, January 2014 Special Issue on Reliability and Device Degradation in Emerging Technologies and Special Issue on WoSAR 2011

Volume 9 Issue 4, November 2013 Special Issue on Bioinformatics
Volume 9 Issue 3, September 2013
Volume 9 Issue 2, May 2013 Special issue on memory technologies
Volume 9 Issue 1, February 2013

Volume 8 Issue 4, October 2012
Volume 8 Issue 3, August 2012
Volume 8 Issue 2, June 2012 Special Issue on Implantable Electronics
Volume 8 Issue 1, February 2012

Volume 7 Issue 4, December 2011
Volume 7 Issue 3, August 2011
Volume 7 Issue 2, June 2011
Volume 7 Issue 1, January 2011

Volume 6 Issue 4, December 2010
Volume 6 Issue 3, August 2010
Volume 6 Issue 2, June 2010
Volume 6 Issue 1, March 2010

Volume 5 Issue 4, November 2009
Volume 5 Issue 3, August 2009
Volume 5 Issue 2, July 2009
Volume 5 Issue 1, January 2009

Volume 4 Issue 4, October 2008
Volume 4 Issue 3, August 2008
Volume 4 Issue 2, April 2008
Volume 4 Issue 1, March 2008
Volume 3 Issue 4, January 2008

Volume 3 Issue 3, November 2007
Volume 3 Issue 2, July 2007
Volume 3 Issue 1, April 2007

Volume 2 Issue 4, October 2006
Volume 2 Issue 3, July 2006
Volume 2 Issue 2, April 2006
Volume 2 Issue 1, January 2006

Volume 1 Issue 3, October 2005
Volume 1 Issue 2, July 2005
Volume 1 Issue 1, April 2005
