ACM Journal on

Emerging Technologies in Computing (JETC)

Latest Articles

Guest Editors’ Introduction: Emerging Networks-on-Chip Designs, Technologies, and Applications

Design and Multi-Abstraction-Level Evaluation of a NoC Router for Mixed-Criticality Real-Time Systems

A Mixed Criticality System (MCS) combines real-time software tasks with different criticality levels. In a MCS, the criticality level specifies the... (more)

Time-Randomized Wormhole NoCs for Critical Applications

Wormhole-based NoCs (wNoCs) are widely accepted in high-performance domains as the most appropriate solution to interconnect an increasing number of... (more)

Limit of Hardware Solutions for Self-Protecting Fault-Tolerant NoCs

We study the ultimate limits of hardware solutions for the self-protection strategies against permanent faults in networks on chips (NoCs). NoCs... (more)

Fault-Tolerant Network-on-Chip Design with Flexible Spare Core Placement

Network-on-Chip (NoC) has been proposed as a promising solution to overcome the communication challenges of System-on-Chip (SoC) design in nanoscale... (more)

Thermal-aware Test Scheduling Strategy for Network-on-Chip based Systems

Rapid progress in technology scaling has introduced massive parallel computing systems with multiple cores on the integrated circuit (IC), in which a... (more)

PANE: Pluggable Asynchronous Network-on-Chip Simulator

Communication between different IP cores in MPSoCs and HMPs often results in clock domain crossing. Asynchronous network on chip (NoC) support communication in such heterogeneous set-ups. While there are a large number of tools to model NoCs for synchronous systems, there is very limited tool support to model communication for multi-clock domain... (more)

BigBus: A Scalable Optical Interconnect

This article presents BigBus, a novel design of an on-chip photonic network for a 1,024-node system. For such a large on-chip network, performance and power reduction are two mutually conflicting goals. This article uses a combination of strategies to reduce static power consumption while simultaneously improving performance and the energy-delay2... (more)

GARDENIA: A Graph Processing Benchmark Suite for Next-Generation Accelerators

This article presents the Graph Algorithm Repository for Designing Next-generation Accelerators (GARDENIA), a benchmark suite for studying irregular graph algorithms on massively parallel accelerators. Applications with limited control and data irregularity are the main focus of existing generic benchmarks for accelerators, while available graph... (more)

Self-learnable Cluster-based Prefetching Method for DRAM-Flash Hybrid Main Memory Architecture

This article presents a novel prefetching mechanism for memory-intensive workloads used in... (more)


About JETC

The Journal of Emerging Technologies in Computing Systems invites submissions of original technical papers describing research and development in emerging technologies in computing systems. 

read more
Call for Nominations
ACM Journal on Emerging Technologies in Computing (JETC)

The term of the current Editor-in-Chief (EiC) of the ACM Journal on Emerging Technologies in Computing (JETC) (JETC) is coming to an end, and the ACM Publications Board has set up a nominating committee to assist the Board in selecting the next EiC.


JETC Special Issues

For a list of JETC special issue Calls-for-Papers past and present, click here.

Energy-efficient FPGA Spiking Neural Accelerators with Supervised and Unsupervised Spike-Timing-Dependent-Plasticity

The Liquid State Machine (LSM) is a powerful model of recurrent spiking neural networks (SNNs) that provides an appealing brain-inspired computing paradigm for machine learning applications. Moreover, operated by processing information directly on spiking events, the LSM is highly amenable to hardware implementation. However, the spike computing paradigm in SNNs constraints its synaptic weight updates locally and imposes a great challenge on the design of learning algorithms as most conventional optimization approaches do not satisfy it. In this paper, we present a bio-plausible supervised spike-timing-dependent-plasticity (STDP) rule to train the output layer of the LSM for good classification performance, and a hardware-friendly unsupervised STDP reservoir training rule as a supplement. Both algorithms are implemented on the FPGA LSM neural accelerator with optimized efficiency achieved by algorithm level optimization and the leverage of self-organizing behaviors naturally introduced by the STDP algorithm. The recurrent spiking neural accelerator is built with an on-board ARM microprocessor host on a Xilinx Zync ZC-706 platform and the trained for speech recognition with the TI46 speech corpus as the benchmark. By implementing unsupervised and supervised STDP together, a performance boost of up to 3.47% can be achieved compared to a competitive non-STDP baseline training algorithm.

Identification of Synthesis Approaches for IP/IC Piracy of Reversible Circuits

Reversible circuits employ a computational paradigm that is beneficial for several applications  including the design of encoding and decoding devices, low power design, and emerging applications such as in quantum computation. However, as for conventional logic, reversible circuits might be subject to Intellectual Property/Integrated Circuit piracy. In order to counteract such attacks, a detailed understanding of how to identify the target function of a reversible circuit is crucial. In contrast to conventional logic, the target function of the reversible circuit is (implicitly or explicitly) embedded into the circuit. Numerous synthesis solutions have been proposed for this purpose. In order to obtain the target function of a reversible circuit, it is crucial to know what synthesis approach has been used to generate the circuit in the first place. In this work, we propose a machine learning-based scheme to determine the respectively used reversible synthesis approach based on their telltale signs. Furthermore, we study the impact of optimizing the synthesis approaches on their telltale signs. Our analysis shows that the most-established synthesis approaches can be determined in the vast majority of cases even if optimized versions of them are applied. This motivates a thorough investigation on how to obfuscate corresponding designs.

Composable Probabilistic Inference Networks Using MRAM-based Stochastic Neurons

Magnetoresistive random access memory (MRAM) technologies with thermally unstable nanomagnets are leveraged to develop an intrinsic stochastic neuron as a building block for restricted Boltzmann machines (RBMs) to form deep belief networks (DBNs). The embedded MRAM-based neuron is modeled using precise physics equations. The simulation results exhibit the desired sigmoidal relation between the input voltages and probability of the output state. A probabilistic inference network simulator (PIN-Sim) is developed to realize a circuit-level model of an RBM utilizing resistive crossbar arrays along with differential amplifiers to implement the positive and negative weight values. The PIN-Sim is composed of five main blocks to train a DBN, evaluate its accuracy, and measure its power consumption. The MNIST dataset is leveraged to investigate the energy and accuracy tradeoffs of seven distinct network topologies. The software and hardware level simulations indicate that a 784x200x10 topology can achieve less than 5% error rates with ~400 pJ energy consumption. The error rates can be reduced to 2.5% by using a 784x500x500x500x10 DBN at the cost of ~10x higher energy consumption and significant area overhead. Finally, the effects of specific hardware-level parameters on power dissipation and accuracy tradeoffs are identified via the developed PIN-Sim framework.

A mixed signal architecture for convolutional neural networks

Deep neural network (DNN) accelerators with improved energy and delay are desirable for meeting the requirements of hardware targeted for IoT and edge computing systems. Convolutional neural networks (CoNNs) belong to one of the most popular types of DNN architectures. This paper presents the design and evaluation of an accelerator for CoNNs. The system-level architecture is based on mixed-signal, cellular neural networks (CeNNs). Specifically, we present (i) the implementation of different layers, including convolution, ReLU, and pooling, in a CoNN using CeNN, (ii) modified CoNN structures with CeNN-friendly layers to reduce computational overheads typically associated with a CoNN, (iii) a mixed-signal CeNN architecture that performs CoNN computations in the analog and mixed signal domain, and (iv) design space exploration that identifies what CeNN-based algorithm and architectural features fare best compared to existing algorithms and architectures when evaluated over common datasets -- MNIST and CIFAR-10. Notably, the proposed approach can lead to 8.7X improvements in energy-delay product (EDP) per digit classification for the MNIST dataset at iso-accuracy when compared with the state-of-the-art DNN engine, while our approach could offer 4.3X improvements in EDP when compared to other network implementations for the CIFAR-10 dataset.

Hardware-Software Co-design to Accelerate Neural Network Applications

Approximate computation is a viable method to save energy and increase performance by trading energy for accuracy. In this paper, we propose a novel approximate floating point multiplier, called CMUL, which significantly reduces energy and improves performance of multiplication at the expense of accuracy. Our design approximately models multiplication by replacing the most costly step of the operation with a lower energy alternative. In order to tune the level of approximation, CMUL dynamically identifies the inputs which will produce the largest approximation error and processes them in precise CMUL mode. In order to use CMUL for deep neural network (DNN) acceleration, we propose a framework which modifies the trained DNN model to make it suitable for approximate hardware. Our framework adjusts the DNN weights to a set of "\textit{potential weights}" that are suitable for approximate hardware. Then, it compensates the possible quality loss by iterative retraining the network based on the existing constraints. Our evaluations on four DNN applications show that, CMUL can achieve 60.3% energy efficiency improvement and 3.2X energy-delay product (EDP) improvement as compared to the baseline GPU, while ensuring less than 0.2% quality loss.

Spiking Neural Networks Hardware Implementations and Challenges: a Survey

Neuromorphic computing is henceforth a major research field for both academic and industrial actors. As opposed to Von Neumann machines, brain-inspired processors aim at bringing closer the memory and the computational elements. They are designed to efficiently evaluate machine-learning algorithms. Recently, Spiking Neural Networks, a generation of cognitive algorithms employing computational primitives mimicking neuron and synapse operational principles, have become an important part of deep learning. They are expected to improve the computational performance and efficiency of neural networks, but are best suited for hardware able to support their temporal dynamics. In this survey, we present the state of the art of hardware implementations of spiking neural networks and the current trends in algorithm elaboration from model selection to training mechanisms. The scope of existing solutions is extensive; we thus present the general framework and study on a case-by-case basis the relevant particularities. We describe the strategies employed to leverage the characteristics of these event-driven algorithms at the hardware level and discuss their related advantages and challenges.

Placement & Routing for Tile-based Field-coupled Nanocomputing Circuits is NP-complete

Field-coupled Nanocomputing (FCN) technologies provide an alternative to conventional CMOS-based computation technologies and are characterized by intriguingly low energy dissipation. Accordingly, their design received significant attention in the recent past. FCN circuit implementations like Quantum-dot Cellular Automata (QCA) or Nanomagnet Logic (NML) have already been built in labs and basic operations such as inverters, Majority, AND, OR, etc. are already available. The design problem basically boils down to the question how to place basic operations and route their connections so that the desired function results while, at the same time, further constraints (related to timing, clocking, path lengths, etc.) are satisfied. While several solutions for this problem have been proposed, interestingly no clear understanding about the complexity of the underlying task exists thus far. In this research note, we consider this problem and eventually prove that placement and routing for tile-based FCN circuits is NP-complete. By this, we provide a theoretical foundation for the further development of corresponding design methods.

In-situ Stochastic Training of MTJ Crossbars with Machine Learning Algorithms

Owing to high device density, scalability and non-volatility, Magnetic Tunnel Junction-based crossbars have garnered significant interest for implementing the weights of neural networks (NNs). The existence of only two stable states in MTJs implies a high overhead of obtaining optimal binary weights in software. This article illustrates that the inherent parallelism in the crossbar structure makes it highly appropriate for in-situ training, wherein the network is taught directly on the hardware. It leads to significantly smaller training overhead as the training time is independent of the size of the network, while also circumventing the effects of alternate current paths in the crossbar and accounting for manufacturing variations in the device. We show how the stochastic switching characteristics of MTJs can be leveraged to perform probabilistic weight updates using the gradient descent algorithm. We describe how the update operations can be performed on crossbars implementing NNs and Restricted Boltzmann Machines, and perform simulations on them to demonstrate the effectiveness of our techniques. The results reveal that stochastically trained MTJ-crossbar feed-forward and Deep Belief nets achieve a classification accuracy nearly same as that of real-valued-weight networks trained in software and exhibit immunity to device variations.

Advanced Simulation of Droplet Microfluidics

The complexity of droplet microfluidics grows, which poses a challenge to the engineers designing the devices. In today's design processes, engineers rely on calculations, assumptions, simplifications, and their experiences. In order to validate the obtained specification, usually a prototype is fabricated and experiments are conducted thus far. In case the design does not implement the desired functionality, this prototyping iteration is repeated - resulting in an expensive process. Here, simulation methods could help to validate the specification before any prototype is fabricated. However, state-of-the-art simulators come with severe limitations, which prevent their utilization for practically-relevant applications. They are often inappropriate for droplet microfluidics, cannot handle required physical phenomena, are not publicly available, and cannot be extended. In this work, we present an advanced simulation framework for droplet microfluidics, which directly works on the specification of the design, supports essential physical phenomena, is publicly available, and extendable. Evaluations and case studies demonstrate the benefits: While current state-of-the-art tools were not applicable, the proposed solution allows to reduce the design time and costs of a drug screening device from one person month and USD 1200, respectively, to just a fraction of that.

Neuromemrisitive Architecture of HTM with On-Device Learning and Neurogenesis

Hierarchical temporal memory (HTM) is a biomimetic sequence memory algorithm that holds promise for invariant representations of spatial and spatiotemporal inputs. This paper presents a comprehensive neuromemristive crossbar architecture for the spatial pooler (SP) and the sparse distributed representation classifier, which are fundamental to the algorithm. There are several unique features in the proposed architecture that tightly link with the HTM algorithm. A memristor that is suitable for emulating the HTM synapses is identified and a new z-window function is proposed. The architecture exploits the concept of synthetic synapses to enable potential synapses in the HTM. The crossbar for the SP avoids dark spots caused by unutilized crossbar regions and supports rapid on-chip training within 2 clock cycles. This research also leverages plasticity mechanisms such as neurogenesis and homeostatic intrinsic plasticity to strengthen the robustness and performance of the SP. The proposed design is benchmarked for image recognition tasks using MNIST and Yale faces datasets, and is evaluated using different metrics including entropy, sparseness, and noise robustness. Detailed power analysis at different stages of the SP operations is performed to demonstrate the suitability for mobile platforms.

MiC:Multi-level Characterization and Optimization of GPGPU Kernels

GPGPU computing have recently enjoyed popularity increasing in new computing paradigm, such as IoT. GPU holds great potential in providing effective solutions for big data analytics while the demands for processing large quantities of data in real-time are increasing. However, the pervasive presence of GPU on mobile devices presents great challenges for GPGPU, mainly because GPGPU integrates large amount of processor arrays and concurrent executing threads (up to hundreds of thousand). Particularly, the root causes of performance loss in a GPGPU program cannot be easily revealed by current approaches. In this article, we propose MiC (Multi-level Characterization), a framework that comprehensively characterizes GPGPU kernels at the instruction, Basic Block (BBL) and thread levels. Using devised mechanisms in each level, we have provided characterization of the GPGPU kernel covering all these levels. We have identified distinguishing characteristics of the GPGPU kernel from CPU workloads in instruction and basic block level. And using our proposed approaches, we can characterize the branch divergence in a visual way, to enable finer analysis of thread behaviors in GPGPU kernels. Additionally, we show an optimization case for a GPGPU kernel from the bottleneck identified through its characterization result.

Hypercolumn Sparsification for Low-Power Convolutional Neural Networks

We provide here a novel method, called hypercolumn sparsification, to achieve high recognition performance for convolutional neural networks (CNNs) despite low-precision weights and activities during both training and test phases. It operates on the stack of feature maps in each of the cascading feature matching and pooling layers through the processing hierarchy of the CNN by an explicit competitive process (k-WTA: winner take all) that generates a sparse feature vector at each spatial location. This principle is inspired by local brain circuits, where neurons tuned to respond to different patterns in the incoming signals from an upstream region inhibit each other using interneurons, such that only the ones that are maximally activated survive the quenching threshold. We show this process of sparsification is critical for probabilistic learning of low-precision weights and bias terms, thereby making pattern recognition amenable for energy-efficient hardware implementations. Further, we show that hypercolumn sparsification could lead to more data-efficient learning as well as having an emergent property of significantly pruning down the number of connections in the network. A theoretical account and empirical analysis are provided to understand these effects better.

SSS: Self-aware System on Chip using a Static-dynamic Hybrid Method

Network on chip (NoC) has become the de facto communication standard for multi-core or many-core system on chip (SoC), due to its scalability and flexibility. However, temperature is an important factor in NoC design, which affects the overall performance of SoC?decreasing circuit frequency, increasing energy consumption, and even shortening the chip lifetime. In this paper, we propose SSS, a self-aware SoC using a static-dynamic hybrid method, which combines dynamic mapping and static mapping to reduce the hot-spots temperature for NoC based SoCs. First, we propose monitoring and thermal modeling for self-state sensoring. Then, in static mapping stage, we calculate the optimal mapping solutions under different temperature modes using discrete firefly algorithm to help self-decision making. Finally, in dynamic mapping stage, we achieve dynamic mapping through configuring NoC and SoC sentient unit for self-optimizing. Experimental results show SSS can reduce the peak temperature by up to 37.52%. FPGA prototype shows the effectiveness and smartness of SSS in reducing hot-spots temperature.

Low-Cost Stochastic Hybrid Multiplier for Quantized Neural Networks

With increased interests of neural networks, hardware implementations of neural networks have been investigated. Researchers pursue low hardware cost by using different technologies such as stochastic computing and quantization. More specifically, the quantization is able to reduce total number of trained weights and results in low hardware cost. Stochastic computing aims to lower hardware costs substantially by using simple gates instead of complex arithmetic operations. However, the advantages of both quantization and stochastic computing in neural networks are not well investigated. In this paper, we propose a new stochastic multiplier with simple CMOS transistors called stochastic hybrid multiplier (SH-multiplier) for quantized neural networks. The new design uses the characteristic of quantized weights and tremendously reduces the hardware cost of neural networks. Experimental results indicate that our stochastic design achieves about 7.7x energy reduction compared to its counterpart binary implementation while maintaining slightly higher recognition error rates than the binary implementation. Compared to previous stochastic neural network implementations, our work derives at least 4x, 9x and 10x reduction in terms of area, power and energy, respectively.

Hardware Optimizations of Dense Binary Hyperdimensional Computing: Rematerialization of Hypervectors, Binarized Bundling, and Combinational Associative Memory

Brain-inspired hyperdimensional (HD) computing models neural activity patterns of the very size of the brain?s circuits with points of a hyperdimensional space, that is, with hypervectors (i.e., ultrawide holographic words: D=10, 000 bits). At its very core, HD computing manipulates a set of seed hypervectors to build composite hypervectors representing objects of interest. It demands memory optimizations with simple operations for an efficient hardware realization. We propose hardware techniques for optimizations of HD computing, in a synthesizable VHDL library, to enable co-located implementation of both learning and classification tasks on only a small portion of Xilinx FPGAs. Our Pareto optimal design is mapped on only 18340 CLBs of an FPGA achieving simultaneous 2.39× lower area and 986× higher throughput compared to the baseline. This is accomplished by: (1) rematerializing hypervectors on the fly by substituting the cheap logical operations for the expensive memory accesses to seed hypervectors; (2) online and incremental learning from different gesture examples while staying in the binary space; (3) combinational associative memories to steadily reduce the latency of classification.

Trained Biased Number Representation for ReRAM-Based Neural Network Accelerators

All ACM Journals | See Full Journal Index

Search JETC
enter search term and/or author name