ACM Journal on

Emerging Technologies in Computing (JETC)


Latest Issue

Volume 15, Issue 2, April 2019 is now available 


About JETC

The Journal of Emerging Technologies in Computing Systems invites submissions of original technical papers describing research and development in emerging technologies in computing systems. 

read more
Call for Nominations
ACM Journal on Emerging Technologies in Computing (JETC)

The term of the current Editor-in-Chief (EiC) of the ACM Journal on Emerging Technologies in Computing (JETC) (JETC) is coming to an end, and the ACM Publications Board has set up a nominating committee to assist the Board in selecting the next EiC.


JETC Special Issues

For a list of JETC special issue Calls-for-Papers past and present, click here.

Energy-efficient FPGA Spiking Neural Accelerators with Supervised and Unsupervised Spike-Timing-Dependent-Plasticity

The Liquid State Machine (LSM) is a powerful model of recurrent spiking neural networks (SNNs) that provides an appealing brain-inspired computing paradigm for machine learning applications. Moreover, operated by processing information directly on spiking events, the LSM is highly amenable to hardware implementation. However, the spike computing paradigm in SNNs constraints its synaptic weight updates locally and imposes a great challenge on the design of learning algorithms as most conventional optimization approaches do not satisfy it. In this paper, we present a bio-plausible supervised spike-timing-dependent-plasticity (STDP) rule to train the output layer of the LSM for good classification performance, and a hardware-friendly unsupervised STDP reservoir training rule as a supplement. Both algorithms are implemented on the FPGA LSM neural accelerator with optimized efficiency achieved by algorithm level optimization and the leverage of self-organizing behaviors naturally introduced by the STDP algorithm. The recurrent spiking neural accelerator is built with an on-board ARM microprocessor host on a Xilinx Zync ZC-706 platform and the trained for speech recognition with the TI46 speech corpus as the benchmark. By implementing unsupervised and supervised STDP together, a performance boost of up to 3.47% can be achieved compared to a competitive non-STDP baseline training algorithm.

Application and Thermal Reliability-Aware Reinforcement Learning Based Multi-Core Power Management

Thread Batching for High-performance Energy-efficient GPU Memory Design

Massive multi-threading in GPU imposes tremendous pressure on memory subsystems. Due to slowly improved peak memory bandwidth, memory becomes a bottleneck of performance and energy efficiency in GPU. In this work, we propose an integrated architectural scheme to optimize the memory accesses and therefore boost the performance and energy efficiency of GPU. Firstly, we propose a thread batch enabled memory partitioning (TEMP) to improve memory access parallelism. In particular, TEMP groups multiple thread blocks that share the same set of pages into a thread batch and bounds stream multiprocessor (SM) to dedicated memory banks. After that, TEMP dispatches the thread batch to an SM to ensure high-parallel memory-access streaming. Secondly, a thread batch-aware scheduling (TBAS) scheme is introduced to improve memory access locality and reduce the contention on memory controllers and interconnection networks. Experimental results show that the integrated TEMP and TBAS can achieve up to 10.3% performance improvement and 11.3% DRAM energy reduction across diverse GPU applications. We also evaluate the performance interference induced by CPU applications in the GPU+CPU heterogeneous system with our proposed schemes. Our results show that the proposed solution can ensure the execution efficiency of GPU applications with negligible performance degradation of CPU applications.

A High-Performance Homogeneous Droplet Routing Technique for MEDA Based Biochips

Recent advancement of microelectrode-dot-array (MEDA) based architecture for digital microfluidic biochips has enabled a major enhancement in microfluidic operations for traditional lab-on-chip devices. One critical issue for MEDA based biochips is the transportation of droplets. MEDA allows dynamic routing for droplets of different size. In this paper, we propose a high-performance droplet routing technique for MEDA based digital microfluidic biochips. First, we propose the basic concept of droplet movement strategy in MEDA based design together with a definition of strictly shielded zones within the layout in MEDA architecture. Next, we propose transportation schemes of droplets for MEDA architecture under different blockage or crossover conditions and estimate route distances for each net in offline. Finally, a priority based routing strategy combining various transportation schemes stated earlier has been proposed. Concurrent movement of each droplet is scheduled in a time-multiplexed manner. This poses critical challenges for parallel routing of individual droplets with optimal sharing of cells formulating a routing problem with higher complexity. The final compaction solution satisfies the timing constraint and improves fault tolerance. Simulations are carried out on standard benchmark circuits namely Benchmark suite I and Benchmark suite III. Experimental results show satisfactory improvements and prove a high degree of robustness for our proposed algorithm.

Hardware-Software Co-design to Accelerate Neural Network Applications

Approximate computation is a viable method to save energy and increase performance by trading energy for accuracy. In this paper, we propose a novel approximate floating point multiplier, called CMUL, which significantly reduces energy and improves performance of multiplication at the expense of accuracy. Our design approximately models multiplication by replacing the most costly step of the operation with a lower energy alternative. In order to tune the level of approximation, CMUL dynamically identifies the inputs which will produce the largest approximation error and processes them in precise CMUL mode. In order to use CMUL for deep neural network (DNN) acceleration, we propose a framework which modifies the trained DNN model to make it suitable for approximate hardware. Our framework adjusts the DNN weights to a set of "\textit{potential weights}" that are suitable for approximate hardware. Then, it compensates the possible quality loss by iterative retraining the network based on the existing constraints. Our evaluations on four DNN applications show that, CMUL can achieve 60.3% energy efficiency improvement and 3.2X energy-delay product (EDP) improvement as compared to the baseline GPU, while ensuring less than 0.2% quality loss.

LiwePMS: A Lightweight Persistent Memory with Wear-aware Memory Management

Next-generation Storage Class Memory(SCM) offers low latency, high density, byte-addressable access and persistency. The potent combination of these attractive characteristics makes it possible for SCM to unify the main memory and storage to reduce the torage hierarchy. Aiming for this, several persistent memory systems were designed. However, the heavy metadata and transaction cost egrade the system performance. In this paper, we present a lightweight, wear-aware persistent memory system, LiwePMS, which allows a fast access to persistent data stored in SCM. LiwePMS makes performance improvement by simplifying the metadata management and the consistency method. LiwePMS abstracts SCM as heap space with container-based dynamic address mapping. Also, LiwePMS implements efficient wear-aware dynamic memory allocator and lightweight transaction mechanism for data consistency in user-space library. The experiments showed that LiwePMS persists key-value records 1.5X faster than Redis RDB mechanism and 16X faster than Redis AOF mechanism. The latency of LiwePMS is merely 50% and 60% of HEAPO on creating and attaching persistent regions. Also, the wear-leveling policy of memory allocator outperforms that of NVMalloc from 3% to 30%, and the transaction method promotes the transaction performance to 1.2X compared to other state-of-the-art persistent memory systems with transaction mechanism.

Guest Editors' Introduction to the Special Section on Hardware and Algorithms for Energy-Constrained On-chip Machine Learning

A Comparative Cross-Layer Study on Racetrack Memories: Domain Wall vs Skyrmion

Racetrack memory (RM), a new storage scheme in which information flows along a nanotrack, has been considered as a potential candidate for future high-density storage device instead of hard disk drive (HDD). The first RM technology, proposed in 2008 by IBM, relies on a train of opposite magnetic domains separated by domain walls (DWs), named DW-RM. After ten years of intensive research, a variety of fundamental advancements has been achieved, unfortunately, no product is available until now. On the other hand, new concepts might also be on the horizon. Recently, an alternative information carrier, magnetic skyrmion, experimentally discovered in 2009, has been regarded as a promising replacement of DW for RM, named skyrmion-based RM (SK-RM). Amazing advances have been made in observing, writing, manipulating and deleting individual skyrmions. So, what are the relationship between DW and skyrmion? What are the key differences between DW and skyrmion, or between DW-RM and SK-RM? What benefits could SK-RM bring and what challenges need to be addressed before application? In this paper, we intend to answer these questions through a comparative cross-layer study between DW-RM and SK-RM. This work will provide guidelines, especially, for circuit and architecture researchers on RM.

Hardware Optimizations of Dense Binary Hyperdimensional Computing: Rematerialization of Hypervectors, Binarized Bundling, and Combinational Associative Memory

Brain-inspired hyperdimensional (HD) computing models neural activity patterns of the very size of the brain?s circuits with points of a hyperdimensional space, that is, with hypervectors (i.e., ultrawide holographic words: D=10, 000 bits). At its very core, HD computing manipulates a set of seed hypervectors to build composite hypervectors representing objects of interest. It demands memory optimizations with simple operations for an efficient hardware realization. We propose hardware techniques for optimizations of HD computing, in a synthesizable VHDL library, to enable co-located implementation of both learning and classification tasks on only a small portion of Xilinx FPGAs. Our Pareto optimal design is mapped on only 18340 CLBs of an FPGA achieving simultaneous 2.39× lower area and 986× higher throughput compared to the baseline. This is accomplished by: (1) rematerializing hypervectors on the fly by substituting the cheap logical operations for the expensive memory accesses to seed hypervectors; (2) online and incremental learning from different gesture examples while staying in the binary space; (3) combinational associative memories to steadily reduce the latency of classification.

All ACM Journals | See Full Journal Index

Search JETC
enter search term and/or author name