ACM DL

ACM Journal on

Emerging Technologies in Computing (JETC)

Menu
Latest Articles

Identification of Synthesis Approaches for IP/IC Piracy of Reversible Circuits

Reversible circuits employ a computational paradigm that is beneficial for several applications, including the design of encoding and decoding... (more)

Neuromemrisitive Architecture of HTM with On-Device Learning and Neurogenesis

Hierarchical temporal memory (HTM) is a biomimetic sequence memory algorithm that holds promise for invariant representations of spatial and... (more)

MiC: Multi-level Characterization and Optimization of GPGPU Kernels

Graphics processing units (GPUs)1 have enjoyed increasing popularity in recent years, which benefits from, for example, general-purpose GPU (GPGPU) for parallel programs and new computing paradigms, such as the Internet of Things (IoT). GPUs hold great potential in providing effective solutions for big data analytics while the demands for... (more)

Advanced Simulation of Droplet Microfluidics

The complexity of droplet microfluidics grows with the implementation of parallel processes and multiple functionalities on a single device. This poses a severe challenge to the engineer designing the corresponding microfluidic networks. In today’s design processes, the engineer relies on calculations, assumptions, simplifications, as well... (more)

Energy-efficient FPGA Spiking Neural Accelerators with Supervised and Unsupervised Spike-timing-dependent-Plasticity

The liquid state machine (LSM) is a model of recurrent spiking neural networks (SNNs) and provides... (more)

SSS: Self-aware System-on-chip Using a Static-dynamic Hybrid Method

Network-on-Chip (NoC) has become the de facto communication standard for multi-core or many-core System-on-Chip (SoC) due to its scalability and flexibility. However, an important factor in NoC design is temperature, which affects the overall performance of SoC—decreasing circuit frequency, increasing energy consumption, and even shortening... (more)

Placement and Routing for Tile-based Field-coupled Nanocomputing Circuits Is NP-complete (Research Note)

Field-coupled Nanocomputing (FCN) technologies provide an alternative to conventional... (more)

LiwePMS: A Lightweight Persistent Memory with Wear-aware Memory Management

Next-generation Storage Class Memory (SCM) offers low-latency, high-density, byte-addressable access and persistency. The potent combination of these attractive characteristics makes it possible for SCM to unify the main memory and storage to reduce the storage hierarchy. Aiming for this, several persistent memory systems were designed. However,... (more)

NEWS

Current Issue

VIEW LATEST JETC ISSUE IN THE ACM DL

About JETC

The Journal of Emerging Technologies in Computing Systems invites submissions of original technical papers describing research and development in emerging technologies in computing systems. 

read more
 
New Editor-In-Chief
ACM Journal on Emerging Technologies in Computing (JETC)
 

JETC is happy to welcome professor Ramesh Karri as the new Editor-in-Chief, who will assume duties on August 1st. We welcome Prof. Karri and wish him all the best in his term. JETC would also like to thank Prof. Yuan Xie who's term as EiC ends on July 31st for his efforts to keep JETC a premium ACM Journal.

READ MORE
 

JETC Special Issues

For a list of JETC special issue Calls-for-Papers past and present, click here.

Language Support for Navigating Architecture Design in Closed Form

As computer architecture continues to expand beyond software-agnostic microarchitecture to specialized and heterogeneous logic or even radically different emerging computing models (e.g., quantum cores, DNA storage units), detailed cycle-level simulation is no longer presupposed. Exploring designs under such complex interacting relationships (e.g. performance, energy, thermal, frequency, etc.) calls for a more integrative but higher-level approach. We propose Charm, a modeling language supporting Closed-form High-level ARchitecture Modeling. Charm enables mathematical representations of mutually dependent architectural relationships to be specified, composed, checked, evaluated, reused, and shared. The language is interpreted through a combination of automatic symbolic evaluation, scalable graph transformation and efficient compiler techniques, generating executable DAGs and optimized analysis procedures. Charm also exploits the advancements in satisfiability modulo theory (SMT) solvers to automatically search the design space when architects explore multiple design knobs simultaneously (e.g., CNN tiling configurations specified by 4 parameters). Through two case studies, we demonstrate that Charm allows one to define high-level architecture models in a clean and concise format, maximize reusability and shareblity, capture unreasonable assumptions, and significantly ease design space exploration at high level.

Introduction to the Special Issue on HALO for Energy Constrained On-Chip Machine Learning

Application and Thermal Reliability-Aware Reinforcement Learning Based Multi-Core Power Management

On-the-Fly Page Migration and Address Reconciliation for Heterogeneous Memory Systems

For efficient placement of data in flat-address heterogeneous memory systems consisting of fast (e.g., 3D-DRAM) and slow memories (e.g., NVM), we present a hardware-based page migration technique. Unlike epoch-based approaches that migrate heavily accessed (?hot?) pages from slow to fast memories at each epoch interval, we migrate a page immediately when it becomes hot (?on-the-fly?), using hardware in user-transparent manner and with minimal OS intervention. The management of physical addresses due to page relocation becomes cumbersome and requires costly OS intervention. We use a small hardware remap table to keep track of new physical addresses of the migrated pages. This limits address reconciliation to occur only at periodic evictions of old remap entries. Also, we propose a hardware-orchestrated light-weight address reconciliation process. For our studied heterogeneous memory system, on-the-fly page migration with hardware-assisted address reconciliation provides 74% and 24% IPC improvements, on average for a set of SPEC CPU2006 workloads when compared to - a baseline without any page migration, and a system with on-the-fly page migration using OS-based address reconciliation, respectively. Furthermore, we present an analytical model for classifying applications as page migration friendly (applications that show performance gains from page migration) or unfriendly based on memory access behavior.

Higher Dimension Quantum Entanglement Generators

Quantum information processing and communication techniques rely heavily upon entangled quantum states, motivating the development of methods and systems to generate entanglement. Much research has been dedicated to entangling radix-2 qubits, resulting in the Bell state generator and its generalized forms where the number of entangled qubits is greater than two, but up until this point, higher-radix quantum entanglement has been largely overlooked. In this work, techniques for quantum state entanglement in high-dimensional systems are described. These higher dimensioned quantum informatic systems comprise n quantum digits, or qudits, that are mathematically characterized as elements of a r-dimensioned Hilbert vector space where r>2. Consequently, the wavefunction is a time-dependent state vector of dimension rn. Theoretical analyses and specific higher-radix entanglement generators are discussed.

Thread Batching for High-performance Energy-efficient GPU Memory Design

Massive multi-threading in GPU imposes tremendous pressure on memory subsystems. Due to slowly improved peak memory bandwidth, memory becomes a bottleneck of performance and energy efficiency in GPU. In this work, we propose an integrated architectural scheme to optimize the memory accesses and therefore boost the performance and energy efficiency of GPU. Firstly, we propose a thread batch enabled memory partitioning (TEMP) to improve memory access parallelism. In particular, TEMP groups multiple thread blocks that share the same set of pages into a thread batch and bounds stream multiprocessor (SM) to dedicated memory banks. After that, TEMP dispatches the thread batch to an SM to ensure high-parallel memory-access streaming. Secondly, a thread batch-aware scheduling (TBAS) scheme is introduced to improve memory access locality and reduce the contention on memory controllers and interconnection networks. Experimental results show that the integrated TEMP and TBAS can achieve up to 10.3% performance improvement and 11.3% DRAM energy reduction across diverse GPU applications. We also evaluate the performance interference induced by CPU applications in the GPU+CPU heterogeneous system with our proposed schemes. Our results show that the proposed solution can ensure the execution efficiency of GPU applications with negligible performance degradation of CPU applications.

QuTiBench: Benchmarking Neural Networks on Heterogeneous Hardware

Neural Networks have become one of the most successful machine learning algorithms, playing a key role in enabling machine vision and speech recognition. However, their deployment in particular within energy constrained, embedded environments remains limited due to their computational complexity and equally steep memory requirements. In order to address this, customized and heterogeneous hardware architecture have emerged with co-designed algorithms.The spectrum of design options is vast. For system level designers, there is currently no good way to compare the variety of hardware, algorithm and optimization choices. While there are numerous benchmarking efforts emerging in this field, none of them support essential algorithmic optimizations such as quantization, or specialized heterogeneous hardware architectures. We propose a new benchmark suite, QuTiBench to addresses this need. QuTiBench is a novel multi-tiered benchmarking methodology that helps system developers understand the benefits and limitations of these novel compute architectures in regards to specific neural networks and will help drive future innovation. We invite the community to contribute to QuTiBench in order to be able to support the full spectrum of choices in implementing machine learning systems, from Cloud to IoT.

MV-Net: Towards Real-time Deep Learning on Mobile GPGPU Systems

Recently the development of deep learning has been propelling the sheer growth of applications on lightweight embedded and mobile systems. However, the limitation of computation resource and power delivery capability in embedded platforms is recognized as a significant bottleneck that prevents the systems from providing real-time deep learning ability, since the inference of deep convolutional neural networks involves large quantities of parameters and operations. Particularly, how to provide QoS-guaranteed neural network inference ability in the multi-task execution environment of multi-core SoCs is even more complicated. In this article, we present a novel deep neural network architecture, MV-Net, which provides performance elasticity and contention-aware self-scheduling ability for QoS enhancement in mobile systems. When the system constraints of QoS, output accuracy and resource contention status changes, MV-Net can dynamically reconfigure the corresponding neural network propagation paths and achieves effective trade-off between network computational complexity and prediction accuracy via approximate computing. The experimental results show that (i) MV-net significantly improves the performance flexibility of current CNN models and makes it possible to provide always-guaranteed QoS in a multi-task environment, and (ii) it can satisfy the QoR constraints, and outperforms the baseline implementation significantly and improves the system energy efficiency at the same time.

Comprehensive Analytic Performance Assessment and K-means based Multicast Routing Algorithms and Architecture for 3D-NoC of Spiking Neurons

Spiking neural networks (SNNs) are artificial neural network models that more closely mimic biological neural networks. In addition to neuronal and synaptic state, SNNs incorporate the variant time scale into their computational model. Since each neuron in these networks is connected to thousands of others, high bandwidth is required. Moreover, since the spike times are used to encode information in SNN, very low communication latency is also required. The 2D-NoC was used as a solution to provide a scalable interconnection fabric in large-scale parallel SNN systems. The 3D-ICs have also attracted a lot of attention as a potential solution to resolve the interconnect bottleneck. The combination of these two immerging technologies provides a new horizon for IC designs to satisfy the high requirements of low-power and small foot-print in emerging AI applications. In this work, we first present an efficient mathematical model to analyze the performance of different neural network topologies. Second, we present an architecture and two low-latency spike routing algorithms, named K-means based multicast routing (KMCR) and shortest path K-means based multicast (SP-KMCR), for three-dimensional NoC of spiking neurons (3DNoC-SNN). The proposed system was validated based on an RTL-level implementation, whilst area/power analysis is performed using 45-nm CMOS technology.

A High-Performance Homogeneous Droplet Routing Technique for MEDA Based Biochips

Recent advancement of microelectrode-dot-array (MEDA) based architecture for digital microfluidic biochips has enabled a major enhancement in microfluidic operations for traditional lab-on-chip devices. One critical issue for MEDA based biochips is the transportation of droplets. MEDA allows dynamic routing for droplets of different size. In this paper, we propose a high-performance droplet routing technique for MEDA based digital microfluidic biochips. First, we propose the basic concept of droplet movement strategy in MEDA based design together with a definition of strictly shielded zones within the layout in MEDA architecture. Next, we propose transportation schemes of droplets for MEDA architecture under different blockage or crossover conditions and estimate route distances for each net in offline. Finally, a priority based routing strategy combining various transportation schemes stated earlier has been proposed. Concurrent movement of each droplet is scheduled in a time-multiplexed manner. This poses critical challenges for parallel routing of individual droplets with optimal sharing of cells formulating a routing problem with higher complexity. The final compaction solution satisfies the timing constraint and improves fault tolerance. Simulations are carried out on standard benchmark circuits namely Benchmark suite I and Benchmark suite III. Experimental results show satisfactory improvements and prove a high degree of robustness for our proposed algorithm.

Predict, Share, and Recycle your Way to Low Power Nanophotonic Networks

High static power consumption is widely regarded as one of the largest bottlenecks in creating scalable optical NoCs. The standard techniques to reduce static power are based on sharing optical channels, and modulating the laser. We show in this paper that state of the art techniques in these areas are suboptimal, and there is a significant room for further improvement. We propose two novel techniques ? a neural network based method for laser modulation by predicting optical traffic, and a distributed and altruistic algorithm for channel sharing ? that are significantly closer to a theoretically ideal scheme. In spite of this, a lot of laser power still gets wasted. We propose to reuse this energy to heat micro-ring resonators (achieve thermal tuning) by efficiently recirculating it. These three methods help us significantly reduce the energy requirements. Our design consumes 4.7X lower laser power as compared to other state of the art proposals. In addition, it results in a 31% improvement in performance and 39% reduction in ED^2 for a suite of Splash2 and Parsec benchmarks.

A Resource-Efficient Embedded Iris Recognition System Using Fully Convolutional Networks

Recently, iris segmentation using Fully Convolutional Networks (FCN) has shown promising advances. For embedded systems, a significant challenge is the computationally demanding nature of the proposed FCN architectures. Moreover, there currently does not exist study that demonstrates the design, implementation and evaluation of complete iris recognition pipeline with FCN-based segmentation. Targeting an embedded platform, we propose a resource-efficient end-to-end iris recognition flow, which consists of FCN-based segmentation, contour fitting, followed by Daugman normalization and encoding. To obtain accurate and efficient FCN architectures, we introduce SW/HW co-design methodology, where we propose multiple novel FCNs and construct a Pareto plot based on their segmentation performance and computational overheads. We then select the most efficient set of models and incorporate each into end-to-end flow to evaluate their true recognition performance. Compared to previous works, our FCN architectures require 50× less FLOPs per inference while setting new state-of-the-art segmentation accuracy. The recognition rates of our end-to-end pipelines also outperform previous state-of-the-art on the two datasets evaluated. We demonstrate a SW/HW co-design realization of our flow on an embedded FPGA platform. Compared to embedded CPU, our hardware acceleration achieves up to 6.6× speedup for overall pipeline while using less than 20% of available FPGA resources.

Heterogeneous Scheduling of Deep Neural Networks for Low Power Real-Time Designs

Deep neural networks have become the readiest answer to a range of applications. Neural networks often require high computational and memory intensive solutions. This makes it challenging to deploy DNNs in embedded, real-time low power applications. We describe a domain-specific metric model for optimizing task deployment for heterogeneous systems. Next, we propose the DNN hardware accelerator SCALENet: a SCalable Low power AccELerator for real-time deep neural Networks. Finally, we propose a heterogeneous aware scheduler that uses the contributions to allocate a task to a resource based on solving a numerical cost for a series of domain objectives. To demonstrate our contribution, we deploy nine modern deep networks with two different applications: image processing, and biomedical seizure detection. Our proposed solution to meets the computational requirements, adapts to multiple architectures, and lowers power by optimizing task to resource allocation. Our solution decreases power consumption by 10% of the total system power and meets the real-time deadlines. We exceed the energy efficiency of NVIDIA GPUs when evaluated against Jetson TK1 and with a 4x power savings in a power envelope of 2.0 Watts. When compared to existing FPGA-based accelerators SCALENet's accelerator and heterogeneous aware scheduler achieves a 4.8x improvement in energy efficiency.

Energy-efficient design of MTJ-based Neural Networks with Stochastic Computing

Hardware implementations of Artificial Neural Networks (ANNs) using conventional binary arithmetic units are computationally expensive, energy-intensive and have large area overheads. Stochastic Computing (SC) is an emerging paradigm which replaces these conventional units with simple logic circuits and is particularly suitable for fault-tolerant applications. We propose an energy-efficient use of Magnetic Tunnel Junctions (MTJs), a spintronic device which exhibits probabilistic switching behavior, as Stochastic Number Generators (SNGs), which forms the basis of our NN implementation in the SC domain. Further, the error resilience of target applications of NNs allows approximating the synaptic weights in our MTJ-based NN implementation, in ways brought about by properties of the MTJ-SNG, to achieve energy-efficiency. An algorithm is designed that, given an error tolerance, can perform such approximations in a single-layer NN in an optimal way owing to the convexity of the problem formulation. We then use this algorithm and develop a heuristic approach for approximating multi-layer NNs. Classification problems were evaluated on the optimized NNs and results showed substantial savings in energy for little loss in accuracy.

A Comparative Cross-Layer Study on Racetrack Memories: Domain Wall vs Skyrmion

Racetrack memory (RM), a new storage scheme in which information flows along a nanotrack, has been considered as a potential candidate for future high-density storage device instead of hard disk drive (HDD). The first RM technology, proposed in 2008 by IBM, relies on a train of opposite magnetic domains separated by domain walls (DWs), named DW-RM. After ten years of intensive research, a variety of fundamental advancements has been achieved, unfortunately, no product is available until now. On the other hand, new concepts might also be on the horizon. Recently, an alternative information carrier, magnetic skyrmion, experimentally discovered in 2009, has been regarded as a promising replacement of DW for RM, named skyrmion-based RM (SK-RM). Amazing advances have been made in observing, writing, manipulating and deleting individual skyrmions. So, what are the relationship between DW and skyrmion? What are the key differences between DW and skyrmion, or between DW-RM and SK-RM? What benefits could SK-RM bring and what challenges need to be addressed before application? In this paper, we intend to answer these questions through a comparative cross-layer study between DW-RM and SK-RM. This work will provide guidelines, especially, for circuit and architecture researchers on RM.

Leveraging Side-channel Information for Disassembly and Security

With the rise of Internet of Things (IoT), devices such as smartphones, embedded medical devices, smart home appliances as well as traditional computing platforms such as personal computers and servers have been increasingly targeted with a variety of cyber attacks. Due to limited hardware resources for embedded devices and difficulty in wide-coverage and on-time software updates, software-only cyber defense techniques, such as traditional anti-virus and malware detectors, do not offer a silver-bullet solution. Hardware-based security monitoring and protection techniques, therefore, have gained significant attention. Monitoring devices using side channel leakage information, e.g. power supply variation and electromagnetic (EM) radiation, is a promising avenue that promotes multiple directions in security and trust applications. In this paper, we provide a taxonomy of hardware-based monitoring techniques against different cyber and hardware attacks, highlight the potentials and unique challenges, and display how power-based side-channel instruction-level monitoring can offer suitable solutions to prevailing embedded device security issues. Further, we delineate approaches for future research directions.

Evaluating the Potential Applications of Quaternary Technologies for Approximate Computing

There exist extensive ongoing research efforts on emerging atomic scale technologies that have the potential to become an alternative to today's Complementary Metal-Oxide-Semiconductor (CMOS) technologies. A common feature among the investigated technologies is that of multi-value devices, in particular, the possibility of implementing quaternary logic and memory. However, for such multi-value devices to be used reliably, an increase in energy dissipation and operation time is required. Building on the principle of approximate computing, we present a set of circuits and memory based on multi-value devices where we can trade reliability against energy efficiency. Keeping the energy and time constraints constant, important data are encoded in a more robust binary format while error tolerant data are encoded in a quaternary format. We evaluate the potential benefit of the circuits and memory by embedding them in a conventional computer system on which we execute jpeg and sobel approximately. We achieve dynamic energy reductions of 10% and 13% for jpeg and sobel, respectively, and improve execution time by 38% for sobel, while maintaining an adequate output quality.

All ACM Journals | See Full Journal Index

Search JETC
enter search term and/or author name