ACM Journal on

Emerging Technologies in Computing (JETC)

Latest Articles

Limit of Hardware Solutions for Self-Protecting Fault-Tolerant NoCs

We study the ultimate limits of hardware solutions for the self-protection strategies against permanent faults in networks on chips (NoCs). NoCs... (more)

GARDENIA: A Graph Processing Benchmark Suite for Next-Generation Accelerators

This article presents the Graph Algorithm Repository for Designing Next-generation Accelerators (GARDENIA), a benchmark suite for studying irregular graph algorithms on massively parallel accelerators. Applications with limited control and data irregularity are the main focus of existing generic benchmarks for accelerators, while available graph... (more)

Self-learnable Cluster-based Prefetching Method for DRAM-Flash Hybrid Main Memory Architecture

This article presents a novel prefetching mechanism for memory-intensive workloads used in... (more)

Neural Network Classifiers Using a Hardware-Based Approximate Activation Function with a Hybrid Stochastic Multiplier

Neural networks are becoming prevalent in many areas, such as pattern recognition and medical... (more)

Long Short-Term Memory Network Design for Analog Computing

We present an analog-integrated circuit implementation of long short-term memory network, which is compatible with digital CMOS technology. We have... (more)


About JETC

The Journal of Emerging Technologies in Computing Systems invites submissions of original technical papers describing research and development in emerging technologies in computing systems. Major economic and technical challenges are expected to impede the continued scaling of semiconductor devices. This has resulted in the search for alternate mechanical, biological/biochemical, nanoscale electronic, asynchronous and quantum computing and sensor technologies. 

read more
Identification of Synthesis Approaches for IP/IC Piracy of Reversible Circuits

Reversible circuits employ a computational paradigm that is beneficial for several applications  including the design of encoding and decoding devices, low power design, and emerging applications such as in quantum computation. However, as for conventional logic, reversible circuits might be subject to Intellectual Property/Integrated Circuit piracy. In order to counteract such attacks, a detailed understanding of how to identify the target function of a reversible circuit is crucial. In contrast to conventional logic, the target function of the reversible circuit is (implicitly or explicitly) embedded into the circuit. Numerous synthesis solutions have been proposed for this purpose. In order to obtain the target function of a reversible circuit, it is crucial to know what synthesis approach has been used to generate the circuit in the first place. In this work, we propose a machine learning-based scheme to determine the respectively used reversible synthesis approach based on their telltale signs. Furthermore, we study the impact of optimizing the synthesis approaches on their telltale signs. Our analysis shows that the most-established synthesis approaches can be determined in the vast majority of cases even if optimized versions of them are applied. This motivates a thorough investigation on how to obfuscate corresponding designs.

Split Manufacturing Based Register Transfer Level Obfuscation

Fabrication-less integrated circuit (IC) design houses outsource fabrication to third party foundries to reduce cost of manufacturing. The outsourcing of IC fabrication, beyond our expectation, raises concerns regarding intellectual property (IP) piracy and theft by rogue elements in the third party foundries. Obfuscation techniques have been proposed to increase resistance to reverse engineering, IP recovery, IP theft and piracy. However, prior work on obfuscation for IP protection has primarily applied to the gate level or the layout level. As a result, it can significantly impact the performance of the original design in addition to requiring redesign of standard cells. In this paper, we propose a high level synthesis and analysis (HLSA) based obfuscation approach for IP protection. The proposed method is based on split manufacturing. Additional dummy units and MUXes can be added to further obfuscate the design. The proposed technique aligns with the standard-cell based design methodologies and does not significantly impact the performance of the original design. Our experimental results confirm that the proposed approach can provide high levels of IC obfuscation with moderate area cost.

Composable Probabilistic Inference Networks Using MRAM-based Stochastic Neurons

Magnetoresistive random access memory (MRAM) technologies with thermally unstable nanomagnets are leveraged to develop an intrinsic stochastic neuron as a building block for restricted Boltzmann machines (RBMs) to form deep belief networks (DBNs). The embedded MRAM-based neuron is modeled using precise physics equations. The simulation results exhibit the desired sigmoidal relation between the input voltages and probability of the output state. A probabilistic inference network simulator (PIN-Sim) is developed to realize a circuit-level model of an RBM utilizing resistive crossbar arrays along with differential amplifiers to implement the positive and negative weight values. The PIN-Sim is composed of five main blocks to train a DBN, evaluate its accuracy, and measure its power consumption. The MNIST dataset is leveraged to investigate the energy and accuracy tradeoffs of seven distinct network topologies. The software and hardware level simulations indicate that a 784x200x10 topology can achieve less than 5% error rates with ~400 pJ energy consumption. The error rates can be reduced to 2.5% by using a 784x500x500x500x10 DBN at the cost of ~10x higher energy consumption and significant area overhead. Finally, the effects of specific hardware-level parameters on power dissipation and accuracy tradeoffs are identified via the developed PIN-Sim framework.

A mixed signal architecture for convolutional neural networks

Deep neural network (DNN) accelerators with improved energy and delay are desirable for meeting the requirements of hardware targeted for IoT and edge computing systems. Convolutional neural networks (CoNNs) belong to one of the most popular types of DNN architectures. This paper presents the design and evaluation of an accelerator for CoNNs. The system-level architecture is based on mixed-signal, cellular neural networks (CeNNs). Specifically, we present (i) the implementation of different layers, including convolution, ReLU, and pooling, in a CoNN using CeNN, (ii) modified CoNN structures with CeNN-friendly layers to reduce computational overheads typically associated with a CoNN, (iii) a mixed-signal CeNN architecture that performs CoNN computations in the analog and mixed signal domain, and (iv) design space exploration that identifies what CeNN-based algorithm and architectural features fare best compared to existing algorithms and architectures when evaluated over common datasets -- MNIST and CIFAR-10. Notably, the proposed approach can lead to 8.7X improvements in energy-delay product (EDP) per digit classification for the MNIST dataset at iso-accuracy when compared with the state-of-the-art DNN engine, while our approach could offer 4.3X improvements in EDP when compared to other network implementations for the CIFAR-10 dataset.

Design and Multi-Abstraction Level Evaluation of a NoC Router for Mixed-Criticality Real-Time Systems

A Mixed Criticality System (MCS) combines real-time software tasks with different criticality levels. In a MCS, the criticality level specifies the level of assurance against system failure. For high-critical flows of messages, it is imperative to meet deadlines, otherwise the whole system might fail, leading to catastrophic results, like, loss of life or serious damage to the environment. In contrast, low-critical flows may tolerate some delays. Furthermore, in MCS, flow performances such as the Worst Case Communication Time (WCCT) may vary depending on the criticality level of the applications. Then, execution platforms must provide different operating modes for applications with different levels of criticality. To conclude, in Network-On-Chip (NoC), sharing resources between communication flows can lead to unpredictable latencies and subsequently turns the implementation of MCS in many-core architectures challenging. In this article, we propose and evaluate a new NoC router to support MCS based on an accurate WCCT analysis for high-critical flows. The proposed router, called \textbf{DAS} (\textbf{D}ouble \textbf{A}rbiter and \textbf{S}witching router), jointly uses {\it Wormhole} and {\it Store And Forward} communication techniques for low and high-critical flows respectively. It ensures that high-critical flows meet their deadlines while maximizing the bandwidth remaining for the low-critical flows.

Hardware-Software Co-design to Accelerate Neural Network Applications

Approximate computation is a viable method to save energy and increase performance by trading energy for accuracy. In this paper, we propose a novel approximate floating point multiplier, called CMUL, which significantly reduces energy and improves performance of multiplication at the expense of accuracy. Our design approximately models multiplication by replacing the most costly step of the operation with a lower energy alternative. In order to tune the level of approximation, CMUL dynamically identifies the inputs which will produce the largest approximation error and processes them in precise CMUL mode. In order to use CMUL for deep neural network (DNN) acceleration, we propose a framework which modifies the trained DNN model to make it suitable for approximate hardware. Our framework adjusts the DNN weights to a set of "\textit{potential weights}" that are suitable for approximate hardware. Then, it compensates the possible quality loss by iterative retraining the network based on the existing constraints. Our evaluations on four DNN applications show that, CMUL can achieve 60.3% energy efficiency improvement and 3.2X energy-delay product (EDP) improvement as compared to the baseline GPU, while ensuring less than 0.2% quality loss.

Spiking Neural Networks Hardware Implementations and Challenges: a Survey

Neuromorphic computing is henceforth a major research field for both academic and industrial actors. As opposed to Von Neumann machines, brain-inspired processors aim at bringing closer the memory and the computational elements. They are designed to efficiently evaluate machine-learning algorithms. Recently, Spiking Neural Networks, a generation of cognitive algorithms employing computational primitives mimicking neuron and synapse operational principles, have become an important part of deep learning. They are expected to improve the computational performance and efficiency of neural networks, but are best suited for hardware able to support their temporal dynamics. In this survey, we present the state of the art of hardware implementations of spiking neural networks and the current trends in algorithm elaboration from model selection to training mechanisms. The scope of existing solutions is extensive; we thus present the general framework and study on a case-by-case basis the relevant particularities. We describe the strategies employed to leverage the characteristics of these event-driven algorithms at the hardware level and discuss their related advantages and challenges.

BigBus: A Scalable Optical Interconnect

This paper presents BigBus, a novel design of an on-chip photonic network for a 1024 node system. For such a large on-chip network, performance and power reduction are two mutually conflicting goals. This paper uses a combination of strategies to reduce static power consumption while simultaneously improving both performance as well as the energy-delay 2 product. The crux of the paper is to segment the entire system into smaller clusters of nodes, and adopt a hybrid strategy for each segment that includes conventional laser modulation, as well as a novel technique for sharing power across nodes dynamically. We represent energy internally as tokens, where one token will allow a node to send a message to any other node in its cluster. We allow optical stations to arbitrate for tokens at a global level, and then we predict the number of token equivalents of power that the off-chip laser needs to generate. Using these techniques BigBus outperforms other competing proposals. We demonstrate a speedup of 14-34% over state of the art proposals and a 20-61% reduction in ED^2.

Neuromemrisitive Architecture of HTM with On-Device Learning and Neurogenesis

Hierarchical temporal memory (HTM) is a biomimetic sequence memory algorithm that holds promise for invariant representations of spatial and spatiotemporal inputs. This paper presents a comprehensive neuromemristive crossbar architecture for the spatial pooler (SP) and the sparse distributed representation classifier, which are fundamental to the algorithm. There are several unique features in the proposed architecture that tightly link with the HTM algorithm. A memristor that is suitable for emulating the HTM synapses is identified and a new z-window function is proposed. The architecture exploits the concept of synthetic synapses to enable potential synapses in the HTM. The crossbar for the SP avoids dark spots caused by unutilized crossbar regions and supports rapid on-chip training within 2 clock cycles. This research also leverages plasticity mechanisms such as neurogenesis and homeostatic intrinsic plasticity to strengthen the robustness and performance of the SP. The proposed design is benchmarked for image recognition tasks using MNIST and Yale faces datasets, and is evaluated using different metrics including entropy, sparseness, and noise robustness. Detailed power analysis at different stages of the SP operations is performed to demonstrate the suitability for mobile platforms.

Guest Editors' Introduction: Emerging Networks-on-Chip Designs, Technologies, and Applications

MiC:Multi-level Characterization and Optimization of GPGPU Kernels

GPGPU computing have recently enjoyed popularity increasing in new computing paradigm, such as IoT. GPU holds great potential in providing effective solutions for big data analytics while the demands for processing large quantities of data in real-time are increasing. However, the pervasive presence of GPU on mobile devices presents great challenges for GPGPU, mainly because GPGPU integrates large amount of processor arrays and concurrent executing threads (up to hundreds of thousand). Particularly, the root causes of performance loss in a GPGPU program cannot be easily revealed by current approaches. In this article, we propose MiC (Multi-level Characterization), a framework that comprehensively characterizes GPGPU kernels at the instruction, Basic Block (BBL) and thread levels. Using devised mechanisms in each level, we have provided characterization of the GPGPU kernel covering all these levels. We have identified distinguishing characteristics of the GPGPU kernel from CPU workloads in instruction and basic block level. And using our proposed approaches, we can characterize the branch divergence in a visual way, to enable finer analysis of thread behaviors in GPGPU kernels. Additionally, we show an optimization case for a GPGPU kernel from the bottleneck identified through its characterization result.

Hypercolumn Sparsification for Low-Power Convolutional Neural Networks

We provide here a novel method, called hypercolumn sparsification, to achieve high recognition performance for convolutional neural networks (CNNs) despite low-precision weights and activities during both training and test phases. It operates on the stack of feature maps in each of the cascading feature matching and pooling layers through the processing hierarchy of the CNN by an explicit competitive process (k-WTA: winner take all) that generates a sparse feature vector at each spatial location. This principle is inspired by local brain circuits, where neurons tuned to respond to different patterns in the incoming signals from an upstream region inhibit each other using interneurons, such that only the ones that are maximally activated survive the quenching threshold. We show this process of sparsification is critical for probabilistic learning of low-precision weights and bias terms, thereby making pattern recognition amenable for energy-efficient hardware implementations. Further, we show that hypercolumn sparsification could lead to more data-efficient learning as well as having an emergent property of significantly pruning down the number of connections in the network. A theoretical account and empirical analysis are provided to understand these effects better.

Thermal-aware Test Scheduling Strategy for Network-on-Chip based Systems

Rapid progress in technology scaling makes transistors smaller and faster over successive generations, and consequently core count in a system gets increased, in which a flexible and scalable packet-switched architecture---Network-on-Chip (NoC)---is commonly used for communication among the cores. To test such system, NoC is reused as a test delivery mechanism. This work proposes a preemptive test scheduling technique for NoC based system to reduce the testtime by minimizing the network resource conflicts. The preemptive test scheduling problem has been formulated using Integer Linear Programming (ILP). Thermal safety during testing is an utmost challenging problem, particularly for three-dimensional NoC (3D NoC). In this paper, authors have also presented a thermal-aware scheduling technique to test cores in 2D as well as 3D stacked NoC, using a Particle Swarm Optimization (PSO) based approach. To reduce testtime further, several innovative augmentations, such as Inversion Mutation, efficient random number generation and multiple PSO operations, have been incorporated in the basic PSO. Experimental results highlight the effectiveness of the proposed method in reducing testtime under power constraints and achieve a tradeoff between testtime and peak temperature.

Time-randomized Wormhole NoCs for Critical Applications

Wormhole-based NoCs (wNoCs) are widely accepted in high-performance domains as the most appropriate solution to interconnect an increasing number of cores in the chip. However, wNoCs suitability in the context of critical real-time applications has not been demonstrated yet. In this paper, in the context of probabilistic timing analysis (PTA), we propose a PTA-compatible wNoC design that provides tight time-composable contention bounds. The proposed wNoC design builds on PTA ability to reason in probabilistic terms about hardware events impacting execution time (e.g. wNoC contention), discarding those sequences of events occurring with a negligible low probability. This allows our wNoC design to deliver improved guaranteed performance w.r.t. conventional time-deterministic setups. Our results show that performance guarantees of applications running on top of probabilistic wNoC designs improve by 40\% and 93\% on average for 4x4 and 6x6 wNoC setups, respectively.

Trained Biased Number Representation for ReRAM-Based Neural Network Accelerators

All ACM Journals | See Full Journal Index

Search JETC
enter search term and/or author name