ACM DL

ACM Journal on

Emerging Technologies in Computing (JETC)

Menu
Latest Articles

Deep Neural Network Optimized to Resistive Memory with Nonlinear Current-Voltage Characteristics

Artificial Neural Network computation relies on intensive vector-matrix multiplications. Recently,... (more)

Energy-Efficient Neural Computing with Approximate Multipliers

Neural networks, with their remarkable ability to derive meaning from a large volume of complicated or imprecise data, can be used to extract patterns... (more)

Real-Time and Low-Power Streaming Source Separation Using Markov Random Field

Machine learning (ML) has revolutionized a wide range of recognition tasks, ranging from text analysis to speech to vision, most notably in cloud... (more)

A GPU-Outperforming FPGA Accelerator Architecture for Binary Convolutional Neural Networks

FPGA-based hardware accelerators for convolutional neural networks (CNNs) have received attention due to their higher energy efficiency than GPUs.... (more)

A Study of Complex Deep Learning Networks on High-Performance, Neuromorphic, and Quantum Computers

Current deep learning approaches have been very successful using convolutional neural networks trained on large graphical-processing-unit-based... (more)

Silicon Photonics for Computing Systems

A Learning-Based Thermal-Sensitive Power Optimization Approach for Optical NoCs

Optical networks-on-chip (NoCs) based on silicon photonics have been proposed as emerging on-chip communication architectures for chip multiprocessors... (more)

A Process-Variation-Tolerant Method for Nanophotonic On-Chip Network

Nanophotonic networks, a potential candidate for future networks on-chip, have been challenged for their reliability due to several device-level... (more)

Reducing Power Consumption of Lasers in Photonic NoCs through Application-Specific Mapping

To face the complex communication problems that arise as the number of on-chip components grows up, photonic networks-on-chip (NoCs) have been... (more)

NEWS

About JETC

The Journal of Emerging Technologies in Computing Systems invites submissions of original technical papers describing research and development in emerging technologies in computing systems. Major economic and technical challenges are expected to impede the continued scaling of semiconductor devices. This has resulted in the search for alternate mechanical, biological/biochemical, nanoscale electronic, asynchronous and quantum computing and sensor technologies. 

read more
IMFlexCom: Energy Efficient In-memory Flexible Computing using Dual-mode SOT-MRAM

In this paper, we propose an In-Memory Flexible Computing platform (IMFlexCom) using a novel Spin Orbit Torque Magnetic Random Access Memory (SOT-MRAM) array architecture, which could work in dual mode: memory mode- to work as non-volatile memory and computing mode- to implement re-configurable logic (AND/OR/XOR) within the memory array. Such intrinsic in-memory logic could be used to process data within memory to greatly reduce power-hungry and long distance massive data communication in conventional Von-Neumann computing systems. We further employ bulk bitwise vector operation and data encryption engine with Advanced Encryption Standard (AES) as case studies to investigate the performance of our proposed in-memory computing architecture. Our design shows < 35× energy saving and < 18× speedup for bulk bitwise in-memory vector AND/OR operation compared to DRAM based in-memory logic. Again, our proposed design can achieve 77.27% and 85.4% lower energy consumption compared to CMOS-ASIC and CMOL based AES implementations, respectively. It offers almost similar energy consumption as recent DW-AES implementation with 66.7% less area overhead.

An FPGA Implementation of a Time Delay Reservoir Using Stochastic Logic

This paper presents and demonstrates a stochastic logic time delay reservoir design in FPGA hardware. The reservoir network approach is analyzed using a number of metrics, such as kernel quality, generalization rank, performance on simple benchmarks, and is also compared to a deterministic design. A novel re-seeding method is introduced to reduce the adverse effects of stochastic noise, which may also be implemented in other stochastic logic reservoir computing designs, such as echo state networks. Benchmark results indicate that the proposed design performs well on noise-tolerant classification problems, but more work needs to be done to improve the stochastic logic time delay reservoirs robustness for regression problems.

A Multi-Level Optimization Framework for FPGA-Based Cellular Neural Network Implementation

Cellular Neural Network (CeNN) is considered as a powerful paradigm for embedded devices. Its analog and mix-signal hardware implementations are proved to be applicable to high-speed image processing, video analysis and medical signal processing with its efficiency and popularity limited by smaller implementation size and lower precision. Recently, digital implementations of CeNNs on FPGA have attracted researchers from both academia and industry due to its high flexibility and short time-to-market. However, most existing implementations are not well optimized to fully utilize the advantages of FPGA platform with unnecessary design and computational redundancy that prevents speedup. We propose a multi-level optimization framework for energy efficient CeNN implementations on FPGAs. In particular, the optimization framework is featured with three level optimizations: system-, module-, and design-space-level, with focus on computational redundancy and attainable performance, respectively. Experimental results show that with various configurations our framework can achieve an energy efficiency improvement of 3.54× and up to 3.88× speedup compared with existing implementations with similar accuracy.

Hardware Trojan Detection Using the Order of Path Delay

Many fabrication-less design houses are outsourcing their designs to third party foundries for fabrication to lower cost. This IC development process, however, raises serious security concerns on Hardware Trojans (HTs). In this paper, for the first time, we propose a two-phase technique, which uses the order of the path delay in path pairs to detect HTs. In the design phase, a full-cover path set that covers all the nets of the design is generated; meanwhile in the set, the relative order of paths in path pairs is determined according to their delay. The order of the paths in path pairs serves as the fingerprint of the design. In the test phase, the actual delay of the paths in the full-cover set is extracted from the fabricated circuits, and the order of paths in path pairs is compared with the fingerprint generated in the design phase. A mismatch between them indicates the existence of Trojans circuits. Both process variations and measurement noise are taken into consideration. The efficiency and accuracy of the proposed technique are confirmed by a series of experiments including the examination of both violated path pairs incurred by HTs and their false alarm rate.

Memristor-CMOS Analog Co-Processor for Acceleration of High Performance Computing Applications

Vector matrix multiplication computation underlies major applications in machine vision, deep learning and scientific simulation. These applications require high computational speed and are run on platforms that are size, weight and power constrained. With the transistor scaling coming to an end, existing digital hardware architectures will not be able to meet this increasing demand. Analog computation with its rich set of primitives and inherent parallel architecture can be faster, more efficient and compact for some of these applications. One such primitive is a memristor-CMOS crossbar array based vector matrix multiplication. In this paper, we develop a memristor-CMOS analog co-processor architecture that can handle floating point computation. To demonstrate the working of the analog co-processor at a system level, we use a new electronic design automation tool called PSpice Systems Option which performs integrated co-simulation of MATLAB/Simulink and PSpice. It is shown that the analog co-processor has a superior performance when compared to other processors and a speedup of up to 12x when compared to projected GPU performance is observed. Using the new PSpice Systems Option tool, various application simulations for image processing and solution to partial differential equations are performed on the analog co-processor model.

Design and Multi-Abstraction Level Evaluation of a NoC Router for Mixed-Criticality Real-Time Systems

A Mixed Criticality System (MCS) combines real-time software tasks with different criticality levels. In a MCS, the criticality level specifies the level of assurance against system failure. For high-critical flows of messages, it is imperative to meet deadlines, otherwise the whole system might fail, leading to catastrophic results, like, loss of life or serious damage to the environment. In contrast, low-critical flows may tolerate some delays. Furthermore, in MCS, flow performances such as the Worst Case Communication Time (WCCT) may vary depending on the criticality level of the applications. Then, execution platforms must provide different operating modes for applications with different levels of criticality. To conclude, in Network-On-Chip (NoC), sharing resources between communication flows can lead to unpredictable latencies and subsequently turns the implementation of MCS in many-core architectures challenging. In this article, we propose and evaluate a new NoC router to support MCS based on an accurate WCCT analysis for high-critical flows. The proposed router, called \textbf{DAS} (\textbf{D}ouble \textbf{A}rbiter and \textbf{S}witching router), jointly uses {\it Wormhole} and {\it Store And Forward} communication techniques for low and high-critical flows respectively. It ensures that high-critical flows meet their deadlines while maximizing the bandwidth remaining for the low-critical flows.

T-count and Qubit Optimized Quantum Circuit Design of the Non-Restoring Square Root Algorithm

Quantum circuits for basic mathematical functions such as the square root are required to implement scientific computing algorithms on quantum computers. Quantum circuits that are based on Clifford+T gates can be made fault tolerant in nature but the T gate is very costly to implement. As a result. reducing T-count has become an important optimization goal. Further, quantum circuits with many qubits are difficult to realize making designs that save qubits and produce no garbage outputs desirable. In this work, we present a T-count optimized quantum square root circuit with only 2·n+1 qubits and no garbage output. To have fair comparison against existing work, the Bennetts garbage removal scheme is used to remove garbage output from existing works. We determined that the proposed design achieves an average T-count savings of 40.91%, 98.88%, 39.25% and 26.11% as well as qubit savings of 85.46%, 95.16%, 90.59% and 86.77% compared to existing work.

STDP-based Unsupervised Feature Learning using Convolution-over-time in Spiking Neural Networks for Energy-Efficient Neuromorphic Computing

Brain-inspired learning models attempt to mimic the computations performed in the neurons and synapses constituting the human brain to achieve its efficiency in cognitive tasks. In this work, we propose spike timing dependent plasticity based unsupervised feature learning in Convolutional Spiking Neural Network (SNN). We use shared weight kernels that are trained to encode representative features underlying the input patterns, thereby improving the sparsity as well as the robustness of the learning model. We show that the proposed Convolutional SNN self-learns several visual categories for object recognition with fewer training patterns than the traditional fully-connected SNN while yielding competitive accuracy. Further, we present an energy-efficient implementation of the Convolutional SNN using a crossbar array of spintronic synapses. Our system-level simulation indicates that the Convolutional SNN offers up to 9.3× reduction in the energy consumption per training pattern compared to the fully-connected SNN.

Semi-Trained Memristive Crossbar Computing Engine with In-Situ Learning Accelerator

On-device learning has gained a significant attention recently as it offers local data processing which ensures user privacy and low power consumption especially on mobile devices and energy constrained platforms. This paper proposes on-device training circuitry for threshold-current memristors integrated in crossbar structure. Furthermore, this paper investigates alternate approaches of mapping the synaptic weights to memristive crossbar, thereby realizing a simplified neuromemristive system. The proposed design is studied within the context of extreme learning machine (ELM) while using delta rule learning algorithm to train the output layer. The network is implemented using IBM 65nm technology node and verified in Cadence Spectre environment. The hardware model is verified for classification with binary and multi-class datasets. The total power for a single 4x4 layer network is estimated to be ~29.62uW, while the area is estimated to be 26.48um x 22.35um.

DFR: An Energy-efficient Analog Delay Feedback Reservoir Computing System for Brain-inspired Computing

Confronting the explosive escalation of data density, von Neumann computing systems, which compute and store data in separated locations, have reached its computational bottleneck. As the emerging computing paradigm, the reservoir computing system is inspired by the working mechanism of mammalian brains; it has proven its benefit to multifaceted applications. In this work, we successfully designed and fabricated an energy efficient analog delayed feedback reservoir (DFR) computing system, which is embedded with a temporal encoding scheme, a nonlinear transfer function, and a dynamic delayed feedback loop. Measurement results demonstrate its high energy efficiency with rich dynamic behaviors, whereby the working mechanism has closely mimicked the behavior of biological neurons. The system performance, as well as the robustness, are studied and analyzed through the Monte-Carlo simulation. The proposed DFR computing system conceptually evolves its training mechanism and computing architecture; such system is capable of nonlinearly projecting input patterns onto higher dimensional spaces for the future classification while operating at the edge of chaos region with merely 526¼W of power consumption. To the best of our knowledge, our work represents the first analog integrated circuit (IC) implementation of the DFR computing system.

Efficient Hardware Implementation of Cellular Neural Networks with Incremental Quantization and Early Exit

Cellular neural networks (CeNNs) have been widely adopted in image processing tasks. Recently,various hardware implementations of CeNNs have emerged in the literature, with FPGA being one of the most popular choices due to its high exibility and low time-to-market. However, CeNNs typically involve extensive computations in a recursive manner. As an example, to simply process an image of 1920x1080 pixels requires 4-8 Giga oating point multiplications, which needs to be done in a timely manner for real-time applications. To address this issue, in this paper we propose a compressed CeNN framework for efcient FPGA implementations. It involves various techniques such as incremental quantization and early exit, which signifcantly reduces computation demands while maintaining an acceptable performance. While similar concept has been explored in hardware implementations of Convolutional Neural Networks (CNNs), CeNNs have completely dierent computation patterns which require dierent quantization and implementation strategies. Experimental results on FPGAs show that incremental quantization and early exit can achieve a speedup of up to 7.8x and 8.3x, respectively, compared with the state-of-the-art implementations, while with almost no performance loss with four widely-adopted applications. We also discover that dierent from CNNs, the optimal quantization strategies of CeNNs depend heavily on the applications.

Efficient Memristor based Architecture for Intrusion Detection and High Speed Packet Classification

Deep packet inspection (DPI) is a critical component to prevent intrusion detection. This requires a detailed analysis of each network packet header and body. Although this is often done on dedicated high power servers in most networked systems, mobile systems could potentially be vulerable to attack if utilized on an unprotected network. In this case having DPI hardware on the mobile system would be highly beneficial. Unfortunately, DPI hardware is generally area and power consuming making them difficult to implement in mobile systems. We developed a memristor crossbar based approach, inspired by memristor crossbar neuromorphic circuits, for a low power, low area, and high throughput DPI system that examines both the header and body of a packet. Two key types of circuits are presented: static pattern matching and regular expression circuits. This system is able to reduce execution time and power consumption due to its high density grid and massive parallelism. Independent searches are performed using a low power memristor crossbar arrays giving rise to a throughput of 390Gbps for minimum size packets (40B long) with no loss in the classification accuracy. The memristor crossbar does not consume static power and consumes dynamic power of 0.00336mW per Snort header rule.

Fault-Tolerant Network-on-Chip Design with Flexible Spare Core Placement

Network-on-Chip (NoC) has been proposed as a promising solution to overcome the communication challenges of System-on-Chip (SoC) design in nanoscale technologies. With the increased integration density of Intellectual Property (IP) cores in a single chip, heat dissipation increases which make the system unreliable. Therefore, efficient fault-tolerant methods are necessary at different levels to improve system performance and make it to operate normally. This paper presents a flexible spare core placement technique for mesh based NoC. An Integer Linear Programming (ILP) based solution has been proposed for the spare core placement problem. Also, Particle Swarm Optimization (PSO) based meta-heuristic has been proposed for the same. Experiments have been performed by taking several application benchmarks reported in the literature. Comparisons have been carried out using our approach and approach followed in the literature (i) by varying the network size with fixed fault percentage in the network, (ii) by fixing the network size while varying the percentage of faults in the network. We have also compared overall communication cost and CPU runtime between ILP and PSO approaches. The results show significant reductions in overall communication cost, dynamic simulation results across all the cases using our approach over the approaches reported in the literature.

Design Space Exploration of 3D Network-on-Chip: A Sensitivity-based Optimization Approach

High-performance and energy-efficient Network-on-Chip (NoC) architecture is one of the crucial components of the manycore processing platforms. A very promising NoC architecture recently proposed in the literature is the three-dimensional small-world NoC (3D SWNoC). Due to short vertical links in 3D integration and the robustness of small-world networks, the 3D SWNoC architecture outperforms its other 3D counterparts. However, the performance of 3D SWNoC is highly dependent on the placement of the links and associated routers. In this paper, we propose a sensitivity-based link placement algorithm (SEN) to optimize the performance of 3D SWNoC.We compare the performance of SEN algorithm with simulated annealing- (SA) and recently proposed machine learning-based (ML) optimization algorithm. The optimized 3D SWNoC obtained by the proposed SEN algorithm achieves, on average, 11.5% and 13.6% lower latency and 18.4% and 21.7% lower energy-delay product than those optimized by the SA and ML algorithms respectively. In addition, the SEN algorithm is 26 to 33 times faster than the SA algorithm for the optimization of 64-, 128-, and 256-core 3D SWNoC designs.However, we find that ML-based methodology has faster convergence time than SEN and SA for bigger systems.

Power, Performance, and Area Benefit of Monolithic 3D ICs for On-Chip Deep Neural Networks Targeting Speech Recognition

In recent years, deep learning has become widespread for various real-world recognition tasks. In addition to recognition accuracy, energy-efficiency and performance is another grand challenge to enable local intelligence in edge devices. In this paper, we investigate the adoption of monolithic 3D IC (M3D) technology for deep learning hardware design, using speech recognition as a test vehicle. M3D has recently proven to be one of the leading contenders to address the power, performance and area (PPA) scaling challenges in advanced technology nodes. Our study encompasses the influence of key parameters in DNN hardware implementations towards their performance and energy-efficiency, including DNN architectural choices, underlying workloads, and tier partitioning choices in M3D. Our post-layout M3D designs, together with hardware-efficient sparse algorithms, produce power savings and performance improvement beyond what can be achieved using conventional 2D ICs. Experimental results show that M3D offers 22.3% iso-performance power saving and 6.2% performance improvement, convincingly demonstrating its entitlement as a solution for DNN ASICs. We further present architectural guidelines for M3D DNNs to maximize the benefits.

Sparse hardware embedding of spiking neuron systems for community detection

Adapting deep neural networks and deep learning algorithms for neuromorphic hardware has been well established for discriminative and generative models. We study the applicability of neural networks and neuromorphic hardware for solving general optimization problems without the use of adaptive training or learning algorithms. We leverage the dynamics of Hopfield networks and spin glass systems to construct a fully connected spiking neural system to generate synchronous spike responses indicative of the underlying community structure in an undirected, unweighted graph. Mapping this fully connected system to current neuromorphic hardware is done by embedding sparse tree graphs to generate only the leading order spiking dynamics. We demonstrate that for a chosen set of benchmark graphs, non-overlapping communities can be identified, even with the loss of higher order spiking behavior.

Thermal-aware Test Scheduling Strategy for Network-on-Chip based Systems

Rapid progress in technology scaling makes transistors smaller and faster over successive generations, and consequently core count in a system gets increased, in which a flexible and scalable packet-switched architecture---Network-on-Chip (NoC)---is commonly used for communication among the cores. To test such system, NoC is reused as a test delivery mechanism. This work proposes a preemptive test scheduling technique for NoC based system to reduce the testtime by minimizing the network resource conflicts. The preemptive test scheduling problem has been formulated using Integer Linear Programming (ILP). Thermal safety during testing is an utmost challenging problem, particularly for three-dimensional NoC (3D NoC). In this paper, authors have also presented a thermal-aware scheduling technique to test cores in 2D as well as 3D stacked NoC, using a Particle Swarm Optimization (PSO) based approach. To reduce testtime further, several innovative augmentations, such as Inversion Mutation, efficient random number generation and multiple PSO operations, have been incorporated in the basic PSO. Experimental results highlight the effectiveness of the proposed method in reducing testtime under power constraints and achieve a tradeoff between testtime and peak temperature.

PANE : Pluggable Asynchronous Network-on-Chip Simulator

Communication between different IP cores in MPSoCs and HMPs often results in clock domain crossing. Asynchronous network on chip (NoC) can supports communication in such heterogeneous set-ups. While there are a large number of tools to model NoCs for synchronous systems, there is very limited tool support to model communication for multi-clock domain NoCs and analyze them. In this paper, we propose \textbf{PANE} :Pluggable Asynchronous NEtwork on Chip simulator, that allows system level simulation of asynchronous network on chip (NoC). PANE allows exploration of synchronous, asynchronous and mixed synchronous-asynchronous(heterogeneous) design space for system level NoC parameters such as packet latencies, throughput, network saturation point. It also supports a large range of NoC configurations for both synthetic and real traffic patterns. In this paper, we also demonstrate the application of PANE by using synchronous routers, asynchronous routers and a mix of asynchronous and synchronous routers. One of the key advantages of PANE is that it allows a seamless transition from synchronous to asynchronous NoC simulators while keeping pace with the developments in synchronous NoC tools as they can be integrated with PANE.

Reliability Hardening Mechanisms in Cyber-Physical Digital-Microfluidic Biochips

In the area of biomedical engineering, digital-microfluidic biochips (DMFBs) have received considerable attention because of their capability of providing an efficient and reliable platform for conducting point-of-care clinical diagnostics. System reliability, in turn, mandates error-recoverability while implementing biochemical assays on-chip for medical applications. Unfortunately, the technology of DMFBs is not yet fully equipped to handle error-recovery from various microfluidic operations involving droplet motion and reaction. Recently, a number of cyber-physical systems have been proposed to provide real-time checking and error-recovery in assays based on the feedback received from a few on-chip checkpoints. However, in order to synthesize robust feedback systems for different types of DMFBs, certain practical issues need to be considered such as co-optimization of checkpoint placement, error-recoverability, and layout of droplet-routing pathways. For application-specific DMFBs, we propose here an algorithm that minimizes the number of checkpoints and determines their locations to cover every path in a given droplet-routing solution. Next, for general-purpose DMFBs, where the checkpoints are pre-deployed in specific locations, we present a checkpoint-aware routing algorithm such that every droplet-routing path passes through at least one checkpoint to enable error-recovery and to ensure physical routability of all droplets.

Time-randomized Wormhole NoCs for Critical Applications

Wormhole-based NoCs (wNoCs) are widely accepted in high-performance domains as the most appropriate solution to interconnect an increasing number of cores in the chip. However, wNoCs suitability in the context of critical real-time applications has not been demonstrated yet. In this paper, in the context of probabilistic timing analysis (PTA), we propose a PTA-compatible wNoC design that provides tight time-composable contention bounds. The proposed wNoC design builds on PTA ability to reason in probabilistic terms about hardware events impacting execution time (e.g. wNoC contention), discarding those sequences of events occurring with a negligible low probability. This allows our wNoC design to deliver improved guaranteed performance w.r.t. conventional time-deterministic setups. Our results show that performance guarantees of applications running on top of probabilistic wNoC designs improve by 40\% and 93\% on average for 4x4 and 6x6 wNoC setups, respectively.

Limit of Hardware Solutions for Self-Protecting Fault-Tolerant NoCs

We study the ultimate limits of hardware solutions for the self-protection strategies against permanent faults in networks on chips (NoCs). NoCs reliability is improved by replacing each base router by an augmented router which includes extra protection circuitry. We compare the protection achieved by the self-test and self-protect (STAP) architectures to that of triple modular redundancy with voting (TMR). In practice, none of the considered architectures (STAP or TMR) can tolerate all the permanent faults, especially faults in the extra-circuitry for protection or voting, and consequently, there will always be some unidentified defective augmented routers which are going to transmit errors in an unpredictable manner. Specifically, we study and determine the average percentage of unidentified defective routers (UDRs) and their impact on the overall reliability of the NoC in light of self-protection strategies. Our study shows that TMR is the most efficient solution to limit the average percentage of UDRs when there are typically less than a 0.1 percent of defective base routers. Above 1% of defective base routers, the STAP approaches are more efficient although the protection efficiency decreases inexorably in the very defective technologies (e.g. when there is 10% or more of defective base routers).

System Level Analysis of 3D ICs with Thermal TSVs

3D stacking of integrated circuits (ICs) provides significant advantages in saving device footprints, improving power management, and continuing performance enhancement, particularly for many-core systems. However, the stacked structure makes the heat dissipation a challenging issue. While Thermal Through Silicon Via (TTSV) is a promising way of lowering the thermal resistance of dies, past research has either overestimated or underestimated the effects of TTSVs due to the lack of detailed 3D IC models and system-level simulations. To accurately simulate TTSV effects on 3D ICs, we adopt benchmarks from Splash-2 running on full system mode of the gem5 simulator. Gem5 generates all the system component activities and McPAT generates corresponding power consumption. Power trace of each benchmark later is fed to HotSpot for thermal simulation. The temperature of 2D and 3D Nehalem like x86 processor is compared. TTSVs are then placed close to hot spot regions of 3D ICs to facilitate vertical heat transfer to heat sink structures, the peak temperature of 3D Nehalem is reduced by 25-5% with a small area overhead of 6%. By using a detailed 3D thermal model, full system simulation, and a validated thermal simulator, our results show accurate effects of TTSVs in 3D ICs.

All ACM Journals | See Full Journal Index

Search JETC
enter search term and/or author name