ACM Journal on

Emerging Technologies in Computing (JETC)

Latest Articles

Design Space Exploration of 3D Network-on-Chip: A Sensitivity-based Optimization Approach

High-performance and energy-efficient Network-on-Chip (NoC) architecture is one of the crucial components of the manycore processing platforms. A very promising NoC architecture recently proposed in the literature is the three-dimensional small-world NoC (3D SWNoC). Due to short vertical links in 3D integration and the robustness of small-world... (more)

Hardware Trojan Detection Using the Order of Path Delay

Many fabrication-less design houses are outsourcing their designs to third-party foundries for fabrication to lower cost. This IC development process,... (more)

Reliability Hardening Mechanisms in Cyber-Physical Digital-Microfluidic Biochips

In the area of biomedical engineering, digital-microfluidic biochips (DMFBs) have received considerable attention because of their capability of... (more)

IMFlexCom: Energy Efficient In-Memory Flexible Computing Using Dual-Mode SOT-MRAM

In this article, we propose an <u>I</u>n-<u>M</u>emory <u>Flex</u>ible <u>Com</u>puting platform (IMFlexCom) using a novel Spin Orbit Torque Magnetic Random Access Memory (SOT-MRAM) array architecture, which could work in dual mode: memory mode and computing mode. Such intrinsic in-memory logic... (more)

T-count and Qubit Optimized Quantum Circuit Design of the Non-Restoring Square Root Algorithm

Quantum circuits for basic mathematical functions such as the square root are required to implement scientific computing algorithms on quantum... (more)

System-Level Analysis of 3D ICs with Thermal TSVs

3D stacking of integrated circuits (ICs) provides significant advantages in saving device footprints, improving power management, and continuing... (more)

Memristor-CMOS Analog Coprocessor for Acceleration of High-Performance Computing Applications

Vector matrix multiplication computation underlies major applications in machine vision, deep learning, and scientific simulation. These applications... (more)


About JETC

The Journal of Emerging Technologies in Computing Systems invites submissions of original technical papers describing research and development in emerging technologies in computing systems. Major economic and technical challenges are expected to impede the continued scaling of semiconductor devices. This has resulted in the search for alternate mechanical, biological/biochemical, nanoscale electronic, asynchronous and quantum computing and sensor technologies. 

read more
Self-learnable Cluster-based Prefetching Method for DRAM-Flash Hybrid Main Memory Architecture

This paper presents a novel prefetching mechanism for memory-intensive workloads used in large-scale data centers. In particular, we design a NAND-flash/DRAM hybrid memory architecture as a cost-effective memory architecture to resolve the scalability and power consumption problems of a DRAM-based model. A smart prefetching mechanism based on a cluster-management scheme to cope with dynamically varying and complex access patterns of any given application is designed for maximizing the performance of the DRAM. In this paper, we propose a new concept for page management, called a cluster, which prefetches data in our hybrid memory architecture. The cluster management is based on a self-learning scheme on dynamically changeable access patterns by considering any correlation between missed pages. Experimental results show that the overall performance is significantly improved in relation to hit rate, execution time, and energy consumption. Namely, our proposed model can enhance the hit rate by 15% and reduce the execution time by 1.75 times. In addition, we can save energy consumption by around 48% by cutting the number of flushed pages to about an eighth of that in a conventional system.

Identification of Synthesis Approaches for IP/IC Piracy of Reversible Circuits

Reversible circuits employ a computational paradigm that is beneficial for several applications  including the design of encoding and decoding devices, low power design, and emerging applications such as in quantum computation. However, as for conventional logic, reversible circuits might be subject to Intellectual Property/Integrated Circuit piracy. In order to counteract such attacks, a detailed understanding of how to identify the target function of a reversible circuit is crucial. In contrast to conventional logic, the target function of the reversible circuit is (implicitly or explicitly) embedded into the circuit. Numerous synthesis solutions have been proposed for this purpose. In order to obtain the target function of a reversible circuit, it is crucial to know what synthesis approach has been used to generate the circuit in the first place. In this work, we propose a machine learning-based scheme to determine the respectively used reversible synthesis approach based on their telltale signs. Furthermore, we study the impact of optimizing the synthesis approaches on their telltale signs. Our analysis shows that the most-established synthesis approaches can be determined in the vast majority of cases even if optimized versions of them are applied. This motivates a thorough investigation on how to obfuscate corresponding designs.

Long Short-Term Memory Network Design for Analog Computing

We present an analog integrated circuit implementation of long short-term memory network, which is compatible with digital CMOS technology. We have used multiple input floating gate MOSFETs as both the front-end to obtain converted analog signals and the differential pairs in proposed analog multipliers. Analog crossbar is built by the analog multiplier processing matrix and bitwise multiplications. We have shown that using current signals as internal transmission signals can largely reduce computation delay compared to the digital implementation. We also have introduced analog blocks to work as activation functions for the algorithm. In the back-end of our design, we have used current comparators to achieve the output to be readable to external digital systems. We have designed the LSTM network with the matrix size of 16×16 in TSMC 180nm CMOS technology. The post-layout simulations show that the latency of one computing cycle is 1.19ns without memory, and power dissipation of the single analog LSTM computing core with 2 kilobytes SRAM at 200MHz is 460.3mW. The overhead of power dissipation due to SRAM access is 8.3%, in which the computing of each LSTM layer requires one clock cycle. The energy efficiency is 0.43GOPS/W.

An FPGA Implementation of a Time Delay Reservoir Using Stochastic Logic

This paper presents and demonstrates a stochastic logic time delay reservoir design in FPGA hardware. The reservoir network approach is analyzed using a number of metrics, such as kernel quality, generalization rank, performance on simple benchmarks, and is also compared to a deterministic design. A novel re-seeding method is introduced to reduce the adverse effects of stochastic noise, which may also be implemented in other stochastic logic reservoir computing designs, such as echo state networks. Benchmark results indicate that the proposed design performs well on noise-tolerant classification problems, but more work needs to be done to improve the stochastic logic time delay reservoirs robustness for regression problems.

A Multi-Level Optimization Framework for FPGA-Based Cellular Neural Network Implementation

Cellular Neural Network (CeNN) is considered as a powerful paradigm for embedded devices. Its analog and mix-signal hardware implementations are proved to be applicable to high-speed image processing, video analysis and medical signal processing with its efficiency and popularity limited by smaller implementation size and lower precision. Recently, digital implementations of CeNNs on FPGA have attracted researchers from both academia and industry due to its high flexibility and short time-to-market. However, most existing implementations are not well optimized to fully utilize the advantages of FPGA platform with unnecessary design and computational redundancy that prevents speedup. We propose a multi-level optimization framework for energy efficient CeNN implementations on FPGAs. In particular, the optimization framework is featured with three level optimizations: system-, module-, and design-space-level, with focus on computational redundancy and attainable performance, respectively. Experimental results show that with various configurations our framework can achieve an energy efficiency improvement of 3.54× and up to 3.88× speedup compared with existing implementations with similar accuracy.

GARDENIA: A Graph Processing Benchmark Suite for Next-generation Accelerators

This paper presents the Graph Analytics Repository for Designing Next-generation Accelerators (GARDENIA), a benchmark suite for studying irregular algorithms on massively parallel accelerators. Applications with limited control and data irregularity are the main focus of existing generic benchmarks for accelerators, while available graph analytics benchmarks do not apply state-of-the-art algorithms and/or optimization techniques. GARDENIA includes emerging irregular applications in big-data and machine learning domains which mimic massively multithreaded commercial programs running on modern large-scale datacenters. Our characterization shows that GARDENIA exhibits irregular microarchitectural behavior which is quite different from structured workloads and straightforward-implemented graph benchmarks.

Split Manufacturing Based Register Transfer Level Obfuscation

Fabrication-less integrated circuit (IC) design houses outsource fabrication to third party foundries to reduce cost of manufacturing. The outsourcing of IC fabrication, beyond our expectation, raises concerns regarding intellectual property (IP) piracy and theft by rogue elements in the third party foundries. Obfuscation techniques have been proposed to increase resistance to reverse engineering, IP recovery, IP theft and piracy. However, prior work on obfuscation for IP protection has primarily applied to the gate level or the layout level. As a result, it can significantly impact the performance of the original design in addition to requiring redesign of standard cells. In this paper, we propose a high level synthesis and analysis (HLSA) based obfuscation approach for IP protection. The proposed method is based on split manufacturing. Additional dummy units and MUXes can be added to further obfuscate the design. The proposed technique aligns with the standard-cell based design methodologies and does not significantly impact the performance of the original design. Our experimental results confirm that the proposed approach can provide high levels of IC obfuscation with moderate area cost.

Design and Multi-Abstraction Level Evaluation of a NoC Router for Mixed-Criticality Real-Time Systems

A Mixed Criticality System (MCS) combines real-time software tasks with different criticality levels. In a MCS, the criticality level specifies the level of assurance against system failure. For high-critical flows of messages, it is imperative to meet deadlines, otherwise the whole system might fail, leading to catastrophic results, like, loss of life or serious damage to the environment. In contrast, low-critical flows may tolerate some delays. Furthermore, in MCS, flow performances such as the Worst Case Communication Time (WCCT) may vary depending on the criticality level of the applications. Then, execution platforms must provide different operating modes for applications with different levels of criticality. To conclude, in Network-On-Chip (NoC), sharing resources between communication flows can lead to unpredictable latencies and subsequently turns the implementation of MCS in many-core architectures challenging. In this article, we propose and evaluate a new NoC router to support MCS based on an accurate WCCT analysis for high-critical flows. The proposed router, called \textbf{DAS} (\textbf{D}ouble \textbf{A}rbiter and \textbf{S}witching router), jointly uses {\it Wormhole} and {\it Store And Forward} communication techniques for low and high-critical flows respectively. It ensures that high-critical flows meet their deadlines while maximizing the bandwidth remaining for the low-critical flows.

BigBus: A Scalable Optical Interconnect

This paper presents BigBus, a novel design of an on-chip photonic network for a 1024 node system. For such a large on-chip network, performance and power reduction are two mutually conflicting goals. This paper uses a combination of strategies to reduce static power consumption while simultaneously improving both performance as well as the energy-delay 2 product. The crux of the paper is to segment the entire system into smaller clusters of nodes, and adopt a hybrid strategy for each segment that includes conventional laser modulation, as well as a novel technique for sharing power across nodes dynamically. We represent energy internally as tokens, where one token will allow a node to send a message to any other node in its cluster. We allow optical stations to arbitrate for tokens at a global level, and then we predict the number of token equivalents of power that the off-chip laser needs to generate. Using these techniques BigBus outperforms other competing proposals. We demonstrate a speedup of 14-34% over state of the art proposals and a 20-61% reduction in ED^2.

STDP-based Unsupervised Feature Learning using Convolution-over-time in Spiking Neural Networks for Energy-Efficient Neuromorphic Computing

Brain-inspired learning models attempt to mimic the computations performed in the neurons and synapses constituting the human brain to achieve its efficiency in cognitive tasks. In this work, we propose spike timing dependent plasticity based unsupervised feature learning in Convolutional Spiking Neural Network (SNN). We use shared weight kernels that are trained to encode representative features underlying the input patterns, thereby improving the sparsity as well as the robustness of the learning model. We show that the proposed Convolutional SNN self-learns several visual categories for object recognition with fewer training patterns than the traditional fully-connected SNN while yielding competitive accuracy. Further, we present an energy-efficient implementation of the Convolutional SNN using a crossbar array of spintronic synapses. Our system-level simulation indicates that the Convolutional SNN offers up to 9.3× reduction in the energy consumption per training pattern compared to the fully-connected SNN.

Semi-Trained Memristive Crossbar Computing Engine with In-Situ Learning Accelerator

On-device learning has gained a significant attention recently as it offers local data processing which ensures user privacy and low power consumption especially on mobile devices and energy constrained platforms. This paper proposes on-device training circuitry for threshold-current memristors integrated in crossbar structure. Furthermore, this paper investigates alternate approaches of mapping the synaptic weights to memristive crossbar, thereby realizing a simplified neuromemristive system. The proposed design is studied within the context of extreme learning machine (ELM) while using delta rule learning algorithm to train the output layer. The network is implemented using IBM 65nm technology node and verified in Cadence Spectre environment. The hardware model is verified for classification with binary and multi-class datasets. The total power for a single 4x4 layer network is estimated to be ~29.62uW, while the area is estimated to be 26.48um x 22.35um.

DFR: An Energy-efficient Analog Delay Feedback Reservoir Computing System for Brain-inspired Computing

Confronting the explosive escalation of data density, von Neumann computing systems, which compute and store data in separated locations, have reached its computational bottleneck. As the emerging computing paradigm, the reservoir computing system is inspired by the working mechanism of mammalian brains; it has proven its benefit to multifaceted applications. In this work, we successfully designed and fabricated an energy efficient analog delayed feedback reservoir (DFR) computing system, which is embedded with a temporal encoding scheme, a nonlinear transfer function, and a dynamic delayed feedback loop. Measurement results demonstrate its high energy efficiency with rich dynamic behaviors, whereby the working mechanism has closely mimicked the behavior of biological neurons. The system performance, as well as the robustness, are studied and analyzed through the Monte-Carlo simulation. The proposed DFR computing system conceptually evolves its training mechanism and computing architecture; such system is capable of nonlinearly projecting input patterns onto higher dimensional spaces for the future classification while operating at the edge of chaos region with merely 526¼W of power consumption. To the best of our knowledge, our work represents the first analog integrated circuit (IC) implementation of the DFR computing system.

Efficient Hardware Implementation of Cellular Neural Networks with Incremental Quantization and Early Exit

Cellular neural networks (CeNNs) have been widely adopted in image processing tasks. Recently,various hardware implementations of CeNNs have emerged in the literature, with FPGA being one of the most popular choices due to its high exibility and low time-to-market. However, CeNNs typically involve extensive computations in a recursive manner. As an example, to simply process an image of 1920x1080 pixels requires 4-8 Giga oating point multiplications, which needs to be done in a timely manner for real-time applications. To address this issue, in this paper we propose a compressed CeNN framework for efcient FPGA implementations. It involves various techniques such as incremental quantization and early exit, which signifcantly reduces computation demands while maintaining an acceptable performance. While similar concept has been explored in hardware implementations of Convolutional Neural Networks (CNNs), CeNNs have completely dierent computation patterns which require dierent quantization and implementation strategies. Experimental results on FPGAs show that incremental quantization and early exit can achieve a speedup of up to 7.8x and 8.3x, respectively, compared with the state-of-the-art implementations, while with almost no performance loss with four widely-adopted applications. We also discover that dierent from CNNs, the optimal quantization strategies of CeNNs depend heavily on the applications.

Efficient Memristor based Architecture for Intrusion Detection and High Speed Packet Classification

Deep packet inspection (DPI) is a critical component to prevent intrusion detection. This requires a detailed analysis of each network packet header and body. Although this is often done on dedicated high power servers in most networked systems, mobile systems could potentially be vulerable to attack if utilized on an unprotected network. In this case having DPI hardware on the mobile system would be highly beneficial. Unfortunately, DPI hardware is generally area and power consuming making them difficult to implement in mobile systems. We developed a memristor crossbar based approach, inspired by memristor crossbar neuromorphic circuits, for a low power, low area, and high throughput DPI system that examines both the header and body of a packet. Two key types of circuits are presented: static pattern matching and regular expression circuits. This system is able to reduce execution time and power consumption due to its high density grid and massive parallelism. Independent searches are performed using a low power memristor crossbar arrays giving rise to a throughput of 390Gbps for minimum size packets (40B long) with no loss in the classification accuracy. The memristor crossbar does not consume static power and consumes dynamic power of 0.00336mW per Snort header rule.

Fault-Tolerant Network-on-Chip Design with Flexible Spare Core Placement

Network-on-Chip (NoC) has been proposed as a promising solution to overcome the communication challenges of System-on-Chip (SoC) design in nanoscale technologies. With the increased integration density of Intellectual Property (IP) cores in a single chip, heat dissipation increases which make the system unreliable. Therefore, efficient fault-tolerant methods are necessary at different levels to improve system performance and make it to operate normally. This paper presents a flexible spare core placement technique for mesh based NoC. An Integer Linear Programming (ILP) based solution has been proposed for the spare core placement problem. Also, Particle Swarm Optimization (PSO) based meta-heuristic has been proposed for the same. Experiments have been performed by taking several application benchmarks reported in the literature. Comparisons have been carried out using our approach and approach followed in the literature (i) by varying the network size with fixed fault percentage in the network, (ii) by fixing the network size while varying the percentage of faults in the network. We have also compared overall communication cost and CPU runtime between ILP and PSO approaches. The results show significant reductions in overall communication cost, dynamic simulation results across all the cases using our approach over the approaches reported in the literature.

Power, Performance, and Area Benefit of Monolithic 3D ICs for On-Chip Deep Neural Networks Targeting Speech Recognition

In recent years, deep learning has become widespread for various real-world recognition tasks. In addition to recognition accuracy, energy-efficiency and performance is another grand challenge to enable local intelligence in edge devices. In this paper, we investigate the adoption of monolithic 3D IC (M3D) technology for deep learning hardware design, using speech recognition as a test vehicle. M3D has recently proven to be one of the leading contenders to address the power, performance and area (PPA) scaling challenges in advanced technology nodes. Our study encompasses the influence of key parameters in DNN hardware implementations towards their performance and energy-efficiency, including DNN architectural choices, underlying workloads, and tier partitioning choices in M3D. Our post-layout M3D designs, together with hardware-efficient sparse algorithms, produce power savings and performance improvement beyond what can be achieved using conventional 2D ICs. Experimental results show that M3D offers 22.3% iso-performance power saving and 6.2% performance improvement, convincingly demonstrating its entitlement as a solution for DNN ASICs. We further present architectural guidelines for M3D DNNs to maximize the benefits.

Sparse hardware embedding of spiking neuron systems for community detection

Adapting deep neural networks and deep learning algorithms for neuromorphic hardware has been well established for discriminative and generative models. We study the applicability of neural networks and neuromorphic hardware for solving general optimization problems without the use of adaptive training or learning algorithms. We leverage the dynamics of Hopfield networks and spin glass systems to construct a fully connected spiking neural system to generate synchronous spike responses indicative of the underlying community structure in an undirected, unweighted graph. Mapping this fully connected system to current neuromorphic hardware is done by embedding sparse tree graphs to generate only the leading order spiking dynamics. We demonstrate that for a chosen set of benchmark graphs, non-overlapping communities can be identified, even with the loss of higher order spiking behavior.

Thermal-aware Test Scheduling Strategy for Network-on-Chip based Systems

Rapid progress in technology scaling makes transistors smaller and faster over successive generations, and consequently core count in a system gets increased, in which a flexible and scalable packet-switched architecture---Network-on-Chip (NoC)---is commonly used for communication among the cores. To test such system, NoC is reused as a test delivery mechanism. This work proposes a preemptive test scheduling technique for NoC based system to reduce the testtime by minimizing the network resource conflicts. The preemptive test scheduling problem has been formulated using Integer Linear Programming (ILP). Thermal safety during testing is an utmost challenging problem, particularly for three-dimensional NoC (3D NoC). In this paper, authors have also presented a thermal-aware scheduling technique to test cores in 2D as well as 3D stacked NoC, using a Particle Swarm Optimization (PSO) based approach. To reduce testtime further, several innovative augmentations, such as Inversion Mutation, efficient random number generation and multiple PSO operations, have been incorporated in the basic PSO. Experimental results highlight the effectiveness of the proposed method in reducing testtime under power constraints and achieve a tradeoff between testtime and peak temperature.

PANE : Pluggable Asynchronous Network-on-Chip Simulator

Communication between different IP cores in MPSoCs and HMPs often results in clock domain crossing. Asynchronous network on chip (NoC) can supports communication in such heterogeneous set-ups. While there are a large number of tools to model NoCs for synchronous systems, there is very limited tool support to model communication for multi-clock domain NoCs and analyze them. In this paper, we propose \textbf{PANE} :Pluggable Asynchronous NEtwork on Chip simulator, that allows system level simulation of asynchronous network on chip (NoC). PANE allows exploration of synchronous, asynchronous and mixed synchronous-asynchronous(heterogeneous) design space for system level NoC parameters such as packet latencies, throughput, network saturation point. It also supports a large range of NoC configurations for both synthetic and real traffic patterns. In this paper, we also demonstrate the application of PANE by using synchronous routers, asynchronous routers and a mix of asynchronous and synchronous routers. One of the key advantages of PANE is that it allows a seamless transition from synchronous to asynchronous NoC simulators while keeping pace with the developments in synchronous NoC tools as they can be integrated with PANE.

Time-randomized Wormhole NoCs for Critical Applications

Wormhole-based NoCs (wNoCs) are widely accepted in high-performance domains as the most appropriate solution to interconnect an increasing number of cores in the chip. However, wNoCs suitability in the context of critical real-time applications has not been demonstrated yet. In this paper, in the context of probabilistic timing analysis (PTA), we propose a PTA-compatible wNoC design that provides tight time-composable contention bounds. The proposed wNoC design builds on PTA ability to reason in probabilistic terms about hardware events impacting execution time (e.g. wNoC contention), discarding those sequences of events occurring with a negligible low probability. This allows our wNoC design to deliver improved guaranteed performance w.r.t. conventional time-deterministic setups. Our results show that performance guarantees of applications running on top of probabilistic wNoC designs improve by 40\% and 93\% on average for 4x4 and 6x6 wNoC setups, respectively.

Neural Network Classifiers using a Hardware-based Approximate Activation Function with a Hybrid Stochastic Multiplier

Neural networks are becoming prevalent in many areas, such as pattern recognition and medical diagnosis. Stochastic computing is one potential solution for neural networks implemented in low-power back-end devices such as solar-powered devices and Internet-of-things devices. In this paper, we investigate a new architecture of stochastic neural networks with a hardware-oriented approximate activation function. The new proposed approximate activation function can be hidden in the proposed architecture and thus reduce the whole hardware cost. Additionally, to further reduce the hardware cost of the stochastic implementation, a new hybrid stochastic multiplier is proposed. It contains OR gates and binary parallel counter, which aims to reduce the number of inputs of binary parallel counter. The experimental results indicate the new proposed approximate architecture without hybrid stochastic multipliers achieves more than 25%, 60% and 3x reduction compared to previous stochastic neural networks, and more than 30x, 30x and 52% reduction compared to conventional binary neural networks, in terms of area, power and energy, respectively, while maintaining the similar error rates compared to the conventional neural networks. Furthermore, the stochastic implementation with hybrid stochastic multipliers further reduces area about 18% to 80%, power from 15% - 113.1% and energy about 15% - 131%, respectively.

Limit of Hardware Solutions for Self-Protecting Fault-Tolerant NoCs

We study the ultimate limits of hardware solutions for the self-protection strategies against permanent faults in networks on chips (NoCs). NoCs reliability is improved by replacing each base router by an augmented router which includes extra protection circuitry. We compare the protection achieved by the self-test and self-protect (STAP) architectures to that of triple modular redundancy with voting (TMR). In practice, none of the considered architectures (STAP or TMR) can tolerate all the permanent faults, especially faults in the extra-circuitry for protection or voting, and consequently, there will always be some unidentified defective augmented routers which are going to transmit errors in an unpredictable manner. Specifically, we study and determine the average percentage of unidentified defective routers (UDRs) and their impact on the overall reliability of the NoC in light of self-protection strategies. Our study shows that TMR is the most efficient solution to limit the average percentage of UDRs when there are typically less than a 0.1 percent of defective base routers. Above 1% of defective base routers, the STAP approaches are more efficient although the protection efficiency decreases inexorably in the very defective technologies (e.g. when there is 10% or more of defective base routers).

Guest Editor Introduction: Neuromorphic Computing

All ACM Journals | See Full Journal Index

Search JETC
enter search term and/or author name