ACM DL

ACM Journal on

Emerging Technologies in Computing (JETC)

Menu
Latest Articles

Sparse Hardware Embedding of Spiking Neuron Systems for Community Detection

We study the applicability of spiking neural networks and neuromorphic hardware for solving general opti- mization problems without the use of... (more)

Efficient Memristor-Based Architecture for Intrusion Detection and High-Speed Packet Classification

Deep packet inspection (DPI) is a critical component to prevent intrusion detection. This requires a... (more)

Power, Performance, and Area Benefit of Monolithic 3D ICs for On-Chip Deep Neural Networks Targeting Speech Recognition

In recent years, deep learning has become widespread for various real-world recognition tasks. In... (more)

Semi-Trained Memristive Crossbar Computing Engine with In Situ Learning Accelerator

On-device intelligence is gaining significant attention recently as it offers local data processing and low power consumption. In this research, an... (more)

STDP-based Unsupervised Feature Learning using Convolution-over-time in Spiking Neural Networks for Energy-Efficient Neuromorphic Computing

Brain-inspired learning models attempt to mimic the computations performed in the neurons and... (more)

DFR: An Energy-efficient Analog Delay Feedback Reservoir Computing System for Brain-inspired Computing

Neuromorphic computing, which is built on a brain-inspired silicon chip, is uniquely applied to keep pace with the explosive escalation of algorithms and data density on machine learning. Reservoir computing, an emerging computing paradigm based on the recurrent neural network with proven benefits across multifaceted applications, offers an... (more)

An FPGA Implementation of a Time Delay Reservoir Using Stochastic Logic

This article presents and demonstrates a stochastic logic time delay reservoir design in FPGA hardware. The reservoir network approach is analyzed... (more)

A Multi-Level-Optimization Framework for FPGA-Based Cellular Neural Network Implementation

Cellular Neural Network (CeNN) is considered as a powerful paradigm for embedded devices. Its analog and mix-signal hardware implementations are... (more)

Efficient Hardware Implementation of Cellular Neural Networks with Incremental Quantization and Early Exit

Cellular neural networks (CeNNs) have been widely adopted in image processing tasks. Recently,... (more)

NEWS

About JETC

The Journal of Emerging Technologies in Computing Systems invites submissions of original technical papers describing research and development in emerging technologies in computing systems. Major economic and technical challenges are expected to impede the continued scaling of semiconductor devices. This has resulted in the search for alternate mechanical, biological/biochemical, nanoscale electronic, asynchronous and quantum computing and sensor technologies. 

read more
Self-learnable Cluster-based Prefetching Method for DRAM-Flash Hybrid Main Memory Architecture

This paper presents a novel prefetching mechanism for memory-intensive workloads used in large-scale data centers. In particular, we design a NAND-flash/DRAM hybrid memory architecture as a cost-effective memory architecture to resolve the scalability and power consumption problems of a DRAM-based model. A smart prefetching mechanism based on a cluster-management scheme to cope with dynamically varying and complex access patterns of any given application is designed for maximizing the performance of the DRAM. In this paper, we propose a new concept for page management, called a cluster, which prefetches data in our hybrid memory architecture. The cluster management is based on a self-learning scheme on dynamically changeable access patterns by considering any correlation between missed pages. Experimental results show that the overall performance is significantly improved in relation to hit rate, execution time, and energy consumption. Namely, our proposed model can enhance the hit rate by 15% and reduce the execution time by 1.75 times. In addition, we can save energy consumption by around 48% by cutting the number of flushed pages to about an eighth of that in a conventional system.

Identification of Synthesis Approaches for IP/IC Piracy of Reversible Circuits

Reversible circuits employ a computational paradigm that is beneficial for several applications  including the design of encoding and decoding devices, low power design, and emerging applications such as in quantum computation. However, as for conventional logic, reversible circuits might be subject to Intellectual Property/Integrated Circuit piracy. In order to counteract such attacks, a detailed understanding of how to identify the target function of a reversible circuit is crucial. In contrast to conventional logic, the target function of the reversible circuit is (implicitly or explicitly) embedded into the circuit. Numerous synthesis solutions have been proposed for this purpose. In order to obtain the target function of a reversible circuit, it is crucial to know what synthesis approach has been used to generate the circuit in the first place. In this work, we propose a machine learning-based scheme to determine the respectively used reversible synthesis approach based on their telltale signs. Furthermore, we study the impact of optimizing the synthesis approaches on their telltale signs. Our analysis shows that the most-established synthesis approaches can be determined in the vast majority of cases even if optimized versions of them are applied. This motivates a thorough investigation on how to obfuscate corresponding designs.

Long Short-Term Memory Network Design for Analog Computing

We present an analog integrated circuit implementation of long short-term memory network, which is compatible with digital CMOS technology. We have used multiple input floating gate MOSFETs as both the front-end to obtain converted analog signals and the differential pairs in proposed analog multipliers. Analog crossbar is built by the analog multiplier processing matrix and bitwise multiplications. We have shown that using current signals as internal transmission signals can largely reduce computation delay compared to the digital implementation. We also have introduced analog blocks to work as activation functions for the algorithm. In the back-end of our design, we have used current comparators to achieve the output to be readable to external digital systems. We have designed the LSTM network with the matrix size of 16×16 in TSMC 180nm CMOS technology. The post-layout simulations show that the latency of one computing cycle is 1.19ns without memory, and power dissipation of the single analog LSTM computing core with 2 kilobytes SRAM at 200MHz is 460.3mW. The overhead of power dissipation due to SRAM access is 8.3%, in which the computing of each LSTM layer requires one clock cycle. The energy efficiency is 0.43GOPS/W.

GARDENIA: A Graph Processing Benchmark Suite for Next-generation Accelerators

This paper presents the Graph Analytics Repository for Designing Next-generation Accelerators (GARDENIA), a benchmark suite for studying irregular algorithms on massively parallel accelerators. Applications with limited control and data irregularity are the main focus of existing generic benchmarks for accelerators, while available graph analytics benchmarks do not apply state-of-the-art algorithms and/or optimization techniques. GARDENIA includes emerging irregular applications in big-data and machine learning domains which mimic massively multithreaded commercial programs running on modern large-scale datacenters. Our characterization shows that GARDENIA exhibits irregular microarchitectural behavior which is quite different from structured workloads and straightforward-implemented graph benchmarks.

Split Manufacturing Based Register Transfer Level Obfuscation

Fabrication-less integrated circuit (IC) design houses outsource fabrication to third party foundries to reduce cost of manufacturing. The outsourcing of IC fabrication, beyond our expectation, raises concerns regarding intellectual property (IP) piracy and theft by rogue elements in the third party foundries. Obfuscation techniques have been proposed to increase resistance to reverse engineering, IP recovery, IP theft and piracy. However, prior work on obfuscation for IP protection has primarily applied to the gate level or the layout level. As a result, it can significantly impact the performance of the original design in addition to requiring redesign of standard cells. In this paper, we propose a high level synthesis and analysis (HLSA) based obfuscation approach for IP protection. The proposed method is based on split manufacturing. Additional dummy units and MUXes can be added to further obfuscate the design. The proposed technique aligns with the standard-cell based design methodologies and does not significantly impact the performance of the original design. Our experimental results confirm that the proposed approach can provide high levels of IC obfuscation with moderate area cost.

Design and Multi-Abstraction Level Evaluation of a NoC Router for Mixed-Criticality Real-Time Systems

A Mixed Criticality System (MCS) combines real-time software tasks with different criticality levels. In a MCS, the criticality level specifies the level of assurance against system failure. For high-critical flows of messages, it is imperative to meet deadlines, otherwise the whole system might fail, leading to catastrophic results, like, loss of life or serious damage to the environment. In contrast, low-critical flows may tolerate some delays. Furthermore, in MCS, flow performances such as the Worst Case Communication Time (WCCT) may vary depending on the criticality level of the applications. Then, execution platforms must provide different operating modes for applications with different levels of criticality. To conclude, in Network-On-Chip (NoC), sharing resources between communication flows can lead to unpredictable latencies and subsequently turns the implementation of MCS in many-core architectures challenging. In this article, we propose and evaluate a new NoC router to support MCS based on an accurate WCCT analysis for high-critical flows. The proposed router, called \textbf{DAS} (\textbf{D}ouble \textbf{A}rbiter and \textbf{S}witching router), jointly uses {\it Wormhole} and {\it Store And Forward} communication techniques for low and high-critical flows respectively. It ensures that high-critical flows meet their deadlines while maximizing the bandwidth remaining for the low-critical flows.

BigBus: A Scalable Optical Interconnect

This paper presents BigBus, a novel design of an on-chip photonic network for a 1024 node system. For such a large on-chip network, performance and power reduction are two mutually conflicting goals. This paper uses a combination of strategies to reduce static power consumption while simultaneously improving both performance as well as the energy-delay 2 product. The crux of the paper is to segment the entire system into smaller clusters of nodes, and adopt a hybrid strategy for each segment that includes conventional laser modulation, as well as a novel technique for sharing power across nodes dynamically. We represent energy internally as tokens, where one token will allow a node to send a message to any other node in its cluster. We allow optical stations to arbitrate for tokens at a global level, and then we predict the number of token equivalents of power that the off-chip laser needs to generate. Using these techniques BigBus outperforms other competing proposals. We demonstrate a speedup of 14-34% over state of the art proposals and a 20-61% reduction in ED^2.

Guest Editors' Introduction: Emerging Networks-on-Chip Designs, Technologies, and Applications

Fault-Tolerant Network-on-Chip Design with Flexible Spare Core Placement

Network-on-Chip (NoC) has been proposed as a promising solution to overcome the communication challenges of System-on-Chip (SoC) design in nanoscale technologies. With the increased integration density of Intellectual Property (IP) cores in a single chip, heat dissipation increases which make the system unreliable. Therefore, efficient fault-tolerant methods are necessary at different levels to improve system performance and make it to operate normally. This paper presents a flexible spare core placement technique for mesh based NoC. An Integer Linear Programming (ILP) based solution has been proposed for the spare core placement problem. Also, Particle Swarm Optimization (PSO) based meta-heuristic has been proposed for the same. Experiments have been performed by taking several application benchmarks reported in the literature. Comparisons have been carried out using our approach and approach followed in the literature (i) by varying the network size with fixed fault percentage in the network, (ii) by fixing the network size while varying the percentage of faults in the network. We have also compared overall communication cost and CPU runtime between ILP and PSO approaches. The results show significant reductions in overall communication cost, dynamic simulation results across all the cases using our approach over the approaches reported in the literature.

Thermal-aware Test Scheduling Strategy for Network-on-Chip based Systems

Rapid progress in technology scaling makes transistors smaller and faster over successive generations, and consequently core count in a system gets increased, in which a flexible and scalable packet-switched architecture---Network-on-Chip (NoC)---is commonly used for communication among the cores. To test such system, NoC is reused as a test delivery mechanism. This work proposes a preemptive test scheduling technique for NoC based system to reduce the testtime by minimizing the network resource conflicts. The preemptive test scheduling problem has been formulated using Integer Linear Programming (ILP). Thermal safety during testing is an utmost challenging problem, particularly for three-dimensional NoC (3D NoC). In this paper, authors have also presented a thermal-aware scheduling technique to test cores in 2D as well as 3D stacked NoC, using a Particle Swarm Optimization (PSO) based approach. To reduce testtime further, several innovative augmentations, such as Inversion Mutation, efficient random number generation and multiple PSO operations, have been incorporated in the basic PSO. Experimental results highlight the effectiveness of the proposed method in reducing testtime under power constraints and achieve a tradeoff between testtime and peak temperature.

PANE : Pluggable Asynchronous Network-on-Chip Simulator

Communication between different IP cores in MPSoCs and HMPs often results in clock domain crossing. Asynchronous network on chip (NoC) can supports communication in such heterogeneous set-ups. While there are a large number of tools to model NoCs for synchronous systems, there is very limited tool support to model communication for multi-clock domain NoCs and analyze them. In this paper, we propose \textbf{PANE} :Pluggable Asynchronous NEtwork on Chip simulator, that allows system level simulation of asynchronous network on chip (NoC). PANE allows exploration of synchronous, asynchronous and mixed synchronous-asynchronous(heterogeneous) design space for system level NoC parameters such as packet latencies, throughput, network saturation point. It also supports a large range of NoC configurations for both synthetic and real traffic patterns. In this paper, we also demonstrate the application of PANE by using synchronous routers, asynchronous routers and a mix of asynchronous and synchronous routers. One of the key advantages of PANE is that it allows a seamless transition from synchronous to asynchronous NoC simulators while keeping pace with the developments in synchronous NoC tools as they can be integrated with PANE.

Time-randomized Wormhole NoCs for Critical Applications

Wormhole-based NoCs (wNoCs) are widely accepted in high-performance domains as the most appropriate solution to interconnect an increasing number of cores in the chip. However, wNoCs suitability in the context of critical real-time applications has not been demonstrated yet. In this paper, in the context of probabilistic timing analysis (PTA), we propose a PTA-compatible wNoC design that provides tight time-composable contention bounds. The proposed wNoC design builds on PTA ability to reason in probabilistic terms about hardware events impacting execution time (e.g. wNoC contention), discarding those sequences of events occurring with a negligible low probability. This allows our wNoC design to deliver improved guaranteed performance w.r.t. conventional time-deterministic setups. Our results show that performance guarantees of applications running on top of probabilistic wNoC designs improve by 40\% and 93\% on average for 4x4 and 6x6 wNoC setups, respectively.

Neural Network Classifiers using a Hardware-based Approximate Activation Function with a Hybrid Stochastic Multiplier

Neural networks are becoming prevalent in many areas, such as pattern recognition and medical diagnosis. Stochastic computing is one potential solution for neural networks implemented in low-power back-end devices such as solar-powered devices and Internet-of-things devices. In this paper, we investigate a new architecture of stochastic neural networks with a hardware-oriented approximate activation function. The new proposed approximate activation function can be hidden in the proposed architecture and thus reduce the whole hardware cost. Additionally, to further reduce the hardware cost of the stochastic implementation, a new hybrid stochastic multiplier is proposed. It contains OR gates and binary parallel counter, which aims to reduce the number of inputs of binary parallel counter. The experimental results indicate the new proposed approximate architecture without hybrid stochastic multipliers achieves more than 25%, 60% and 3x reduction compared to previous stochastic neural networks, and more than 30x, 30x and 52% reduction compared to conventional binary neural networks, in terms of area, power and energy, respectively, while maintaining the similar error rates compared to the conventional neural networks. Furthermore, the stochastic implementation with hybrid stochastic multipliers further reduces area about 18% to 80%, power from 15% - 113.1% and energy about 15% - 131%, respectively.

Limit of Hardware Solutions for Self-Protecting Fault-Tolerant NoCs

We study the ultimate limits of hardware solutions for the self-protection strategies against permanent faults in networks on chips (NoCs). NoCs reliability is improved by replacing each base router by an augmented router which includes extra protection circuitry. We compare the protection achieved by the self-test and self-protect (STAP) architectures to that of triple modular redundancy with voting (TMR). In practice, none of the considered architectures (STAP or TMR) can tolerate all the permanent faults, especially faults in the extra-circuitry for protection or voting, and consequently, there will always be some unidentified defective augmented routers which are going to transmit errors in an unpredictable manner. Specifically, we study and determine the average percentage of unidentified defective routers (UDRs) and their impact on the overall reliability of the NoC in light of self-protection strategies. Our study shows that TMR is the most efficient solution to limit the average percentage of UDRs when there are typically less than a 0.1 percent of defective base routers. Above 1% of defective base routers, the STAP approaches are more efficient although the protection efficiency decreases inexorably in the very defective technologies (e.g. when there is 10% or more of defective base routers).

All ACM Journals | See Full Journal Index

Search JETC
enter search term and/or author name