As computer architecture continues to expand beyond software-agnostic microarchitecture to specialized and heterogeneous logic or even radically different emerging computing models (e.g., quantum cores, DNA storage units), detailed cycle-level simulation is no longer presupposed. Exploring designs under such complex interacting relationships (e.g. performance, energy, thermal, frequency, etc.) calls for a more integrative but higher-level approach. We propose Charm, a modeling language supporting Closed-form High-level ARchitecture Modeling. Charm enables mathematical representations of mutually dependent architectural relationships to be specified, composed, checked, evaluated, reused, and shared. The language is interpreted through a combination of automatic symbolic evaluation, scalable graph transformation and efficient compiler techniques, generating executable DAGs and optimized analysis procedures. Charm also exploits the advancements in satisfiability modulo theory (SMT) solvers to automatically search the design space when architects explore multiple design knobs simultaneously (e.g., CNN tiling configurations specified by 4 parameters). Through two case studies, we demonstrate that Charm allows one to define high-level architecture models in a clean and concise format, maximize reusability and shareblity, capture unreasonable assumptions, and significantly ease design space exploration at high level.
Introduction to the Special Issue on HALO for Energy Constrained On-Chip Machine Learning
Application and Thermal Reliability-Aware Reinforcement Learning Based Multi-Core Power Management
For efficient placement of data in flat-address heterogeneous memory systems consisting of fast (e.g., 3D-DRAM) and slow memories (e.g., NVM), we present a hardware-based page migration technique. Unlike epoch-based approaches that migrate heavily accessed (?hot?) pages from slow to fast memories at each epoch interval, we migrate a page immediately when it becomes hot (?on-the-fly?), using hardware in user-transparent manner and with minimal OS intervention. The management of physical addresses due to page relocation becomes cumbersome and requires costly OS intervention. We use a small hardware remap table to keep track of new physical addresses of the migrated pages. This limits address reconciliation to occur only at periodic evictions of old remap entries. Also, we propose a hardware-orchestrated light-weight address reconciliation process. For our studied heterogeneous memory system, on-the-fly page migration with hardware-assisted address reconciliation provides 74% and 24% IPC improvements, on average for a set of SPEC CPU2006 workloads when compared to - a baseline without any page migration, and a system with on-the-fly page migration using OS-based address reconciliation, respectively. Furthermore, we present an analytical model for classifying applications as page migration friendly (applications that show performance gains from page migration) or unfriendly based on memory access behavior.
Quantum information processing and communication techniques rely heavily upon entangled quantum states, motivating the development of methods and systems to generate entanglement. Much research has been dedicated to entangling radix-2 qubits, resulting in the Bell state generator and its generalized forms where the number of entangled qubits is greater than two, but up until this point, higher-radix quantum entanglement has been largely overlooked. In this work, techniques for quantum state entanglement in high-dimensional systems are described. These higher dimensioned quantum informatic systems comprise n quantum digits, or qudits, that are mathematically characterized as elements of a r-dimensioned Hilbert vector space where r>2. Consequently, the wavefunction is a time-dependent state vector of dimension rn. Theoretical analyses and specific higher-radix entanglement generators are discussed.
Massive multi-threading in GPU imposes tremendous pressure on memory subsystems. Due to slowly improved peak memory bandwidth, memory becomes a bottleneck of performance and energy efficiency in GPU. In this work, we propose an integrated architectural scheme to optimize the memory accesses and therefore boost the performance and energy efficiency of GPU. Firstly, we propose a thread batch enabled memory partitioning (TEMP) to improve memory access parallelism. In particular, TEMP groups multiple thread blocks that share the same set of pages into a thread batch and bounds stream multiprocessor (SM) to dedicated memory banks. After that, TEMP dispatches the thread batch to an SM to ensure high-parallel memory-access streaming. Secondly, a thread batch-aware scheduling (TBAS) scheme is introduced to improve memory access locality and reduce the contention on memory controllers and interconnection networks. Experimental results show that the integrated TEMP and TBAS can achieve up to 10.3% performance improvement and 11.3% DRAM energy reduction across diverse GPU applications. We also evaluate the performance interference induced by CPU applications in the GPU+CPU heterogeneous system with our proposed schemes. Our results show that the proposed solution can ensure the execution efficiency of GPU applications with negligible performance degradation of CPU applications.
Neural Networks have become one of the most successful machine learning algorithms, playing a key role in enabling machine vision and speech recognition. However, their deployment in particular within energy constrained, embedded environments remains limited due to their computational complexity and equally steep memory requirements. In order to address this, customized and heterogeneous hardware architecture have emerged with co-designed algorithms.The spectrum of design options is vast. For system level designers, there is currently no good way to compare the variety of hardware, algorithm and optimization choices. While there are numerous benchmarking efforts emerging in this field, none of them support essential algorithmic optimizations such as quantization, or specialized heterogeneous hardware architectures. We propose a new benchmark suite, QuTiBench to addresses this need. QuTiBench is a novel multi-tiered benchmarking methodology that helps system developers understand the benefits and limitations of these novel compute architectures in regards to specific neural networks and will help drive future innovation. We invite the community to contribute to QuTiBench in order to be able to support the full spectrum of choices in implementing machine learning systems, from Cloud to IoT.
Recently the development of deep learning has been propelling the sheer growth of applications on lightweight embedded and mobile systems. However, the limitation of computation resource and power delivery capability in embedded platforms is recognized as a significant bottleneck that prevents the systems from providing real-time deep learning ability, since the inference of deep convolutional neural networks involves large quantities of parameters and operations. Particularly, how to provide QoS-guaranteed neural network inference ability in the multi-task execution environment of multi-core SoCs is even more complicated. In this article, we present a novel deep neural network architecture, MV-Net, which provides performance elasticity and contention-aware self-scheduling ability for QoS enhancement in mobile systems. When the system constraints of QoS, output accuracy and resource contention status changes, MV-Net can dynamically reconfigure the corresponding neural network propagation paths and achieves effective trade-off between network computational complexity and prediction accuracy via approximate computing. The experimental results show that (i) MV-net significantly improves the performance flexibility of current CNN models and makes it possible to provide always-guaranteed QoS in a multi-task environment, and (ii) it can satisfy the QoR constraints, and outperforms the baseline implementation significantly and improves the system energy efficiency at the same time.
Spiking neural networks (SNNs) are artificial neural network models that more closely mimic biological neural networks. In addition to neuronal and synaptic state, SNNs incorporate the variant time scale into their computational model. Since each neuron in these networks is connected to thousands of others, high bandwidth is required. Moreover, since the spike times are used to encode information in SNN, very low communication latency is also required. The 2D-NoC was used as a solution to provide a scalable interconnection fabric in large-scale parallel SNN systems. The 3D-ICs have also attracted a lot of attention as a potential solution to resolve the interconnect bottleneck. The combination of these two immerging technologies provides a new horizon for IC designs to satisfy the high requirements of low-power and small foot-print in emerging AI applications. In this work, we first present an efficient mathematical model to analyze the performance of different neural network topologies. Second, we present an architecture and two low-latency spike routing algorithms, named K-means based multicast routing (KMCR) and shortest path K-means based multicast (SP-KMCR), for three-dimensional NoC of spiking neurons (3DNoC-SNN). The proposed system was validated based on an RTL-level implementation, whilst area/power analysis is performed using 45-nm CMOS technology.
Recent advancement of microelectrode-dot-array (MEDA) based architecture for digital microfluidic biochips has enabled a major enhancement in microfluidic operations for traditional lab-on-chip devices. One critical issue for MEDA based biochips is the transportation of droplets. MEDA allows dynamic routing for droplets of different size. In this paper, we propose a high-performance droplet routing technique for MEDA based digital microfluidic biochips. First, we propose the basic concept of droplet movement strategy in MEDA based design together with a definition of strictly shielded zones within the layout in MEDA architecture. Next, we propose transportation schemes of droplets for MEDA architecture under different blockage or crossover conditions and estimate route distances for each net in offline. Finally, a priority based routing strategy combining various transportation schemes stated earlier has been proposed. Concurrent movement of each droplet is scheduled in a time-multiplexed manner. This poses critical challenges for parallel routing of individual droplets with optimal sharing of cells formulating a routing problem with higher complexity. The final compaction solution satisfies the timing constraint and improves fault tolerance. Simulations are carried out on standard benchmark circuits namely Benchmark suite I and Benchmark suite III. Experimental results show satisfactory improvements and prove a high degree of robustness for our proposed algorithm.
High static power consumption is widely regarded as one of the largest bottlenecks in creating scalable optical NoCs. The standard techniques to reduce static power are based on sharing optical channels, and modulating the laser. We show in this paper that state of the art techniques in these areas are suboptimal, and there is a significant room for further improvement. We propose two novel techniques ? a neural network based method for laser modulation by predicting optical traffic, and a distributed and altruistic algorithm for channel sharing ? that are significantly closer to a theoretically ideal scheme. In spite of this, a lot of laser power still gets wasted. We propose to reuse this energy to heat micro-ring resonators (achieve thermal tuning) by efficiently recirculating it. These three methods help us significantly reduce the energy requirements. Our design consumes 4.7X lower laser power as compared to other state of the art proposals. In addition, it results in a 31% improvement in performance and 39% reduction in ED^2 for a suite of Splash2 and Parsec benchmarks.
Recently, iris segmentation using Fully Convolutional Networks (FCN) has shown promising advances. For embedded systems, a significant challenge is the computationally demanding nature of the proposed FCN architectures. Moreover, there currently does not exist study that demonstrates the design, implementation and evaluation of complete iris recognition pipeline with FCN-based segmentation. Targeting an embedded platform, we propose a resource-efficient end-to-end iris recognition flow, which consists of FCN-based segmentation, contour fitting, followed by Daugman normalization and encoding. To obtain accurate and efficient FCN architectures, we introduce SW/HW co-design methodology, where we propose multiple novel FCNs and construct a Pareto plot based on their segmentation performance and computational overheads. We then select the most efficient set of models and incorporate each into end-to-end flow to evaluate their true recognition performance. Compared to previous works, our FCN architectures require 50× less FLOPs per inference while setting new state-of-the-art segmentation accuracy. The recognition rates of our end-to-end pipelines also outperform previous state-of-the-art on the two datasets evaluated. We demonstrate a SW/HW co-design realization of our flow on an embedded FPGA platform. Compared to embedded CPU, our hardware acceleration achieves up to 6.6× speedup for overall pipeline while using less than 20% of available FPGA resources.
Deep neural networks have become the readiest answer to a range of applications. Neural networks often require high computational and memory intensive solutions. This makes it challenging to deploy DNNs in embedded, real-time low power applications. We describe a domain-specific metric model for optimizing task deployment for heterogeneous systems. Next, we propose the DNN hardware accelerator SCALENet: a SCalable Low power AccELerator for real-time deep neural Networks. Finally, we propose a heterogeneous aware scheduler that uses the contributions to allocate a task to a resource based on solving a numerical cost for a series of domain objectives. To demonstrate our contribution, we deploy nine modern deep networks with two different applications: image processing, and biomedical seizure detection. Our proposed solution to meets the computational requirements, adapts to multiple architectures, and lowers power by optimizing task to resource allocation. Our solution decreases power consumption by 10% of the total system power and meets the real-time deadlines. We exceed the energy efficiency of NVIDIA GPUs when evaluated against Jetson TK1 and with a 4x power savings in a power envelope of 2.0 Watts. When compared to existing FPGA-based accelerators SCALENet's accelerator and heterogeneous aware scheduler achieves a 4.8x improvement in energy efficiency.
Hardware implementations of Artificial Neural Networks (ANNs) using conventional binary arithmetic units are computationally expensive, energy-intensive and have large area overheads. Stochastic Computing (SC) is an emerging paradigm which replaces these conventional units with simple logic circuits and is particularly suitable for fault-tolerant applications. We propose an energy-efficient use of Magnetic Tunnel Junctions (MTJs), a spintronic device which exhibits probabilistic switching behavior, as Stochastic Number Generators (SNGs), which forms the basis of our NN implementation in the SC domain. Further, the error resilience of target applications of NNs allows approximating the synaptic weights in our MTJ-based NN implementation, in ways brought about by properties of the MTJ-SNG, to achieve energy-efficiency. An algorithm is designed that, given an error tolerance, can perform such approximations in a single-layer NN in an optimal way owing to the convexity of the problem formulation. We then use this algorithm and develop a heuristic approach for approximating multi-layer NNs. Classification problems were evaluated on the optimized NNs and results showed substantial savings in energy for little loss in accuracy.
Racetrack memory (RM), a new storage scheme in which information flows along a nanotrack, has been considered as a potential candidate for future high-density storage device instead of hard disk drive (HDD). The first RM technology, proposed in 2008 by IBM, relies on a train of opposite magnetic domains separated by domain walls (DWs), named DW-RM. After ten years of intensive research, a variety of fundamental advancements has been achieved, unfortunately, no product is available until now. On the other hand, new concepts might also be on the horizon. Recently, an alternative information carrier, magnetic skyrmion, experimentally discovered in 2009, has been regarded as a promising replacement of DW for RM, named skyrmion-based RM (SK-RM). Amazing advances have been made in observing, writing, manipulating and deleting individual skyrmions. So, what are the relationship between DW and skyrmion? What are the key differences between DW and skyrmion, or between DW-RM and SK-RM? What benefits could SK-RM bring and what challenges need to be addressed before application? In this paper, we intend to answer these questions through a comparative cross-layer study between DW-RM and SK-RM. This work will provide guidelines, especially, for circuit and architecture researchers on RM.
With the rise of Internet of Things (IoT), devices such as smartphones, embedded medical devices, smart home appliances as well as traditional computing platforms such as personal computers and servers have been increasingly targeted with a variety of cyber attacks. Due to limited hardware resources for embedded devices and difficulty in wide-coverage and on-time software updates, software-only cyber defense techniques, such as traditional anti-virus and malware detectors, do not offer a silver-bullet solution. Hardware-based security monitoring and protection techniques, therefore, have gained significant attention. Monitoring devices using side channel leakage information, e.g. power supply variation and electromagnetic (EM) radiation, is a promising avenue that promotes multiple directions in security and trust applications. In this paper, we provide a taxonomy of hardware-based monitoring techniques against different cyber and hardware attacks, highlight the potentials and unique challenges, and display how power-based side-channel instruction-level monitoring can offer suitable solutions to prevailing embedded device security issues. Further, we delineate approaches for future research directions.
There exist extensive ongoing research efforts on emerging atomic scale technologies that have the potential to become an alternative to today's Complementary Metal-Oxide-Semiconductor (CMOS) technologies. A common feature among the investigated technologies is that of multi-value devices, in particular, the possibility of implementing quaternary logic and memory. However, for such multi-value devices to be used reliably, an increase in energy dissipation and operation time is required. Building on the principle of approximate computing, we present a set of circuits and memory based on multi-value devices where we can trade reliability against energy efficiency. Keeping the energy and time constraints constant, important data are encoded in a more robust binary format while error tolerant data are encoded in a quaternary format. We evaluate the potential benefit of the circuits and memory by embedding them in a conventional computer system on which we execute jpeg and sobel approximately. We achieve dynamic energy reductions of 10% and 13% for jpeg and sobel, respectively, and improve execution time by 38% for sobel, while maintaining an adequate output quality.