Introduction to the Special Issue on HALO for Energy Constrained On-Chip Machine Learning
Determining the optimal microarchitecture design configuration of a processor at the early stages of processor design is undeniably a challenge. Application-specific Design Space Exploration (DSE) is even more difficult since the property of application needs to be considered at the same time during the DSE process. Improving the speed and precision of the DSE process remains a particular challenge in microprocessor design. In this paper, we propose a novel processor DSE methodology based on criticality and sensitivity analysis, named CSMO-DSE (Criticality and Sensitivity-based Multi-Objective DSE). In our methodology, the criticality of the processor's performance events is obtained through critical path analysis of the dependency graph. The dependence graph is derived from the profile generated by running a program on an instrumented cycle-accurate microprocessor simulator. The sensitivity of microarchitecture parameters to various performance events is analyzed with a factorial experiment design method, namely Plackett and Burman (P\&B) design. The criticality and sensitivity information is then used to form an optimization matrix, which is used in performance, power/area, and energy efficiency optimization algorithms in CSMO-DSE methodology. Through experiments with SPEC 2006 benchmark programs, we find that our CSMO-DSE methodology is 4.73x faster than the baseline methodology.
For efficient placement of data in flat-address heterogeneous memory systems consisting of fast (e.g., 3D-DRAM) and slow memories (e.g., NVM), we present a hardware-based page migration technique. Unlike epoch-based approaches that migrate heavily accessed (?hot?) pages from slow to fast memories at each epoch interval, we migrate a page immediately when it becomes hot (?on-the-fly?), using hardware in user-transparent manner and with minimal OS intervention. The management of physical addresses due to page relocation becomes cumbersome and requires costly OS intervention. We use a small hardware remap table to keep track of new physical addresses of the migrated pages. This limits address reconciliation to occur only at periodic evictions of old remap entries. Also, we propose a hardware-orchestrated light-weight address reconciliation process. For our studied heterogeneous memory system, on-the-fly page migration with hardware-assisted address reconciliation provides 74% and 24% IPC improvements, on average for a set of SPEC CPU2006 workloads when compared to - a baseline without any page migration, and a system with on-the-fly page migration using OS-based address reconciliation, respectively. Furthermore, we present an analytical model for classifying applications as page migration friendly (applications that show performance gains from page migration) or unfriendly based on memory access behavior.
Technological and architectural improvements have been constantly required to sustain the demand of faster and cheaper computers. However, CMOS down-scaling is suffering from three technology walls: leakage wall, reliability wall and cost wall. On top of that, performance increase due to architectural improvements is also gradually saturating due to three well-known architecture walls: memory wall, power wall and instruction level parallelism (ILP) wall. Hence, a lot of research is focusing on proposing and developing new technologies and architectures. In this paper, we present a comprehensive classification of memory-centric computing architectures; it is based on three metrics: computation location, level of parallelism and used memory technology. The classification does not only provide an overview of existing architectures with their pros and cons, but also unifies the terminology that uniquely identifies these architecture, and highlights the potential future architectures that can be further explored. Hence, setting up a direction for future research in the field.
Neural Networks have become one of the most successful machine learning algorithms, playing a key role in enabling machine vision and speech recognition. However, their deployment in particular within energy constrained, embedded environments remains limited due to their computational complexity and equally steep memory requirements. In order to address this, customized and heterogeneous hardware architecture have emerged with co-designed algorithms.The spectrum of design options is vast. For system level designers, there is currently no good way to compare the variety of hardware, algorithm and optimization choices. While there are numerous benchmarking efforts emerging in this field, none of them support essential algorithmic optimizations such as quantization, or specialized heterogeneous hardware architectures. We propose a new benchmark suite, QuTiBench to addresses this need. QuTiBench is a novel multi-tiered benchmarking methodology that helps system developers understand the benefits and limitations of these novel compute architectures in regards to specific neural networks and will help drive future innovation. We invite the community to contribute to QuTiBench in order to be able to support the full spectrum of choices in implementing machine learning systems, from Cloud to IoT.
Energy harvesting is an attractive way to power future IoT devices since it can eliminate the need for battery or power cables. However, harvested energy is intrinsically unstable. While FPGAs have been widely adopted in various embedded systems, it is hard to survive unstable power since all the memory components in FPGA are based on volatile SRAMs. The emerging non-volatile memory based FPGAs provide promising potentials to keep configuration data on the chip during power outages. However, few works have considered implementing efficient runtime intermediate data checkpoint on non-volatile FPGAs. To realize accumulative computation under intermittent power on FPGA, this paper proposes a low-cost design framework, Data-Flow-Tracking FPGA (DFT-FPGA), which utilizes binary counters to track intermediate data flow. Instead of keeping all on-chip intermediate data, DFT-FPGA only targets on necessary data that is labeled by off-line analysis and identified by an online tracking system. The evaluation shows that compared with state-of-the-art technique, DFT-FPGA can realize accumulative computing with less off-line workload and significantly reduce online roll-back time and resource utilization.
Permutation-based obfuscation has been proposed to protect hardware against cloning, overproduction, and reverse engineering with a secret key. To prevent key extraction from memory, this key is usually stored in volatile memory. Since the key is erased after the system loses power, this scheme is often considered the best way to prevent a key from being stolen since many attacks would require power. However, in this paper, we propose a new attack where the key is determined by exploring path aging within the permutation network used for obfuscation. Both the theoretical analysis and experimental results are provided. A practical procedure to achieve the proposed attack is also discussed in the context of an attacker's capabilities and knowledge. The proposed attack is executed in both simulation and hardware. The experimental results show the accuracy of identifying the key is over 80\% and more than enough to reduce the number of brute force combinations required by an attacker. This attack accuracy reaches 100\% when the permutation network has experienced sufficient degradations. Besides the attack, we also proposed a countermeasure which sweeps the permutation network configurations. Incorporating this low-cost countermeasure, the proposed attack becomes no better than brute force guessing.
In this work, we propose a multiplication-less binarized Depthwise-separable convolution neural network, called BD-Net. BD-Net is designed to use binarized depthwise separable convolution block as the drop-in replacement of conventional spatial-convolution in deep convolution neural network (DNN). In BD-Net, the computation-expensive convolution operations (i.e., Multiplication and Accumulation) are converted into energy-efficient Addition/Subtraction operations. For further compressing the model size while maintaining the dominant computation in addition/subtraction, we propose a brand-new sparse binarization method with hardware-oriented structured sparsity pattern. In order to train such sparse BD-Net, we propose and leverage two techniques: 1) a modified group-lasso regularization whose group size is identical to the capacity of basic computing core in accelerator. 2) a weight penalty clipping technique to solve the disharmony issue between weight binarization and lasso regularization. The experiment results show that the proposed sparse BD-Net can achieve comparable or even better inference accuracy, in comparison to the full precision CNN baseline. Beyond that, a BD-Net customized process-in-memory accelerator is designed using SOT-MRAM, which owns characteristics of high channel expansion flexibility and computation parallelism. Through the detailed analysis from both software and hardware perspective. we provide an intuitive design guidance for software/hardware co-design of DNN acceleration on mobile embedded system.
Image sensors are widely used in various applications. With the increasing requirement for high resolutions and frame rates, power consumption has become a critical issue, which limits the use of image sensors in mobile devices and IoT applications. Compressive sensing (CS) techniques can achieve sub-Nyquist sampling rate to reduce the power consumption in hardware circuits. Currently, most compressive measurements are implemented in digital CMOS circuits, leading to high hardware complexity and power consumption as well as the limited sampling speed. Furthermore, CS applications with large image sizes are usually based on block-wise methods, which require real-time rate controls during practical operations. In this paper, we propose a memristor-based CS encoder that can be integrated with conventional image sensors to achieve high performance with low power consumption and hardware overheads. A self-adaptive compressing rate control mechanism is also devised to maximize the performance of the proposed technique. Simulation results of wireless video streaming demonstrate the advantages of the proposed technique.
Deep neural networks have become the readiest answer to a range of applications. Neural networks often require high computational and memory intensive solutions. This makes it challenging to deploy DNNs in embedded, real-time low power applications. We describe a domain-specific metric model for optimizing task deployment for heterogeneous systems. Next, we propose the DNN hardware accelerator SCALENet: a SCalable Low power AccELerator for real-time deep neural Networks. Finally, we propose a heterogeneous aware scheduler that uses the contributions to allocate a task to a resource based on solving a numerical cost for a series of domain objectives. To demonstrate our contribution, we deploy nine modern deep networks with two different applications: image processing, and biomedical seizure detection. Our proposed solution to meets the computational requirements, adapts to multiple architectures, and lowers power by optimizing task to resource allocation. Our solution decreases power consumption by 10% of the total system power and meets the real-time deadlines. We exceed the energy efficiency of NVIDIA GPUs when evaluated against Jetson TK1 and with a 4x power savings in a power envelope of 2.0 Watts. When compared to existing FPGA-based accelerators SCALENet's accelerator and heterogeneous aware scheduler achieves a 4.8x improvement in energy efficiency.
Pattern matching algorithms, which may be realized via associative memories, require further improvements in both accuracy and power consumption to achieve more widespread use in real-world applications, particularly in embedded systems. In this paper we extend our previous work to prove that our Hamming Distance (HD) circuit is scalable, generalizable, and tolerant to device-to-device variation. We showed that the operation of our circuit under non-ideal fabrication conditions changes slightly, decreasing the correct classification rates for the MNIST handwritten digits dataset by < 1%. Our circuit?s operation is independent of the memristor model used. It is also n× faster than other HD circuits where n is the number of HDs to be computed. This is due to our leverage of in-memory parallel computing. In addition, it consumes ? 100× less power compared to a similar memristive HD circuit. In a full HD ACAM our HD circuit consumed only 3.6% of the full power while Zhu?s circuit (an HD circuit) 80.8%. Improved associative memories that rely on the HD can improve image and object recognition applications. Our scheme will help in accelerating many machine learning algorithms while lowering their power consumption.
Energy dissipation has become a crucial aspect for the further development of computing technologies. Despite the good progress that has been achieved in this regard, there is a fundamental thermodynamic limit (known as Landauer's limit) that will never be broken by conventional technologies as long as the computations are performed in the conventional, non-reversible way. But even if reversible computations were performed, the basic energy needed for operating the circuits is still far too high. In contrast, novel nanotechnologies like Quantum-dot Cellular Automata (QCA) allow for computations with very low energy dissipation and, hence, are promising candidates for breaking this limit. Accordingly, the design of reversible QCA circuits is an active field of research. But whether QCA circuits are indeed able to break Landauer's limit is unknown thus far. In this work, we address this gap by utilizing an established theoretical model that has been implemented in a physics simulator enabling a precise consideration of how energy is dissipated in QCA designs. Our results provide strong evidence that QCA is indeed a suitable technology for breaking Landauer's limit. Further, the first physically reversible design of an adder circuit is presented which serves as proof-of-concept for future fully reversible circuit realizations.
Traditional computing hardware often encounters on-chip memory bottleneck on large scale ConvolutionNeural Networks (CNN) applications. With its unique in-memory computing feature, resistive crossbar-based computing attracts researchers? attention as a promising solution to the memory bottleneck issue in von Neumann architectures. However, the parasitic resistances in crossbar deviate its behavior from the ideal weighted summation operation. In large-scale implementations, the impact of parasitic resistances must be carefully considered and mitigated to ensure circuits? functionality. In this work, we implemented and simulated CNNs on resistive crossbar circuits with consideration of parasitic resistances. Moreover, we carried out a new mapping scheme for high utilization of crossbar arrays on convolution, and a mitigation algorithm to mitigate parasitic resistances in CNN applications. The mitigation algorithm considers parasitic resistances as well as data/kernel patterns of each layer to minimize the computing error in crossbar-based convolutions of CNNs. We demonstrated the proposed methods with implementations of a 4-layer CNN on MNIST, and ResNet(20, 32, and 56) on CIFAR10. Simulation results show the proposed methods well mitigate the parasitic resistances in crossbars. With our methods, modern CNNs on crossbars can preserve ideal(software) level classification accuracy with 6-bit ADCs and DACs implementation.
With the rise of Internet of Things (IoT), devices such as smartphones, embedded medical devices, smart home appliances as well as traditional computing platforms such as personal computers and servers have been increasingly targeted with a variety of cyber attacks. Due to limited hardware resources for embedded devices and difficulty in wide-coverage and on-time software updates, software-only cyber defense techniques, such as traditional anti-virus and malware detectors, do not offer a silver-bullet solution. Hardware-based security monitoring and protection techniques, therefore, have gained significant attention. Monitoring devices using side channel leakage information, e.g. power supply variation and electromagnetic (EM) radiation, is a promising avenue that promotes multiple directions in security and trust applications. In this paper, we provide a taxonomy of hardware-based monitoring techniques against different cyber and hardware attacks, highlight the potentials and unique challenges, and display how power-based side-channel instruction-level monitoring can offer suitable solutions to prevailing embedded device security issues. Further, we delineate approaches for future research directions.