-
[show abstract]
[hide abstract]
ABSTRACT: Dynamic optimization has been proposed to overcome many limitations caused by static optimization and is widely applied in
dynamic binary translation (DBT) to effectively enhance system performance. However, almost all the existing dynamic optimization
techniques or methods employed in DBT systems for a single-threaded executive environment considerably increase the complexity
of the hardware or the striking runtime overhead. We propose a multithreaded DBT framework with no associated hardware called
the MTCrossBit, where a helper thread for building a hot trace is employed to significantly reduce the overhead. In addition,
the main thread and helper thread are each assigned to different cores to use the multi-core resources efficiently to attain
better performance, two novel methods yet to be implemented in the MTCrossBit are presented: the dual-special-parallel translation
caches and the new lock-free threads communication mechanism—assembly language communication (ASLC). We then apply quantitative
analysis to prove that MTCrossBit can speed up the original CrossBit. Simultaneously, we present results from the implementation
of the MTCrossBit on the uniprocessor machines with multi-cores utilizing the benchmark-SPECint 2000, and illustrate that
we achieved some success with the above concurrent architecture.
Keywordsdynamic binary translator–hot trace–multithreaded framework–parallelism–multicore
Sciece China. Information Sciences 05/2012; 54(10):2064-2078. · 0.39 Impact Factor
-
Computers & Mathematics with Applications. 01/2012; 63:469-480.
-
[show abstract]
[hide abstract]
ABSTRACT: Nowadays, one of the most important goals of data center management is to maximize their profit by minimizing power consumption and service-level agreement (SLA) violations of hosted applications. System dynamics make it difficult to implement optimization in both aspects on shared infrastructures. A majority of existing works either focused on one aspect or applied models that are trained offline for application-specific workload. In addition, virtualization is being widely used in large-scale data centers to attain basic benefits like fault and performance isolation, and to improve system manageability. A key challenge that comes with virtualization is to dynamically provisioning resources for virtual machines and optimize their capacity for meeting service level objectives at the lowest possible cost. In this paper, we present a hierarchical management framework to assure application-level performance while minimizing power consumption for virtualized data centers. A novelty of the management framework is to combine control theory with linear programming technique. Empirical results show that the proposed framework brings substantial energy saving, while ensuring application performance. Especially, the integration of the performance controller and the energy optimizer results in an energy saving of 43% on our hardware testbed.
Green Computing and Communications (GreenCom), 2011 IEEE/ACM International Conference on; 09/2011
-
[show abstract]
[hide abstract]
ABSTRACT: System virtualization, which provides good isolation, is now widely used in server consolidation. Meanwhile, one of the hot topics in this field is to extend virtualization for embedded systems. However, current popular virtualization platforms do not support real-time operating systems such as embedded Linux well because the platform is not real-time ware, which will bring low-performance I/O and high scheduling latency. The goal of this paper is to optimize the Xen virtualization platform to be real-time operating system friendly. We improve two aspects of the Xen virtualization platform. First, we improve the xen scheduler to manage the scheduling latency and response time of the real-time operating system. Second, we import multiple real-time operating systems balancing method. Our experiment demonstrates that our enhancement to the Xen virtualization platform support real-time operating system well and the improvement to the real-time performance is about 20%.
Embedded and Ubiquitous Computing (EUC), 2010 IEEE/IFIP 8th International Conference on; 01/2011
-
[show abstract]
[hide abstract]
ABSTRACT: Many advanced hardware accelerations for virtualization, such as Pause Loop Exit (PLE), Extended Page Table (EPT), and Single Root I/O Virtualization (SR-IOV), have been introduced recently to improve the virtualization performance and scalability. In this paper, we share our experience with the performance and scalability issues of virtualization, especially those brought by the modern, multi-core and/or overcommitted systems. We then describe our work on the implementation and optimizations of the advanced hardware acceleration support in the latest version of Xen. Finally, we present performance evaluations and characterizations of these hardware accelerations, using both micro-benchmarks and a server consolidation benchmark (vConsolidate). The experimental results demonstrate an up to 77% improvement with these hardware accelerations, 49% of which is due to EPT and another 28% due to SR-IOV.
Workload Characterization (IISWC), 2010 IEEE International Symposium on; 01/2011
-
[show abstract]
[hide abstract]
ABSTRACT: In Recent years embedded world has been undergoing a shift from traditional single-core processors to processors with multiple cores. However, this shift poses a challenge of adapting legacy uniprocessor-oriented real-time operating system (RTOS) to exploit the capability of multi-core processor. In addition, some embedded systems are inevitably going towards the direction of integrating real-time with off-the-shelf time-sharing system, as the combination of the two has the potential to provide not only timely and deterministic response but also a large application base. Virtualization technology, which ensures strong isolation between virtual machines, is therefore a promising solution to above mentioned issues. However, there remains a concern regarding the responsiveness of the RTOS running on top of a virtual machine. In this paper we propose an embedded real-time virtualization architecture based on Kernel-Based Virtual Machine (KVM), in which VxWorks and Linux are combined together. We then analyze and evaluate how KVM influences the interrupt-response times of VxWorks as a guest operating system. By applying several real-time performance tuning methods on the host Linux, we will show that sub-millisecond interrupt response latency can be achieved on the guest VxWorks.
Computer Sciences and Convergence Information Technology (ICCIT), 2010 5th International Conference on; 01/2011
-
Third International Conference on Communications and Mobile Computing, CMC 2011, Qingdao, China, 18-20 April 2011; 01/2011
-
Proceedings of the 2011 ACM Symposium on Applied Computing (SAC), TaiChung, Taiwan, March 21 - 24, 2011; 01/2011
-
JSW. 01/2011; 6:2331-2340.
-
Proceedings of the 2011 ACM Symposium on Applied Computing (SAC), TaiChung, Taiwan, March 21 - 24, 2011; 01/2011
-
Jun Nakajima,
Qian Lin,
Sheng Yang,
Min Zhu,
Shang Gao,
Mingyuan Xia,
Peijie Yu,
Yaozu Dong,
Zhengwei Qi,
Kai Chen, Haibing Guan
Proceedings of the 2011 ACM Symposium on Applied Computing (SAC), TaiChung, Taiwan, March 21 - 24, 2011; 01/2011
-
Proceedings of the 73rd IEEE Vehicular Technology Conference, VTC Spring 2011, 15-18 May 2011, Budapest, Hungary; 01/2011
-
JSW. 01/2011; 6:2341-2349.
-
[show abstract]
[hide abstract]
ABSTRACT: As a key part of reverse engineering, decompilation plays a very important role in software security and maintenance. Unfortunately, most existing decompilation tools suffer from the low accuracy in identifying variables, functions and composite structures, which results in poor readability. To address these limitations, we present a practical decompiler called C-Decompiler for Windows C programs that (1) uses a shadow stack to perform refined data flow analysis, and (2) adopts inter-basic-block register propagation to reduce redundant variables. Our experimental results illustrate that on average C-Decompiler has the highest total percentage reduction of 55.91%, lowest variable expansion rate of 55.79% in the three tools, and the same Cyclomatic Complexity as the original source code for each test application. Furthermore, in our experiment, C-Decompiler is able to recognize functions with lower false positive and false negative rate. In the studies, we show that C-Decompiler is a practical tool to produce highly readable C code.
Reverse Engineering (WCRE), 2010 17th Working Conference on; 11/2010
-
[show abstract]
[hide abstract]
ABSTRACT: Profile data is valuable for identifying program hotspots and guiding optimizations. Traditional software profiling techniques incur significant overhead and are not suitable for DBT (Dynamic Binary Translation) systems. Hardware can support profile collection through either counters or timer interrupts that permit collection of statistical samples via software. Most hardware-support profiling systems can only achieve either high profile accuracy or low overhead. In this paper, we propose a novel profile approach on DBT using hardware support technique to achieve rapidly and accurately collecting profile information with minimal runtime overhead. This approach makes use of instrumentation code and a set of profiling hardware which supports operations of updating counters. It is believed that such a software-hardware collaborative approach will serve to provide a strong foundation for optimizing DBT systems.
Information Science and Engineering (ICISE), 2009 1st International Conference on; 01/2010
-
IEEE 16th International Conference on Parallel and Distributed Systems, ICPADS 2010, 8-10 Dec. 2010, Shanghai, China; 01/2010
-
Journal of Systems Architecture - Embedded Systems Design. 01/2010; 56:500-508.
-
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC 2010, Chicago, Illinois, USA, June 21-25, 2010; 01/2010
-
16th International Conference on High-Performance Computer Architecture (HPCA-16 2010), 9-14 January 2010, Bangalore, India; 01/2010
-
[show abstract]
[hide abstract]
ABSTRACT: This paper presents performance comparison of register allocation algorithms in DBT (Dynamic Binary Translation). A group of register allocation algorithms including SRA(Simple Register Allocation), GRA(Global Register Allocation), NRA(Next-use Register Allocation) and SGRA(Simplified Graph-coloring Register Allocation) are implemented in a DBT system and evaluated. SGRA is a simplified version of graph-coloring register allocation method proposed by us. NRA and GRA are also an optimization versions based on the target platform. From the experimental result, SGRA has the best performance on all the six programs chosen from SPEC CINT2000. The improvement of SGRA is 7.3% on average over SRA. And based on comparison of the performance of DBT system with super block technique switching on or off, we find SGRA has better performance in big allocation scope, such as super blocks and traces.
Knowledge and Systems Engineering, International Conference on. 10/2009;