
Zhiyi Huang- PhD
- Professor at University of Otago
Zhiyi Huang
- PhD
- Professor at University of Otago
About
137
Publications
13,803
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,218
Citations
Introduction
Current institution
Additional affiliations
August 1998 - present
May 1996 - July 1998
Publications
Publications (137)
Radio links in Wireless Body Area Networks (WBANs) are highly subject to short and long-term attenuation due to the unstable network topology and frequent body blockage. This instability makes it challenging to achieve reliable and energy-efficient communication, but on the other hand, provides a great potential for the sending nodes to dynamically...
Sleep staging traditionally requires massive time and expertise from clinicians. Various automated sleep staging methods have been developed to streamline this task, however, they commonly need extensive labeled clinical sleep data. ActiveSleepLearner is proposed to address this challenge, a transfer learning framework that leverages active learnin...
With the development of Location-Based Social Networks, successive Point Of Interest (POI) recommendation systems have become a hot spot in the field of recommendation systems. Successive POI recommendation systems suggest to users new and interesting places to visit. However, in real-life POI recommendation, there are often a small number of users...
Multi-layer perceptron (MLP) is a class of Artificial Neural Networks widely used in regression, classification, and prediction. To accelerate the training of MLP, more cores can be used for parallel computing on many-core systems. However, with the increasing number of cores integrated into the chip, the communication bottleneck in the training of...
Communication efficiency plays an important role in accelerating the distributed training of Deep Neural Networks (DNN). All-reduce is the key communication primitive to reduce model parameters in distributed DNN training. Most existing all-reduce algorithms are designed for traditional electrical interconnect systems, which cannot meet the communi...
Multi-layer Perceptron (MLP) is a class of Artificial Neural Networks widely used in regression, classification, and prediction. To accelerate the training of MLP, more cores can be used for parallel computing on many-core systems. With the increasing number of cores, interconnection of cores has a pivotal role in accelerating MLP training. Current...
Fully Connected Neural Network (FCNN) is a class of Artificial Neural Networks widely used in computer science and engineering, whereas the training process can take a long time with large datasets in existing many-core systems. Optical Network-on-Chip (ONoC), an emerging chip-scale optical interconnection technology, has great potential to acceler...
Task allocation in Data Stream Processing Systems (DSPSs) has a significant impact on performance metrics such as data processing latency and system throughput. An application processed by DSPSs can be represented as a Directed Acyclic Graph (DAG), where each vertex represents a task and the edges show the dataflow between the tasks. Task allocatio...
Radio links in Wireless Body Area Networks (WBANs) suffer from both short-term and long-term variations due to the dynamic network topology and frequent blockage caused by body movements, making it challenging to achieve reliable, energy-efficient and real-time data communication. Through experiments with TelosB motes, we observe a strong positive...
Many modern forms of asymmetric multiprocessing (AMP) architecture use hypervisors to increase software security by isolating the system software in virtual machines. However, efficient virtualisation depends on hardware support that is not available across all products. Within modern ARM architectures, the aforementioned software isolation can als...
Radio links in wireless body area networks (WBANs) commonly experience highly time-varying attenuation due to the dynamic network topology and frequent occlusions caused by body movements, making it challenging to design a reliable, energy-efficient, and real-time communication protocol for WBANs. In this article, we present Chimp, a learning-based...
To efficiently handle a large volume of data, scheduling algorithms in stream processing systems need to minimise the data movement between communicating tasks to improve system throughput. However, finding an optimal scheduling algorithm for these systems is NP-hard. In this paper, we propose a heuristic scheduling algorithm – T3-Scheduler – for a...
Sparse bundle adjustment (SBA) is a key but time- and memory-consuming step in three-dimensional (3D) reconstruction. In this paper, we propose a 3D point-based distributed SBA algorithm (DSBA) to improve the speed and scalability of SBA. The algorithm uses an asynchronously distributed sparse bundle adjustment (A-DSBA) to overlap data communicatio...
Nowadays data stream processing systems need to efficiently handle large volumes of data in near real-time. To achieve this, the schedulers within such systems minimise the data movement between highly communicating tasks, improving system throughput. However, finding an optimal schedule for these systems is NP-hard. In this research, we propose a...
In this paper, we propose a new power modelling method, called Manila, that can largely reduce the effort of PMC-based power modelling using multi-dimensional k-nearest neighbour searching, without the use of model tuning and domain-specific knowledge. This method helps improve the accuracy of PMC-based power modelling and widen its scope of use. S...
To efficiently handle a large volume of data, scheduling algorithms in stream processing systems need to minimise the data movement between communicating tasks to improve system throughput. However, finding an optimal scheduling algorithm for these systems is NP-hard. In this paper, we propose a heuristic scheduling algorithm for a heterogeneous cl...
Unstable channel links in Vehicular Ad hoc Net- works (VANETs) make the design of reliable broadcast schemes challenging. Existing solutions fail to balance the requirements in Packet Delivery Ratio (PDR), latency and communication overhead, rendering the vehicular broadcast either severe packet losses, long time delay or excessive duplication. In...
Indoor localisation systems have slowly become more and more accurate. Each localisation system needs tuning to affect reasonable performance. In this paper we propose CRAFT, a crowd sourced approach to constructing a WiFi fingerprint database. The method uses a temporarily deployment of a small number of anchor nodes to roughly locate the position...
Approximate
$k$
Nearest Neighbours (A
$k$
NN) search is widely used in domains such as computer vision and machine learning. However, A
$k$
NN search in high-dimensional datasets does not scale well on multicore platforms, due to its large memory footprint. Parallel A
$k$
NN search using space subdivision for filtering helps reduce the memory f...
Manycore processor becomes the mainstream platform for cloud computing applications. However, the design of highperformance and sustainable inter-core communication network is still a challenging problem. Optical Network on Chip (ONoC) is a promising chip-scale optical communication technology with high bandwidth capacity and energy efficiency. In...
When media is streamed over networks that only provide best-effort delivery, playback interruptions caused by variations of network throughput can be largely eliminated by using techniques such as client-side playback buffering. A larger buffer generally provides stronger protection against playback interruptions, since it accommodates a wider rang...
The rapidly growing demand for accident-free driving in intelligent transportation makes reliable broadcast a critical factor for vehicular ad hoc networks. Existing solutions always try to improve the broadcast reliability by retransmitting lost packets. However, the excessive retransmissions can easily cause unpredictable time delay and even broa...
Multicast communication, which widely exists in multicore systems, can occupy a large quantity of network resources and lead to severe traffic congestions. Optical Network on Chip (ONoC) is considered as a promising interconnection technology for future multicore processors, due to its remarkable advantages of high bandwidth capacity and transmissi...
Performance monitoring counters (PMCs) are of great value to monitor the status of processors and their further analysis and modeling. In this paper, we explore a novel problem called PMC integration, i.e., how to combine a group of PMCs which are collected asynchronously together. It is well known that, due to hardware constraints, the number of P...
With ever-accelerating data creation rates in Big Data applications, there is a need for efficient stream processing engines. Apache Storm has been of interest in both academia and industry because of its real-time, distributed, scalable and reliable framework for stream processing. In this paper, we propose an adaptive hierarchical scheduler for t...
Power and performance are two potentially opposing objectives in the design of a supercomputer, where increases in performance often come at the cost of increased power consumption and vice versa. The task of simultaneously maximising both objectives is becoming an increasingly prominent challenge in the development of future exascale supercomputer...
Train localisation is important to railway safety. Using Wireless Sensor Networks (WSNs) in train localisation is a robust and cost effective way. A WSN-based train localisation system contains anchor nodes that are deployed along railway tracks and have known geographic coordinates. However, anchor nodes along the railway tracks are prone to hardw...
Optical Network on Chip (ONoC) is a promising
technology for the next-generation many-core chip multiprocessors
owing to its tremendous advantages in low power consumption,
low communication delay, and high bandwidth. In this paper
we present WRH-ONoC, a novel wavelength-reused hierarchical
architecture that is capable of interconnecting thousands...
Optical Network on Chip (ONoC) is a promising
technology for the next-generation many-core chip multiprocessors
owing to its tremendous advantages in low power consumption,
low communication delay, and high bandwidth. In this paper
we present WRH-ONoC, a novel wavelength-reused hierarchical
architecture that is capable of interconnecting thousands...
This chapter presents the motivation, fundamental concepts, and implementation methodologies for power modeling. It also presents various performance monitoring techniques that are commonly used for collecting application and component performance values. The chapter discusses the relationship between power and performance that needs to be quantifi...
Robotics and Wireless Sensor Network (WSN) collaboration is an emerging research field in which both technologies can benefit from integrated implementations. A robot operating in WSN assisted environments can dynamically push instructions using over the air (OTA) update protocols to alter sensors to suit the requirements. In this paper, an Intrusi...
Robotics and Wireless Sensor Network (WSN) collaborations is an emerging research field in which both the technologies can benefit from integrated implementations. A robot operating in WSN assisted environment can dynamically push instructions using over the air (OTA) update protocols to alter sensing requirement. In this paper, an Intrusion Detect...
In this paper we unveil some energy efficiency and performance frontiers for sparse computations on GPU-based supercomputers. To do this, we consider state-of-the-art implementations of the sparse matrix-vector (SpMV) product in libraries like cuSPARSE, MKL, and MAGMA, and their use in the LOBPCG eigen-solver. LOBPCG is chosen as a benchmark for th...
Research efforts into indoor localisation have focused on im- proving the accuracy of location estimates. In this paper, we propose a novel approach called SIB that uses RSSI values from low-power transmissions to exclude the noisy measure- ments from usual high-power RSSI measurements. SIB can effectively reduce the effect of noise in fingerprint-...
K Nearest Neighbors (k-NN) search is a widely used category of algorithms with applications in domains such as computer vision and machine learning. With the rapidly increasing amount of data available, and their high dimensionality, k-NN algorithms scale poorly on multicore systems because they hit a memory wall. In this paper, we propose a novel...
Many techniques have previously been proposed for using low-level CPU Performance Monitoring Counters in power estimation models. In this paper, we present some apparent myths regarding these techniques, and their potential impact. The underlying misconceptions include: (1) sampling rate and execution time can be left unspecified; (2) thermal effec...
Real-time train localization using wireless sensor
networks (WSNs) offers huge benefits in terms of cost reduction
and safety enhancement in railway environments. A challenging
problem in WSN-based train localization is how to guarantee
timely communication between the anchor sensors deployed
along the track and the gateway deployed on the train wi...
Modern multi-core architectures offer Dynamic Voltage and Frequency Scaling (DVFS) that can dynamically adjust the operating frequency of each core for energy saving. However, current parallel programming environments and schedulers for task-based programs do not utilize DVFS and thus suffer from energy inefficiency in multi-core processors. To red...
This paper investigates the problem of scheduling delay-constrained traffic in a single-hop wireless industrial network in which different source devices have different data rates. We aim to maximize the packet delivery reliability while meeting the deadline for each packet. The transmission scheduling problem is decomposed into two sub-problems: s...
This paper introduces the JStar parallel programming language, which is a Java-based declarative language aimed at discouraging sequential programming, encouraging massively parallel programming, and giving the compiler and runtime maximum freedom to ...
k Nearest Neighbors (k-NN) search is a widely used category of algorithms with applications in domains such as computer vision and machine learning. Despite the desire to process increasing amounts of high-dimensional data within these domains, k-NN algorithms scale poorly on multicore systems because they hit a memory wall. In this paper, we propo...
Wireless sensor networks (WSNs) offer promising solutions for real-time object monitoring and tracking. An interesting application is train localization, in which anchor sensors are deployed along the railway track to detect the train and timely report to a gateway installed on the train. To save energy, anchor sensors operate based on an asynchron...
This paper extensively evaluates the performance of View-Oriented Transactional Memory (VOTM) based on two implementations that adopt different Transactional Memory (TM) algorithms. The Restricted Admission Control (RAC) mechanism in VOTM plays a key role in the performance gains of VOTM. In this paper, we use six applications to evaluate the perfo...
Modern multicore computers often adopt a multisocket multicore architecture with shared caches in each socket. However, traditional work-stealing schedulers tend to pollute the shared cache and incur more cache misses due to their random stealing. To relieve this problem, this paper proposes an Adaptive Cache-Aware Bi-tier work-stealing (A-CAB) sch...
Real-time train localization is essential to ensure the safety of modern railway transportation. This paper investigates the feasibility to achieve real-time and accurate train localization using wireless sensor networks. We carry out on-site experiments in a railway environment and demonstrate that Received Signal Strength Indicator (RSSI) is a go...
In memory-intensive algorithms, the problem size is often so large that it cannot fit into the cache of a CPU, and this may result in an excessive number of cache misses, a bottleneck that can easily make seemingly embarrassingly-parallel algorithms such as feature-matching unscalable in many core systems. To solve this bottleneck, this paper propo...
Feature matching is a fundamental problem in many computer vision tasks. As datasets become larger, and individual image resolution increases, this is becoming more and more computationally demanding work. While prior knowledge about the scene geometry can, in some cases, reduce the number of image pairs that need to be considered, the sheer volume...
Many computer vision applications are entering the 'big data' era: it is straightforward to acquire very large datasets that need to be processed. Our current research targets a large-scale structure-from-motion application, in which 3D models are formed from large collections of digital photographs. There have also been many recent technological d...
Parallel programming is the mainstream for today's HPC applications. Programmers need to parallelize their programs to achieve better performance on multicore systems. However, due to a lack of good understanding of parallelism in algorithms, scheduling policy in runtime systems, and multicore architectures, programmers usually find it very hard to...
Many techniques have previously been proposed for using low-level CPU Performance Monitoring Counters in power estimation models. In this paper, we present some common myths of these techniques, and their potential impact. Such myths include: (1) sampling rate can be ignored; (2) thermal effects are neutral; and (3) memory events correlate well wit...
In recent years rapid revolution of Multiprocessor System-on-Chip (MPSoC) poses new challenges for programming such architectures in an efficient manner. In order to explore potential hardware concurrency, software developers are still expected to handle many of the low-level details of programming including utilizing DMA, ensuring cache co-herency...
This paper proposes a Restricted Admission Control (RAC) scheme for View-Oriented Transactional Memory. The scheme can control the number of threads concurrently accessing a view in order to reduce the number of aborts of transactions. The RAC scheme has the merits of both the locking mechanism and the transactional memory. A theoretical model is p...
This paper extends the Restricted Admission Control (RAC) theoretical model to cover the multiple-view cases in View-Oriented Transactional Memory (VOTM) to analyze potential performance gain in VOTM when shared data is partitioned into multiple views. Experimental results show that partitioning shared data into separate views, each of which is ind...
Multi-socket Multi-core architectures with shared caches in each socket have become mainstream when a single multi-core chip cannot provide enough computing capacity for high performance computing. However, traditional task-stealing schedulers tend to pollute the shared cache and incur severe cache misses due to their randomness in stealing. To add...
Asymmetric Multi-Core (AMC) architectures have shown high performance as well as power efficiency. However, current parallel programming environments do not perform well on AMC due to their assumption that all cores are symmetric and provide equal performance. Their random task scheduling policies, such as task-stealing, can result in unbalanced wo...
Next Generation Wireless Networks (NGWNs) are expected to provide high data rate and optimized quality of service to multimedia and real-time applications over the Internet Protocol (IP) networks. To achieve these goals, handover plays a very critical role in maintaining the seamless connectivity when mobile terminals move across different cells or...
This chapter discusses energy-aware scheduling techniques for parallel applications on multicore computers. Key techniques for developing an energy-aware scheduler, such as estimation of power usage and performance features per application, are analyzed and evaluated. The authors first discuss the runtime profiling techniques for collecting detaile...
This paper proposes the View-Oriented Transactional Memory (VOTM) model to seamlessly integrate locking mechanism and transactional memory. The VOTM model allows programmers to partition the shared memory into "views", which are non-overlapping sets of shared data objects. The Restricted Admission Control (RAC) scheme can then control the number of...
Worldwide Interoperability for Microwave Access (WiMAX) deployment is growing at a rapid pace. Since Mobile WiMAX has the key advantage of serving large coverage areas per base station, it has become a popular emerging technology for handling mobile clients. However, serving a large number of Mobile Stations (MS) in practice requires an efficient h...
Next Generation Wireless Networks (NGWNs) focus on convergence of different Radio Access Technologies (RATs) providing good Quality of Service (QoS) for applications such as Voice over IP traffic (VoIP) and video streaming. The voice applications over IP networks are growing rapidly due to their increasing popularity and cost. To meet the demand of...
Modern multi-core computers often adopt a multi- socket multi-core architecture with shared caches in each socket. However, traditional task-stealing schedulers tend to pollute the shared cache and incur more cache misses due to their random stealing. To relieve this problem, this paper proposes a Cache Aware Bi-tier (CAB) task-stealing scheduler,...
In this paper, we have further explored the novel metrics and policies of Speedup per Watt (SPW), Power per Speedup (PPS), Energy per Target (EPT), Sharing Policy, the Hare and the Tortoise Policies, which were introduced in our previous work. Each policy leverages application parallelism and Dynamic Voltage and Frequency Scaling (DVFS) to reduce e...
This paper proposes a scheme for automatic detection of view access in the View-Oriented Parallel Programming (VOPP) model. VOPP is a shared-memory-based, data-centric model that uses “views” to bundle mutual exclusion with data access. Based on the automatic detection scheme, a view is automatically acquired when first accessed, and automatically...
In this paper, we have proposed three new metrics, Speedup per Watt (SPW), Power per Speedup (PPS) and Energy per Target (EPT), to guide task schedulers to select the best task schedules for energy saving in multicore computers. Based on these metrics, we have proposed the novel Sharing Policies, the Hare and the Tortoise Policies, which have taken...
Many multithreaded concurrency platforms that use a work-stealing runtime system incorporate a "cactus stack," wherein a function's accesses to stack variables properly respect the function's calling ancestry, even when many of the functions operate in parallel. Unfortunately, such existing concurrency platforms fail to satisfy at least one of the...
Data races hamper parallel programming and threaten the reliability of future software. This paper proposes the data race
prevention scheme View-Oriented Data race Prevention (VODAP), which can prevent data races in the View-Oriented Parallel Programming
(VOPP) model. VOPP is a novel shared-memory data-centric parallel programming model, which uses...
This paper proposes a data race prevention scheme, which can prevent data races in the View-Oriented Parallel Programming (VOPP) model. VOPP is a novel shared-memory data-centric parallel programming model, which uses views to bundle mutual exclusion with data access. We have implemented the data race prevention scheme with a memory protection mech...
In the traditional analysis method of noise signals, it is very difficult to relate the overall noise from compressors to the angular position. The experimental method of separating the overall noise of different angular ranges is carried out at the real conditions. The starting position of the rotary piston is labelled with vane displacement and t...
View-oriented parallel programming (VOPP) is a novel parallel programming model which uses views for communication between multiple processes. With the introduction of views, mutual exclusion and shared data access are bundled together, which offers both convenience and high performance to parallel programming. This paper presents the implementatio...
Zhiyi Huang Jingbo Niu G. Li- [...]
Y. Liu
This article presents the experiment process and results of abrasive water jet perforation. This experiment was conducted in Kalamayi, China, Xinjiang Oilfield in October 2004. Referring to explosive perforation experiment, we made two cement cylinder samples with a diameter of 2.4 m, 1.2 m high, putting a 139.7 mm (5-1/2″) and a 177.8 mm (7″) casi...
The best-effort service model of the Internet is unsuitable for streaming applications which require a smooth and flexible packet transmission rate. TCP is unable to provide such a sending rate due to its strict adherence to congestion control. We study the effect of the transport protocolpsilas send buffer size on the performance of streaming medi...
This paper describes the use of remote memory for virtual memory swapping in a cluster computer. Our design uses a lightweight kernel-to-kernel communications channel for fast, efficient data transfer. Performance tests are made to compare our system to normal hard disk swapping. The tests show significantly improved performance when data access is...
Abstract Operating systems only provide general-purposeI/O op- timisation since they have to service various types of ap- plications. However, application level I/O optimisation can achieve better performance,since an application has a bet- ter knowledge,of how to optimise disk I/O for the applica- tion. In this paper we provide a solution for appl...
Parallel computing has been in the spotlight with the advent of multi-core computers. The popular multithreading model does not scale very well when there are hundreds or thousands of cores, since it can only help exploit coarse-grained parallelism. There exist a lot of fine-grained parallelism to be exploited in I/O tasks and memory accesses durin...
In the last few years, GPUs(Graphics Processing Units) have made rapid development. Their ever-increasing computing power and decreasing cost have attracted attention from both industry and academia. In addition to graphics applications, researchers are interested in using them for general purpose computing. Recently, NVIDIA released a new computin...
Large scale e-Research environments face classical distributed challenges: performance, heterogeneous equipment and variable contexts. The users of such infrastructures want to benefit from full interactive environments based on multimedia streams (voice, video, virtual reality) which are difficult to design and support on a large scale basis. In t...
View-Oriented Parallel Programming(VOPP) is a novel programming style based on Distributed Shared Memory, which is friendly
and easy for programmers to use. In this paper we compare VOPP with two other systems for parallel programming on clusters:
LAM/MPI, a message passing system, and TreadMarks, a software distributed shared memory system. We pre...
Traditional parallel programming styles have many problems which hinder the development of parallel applications. The message passing style can be too complex for many programmers. While shared memory based parallel programming is relatively easy, it requires programmers to guarantee there is no data race in programs by using mutually exclusive loc...
This paper proposes a load balancing algorithm for distributed use of a cluster computer. It uses load information including CPU queue length, CPU utilisation, memory utilisation and network traffic to decide the load of each node. This algorithm is compared to an algorithm using only the CPU queue length. The performance evaluation results show th...
This paper presents a high-performance distributed shared memory system called VODCA, which supports a novel view-oriented parallel programming on cluster computers. One advantage of view-oriented parallel programming is that it allows the programmer to participate in performance optimization through wise partitioning of the shared data into views....
In contrast to merely AND- and merely OR- parallel execution models/systems, the side- effect problem in AND/OR parallel execution of Prolog programs is intricate and need to be carefully investigated. To decrease the non-trivial recomputation occurred in previous approach, this paper presents a Selective Recomputation(SR) approach for handling sid...
Traditional parallel programming styles have many problems which hinder the development of parallel applications. The message passing style can be too complex for many programmers. While shared memory based parallel programming is relatively easy, it requires programmers to guarantee there is no data race in programs. Data race conditions are gener...
This paper proposes a view-oriented update protocol with integrated diff for efficient implementation of a view-based consistency model which supports a novel view-oriented parallel programming style based on distributed shared memory. View-oriented parallel programming requires the programmer to divide the shared data into views according to the n...
This paper evaluates the performance of a novel view-oriented parallel programming style for parallel programming on cluster computers. View-oriented parallel programming is based on distributed shared memory which is friendly and easy for programmers to use. It requires the programmer to divide shared data into views according to the memory access...