You-Gan Wang

You-Gan Wang
Queensland University of Technology | QUT · School of Mathematical Sciences

D.Phil, Oxford

About

243
Publications
62,359
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
4,686
Citations
Additional affiliations
April 2010 - September 2015
The University of Queensland
Position
  • Chair Professor of Applied Statistics
Education
September 1988 - February 1991
University of Oxford
Field of study
  • Statistics

Publications

Publications (243)
Article
Full-text available
The presence of heterogeneous variances is the norm in practice, which makes machine learning predictions less reliable when noise variance is implicitly assumed to be equal. To this end, we extend support vector regression by allowing a range of variances in the model training. Specifically, we model the variance as a function of the mean and othe...
Article
Full-text available
In load forecasting fields, electricity demand with hierarchical structure is very popular where there are some differences among investigated load series because of geography or customers' habits. Common methods usually ignore their differences and introduce some complex models to improve forecasting performance. Therefore, appropriately dealing w...
Article
Full-text available
In engineering applications, many real-world optimization problems are nonlinear with multiple local optimums. Traditional algorithms that require gradients are not suitable for these problems. Meta-heuristic algorithms are popularly employed to deal with these problems because they can promisingly jump out of local optima and do not need any gradi...
Article
A better understanding of phosphorus-transfer process and influence factors at Sediment-Water Interface (SWI) is essential to develop effective and efficient river managements strategies. In this study, overlying water, pore water and riverbed sediment samples were collected in the Huaihe River (HR), a highly polluted river in Eastern China, in May...
Article
Full-text available
An Underwater Data Center (UDC) is an underwater vessel full of computing servers and designed with a cooling system using cold water in the ocean. A UDC vessel is composed of cabinets for computing servers, and the cabinets are finally packed into racks that facilitate the installments of the computing servers. We formulate the problem of packing...
Article
Full-text available
Wind energy is a core sustainable source of electric power, and accurate wind-speed forecasting is pivotal to enhancing the power stability, efficiency, and utilization. The existing forecasting methods are still limited by the influence of outliers and the modelling difficulties caused by complex features in wind speed series. This paper proposes...
Article
Energy efficiency is a critical issue in data centre management, which is the foundation for cloud computing. The VM placement has a considerable impact on a data centre's energy efficiency and resource utilisation. The assignment of VMs to PMs is an NP-hard problem without an easy way to find an optimal solution, particularly in large-scale data c...
Article
Full-text available
An in-situ monitoring of water quality (suspended sediment concentration, SSC) and concurrent hydrodynamics was conducted in the subaqueous Yellow River Delta in China. Empirical mode decomposition and spectral analysis on the SSC time series reveal the different periodicities of each physical mechanism that contribute to the SSC variations. Based...
Article
Full-text available
Background Polyploids are common in flowering plants and they tend to have more expanded ranges of distributions than their diploid progenitors. Possible mechanisms underlying polyploid success has been intensively investigated. Previous studies show that polyploidy generates novel changes and that subgenomes in allopolyploid species often differ i...
Article
Full-text available
With a rapid decline in the cost of battery energy storage, a battery system plays an increasing important role in managing an imbalance between ordering and consumption in the electricity wholesale market. To determine the optimal battery capacity that minimizes costs, we develop a new cost-oriented load forecasting framework accounting for batter...
Article
Full-text available
Selecting the minimal best subset out of a huge number of factors for influencing the response is a fundamental and very challenging NP-hard problem because the presence of many redundant genes results in over-fitting easily while missing an important gene can more detrimental impact on predictions, and computation is prohibitive for exhaust search...
Article
Full-text available
In extreme learning machine (ELM) framework, the hidden layer setting determines its generalization ability; and in presence of outliers in the training set, weights between hidden layer and output layer based on the least squares would be overly estimated. To address these two problems in ELM implementation, we extend robust penalized statistical...
Preprint
Full-text available
In environmental monitoring, multiple measurements are often collected at many locations and these measurements depend on each other in complex ways, such as nonlinear dependence. In this research, a novel copula-based geostatistical modelling approach was developed to model multivariate continuous spatial random fields using mixture copulas that c...
Article
Full-text available
This article proposes a new robust smooth-threshold estimating equation to select important variables and automatically estimate parameters for high dimensional longitudinal data. A novel working correlation matrix is proposed to capture correlations within the same subject. The proposed procedure works well when the number of covariates p n increa...
Article
Full-text available
In medical studies, the collected covariates contain underlying outliers. For clustered/longitudinal data with censored observations, the traditional Gehan-type estimator is robust to outliers in response but sensitive to outliers in the covariate domain, and it also ignores the within-cluster correlations. To take account of within-cluster correla...
Article
Full-text available
Variable selection is fundamental to high dimensional statistical modeling, and many approaches have been proposed. However, existing variable selection methods do not perform well in presence of outliers in response variable or/and covariates. In order to ensure a high probability of correct selection and efficient parameter estimation, we investi...
Conference Paper
Full-text available
https://www.statsoc.org.au/resources/Documents/ECSSC/ECSSC2021_Conference_Proceedings.pdf
Poster
https://www.statsoc.org.au/resources/Documents/ECSSC/ECSSC2021_Conference_Proceedings.pdf
Conference Paper
Full-text available
Copulas are a flexible tool to model multivariate dependence. Recently, they have become more popular in spatial statistics. They cover the range from perfect positive dependence to independent and also include asymmetric dependence. The vine copula construction method facilitates the decomposition of a multivariate copula into a set of bivariate c...
Preprint
Full-text available
In the stock market, accurate prediction of stock price movement direction can effectively increase the profits for investors. However, the stock price is an extremely complex dynamic system with strong fluctuation, proper selection of technical indicators can potentially improve the accuracy of the direction prediction. We propose a novel sparse l...
Article
Full-text available
In this paper, a ridgelized Hotelling’s [Formula: see text] test is developed for a hypothesis on a large-dimensional mean vector under certain moment conditions. It generalizes the main result of Chen et al. [A regularized Hotelling’s [Formula: see text] test for pathway analysis in proteomic studies, J. Am. Stat. Assoc. 106(496) (2011) 1345–1360....
Article
Full-text available
In environmental studies, regression analysis is frequently performed. The classical approach is the ordinary least squares method which consists in minimizing the sum of the squares of the residuals. However, this method relies on strong assumptions that are not always satisfied. In environmental data, the response variable often contains outliers...
Article
Full-text available
New technologies have produced increasingly complex and massive datasets, such as next generation sequencing and microarray data in biology, dynamic treatment regimes in clinical trials and long-term wide-scale studies in the social sciences. Each study exhibits its unique data structure within individuals, clusters and possibly across time and spa...
Article
Full-text available
Generalized estimating equations (GEE) is a widely used method for analysing longitudinal data, and the sandwich method is often used to estimate the variance–covariance matrix of the regression coefficient estimators. However, the sandwich method relies on the residual products as an estimator for the true covariance of the responses, and the esti...
Article
Full-text available
In robust regression, it is usually assumed that the distribution of the error term is symmetric or the data are symmetrically contaminated by outliers. However, this assumption is usually not satisfied in practical problems, and thus if the traditional robust methods, such as Tukey’s biweight and Huber’s method, are used to estimate the regression...
Article
This paper considers predictive regressions, where yt is predicted by all p lags of xt, here with xt being autoregressive of order q, PR(p,q). The literature considers model properties in the cases where p=q. We demonstrate that the current augmented regression method can still reduce the bias in predictive coefficients, but its efficiency depends...
Article
Full-text available
In situ observations of suspended sediment concentration (SSC) and hydrodynamics were conducted in the subaqueous Yellow River Delta, China. With the dataset, a new least absolute shrinkage and selection operator (LASSO) regression model with temporal autocorrelation incorporated (temporal LASSO) is proposed for SSC prediction and mechanism investi...
Article
Parameter estimation for the skew-normal distribution is challenging, since the profile likelihood function of shape parameter has a stationary point at zero, which hampers the use of traditional methods, such as maximum likelihood method. We present a modified empirical characteristic function method to perform parameter estimation for the skew-no...
Preprint
Full-text available
Climate change is commonly associated with an overall increase in mean temperature in a defined past time period. Many studies consider temperature trends at the global scale, but the literature is lacking in in-depth analysis of the temperature trends across Australia in recent decades. In addition to heterogeneity in mean and median values, daily...
Article
Subgenome asymmetry (SA) has routinely been attributed to different responses between the subgenomes of a polyploid to various stimuli during evolution. Here, we compared subgenome differences in gene ratio and relative diversity between artificial and natural genotypes of several allopolyploid species. Surprisingly, consistent differences were det...
Article
Full-text available
In energy demand forecasting, the objective function is often symmetric, implying that over-prediction errors and under-prediction errors have the same consequences. In practice, these two types of errors generally incur very different costs. To accommodate this, we propose a machine learning algorithm with a cost-oriented asymmetric loss function...
Article
Inertial motion sensors located on the animal have been used to study the behaviour of ruminant livestock. The time window size of segmented signal data can significantly affect the classification accuracy of animal behaviours. To date, there have been no studies evaluating the impact of a mixture of time window size features on the accuracy of ani...
Article
Full-text available
We establish profit models to predict the performance of airlines in the short term using the quarterly profit data collected on the three largest airlines in China together with additional recent historical data on external influencing factors. In particular, we propose the application of the LASSO estimation method to this problem and we compare...
Preprint
This paper proposes a new robust smooth-threshold estimating equation to select important variables and automatically estimate parameters for high dimensional longitudinal data. Our proposed procedure works well when the number of covariates p increases as the number of subjects n increases and even when p exceeds n. A novel working correlation mat...
Article
This paper contributes to the filling of two gaps in accommodation demand forecasting: (a) the limited number of studies on the use of modern machine learning techniques to identify the dynamics of accommodation demand; and (b) the lack of understanding of comparative forecasting performance of different modelling techniques at multiple forecast ho...
Article
Full-text available
Key message: We identified 1.844 million barley pan-genome sequence anchors from 12,306 genotypes using genetic mapping and machine learning. There is increasing evidence that genes from a given crop genotype are far to cover all genes in that species; thus, building more comprehensive pan-genomes is of great importance in genetic research and bre...
Preprint
Full-text available
The coronavirus disease 2019 (COVID-19) had caused more that 8 million infections as of middle June 2020. Recently, Brazil has become a new epicentre of COVID-19, while India and African region are potential epicentres. This study aims to predict the inflection point and outbreak size of these new/potential epicentres at the early phase of the epid...
Article
Full-text available
Robust approach is often desirable in presence of outliers for more efficient parameter estimation. However, the choice of the regularization parameter value impacts the efficiency of the parameter estimators. To maximize the estimation efficiency, we construct a likelihood function for simultaneously estimating the regression parameters and the tu...
Article
Empirical studies are popular in estimating fish natural mortality rate (M). However, these empirical methods derive M from other life-history parameters and are often perceived as being less reliable than direct methods. To improve the predictive performance and reliability of empirical methods, we develop ensemble learning models, including baggi...
Article
Virtual machine placement (VMP) and power management are essential topics in the development of cloud computing and data centers. The assignment of a virtual machine to physical machine impacts the energy consumption, the makespan, and the idle time of physical machines. In this paper, we formulate the problem as a three-dimension bin-packing optim...
Article
Full-text available
Homoploid hybrid speciation (HHS) has been historically considered as rare until the wide exploitation of genome sequences over the last few years. It is now widely believed that pervasive HHS was involved in the evolution of a wide range of plant and animal species (Mallet, 2007; Abbott et al., 2013; Sousa & Hey, 2013; Payseur & Rieseberg, 2016; T...
Preprint
Full-text available
In medical studies, the collected covariates usually contain underlying outliers. For clustered /longitudinal data with censored observations, the traditional Gehan-type estimator is robust to outliers existing in response but sensitive to outliers in the covariate domain, and it also ignores the within-cluster correlations. To take account of with...
Preprint
The insensitive parameter in support vector regression determines the set of support vectors that greatly impacts the prediction. A data-driven approach is proposed to determine an approximate value for this insensitive parameter by minimizing a generalized loss function originating from the likelihood principle. This data-driven support vector reg...
Article
The rainfall patterns play an important role in determining the response of Suspended Sediment (SS) and Total Phosphorus (TP) to influence factors, which is complex and needs to be better understood. The Bayesian Network (BN), with each variable depending on only its immediate parent variables, can help to describe the complex processes involved. I...
Article
A network is usually embedded in a larger network and interacts with other networks simultaneously, while the networks in network science literature are generally examined independently. The trade values can generally reflect periodical (annual or monthly) flows of cargo types and values between two economies, while the geographical and transportat...
Article
Motivation: Under two biologically different conditions, we are often interested in identifying differentially expressed genes. It is usually the case that the assumption of equal variances on the two groups is violated for many genes where a large number of them are required to be filtered or ranked. In these cases, exact tests are unavailable an...
Article
Full-text available
Fisheries management must take account of environmental sustainability, economic profitability, and social benefits generated by the public resources. The traditional approach of maximum economic yield (MEY), however, is yet to consider social objectives in deriving quantitative quotes. Current MEY evaluation framework would be appropriate if the e...
Article
To better manage water environment in highly polluted rivers, the influence factors on water quality need to be investigated. With the effects of oxygen-demanding contaminants, it is difficult to resolve the complex interdependencies of the various factors using conventional methods. The Bayesian Networks (BNs), in which each variable only depends...
Article
Degradation data are usually collected for assessing the reliability of the product. We propose a new two-stage method to analyze degradation data. The degradation path is fitted by the nonlinear mixed effects model in the first stage, and the parameters in lifetime distribution are estimated by maximizing the asymptotic marginal distribution of ps...
Preprint
Full-text available
Homoploid hybrid speciation has been reported in a wide range of species since the exploitation of genome sequences in evolutionary studies. However, the interference of ancestral subdivision has not been adequately considered in many such investigations. Using the D lineage in wheat as an example, we showed clearly that ancestral subdivision has l...
Article
A landmark study published in 2002 estimated a very small Ne/N ratio (around 10‐5) in a population of pink snapper (Chrysophrys auratus, Forster, 1801) in the Hauraki Gulf in New Zealand. It epitomized the tiny Ne/N ratios (<10‐3) reported in marine species due to the hypothesized operation of sweepstakes reproductive success (SRS). Here we re‐eval...
Article
Many problems in biomedical and other sciences are subject to biased estimates (maximum likelihood or of similar types). In two seminal papers Cox and Snell (1968 Cox, D. R., and E. J. Snell. 1968. A general definition of residuals (with discussion). Journal of the Royal Statistical Society. Series B 30:248–75. doi: 10.1111/j.2517-6161.1968.tb00724...
Article
Energy efficiency is a critical issue in the management of data centers, which form the backbone of cloud computing. Virtual resource management has a significant impact on improving the energy efficiency of data centers. Despite the progress in this area, virtual resource management has been considered mainly at two separate levels: application as...
Article
Full-text available
We introduce robust procedures for analyzing water quality data collected over time. One challenging task in analyzing such data is how to achieve robustness in presence of outliers while maintaining high estimation efficiency so that we can draw valid conclusions and provide useful advices in water management. The robust approach requires specific...
Article
To assess the quality of a water environment, an in-depth analysis of temporal patterns of contaminant concentrations in water body should be carried out based on unbiased water quality datasets. In this study, we developed a modified log-linear model to account for non-stationary seasonal variations of contaminant concentrations over multiple peri...