Wei LuoDeakin University · School of Information Technology
Wei Luo
Doctor of Philosophy
About
111
Publications
40,088
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,979
Citations
Introduction
My research focuses on improving the reliability of machine learning models to benefit the broader community. Recognising that data are only noisy, partial, and biased representations of reality, we aim to create machine learning models faithful to the application domain and robust in the prediction. To this end, we tackle the research challenges in model uncertainty and generalisation via geometric and topological methods, particularly through the lens of dynamical systems.
Additional affiliations
January 2009 - August 2012
Education
September 2001 - April 2008
September 1999 - June 2001
September 1995 - June 1999
Publications
Publications (111)
Deep neural networks (DNNs) are susceptible to backdoor attacks, where adversaries poison datasets with adversary-specified triggers to implant hidden backdoors, enabling malicious manipulation of model predictions. Dataset purification serves as a proactive defense by removing malicious training samples to prevent backdoor injection at its source....
Deep neural networks (DNNs) are vulnerable to backdoor attacks, where adversaries can maliciously trigger model misclassifications by implanting a hidden backdoor during model training. This paper proposes a simple yet effective input-level backdoor detection (dubbed IBD-PSC) as a 'firewall' to filter out malicious testing images. Our method is mot...
Neural ordinary differential equations (NODEs) have achieved remarkable performance in many data mining applications that involve multivariate time series data. Its adoption in the data-driven discovery of dynamic systems, however, was hindered by the lack of interpretability due to the black-box nature of neural networks. In this study, we propose...
Federated learning (FL) involves collaboration between clients with limited data to produce a single optimal global model through consensus. One of the difficulties with FL is the differences in data statistics between local clients. Clients with statistically heterogeneous data deviate from the global target, resulting in a slower convergence rate...
As deep learning gains popularity in modelling dynamical systems, we expose an underappreciated misunderstanding relevant to modelling dynamics on networks. Strongly influenced by graph neural networks, latent vertex embeddings are naturally adopted in many neural dynamical network models. However, we show that embeddings tend to induce a model tha...
Federated learning (FL) is a collaborative machine learning paradigm in which clients with limited data collaborate to train a single “best” global model based on consensus. One major challenge facing FL is the statistical heterogeneity among the data for each of the local clients. Clients trained with non-IID or imbalanced data whose models are ag...
In recent years, distributed graph convolutional networks (GCNs) training frameworks have achieved great success in learning the representation of graph-structured data with large sizes. However, existing distributed GCN training frameworks require enormous communication costs since a multitude of dependent graph data need to be transmitted from ot...
In federated learning, client models are often trained on local training sets that vary in size and distribution. Such statistical heterogeneity in training data leads to performance variations across local models. Even within a model, some parameter estimates can be more reliable than others. Most existing FL approaches (such as FedAvg), however,...
Evolving Android malware poses a severe security threat to mobile users, and machine-learning (ML)-based defense techniques attract active research. Due to the lack of knowledge, many zero-day families’ malware may remain undetected until the classifier gains specialized knowledge. The most existing ML-based methods will take a long time to learn n...
Reconstructing gene regulatory networks (GRNs) from expression data is vital for understanding gene transcription. Although increasingly advanced algorithms, particularly deep-learning models, are proposed to mine potential gene-regulatory interactions, an insufficient effort has been invested into improving feature reliability in the presence of b...
Deep neural networks tend to underestimate uncertainty and produce overly confident predictions. Recently proposed solutions, such as MC Dropout and SDENet, require complex training and/or auxiliary out-of-distribution data. We propose a simple solution by extending the time-tested iterative reweighted least square (IRLS) in generalised linear regr...
In recent years, Graph Convolutional Networks (GCNs) have achieved great success in learning from graph-structured data. With the growing tendency of graph nodes and edges, GCN training by single processor cannot meet the demand for time and memory, which led to a boom into distributed GCN training frameworks research. However, existing distributed...
Contrast sets are used in many knowledge-based systems to capture data patterns relevant to a target variable. While they have many advantages such as being highly interpretable, they do not come with a similarity measure or feature vectors for downstream tasks such as regression or classification. To address these disadvantages, we propose Con2Vec...
Aim: To use available electronic administrative records to identify data reliability, predict discharge destination, and identify risk factors associated with specific outcomes following hospital admission with stroke, compared to stroke specific clinical factors, using machine learning techniques.
Method: The study included 2,531 patients having a...
Deep neural networks tend to underestimate uncertainty and produce overly confident predictions. Recently proposed solutions, such as MC Dropout and SDENet, require complex training and/or auxiliary out-of-distribution data. We propose a simple solution by extending the time-tested iterative reweighted least square (IRLS) in generalised linear regr...
Nationally and internationally, there is a strong push for primary school teachers to use data to support their instructional decision making in classrooms. As new technology enables the rapid generation of many different forms of data, significant challenges arise for teachers to make sense of this large volume of data in order to inform their ins...
Background
This study aims to derive country-specific EQ-5D-5L health status utility (HSU) from the MacNew Heart Disease Health-related Quality of Life questionnaire (MacNew) using both traditional regression analyses, as well as a machine learning technique.
Methods
Data were drawn from the Multi-Instrument Comparison (MIC) survey. The EQ-5D-5L w...
Source camera identification is central to multimedia forensics with much research effort addressing to this problem. The assumption made by most existing solutions is that the images are taken from a finite set of camera models available at the classifier training stage. In most cloud-based image services, however, new images are uploaded daily fr...
Deep Learning (DL) is a disruptive technology that has changed the landscape of cyber security research. Deep learning models have many advantages over traditional Machine Learning (ML) models, particularly when there is a large amount of data available. Android malware detection or classification qualifies as a big data problem because of the fast...
In this paper, we present a novel machine-learning approach that analyzes student assessment scores across a teaching period to predict their final exam performance. One challenge for many universities around the world is identifying the students who are at risk of failing a subject sufficiently early enough to provide proactive interventions that...
Link prediction plays an important role in network analysis and applications. Recently, approaches for link prediction have evolved from traditional similarity-based algorithms into embedding-based algorithms. However, most existing approaches fail to exploit the fact that real-world networks are different from random networks. In particular, real-...
Introduction
Limited evidence exists on the cost-effectiveness of interventions to prevent obesity and promote healthy body image in adolescents. The SHINE (Supporting Healthy Image, Nutrition and Exercise) study is a cluster randomised control trial (cRCT) aiming to deliver universal education about healthy nutrition and physical activity to adole...
Many applications of intelligent systems involve understanding a group of contrastively different outcome (e.g., all survivors of a deadly cancer, a top performing team in a large corporation). The intelligent system needs to identify attributes (features) which best describe or explain the group versus its alternatives. In data mining, this proble...
Human emotions can be recognized from facial expressions captured in videos. It is a growing research area in which many have attempted to improve video emotion detection in both lab‐controlled and unconstrained environments. While existing methods show a decent recognition accuracy on lab‐controlled datasets, they deliver much lower accuracy in a...
The aim of this study was to assess if tactical and technical performance indicators (PIs) could be used in combination to model match outcomes in Australian Football (AF). A database of 101 technical PIs and 14 tactical PIs from every match in the 2009–2016 Australian Football League (AFL) seasons was merged. Two outcome measures Win-loss and Scor...
Emotions are expressed by humans to demonstrate their feelings in daily life. Video emotion recognition can be employed to detect various human emotions captured in videos. Recently, many researchers have been attracted to this research area and attempted to improve video emotion detection in both lab controlled and unconstrained environments. Whil...
Financial risk management avoids losses and maximizes profits, and hence is vital to most businesses. As the task relies heavily on information-driven decision making, machine learning is a promising source for new methods and technologies. In recent years, we have seen increasing adoption of machine learning methods for various risk management tas...
Machine learning (ML) has great potential in automated code vulnerability discovery. However, automated discovery application driven by off-the-shelf machine learning tools often performs poorly due to the shortage of high-quality training data. The scarceness of vulnerability data is almost always a problem for any developing software project duri...
Android malware poses serious security and privacy threats to the mobile users. Traditional malware detection and family classification technologies are becoming less effective due to the rapid evolution of the malware landscape, with the emerging of so-called zero-day-family malware families. To address this issue, our paper presents a novel resea...
Android has dominated the smartphone market and become the most popular mobile operating system. This rapidly increasing market share of Android has contributed to the boom of Android malware in numbers and in varieties. There exist many techniques which are proposed to accurately detect malware, e.g., software engineering-based techniques and mach...
Variable annuities are very profitable financial products that pose unique challenges in risk prediction. Metamodeling techniques are popular due to the significant saving in computation time. However, the current metamodeling techniques still have a low valuation accuracy. One key difficulty is the selection of a small number of contracts that opt...
Variable annuities are important financial products that result in 100 billion sales in 2018. These products contain complex guarantees that are computationally expensive to value, and insurance companies are turning to machine learning for the valuation of large portfolios of variable annuity policies. Although earlier studies, exemplified by the...
Detecting anomalies in surveillance videos has long been an important but unsolved problem. In particular, many existing solutions are overly sensitive to (often ephemeral) visual artifacts in the raw video data, resulting in false positives and fragmented detection regions. To overcome such sensitivity and to capture true anomalies with semantic s...
Social network analysis (SNA) has been applied widely in soccer and
basketball to assess the how a team share possession of the ball. Social
network analysis can be used to determine whether the characteristics of
team work are related to match outcome (Duch, et al., 2010; Grund, 2012).
To date, this approach had not been applied to assess team wor...
Mathematical models that explain match outcome, based on the value of technical performance indicators (PIs), can be used to identify the most important aspects of technical performance in team field-sports. The purpose of this study was to evaluate several methodological opportunities, to enhance the accuracy of this type of modelling. Specificall...
Recently Android malicious samples threaten billions of the mobile end users’ security or privacy. The community researchers have designed many methods to automatically and accurately identify Android malware samples. However, the rapid increase of Android malicious samples outpowers the capabilities of traditional Android malware detectors and cla...
Social network analysis (SNA) has been applied in soccer and basketball to assess how a team shares possession of the ball, which could be considered as an aspect of teamwork. The analysis of teamwork could provide the opportunity to identify tactical characteristics of team performance that are associated with winning. Ball possession data from ea...
When learning sequence representations, traditional pattern-based methods often suffer from the data sparsity and high-dimensionality problems while recent neural embedding methods often fail on sequential datasets with a small vocabulary. To address these disadvantages, we propose an unsupervised method (named Sqn2Vec) which first leverages sequen...
Android malware can pose serious security threat to the mobile users. With the rapid growth in malware programs, categorical isolation of malware is no longer satisfactory for security risk management. It is more pragmatic to focus the limited resources on identifying the small fraction of malware programs of high security impact. In this paper, we...
Objectives:
To identify novel insights about performance in Australian Football (AF), by modelling the relationships between player actions and match outcomes. This study extends and improves on previous studies by utilising a wider range of performance indicators (PIs) and a longer time frame for the development of predictive models.
Design:
Ob...
When learning sequence representations, traditional pattern-based methods often suffer from the data sparsity and high-dimensionality problems while recent neural embedding methods often fail on sequential datasets with a small vocabulary. To address these disadvantages, we propose an unsupervised method (named Sqn2Vec) which first leverages sequen...
Cancer is a worldwide problem and one of the leading causes of death. Increasing prevalence of cancer, particularly in developing countries, demands better understandings of the effectiveness and adverse consequences of different cancer treatment regimes in real patient populations. Current understandings of cancer treatment toxicities are often de...
Learning meaningful and effective representations for transaction data is a crucial prerequisite for transaction classification and clustering tasks. Traditional methods which use frequent itemsets (FIs) as features often suffer from the data sparsity and high-dimensionality problems. Several supervised methods based on discriminative FIs have been...
Learning meaningful and effective representations for transaction data is a crucial prerequisite for transaction classification and clustering tasks. Traditional methods which use frequent itemsets (FIs) as features often suffer from the data sparsity and high-dimensionality problems. Several supervised methods based on discriminative FIs have been...
We propose a novel approach to learn distributed representation for graph data. Our idea is to combine a recently introduced neural document embedding model with a traditional pattern mining technique, by treating a graph as a document and frequent subgraphs as atomic units for the embedding process. Compared to the latest graph embedding methods,...
We propose a novel approach to learn distributed representation for graph data. Our idea is to combine a recently introduced neural document embedding model with a traditional pattern mining technique, by treating a graph as a document and frequent subgraphs as atomic units for the embedding process. Compared to the latest graph embedding methods,...
Evidence-based medicine often involves the identification of patients with similar conditions, which are often captured in ICD code sequences. With no satisfying prior solutions for matching ICD-10 code sequences, this paper presents a method which effectively captures the clinical similarity among routine patients who have multiple comorbidities a...
Machine learning is now widely used to detect security vulnerabilities in software, even before the software is released. But its potential is often severely compromised at the early stage of a software project, when we face a shortage of high-quality training data and have to rely on overly generic hand-crafted features. This paper addresses this...
Introduction & Aims: Sport scientists are increasingly using data mining methods to analyse sporting performance, but important methodological considerations in this context are underexplored. The aim of this study is to demonstrate and critically evaluate the application of common data mining techniques, with game style 'era identification' in Aus...
In cybersecurity, vulnerability discovery in source code is a fundamental problem. To automate vulnerability discovery, Machine learning (ML) based techniques has attracted tremendous attention. However, existing ML-based techniques focus on the component or file level detection, and thus considerable human effort is still required to pinpoint the...
Objectives:
Comparison of outcomes for cancer patients discussed and not discussed at a multidisciplinary meeting (MDM).
Study design:
Retrospective analysis of the association of MDM discussion with survival.
Methods:
All newly diagnosed cancer patients from 2009 to 2012, presenting to a large regional cancer service in South West Victoria, A...
Background:
As more and more researchers are turning to big data for new opportunities of biomedical discoveries, machine learning models, as the backbone of big data analysis, are mentioned more often in biomedical journals. However, owing to the inherent complexity of machine learning methods, they are prone to misuse. Because of the flexibility...
In this paper, we consider the patient similarity matching problem over a cancer cohort of more than 220,000 patients. Our approach first leverages on Word2Vec framework to embed ICD codes into vector-valued representation. We then propose a sequential algorithm for case-control matching on this representation space of diagnosis codes. The novel pr...
Data scientists, with access to fast growing data and computing power, constantly look for algorithms with greater detection power to discover “novel” knowledge. But more often than not, their algorithms give them too many outputs that are either highly speculative or simply confirming what the domain experts already know. To escape this dilemma, w...
Preterm births occur at an alarming rate of 10-15%. Preemies have a higher risk of infant mortality, developmental retardation and long-term disabilities. Predicting preterm birth is difficult, even for the most experienced clinicians. The most well-designed clinical study thus far reaches a modest sensitivity of 18.2-24.2% at specificity of 28.6-3...
Objective:
Our study investigates different models to forecast the total number of next-day discharges from an open ward having no real-time clinical data.
Methods:
We compared 5 popular regression algorithms to model total next-day discharges: (1) autoregressive integrated moving average (ARIMA), (2) the autoregressive moving average with exoge...
Background:
Although physical illnesses, routinely documented in electronic medical records (EMR), have been found to be a contributing factor to suicides, no automated systems use this information to predict suicide risk.
Objective:
The aim of this study is to quantify the impact of physical illnesses on suicide risk, and develop a predictive m...
Background:
Preterm birth is a clinical event significant but difficult to predict. Biomarkers such as fetal fibronectin and cervical length are effective, but the often are used only for women with clinically suspected preterm risk. It is unknown whether routinely collected data can be used in early pregnancy to stratify preterm birth risk by ide...
Treatments of cancer cause severe side effects called toxicities. Reduction
of such effects is crucial in cancer care. To impact care, we need to predict
toxicities at fortnightly intervals. This toxicity data differs from traditional time
series data as toxicities can be caused by one treatment on a given day alone,
and thus it is necessary to con...
Objectives The Health of the Nation Outcome Scales (HoNOS) are mandated outcome-measures in many mental-health jurisdictions. When HoNOS are used in different care settings, it is important to assess if setting specific bias exists. This article examines the consistency of HoNOS in a sample of psychiatric patients transitioned from acute inpatient...
The era of big data brings new challenges to the network traffic technique that is an essential tool for network management and security. To deal with the problems of dynamic ports and encrypted payload in traditional port-based and payload-basedmethods, the state-of-the-art method employs fl