Gillian Dobbie

Gillian Dobbie
University of Auckland · Department of Computer Science

About

265
Publications
35,711
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,304
Citations

Publications

Publications (265)
Conference Paper
Recently issued data privacy regulations like GDPR (General Data Protection Regulation) grant individuals the right to be forgotten. In the context of machine learning, this requires a model to forget about a training data sample if requested by the data owner (i.e., machine unlearning). As an essential step prior to machine unlearning, it is still...
Preprint
Recently issued data privacy regulations like GDPR (General Data Protection Regulation) grant individuals the right to be forgotten. In the context of machine learning, this requires a model to forget about a training data sample if requested by the data owner (i.e., machine unlearning). As an essential step prior to machine unlearning, it is still...
Article
Automated model repair techniques enable machines to synthesise patches that ensure models meet given requirements. B-repair, which is an existing model repair approach, assists users in repairing erroneous models in the B formal method, but repairing large models is inefficient due to successive applications of repair. In this work, we improve the...
Article
Machine learning (ML) models have been widely applied to various applications, including image classification, text generation, audio recognition, and graph data analysis. However, recent studies have shown that ML models are vulnerable to membership inference attacks (MIAs), which aim to infer whether a data record was used to train a target model...
Article
Full-text available
In order to adapt random forests to the dynamic nature of data streams, the state-of-the-art technique discards trained trees and grows new trees when concept drifts are detected. This is particularly wasteful when recurrent patterns exist. In this work, we introduce a novel framework called PEARL, which uses both an exact technique and a probabili...
Article
In software engineering, formal methods are often used to specify and verify design models of software products. Whether design models are consistent with required properties can significantly impact the quality of final software products. In this work, we study B model quality measurements based on the ISO/IEC 25010 standard. These measurements ar...
Preprint
Full-text available
Federated learning (FL) has emerged as a promising privacy-aware paradigm that allows multiple clients to jointly train a model without sharing their private data. Recently, many studies have shown that FL is vulnerable to membership inference attacks (MIAs) that can distinguish the training members of the given model from the non-members. However,...
Preprint
Full-text available
Machine learning (ML) models have been widely applied to various applications, including image classification, text generation, audio recognition, and graph data analysis. However, recent studies have shown that ML models are vulnerable to membership inference attacks (MIAs), which aim to infer whether a data record was used to train a target model...
Article
Full-text available
Recommender systems are important applications in big data analytics because accurate recommendation items or high‐valued suggestions can bring high profit to both commercial companies and customers. To make precise recommendations, a recommender system often needs large and fine‐grained data for training. In the current big data era, data often ex...
Chapter
In order to adapt random forests to the dynamic nature of data streams, the state-of-the-art technique discards trained trees and grows new trees when concept drifts are detected. This is particularly wasteful when recurrent patterns exist. In this work, we introduce a novel framework called PEARL, which uses both an exact technique and a probabili...
Preprint
The B method has facilitated the development of software by specifying the design of software as abstract machines and formally verifying the correctness of the abstract machines. The quality of B abstract machines can significantly impact the quality of final software products. In this paper, we propose a set of criteria for measuring the quality...
Article
Full-text available
By developing awareness of smartphone activities that the user is performing on their smartphone, such as scrolling feeds, typing and watching videos, we can develop application features that are beneficial to the users, such as personalization. It is currently not possible to access real-time smartphone activities directly, due to standard smartph...
Chapter
In New Zealand, road accident casualties have been increasing. Factor analyses and time series analyses show what types of accidents result in casualties, but the results from the analysis can become outdated. We propose a stream classification framework with drift detection to signal and adapt when the factors associated with crash casualties chan...
Chapter
An increasing number of people are using social media services and with it comes a more attractive outlet for phishing attacks. Phishers curate tweets that lead users to websites that download malware. This is a major issue as phishers can gain access to the user’s digital identity and perform malicious acts. Phishing attacks also have a potential...
Chapter
The main research content of this topic is model repair in formal methods. Formal verification can verify the correctness of a model using rigorous mathematical methods. However, the repair of incorrect models is usually done by humans. In order to automate the model repair, we combine the B method, formal verification, probabilistic methods, satis...
Conference Paper
Recommender systems play a vital role in web-based information systems, especially in the domain of e-commerce. Most of these systems provide their recommendation based on user’s preferences. However, based on different situations of the user, their preferences can differ. Providing recommendations based only on the user’s preferences and ignoring...
Article
Full-text available
The B-method, which provides automated verification for the design of software systems, still requires users to manually repair faulty models. This paper proposes B-repair, an approach that supports automated repair of faulty models written in the B formal specification language. After discovering a fault in a model using the B-method, B-repair is...
Conference Paper
Network embedding learns the vector representations of nodes. Most real world networks are heterogeneous and evolve over time. There are, however, no network embedding approaches designed for dynamic heterogeneous networks so far. Addressing this research gap is beneficial for analyzing and mining real world networks. We develop a novel representat...
Article
When concept drift is detected during classification in a data stream, a common remedy is to retrain a framework’s classifier. However, this loses useful information if the classifier has learnt the current concept well, and this concept will recur again in the future. Some frameworks retain and reuse classifiers, but it can be time-consuming to se...
Preprint
When concept drift is detected during classification in a data stream, a common remedy is to retrain a framework's classifier. However, this loses useful information if the classifier has learnt the current concept well, and this concept will recur again in the future. Some frameworks retain and reuse classifiers, but it can be time-consuming to se...
Article
Top-k nodes are the important actors for a subjectively determined topic in a social network. To some extent, a topic is taken as a ranking criteria for identifying top-k nodes. Within a viral marketing network, subjectively selected topics can include the following: Who can promote a new product to the largest number of people, and who are the hig...
Chapter
Boolean data is a core data type in machine learning. It is used to represent categorical and transactional data. Unlike real valued data, it is notoriously difficult to efficiently design boolean datasets that satisfy particular constraints. Inverse Frequent Itemset Mining (IFM) is the problem of constructing a boolean dataset, satisfying given su...
Chapter
Learning in evolving environments involves learning from data where the statistical characteristics can change over time. Current change detection algorithms that are used online for data streams detect whether a change has occurred in the data but there is always a detection delay. None of the existing online techniques can accurately pin-point th...
Chapter
Many real world applications need to capture a mix of temporal and non-temporal entities, relationships and attributes. These concepts add complexity when designing database schemas and it is difficult to capture the temporal semantics precisely. We propose a new framework for designing SQL databases that distinguishes between temporal and non-temp...
Chapter
Querying temporal relational databases is a challenge for non-expert database users, since it requires users to understand the semantics of the database and apply temporal joins as well as temporal conditions correctly in SQL statements. Traditional keyword search approaches are not directly applicable to temporal relational databases since they tr...
Conference Paper
Full-text available
More and more researchers are using remote sensing technology to measure real-world, on-road automobile emissions of nitric oxide (NO), one of the most important and frequently studied pollutants. Partnered with the National Institute of Water and Atmospheric Research (NIWA) in New Zealand, we aim to establish a robust NO emission factor prediction...
Conference Paper
Recently, there has been a strong demand for talented ICT (Information and Communication Technology) graduates in the software industry in New Zealand. To meet this demand, in 2015, the government of New Zealand provided funding for three new ICT Graduate Schools. The challenge for the schools was twofold: to provide a qualification for students tr...
Conference Paper
A data stream’s concept may evolve over time, which is known as the concept drift. Concept drifts affect the prediction accuracy of the learning model and are required to be handled to maintain the model quality. In most cases, there is a trade-off between maintaining prediction quality and learning efficiency. We present a novel framework known as...
Conference Paper
Tackling missing data is one of the fundamental data pre-processing steps. Data analysis and pattern extraction are affected due to the underlying differences between instances with and without missing data. This is a particular problem with ordinal data, where for example a sample of a population may have all failed to answer a specific question i...
Conference Paper
We propose the Concept Profiling Framework (CPF), a meta-learner that uses a concept drift detector and a collection of classification models to perform effective classification on data streams with recurrent concept drifts, through relating models by similarity of their classifying behaviour. We introduce a memory-efficient version of our framewor...
Article
Context: Recent years have witnessed growing interests in semantic web and its related technologies. While various frameworks have been proposed for designing semantic web services (SWS), few of them aim at testing. Objective: This paper investigates into the technologies for automatically deriving test cases from semantic web service descriptions...
Article
With the growing in size and complexity of modern computer systems, the need for improving the quality at all stages of software development has become a critical issue. The current software production has been largely dependent on manual code development. Despite the slow development process, the errors introduced by the programmers contribute to...
Article
Instead of constructing complex declarative queries, many users prefer to write their programs using procedural code embedded with simple queries. Since many users are not expert programmers or the programs are written in a rush, these programs usually exhibit poor performance in practice and it is a challenge to automatically and efficiently optim...
Article
Semi-stream join algorithms join a fast data stream with a disk-based relation. This is important, for example, in real-time data warehousing where a stream of transactions is joined with master data before loading it into a data warehouse. In many important scenarios, the stream input has a skewed distribution, which makes certain performance opti...
Conference Paper
Discords are the most unusual subsequences of a time series. Sequential discovery of discords is time consuming. As the scale of datasets increases unceasingly, datasets have to be kept on hard disk, which degrades the utilization of computing resources. Furthermore, the results discovered from segmentations of a time series are non-combinable, whi...
Conference Paper
With the prevalence of cutting-edge technology, the social media network is gaining popularity and is becoming a worldwide phenomenon. Twitter is one of the most widely used social media sites, with over 500 million users all around the world. Along with its rapidly growing number of users, it has also attracted unwanted users such as scammers, spa...
Poster
Full-text available
Often, we communicate science in a linear fashion, which suggests as if the right conclusion is achieved in a single swoop, rather than communicating science as it is - a dynamic, exploratory process of evolution. Such discourse lacks understandability of how a researcher came up with a given workflow design, and does not communicate the lessons le...
Conference Paper
Health monitoring involves sensing, reporting, and sometimes adjusting the states of objects or nodes remotely. This paper describes the design and implementation of a real-time distributed hardware health monitoring framework, assuming a homogeneous set of hardware nodes. The framework consists of sensor components operating at the nodes, and visu...
Book
This two-volume set, LNAI 9651 and 9652, constitutes the thoroughly refereed proceedings of the 20th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD 2016, held in Auckland, New Zealand, in April 2016. The 91 full papers were carefully reviewed and selected from 307 submissions. They are organized in topical section...
Chapter
Knowledge Discovery and Data (KDD) mining helps uncover hidden knowledge in huge amounts of data. However, recently, different researchers have questioned the capability of traditional KDD techniques to tackle the information extraction problem in an efficient way while achieving accurate results when the amount of data grows. One of the ways to ov...
Conference Paper
Full-text available
Extreme weather events such as ice storms cause significant damage to life and property. Accurately forecasting ice storms sufficiently in advance to offset their impacts is very challenging because they are driven by atmospheric processes that are complex and not completely defined. Furthermore, such forecasting has to consider the influence of a...
Conference Paper
Full-text available
When we upload or create data into the cloud or the web, we immediately lose control of our data. Most of the time, we will not know where the data will be stored, or how many copies of our files are there. Worse, we are unable to know and stop malicious insiders from accessing the possibly sensitive data. Despite being transferred across and withi...
Conference Paper
Referential integrity is one of the three inherent integrity rules and can be enforced in databases using foreign keys. However, in many real world applications referential integrity is not enforced since foreign keys remain disabled to ease data acquisition. Important applications such as anomaly detection, data integration, data modeling, indexin...
Article
Data clustering is one of the most widely used data mining techniques, classifying similar data items into groups on the basis of similarity among the data items. Different issues have been observed while achieving the classification of data into the most suitable grouping. Efficiency of the clustering techniques and accuracy of the resulting group...
Conference Paper
In the recent years, the number of scientific publications has increased substantially. A way to measure the impact of a publication is to count the number of citations to the paper. Thus, citations are being used as a proxy for a researcher’s contribution and influence in a field. Citation classification can provide context to the citations. To pe...
Conference Paper
Categories are the fundamental components of scientific knowledge and are used in every phase of the scientific process. However, they are often in a state of flux, with new observations, discoveries and changes in our conceptual understanding leading to the birth and death of categories, drift in their identities, as well as merging or splitting....
Conference Paper
Full-text available
Current methods in data streams that detect concept drifts in the underlying distribution of data look at the distribution difference using statistical measures based on mean and variance. Existing methods are unable to proactively approximate the probability of a concept drift occurring and predict future drift points. We extend the current drift...
Article
Full-text available
Recommender systems are highly vulnerable to shilling attacks, both by individuals and groups. Attackers who introduce biased ratings in order to affect recommendations, have been shown to negatively affect collaborative filtering (CF) algorithms. Previous research focuses only on the differences between genuine profiles and attack profiles, ignori...
Chapter
There has been some research in the area of rare pattern mining where the researchers try to capture patterns involving events that are unusual in a dataset. These patterns are considered more useful than frequent patterns in some domains, including detection of computer attacks, or fraudulent credit transactions. Until now, most of the research in...
Chapter
We propose algorithms for the detection of disjoint and overlapping communities in networks. The algorithms exploit both the degree and clustering coefficient of vertices as these metrics characterize dense connections, which we hypothesize as being indicative of communities. Each vertex independently seeks the community to which it belongs, by vis...
Article
Current drift detection techniques detect a change in distribution within a stream. However, there are no current techniques that analyze the change in the rate of these detected changes. We coin the term stream volatility, to describe the rate of changes in a stream. A stream has a high volatility if changes are detected frequently and has a low v...
Poster
Full-text available
Research Object (RO) is an aggregation of various digital assets – data, methods, software and workflows – and thus relies on the underlying digital ecosystem of science, which is fragmented among multitude of disconnected tools and systems – data repositories, software tools, workflow systems, digital journals, wikis, social networks, etc. This fr...
Article
Understanding the impacts of copyright is a challenge for the sharing and reuse of our research data. There is growing recognition of the problem, but the legal knowledge required to navigate through the minefield of restrictions and risks is often too difficult to uncover and understand. As of yet there are no appropriate tools to aid researchers,...
Article
Fraud is an ongoing concern for online auction websites. Current methods to detect or prevent fraud have been limited in several ways, making them difficult to apply in real world settings. Firstly, existing methods cannot adapt to changes in the behaviour of fraudulent users over time: new models must be continuously constructed as they gradually...
Conference Paper
In this age of digital science, our scientific knowledge is fragmented into data, methods, schemas, ontology, code and workflows – each of them largely disconnected from the others and manipulated within their own specialized tools. These tools have disaggregated our scientific knowledge and analytic processes. As a result, it is becoming much hard...
Article
Optimization based techniques have emerged as important methods to tackle the problems of efficiency and accuracy in data mining. One of the current application areas is outlier detection that has not been fully explored yet but has enormous potential. Web bots are an example of outliers, which can be found in the web usage analysis process. Web bo...
Poster
Full-text available
The poster was presented at ESWC summer school 2014 and won the first prize
Conference Paper
In Data Stream Management Systems (DSMS) semi-stream processing has become a popular area of research due to the high demand of applications for up-to-date information (e.g. in real-time data warehousing). A common operation in stream processing is joining an incoming stream with disk-based master data, also known as semi-stream join. This join typ...
Conference Paper
Current approaches to drift detection assume that stable memory consumption with slight variations with each stream is suitable for all programs. This is not always the case and there are situations where small variations in memory are undesirable such as drift detectors on medical vital sign monitoring systems. Under these circumstances, it is not...
Article
Existing works on keyword search over relational databases typically do not consider users' search intention for a query and return many answers which often overwhelm users. We observe that a database is in fact a repository of real world objects that interact with each other via relationships. In this work, we identify four types of semantic paths...