Benjamin C. M. FungMcGill University | McGill · School of Information Studies (SIS)
Benjamin C. M. Fung
Ph.D., M.Sc., B.Sc., P.Eng.
About
213
Publications
65,524
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
11,747
Citations
Introduction
Benjamin Fung is a Canada Research Chair in Data Mining for Cybersecurity, a Full Professor of Information Studies (SIS), and an Associate Member of Computer Science (SOCS) at McGill University. He received a Ph.D. degree in computing science from Simon Fraser University in 2007. Benjamin has over 130 refereed publications that span the research forums of data mining, privacy protection, and cyber forensics.
Additional affiliations
September 2013 - July 2020
April 2015 - present
June 2011 - August 2013
Education
January 2003 - April 2007
September 2000 - September 2002
September 1994 - April 1999
Publications
Publications (213)
With the increasing prevalence of location-aware devices, trajectory data has
been generated and collected in various application domains. Trajectory data
carries rich information that is useful for many data analysis tasks. Yet,
improper publishing and use of trajectory data could jeopardize individual
privacy. However, it has been shown that exis...
The collection of digital information by governments, corporations, and individuals has created tremendous opportunities for knowledge- and information-based decision making. Driven by mutual benefits, or by regulations that require certain data to be published, there is a demand for the exchange and publication of data among various parties. Data...
Privacy-preserving data publishing addresses the problem of disclosing sensitive data when mining for useful information. Among the existing privacy models, ∈-differential privacy provides one of the strongest privacy guarantees and has no assumptions about an adversary's background knowledge. Most of the existing solutions that ensure ∈-differenti...
Authorship analysis (AA) is the study of unveiling the hidden properties of authors from a body of exponentially exploding textual data. It extracts an author's identity and sociolinguistic characteristics based on the reflected writing styles in the text. It is an essential process for various areas, such as cybercrime investigation, psycholinguis...
Training large language models is a computationally intensive process that often requires substantial resources to achieve state-of-the-art results. Incremental layer-wise training has been proposed as a potential strategy to optimize the training process by progressively introducing layers, with the expectation that this approach would lead to fas...
Software reverse engineering is an essential but time-consuming undertaking in identifying malware, software vulnerabilities, and plagiarism, especially when access to the source code is limited. The extraction of abstract characteristics
that represent malware and work as classifier inputs in traditional machine learning approaches requires featur...
Addressing the challenge of toxic language in online discussions is crucial for the development of effective toxicity detection models. This pioneering work focuses on addressing imbalanced datasets in toxicity detection by introducing a novel approach to augment toxic language data. We create a balanced dataset by instructing fine-tuning of Large...
Explainable Artificial Intelligence (XAI) aims to alleviate the black-box AI conundrum in the field of Digital Forensics (DF) (and others) by providing layman-interpretable explanations to predictions made by AI models. It also handles the increasing volumes of forensic images that are impossible to investigate via manual methods; or even automated...
In the past decade, the number of malware variants has increased rapidly. Many researchers have proposed to detect malware using intelligent techniques, such as Machine Learning (ML) and Deep Learning (DL), which have high accuracy and precision. These methods, however, suffer from being opaque in the decision-making process. Therefore, we need Art...
Social media platforms present a perplexing duality, acting at once as sites to build community and a sense of belonging, while also giving rise to misinformation, facilitating and intensifying disinformation campaigns and perpetuating existing patterns of discrimination from the physical world. The first-step platforms take in mitigating the harmf...
Representation learning has been applied to Electronic Health Records (EHR) for medical concept embedding and the downstream predictive analytics tasks with promising results. Medical ontologies can also be integrated to guide the learning so that the embedding space can better align with existing medical knowledge. Yet, properly carrying out the i...
Object detection techniques have been widely studied, utilized in various works, and have exhibited robust performance on images with sufficient luminance. However, these approaches typically struggle to extract valuable features from low-luminance images, which often exhibit blurriness and dim appearence, leading to detection failures. To overcome...
Data-driven energy prediction models have drawn extensive attention in building domain in recent years. Improving the predictive accuracy of energy prediction models has been the main concern for existing research. However, an accurate model could not ensure perfect performance under all situations and the performance variation may cause fairness p...
With the growing global awareness of the environmental impact of clothing consumption, there has been a notable surge in the publication of journal articles dedicated to “fashion sustainability” in the past decade, specifically from 2010 to 2020. However, despite this wealth of research, many studies remain disconnected and fragmented due to varyin...
The practice of code reuse is crucial in software development for a faster and more efficient development lifecycle. In reality, however, code reuse practices lack proper control, resulting in issues such as vulnerability propagation and intellectual property infringements. Assembly clone search, a critical shift-right defence mechanism, has been e...
Software vulnerabilities have been posing tremendous reliability threats to the general public as well as critical infrastructures, and there have been many studies aiming to detect and mitigate software defects at the binary level. Most of the standard practices leverage both static and dynamic analysis, which have several drawbacks like heavy man...
With the increasing adoption of digital health platforms through mobile apps and online services, people have greater flexibility connecting with medical practitioners, pharmacists, and laboratories and accessing resources to manage their own health-related concerns. Many healthcare institutions are connecting with each other to facilitate the exch...
The proliferation of ransomware has become a significant threat to cybersecurity in recent years, causing significant financial, reputational, and operational damage to individuals and organizations. This paper aims to provide a comprehensive overview of the evolution of ransomware, its taxonomy, and its state-of-the-art research contributions. We...
Social media use has transformed communication and made social interaction more accessible. Public microblogs allow people to share and access news through existing and social-media-created social connections and access to public news sources. These benefits also create opportunities for the spread of false information. False information online can...
COVID-19 is an opportunity to study public acceptance of a ‘‘new’’ healthcare intervention, universal masking, which unlike vaccination, is mostly alien to the Anglosphere public despite being practiced in ages past. Using a collection of over two million tweets, we studied the ways in which proponents and opponents of masking vied for influence as...
In recent years, the massive data collection in buildings has paved the way for the development of accurate data-driven building models (DDBMs) for various applications. However, a model with a high overall accuracy would not ensure a good predictive performance on all conditions. The biased predictive performance for some conditions may cause fair...
In recent years, massive data collected from buildings made development and application of data-driven building models is a hot research topic. Due to the variation of data volume in different conditions, existing data-driven building models (DDBMs) would present distinct accuracy for different users or periods. This may create further fairness pro...
Recent research indicates that machine learning models are vulnerable to adversarial samples that are slightly perturbed versions of natural samples. Adversarial samples can be crafted in white-box or black-box scenario. In the black-box scenario adversaries possess no knowledge of the detailed architecture and parameters of the model they attack,...
Data-driven models have drawn extensive attention in the building domain in recent years, and their predictive accuracy depends on features or data distribution. Accuracy variation among users or periods creates a certain unfairness to some users. This paper addresses a new research problem called fairness-aware prediction of data-driven building a...
Malware has been an increasing threat to computer users. Different pieces of malware have different damage potential depending on their objectives and functionalities. In the literature, there are many studies that focus on automatically identifying malware with their families. However, there is a lack of focus on automatically identifying the seve...
In recent years, the declining birthrate and aging population have gradually brought countries into an ageing society. Regarding accidents that occur amongst the elderly, falls are an essential problem that quickly causes indirect physical loss. In this paper, we propose a pose estimation-based fall detection algorithm to detect fall risks. We use...
Deep learning models have achieved state-of-the-art performance in many classification tasks. However, most of them cannot provide an interpretation for their classification results. Machine learning models that are interpretable are usually linear or piecewise linear and yield inferior performance. Non-linear models achieve much better classificat...
A reliable occupancy prediction model plays a critical role in improving the performance of energy simulation and occupant-centric building operations. In general, occupancy and occupant activities differ by season, and it is important to account for the dynamic nature of occupancy in simulations and to propose energy-efficient strategies. The pres...
Indiscriminate elimination of harmful fake news risks destroying satirical news, which can be benign or even beneficial, because both types of news share highly similar textual cues. In this work we applied a recent development in neural network architecture, transformers, to the task of separating satirical news from fake news. Transformers have h...
Complementary metal-oxide-semiconductor (CMOS) image sensors can cause noise in images collected or transmitted in unfavorable environments, especially low-illumination scenarios. Numerous approaches have been developed to solve the problem of image noise removal. However, producing natural and high-quality denoised images remains a crucial challen...
The widespread popularity of social networking is leading to the adoption of Twitter as an information dissemination tool. Existing research has shown that information dissemination over Twitter has a much broader reach than traditional media and can be used for effective post-incident measures. People use informal language on Twitter, including ac...
Limited empirical research has examined the importance of product cues and information sources in relation to demographic variables and consumer innovativeness, particularly from a cross-national perspective. In order to understand consumer choice from a cross-national perspective, data were collected from Canada, China, India, and Taiwan. Data wer...
Malware detection and classification are becoming more and more challenging, given the complexity of malware design and the recent advancement of communication and computing infrastructure. The existing malware classification approaches enable reverse engineers to better understand their patterns and categorizations, and to cope with their evolutio...
Malware currently presents a number of serious threats to computer users. Signature-based malware detection methods are limited in detecting new malware samples that are significantly different from known ones. Therefore, machine learning-based methods have been proposed, but there are two challenges these methods face. The first is to model the fu...
Authorship attribution is the problem of identifying the most plausible author of an anonymous text from a set of candidate authors. Researchers have investigated same-topic and cross-topic scenarios of authorship attribution, which differ according to whether unseen topics are used in the testing phase. However, neither scenario allows us to expla...
The purpose of this study is to investigate the salient effects of product evaluative cues from a cross-national perspective. A web-based survey consisted of eight measuring items of environmental commitment and behaviour, 20 items of product cues, and demographic and behavioural questions were employed. A total of 321 and 309 usable surveys were c...
Haze removal techniques employed to increase the visibility level of an image play an important role in many vision-based systems. Several traditional dark channel prior-based methods have been proposed to remove haze formation and thereby enhance the robustness of these systems. However, when the captured images contain disproportionate haze distr...
We are thrilled and delighted to present this special issue, which emphasises on the novel area of Enabling Technologies for Energy Cloud. This guest editorial provides an overview of all articles accepted for publication in this special issue.
Users from all over the world increasingly adopt social media for newsgathering, especially during breaking news. Breaking news is an unexpected event that is currently developing. Early stages of breaking news are usually associated with lots of unverified information, i.e., rumors. Efficiently detecting and acting upon rumors in a timely fashion...
Artificial intelligence (AI) is a well-established branch of computer science concerned with making machines smart enough to perform computationally large or complex tasks that normally require human intelligence; furthermore, it comprises a combination of technologies that can obtain insights and patterns from a massive amount of data which is a c...
A problem of authorship characterization is to determine the sociolinguistic characteristics of the potential author of a given anonymous text message. Unlike the problems of authorship attribution, where the potential suspects and their training samples are accessible for investigation, no candidate list of suspects is available in authorship char...
In the previous chapters, methods to address two authorship problems, i.e., authorship identification and authorship characterization, were proposed. This chapter discusses the third authorship problem, called authorship verification. The proposed approach is applicable to different types of online messages, but in the current study, the focus is o...
Society’s increasing reliance on technology, fueled by a growing desire for increased connectivity (given the increased productivity, efficiency, and availability to name a few motivations) has helped give rise to the compounded growth of electronic data. The increasing adoption of various technologies has driven the need to protect said technologi...
This chapter presents the central theme and a big picture of the methods and technologies covered in this book (see Fig. 2.2). For the readers to comprehend presented security and forensics issues, and associated solutions, the content is organized as components of a forensics analysis framework. The framework is employed to analyze online messages...
This chapter provides a brief description of the methods employed for collecting initial information about a given suspicious online communication message, including header and network information; and how to forensically analyze the dataset to attain the information that would be necessary to trace back to the source of the crime. The header conte...
This chapter presents an overview of authorship analysis from multiple standpoints. It includes historical perspective, description of stylometric features, and authorship analysis techniques and their limitations.
This chapter presents a novel approach to frequent-pattern based Writeprint creation, and addresses two authorship problems: authorship attribution in the usual way (disregarding stylistic variation), and authorship attribution by focusing on stylistic variations. Stylistic variation is the occasional change in the writing features of an individual...
In the previous chapters, the different aspects of the authorship analysis problem were discussed. This chapter will propose a framework for extracting criminal information from the textual content of suspicious online messages. Archives of online messages, including chat logs, e-mails, web forums, and blogs, often contain an enormous amount of for...
This chapter discusses authorship attribution through a training sample. The focus on authorship attribution discussed in this chapter differs in two ways from the traditional authorship identification problem discussed in the earlier chapters of this book. Firstly, the traditional authorship attribution studies [63, 65] only work in the presence o...
In this chapter, Associative Classification (AC) [139] is employed, based on association rule discovery techniques, for authorship identification. The developed classification model consists of patterns that represent the respective author’s most prominent combinations of writing style features.
Non-negative tensor factorization has been shown a practical solution to automatically discover phenotypes from the electronic health records (EHR) with minimal human supervision. Such methods generally require an input tensor describing the inter-modal interactions to be pre-established; however, the correspondence between different modalities (e....
Non-negative tensor factorization has been shown a practical solution to automatically discover phenotypes from the electronic health records (EHR) with minimal human supervision. Such methods generally require an input tensor describing the inter-modal interactions to be pre-established; however, the correspondence between different modalities (e....
Many models have been proposed to preserve data privacy for different data publishing scenarios. Among these models, ∊-differential privacy is receiving increasing attention because it does not make assumptions about adversaries’ prior knowledge and can provide a rigorous privacy guarantee. Although there are numerous proposed approaches using ∊-di...