Jacky Keung

Jacky Keung
City University of Hong Kong | CityU · Department of Computer Science

Ph.D (UNSW)

About

753
Publications
203,166
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
31,970
Citations
Additional affiliations
January 2013 - present
City University of Hong Kong
Position
  • Professor (Associate)
January 2011 - December 2012
The Hong Kong Polytechnic University
Position
  • Professor (Assistant)
January 2008 - December 2010
National ICT Australia - NICTA
Position
  • Researcher

Publications

Publications (753)
Preprint
Full-text available
Log anomaly detection has become a common practice for software engineers to analyze software system behavior. Despite significant research efforts in log anomaly detection over the past decade, it remains unclear what are practitioners' expectations on log anomaly detection and whether current research meets their needs. To fill this gap, we condu...
Preprint
Full-text available
Coding tasks have been valuable for evaluating Large Language Models (LLMs), as they demand the comprehension of high-level instructions, complex reasoning, and the implementation of functional programs -- core capabilities for advancing Artificial General Intelligence. Despite the progress in Large Multimodal Models (LMMs), which extend LLMs with...
Article
Full-text available
In agile requirements engineering, Generating Acceptance Criteria (GAC) to elaborate user stories plays a pivotal role in the sprint planning phase, which provides a reference for delivering functional solutions. GAC requires extensive collaboration and human involvement. However, the lack of labeled datasets tailored for User Story attached with A...
Article
DLLAD methods may underperform in severely imbalanced datasets. Although data resampling has proven effective in other software engineering tasks, it has not been explored in LAD. This study aims to fill this gap by providing an in-depth analysis of the impact of diverse data resampling methods on existing DLLAD approaches from two distinct perspec...
Article
Full-text available
Software crashes occur when the software program is executed wrongly or interrupted compulsively, which negatively impacts on user experience. Since the stack traces offer the exception-related information about software crashes, researchers used features collected from the stack trace to automatically identify whether the fault residence where the...
Article
With the rapid development of the Industrial Internet of Things (IIoT), log-based anomaly detection has become vital for smart industrial construction that has prompted many researchers to contribute. To detect anomalies based on log data, semi-supervised approaches stand out from supervised and unsupervised approaches because they only require a p...
Preprint
Stack Overflow is one of the most popular programming communities where developers can seek help for their encountered problems. Nevertheless, if inexperienced developers fail to describe their problems clearly, it is hard for them to attract sufficient attention and get the anticipated answers. We propose M$_3$NSCT5, a novel approach to automatica...
Article
Being able to automatically detect the performance issues in apps can significantly improve apps’ quality as well as having a positive influence on user satisfaction. A pplication P erformance M anagement (APM) libraries are used to locate the apps’ performance bottleneck, monitor their behaviors at runtime, and identify potential security ri...
Article
Context Stack Overflow is very helpful for software developers who are seeking answers to programming problems. Previous studies have shown that a growing number of questions are of low quality and thus obtain less attention from potential answerers. Gao et al. proposed an LSTM-based model (i.e., BiLSTM-CC) to automatically generate question titles...
Article
Context Inappropriate public disclosure of security bug reports (SBRs) is likely to attract malicious attackers to invade software systems; hence being able to detect SBRs has become increasingly important for software maintenance. Due to the class imbalance problem that the number of non-security bug reports (NSBRs) exceeds the number of SBRs, ins...
Article
Context Defect Number Prediction (DNP) models can offer more benefits than classification-based defect prediction. Recently, many researchers proposed to employ regression algorithms for DNP, and found that the algorithms achieve low Average Absolute Error (AAE) and high Pred(0.3) values. However, since the defect datasets generally contain many no...
Article
Context In software defect prediction, SMOTE-based techniques are widely adopted to alleviate the class imbalance problem. SMOTE-based techniques select instances close in the distance to synthesize minority class instances, ensuring few noise instances are generated. Objective However, recent studies show that selecting instances far away effecti...
Article
Context In practice, software datasets tend to have more non-defective instances than defective ones, which is referred to as the class imbalance problem in software defect prediction (SDP). Synthetic Minority Oversampling TEchnique (SMOTE) and its variants alleviate the class imbalance problem by generating synthetic defective instances. SMOTE-bas...
Preprint
Bellwether effect refers to the existence of exemplary projects (called the Bellwether) within a historical dataset to be used for improved prediction performance. Recent studies have shown an implicit assumption of using recently completed projects (referred to as moving window) for improved prediction accuracy. In this paper, we investigate the B...
Preprint
Context: In addressing how best to estimate how much effort is required to develop software, a recent study found that using exemplary and recently completed projects [forming Bellwether moving windows (BMW)] in software effort prediction (SEP) models leads to relatively improved accuracy. More studies need to be conducted to determine whether the...
Preprint
Full-text available
Being able to automatically detect the performance issues in apps can significantly improve apps' quality as well as having a positive influence on user satisfaction. Application Performance Management (APM) libraries are used to locate the apps' performance bottleneck, monitor their behaviors at runtime, and identify potential security risks. Alth...
Article
Context: The automatically produced crash reports are able to analyze the root of fault causing the crash (crashing fault for short) which is a critical activity for software quality assurance. Objective: Correctly predicting the existence of crashing fault residence in stack traces of crash report can speed up program debugging process and optimiz...
Article
Context Generally, there are more non-defective instances than defective instances in the datasets used for software defect prediction (SDP), which is referred to as the class imbalance problem. Oversampling techniques are frequently adopted to alleviate the problem by generating new synthetic defective instances. Existing techniques generate eithe...
Article
Context Scheduling in cloud is complicated as a result of multi-tenancy. Diverse tenants have different requirements, including service functions, response time, QoS and throughput. Diverse tenants require different scheduling capabilities, resource consumption and competition. Multi-tenancy scheduling approaches have been developed for different s...
Article
Scheduling and resource allocation in clouds is used to harness the power of the underlying resource pool. Service providers can meet quality of service (QoS) requirements of tenants specified in Service Level Agreements. Improving resource allocation ensures that all tenants will receive fairer access to system resources, which improves overall ut...
Article
Full-text available
Angelman syndrome is a complex neurodevelopmental disorder characterized by delayed development, intellectual disability, speech impairment, and ataxia. It results from the loss of UBE3A protein, an E3 ubiquitin ligase, in neurons of the brain. Despite the dynamic spatiotemporal expression of UBE3A observed in rodents and the potential clinical imp...
Article
Epigenetic states inherently define a wide range of complex biological phenotypes and processes in development and disease. Accurate cellular modeling would ideally capture the epigenetic complexity of these processes as well as the accompanying molecular changes in chromatin biochemistry including in DNA and histone modifications. Here we highligh...
Article
Full-text available
Several software design patterns have cataloged either with canonical or as variants to solve a recurring design problem. However, novice designers mostly adopt patterns without considering their ground reality and relevance to design problems, which causes to increase the development and maintenance efforts. The existing automated systems to selec...
Preprint
Technological leaps are often driven by key innovations that transform the underlying architectures of systems. Current DNA storage systems largely rely on polymerase chain reaction, which broadly informs how information is encoded, databases are organized, and files are accessed. Here we show that a hybrid 'toehold' DNA structure can unlock a fund...
Article
Full-text available
The employment of design patterns is considered as a benchmark of software quality in terms of reducing the number of software faults. However, the quantification of the information about the hinder design issues such as the number of roles, type of design pattern, and their association with anti-pattern classes is still required. The authors propo...
Article
Software Defect Prediction (SDP) aims to detect defective modules to enable the reasonable allocation of testing resources, which is an economically critical activity in software quality assurance. Learning effective feature representation and addressing class imbalance are two main challenges in SDP. Ideally, the more discriminative the features l...
Research Proposal
Full-text available
In the innovative era of Data Science along with its applications in an assortment of domains the software technology community is confronting challenges of evolving new theories, technologies and advanced algorithms for incorporating software engineering practices with knowledge discovery. It is apparent from the literature that such new theories,...
Article
Context: Ranking-oriented defect prediction (RODP) ranks software modules to allocate limited testing resources to each module according to the predicted number of defects. Most RODP methods overlook that ranking a module with more defects incorrectly makes it difficult to successfully find all of the defects in the module due to fewer testing reso...
Conference Paper
Scheduling on clouds is required so that service providers can meet Quality of Service (QoS) requirements of tenants. Deadline is a major criterion in judging QoS. This work presents a real-time, preemptive, constrained scheduler using queuing theory – PDSonQueue – which enables better meetinhg of QoS requirements. PDSonQueue also shortens a job’s...
Conference Paper
Scheduling tasks in the vicinity of stored data can significantly diminish network traffic. Scheduling optimisation can improve data locality by attempting to locate a task and its related data on the same node. Existing schedulers tend to ignore overhead and tradeoff between data transfer and task placement, and bandwidth consumption, by only emph...
Article
The extreme density of DNA presents a compelling advantage over current storage media; however, in order to reach practical capacities, new systems for organizing and accessing information are needed. Here we use chemical handles to selectively extract unique files from a complex database of DNA mimicking 5 TB of data and design and implement a nes...
Article
Full-text available
Interventional radiology employs image-guided techniques to perform minimally invasive procedures for diagnosis and treatment. Interventional radiology is often used to place central venous catheters and subcutaneous ports, with some evidence of benefit over surgical placement. Arterial embolization procedures are used to manage many types of hemor...
Article
Full-text available
Software defect data sets are typically characterized by an unbalanced class distribution where the defective modules are fewer than the non-defective modules. Prediction performances of defect prediction models are detrimentally affected by the skewed distribution of the faulty minority modules in the data set since most algorithms assume both cla...
Preprint
Full-text available
The extreme density of DNA presents a compelling advantage over current storage media; however, in order to reach practical capacities, new approaches for organizing and accessing information are needed. Here we use chemical handles to selectively extract unique files from a complex database of DNA mimicking 5 TB of data and design and implement a...
Conference Paper
Effort-Aware Defect Prediction (EADP) ranks software modules based on the possibility of these modules being defective, their predicted number of defects, or defect density by using learning to rank algorithms. Prior empirical studies compared a few learning to rank algorithms considering small number of datasets, evaluating with inappropriate or o...
Article
Cross Version Defect Prediction (CVDP) is a practical scenario by training the classification model on the historical data of the prior version and then predicting the defect labels of modules in the current version. Unfortunately, the differences of data distribution across versions may hinder the effectiveness of the trained CVDP model. Thus, it...
Article
Cross Version Defect Prediction (CVDP) is a practical scenario by training the classification model on the historical data of the prior version and then predicting the defect labels of modules in the current version. Unfortunately, the differences of data distribution across versions may hinder the effectiveness of the trained CVDP model. Thus, it...
Article
Though, Unified Modeling Language (UML), Ontology, and Text categorization approaches have been used to automate the classification and selection of design pattern(s). However, there are certain issues such as time and effort for formal specification of new patterns, system context-awareness, and lack of knowledge which needs to be addressed. We pr...
Article
Context: Automatic localization of buggy files can speed up the process of bug fixing to improve the efficiency and productivity of software quality assurance teams. Useful semantic information is available in bug reports and source code, but it is usually underutilized by existing bug localization approaches. Objective: To improve the performance...
Conference Paper
Full-text available
Several software design patterns have been familiarized either in canonical or as variant solutions in order to solve a problem. Novice designers mostly adopt patterns without considering their ground reality and relevancy with design problems, which may cause to increase the development and maintenance efforts. In order to realize the ground reali...
Article
Context Code readability classification (which refers to classification of a piece of source code as either readable or unreadable) has attracted increasing concern in academia and industry. To construct accurate classification models, previous studies depended mainly upon handcrafted features. However, the manual feature engineering process is usu...
Conference Paper
Background: Correctly localizing buggy files for bug reports together with their semantic and structural information is a crucial task, which would essentially improve the accuracy of bug localization techniques. Aims: To empirically evaluate and demonstrate the effects of both semantic and structural information in bug reports and source files on...
Conference Paper
The process of classifying a piece of source code into a Readable or Unreadable class is referred to as Code Readability Classification. To build accurate classification models, existing studies focus on handcrafting features from different aspects that intuitively seem to correlate with code readability, and then exploring various machine learning...
Article
Context: In addressing how best to estimate how much effort is required to develop software, a recent study found that using exemplary and recently completed projects [forming Bellwether moving windows (BMW)] in software effort prediction (SEP) models leads to relatively improved accuracy. More studies need to be conducted to determine whether the...
Conference Paper
This study presents MAHAKIL, a novel and efficient synthetic over-sampling approach for software defect datasets that is based on the chromosomal theory of inheritance. Exploiting this theory, MAHAKIL interprets two distinct sub-classes as parents and generates a new instance that inherits different traits from each parent and contributes to the di...
Article
Full-text available
A search for the exotic meson X(5568) decaying into the Bs0π± final state is performed using data corresponding to 9.6 fb-1 from pp̄ collisions at s=1960 GeV recorded by the Collider Detector at Fermilab. No evidence for this state is found and an upper limit of 6.7% at the 95% confidence level is set on the fraction of Bs0 produced through the X(5...
Article
Full-text available
Context Cross-project defect prediction (CPDP) which uses dataset from other projects to build predictors has been recently recommended as an effective approach for building prediction models that lack historical or sufficient local datasets. Class imbalance and distribution mismatch between the source and target datasets associated with real-world...
Article
Context The challenge of locating bugs in mostly large-scale software systems has led to the development of bug localization techniques. However, the lexical mismatch between bug reports and source codes degrades the performances of existing information retrieval or machine learning-based approaches. Objective To bridge the lexical gap and improve...
Article
Context We observed a special type of bug reopen that has no direct impact on the user experience or the normal operation of the system being developed. We refer to these as non-negative bug reopens. Objective Non-negative bug reopens are novel and somewhat contradictory to popular conceptions. Therefore, we thoroughly explored these phenomena in...
Article
Full-text available
The CDF and D0 experiments at the Fermilab Tevatron have measured the asymmetry between yields of forward- and backward-produced top and antitop quarks based on their rapidity difference and the asymmetry between their decay leptons. These measurements use the full data sets collected in proton-antiproton collisions at a center-of-mass energy of s=...
Article
Full-text available
A search for the exotic meson X(5568) decaying into the B[subscript s][superscript 0]π[superscript ±] final state is performed using data corresponding to 9.6 fb[superscript -1] from pp[over ¯] collisions at sqrt[s]=1960 GeV recorded by the Collider Detector at Fermilab. No evidence for this state is found and an upper limit of 6.7% at the 95% conf...
Conference Paper
The structural complexity of design components (e.g. Classes) is proportional to design quality at the system level and is quantified via the object-oriented metrics. The frequent use of design patterns causes of too much abstraction and can increase the structural complexity of design components. Though, in our previous work, we have empirically i...
Article
Full-text available
A measurement of the inclusive production cross section of isolated prompt photons in proton-antiproton collisions at center-of-mass energy s=1.96 TeV is presented. The results are obtained using the full Run II data sample collected with the Collider Detector at the Fermilab Tevatron, which corresponds to an integrated luminosity of 9.5 fb−1. The...
Article
Bug localization is a software development and maintenance activity that aims to find relevant source code entities to be modified so that a specific bug can be fixed on the basis of the given bug report. Information retrieval (IR) techniques have been widely used to locate bugs in recent decades. These techniques mainly use the IR similarity betwe...
Conference Paper
Context: Recent studies have shown that performance of defect prediction models can be affected when data sampling approaches are applied to imbalanced training data for building defect prediction models. However, the magnitude (degree and power) of the effect of these sampling methods on the classification and prioritization performances of defect...
Article
Context Software effort estimation (SEE) plays a key role in predicting the effort needed to complete software development task. However, the conclusion instability across learners has affected the implementation of SEE models. This instability can be attributed to the lack of an effort classification benchmark that software researchers and practit...
Article
Programmers tend to leave incomplete, temporary workarounds and buggy codes that require rework in software development and such pitfall is referred to as Self-admitted Technical Debt (SATD). Previous studies have shown that SATD negatively affects software project and incurs high maintenance overheads. In this study, we introduce a prioritization...
Article
The CDF and D0 experiments at the Fermilab Tevatron have measured the asymmetry between yields of forward- and backward-produced top and antitop quarks based on their rapidity difference and the asymmetry between their decay leptons. These measurements use the full data sets collected in proton-antiproton collisions at a center-of-mass energy of √s...
Conference Paper
Full-text available
Systematic Literature Review (SLR) is becoming a vital part of present day research in software process improvement (SPI). Nevertheless, there is no available study that provides detail review of the published software process improvement SLRs. Objective: The aim of this article is to classify the SLRs of SPI in order to identify the main research...