
Herna Lydia Viktor- PhD (Computer Science)
- Professor at University of Ottawa
Herna Lydia Viktor
- PhD (Computer Science)
- Professor at University of Ottawa
About
195
Publications
59,664
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,752
Citations
Introduction
I am a full professor of Computer Science at the School of Electrical Engineering and Computer Science, of the University of Ottawa.
My areas of expertise are Applied AI and databases. Specifically, my team and I are working on machine learning algorithms, advances techniques for data-driven discovery and Big Data solutions for decision support.
Current institution
Publications
Publications (195)
Lifelong Machine Learning (LML) denotes a scenario involving multiple sequential tasks, each accompanied by its respective dataset, in order to solve specific learning problems. In this context, the focus of LML techniques is on utilizing already acquired knowledge to adapt to new tasks efficiently. Essentially, LML concerns about facing new tasks...
Recent breakthroughs in deep learning have revolutionized protein sequence and structure prediction. These advancements are built on decades of protein design efforts, and are overcoming traditional time and cost limitations. Diffusion models, at the forefront of these innovations, significantly enhance design efficiency by automating knowledge acq...
Machine Learning’s widespread application owes to its ability to develop accurate and scalable models. In cyber-security, where labeled data is scarce, Semi-Supervised Learning (SSL) emerges as a potential solution. SSL excels at tasks challenging traditional supervised and unsupervised algorithms by leveraging limited labelled data alongside abund...
Price prediction remains a crucial aspect of financial market research as it forms the basis for various trading strategies and portfolio management techniques. However, traditional models such as ARIMA are not effective for multi-horizon forecasting, and current deep learning approaches do not take into account the conditional heteroscedasticity o...
Protein generation has numerous applications in designing therapeutic antibodies and creating new drugs. Still, it is a demanding task due to the inherent complexities of protein structures and the limitations of current generative models. Proteins possess intricate geometry, and sampling their conformational space is challenging due to its high di...
The accuracy of price forecasts is important for financial market trading strategies and portfolio management. Compared to traditional models such as ARIMA and other state-of-the-art deep learning techniques, temporal Transformers with similarity embedding perform better for multi-horizon forecasts in financial time series, as they account for the...
Protein structural properties are often determined by experimental techniques such as X-ray crystallography and nuclear magnetic resonance. However, both approaches are time-consuming and expensive. Conversely, protein amino acid sequences may be readily obtained from inexpensive high-throughput techniques, although such sequences lack structural i...
Lifelong machine learning concerns the development of systems that continuously learn from diverse tasks, incorporating new knowledge without forgetting the knowledge they have previously acquired. Multi-label classification is a supervised learning process in which each instance is assigned multiple non-exclusive labels, with each label denoted as...
This paper presents a new approach for protein generation based on one-shot learning and hybrid quantum neural networks. Given a single protein complex, the system learns how to predict the remaining unknown properties, without resorting to autoregression, from the physicochemical properties of the receptor and a prior on the physicochemical proper...
The design of binder proteins for specific target proteins using deep learning is a challenging task that has a wide range of applications in both designing therapeutic antibodies and creating new drugs. Machine learning-based solutions, as opposed to laboratory design, streamline the design process and enable the design of new proteins that may be...
Background
Conducting clinical trials for traumatic spinal cord injury (tSCI) presents challenges due to patient heterogeneity. Identifying clinically similar subgroups using patient demographics and baseline injury characteristics could lead to better patient-centered care and integrated care delivery.
Purpose
We sought to (1) apply an unsupervis...
Log sequences generated by heterogeneous systems are critical for understanding computer system behaviour and ensuring operational and security integrity. However, the diverse formats, structures, and content of logs pose challenges for traditional log anomaly detection approaches that rely on log parsing, which can be imperfect and incomplete in i...
Background
Traumatic spinal cord injuries (TSCI) greatly affect the lives of patients and their families. Prognostication may improve treatment strategies, health care resource allocation, and counseling. Multivariable clinical prediction models (CPMs) for prognosis are tools that can estimate an absolute risk or probability that an outcome will oc...
Patient-reported outcome measures (PROMs) are an important metric to assess total knee arthroplasty (TKA) patients. The purpose of this study was to use a machine learning (ML) algorithm to identify patient features that impact PROMs after TKA.
A common approach to quantifying model interpretability is to calculate faithfulness metrics based on iteratively masking input tokens and measuring how much the predicted label changes as a result. However, we show that such metrics are generally not suitable for comparing the interpretability of different neural text classifiers as the response t...
Online supervised learning from fast-evolving data streams, particularly in domains such as health, the environment, and manufacturing, is a crucial research area. However, these domains often experience class imbalance, which can skew class distributions. It is essential for online learning algorithms to analyze large datasets in real-time while a...
Studies of protein-protein interactions facilitate the development of new drugs and can aid understanding of the mechanisms behind disease pathogenesis. Finding the sites of interaction on the molecular surface is key to understanding protein-protein interactions and the role of molecular pathways. However, this is still an open area of research. T...
In Machine Learning, the datasets used to build models are one of the main factors limiting what these models can achieve and how good their predictive performance is. Machine Learning applications for cyber-security or computer security are numerous including cyber threat mitigation and security infrastructure enhancement through pattern recogniti...
Research into Intrusion and Anomaly Detectors at the Host level typically pays much attention to extracting attributes from system call traces. These include window-based, Hidden Markov Models, and sequence-model-based attributes. Recently, several works have been focusing on sequence-model-based feature extractors, specifically Word2Vec and GloVe,...
Artificial Intelligence and Machine Learning have witnessed rapid, significant improvements in Natural Language Processing (NLP) tasks. Utilizing Deep Learning, researchers have taken advantage of repository comments in Software Engineering to produce accurate methods for detecting Self-Admitted Technical Debt (SATD) from 20 open-source Java projec...
Protein-protein interactions play an important role in the development of new therapeutic treatments and prophylactic vaccines. For instance, the efficacy of a vaccine strongly depends to what extent an antibody may form a stable bond with an antigen. In-laboratory experiments are both time-consuming and expensive, which limits their scope to only...
Proteins mainly perform their functions by interacting with other proteins. Protein–protein interactions underpin various biological activities such as metabolic cycles, signal transduction, and immune response. However, due to the sheer number of proteins, experimental methods for finding interacting and non-interacting protein pairs are time-cons...
We apply a large multilingual language model (BLOOM-176B) in open-ended generation of Chinese song lyrics, and evaluate the resulting lyrics for coherence and creativity using human reviewers. We find that current computational metrics for evaluating large language model outputs (MAUVE) have limitations in evaluation of creative writing. We note th...
Machine-generated text is increasingly difficult to distinguish from text authored by humans. Powerful open-source models are freely available, and user-friendly tools that democratize access to generative models are proliferating. ChatGPT, which was released shortly after the first edition of this survey, epitomizes these trends. The great potenti...
Flattening shapes without distortion is a problem that has been intriguing scientists for centuries. It is a fundamental problem of high importance in computer vision as many approaches may greatly benefit from its implementation. This paper introduces a new approach that allows flattening without distortion, by transforming the shape from Riemanni...
Recently, there has been growing interest in fairness considerations in Artificial Intelligence (AI) and AI-based systems, as the decisions made by AI applications may negatively impact individuals and communities with ethical or legal consequences. Indeed, it is crucial to ensure that decisions based on AI-based systems do not reflect discriminato...
Advances in natural language generation (NLG) have resulted in machine generated text that is increasingly difficult to distinguish from human authored text. Powerful open-source models are freely available, and user-friendly tools democratizing access to generative models are proliferating. The great potential of state-of-the-art NLG systems is te...
Price prediction is essential in financial market research, as it is often used as a primary component for trading strategy or portfolio management specialisations. As these strategies rely on more than one future prediction point, the accuracy of a multi-horizon forecast is very important. Classical models, such as autoregressive integrated moving...
Most proteins perform their biological function by interacting with themselves or other molecules. Thus, one may obtain biological insights into protein functions, disease prevalence, and therapy development by identifying protein–protein interactions (PPI). However, finding the interacting and non-interacting protein pairs through experimental app...
The widespread usage of machine learning in different mainstream contexts has made deep learning the technique of choice in various domains, including finance. This systematic survey explores various scenarios employing deep learning in financial markets, especially the stock market. A key requirement for our methodology is its focus on research pa...
Log parsing is the process of extracting logical units from system, device or application generated logs. It holds utmost importance in the field of log analytics and forensics. Many security analytic tools rely on logs to detect, prevent and mitigate attacks. It is critical for these tools to extract information from large volumes of logs from mul...
Due to the rapid technological advances that have been made over the years, more people are changing their way of living from traditional ways of doing business to those featuring greater use of electronic resources. This transition has attracted (and continues to attract) the attention of cybercriminals, referred to in this article as “attackers”,...
The detection of computer-generated text is an area of rapidly increasing significance as nascent generative models allow for efficient creation of compelling human-like text, which may be abused for the purposes of spam, disinformation, phishing, or online influence campaigns. Past work has studied detection of current state-of-the-art models, but...
Online semi-supervised learning (SSL) from data streams is an emerging area of research with many applications due to the fact that it is often expensive, time-consuming, and sometimes even unfeasible to collect labelled data from streaming domains. State-of-the-art online SSL algorithms use clustering techniques to maintain micro-clusters, or, alt...
The classification of deformable protein shapes, based solely on their macromolecular surfaces, is a challenging problem in proteinprotein interaction prediction and protein design. Shape classification is made difficult by the fact that proteins are dynamic, flexible entities with high geometrical complexity. In this paper, we introduce a novel de...
This work introduces novel approaches, based on geometrical deep learning, for predicting protein–protein interactions. A dataset containing both interacting and non-interacting proteins is selected from the Negatome Database. Interactions are predicted from a graph representing the proteins’ three-dimensional macromolecular surfaces. The nodes are...
Online influence operations (OIOs) present a serious threat to the integrity of online social spaces and to real-world democratic elections. While many OIO detection approaches have focused on classification algorithms for individual social media posts (often with artificially balanced datasets), we present a novel system centering around a human a...
In recommendation systems, the grey-sheep problem refers to users with unique preferences and tastes that make it difficult to develop accurate profiles. That is, the similarity search approach typically followed during the recommendation process fails to yield good results. Most research does not focus on such users and thus fails to cater to more...
In e-business, recommender systems have been instrumental in guiding users through their online experiences. However, these systems are often limited by the lack of labels data and data sparsity. Increasingly, data-mining techniques are utilized to address this issue. In most research, recommendations to be made are achieved via supervised learning...
Mining data streams has become an important topic due to the increased availability of vast amounts of online data. In such incremental learning scenarios, observations arrive in a sequence over time and are subject to changes in data distributions, also known as concept drifts. Interleaved test-then-train evaluations are often used during supervis...
Recommendation systems, which are employed to mitigate the information overload e-commerce users face, have succeeded in aiding customers during their online shopping experience. However, to be able to make accurate recommendations, these systems require information about the items for sale and about users’ individual preferences. Making recommenda...
The detection of clandestine efforts to influence users in online communities is a challenging problem with significant active development. We demonstrate that features derived from the text of user comments are useful for identifying suspect activity, but lead to increased erroneous identifications when keywords over-represented in past influence...
The MapReduce programming paradigm is a prominent model for expressing parallel computations, especially in the context of data processing of vast data sets. However, modern data processing runtimes, implementing the MapReduce programming paradigm, do not generally support the use of arbitrary programming languages. Access to programming-language i...
In machine learning, the one-class classification problem occurs when training instances are only available from one class. It has been observed that making use of this class's structure, or its different contexts, may improve one-class classifier performance. Although this observation has been demonstrated for static data, a rigorous application o...
Objectives This study aims to assess the psychosocial risk factors and resettlement stress relationships to cardiovascular health among adult immigrant (Figure 1) who landed in Canada after 1985. Furthermore, to develop Machine Learning (ML) prediction models based on pre and post-immigration data to predict the risk of CVD for new arrivals of adul...
Clustering naturally addresses many of the challenges of data streams and many data stream clustering algorithms (DSCAs) have been proposed. The literature does not, however, provide quantitative descriptions of how these algorithms behave in different circumstances. In this paper we study how the clusterings produced by different DSCAs change, rel...
Recommendation systems, which are employed to mitigate the information overload faced by e-commerce users, have succeeded in aiding customers during their online shopping experience. However, to be able to make accurate recommendations, these systems require information about the items for sale and information about users’ individual preferences. M...
The last decade has seen a surge of interest in adaptive learning algorithms for data stream classification, with applications ranging from predicting ozone level peaks, learning stock market indicators, to detecting computer security violations. In addition, a number of methods have been developed to detect concept drifts in these streams. Conside...
Ab initio molecular dynamics is an irreplaceable technique for the realistic simulation of complex molecular systems and processes from first principles. This paper proposes a comprehensive and self-contained review of ab initio molecular dynamics from a computational perspective and from first principles. Quantum mechanics is presented from a mole...
The success of data stream mining techniques has allowed decision makers to analyze their data in multiple domains, ranging from monitoring network intrusion to financial markets analysis and online sales transactions exploration. Specifically, online ensembles that construct accurate models against drifting data streams have been developed. Recent...
Increasingly, Internet of Things (IoT) domains, such as sensor networks, smart cities, and social networks, generate vast amounts of data. Such data are not only unbounded and rapidly evolving. Rather, the content thereof dynamically evolves over time, often in unforeseen ways. These variations are due to so-called concept drifts, caused by changes...
The identification of changes in data distributions associated with data streams is critical in understanding the mechanics of data generating processes and ensuring that data models remain representative through time. To this end, concept drift detection methods often utilize statistical techniques that take numerical data as input. However, many...
The last decade has seen a surge of interest in adaptive learning algorithms for data stream classification, with applications ranging from predicting ozone level peaks, learning stock market indicators, to detecting computer security violations. In addition, a number of methods have been developed to detect concept drifts in these streams. Conside...
Data mining has been successfully applied in many businesses, thus aiding managers to make informed decisions that are based on facts, rather than having to rely on guesswork and incorrect extrapolations. Data mining algorithms equip institutions to predict the movements of financial indicators, enable companies to move towards more energy-efficien...
Adaptive online learning algorithms have been successfully applied to fast-evolving data streams. Such streams are susceptible to concept drift, which implies that the most suitable type of classifier often changes over time. In this setting, a system that is able to seamlessly select the type of learner that presents the current “best” model holds...
Decision makers increasingly require near-instant models to make sense of fast evolving data streams. Learning from such evolving environments is, however, a challenging task. This challenge is partially due to the fact that the distribution of data often changes over time, thus potentially leading to degradation in the overall performance. In part...
Recently, there is a growing trend to utilize data mining algorithms to explore datasets being modeled using graphs. In most cases, these graphs evolve over time, thus exhibiting more complex patterns and relationships among nodes. In particular , social networks are believed to manifest the preferential attachment property which assumes that new g...
Online ensemble methods have been very successful to create accurate models against data streams that are susceptible to concept drift. The success of data stream mining has allowed diverse users to analyse their data in multiple domains, ranging from monitoring stock markets to analysing network traffic and exploring ATM transactions. Increasingly...
Selecting the optimal subset of views for materialization provides an effective way to reduce the query evaluation time for real-time Online Analytic Processing (OLAP) queries posed against a data warehouse. However, materializing a large number of views may be counterproductive and may exceed storage thresholds, especially when considering very la...
Twitter feeds provide data scientists with a large repository for entity based sentiment analysis. Specifically, the tweets of individual users may be used in order to track the ebb and flow of their sentiments and opinions. However, this domain poses a challenge for traditional classifiers, since the vast majority of tweets are unlabeled. Further,...
Macromolecular structures, such as neuraminidases, hemagglutinins, and monoclonal antibodies, are not rigid entities. Rather, they are characterised by their flexibility, which is the result of the interaction and collective motion of their constituent atoms. This conformational diversity has a significant impact on their physicochemical and biolog...
Class imbalance is a crucial problem in machine learning and occurs in many domains. Specifically, the
two-class problem has received interest from researchers in recent years, leading to solutions for oil spill
detection, tumour discovery and fraudulent credit card detection, amongst others. However, handling class
imbalance in datasets that conta...
Acquisition systems based on laser triangulation or structured light are becoming commonplace in anthropometry. Such systems allow one to capture very detailed data to be used when addressing the sizing problem. This chapter introduces state-of-the-art approaches to describe, to segment and to cluster the data acquired by such systems. We describe...
Imbalanced data, where the number of instances of one class is much higher than the others, are frequent in many domains such as fraud detection, telecommunications management, oil spill detection, and text classification. Traditional classifiers do not perform well when considering data that are susceptible to both within-class and between-class i...
In data warehousing, selecting a subset of views for materialization has been widely employed as a way to reduce the query evaluation time for real-time OLAP queries. However, materialization of a large number of views may be counterproductive and may exceed storage thresholds, especially when considering very large data warehouses. Thus, an import...
Finding correspondences between deformable objects has wide application in many domains. In information retrieval, researchers may be interested in finding similar objects, while computer animation experts may be considering ways to morph shapes. The correspondence problem is especially challenging when the objects under consideration are suspect t...
Non-rigid shapes are generally known as objects where the three dimensional geometry may deform by internal and/or external forces. Deformable shapes are all around us, ranging from macromolecules, to natural objects such as the trees in the forest or the fruits in our gardens, and even human bodies. The development of measurements to accurately de...
Meta-model merging is the process of incorporating data models into an integrated, consistent model, against which accurate queries may be processed. The efficiency of such a process is very much reliant on effective semantic representation of chosen data models, as well as the mapping relationships between the schema and data instance elements of...
The protein docking problem refers to the task of predicting the appropriate matching of one protein molecule (the receptor) to another (the ligand), when attempting to bind them to form a stable complex. Research shows that matching the three-dimensional geometric structures of proteins plays a key role in determining a so-called docking pair. How...
Recommender Systems have been applied in a large number of domains. However, current approaches rarely consider multiple criteria or the level of mobility and location of a user. In this paper, we introduce a novel algorithm to construct personalized multi-criteria Recommender Systems. Our algorithm incorporates the user's current context, and tech...
Recently, a number of researchers have turned their attention to the creation of isometrically invariant shape descriptors based on the heat equation. The reason for this surge in interest is that the Laplace-Beltrami operator, associated with the heat equation, is highly dependent on the topology of the underlying manifold, which may lead to the c...
Research has shown that the functionalities of proteins are largely influenced by their three dimensional (3D) shapes. This observation is especially relevant in drug design, where the knowledge of the 3D structure of a protein enables pharmacologists to select the best binding proteins when aiming to moderate functions. However, a relatively small...
Multirelational classification aims to discover patterns across multiple interlinked tables (relations) in a relational database. In many large organizations, such a database often spans numerous departments and/or subdivisions, which are involved in different aspects of the enterprise such as customer profiling, fraud detection, inventory manageme...
Meta-model merging is the process of incorporating data models into an integrated, consistent model against which accurate queries may be processed. Within the data warehousing domain, the integration of data marts is often time-consuming. In this paper, we introduce an approach for the integration of relational star schemas, which are instances of...