Knowledge and Information Systems

Published by Springer Nature
Online ISSN: 0219-3116
Print ISSN: 0219-1377
Learn more about this page
Recent publications
  • Huiping Guo
    Huiping Guo
  • Hongru Li
    Hongru Li
Decomposition hybrid structure learning algorithms (DHSLAs), which combine the idea of divide and conquer with hybrid algorithms to reduce the computational complexity, are used to learn Bayesian network (BN) structure. Nevertheless, it’s hard to learn highly accurate BN structures using DHSLAs based on data alone in some cases. First, accurate divisions for the whole domain are difficult to obtain because of the effect on network density and substructures tend to be poorly learned because of the large search space. In addition, using data alone, it is difficult to distinguish Markov equivalence classes. At this point, utilizing expert knowledge is an effective way. However, existing algorithms have not been studied to integrate expert knowledge into DHSLAs. Therefore, in this paper, we propose the first structure learning algorithm for using expert knowledge in DHSLAs called Decomposition Hybrid Structure Learning Algorithm with Expert Knowledge (DHSLA-EK). In the DHSLA-EK, we incorporate domain knowledge and structural knowledge with confidence into the DHSLA by constructing prior subdomains in the decomposition stage and by forming a novel scoring function in the subdomain structure learning stage. Extensive experiments on four benchmark networks indicate that the proposed algorithm can effectively improve the learning effect of the DHSLA.
Data points situated near a cluster boundary are called boundary points and they can represent useful information about the process generating this data. The existing methods of boundary points detection cannot differentiate boundary points from outliers as they are affected by the presence of outliers as well as by the size and density of clusters in the dataset. Also, they require tuning of one or more parameters and prior knowledge of the number of outliers in the dataset for tuning. In this research, a boundary points detection method called BPF is proposed which can effectively differentiate boundary points from outliers and core points. BPF combines the well-known outlier detection method Local Outlier Factor (LOF) with Gravity value to calculate the BPF score. Our proposed algorithm StaticBPF can detect the top-m boundary points in the given dataset. Importantly, StaticBPF requires tuning of only one parameter i.e. the number of nearest neighbors (k)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(k)$$\end{document} and can employ the same k\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k$$\end{document} used by LOF for outlier detection. This paper also extends BPF for streaming data and proposes StreamBPF. StreamBPF employs a grid structure for improving k-nearest neighbor computation and an incremental method of calculating BPF scores of a subset of data points in a sliding window over data streams. In evaluation, the accuracy of StaticBPF and the runtime efficiency of StreamBPF are evaluated on synthetic and real data where they generally performed better than their competitors.
Local differential privacy (LDP) is an emerging privacy-preserving data collection model that requires no trusted third party. Most privacy-preserving decentralized graph publishing studies adopt LDP technique to ensure individual privacy. However, existing LDP-based synthetic graph generation approaches focus on static graph publishing and can only republish synthetic graphs in a brute-force manner when dealing with dynamic graph problems, resulting in low synthetic graph accuracy. The main difficulties come from the two steps of dynamic graph publishing: excessive noise injection in initial graph generation and over-segmentation of the privacy budget in graph update. We address these two issues by presenting PPDU, the first dynamic graph publication approach under LDP. PPDU uses a privacy-preference-specifying mechanism to untie the noise injection and the graph size, significantly reducing noise injection. We then divide the privacy-preserving graph update problem into three subproblems: node insertion, edge insertion, and edge deletion, and propose update threshold-based dynamic graph releasing methods to avoid excessive segmentation of the privacy budget, thereby significantly improving the accuracy of synthetic graphs. Theoretical analysis and experimental results prove that our solution can continually yield high-quality dynamic graphs while satisfying edge LDP.
In recent years, one of the most popular techniques in the computer vision community has been the deep learning technique. As a data-driven technique, deep model requires enormous amounts of accurately labelled training data, which is often inaccessible in many real-world applications. A data-space solution is Data Augmentation (DA), that can artificially generate new images out of original samples. Image augmentation strategies can vary by dataset, as different data types might require different augmentations to facilitate model training. However, the design of DA policies has been largely decided by the human experts with domain knowledge, which is considered to be highly subjective and error-prone. To mitigate such problem, a novel direction is to automatically learn the image augmentation policies from the given dataset using Automated Data Augmentation (AutoDA) techniques. The goal of AutoDA models is to find the optimal DA policies that can maximize the model performance gains. This survey discusses the underlying reasons of the emergence of AutoDA technology from the perspective of image classification. We identify three key components of a standard AutoDA model: a search space, a search algorithm and an evaluation function. Based on their architecture, we provide a systematic taxonomy of existing image AutoDA approaches. This paper presents the major works in AutoDA field, discussing their pros and cons, and proposing several potential directions for future improvements.
Transactions are the bread-and-butter of database management system (DBMS) industry. When you check your bank balance, pay bill, or move money from saving to chequing account, transactions are involved. That transactions are self-similar—whether you pay a utility company or credit card, it is still a ‘pay bill’ transaction—has been noted before. Somewhat surprisingly, that property remains largely unexploited, barring some notable exceptions. The research reported in this paper begins to build ‘intelligence’ into database systems by offering built-in transaction classification and clustering. The utility of such an approach is demonstrated by showing how it simplifies DBMS monitoring and troubleshooting. The well-known DBSCAN algorithm clusters online transaction processing (OLTP) transactions: this paper’s contribution is in demonstrating a robust server-side feature extraction approach, rather than the previously suggested and error-prone log-mining approach. It is shown how ‘DBSCAN + angular cosine distance function’ finds better clusters than the previously tried combinations, and simplifies DBSCAN parameter tuning—a known nontrivial task. DBMS troubleshooting efficacy is demonstrated by identifying the root causes of several real-life performance problems: problematic transaction rollbacks; performance drifts; system-wide issues; CPU and memory bottlenecks; and so on. It is also shown that the cluster count remains unchanged irrespective of system load—a desirable but often overlooked property. The transaction clustering solution has been implemented inside the popular MySQL DBMS, although most modern relational database systems can benefit from the ideas described herein.
For each of two PFSA G1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$G_1$$\end{document} and G2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$G_2$$\end{document} over binary alphabet, we generate 100 sequences of length 100 in (a), and 100 sequences of length 1000 in (b). Then we generate random pairs of PFSA {T1,T2}\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{T_1, T_2\}$$\end{document} with binary alphabet and with size of state set ranging from 2 to 5. SLD between the sequences remain sufficiently positive for the examples, corroborating that a finite number of random PFSA discriminates between sample paths of distinct processes with high probability (Theorem 2). Note that the separations in (b). are more clear, corroborating that longer sequences are easier to distinguish
Comparison with state of the art classification tools, with simple classifier based on SLD. The binary classification performance measured on data generated from the model pairs (a) are more or less comparable (b), although SLD outperforms both HIVECOTE2 and Rocket classifiers. Importantly, the time and space complexity, measured as runtimes in training and testing, and total runtime memeory usage is about an order of magnitude better for SLD for the datasets tested. Performance was computed using 1000 samples from each class (i,ii) of each model pair (0 through 3). The model pairs are chosen to be very simialr with identical structure, differing only in some transition probabilities to make classification non-trivial
Noise response with SLD based classification. Noise fraction refers to fraction of bits flipped in binary data generated by the model pairs 0–3 (See Fig. 2a). The change in AUC with 95% confidence bounds shown
a–d Four basis PFSA used for the algorithm comparison in Sect. 6 and applications in Sect. 7. An edge connecting state q to q′\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q'$$\end{document} is labeled as σπ~(q,σ)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma \left( {\widetilde{\pi }}(q, \sigma ) \right) $$\end{document} if δ(q,σ)=q′\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\delta (q, \sigma ) = q'$$\end{document} (See Definition 2). e Performance and run time comparisons of SLD, data smashing [5], and DTW on a synthetic symbolic dataset. We denote data smashing by S and DTW by their window size. The average run time of of SLD is 0.042 s. f Run time v.s. sequence length comparison between DTW30 and SLD. g 2D embeddings produced by Algorithm 1 and DTW5 on the FordA dataset with decision boundaries of SVM and neural network. h Distance matrices produced by DTW with window sizes 0, 2, 5, 10 and SLD with inferred basis PFSA on a dataset designed to be especially challenging for DTW
a–h Multi-channel EEG recordings and distance matrices for two participants from the Motor Movement Imagery Datasets (see Sect. 7.1). For each participant, the dataset contains two EEG recordings of two tasks, alternating rest and movement (TM) and alternating rest and imaginary movement (TI). Rest sections are colored gray while (imaginary) movement sections, black. From the arrangement of the sequences (see Table 2), we can see that SLD clearly distinguishes rest from (imaginary) movements for participant S004 while distinguishes two recordings for participant S001. i, j Distance matrices of accelerometer measurements in the x, y, and z directions from 5 users in the User Identification from Walking Dataset (see Sect. 7.2). For each direction, we collect 10 sequences of 500 time steps from 5 users and show distance matrices resulted from two quantizations (see Sect. 5). From the heatmap, we see that measurements from different directions can be combined in user recognition. For example, although SLD doesn’t separate user 1 and 2 well in the x-direction, it picks enough separation in the z-direction
Comparing and contrasting subtle historical patterns is central to time series analysis. Here we introduce a new approach to quantify deviations in the underlying hidden stochastic generators of sequential discrete-valued data streams. The proposed measure is universal in the sense that we can compare data streams without any feature engineering step, and without the need of any hyper-parameters. Our core idea here is the generalization of the Kullback–Leibler divergence, often used to compare probability distributions, to a notion of divergence between finite-valued ergodic stationary stochastic processes. Using this notion of process divergence, we craft a measure of deviation on finite sample paths which we call the sequence likelihood divergence (SLD) which approximates a metric on the space of the underlying generators within a well-defined class of discrete-valued stochastic processes. We compare the performance of SLD against the state of the art approaches, e.g., dynamic time warping (Petitjean et al. in Pattern Recognit 44(3):678–693, 2011) with synthetic data, real-world applications with electroencephalogram data and in gait recognition, and on diverse time-series classification problems from the University of California, Riverside time series classification archive (Thanawin Rakthanmanon and Westover). We demonstrate that the new tool is at par or better in classification accuracy, while being significantly faster in comparable implementations. Released in the publicly domain, we are hopeful that SLD will enhance the standard toolbox used in classification, clustering and inference problems in time series analysis.
Document stores have gained popularity among NoSQL systems mainly due to the semi-structured data storage structure and the enhanced query capabilities. The database design in document stores expands beyond the first normal form by encouraging de-normalization through nesting. This hinders the process, as the number of alternatives grows exponentially with multiple choices in nesting (including different levels) and referencing (including the direction of the reference). Due to this complexity, document store data design is mostly carried out in trial-and-error or ad-hoc rule-based approaches. However, the choices affect multiple, often conflicting, aspects such as query performance, storage space, and complexity of the documents. To overcome these issues, in this paper, we apply multicriteria optimization. Our approach is driven by a query workload and a set of optimization objectives. First, we formalize a canonical model to represent alternative designs and introduce an algebra of transformations that can systematically modify a design. Then, using these transformations, we implement a local search algorithm driven by a loss function that can propose near-optimal designs with high probability. Finally, we compare our prototype against an existing document store data design solution purely driven by query cost, where our proposed designs have better performance and are more compact with less redundancy.
The complex systems in the real-world are commonly associated with multiple types of objects and relations, and heterogeneous graphs are ubiquitous data structures that can inherently represent multi-modal interactions between objects. Generating high-quality heterogeneous graphs allows us to understand the implicit distribution of heterogeneous graphs and provides benchmarks for downstream heterogeneous representation learning tasks. Existing works are limited to either merely generating the graph topology with neglecting local semantic information or only generating the graph without preserving the higher-order structural information and the global heterogeneous distribution in generated graphs. To this end, we formulate a general, end-to-end framework - HGEN for generating novel heterogeneous graphs with a newly proposed heterogeneous walk generator. On top of HGEN, we further develop a network motif generator to better characterize the higher-order structural distribution. A novel heterogeneous graph assembler is further developed to adaptively assemble novel heterogeneous graphs from the generated heterogeneous walks and motifs in a stratified manner. The extended model is proven to preserve the local semantic and heterogeneous global distribution of observed graphs with the theoretical guarantee. Lastly, comprehensive experiments on both synthetic and real-world practical datasets demonstrate the power and efficiency of the proposed method.
Sentiment analysis is a natural language processing method used to assess data's positivity, negativity, and neutrality. Several techniques were suggested as ways to solve the sentiment analysis task. This study presents a novel multi-criteria decision-making (MCDM) and game theory-based mathematical framework for the sentiment orientation of reviews. We propose two frameworks: sentiment orientation tagger modal (SOTM) and aspect-based ranking modal (ABRM). The SOTM consists of the simple additive weighting (SAW) technique and the principle of Nash equilibrium from game theory to deduce the tag for the review dataset. We identify a review's sentiment as positive, negative, or neutral. In ABRM, we rank the aspects of the review using the preference selection index (PSI). We propose an unsupervised sentiment classification model that combines context, rating, and emotion scores with a mathematical optimization model. The effectiveness of our proposed model is comparable to the state-of-the-art models, as demonstrated by experimental results on three benchmark review datasets. We also establish the significance of the results through statistical analysis. The proposed model ensures rationality and consistency. The novel combination of the MCDM and game theory model with the reviews' context, rating, and emotion scores creates a new paradigm in sentiment analysis. Also, the proposed model is generalizable and can analyze sentiment in many fields.
The popularity of football among fans to analyze the game has been immense with the advent of internet. The concept of making a dream team in football has become a new fashion for the football lovers. The paper focuses in helping achieving this prediction of a football dream team. The aim of this research is to assess the dynamics of a complex topological structure when prompted with random entities whose attributes are known to us. Using graph theory and vectorial distances, the dream team is evaluated on the basis of individual abilities and interplayer synergy. Instead of focusing on discrete events in a match, this framework proposes an idea in which a dream team is quantified on the basis of their positional attributes. Each player is rated in accordance to the position he is playing, which eventually helps in finding the overall team rating. The second part of this research uses graph theory to evaluate structural and topological properties of interpersonal interactions of teammates. Teammates are treated as nodes of a graph, where each edge exemplifies the strength of their interpersonal interaction. The strength of the bond depends on on-field interactions via ball passing, ball receiving and communication which depend on experience of playing together, Nationality and Club. The methodology adopted in this paper can be a formidable basis for similarly situated larger setups involving much larger intricacies. Using this framework, we can see the behavior of a hypothetical topological structure whose node attributes are known to us, thus projecting its performance as a team and individual entities.
Spatial data are ubiquitous, massively collected, and widely used to support critical decision-making in many societal domains, including public health (e.g., COVID-19 pandemic control), agricultural crop monitoring, transportation, etc. While recent advances in machine learning and deep learning offer new promising ways to mine such rich datasets (e.g., satellite imagery, COVID statistics), spatial heterogeneity—an intrinsic characteristic embedded in spatial data—poses a major challenge as data distributions or generative processes often vary across space at different scales, with their spatial extents unknown. Recent studies (e.g., SVANN, spatial ensemble) targeting this difficult problem either require a known space-partitioning as the input, or can only support very limited number of partitions or classes (e.g., two) due to the decrease in training data size and the complexity of analysis. To address these limitations, we propose a model-agnostic framework to automatically transform a deep learning model into a spatial-heterogeneity-aware architecture, where the learning of arbitrary space partitionings is guided by a learning-engaged generalization of multivariate scan statistic and parameters are shared based on spatial relationships. Moreover, we propose a spatial moderator to generalize learned space partitionings to new test regions. Finally, we extend the framework by integrating meta-learning-based training strategies into both spatial transformation and moderation to enhance knowledge sharing and adaptation among different processes. Experiment results on real-world datasets show that the framework can effectively capture flexibly shaped heterogeneous footprints and substantially improve prediction performances.
Session-based recommendation (SBR) is to predict the next item, given an anonymous interaction sequence. Recently, many advanced SBR models have shown great recommending performance, but few studies note that they suffer from popularity bias seriously: the model tends to recommend popular items and fails to recommend long-tail items. The only few debias works relieve popularity bias indeed. However, they ignore individual’s conformity toward popular items and thus decrease recommending performance on popular items. Besides, conformity is always entangled with individual’s real interest, which hinders extracting one’s comprehensive preference. To tackle the problem, we propose an SBR framework with Disentangling InteRest and Conformity for eliminating popularity bias in SBR. In this framework, two groups of item encoders and session modeling modules are devised to extract interest and conformity, respectively, and a fusion module is designed to combine these two types of preference. Also, a discrepancy loss is utilized to disentangle the representation of interest and conformity. Besides, our devised framework can integrate with several SBR models seamlessly. We conduct extensive experiments on three real-world datasets with four advanced SBR models. The results show that our framework outperforms other state-of-the-art debias methods consistently.
The binned fair calibration for deprived community, graphing the predicted probabilities against the true probability in deciles. The color of the bar represents different methods with the red bar representing the true probability observed by KM
The time complexity comparison of all the methods
Fairness in machine learning (ML) has gained attention within the ML community and the broader society beyond with many fairness definitions and algorithms being proposed. Surprisingly, there is little work quantifying and guaranteeing fairness in the presence of uncertainty which is prevalent in many socially sensitive applications, ranging from marketing analytics to actuarial analysis and recidivism prediction instruments. To this end, we revisit fairness and reveal idiosyncrasies of existing fairness literature assuming certainty on the class label that limits their real-world utility. Our primary contributions are formulating fairness under uncertainty and group constraints along with a suite of corresponding new fairness definitions and algorithm. We argue that this formulation has a broader applicability to practical scenarios concerning fairness. We also show how the newly devised fairness notions involving censored information and the general framework for fair predictions in the presence of censorship allow us to measure and mitigate discrimination under uncertainty that bridges the gap with real-world applications. Empirical evaluations on real-world datasets with censorship and sensitive attributes demonstrate the practicality of our approach.
Prediction and classification of diseases are essential in medical science, as it attempts to immune the spread of the disease and discover the infected regions from the early stages. Machine learning (ML) approaches are commonly used for predicting and classifying diseases that are precisely utilized as an efficient tool for doctors and specialists. This paper proposes a prediction framework based on ML approaches to predict Hepatitis C Virus among healthcare workers in Egypt. We utilized real-world data from the National Liver Institute, founded at Menoufiya University (Menoufiya, Egypt). The collected dataset consists of 859 patients with 12 different features. To ensure the robustness and reliability of the proposed framework, we performed two scenarios: the first without feature selection and the second after the features are selected based on sequential forward selection (SFS). Furthermore, the feature subset selected based on the generated features from SFS is evaluated. Naïve Bayes, random forest (RF), K-nearest neighbor, and logistic regression are utilized as induction algorithms and classifiers for model evaluation. Then, the effect of parameter tuning on learning techniques is measured. The experimental results indicated that the proposed framework achieved higher accuracies after SFS selection than without feature selection. Moreover, the RF classifier achieved 94.06% accuracy with a minimum learning elapsed time of 0.54 s. Finally, after adjusting the hyperparameter values of the RF classifier, the classification accuracy is improved to 94.88% using only four features.
Ontology meta-matching system
Aggregation process of similarity values
The variation of dediv\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$de_\mathrm{{div}}$$\end{document} with t
Comparison of different optimization models
As an effective method of addressing ontology heterogeneity problem, ontology matching becomes increasingly important for knowledge sharing and inter-system communication. Ontology meta-matching, which aims at finding optimal ways of integrating different similarity measures, is an effective method of determining high-quality ontology alignment. However, the existing ontology meta-matching techniques suffer from the following defects: first, most of them are depending on the reference alignment that ought to be given by experts in advance, which is not available in the practical scenarios; second, they tend to get stuck in the local optima, which makes the alignment unsatisfactory. In order to solve the above problems, in this work, an optimization model for ontology meta-matching problem is constructed on the basis of a new proposed evaluation metric on the alignment’s quality. After that, a multi-strategy adaptive co-firefly algorithm (MACFA), which is able to trade off the algorithm’s exploitation and exploration, is proposed to overcome the premature convergence. The testing cases in Ontology Alignment Evaluation Initiative (OAEI) is utilized to verify the effectiveness of our approach. Experimental results show that the optimization model as well as MACFA improves the ontology alignment’s quality, and compared with OAEI’s participants, the proposed matching system achieves competitive results.
In most recommendation scenarios, user information is difficult to obtain due to user privacy and data protection issues. Some graph-based methods can learn the user’s feature information through the structural relationship in both user graphs and item graphs. However, a user’s latent associations with other users, especially those hidden in the user’s sequential behavior, are not well identified in the sequential recommendation. In this work, we propose a user view dynamic graph-driven sequential recommender to find out different user latent associations without additional user information. Our model can not only find out the global associations of users, but also discover the user dynamic associations through information dissemination during training. In particular, the dynamic associations are highlighted via contrastive learning to refine global associations from the user view to achieve more efficient sequential recommendations. Furthermore, our approach can serve as a container for commonly used sequential recommenders to achieve better performances. Experimental results show that the user view information has a positive guiding effect on sequential recommendation and our approach outperforms state-of-the-art models.
Given a node-attributed graph, how can we efficiently represent it with few numerical features that expressively reflect its topology and attribute information? We propose A-DOGE, for attributed DOS-based graph embedding, based on density of states (DOS, a.k.a. spectral density) to tackle this problem. A-DOGE is designed to fulfill a long desiderata of desirable characteristics. Most notably, it capitalizes on efficient approximation algorithms for DOS, that we extend to blend in node labels and attributes for the first time, making it fast and scalable for large attributed graphs and graph databases. Being based on the entire eigenspectrum of a graph, A-DOGE can capture structural and attribute properties at multiple (“glocal”) scales. Moreover, it is unsupervised (i.e., agnostic to any specific objective) and lends itself to various interpretations, which makes it suitable for exploratory graph mining tasks. Finally, it processes each graph independent of others, making it amenable for streaming settings as well as parallelization. Through extensive experiments, we show the efficacy and efficiency of A-DOGE on exploratory graph analysis and graph classification tasks, where it significantly outperforms unsupervised baselines and achieves competitive performance with modern supervised GNNs, while achieving the best trade-off between accuracy and runtime.
This paper contributes multivariate versions of seven commonly used elastic similarity and distance measures for time series data analytics. Elastic similarity and distance measures can compensate for misalignments in the time axis of time series data. We adapt two existing strategies used in a multivariate version of the well-known Dynamic Time Warping (DTW), namely, Independent and Dependent DTW, to these seven measures. While these measures can be applied to various time series analysis tasks, we demonstrate their utility on multivariate time series classification using the nearest neighbor classifier. On 23 well-known datasets, we demonstrate that each of the measures but one achieves the highest accuracy relative to others on at least one dataset, supporting the value of developing a suite of multivariate similarity and distance measures. We also demonstrate that there are datasets for which either the dependent versions of all measures are more accurate than their independent counterparts or vice versa. In addition, we also construct a nearest neighbor-based ensemble of the measures and show that it is competitive to other state-of-the-art single-strategy multivariate time series classifiers.
Group interactions arise in our daily lives (email communications, on-demand ride sharing, and comment interactions on online communities, to name a few), and they together form hypergraphs that evolve over time. Given such temporal hypergraphs, how can we describe their underlying design principles? If their sizes and time spans are considerably different, how can we compare their structural and temporal characteristics? In this work, we define 96 temporal hypergraph motifs (TH-motifs) and propose the relative occurrences of their instances as an answer to the above questions. TH-motifs categorize the relational and temporal dynamics among three connected hyperedges that appear within a short time. For scalable analysis, we develop THyMe+, a fast and exact algorithm for counting the instances of TH-motifs in massive hypergraphs, and we show that THyMe+ is up to 2,163×faster while requiring less space than baseline approaches. In addition to exact counting algorithms, we design three versions of sampling algorithms for approximate counting. We theoretically analyze the accuracy of the proposed methods, and we empirically show that the most advanced algorithm, , is up to 11.1× more accurate than baseline approaches. Using the algorithms, we investigate 11 real-world temporal hypergraphs from various domains. We demonstrate that TH-motifs provide important information useful for downstream tasks and reveal interesting patterns, including the striking similarity between temporal hypergraphs from the same domain.
The high-speed, continuous and endless characteristics of data streams make it a challenging task to quickly mine high utility itemsets in limited memory space. The sliding window model, which focuses only on the most recent data, has received extensive research and attention as it can effectively adapt to the data stream environment. However, the presence of many communal batches in adjacent sliding windows causes the algorithm to repeatedly generate a large number of identical itemsets, which reduces the spatiotemporal performance of the algorithm. In order to solve these problems and provide users with a concise and lossless resultset, a new closed high utility pattern mining algorithm over data stream is proposed, named FCHM-Stream. A new utility list structure based on batch division and a resultset maintenance strategy based on skip-list structure are designed to effectively reduce identical itemsets repeatedly generated and thus reduce the running time of the algorithm. Extensive experimental results show that the proposed algorithm has a large improvement in runtime compared to the state-of-the-art algorithms.
Example of application of caSPiTa. It shows a dataset D\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {D}}$$\end{document} containing the following transactions: 10 CBAD, 90 CBAC, and 200 AC ( |D|=300\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$|{\mathcal {D}}|=300$$\end{document}) (left) and the 1-st order generative null model constructed from D\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {D}}$$\end{document} (right)
Example of network and of De Bruijn graph. It shows the network N=(G,ω)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N = (G,\omega )$$\end{document} (left) and the 2-nd order De Bruijn graph G2=(V2,E2)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$G^2 = (V^2,E^2)$$\end{document} of G (right)
Example of generative null models. It shows N1(D)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N^1({\mathcal {D}})$$\end{document} (left) and N2(D)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N^2({\mathcal {D}})$$\end{document} (right), respectively, the 1-st and the 2-nd order generative null model of D={τ1=BAD,τ2=CBAC,τ3=CBAD,τ4=AD}\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {D}} = \{\tau _1 = BAD,\tau _2 = CBAC,\tau _3=CBAD,\tau _4=AD\}$$\end{document}
Example of paths oriented generation. It shows the 2-nd order generative null model N2(D)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N^2({\mathcal {D}})$$\end{document} and the starting vertex BA (left), and the probabilities of all paths of length 3 under the POG strategy (right)
The mining of time series data has applications in several domains, and in many cases the data are generated by networks, with time series representing paths on such networks. In this work, we consider the scenario in which the dataset, i.e., a collection of time series, is generated by an unknown underlying network, and we study the problem of mining statistically significant paths, which are paths whose number of observed occurrences in the dataset is unexpected given the distribution defined by some features of the underlying network. A major challenge in such a problem is that the underlying network is unknown, and, thus, one cannot directly identify such paths. We then propose caSPiTa, an algorithm to mine statistically significant paths in time series data generated by an unknown and underlying network that considers a generative null model based on meaningful characteristics of the observed dataset, while providing guarantees in terms of false discoveries. Our extensive evaluation on pseudo-artificial and real data shows that caSPiTa is able to efficiently mine large sets of significant paths, while providing guarantees on the false positives.
Generative adversarial network () models have been successfully utilized in a wide range of machine learning applications, and tabular data generation domain is not an exception. Notably, some state-of-the-art models of tabular data generation, such as , , , etc. are based on models. Even though these models have resulted in superior performance in generating artificial data when trained on a range of datasets, there is a lot of room (and desire) for improvement. Not to mention that existing methods do have some weaknesses other than performance. For example, the current methods focus only on the performance of the model, and limited emphasis is given on the interpretation of the model. Secondly, the current models operate on raw features only, and hence they fail to exploit any prior knowledge on explicit feature interactions that can be utilized during data generation process. To alleviate the two above-mentioned limitations, in this work, we propose a novel tabular data generation model— G enerative A dversarial Network modelling inspired from N aive B ayes and L ogistic R egression’s relationship ( $${ { \texttt {GANBLR} } }$$ GANBLR ), which not only address the interpretation limitation of existing tabular -based models but provides capability to handle explicit feature interactions as well. Through extensive evaluations on wide range of datasets, we demonstrate $${ { \texttt {GANBLR} } }$$ GANBLR ’s superior performance as well as better interpretable capability (explanation of feature importance in the synthetic generation process) as compared to existing state-of-the-art tabular data generation models.
Owing to the remarkable development of deep learning technology, there have been a series of efforts to build deep learning-based climate models. Whereas most of them utilize recurrent neural networks and/or graph neural networks, we design a novel climate model based on two concepts, the neural ordinary differential equation (NODE) and the advection–diffusion equation. The advection–diffusion equation is widely used for climate modeling because it describes many physical processes involving Brownian and bulk motions in climate systems. On the other hand, NODEs are to learn a latent governing equation of ODE from data. In our presented method, we combine them into a single framework and propose a concept, called neural advection–diffusion equation (NADE). Our NADE, equipped with the advection–diffusion equation and one more additional neural network to model inherent uncertainty, can learn an appropriate latent governing equation that best describes a given climate dataset. In our experiments with three real-world and two synthetic datasets and fourteen baselines, our method consistently outperforms existing baselines by non-trivial margins.
Driving the decisions of stock market investors is among the most challenging financial research problems. Markowitz’s approach to portfolio selection models stock profitability and risk level through a mean–variance model, which involves estimating a very large number of parameters. In addition to requiring considerable computational effort, this raises serious concerns about the reliability of the model in real-world scenarios. This paper presents a hybrid approach that combines itemset extraction with portfolio selection. We propose to adapt Markowitz’s model logic to deal with sets of candidate portfolios rather than with single stocks. We overcome some of the known issues of the Markovitz model as follows: (i) Complexity: we reduce the model complexity, in terms of parameter estimation, by studying the interactions among stocks within a shortlist of candidate stock portfolios previously selected by an itemset mining algorithm. (ii) Portfolio-level constraints: we not only perform stock-level selection, but also support the enforcement of arbitrary constraints at the portfolio level, including the properties of diversification and the fundamental indicators. (iii) Usability: we simplify the decision-maker’s work by proposing a decision support system that enables flexible use of domain knowledge and human-in-the-loop feedback. The experimental results, achieved on the US stock market, confirm the proposed approach’s flexibility, effectiveness, and scalability.
Relative error and computational time as a function of ϵ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\epsilon $$\end{document} for CaHep, CaAstro, and Amazon datasets and h=3\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$h = 3$$\end{document} (top row) and h=4\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$h = 4$$\end{document} (bottom row)
Computational time as a function of number of edges applied to synthetic data
Sizes and computational times for the benchmark datasets
Core decomposition is a classic technique for discovering densely connected regions in a graph with large range of applications. Formally, a k -core is a maximal subgraph where each vertex has at least k neighbors. A natural extension of a k -core is a ( k , h )-core, where each node must have at least k nodes that can be reached with a path of length h . The downside in using ( k , h )-core decomposition is the significant increase in the computational complexity: whereas the standard core decomposition can be done in $${{\mathcal {O}}}{\left( m\right) }$$ O m time, the generalization can require $${{\mathcal {O}}}{\left( n^2m\right) }$$ O n 2 m time, where n and m are the number of nodes and edges in the given graph. In this paper, we propose a randomized algorithm that produces an $$\epsilon $$ ϵ -approximation of ( k , h ) core decomposition with a probability of $$1 - \delta $$ 1 - δ in $${{\mathcal {O}}}{\left( \epsilon ^{-2} hm (\log ^2 n - \log \delta )\right) }$$ O ϵ - 2 h m ( log 2 n - log δ ) time. The approximation is based on sampling the neighborhoods of nodes, and we use Chernoff bound to prove the approximation guarantee. We also study distance-generalized dense subgraphs, show that the problem is NP -hard, provide an algorithm for discovering such graphs with approximate core decompositions, and provide theoretical guarantees for the quality of the discovered subgraphs. We demonstrate empirically that approximating the decomposition complements the exact computation: computing the approximation is significantly faster than computing the exact solution for the networks where computing the exact solution is slow.
Original JSON data that is transformed into the SQL records of Example 3
Example of JSON data originating from different sources
Combinations of Data Dimensions, a Formalization of their Supporting Data Structures, and the Enabling Technical Concepts
Edgar Codd introduced the principle of entity integrity in the context of his relational model of data. The principle says that every targeted real-world entity should be uniquely represented in the database. In actual database systems, entity integrity is typically enforced by primary keys. We introduce a framework toward generalizing entity integrity to different dimensions of data, including volume, variety, and veracity. We establish axiomatic and algorithmic foundations for the implication problem of the combined class of uniqueness constraints, functional dependencies, and multivalued dependencies in all combinations of the dimensions we consider. These are based on specific approaches to the semantics of these integrity constraints and to the dimensions of data. We also highlight how our concepts lead to new opportunities for diverse and important areas of applications, such as query optimization, database design and security, and data quality. Overall, this sets out an agenda for future research that extends our approaches or applies different approaches in this area, as driven by application requirements.
Due to the sustained popularization of Online Social Networks (OSNs), it has become of interest for a variety of domains of applications to correctly characterize how the behavior of an individual user can be influenced by the actions of other users in a network. Additionally, the richness of available features in modern OSNs highlights the growing importance of user-generated data in establishing user relations. In this paper, we follow a data-driven methodology and propose a diffusion algorithm designed around user-to-content relationships and an action–reaction paradigm. Crucially, we design our approach by integrating different cross-disciplinary theories of how users influence each other. Thus, we enrich the influence maximization task with a psychological dimension and define a model that ties influence diffusion to recurrent users’ behavior from OSN logs, considering relationships between users mediated by user-generated content. We evaluate our approach over the Yahoo Flickr Creative Commons 100 Million real-world dataset. We measure efficiency and effectiveness by analyzing scalability and spread efficacy and show how our model outperforms existing state-of-the-art methods.
Graph Neural networks (GNNs) which are powerful and widely applied models are based on the assumption that graph topologies play key roles in the graph representation learning.However, the existing GNN methods are based on the Euclidean space embedding, which is difficult to represent a variety of graph geometric properties well. Recently, Riemannian geometries have been introduced into GNNs, such as Hyperbolic Graph Neural Networks proposed for the hierarchy-preserving graph representation learning. In Riemannian geometry, the different graph topological structures can be reflected by corresponding curved embedding spaces, such as a hyperbolic space can be understood as a continuous tree-like structure and a spherical space can be understood as a continuous clique. However, most existing non-Euclidean GNNs are based on heuristic, manual statistical, or estimation methods, which is difficult to automatically select the appropriate embedding space for graphs with different topological properties. To deal with this problem, we propose the Adaptive Curvature Exploration Geometric Graph Neural Network to automatically learn high-quality graph representations and explore the embedding space with optimal curvature at the same time. We optimize the multi-objective optimization problem of the graph representation learning and curvature exploration with the multi-agent reinforcement learning and using the Nash Q-learning algorithm to collaboratively train the two agents to reach Nash equilibrium. We construct extensive experiments including synthetic and real-world graph datasets, and the results demonstrate significant and consistent performance improvement and generalization of our method.
In this paper, we propose a single-agent Monte Carlo-based reinforced feature selection method, as well as two efficiency improvement strategies, i.e., early stopping strategy and reward-level interactive strategy. Feature selection is one of the most important technologies in data prepossessing, aiming to find the optimal feature subset for a given downstream machine learning task. Enormous research has been done to improve its effectiveness and efficiency. Recently, the multi-agent reinforced feature selection (MARFS) has achieved great success in improving the performance of feature selection. However, MARFS suffers from the heavy burden of computational cost, which greatly limits its application in real-world scenarios. In this paper, we propose an efficient reinforcement feature selection method, which uses one agent to traverse the whole feature set and decides to select or not select each feature one by one. Specifically, we first develop one behavior policy and use it to traverse the feature set and generate training data. And then, we evaluate the target policy based on the training data and improve the target policy by Bellman equation. Besides, we conduct the importance sampling in an incremental way and propose an early stopping strategy to improve the training efficiency by the removal of skew data. In the early stopping strategy, the behavior policy stops traversing with a probability inversely proportional to the importance sampling weight. In addition, we propose a reward-level and training-level interactive strategy to improve the training efficiency via external advice. What’s more, we propose an incremental descriptive statistics method to represent the state with low computational cost. Finally, we design extensive experiments on real-world data to demonstrate the superiority of the proposed method.
Data mining and machine learning algorithms’ performance is degraded by data of high-dimensional nature due to an issue called “curse of dimensionality”. Feature selection is a hot research topic where a subset of features are selected to reduce the dimensionality and thereby increasing the accuracy of learning algorithms. Redundant and irrelevant features are eliminated. Cervical cancer is most commonly occurring disease and driving reasons for untimely mortality among ladies worldwide especially in emerging low income nations like India. However, literature shows that the early identification and exact conclusion of cervical malignant growth can increase the survival chances. The disease does not show signs of its presence in the early stages of its growth. Automated classification and diagnosis of cervical cancer using machine learning and deep learning techniques is in high demand as it allows timely, accurate and regular study of the patient’s health progress. Meta-heuristics algorithms provide a global problem independent optimal solution and applied on feature selection problem since decades. In spite of having a good number of literature, there is no survey on feature selection techniques applied to cervical cancer dataset. This paper presents a brief survey on meta-heuristic, its variants, hybrid meta-heuristic and hyper-heuristic techniques. This survey summarizes the feature selection techniques applied to the cervical cancer data to identify the research gap thereby guiding the researchers in the future research direction. The summary of categorization of the techniques such as nature-inspired or non-nature inspired and trajectory or population based is also highlighted. The survey provides a comparative literature review involving classical feature selection techniques and feature selection using metaheuristic algorithms for cervical cancer classification application.
This paper discusses group decision-making (GDM) with interval multiplicative preference relations (IMPRs) based on the geometric consistency. We propose a logarithmically geometric compatibility degree between two IMPRs and then define a geometrically logarithmic consistency index of IMPRs. The new consistency index of IMPRs is invariant under permutation of alternatives and transpose of IMPRs. By the statistics theory, the thresholds of the geometrically logarithmic consistency index are provided. For an unacceptably consistent IMPR, an interactive iterative algorithm is designed to improve its consistency level. Using the relationship between an interval weight vector (IWV) and an IMPR, a fuzzy programming model is established to derive an IWV. This model is converted into a linear programming model for resolution. Subsequently, a new individual decision-making (IDM) method with an IMPR is put forward. By minimizing the logarithmically geometric compatibility degree between each individual IMPR and the collective one, a convex programming model is built to determine experts’ weights. Consequently, a novel GDM method with IMPRs is presented. Numerical examples and simulation experiments are conducted to reveal the superiority of the proposed IDM method and GDM method.
Accuracy w.r.t. number of parameters from different models and datasets. The three blue circles correspond to rank-1, 2, and 3 Falcon (from left to right order), respectively. Falcon provides the best accuracy for a given number of parameters
Relation between standard convolution and Falcon expressed with GEP. The common axes correspond to the output channel-axis of standard convolution. TT(1,2,4,3)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$TT_{(1,2,4,3)}$$\end{document} indicates tensor transpose operation to permute the third and the fourth dimensions of a tensor
Comparison of architectures. BN denotes batch-normalization. ReLU and ReLU6 are activation functions. a Standard convolution with branch (StConv-branch). bFalcon-branch which combines Falcon with St Conv-branch
Accuracy w.r.t. number of parameters on different models and datasets. The three blue circles correspond to rank-1, 2, and 3 Falcon (from left to right order), respectively. Falcon provides the best accuracy for a given number of parameters
How can we efficiently compress convolutional neural network (CNN) using depthwise separable convolution, while retaining their accuracy on classification tasks? Depthwise separable convolution, which replaces a standard convolution with a depthwise convolution and a pointwise convolution, has been used for building lightweight architectures. However, previous works based on depthwise separable convolution are limited when compressing a trained CNN model since (1) they are mostly heuristic approaches without a precise understanding of their relations to standard convolution, and (2) their accuracies do not match that of the standard convolution. In this paper, we propose Falcon, an accurate and lightweight method to compress CNN based on depthwise separable convolution.Falcon uses generalized elementwise product (GEP), our proposed mathematical formulation to approximate the standard convolution kernel, to interpret existing convolution methods based on depthwise separable convolution. By exploiting the knowledge of a trained standard model and carefully determining the order of depthwise separable convolution via GEP, Falcon achieves sufficient accuracy close to that of the trained standard model. Furthermore, this interpretation leads to developing a generalized version rank-k Falcon which performs k independent Falcon operations and sums up the result. Experiments show that Falcon (1) provides higher accuracy than existing methods based on depthwise separable convolution and tensor decomposition and (2) reduces the number of parameters and FLOPs of standard convolution by up to a factor of 8 while ensuring similar accuracy. We also demonstrate that rank-k Falcon further improves the accuracy while sacrificing a bit of compression and computation reduction rates.
Domain-specific document collections, such as data sets about the COVID-19 pandemic, politics, and sports, have become more common as platforms grow and develop better ways to connect people whose interests align. These data sets come from many different sources, ranging from traditional sources like open-ended surveys and newspaper articles to one of the dozens of online social media platforms. Most topic models are equipped to generate topics from one or more of these data sources, but models rarely work well across all types of documents. The main problem that many models face is the varying noise levels inherent in different types of documents. We propose topic-noise models, a new type of topic model that jointly models topic and noise distributions to produce a more accurate, flexible representation of documents regardless of their origin and varying qualities. Our topic-noise model, Topic Noise Discriminator (TND) approximates topic and noise distributions side-by-side with the help of word embedding spaces. While topic-noise models are important for the types of short, noisy documents that often originate on social media platforms, TND can also be used with more traditional data sources like newspapers. TND itself generates a noise distribution that when ensembled with other generative topic models can produce more coherent and diverse topic sets. We show the effectiveness of this approach using Latent Dirichlet Allocation (LDA), and demonstrate the ability of TND to improve the quality of LDA topics in noisy document collections. Finally, researchers are beginning to generate topics using multiple sources and finding that they need a way to identify a core set based on text from different sources. We propose using cross-source topic blending (CSTB), an approach that maps topics sets to an s-partite graph and identifies core topics that blend topics from across s sources by identifying subgraphs with certain linkage properties. We demonstrate the effectiveness of topic-noise models and CSTB empirically on large real-world data sets from multiple domains and data sources.
With the increase in the use of compute-intensive applications, the demand to continuously boost the efficiency of data processing increases. Offloading the compute-intensive application tasks to the edge servers can effectively solve problems for resource-constrained mobile devices. However, the computation offloading may increase network load and transmission delay, which will influence the user experience. On the other hand, the unceasing distance change between the local device and edge server could also affect the service quality due to user mobility. This paper proposes the offloading and service migration methods for compute-intensive applications to deal with these issues. First, the fine-grained computation offloading algorithm based on a non-cooperative game is proposed. The overhead on both the local side and edge side is analyzed. Moreover, the service migration path selection based on the Markov decision process is proposed by considering user mobility, energy cost, migration cost, available storage, and bandwidth. The optimal service migration path is selected according to the Markov decision process, which can improve service quality. Experiment results show that our proposed offloading strategy performs better in reducing energy consumption by more than 10% and latency by more than 6.2%, compared with other baseline algorithms, and saving mobile device energy and reducing task response time, saving over 10% of time and energy consumption compared to similar algorithms. The proposed service migration scheme can reduce migration times and maintain a success rate of more than 90% while guaranteeing service continuity in a multi-user scenario.
Nearest neighbour similarity measures are widely used in many time series data analysis applications. They compute a measure of similarity between two time series. Most applications require tuning of these measures’ meta-parameters in order to achieve good performance. However, most measures have at least $$O(L^2)$$ O ( L 2 ) complexity, making them computationally expensive and the process of learning their meta-parameters burdensome, requiring days even for datasets containing only a few thousand series. In this paper, we propose UltraFastMPSearch , a family of algorithms to learn the meta-parameters for different types of time series distance measures. These algorithms are significantly faster than the prior state of the art. Our algorithms build upon the state of the art, exploiting the properties of a new efficient exact algorithm which supports early abandoning and pruning for most time series distance measures. We show on 128 datasets from the UCR archive that our new family of algorithms are up to an order of magnitude faster than the previous state of the art.
Best output achieved by a HM, b FHM, c Fodina, d PM, and e Proposed method for BPIC154f\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$BPIC15_{4f}$$\end{document} event log
Median of problem-solving time for different |AL|\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$|A_L |$$\end{document} and noise levels
Heuristic mining techniques are among the most popular methods in the process discovery area. This category of methods is composed of two main steps: (1) discovering a dependency graph and (2) determining the split/join patterns of the dependency graph. The current dependency graph discovery techniques of heuristic-based methods select the initial set of graph arcs according to dependency measures and then modify the set regarding some criteria. This can lead to selecting the non-optimal set of arcs. Also, the modifications can result in modeling rare behaviors and, consequently, low precision and non-simple process models. The motivation of this paper is to improve the heuristic mining methods by addressing the mentioned issues. The contribution of this paper is to propose a new integer linear programming model that determines the optimal set of graph arcs regarding dependency measures. Simultaneously, the proposed method can eliminate some other issues that the existing methods cannot handle completely; i.e., even in the presence of loops, it guarantees that all tasks are on a path from the initial to the final tasks. This approach also allows utilizing domain knowledge by introducing appropriate constraints, which can be a practical advantage in real-world problems. To assess the results, we modified two existing methods of evaluating process models to make them capable of measuring the quality of dependency graphs. According to assessments, the proposed method’s outputs are superior to those of the most prominent dependency graph discovery methods in terms of fitness, precision, and especially simplicity.
An example of South Korea's egocentric network in each snapshot from the temporal knowledge graph
Example illustration of the prediction process of DAuCNet
The Architecture of the proposed DAuCNet model
Length effect of historical time steps on MRR and Hit@10
The research on temporal knowledge graphs (TKGs) has received increasing attention. Since knowledge graphs are always incomplete, knowledge reasoning problems are crucial. However, knowledge reasoning is challenging due to the temporal evolution of TKGs. Yet, most existing approaches focus on knowledge graph inference within past timestamps and cannot predict upcoming facts. There is evidence that when temporal facts in TKGs are evolving and interacting, most of them exhibit repeating patterns along the historical timeline. This observation indicates that forecasting models may predict upcoming facts based on history. To this end, this paper proposes a novel temporal representation learning model for predicting future facts named DAuCNet, which applies a Deep Autoregressive structure as the main framework and combines it with a time-aware Copy-based mechanism Network. Specifically, our model proposes a Temporal Fact Encoder to encode historical facts and a Duplicate Fact Collector to collect historically relevant events and identify repetitive events. It employs a multi-relation Neighborhood Aggregator based on graph-attention networks to model the connection of facts at the concurrent window. Finally, DAuCNet integrates these three modules to forecast future facts. Experimental results show that DAuCNet performs significantly better at temporal link prediction and inference for future timestamps than other baselines using five public knowledge graph datasets.
An obvious defect of extreme learning machine (ELM) is that its prediction performance is sensitive to the random initialization of input-layer weights and hidden-layer biases. To make ELM insensitive to random initialization, GPRELM adopts the simple an effective strategy of integrating Gaussian process regression into ELM. However, there is a serious overfitting problem in kernel-based GPRELM (kGPRELM). In this paper, we investigate the theoretical reasons for the overfitting of kGPRELM and further propose a correlation-based GPRELM (cGPRELM), which uses a correlation coefficient to measure the similarity between two different hidden-layer output vectors. cGPRELM reduces the likelihood that the covariance matrix becomes an identity matrix when the number of hidden-layer nodes is increased, effectively controlling overfitting. Furthermore, cGPRELM works well for improper initialization intervals where ELM and kGPRELM fail to provide good predictions. The experimental results on real classification and regression data sets demonstrate the feasibility and superiority of cGPRELM, as it not only achieves better generalization performance but also has a lower computational complexity.
Rough set is a world-renowned innovation for dealing with ambiguous, incomplete, and imprecise situations. Soft set theory and neutrosophic set theory are other advanced mathematical techniques for dealing with ambiguous, partial, and inconsistent data. The aim of this paper is to broaden the scope of rough set theory, soft set theory, and neutrosophic set theory. The notion of neutrosophic soft rough sets have been re-introduced. On neutrosophic soft rough set, several definitions, properties, and examples have been established. We also develop the concept of neutrosophic soft rough topology, which is based on a novel neutrosophic soft rough set approach. We have defined open sets, closed sets, interior, and closure as characteristics of neutrosophic soft rough topology.
In the last decade, a large number of knowledge graph (KG) completion approaches were proposed. Albeit effective, these efforts are disjoint, and their collective strengths and weaknesses in effective KG completion have not been studied in the literature. We extend Plumber , a framework that brings together the research community’s disjoint efforts on KG completion. We include more components into the architecture of Plumber to comprise 40 reusable components for various KG completion subtasks, such as coreference resolution, entity linking, and relation extraction. Using these components, Plumber dynamically generates suitable knowledge extraction pipelines and offers overall 432 distinct pipelines. We study the optimization problem of choosing optimal pipelines based on input sentences. To do so, we train a transformer-based classification model that extracts contextual embeddings from the input and finds an appropriate pipeline. We study the efficacy of Plumber for extracting the KG triples using standard datasets over three KGs: DBpedia, Wikidata, and Open Research Knowledge Graph. Our results demonstrate the effectiveness of Plumber in dynamically generating KG completion pipelines, outperforming all baselines agnostic of the underlying KG. Furthermore, we provide an analysis of collective failure cases, study the similarities and synergies among integrated components and discuss their limitations.
The telecommunication segment has grown tremendously over the past few decades. Particularly smartphones have now turned out to be essential and have outperformed many gadgets like computers, cameras, etc. In this current scenario, smartphones become an essential product for all kinds of consumers such as students, teachers, businessmen, etc. And the consumers also like an extensive number of enhanced and better-quality features being embedded into them. Along with this growth, there is a fast growth of mobile application software providers also. Apart from calling, many consumers use smartphones for browsing the internet. Many android developers provide browser application software with several advancements. This puts the consumers into confusion to select a better browser for their smartphone to accomplish their requirements. Hence the consumers need a proven methodology to select a better browser for their smartphones. To select a better browser, in this paper a hybrid multi-criteria decision making model is proposed by integrating grey relational analysis (GRA) and fuzzy analytical hierarchy process (FAHP). The findings are compared and validated through a machine learning approach also.
Partial multi-label learning (PML) models the scenario where each training sample is annotated with a candidate label set, among which only a subset corresponds to the ground-truth labels. Existing PML approaches generally promise that there are sufficient partial multi-label samples for training the predictor. Nevertheless, when dealing with new tasks, it is more common that we only have a few PML samples associated with those tasks at hand. Furthermore, existing few-shot learning solutions typically assume the labels of support (training) samples are noise-free; as a result, noisy labels concealed within the candidate labels may seriously misinform the meta-learner and thus lead to a compromised performance. We formalize this problem as new learning paradigm called few-shot partial multi-label learning (FsPML), which aims to induce a noise-robust multi-label classifier with limited PML samples related to the target task. To address this problem, we propose a method named FsPML via prototype rectification (FsPML-PR). Specifically, FsPML-PR first conducts adaptive distance metric learning with an embedding network on the tasks previously encountered. Next, it performs positive/negative prototype rectification and disambiguating labels using samples features and label correlations in the embedding space. A new sample can then be classified based on its distances to the positive and to the negative prototypes. Extensive experimental studies on benchmark datasets certificate that our proposed FsPML achieves superior performance across various settings.
Many high-dimensional practical data sets have hierarchical structures induced by graphs or time series. Such data sets are hard to process in Euclidean spaces, and one often seeks low-dimensional embeddings in other space forms to perform the required learning tasks. For hierarchical data, the space of choice is a hyperbolic space because it guarantees low-distortion embeddings for tree-like structures. The geometry of hyperbolic spaces has properties not encountered in Euclidean spaces that pose challenges when trying to rigorously analyze algorithmic solutions. We propose a unified framework for learning scalable and simple hyperbolic linear classifiers with provable performance guarantees. The gist of our approach is to focus on Poincaré ball models and formulate the classification problems using tangent space formalisms. Our results include a new hyperbolic perceptron algorithm as well as an efficient and highly accurate convex optimization setup for hyperbolic support vector machine classifiers. Furthermore, we adapt our approach to accommodate second-order perceptrons, where data are preprocessed based on second-order information (correlation) to accelerate convergence, and strategic perceptrons, where potentially manipulated data arrive in an online manner and decisions are made sequentially. The excellent performance of the Poincaré second-order and strategic perceptrons shows that the proposed framework can be extended to general machine learning problems in hyperbolic spaces. Our experimental results, pertaining to synthetic, single-cell RNA-seq expression measurements, CIFAR10, Fashion-MNIST and mini-ImageNet, establish that all algorithms provably converge and have complexity comparable to those of their Euclidean counterparts.
smart home security system Based Motion Detection system
algorithm based Overall performances of the Smart home security system
Thresholding Method in the System Overall
This research approach is arranged once the Simulation Model design in MATLAB compeered computer software to create the Surveillance up and Smart Home surveillance system is a Smart Security system. The Experimental outcomes compare present status with normal test precision on genuine videos then this method provides test accuracy on genuine videos from the video clip that is the intelligent system. This research approach is arranged once the Simulation Model is designed in MATLAB-compeered computer software. The Deep Learning technique, firstly, the architecture design of the convolutional neural Network (CNN) community is presented and analyzed within the context associated with selected and designed architecture from the surveillance system that makes sense. Positive results are meticulously examined, as well as the among the most effective is chosen to become utilized in the proposed system model, and greater quality of Service (QoS). The major aim of the research study is to automate the IVS system through IPS as much as possible and to achieve a high percentage of accuracy. This thesis focuses on counting and tracking people/ objects in the crowd which include several technical tasks such as human detection and overcoming the problem of occlusion with acceptable processing speed.
Question Answering is a crucial natural language processing task. This field of research has attracted a sudden amount of interest lately due mainly to the integration of the deep learning models in the Question Answering Systems which consequently power up many advancements and improvements. This survey aims to explore and shed light upon the recent and most powerful deep learning-based Question Answering Systems and classify them based on the deep learning model used, stating the details of the used word representation, datasets, and evaluation metrics. It aims to highlight and discuss the currently used models and give insights that direct future research to enhance this increasingly growing field.
Steps performed by the k-NN classification algorithm with a number of neighbors k = 5
Accuracies of the k-NN classifier using different k values and RDDM as concept drift detector, with the drifts configured at positions 2000, 4000, 6000, and 8000, in the artificial datasets LED (a) and Mixed (b)
Accuracies of the methods using RDDM as concept drift detector, with the drifts configured at positions 2000, 4000, 6000, and 8000, in the artificial datasets: a LED and b Mixed
Statistical comparison of the accuracy of the methods using the Friedman test and the Nemenyi posttest for comparisons in all datasets tested with the auxiliary detector
Over the years, several classification algorithms have been proposed in the machine learning area to address challenges related to the continuous arrival of data over time, formally known as data stream. The implementations of these approaches are of vital importance for the different applications where they are used, and they have also received modifications, specifically to address the problem of concept drift, a phenomenon present in classification problems with data streams. The K-nearest neighbors (k-NN) classification algorithm is one of the methods of the family of lazy approaches used to address this problem in online learning, but it still presents some challenges that can be improved, such as the efficient choice of the number of neighbors k used in the learning process. This article proposes paired k-NN learners with dynamically adjusted number of neighbors (PL-kNN), an innovative method which adjusts dynamically and incrementally the number of neighbors used by its pair of k-NN learners in the process of online learning regarding data streams with concept drifts. To validate it, experiments were carried out with both artificial and real-world datasets and the results were evaluated using the accuracy metric, run-time, memory usage, and the Friedman statistical test with the Nemenyi post hoc test. The experimental results show that PL-kNN improves and optimizes the accuracy performances of k-NN with fixed neighboring k values in most tested scenarios.
Network embedding in heterogeneous network has recently attracted much attention due to its effectiveness in capturing the structure and inherent properties of networks. Most existing models focus on node proximity of networks. Nevertheless, in heterogeneous network, it contains different types (domains) of nodes and edges. The same types of nodes exhibit global patterns widely known as communities, and a community is intuitively identified as a group of nodes with more connections between its internal nodes compared with the external ones. Similarly, we assume that there is also an intermediate structure in the different types of nodes, which we call it as organization, and nodes in an organization interact more frequently than external ones. Thus, nodes within the same community and organization should have similar node embeddings. Inspired by this, we take the structural characteristics in heterogeneous network into consideration and propose a novel structure-aware Attributed Heterogeneous Network Embedding model (SAHNE). Specifically, we first introduce a random walk strategy based upon node degree to sample node sequences, which can better explore the community and organization information in heterogeneous network. Next, we design a structure-aware attributed heterogeneous network embedding model to simultaneously detect community and organization distribution of each node and learn embeddings of nodes, communities and organizations. Extensive experiments on three real-world heterogeneous networks demonstrate that SAHNE outperforms the state-of-the-art methods in terms of various datamining tasks.
Stages of the proposed methodology
This study proposes a three-stage Modified Kemeny Median Indicator Rank Accordance (KEMIRA-M) and Decision-Making Trial and Evaluation Laboratory (DEMATEL) integration for RA. At the first stage, risk criteria rankings are obtained for each expert separately by implementing DEMATEL. At the second stage, criteria weights obtained from DEMATEL are used to determine Median Priority Components which is an aggregated criterion ranking for all experts as in traditional KEMIRA-M. In this stage, initial decision matrix including danger sources’ performance values for risk criteria is formed and rankings of danger sources are obtained via KEMIRA-M selection procedure considering criteria weights obtained from DEMATEL. At the third stage, direct relationship matrix of DEMATEL is used again to determine affect level of measures on danger sources. Then, by using the danger sources’ weights obtained from the second stage, the measures were prioritized according to the weighted means. This study is the first one that advance KEMIRA-M’s weighting procedure by implementing DEMATEL. In this way, a systematic weighting procedure has been gained for KEMIRA-M and a rule-based weight assignment can be performed for risk criteria in KEMIRA-M. Additionally, a three-step KEMIRA-M and DEMATEL is first proposed in this study as a holistic RA to prioritize measures. There is no study that considers risk criteria, danger sources and measures at the same time to prioritize measures. This study provides a new comprehensive approach for experts and executives to make a work plan for measure applications considering risk criteria and danger sources.
Incorporating knowledge graphs in recommendation systems is promising as knowledge graphs can be a side information for recommendation systems to alleviate the sparsity and the cold start problems. However, existing works essentially assume that side information (i.e., knowledge graphs) is completed, which may lead to sub-optimal performance. Meanwhile, semantic hierarchies implied in applications are prevalent, and many existing approaches fail to model this semantic characteristic. Modeling the semantic structure between items in recommendation systems is a crucial challenge. Therefore, it is crucial to solve the incompleteness of knowledge graphs when integrating it into recommendation system as well as to represent the hierarchical structure contained in items. In this paper, we propose Paguridae, a framework that utilizes the item recommendation task to assist link prediction task. A core idea of the Paguridae is that two tasks automatically share the potential features between items and entities. We adopt two main structures to model the hierarchy between items and entities. In order to model the hierarchy in items, we adopt graph convolutional networks as a representation learning method. In order to model the hierarchy in entities, we use Hirec model, which maps entities into the polar coordinate system. Under the framework, users can get better recommendations and knowledge graphs can be completed as these two tasks have a mutual effect. Experiments on two real-world datasets show that the Paguridae can be trained substantially, improving F1-score by 62.51% and precision by 49.31% compared to the state-of-the-art methods.
Improving news understanding by matching a correlated web table. A concrete example of the news/table matching - (adapted from the original web page) (a) (b)
An overview of the News-Table Matching Model. Our model learns a joint-representation for a ⟨\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\langle $$\end{document}news, table⟩\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rangle $$\end{document} tuple by applying three network blocks: Bi-Context, Attention and Transformer, in which embedding vector is the FastText representation from a pre-trained corpus, bi-context vector is the contextual vectors learned from the input data, attention matrix is the matching degree between news and table aspects and attention vectors is the final matching signals for each pair of inputs. Moreover, An MLP architecture captures relevant matches and produces the similarity score on its top layer
Nowadays, digital-news understanding is often overwhelmed by the deluge of online information. One approach to cover this gap is to outline the news story by highlighting the most relevant facts. For example, recent studies summarize news articles by generating representative headlines. In this paper, we go beyond and argue news understanding can also be enhanced by surfacing contextual data relevant to the article, such as structured web tables. Specifically, our goal is to match news articles and web tables for news augmentation. For that, we introduce a novel BERT-based attention model to compute this matching degree. Through an extensive experimental evaluation over Wikipedia tables, we compare the performance of our model with standard IR techniques, document/sentence encoders and neural IR models for this task. The overall results point out our model outperforms all baselines at different levels of accuracy and in the mean reciprocal ranking measure.
Top-cited authors
Diane J. Cook
  • Washington State University
Samaneh Aminikhanghahi
  • Washington State University
Philippe Fournier Viger
  • Shenzhen University
Carlos A. Coello Coello
  • Center for Research and Advanced Studies of the National Polytechnic Institute
Abdullah Gani
  • University of Malaya