Article

Semantic similarity controllers: On the trade-off between accuracy and interpretability

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In recent times, we have seen an explosion in the number of new solutions to address the problem of semantic similarity. In this context, solutions of a neuronal nature seem to obtain the best results. However, there are some problems related to their low interpretability as well as the large number of resources needed for their training. In this work, we focus on the data-driven approach for the design of semantic similarity controllers. The goal is to offer the human operator a set of solutions in the form of a Pareto front that allows choosing the configuration that best suits a specific use case. To do that, we have explored the use of multi-objective evolutionary algorithms that can help find break-even points for the problem of accuracy versus interpretability.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The great advantage is that a wide range of membership functions can be defined using only a limited number of points [33]. ...
... One of the traditional advantages of Mandami's models concerning other approaches, e.g., Tagaki-Sugeno's [36], is that they facilitate interpretability [33]. This is because the Mamdani inference is well suited to human input, while the Tagaki-Sugeno inference is assumed to be well suited to analysis [37]. ...
... Semantic similarity is an essential concept in NLP that has been extensively studied in the literature (Han et al., 2013;Lastra-Díaz & García-Serrano, 2015;Harispe et al., 2015;Zhu & Iglesias, 2017;Lastra-Díaz et al., 2019;Martinez-Gil & Chaves-Gonzalez, 2021). Traditional approaches to measuring semantic similarity are typically based on the study of inherent characteristics of the words (lexical methods) orits distribution in sufficiently meaningful text corpora (distributional semantics) (Martinez-Gil & Chaves-Gonzalez, 2022). ...
Preprint
Full-text available
The issue of word sense ambiguity poses a significant challenge in natural language processing due to the scarcity of annotated data to feed machine learning models to face the challenge. Therefore, unsupervised word sense disambiguation methods have been developed to overcome that challenge without relying on annotated data. This research proposes a new context-aware approach to unsupervised word sense disambiguation, which provides a flexible mechanism for incorporating contextual information into the similarity measurement process. We experiment with a popular benchmark dataset to evaluate the proposed strategy and compare its performance with state-of-the-art unsupervised word sense disambiguation techniques. The experimental results indicate that our approach substantially enhances disambiguation accuracy and surpasses the performance of several existing techniques. Our findings underscore the significance of integrating contextual information in semantic similarity measurements to manage word sense ambiguity in unsupervised scenarios effectively.
... In situations like this, different solutions can simultaneously fulfill all objectives. As a result, all optimal solutions ought to be regarded as equivalent merit without any external evaluation from a human operator (Martinez-Gil & Chaves-Gonzalez, 2021). More formally, we can model a MOO problem as expressed in Equation 5. ...
Preprint
Full-text available
This research presents ORUGA, a method that tries to automatically optimize the readability of any text in English. The core idea behind the method is that certain factors affect the readability of a text, some of which are quantifiable (number of words, syllables, presence or absence of adverbs, and so on). The nature of these factors allows us to implement a genetic learning strategy to replace some existing words with their most suitable synonyms to facilitate optimization. In addition, this research seeks to preserve both the original text's content and form through multi-objective optimization techniques. In this way, neither the text's syntactic structure nor the semantic content of the original message is significantly distorted. An exhaustive study on a substantial number and diversity of texts confirms that our method was able to optimize the degree of readability in all cases without significantly altering their form or meaning. The source code of this approach is available at https://github.com/jorge-martinez-gil/oruga.
... The most signi¯cant issue is that an increasing number of fundamental methods can estimate semantic similarity [20,21]. There is a profusion of appropriate approaches accessible, each of which is founded on quite distinct ideas and presumptions [22], and the knowledge engineer is at a loss as to which one to apply [23]. ...
Article
The challenge of assessing semantic similarity between pieces of text through computers has attracted considerable attention from industry and academia. New advances in neural computation have developed very sophisticated concepts, establishing a new state of the art in this respect. In this paper, we go one step further by proposing new techniques built on the existing methods. To do so, we bring to the table the stacking concept that has given such good results and propose a new architecture for ensemble learning based on genetic programming. As there are several possible variants, we compare them all and try to establish which one is the most appropriate to achieve successful results in this context. Analysis of the experiments indicates that Cartesian Genetic Programming seems to give better average results.
... The main difficulty comes from the nature of human language and the fact that text suffers from ambiguity in most cases. In addition, sentences and questions about the same topic and case can be formulated differently [18]. Language is very dynamic, and people can ask a question in almost infinite different ways. ...
... In this direction, Martinez-Gil and Chaves-Gonzalez comprehensively analyze many multi-objective evolutionary learning strategies, focusing on modeling an appropriate balance between accuracy and interpretability. The goal is to assess which one is the most suitable for developing semantic similarity controllers (Martinez-Gil & Chaves-Gonzalez, 2021). Fig. 1 shows us what training and test results should typically look like. ...
Article
This article presents a comprehensive review of stacking methods commonly used to address the challenge of automatic semantic similarity measurement in the literature. Since more than two decades of research have left various semantic similarity measures, scientists and practitioners often find many difficulties in choosing the best method to put into production. For this reason, a novel generation of strategies has been proposed to use basic semantic similarity measures using base estimators to achieve a better performance than could be gained from any of the semantic similarity measures. In this work, we analyze different stacking techniques, ranging from the classical algebraic methods to the most powerful ones based on hybridization, including blending, neural, fuzzy, and genetic-based stacking. Each technique excels in aspects such as simplicity, robustness, accuracy, interpretability, transferability, or a favorable combination of several of those aspects. The goal is that the reader can have an overview of the state-of-the-art in this field.
... The main difficulty comes from the nature of human language and the fact that text suffers from ambiguity in most cases. In addition, sentences and questions about the same topic and case can be formulated differently [18]. Language is very dynamic, and people can ask a question in almost infinite different ways. ...
Chapter
This research work presents a new augmentation model for knowledge graphs (KGs) that increases the accuracy of knowledge graph question answering (KGQA) systems. In the current situation, large KGs can represent millions of facts. However, the many nuances of human language mean that the answer to a given question cannot be found, or it is not possible to find always correct results. Frequently, this problem occurs because how the question is formulated does not fit with the information represented in the KG. Therefore, KGQA systems need to be improved to address this problem. We present a suite of augmentation techniques so that a wide variety of KGs can be automatically augmented, thus increasing the chances of finding the correct answer to a question. The first results from an extensive empirical study seem to be promising.
Chapter
Full-text available
Billboard advertisement is one of the dominant modes of traditional outdoor advertisements. A billboard operator manages the ad slots of a set of billboards. Normally, a user traversal is exposed to multiple billboards. Given a set of billboards, there is an opportunity to improve the revenue of the billboard operator by satisfying the advertising demands of an increased number of clients and ensuring that a user gets exposed to different ads on the billboards during the traversal. In this paper, we propose a framework to improve the revenue of the billboard operator by employing transactional modeling in conjunction with pattern mining. Our main contributions are three-fold. First, we introduce the problem of billboard advertisement allocation for improving the billboard operator revenue. Second, we propose an efficient user trajectory-based transactional framework using coverage pattern mining for improving the revenue of the billboard operator. Third, we conduct a performance study with a real dataset to demonstrate the effectiveness of our proposed framework.Keywordsbillboard advertisementdata miningpattern miningtransactional modelinguser trajectoryad revenue
Article
Recent efforts adopt interaction-based models to construct the interaction of words between sentences, which aim to predict whether two sentences are semantically equivalent or not in semantic textual similarity (STS) task. However, these methods lack the global semantic awareness, which make it difficult to distinguish syntactic differences and also suffer from the inference time cost, primarily due to the calculation of the pair-interactions of words. A novel model called Locality-Sensitive Hashing Relational Graph Matching Network (LSHRGMN) is therefore proposed, which tackles these problems by syntactic dependency graph and locality-sensitive hashing (LSH). Specifically, syntactic dependency graph is aware of the global semantic information via rooting in each word to construct several trees and merging all the trees into one graph. LSH mechanism is introduced into pair-interactions of words for the inference efficiency problem. Extensive experiments are conducted on three real-world datasets, and the result shows that the proposed approach acquires higher accuracy and intriguing inference speed.
Article
Full-text available
Human similarity and relatedness judgements between concepts underlie most of cognitive capabilities, such as categorisation, memory, decision-making and reasoning. For this reason, the proposal of methods for the estimation of the degree of similarity and relatedness between words and concepts has been a very active line of research in the fields of artificial intelligence, information retrieval and natural language processing among others. Main approaches proposed in the literature can be categorised in two large families as follows: (1) Ontology-based semantic similarity Measures (OM) and (2) distributional measures whose most recent and successful methods are based on Word Embedding (WE) models. However, the lack of a deep analysis of both families of methods slows down the advance of this line of research and its applications. This work introduces the largest, reproducible and detailed experimental survey of OM measures and WE models reported in the literature which is based on the evaluation of both families of methods on a same software platform, with the aim of elucidating what is the state of the problem. We show that WE models which combine distributional and ontology-based information get the best results, and in addition, we show for the first time that a simple average of two best performing WE models with other ontology-based measures or WE models is able to improve the state of the art by a large margin. In addition, we provide a very detailed reproducibility protocol together with a collection of software tools and datasets as complementary material to allow the exact replication of our results.
Article
Full-text available
In almost no other field of computer science, the idea of using bio-inspired search paradigms has been so useful as in solving multiobjective optimization problems. The idea of using a population of search agents that collectively approximate the Pareto front resonates well with processes in natural evolution, immune systems, and swarm intelligence. Methods such as NSGA-II, SPEA2, SMS-EMOA, MOPSO, and MOEA/D became standard solvers when it comes to solving multiobjective optimization problems. This tutorial will review some of the most important fundamentals in multiobjective optimization and then introduce representative algorithms, illustrate their working principles, and discuss their application scope. In addition, the tutorial will discuss statistical performance assessment. Finally, it highlights recent important trends and closely related research fields. The tutorial is intended for readers, who want to acquire basic knowledge on the mathematical foundations of multiobjective optimization and state-of-the-art methods in evolutionary multiobjective optimization. The aim is to provide a starting point for researching in this active area, and it should also help the advanced reader to identify open research topics.
Article
Full-text available
This work is a detailed companion reproducibility paper of the methods and experiments proposed by Lastra-Díaz and García-Serrano in [56, 57, 58], which introduces the following contributions: (1) a new and efficient representation model for taxonomies, called PosetHERep, which is an adaptation of the half-edge data structure commonly used to represent discrete manifolds and planar graphs; (2) a new Java software library called the Half-Edge Semantic Measures Library (HESML) based on PosetHERep, which implements most ontology-based semantic similarity measures and Information Content (IC) models reported in the literature; (3) a set of reproducible experiments on word similarity based on HESML and ReproZip with the aim of exactly reproducing the experimental surveys in the three aforementioned works; (4) a replication framework and dataset, called WNSimRep v1, whose aim is to assist the exact replication of most methods reported in the literature; and finally, (5) a set of scalability and performance benchmarks for semantic measures libraries. PosetHERep and HESML are motivated by several drawbacks in the current semantic measures libraries, especially the performance and scalability, as well as the evaluation of new methods and the replication of most previous methods. The reproducible experiments introduced herein are encouraged by the lack of a set of large, self-contained and easily reproducible experiments with the aim of replicating and confirming previously reported results. Likewise, the WNSimRep v1 dataset is motivated by the discovery of several contradictory results and difficulties in reproducing previously reported methods and experiments. PosetHERep proposes a memory-efficient representation for taxonomies which linearly scales with the size of the taxonomy and provides an efficient implementation of most taxonomy-based algorithms used by the semantic measures and IC models, whilst HESML provides an open framework to aid research into the area by providing a simpler and more efficient software architecture than the current software libraries. Finally, we prove the outperformance of HESML on the state-of-the-art libraries, as well as the possibility of significantly improving their performance and scalability without caching using PosetHERep.
Article
Full-text available
Semantic similarity measurement aims to determine the likeness between two text expressions that use different lexicographies for representing the same real object or idea. There are a lot of semantic similarity measures for addressing this problem. However, the best results have been achieved when aggregating a number of simple similarity measures. This means that after the various similarity values have been calculated, the overall similarity for a pair of text expressions is computed using an aggregation function of these individual semantic similarity values. This aggregation is often computed by means of statistical functions. In this work, we present CoTO (Consensus or Trade-Off) a solution based on fuzzy logic that is able to outperform these traditional approaches.
Chapter
Full-text available
Fuzzy systems are universally acknowledged as valuable tools to model complex phenomena while preserving a readable form of knowledge representation. The resort to natural language for expressing the terms involved in fuzzy rules, in fact, is a key factor to conjugate mathematical formalism and logical inference with human-centered interpretability. That makes fuzzy systems specifically suitable in every real-world context where people are in charge of crucial decisions. This is because the self-explanatory nature of fuzzy rules profitably supports expert assessments. Additionally, as far as interpretability is investigated, it appears that (a) the simple adoption of fuzzy sets in modeling is not enough to ensure interpretability; (b) fuzzy knowledge representation must confront the problem of preserving the overall system accuracy, thus yielding a trade-off which is frequently debated. Such issues have attracted a growing interest in the research community and became to assume a central role in the current literature panorama of computational intelligence. This chapter gives an overview of the topics related to fuzzy system interpretability, facing the ambitious goal of proposing some answers to a number of open challenging questions: What is interpretability? Why interpretability is worth considering? How to ensure interpretability, and how to assess (quantify) it? Finally, how to design interpretable fuzzy models?
Article
Full-text available
Fuzzy Logic Controllers are a specific model of Fuzzy Rule Based Systems suitable for engineering applications for which classic control strategies do not achieve good results or for when it is too difficult to obtain a mathematical model. Recently, the International Electrotechnical Commission has published a standard for fuzzy control programming in part 7 of the IEC 61131 norm in order to offer a well defined common understanding of the basic means with which to integrate fuzzy control applications in control systems. In this paper, we introduce an open source Java library called jFuzzyLogic which offers a fully functional and complete implementation of a fuzzy inference system according to this standard, providing a programming interface and Eclipse plugin to easily write and test code for fuzzy control applications. A case study is given to illustrate the use of jFuzzyLogic.
Article
Full-text available
This paper introduces the active learning of Pareto fronts (ALP) algorithm, a novel approach to recover the Pareto front of a multiobjective optimization problem. ALP casts the identification of the Pareto front into a supervised machine learning task. This approach enables an analytical model of the Pareto front to be built. The computational effort in generating the supervised information is reduced by an active learning strategy. In particular, the model is learned from a set of informative training objective vectors. The training objective vectors are approximated Pareto-optimal vectors obtained by solving different scalarized problem instances. The experimental results show that ALP achieves an accurate Pareto front approximation with a lower computational effort than state-of-the-art estimation of distribution algorithms and widely known genetic techniques.
Conference Paper
Full-text available
This paper proposes and evaluates an evolutionary multiobjective optimization algorithm (EMOA) that eliminates dominance ranking in selection and performs indicator-based selection with the R2 indicator. Although it is known that the R2 indicator possesses desirable properties to quantify the goodness of a solution or a solution set, few attempts have been made until recently to investigate indicator-based EMOAs with the R2 indicator. The proposed EMOA, called R2-IBEA, is designed to obtain a diverse set of Pareto-approximated solutions by correcting an inherent bias in the R2 indicator. (The R2 indicator has a stronger bias to the center of the Pareto front than to its edges.) Experimental results demonstrate that R2- IBEA outperforms existing indicator-based, decomposition-based and dominance ranking based EMOAs in the optimality and diversity of solutions. R2-IBEA successfully produces diverse individuals that are distributed well in the objective space. It is also empirically verified that R2-IBEA scales well from twodimensional to five-dimensional problems.
Article
Full-text available
We describe a new selection technique forevolutionary multiobjective optimization algorithmsin which the unit of selection is ahyperbox in objective space. In this technique,instead of assigning a selective fitnessto an individual, selective fitness is assignedto the hyperboxes in objective space whichare currently occupied by at least one individualin the current approximation to thePareto frontier. A hyperbox is thereby selected,and the resulting selected individualis...
Conference Paper
Full-text available
In this work we present a new hybrid cellular genetic algorithm. We take MOCell as starting point, a multi-objective cellular genetic algorithm, and, instead of using the typical genetic crossover and mutation operators, they are replaced by the reproductive operators used in differential evolution. An external archive is used to store the nondominated solutions found during the search process and the SPEA2 density estimator is applied when the archive becomes full. We evaluate the resulting hybrid algorithm using a benchmark composed of three-objective test problems, and we compare the results with several state of the art multi-objective metaheuristics. The obtained results show that our proposal outperforms the other algorithms according to the two considered quality indicators.
Conference Paper
Full-text available
In this work, we present a new multi-objective particle swarm optimization algorithm (PSO) characterized by the use of a strategy to limit the velocity of the particles. The proposed approach, called Speed-constrained Multi-objective PSO (SMPSO) allows to produce new effective particle positions in those cases in which the velocity becomes too high. Other features of SMPSO include the use of polynomial mutation as a turbulence factor and an external archive to store the non-dominated solutions found during the search. Our proposed approach is compared with respect to five multi-objective metaheuristics representative of the state-of-the-art in the area. For the comparison, two different criteria are adopted: the quality of the resulting approximation sets and the convergence speed to the Pareto front. The experiments carried out indicate that SMPSO obtains remarkable results in terms of both, accuracy and speed.
Article
Full-text available
Linguistic fuzzy modeling in high-dimensional regression problems poses the challenge of exponential-rule explosion when the number of variables and/or instances becomes high. One way to address this problem is by determining the used variables, the linguistic partitioning and the rule set together, in order to only evolve very simple, but still accurate models. However, evolving these components together is a difficult task, which involves a complex search space. In this study, we propose an effective multiobjective evolutionary algorithm that, based on embedded genetic database (DB) learning (involved variables, granularities, and slight fuzzy-partition displacements), allows the fast learning of simple and quite-accurate linguistic models. Some efficient mechanisms have been designed to ensure a very fast, but not premature, convergence in problems with a high number of variables. Further, since additional problems could arise for datasets with a large number of instances, we also propose a general mechanism for the estimation of the model error when using evolutionary algorithms, by only considering a reduced subset of the examples. By doing so, we can also apply a fast postprocessing stage for further refining the learned solutions. We tested our approach on 17 real-world datasets with different numbers of variables and instances. Three well-known methods based on embedded genetic DB learning have been executed as references. We compared the different approaches by applying nonparametric statistical tests for multiple comparisons. The results confirm the effectiveness of the proposed method not only in terms of scalability but in terms of the simplicity and generalizability of the obtained models as well.
Article
Full-text available
In this paper, we propose an index that helps preserve the semantic interpretability of linguistic fuzzy models while a tuning of the membership functions (MFs) is performed. The proposed index is the aggregation of three metrics that preserve the original meanings of the MFs as much as possible while a tuning of their definition parameters is performed. Additionally, rule-selection mechanisms can be used to reduce the model complexity, which involves another important interpretability aspect. To this end, we propose a postprocessing multiobjective evolutionary algorithm that performs rule selection and tuning of fuzzy-rule-based systems with three objectives: accuracy, semantic interpretability maximization, and complexity minimization. We tested our approach on nine real-world regression datasets. In order to analyze the interaction between the fuzzy-rule-selection approach and the tuning approach, these are also individually proved in a multiobjective framework and compared with their respective single-objective counterparts. We compared the different approaches by applying nonparametric statistical tests for pairwise and multiple comparisons, taking into consideration three representative points from the obtained Pareto fronts in the case of the multiobjective-based approaches. Results confirm the effectiveness of our approach, and a wide range of solutions is obtained, which are not only more interpretable but are also more accurate.
Article
Full-text available
The hypervolume measure (or S metric) is a frequently applied quality measure for comparing the results of evolutionary multiobjective optimisation algorithms (EMOA). The new idea is to aim explicitly for the maximisation of the dominated hypervolume within the optimisation process. A steady-state EMOA is proposed that features a selection operator based on the hypervolume measure combined with the concept of non-dominated sorting. The algorithm’s population evolves to a well-distributed set of solutions, thereby focussing on interesting regions of the Pareto front. The performance of the devised Smetric selection EMOA (SMS-EMOA) is compared to state-of-the-art methods on two- and three-objective benchmark suites as well as on aeronautical real-world applications.
Chapter
Interpretability has been always present in Machine Learning and Artificial Intelligence. However, it is difficult to measure it (even to define it), and quite commonly it collides with other properties as accuracy, with a clear meaning and well defined metrics. This situation has reduced its influence in the area. But due to different external reasons, interpretability is now gaining importance in Artificial Intelligence, and particularly in Machine Learning. This new situation has two effects on the field of fuzzy systems. First, considering the capability of the fuzzy formalism to describe complex phenomena in terms that are quite close to human language, fuzzy systems have gained significant presence as an interpretable modeling tool. Second, the attention paid to interpretability of fuzzy systems, that grew during the first decade of this century and then experienced a certain decay, is growing again. The present paper will consider four questions regarding interpretability: what is, why is it important, how to measure it, and how to achieve it. These questions will be first introduced in the general framework of Artificial Intelligence, to be then focused from the point of view of fuzzy systems.
Article
The problem of automatically measuring the degree of semantic similarity between textual expressions is a challenge that consists of calculating the degree of likeness between two text fragments that have none or few features in common according to human judgment. In recent times, several machine learning methods have been able to establish a new state-of-the-art regarding the accuracy, but none or little attention has been paid to their interpretability, i.e. the extent to which an end-user could be able to understand the cause of the output from these approaches. Although such solutions based on symbolic regression already exist in the field of clustering (Lensen et al., 2019), we propose here a new approach which is being able to reach high levels of interpretability without sacrificing accuracy in the context of semantic textual similarity. After a complete empirical evaluation using several benchmark datasets, it is shown that our approach yields promising results in a wide range of scenarios.
Article
Recent advances in machine learning have been able to make improvements over the state-of-the-art regarding semantic similarity measurement techniques. In fact, we have all seen how classical techniques have given way to promising neural techniques. Nonetheless, these new techniques have a weak point: they are hardly interpretable. For this reason, we have oriented our research towards the design of strategies being able to be accurate enough but without sacrificing their interpretability. As a result, we have obtained a strategy for the automatic design of semantic similarity controllers based on fuzzy logics, which are automatically identified using genetic algorithms (GAs). After an exhaustive evaluation using a number of well-known benchmark datasets, we can conclude that our strategy fulfills both expectations: it is able of achieving reasonably good results, and at the same time, it can offer high degrees of interpretability.
Article
A new evolutionary-learning algorithm is proposed to learn a decision maker (DM)'s best solution on a conflicting multiobjective space. Given the exemplary pairwise comparisons of solutions by a DM, we learn an ideal point (for the DM) that is used to evolve toward a better set of solutions. The process is repeated to get the DM's best solution. The comparison of solutions in pairs facilitates the process of eliciting training information for the proposed learning model. Experimental study on standard multiobjective data sets shows that the proposed method accurately identifies a DM's preferred zone in relatively a few generations and with a small number of preferences. Besides, it is found to be robust to inconsistencies in the preference statements. The results obtained are validated through a variant of the established NSGA-2 algorithm.
Conference Paper
The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large num- ber of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alterna- tive to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of “Canada” and “Air” cannot be easily combined to obtain “Air Canada”. Motivated by this example,we present a simplemethod for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.
Article
Multi-objective metaheuristics have become popular techniques for dealing with complex optimization problems composed of a number of conflicting functions. Nowadays, we are in the Big Data era, so metaheuristics must be able to solve dynamic problems that may vary over time due to the processing and analysis of several streaming data sources. As this is a new field, there is a need for software platforms to solve dynamic multi-objective Big Data optimization problems. In this paper, we present jMetalSP, which combines the multi-objective optimization features of the jMetal framework with the streaming facilities of the Apache Spark cluster computing system. Thus, existing state-of-the-art multi-objective metaheuristics can be easily adapted to deal with dynamic optimization problems that are fed by multiple streaming data sources. Moreover, these algorithms can take advantage of the parallel computing features of Spark. We describe the architecture of jMetalSP and show how it can be used to solve a dynamic bi-objective instance of the Traveling Salesman Problem (TSP) based on New York City's real-time traffic data. We have also carried out an experimental study to assess the performance of the resultant jMetalSP application in a Hadoop cluster composed of 100 nodes.
Article
The measurement of the semantic relatedness between words has gained increasing interest in several research fields, including cognitive science, artificial intelligence, biology, and linguistics. The development of efficient measures is based on knowledge resources, such as Wikipedia, a huge and living encyclopedia supplied by net surfers. In this paper, we propose a novel approach based on multi-Layered Wikipedia representation for Computing word Relatedness (LWCR) exploiting a weighting scheme based on Wikipedia Category Graph (WCG): Term Frequency-Inverse Category Frequency (tfxicf). Our proposal provides for each category pertaining to the WCG a Category Description Vector (CDV) including the weights of stems extracted from articles assigned to a category. The semantic relatedness degree is computed using the cosine measure between the CDVs assigned to the target words couple. The basic idea is followed by enhancement modules exploiting other Wikipedia features, such as article titles, redirection mechanism, and neighborhood category enrichment, to exploit semantic features and better quantify the semantic relatedness between words. To the best of our knowledge, this is the first attempt to incorporate the WCG-based term-weighting scheme (tfxicf) into computing model of semantic relatedness. It is also the first work that exploits 17 datasets in the assessment process, which are divided into two sets. The first set includes the ones designed for semantic similarity purposes: RG65, MC30, AG203, WP300, SimLexNoun666 and GeReSiD50Sim; the second includes datasets for semantic relatedness evaluation: WordSim353, GM30, Zeigler25, Zeigler30, MTurk287, MTurk771, MEN3000, Rel122, ReWord26, GeReSiD50 and SCWS1229. The found results are compared to WordNet-based measures and distributional measures cosine and PMI performed on Wikipedia articles. Experiments show that our approach provides consistent improvements over the state of the art results on multiple benchmarks.
Conference Paper
Convergence and diversity are two main goals in multiobjective optimization. In literature, most existing multiobjective optimization evolutionary algorithms (MOEAs) adopt a convergence-first-and-diversity-second environmental selection which prefers nondominated solutions to dominated ones, as is the case with the popular nondominated sorting based selection method. While convergence-first sorting has continuously shown effectiveness for handling a variety of problems, it faces challenges to maintain well population diversity due to the overemphasis of convergence. In this paper, we propose a general diversity-first sorting method for multiobjective optimization. Based on the method, a new MOEA, called DBEA, is then introduced. DBEA is compared with the recently-developed nondominated sorting genetic algorithm III (NSGA-III) on different problems. Experimental studies show that the diversity-first method has great potential for diversity maintenance and is very competitive for many-objective optimization.
Article
Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Many popular models to learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for morphologically rich languages with large vocabularies and many rare words. In this paper, we propose a new approach based on the skip-gram model, where each word is represented as a bag of character n-grams. A vector representation is associated to each character n-gram, words being represented as the sum of these representations. Our method is fast, allowing to train models on large corpus quickly. We evaluate the obtained word representations on five different languages, on word similarity and analogy tasks.
Article
Semantic similarity measurement of biomedical nomenclature aims to determine the likeness between two biomedical expressions that use different lexicographies for representing the same real biomedical concept. There are many semantic similarity measures for trying to address this issue, many of them have represented an incremental improvement over the previous ones. In this work, we present yet another incremental solution that is able to outperform existing approaches by using a sophisticated aggregation method based on fuzzy logic. Results show us that our strategy is able to consistently beat existing approaches when solving well-known biomedical benchmark data sets.
Article
The distributional hypothesis of Harris (1954), according to which the meaning of words is evidenced by the contexts they occur in, has motivated several effective techniques for obtaining vector space semantic representations of words using unannotated text corpora. This paper argues that lexico-semantic content should additionally be invariant across languages and proposes a simple technique based on canonical correlation analysis (CCA) for incorporating multilingual evidence into vectors generated monolingually. We evaluate the resulting word representations on standard lexical semantic evaluation tasks and show that our method produces substantially better semantic representations than monolingual techniques.
Article
Having developed multiobjective optimization algorithms using evolutionary optimization methods and demonstrated their niche on various practical problems involving mostly two and three objectives, there is now a growing need for developing evolutionary multiobjective optimization (EMO) algorithms for handling many-objective (having four or more objectives) optimization problems. In this paper, we recognize a few recent efforts and discuss a number of viable directions for developing a potential EMO algorithm for solving many-objective optimization problems. Thereafter, we suggest a reference-point-based many-objective evolutionary algorithm following NSGA-II framework (we call it NSGA-III) that emphasizes population members that are nondominated, yet close to a set of supplied reference points. The proposed NSGA-III is applied to a number of many-objective test problems with three to 15 objectives and compared with two versions of a recently suggested EMO algorithm (MOEA/D). While each of the two MOEA/D methods works well on different classes of problems, the proposed NSGA-III is found to produce satisfactory results on all problems considered in this paper. This paper presents results on unconstrained problems, and the sequel paper considers constrained and other specialties in handling many-objective optimization problems.
Conference Paper
Unsupervised word representations are very useful in NLP tasks both as inputs to learning algorithms and as extra word features in NLP systems. However, most of these models are built with only local context and one representation per word. This is problematic because words are often polysemous and global context can also provide useful information for learning word meanings. We present a new neural network architecture which 1) learns word embeddings that better capture the semantics of words by incorporating both local and global document context, and 2) accounts for homonymy and polysemy by learning multiple embeddings per word. We introduce a new dataset with human judgments on pairs of words in sentential context, and evaluate our model on it, showing that our model outperforms competitive baselines and other neural language models.
Article
Over the past few decades, fuzzy systems have been widely used in several application fields, thanks to their ability to model complex systems. The design of fuzzy systems has been successfully performed by applying evolutionary and, in particular, genetic algorithms, and recently, this approach has been extended by using multiobjective evolutionary algorithms, which can consider multiple conflicting objectives, instead of a single one. The hybridization between multiobjective evolutionary algorithms and fuzzy systems is currently known as multiobjective evolutionary fuzzy systems. This paper presents an overview of multiobjective evolutionary fuzzy systems, describing the main contributions on this field and providing a two-level taxonomy of the existing proposals, in order to outline a well-established framework that could help researchers who work on significant further developments. Finally, some considerations of recent trends and potential research directions are presented.
Article
Pointwise mutual information (PMI) is a widely used word similarity measure, but it lacks a clear explanation of how it works. We explore how PMI differs from distributional similarity, and we introduce a novel metric, PMImax, that augments PMI with information about a word's number of senses. The coefficients of PMImax are determined empirically by maximizing a utility function based on the performance of automatic thesaurus generation. We show that it outperforms traditional PMI in the application of automatic thesaurus generation and in two word similarity benchmark tasks: human similarity ratings and TOEFL synonym questions. PMImax achieves a correlation coefficient comparable to the best knowledge-based approaches on the Miller-Charles similarity rating data set.
Article
The relationship between semantic and contextual similarity is investigated for pairs of nouns that vary from high to low semantic similarity. Semantic similarity is estimated by subjective ratings; contextual similarity is estimated by the method of sorting sentential contexts. The results show an inverse linear relationship between similarity of meaning and the discriminability of contexts. This relation, is obtained for two separate corpora of sentence contexts. It is concluded that, on average, for words in the same language drawn from the same syntactic and semantic categories, the more often two words can be substituted into the same contexts the more similar in meaning they are judged to be.
Article
Different studies have proposed methods for mining fuzzy association rules from quantitative data, where the membership functions were assumed to be known in advance. However, it is not an easy task to know a priori the most appropriate fuzzy sets that cover the domains of quantitative attributes for mining fuzzy association rules. This paper thus presents a new fuzzy data-mining algorithm for extracting both fuzzy association rules and membership functions by means of a genetic learning of the membership functions and a basic method for mining fuzzy association rules. It is based on the 2-tuples linguistic representation model allowing us to adjust the context associated to the linguistic term membership functions. Experimental results show the effectiveness of the framework.
Article
A new method for automatic indexing and retrieval is described. The approach is to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries. The particular technique used is singular-value decomposition, in which a large term by document matrix is decomposed into a set of ca. 100 orthogonal factors from which the original matrix can be approximated by linear combination. Documents are represented by ca. 100 item vectors of factor weights. Queries are represented as pseudo-document vectors formed from weighted combinations of terms, and documents with supra-threshold cosine values are returned. initial tests find this completely automatic method for retrieval to be promising.
Chapter
This chapter explains evolutionary multiobjective design of fuzzy rule-based systems in comparison with single-objective design. Evolutionary algorithms have been used in many studies on fuzzy system design for rule generation, rule selection, input selection, fuzzy partition, and membership function tuning. Those studies are referred to as genetic fuzzy systems because genetic algorithms have been mainly used as evolutionary algorithms. In many studies on genetic fuzzy systems, the accuracy of fuzzy rule-based systems is maximized. However, accuracy maximization often leads to the deterioration in the interpretability of fuzzy rule-based systems due to the increase in their complexity. Thus, multiobjective genetic algorithms were used in some studies to maximize not only the accuracy of fuzzy rule-based systems but also their interpretability. Those studies, which can be viewed as a subset of genetic fuzzy system studies, are referred to as multiobjective genetic fuzzy systems (MoGFS ). A number of fuzzy rule-based systems with different complexities are obtained along the interpretability–accuracy tradeoff curve. One extreme of the tradeoff curve is a simple highly interpretable fuzzy rule-based system with low accuracy while the other extreme is a complicated highly accurate one with low interpretability. In MoGFS, multiple accuracy measures such as a true positive rate and a true negative rate can be simultaneously used as separate objectives. Multiple interpretability measures can also be simultaneously used in MoGFS.
Article
A mathematical tool to build a fuzzy model of a system where fuzzy implications and reasoning are used is presented. The premise of an implication is the description of fuzzy subspace of inputs and its consequence is a linear input-output relation. The method of identification of a system using its input-output data is then shown. Two applications of the method to industrial processes are also discussed: a water cleaning process and a converter in a steel-making process.
Article
The need for trading off interpretability and accuracy is intrinsic to the use of fuzzy systems. The obtaining of accurate but also human-comprehensible fuzzy systems played a key role in Zadeh and Mamdani’s seminal ideas and system identification methodologies. Nevertheless, before the advent of soft computing, accuracy progressively became the main concern of fuzzy model builders, making the resulting fuzzy systems get closer to black-box models such as neural networks. Fortunately, the fuzzy modeling scientific community has come back to its origins by considering design techniques dealing with the interpretability-accuracy tradeoff. In particular, the use of genetic fuzzy systems has been widely extended thanks to their inherent flexibility and their capability to jointly consider different optimization criteria. The current contribution constitutes a review on the most representative genetic fuzzy systems relying on Mamdani-type fuzzy rule-based systems to obtain interpretable linguistic fuzzy models with a good accuracy.
Article
This paper examines the interpretability-accuracy tradeoff in fuzzy rule-based classifiers using a multiobjective fuzzy genetics-based machine learning (GBML) algorithm. Our GBML algorithm is a hybrid version of Michigan and Pittsburgh approaches, which is implemented in the framework of evolutionary multiobjective optimization (EMO). Each fuzzy rule is represented by its antecedent fuzzy sets as an integer string of fixed length. Each fuzzy rule-based classifier, which is a set of fuzzy rules, is represented as a concatenated integer string of variable length. Our GBML algorithm simultaneously maximizes the accuracy of rule sets and minimizes their complexity. The accuracy is measured by the number of correctly classified training patterns while the complexity is measured by the number of fuzzy rules and/or the total number of antecedent conditions of fuzzy rules. We examine the interpretability-accuracy tradeoff for training patterns through computational experiments on some benchmark data sets. A clear tradeoff structure is visualized for each data set. We also examine the interpretability-accuracy tradeoff for test patterns. Due to the overfitting to training patterns, a clear tradeoff structure is not always obtained in computational experiments for test patterns.
Article
This article presents a measure of semantic similarity in an IS-A taxonomy based on the notion of shared information content. Experimental evaluation against a benchmark set of human similarity judgments demonstrates that the measure performs better than the traditional edge-counting approach. The article presents algorithms that take advantage of taxonomic similarity in resolving syntactic and semantic ambiguity, along with experimental results demonstrating their effectiveness.
Article
In this paper, we propose a multi-objective evolutionary algorithm (MOEA) to generate Mamdani fuzzy rule-based systems with different trade-offs between accuracy and complexity by learning concurrently granularities of the input and output partitions, membership function (MF) parameters and rules. To this aim, we introduce the concept of virtual and concrete partitions: the former is defined by uniformly partitioning each linguistic variable with a fixed maximum number of fuzzy sets; the latter takes into account, for each variable, the number of fuzzy sets determined by the evolutionary process. Rule bases and MF parameters are defined on the virtual partitions and, whenever a fitness evaluation is required, mapped to the concrete partitions by employing appropriate mapping strategies. The implementation of the MOEA relies on a chromosome composed of three parts, which codify the partition granularities, the virtual rule base and the membership function parameters, respectively, and on purposely-defined genetic operators. The MOEA has been tested on three real-world regression problems achieving very promising results. In particular, we highlight how starting from randomly generated solutions, the MOEA is able to determine different granularities for different variables achieving good trade-offs between complexity and accuracy.
Article
A methodology for the encoding of the chromosome of a genetic algorithm (GA) is described in the paper. The encoding procedure is applied to the problem of automatically generating fuzzy rule-based models from data. Models generated by this approach have much of the flexibility of black-box methods, such as neural networks. In addition, they implicitly express information about the process being modelled through the linguistic terms associated with the rules. They can be applied to problems that are too complex to model in a first principles sense and can reduce the computational overhead when compared to established first principles based models. The encoding mechanism allows the rule base structure and parameters of the fuzzy model to be estimated simultaneously from data. The principle advantage is the preservation of the linguistic concept without the need to consider the entire rule base. The GA searches for the optimum solution given a comparatively small number of rules compared to all possible. This minimises the computational demand of the model generation and allows problems with realistic dimensions to be considered. A further feature is that the rules are extracted from the data without the need to establish any information about the model structure a priori. The implementation of the algorithm is described and the approach is applied to the modelling of components of heating ventilating and air-conditioning systems.
Article
Linguistic fuzzy modelling, developed by linguistic fuzzy rule-based systems, allows us to deal with the modelling of systems by building a linguistic model which could become interpretable by human beings. Linguistic fuzzy modelling comes with two contradictory requirements: interpretability and accuracy. In recent years the interest of researchers in obtaining more interpretable linguistic fuzzy models has grown.Whereas the measures of accuracy are straightforward and well-known, interpretability measures are difficult to define since interpretability depends on several factors; mainly the model structure, the number of rules, the number of features, the number of linguistic terms, the shape of the fuzzy sets, etc. Moreover, due to the subjectivity of the concept the choice of appropriate interpretability measures is still an open problem.In this paper, we present an overview of the proposed interpretability measures and techniques for obtaining more interpretable linguistic fuzzy rule-based systems. To this end, we will propose a taxonomy based on a double axis: “Complexity versus semantic interpretability” considering the two main kinds of measures; and “rule base versus fuzzy partitions” considering the different components of the knowledge base to which both kinds of measures can be applied. The main aim is to provide a well established framework in order to facilitate a better understanding of the topic and well founded future works.