To read the full-text of this research, you can request a copy directly from the authors.
Abstract
In this paper we present a novel approach to minimally supervised synonym extraction. The approach is based on the word embeddings and aims at presenting a method for synonym extraction that is extensible to various languages.
We report experiments with word vectors trained by using both the continuous bag-of-words model (CBoW) and the skip-gram model (SG) investigating the effects of different settings with respect to the contextual window size, the number of dimensions and the type of word vectors. We analyze the word categories that are (cosine) similar in the vector space, showing that cosine similarity on its own is a bad indicator to determine if two words are synonymous. In this context, we propose a new measure, relative cosine similarity, for calculating similarity relative to other cosine-similar words in the corpus. We show that calculating similarity relative to other words boosts the precision of the extraction. We also experiment with combining similarity scores from differently-trained vectors and explore the advantages of using a part-of-speech tagger as a way of introducing some light supervision, thus aiding extraction.
We perform both intrinsic and extrinsic evaluation on our final system: intrinsic evaluation is carried out manually by two human evaluators and we use the output of our system in a machine translation task for extrinsic evaluation, showing that the extracted synonyms improve the evaluation metric.
To read the full-text of this research, you can request a copy directly from the authors.
... Recently, important publications in the field of ASE used the statistical approach and gained significant precision [6], [7], and [8], but they did not consider the efficiency and the time required for their systems tends to be long. For example, Leeuwenberga et al. in [7] used a bag of word model called relative cosine similarity to extract the term synonyms. ...
... Recently, important publications in the field of ASE used the statistical approach and gained significant precision [6], [7], and [8], but they did not consider the efficiency and the time required for their systems tends to be long. For example, Leeuwenberga et al. in [7] used a bag of word model called relative cosine similarity to extract the term synonyms. In their work, the construction of the terms-terms weighted matrix is expensive in terms of space and time, and the repetitive computations add more delay. ...
... • The statistical techniques over monolingual corpora, such as the Point Mutual Information [14], the Vector Space Model with cosine similarity or relative cosine similarity [7], [33], [34], [35], and [36]. • The translation techniques among different languages over the bilingual or multilingual dictionaries (the words that share the same interpretations are synonyms) [37], [38], and [39]. ...
The traditional statistical approach in synonyms extraction is time-consuming. It is necessary to develop a new method to improve the efficiency and accuracy. This research presents a new method in synonyms extraction called Noun Based Distinctive Verbs (NBDV) that replaces the traditional tf-idf weighting scheme with a new weighting scheme called the Orbit Weighing Scheme (OWS). The OWS links the nouns to their semantic space by examining the singular verbs in each context. The new method was compared with important models in the field such as the Skip-Gram, the Continuous Bag of Words, and the GloVe model. The NBDV model was manipulated over the Arabic and English languages and the results showed 47% Recall and 51% Precision in the dictionary-based evaluation and 57.5% Precision in the human experts’ evaluation. Comparing with the synonyms extraction based on tf.idf, the NBDV obtained 11% higher recall and 10% higher precision. Regarding the efficiency, we found that on average, the synonyms extraction of a single noun requires the process of 186 verbs and in 63% of the runs; the number of singular verbs was less than 200. It is concluded that the developed method is efficient and processes the single run in linear time.
... Specifically, our method incorporates a BERT model for identifying the various aspects of products and another BERT model to determine subfeature relations between aspects. In addition, we train word2vec [24] to extract word vectors which are used to group the aspects into synsets using the method proposed in [19]. Our method requires a very limited amount of hand-annotation, relying on a very small, manually created ontology and using distantly supervised learning where a large amount of text is automatically annotated. ...
... Once we have obtained the word embeddings for each of the aspects, we can use the relative cosine similarity of the vectors to group them into synsets. Note that we use relative cosine similarity as this is a more accurate measure for synonymy than cosine similarity [19]. The cosine similarity relative to the top most similar words between word embeddings and is calculated as: ...
... , is the set of most similar words to . If 10 ( , ) > 0.10, is more similar to than an arbitrary similar word from ,10 : this was shown to be a good indicator of synonymy [19]. We use the relative cosine similarity measure to construct a weighted graph of aspects to be used in the Equidistant Nodes Clustering (ENC) method [4], which was shown to obtain state-of-the-art performance in the synset induction task. ...
Ontologies have proven beneficial in different settings that make use of textual reviews. However, manually constructing ontologies is a laborious and time-consuming process in need of automation. We propose a novel methodology for automatically extracting ontologies, in the form of meronomies, from product reviews, using a very limited amount of hand-annotated training data. We show that the ontologies generated by our method outperform hand-crafted ontologies (WordNet) and ontologies extracted by existing methods (Text2Onto and COMET) in several, diverse settings. Specifically, our generated ontologies outperform the others when evaluated by human annotators as well as on an existing Q&A dataset from Amazon. Moreover, our method is better able to generalise, in capturing knowledge about unseen products. Finally, we consider a real-world setting, showing that our method is better able to determine recommended products based on their reviews, in alternative to using Amazon's standard score aggregations.
... e dimensions of distributed word vectors are word features that represent different aspects of word meaning [14]. Word embeddings have been used widely to extract and detect synonyms in English [15][16][17][18][19]. e author in [19] uses cosine similarity, "a measure of similarity between two nonzero vectors of an inner product space that measures the cosine of the angle between them" [20]. ...
... e dimensions of distributed word vectors are word features that represent different aspects of word meaning [14]. Word embeddings have been used widely to extract and detect synonyms in English [15][16][17][18][19]. e author in [19] uses cosine similarity, "a measure of similarity between two nonzero vectors of an inner product space that measures the cosine of the angle between them" [20]. However, the list of most similar words retrieved using cosine similarity contains words that share some relation with the seed word including not only synonymy but also other relations such as inflections and antonyms [19]. ...
... Word embeddings have been used widely to extract and detect synonyms in English [15][16][17][18][19]. e author in [19] uses cosine similarity, "a measure of similarity between two nonzero vectors of an inner product space that measures the cosine of the angle between them" [20]. However, the list of most similar words retrieved using cosine similarity contains words that share some relation with the seed word including not only synonymy but also other relations such as inflections and antonyms [19]. us, cosine similarity alone is not an effective measure for synonym extraction. ...
Automatic synonym extraction plays an important role in many natural language processing systems, such as those involving information retrieval and question answering. Recently, research has focused on extracting semantic relations from word embeddings since they capture relatedness and similarity between words. However, using word embeddings alone poses problems for synonym extraction because it cannot determine whether the relation between words is synonymy or some other semantic relation. In this paper, we present a novel solution for this problem by proposing the SynoExtractor pipeline, which can be used to filter similar word embeddings to retain synonyms based on specified linguistic rules. Our experiments were conducted using KSUCCA and Gigaword embeddings and trained with CBOW and SG models. We evaluated automatically extracted synonyms by comparing them with Alma’any Arabic synonym thesauri. We also arranged for a manual evaluation by two Arabic linguists. The results of experiments we conducted show that using the SynoExtractor pipeline enhances the precision of synonym extraction compared to using the cosine similarity measure alone. SynoExtractor obtained a 0.605 mean average precision (MAP) for the King Saud University Corpus of Classical Arabic with 21% improvement over the baseline and a 0.748 MAP for the Gigaword corpus with 25% improvement. SynoExtractor outperformed the Sketch Engine thesaurus for synonym extraction by 32% in terms of MAP. Our work shows promising results for synonym extraction suggesting that our method can also be used with other languages.
... However, in the sentence "He's like a big brother to me", "big" cannot be substituted by "large". Leeuwenberg et al. [7] consider two words to be synonyms if they denote the same concept and are interchangeable in many contexts, with regard to one of their senses. The latter definition is quite convenient for computational applications. ...
... Perhaps the most similar work to the research conducted in this paper in the recent literature for synonymy discovery is the work of Leeuwenberg et al. [7]. The authors presented a minimally supervised approach to extract synonyms using distributional word vectors for English and German. ...
... For extrinsic evaluation, the authors used their system in machine translation evaluation task and observe an improvement on the evaluation metric of the machine translation. It worth mentioning that Leeuwenberg et al. [7] favoured minimally supervised system so their approach can be extended to other least resourced languages which may not supported by rich lexical databases or sophisticated NLP tools. The only source of annotation that was used in that research is part-of-speech tagged corpus. ...
Extracting synonyms from textual corpora using computational techniques is an interesting research problem in the Natural Language Processing (NLP) domain. Neural techniques (such as Word2Vec) have been recently utilized to produce distributional word representations (also known as word embeddings) that capture semantic similarity/relatedness between words based on linear context. Nevertheless, using these techniques for synonyms extraction poses many challenges due to the fact that similarity between vector word representations does not indicate only synonymy between words, but also other sense relations as well as word association or relatedness. In this paper, we tackle this problem using a novel 2-step approach. We first build distributional word embeddings using Word2Vec then use the induced word embeddings as an input to train a feed-forward neutral network using annotated dataset to distinguish between synonyms and other semantically related words
... The word embeddings are recently gaining more attention and may help to address a broad range of NLP applications such as multi-task learning (partof-speech tagging, chunking, named entity tags, semantic role labeling, language model, semantically related words) Ref. 6,7,8,13,41 , adjectival scales Ref. 15 , text classification Ref. 18 , sentiment analysis Ref. 21,38,40 , dependency parsing Ref. 1,5 , analogies Ref. 11,22,26,30 , paraphrase detection Ref. 37 , recommendation system Ref. 2 and machine translation Ref. 19 . ...
... Accordingly, we decided to set the K value to 300. Ref. 19 that smaller contextual windows generally gets better precision. We experimentally keep the windows size 5. ...
... Such studies argue that SG performs better for semantic relations. Ref. 19 observed that CBoW vectors give higher precision than SG for both German and English. The study suggests that the reason could be that CBoW vectors tend to be slightly more syntactical compared to SG vectors. ...
Recently, Neural Network Language Models have been effectively applied to many types of Natural Language Processing (NLP) tasks. One popular type of tasks is the discovery of semantic and syntactic regularities that support the researchers in building a lexicon. Word embedding representations are notably good at discovering such linguistic regularities. We argue that two supervised learning approaches based on word embeddings can be successfully applied to the hypernym problem, namely, utilizing embedding offsets between word pairs and learning semantic projection to link the words. The offset-based model classifies offsets as hypernym or not. The semantic projection approach trains a semantic transformation matrix that ideally maps a hyponym to its hypernym. A semantic projection model can learn a projection matrix provided that there is a sufficient number of training word pairs. However, we argue that such models tend to learn is-a-particular-hypernym relation rather than to generalize is-a relation. The embeddings are trained by applying both the Continuous Bag-of Words and the Skip-Gram training models using a huge corpus in Turkish text. The main contribution of the study is the development of a novel and efficient architecture that is well-suited to applying word embeddings approaches to the Turkish language domain. We report that both the projection and the offset classification models give promising and novel results for the Turkish Language.
... The work of Leeuwenberg et al. [18] is the study that most closely resembles the research done in this paper in the current literature for synonymy discovery. For both English and German, the authors demonstrated a minimally supervised method for extracting synonyms using distributional word vectors. ...
... In this paper, we follow a similar methodology as the work of Leeuwenberg et al [18]. Additionally, we demonstrate how synonym suggestion can be employed as a supervised machine learning task, with the features being the word embeddings created using approaches like Word2Vec. ...
Tamil is one of the ancient and most convoluted languages in the world. Although it is being the official language of many Asian countries, even native speakers tend to find difficulties in writing Tamil due to its morphologically rich nature. While there are various studies focusing on automatically identifying and correcting a specific typing error, very limited effort has been made to develop a comprehensive solution to assist the native and non-native writers of Tamil. In this paper, we propose a typing assistant tool Tamil Grammarly using Natural Language Processing (NLP) techniques. Specifically, the tool aims to aid the user to fix grammatical errors and spelling errors and recommend the next words and synonyms of the current word in real-time while typing. The NLP-based typing assistant functions of Tamil Grammarly were developed using a transformer-based model, LSTM model, and Word2Vec model. Extensive evaluation performed shows that our tool can assist the users in real-time with an accuracy of 73% - 93% within 0.4 to 5.3 seconds.
... We used the default parameter for the window size (5 words)-while setting the minimum word frequency at 10. In Word2Vec, the metric used to calculate the distance between two vectors is the standard cosine similarity (Leeuwenberg et al., 2016). The closer the cosine similarity between two word vectors is to one, the more similar the words are according to the model. ...
... But determining whether the similarity score obtained from word embedding is indicative of term synonymy is still an open question. Leeuwenberg et al., 2016 showed that cosine similarity alone is a bad indicator to determine if two words are synonymous. They proposed a new measure, i.e. relative cosine similarity, which calculates similarity relative to other cosine-similar words in the corpus. ...
... Another interesting extension would be a rigorous treatment for synonymy. The latter may be achieved by incorporating in data pre-processing a similarity measure between word vectors stemming from a "Word2vec" framework [50], [51], followed by standardization of words with the similarity measures exceeding a pre-defined threshold. Alternatively, assuming the presence of a fixed, domain-specific database of synonyms, one could standardize the synonyms based on their appearance in the noted database, as part of data pre-processing. ...
... Alternatively, assuming the presence of a fixed, domain-specific database of synonyms, one could standardize the synonyms based on their appearance in the noted database, as part of data pre-processing. A summary of further approaches for synonym identification can be found in [50]. ...
How would an inventor, entrepreneur, investor, or patent examiner quantify the extent to which the inventive claims listed in a patent document align with patent specification? Since a specification that is poorly aligned with the inventive claims can render an invention unpatentable and can invalidate an already issued patent, an effective measure of alignment is necessary. We define a novel measure of drafting alignment using Latent Dirichlet Allocation (LDA). The measure is defined for each patent document by first identifying the latent topics underlying the claims and the specification, and then using the Hellinger distance to find the proximity between the topical coverages. We demonstrate the use of the novel measure for data processing patent documents related to cybersecurity. The properties of the proposed measure are further investigated using exploratory data analysis, and it is shown that generally alignment is positively associated with the prior patenting efforts as well as the tendency to include figures in a document.
... Synonym discovery, and the related task of paraphrase identification, [4][5][6] have been explored using a variety of methods. [7][8][9][10] This task builds on work in measuring semantic similarity between words and phrases. 11,12 Synonym discovery is especially important within the clinical medical domain. ...
... While benefiting from interpretability, this does not allow for the integration of character or contextual models that our work provides. Additional work has studied approaches to synonym expansion in nonmedical domains, 10,33 and the related tasks of addressed abbreviation and acronym resolution 34,35 in the clinical space. There has been a wide variety of research into the related task of medical concept linking-well-known systems include cTAKES 36 and MetaMap. ...
Objectives
An important component of processing medical texts is the identification of synonymous words or phrases. Synonyms can inform learned representations of patients or improve linking mentioned concepts to medical ontologies. However, medical synonyms can be lexically similar (“dilated RA” and “dilated RV”) or dissimilar (“cerebrovascular accident” and “stroke”); contextual information can determine if 2 strings are synonymous. Medical professionals utilize extensive variation of medical terminology, often not evidenced in structured medical resources. Therefore, the ability to discover synonyms, especially without reliance on training data, is an important component in processing training notes. The ability to discover synonyms from models trained on large amounts of unannotated data removes the need to rely on annotated pairs of similar words. Models relying solely on non-annotated data can be trained on a wider variety of texts without the cost of annotation, and thus may capture a broader variety of language.
Materials and Methods
Recent contextualized deep learning representation models, such as ELMo (Peters et al., 2019) and BERT, (Devlin et al. 2019) have shown strong improvements over previous approaches in a broad variety of tasks. We leverage these contextualized deep learning models to build representations of synonyms, which integrate the context of surrounding sentence and use character-level models to alleviate out-of-vocabulary issues. Using these models, we perform unsupervised discovery of likely synonym matches, which reduces the reliance on expensive training data.
Results
We use the ShARe/CLEF eHealth Evaluation Lab 2013 Task 1b data to evaluate our synonym discovery method. Comparing our proposed contextualized deep learning representations to previous non-neural representations, we find that the contextualized representations show consistent improvement over non-contextualized models in all metrics.
Conclusions
Our results show that contextualized models produce effective representations for synonym discovery. We expect that the use of these representations in other tasks would produce similar gains in performance.
... Since some of these lexical bases are manually constructed, they are time-consuming and expensive. For this reason, eventually not all desired words will be present and the knowledge base quality varies from language to language (Leeuwenberg et al., 2016). ...
... Some identified evidences in the literature indicates that the expansion of terms using lexical resources such as WordNet has several problems since some of these lexical bases are manually constructed, time-consuming and expensive. For this reason, not all desired words will be available and their quality varies from language to language (Leeuwenberg et al., 2016). Distributional approaches regarding word similarity proved to be more competitive than the thesaurus-based approach and have been successfully used to cover out-of-vocabulary items in lexical resources (Agirre et al., 2009;Barcelos and Rigo, 2018;Oliveira, 2018). ...
The similarity between words constitutes significant support to tasks in natural language processing. Several works use Lexical resources such as WordNet for semantic similarity and synonym identification. Nevertheless, words out-of-vocabulary or missing links between senses are perceived problems of this approach. Distributional-based proposals like word embeddings have successfully been used to meet such problems, but the lack of contextual information can prevent the achievement of even better results. The distributional models that include contextual information can bring advantages to this area, but these models are still scarcely explored. Therefore, this work studies the advantages of incorporating syntactic information in the distributional models, fostering for better results in semantic similarity approaches. For that purpose, the current work explore existing lexical and distributional techniques regarding the measurement of word similarity in Brazilian Portuguese. Experiments were carried out with the lexical database WordNet, using different techniques over a standard dataset. The results indicate that word embeddings can cover words out of vocabulary and have better results in comparison with lexical approaches. The main contribution of this article is a new approach to apply syntactic context in the training process of word embeddings to a Brazilian Portuguese corpus. The comparison of this model with the outcome of the previous experiments shows sound results and presents relevant complementary aspects.
... Recently, word and entity embedding methods [16,17,23,24], which learn distributed vector representation of words from a large corpus, have been prevalent in data mining communities. For English, a few word or character embedding based synonym prediction methods have been proposed [11,15,32]. ...
... For neural based methods, word embedding techniques have been widely adopted for synonym prediction [11,15,32]. Recently, there is a growing interest to enhance word embedding by incorporating domain semantic knowledge. ...
Automatic synonym recognition is of great importance for entity-centric text mining and interpretation. Due to the high language use variability in real-life, manual construction of semantic resources to cover all synonyms is prohibitively expensive and may also result in limited coverage. Although there are public knowledge bases, they only have limited coverage for languages other than English. In this paper, we focus on medical domain and propose an automatic way to accelerate the process of medical synonymy resource development for Chinese, including both formal entities from healthcare professionals and noisy descriptions from end-users. Motivated by the success of distributed word representations, we design a multi-task model with hierarchical task relationship to learn more representative entity/term embeddings and apply them to synonym prediction. In our model, we extend the classical skip-gram word embedding model by introducing an auxiliary task "neighboring word semantic type prediction'' and hierarchically organize them based on the task complexity. Meanwhile, we incorporate existing medical term-term synonymous knowledge into our word embedding learning framework. We demonstrate that the embeddings trained from our proposed multi-task model yield significant improvement for entity semantic relatedness evaluation, neighboring word semantic type prediction and synonym prediction compared with baselines. Furthermore, we create a large medical text corpus in Chinese that includes annotations for entities, descriptions and synonymous pairs for future research in this direction.
... These obstacles led to the advent of word embeddings, also called as word vectors, a more efficient solution, which takes into account the context of the nearby words like co-occurrence matrix but with much reduction in dimensionality of representation vectors. Technically, this means that the word embedding vectors collect more information into fewer dimensions, and in fact, the word embedding approach is the primary reason for NLP's breakout [16] with considerable achievements in entity recognition [17], sentiment classification [18], machine translation [19], and synonym extraction [20]. ...
... Table 2 displays the number of features fed in our surveyed classifiers. We observed that when the length of biological words was shorter than 3, the number of features approximately follow the formulae [Number of features (20) length ], where 20 is the number of common amino acids. However, when the word lengths > 3, the number of features is no longer in compliance with this rule. ...
... Another essential feature of OT is that it provides parts of speech based on the search item. That is, the OT guides users to select the part of speech for which they intend to obtain results -a tagged search that adds some supervision to aid word extractions (i.e., retrieving the intended synonym or antonym) (Leeuwenberg et al., 2016). ...
Many applications, software, websites, and online dictionaries have been introduced thanks to the advent of computer-and mobile-assisted technologies. Recent technological advancements have resulted in more targeted apps (e.g., BoldVoice for pronunciation) or reference tools (Etymonline.com for an etymology dictionary) for language learning and practice. One such tool is Thesaurus.com, an online thesaurus freely accessible through mobile devices and computers. Given the significance of synonyms and antonyms for vocabulary learning and the problems English as a foreign language learners have with these lexical subfields, this media review evaluated the aforementioned online thesaurus. The results indicated potential pedagogical opportunities and several areas of improvement. The review concludes with suggestions for further research to corroborate the findings.
... Another essential feature of OT is that it provides parts of speech based on the search item. That is, the OT guides users to select the part of speech for which they intend to obtain results -a tagged search that adds some supervision to aid word extractions (i.e., retrieving the intended synonym or antonym) (Leeuwenberg et al., 2016). ...
Many applications, software, websites, and online dictionaries have been introduced thanks to the advent of computer- and mobile-assisted technologies. Recent technological advancements have resulted in more targeted apps (e.g., BoldVoice for pronunciation) or reference tools (Etymonline.com for an etymology dictionary) for language learning and practice. One such tool is Thesaurus.com, an online thesaurus freely accessible through mobile devices and computers. Given the significance of synonyms and antonyms for vocabulary learning and the problems English as a foreign language learners have with these lexical subfields, this media review evaluated the aforementioned online thesaurus. The results indicated potential pedagogical opportunities and several areas of improvement. The review concludes with suggestions for further research to corroborate the findings.
... Past research shows that words with similar meaning tend to have similar embeddings, allowing measures of proximity (vs. distance) in vectorial space to be interpreted as semantic similarity (Leeuwenberg et al., 2016). Word embeddings have been shown to successfully replicate the direction and magnitude of stereotypes about social groups observed in experimental psychology research (Caliskan et al., 2016;Charlesworth et al., 2021) , and reflect their historical fluctuation (Garg et al., 2018). ...
Advances in the protection of human rights have placed the notion of autonomous consent in the spotlight of ethical and legal thought. Nowadays, the moral demand for consent governs a host of everyday interactions, from the provision of medical care, third-party use of personal data, to sexual relations. Past theoretical and empirical work has made abundant progress delineating the circumstances under which a person’s consent should be deemed valid. In the present work, we ask a related question about people’s tacit understanding of consent: What does an individual’s expression of consent convey? By probing participants’ linguistic acceptability judgments and social inferences in response to contextualized expressions of consent (total N = 1233), and leveraging the tools of computational linguistics, we documented two attributes of people’s conceptual representation of consent. First, expressions of consent were connected to the speaker’s patient role in dyadic interactions. Second, consent was indicative of a person’s instrumental desire toward the target action. These findings help to construe the act of consent as conveying a speaker’s instrumental acceptance of an agent’s act for some ulterior end (e.g., as a way of restoring health via a medical intervention). A final study revealed that, in the sexual domain, people selectively reject instrumental desire as a normative standard echoing concerns among feminist scholars that emphasis on women’s consent may overlook the gender disparity of consented, yet non-desired, sexual encounters.
... What multiclass text classification has in common with binary text classification is that in both models each text is represented by only one label. Sentiment analysis [7], topic modelling [8] and synonym extraction [9] are examples of multi-class text classification tasks. Although both binary and multi-class text classification lead to successful results, they do not produce satisfactory outputs in situations where individuals' opinions are needed, such as with respect to products and events, as the language we use and the expressions we produce contain more complex meanings in these contexts. ...
The multi-label customer reviews classification task aims to identify the different thoughts of customers about the product they are purchasing. Due to the impact of the COVID-19 pandemic, customers have become more prone to shopping online. As a consequence, the amount of text data on e-commerce is continuously increasing, which enables new studies to be carried out and important findings to be obtained with more detailed analysis. Nowadays, e-commerce customer reviews are analyzed by both researchers and sector experts, and are subject to many sentiment analysis studies. Herein, an analysis of customer reviews is carried out in order to obtain more in-depth thoughts about the product, rather than engaging in emotion-based analysis. Initially, we form a new customer reviews dataset made up of reviews by Turkish consumers in order to perform the proposed analysis. The created dataset contains more than 50,000 reviews in three different categories, and each review has multiple labels according to the comments made by the customers. Later, we applied machine learning methods employed for multi-label classification to the dataset. Finally, we compared and analyzed the results we obtained using a diverse set of statistical metrics. As a result of our experimental studies, we found the Micro Precision 0.9157, Micro Recall 0.8837, Micro F1 Score 0.8925, and Hamming Loss 0.0278 to be the most successful approaches.
... The word associated with the highest ranked node is considered as the most similar word of the query word, and so on. While the above studies attempt to find similarity between words by exploit topological structure of the network using random walk, studies such as Qu et al. [2017], Leeuwenberg et al. [2016], and Fei et al. [2019] exploit neural-based embedding models to capture similarity between words. ...
WordNets built for low-resource languages, such as Assamese, often use the expansion methodology. This may result in missing lexical entries and missing synonymy relations. As the Assamese WordNet is also built using the expansion method, using the Hindi WordNet, it also has missing synonymy relations. As WordNets can be visualized as a network of unique words connected by synonymy relations, link prediction in complex network analysis is an effective way of predicting missing relations in a network. Hence, to predict the missing synonyms in the Assamese WordNet, link prediction methods were used in the current work that proved effective. It is also observed that for discovering missing relations in the Assamese WordNet, simple local proximity-based methods might be more effective as compared to global and complex supervised models using network embedding. Further, it is noticed that though a set of retrieved words are not synonyms per se, they are semantically related to the target word and may be categorized as semantic cohorts.
... Words with common contexts (thus considered as similar) have close vectors in the produced vector space. In word2vec, the metric used to calculate the distance between two vectors is the standard cosine similarity (Leeuwenberg et al., 2016). The closer the cosine similarity between two word vectors is to one, the more similar the words are according to the model. ...
Epidemic intelligence aims to detect, investigate and monitor potential health threats while relying on formal (e.g. official health authorities) and informal (e.g. media) information sources. Monitoring of unofficial sources, or so-called event-based surveillance (EBS), requires the development of systems designed to retrieve and process unstructured textual data published online. This manuscript focuses on the extraction and combination of epidemiological information from informal sources (i.e. online news), in the context of the international surveillance of animal infectious diseases. The first objective of this thesis is to propose and compare approaches to enhance the identification and extraction of relevant epidemiological information from the content of online news. The second objective is to study the use of epidemiological entities extracted from the news articles (i.e. diseases, hosts, locations and dates) in the context of event extraction and retrieval of related online news.This manuscript proposes new textual representation approaches by selecting, expanding, and combining relevant epidemiological features. We show that adapting and extending text mining and classification methods improves the added value of online news sources for event-based surveillance. We stress the role of domain expert knowledge regarding the relevance and the interpretability of methods proposed in this thesis. While our researches are conducted in the context of animal disease surveillance, we discuss the generic aspects of our approaches regarding unknown threats and One Health surveillance.
... The procedure can generally include several preprocessing steps like, e.g., text cleaning, white space removal, case folding, spelling errors corrections, abbreviations expanding, stemming, stop words removal, or negation handling (Dařena, 2019). For word embeddings training, some preprocessing can be applied too (Li et al., 2017;Leeuwenberg et al., 2016) which can have an impact on the context of the words, the number of unique words, and global word frequencies. Subsequently, one-hot encoded vectors (vector where only one out of its units is 1 and all others are 0) that act as the inputs and outputs of the neural models are derived (Rong, 2014). ...
... Considering the promising performance produced by neural word embeddings (Word2vec [23], FastText [24], and Glove [25]) in variety of NLP tasks including hierarchical text categorization [26], multi-class text document classification [27], [28], investigation of gender roles [29], non-relevant post detection [30], topic modelling [31], automated sarcasm detection [32], synonym extraction [33], automated enrichment of lexicons for misogyny detection [34], sentiment analysis [35], automated text summarization [36], text clustering [37], measuring emotional polarity from debates [38], recommendation system [39], the paper in hand for the very first time provides Word2vec [23], FastText [24], and Glove [25] embeddings for Roman Urdu. These pre-trained embeddings can be used to enhance the performance of diverse deep learning based Roman Urdu processing tasks. ...
In order to accelerate the performance of various Natural Language Processing tasks for Roman Urdu, this article for the very first time provides 3 neural word embeddings prepared using most widely used approaches namely Word2vec, FastText, and Glove. The integrity of generated neural word embeddings is evaluated using intrinsic and extrinsic evaluation approaches. Considering the lack of publicly available benchmark datasets, it provides a first-ever Roman Urdu public dataset which consists of 3241 sentiments annotated against positive, negative, and neutral classes. To provide benchmark baseline performance over the presented dataset for Roman Urdu sentiment analysis, we adapt diverse machine learning (Support Vector Machine, Logistic Regression, Naive Bayes), deep learning (convolutional neural network, recurrent neural network), and hybrid deep learning approaches. Performance impact of generated neural word embeddings based representation is compared with other most widely used bag of words based feature representation approaches using diverse machine and deep learning classifiers. In order to improve the performance of Roman Urdu sentiment analysis, it proposes a novel precisely extreme multi-channel hybrid methodology which makes use of convolutional and recurrent neural networks along with pre-trained neural word embeddings. The proposed hybrid approach outperforms adapted machine learning approaches by the significant figure of 9% and deep learning approaches by the figure of 4% in terms of F1-score.
... Synonym identification is an active subject of AI research. One approach is to analyze patterns in the text surrounding a word, known as "distributional word vectors" or "word embeddings," on grounds that synonyms can be used in the same way and therefore tend to be surrounded by similar text (Leeuwenberg et al. 2016;Mohammed 2020). However, this work uses large linguistic datasets (i.e., large collections of text). ...
This paper considers the question: In what ways can artificial intelligence assist with interdisciplinary research for addressing complex societal problems and advancing the social good? Problems such as environmental protection, public health, and emerging technology governance do not fit neatly within traditional academic disciplines and therefore require an interdisciplinary approach. However, interdisciplinary research poses large cognitive challenges for human researchers that go beyond the substantial challenges of narrow disciplinary research. The challenges include epistemic divides between disciplines, the massive bodies of relevant literature, the peer review of work that integrates an eclectic mix of topics, and the transfer of interdisciplinary research insights from one problem to another. Artificial interdisciplinarity already helps with these challenges via search engines, recommendation engines, and automated content analysis. Future "strong artificial interdisciplinarity" based on human-level artificial general intelligence could excel at interdisciplinary research, but it may take a long time to develop and could pose major safety and ethical issues. Therefore, there is an important role for intermediate-term artificial interdisciplinarity systems that could make major contributions to addressing societal problems without the concerns associated with artificial general intelligence.
... Word embedding is a vector representation of words based on the context of the sentences or semantic relationships between words. The vector contains real number [19]. Mikolov et al. proposed two models that focus on learning word vectors which are continuous bag-of-words (CBoW) and Continuous Skip-gram (SG) as shown in Figure 3. ...
... Considering the promising performance produced by neural word embeddings (Word2vec [18], FastText [19], and Glove [20]) in variety of NLP tasks including hierarchical text categorization [21], multi-class text document classification [22] [23], investigation of gender roles [24], non-relevant post detection [25], topic modelling [26], automated sarcasm detection [27], synonym extraction [28], automated enrichment of lexicons for misogyny detection [29], sentiment analysis [30], automated text summarization [31], text clustering [32], measuring emotional polarity from debates [33], recommendation system [34], the paper in hand for the very first time provides Word2vec [18], FastText [19], and Glove [20] embeddings for Roman Urdu. These pre-trained embeddings can be used to enhance the performance of diverse deep learning based Roman Urdu processing tasks. ...
In order to accelerate the performance of various Natural Language Processing tasks for Roman Urdu, this paper for the very first time provides 3 neural word embeddings prepared using most widely used approaches namely Word2vec, FastText, and Glove. The integrity of generated neural word embeddings is evaluated using intrinsic and extrinsic evaluation approaches. Considering the lack of publicly available benchmark datasets, it provides a first-ever Roman Urdu dataset which consists of 3241 sentiments annotated against positive, negative and neutral classes. To provide benchmark baseline performance over the presented dataset, we adapt diverse machine learning (Support Vector Machine Logistic Regression, Naive Bayes), deep learning (convolutional neural network, recurrent neural network), and hybrid approaches. Effectiveness of generated neural word embeddings is evaluated by comparing the performance of machine and deep learning based methodologies using 7, and 5 distinct feature representation approaches respectively. Finally, it proposes a novel precisely extreme multi-channel hybrid methodology which outperforms state-of-the-art adapted machine and deep learning approaches by the figure of 9%, and 4% in terms of F1-score. Roman Urdu Sentiment Analysis, Pretrain word embeddings for Roman Urdu, Word2Vec, Glove, Fast-Text
... Accordingly, we decided to set the K value to 300. For window size selection, some studies [28] showed that smaller contextual windows generally give better precision. After some experiments, we set the window size to 10. ...
In recent years, author gender identification has gained considerable attention in the fields of information retrieval and computational linguistics. In this paper, we employ and evaluate different learning approaches based on machine learning (ML) and neural network language models to address the problem of author gender identification. First, several ML classifiers are applied to the features obtained by bag-of-words. Secondly, datasets are represented by a low-dimensional real-valued vector using Word2vec, GloVe, and Doc2vec, which are on par with ML classifiers in terms of accuracy. Lastly, neural networks architectures, the convolution neural network and recurrent neural network, are trained and their associated performances are assessed. A variety of experiments are successfully conducted. Different issues, such as the effects of the number of dimensions, training architecture type, and corpus size, are considered. The main contribution of the study is to identify author gender by applying word embeddings and deep learning architectures to the Turkish language.
... Word Embeddings have been widely used for extracting similar words [21]. Previous study has shown that word embedding has significant improvement over WordNet based measures [22]. ...
Following approaches for understanding lexical meaning developed by Y¯ aska, Patanjali and Bhartrihari from Indian linguistic traditions and extending approaches developed by Leibniz and Brentano in the modern times, a framework of formal ontology of language was developed. This framework proposes that meaning of words are informed by intrinsic and extrinsic ontological structures. The paper aims to capture such intrinsic and extrinsic meanings of words for two major Indian languages , namely, Hindi and Telugu. Parts-of-speech have been rendered into sense-types and sense-classes. Using them we have developed a gold-standard annotated lexical resource to support semantic understanding of a language. The resource has collection of Hindi and Telugu lexicons, which has been manually annotated by native speakers of the languages following our annotation guidelines. Further, the resource was utilised to derive adverbial sense-class distribution of verbs and k¯ araka-verb sense-type distribution. Different corpora (news, novels) were compared using verb sense-types distribution. Word Embedding was used as an aid for the enrichment of the resource. This is a work in progress that aims at lexical coverage of language extensively.
... Today an increasing number of works are dedicated to statistical analysis of semantically related word frequencies. For example, the approach called "relative cosine similarity" is proposed in [1]. Distributed word vectors were analyzed, the cosine distance was used to estimate the synonymic similarity for different part-of-speech separately. ...
In this study a similarity in changes of frequencies dynamics for semantically related words was analyzed using word statistics extracted from more than 4.5 million books written over a period of 205 years. The approach is based on the correlation analysis of 1-grams frequency dynamics. We analyzed the frequencies correlation of synonym pairs, their corresponding antonymous groups and random words pairs. Also, we compared several metrics to find the most effective for assessing the degree of similarity in the dynamics of use of different words. Comparing differences between logarithmic rank variations in pairs of synonyms and random word pairs, significant differences are found, though they are smaller than it could be expected.
Automatic evaluation of hashtag recommendation models is a fundamental task in Twitter. In the traditional evaluation methods, the recommended hashtags from an algorithm are firstly compared with the ground truth hashtags for exact correspondences. The number of exact matches is then used to calculate the hit rate, hit ratio, precision, recall, or F1-score. This way of evaluating hashtag similarities is inadequate as it ignores the semantic correlation between the recommended and ground truth hashtags. To tackle this problem, we propose a novel semantic evaluation framework for hashtag recommendation, called #REval. This framework includes an internal module referred to as BERTag , which automatically learns the hashtag embeddings. We investigate on how the #REval framework performs under different word embedding methods and different numbers of synonyms and hashtags in the recommendation using our proposed #REval-hit-ratio measure. Our experiments of the proposed framework on three large datasets show that #REval gave more meaningful hashtag synonyms for hashtag recommendation evaluation. Our analysis also highlights the sensitivity of the framework to the word embedding technique, with #REval based on BERTag more superior over #REval based on Word2Vec, FastText, and GloVe.
With tremendous evolution in the internet world, the internet has become a household thing. Internet users use search engines or personal assistants to request information from the internet. Search results are greatly dependent on the entered keywords. Casual users may enter a vague query due to lack of knowledge of the domain-specific words. We propose a query reformulation system that determines the context of the query, decides on keywords to be replaced and outputs a better-modified query. We propose strategies for keyword replacements and metrics for query betterment checks. We have found that if we project keywords into the vector space of word projection using word embedding techniques and if the keyword replacement is correct, clusters of a new set of keywords become more cohesive. This assumption forms the basis of our proposed work. To prove the effectiveness of the proposed system, we applied it to the ad-hoc retrieval tasks over two benchmark corpora viz TREC-CDS 2014 and OHSUMED corpus. We indexed Whoosh search engine on these corpora and evaluated based on the given queries provided along with the corpus. Experimental results show that the proposed techniques achieved 9 to 11% improvement in precision and recall scores. Using Google’s popularity index, we also prove that the reformulated queries are not only more accurate but also more popular. The proposed system also applies to Conversational AI chatbots like ChatGPT, where users must rephrase their queries to obtain better results.
Automatic evaluation of hashtag recommendation models is a fundamental task in many online social network systems. In the traditional evaluation method, the recommended hashtags from an algorithm are firstly compared with the ground truth hashtags for exact correspondences. The number of exact matches is then used to calculate the hit rate, hit ratio, precision, recall, or F1-score. This way of evaluating hashtag similarities is inadequate as it ignores the semantic correlation between the recommended and ground truth hashtags. To tackle this problem, we propose a novel semantic evaluation framework for hashtag recommendation, called #REval. This framework includes an internal module referred to as BERTag, which automatically learns the hashtag embeddings. We investigate on how the #REval framework performs under different word embedding methods and different numbers of synonyms and hashtags in the recommendation using our proposed #REval-hit-ratio measure. Our experiments of the proposed framework on three large datasets show that #REval gave more meaningful hashtag synonyms for hashtag recommendation evaluation. Our analysis also highlights the sensitivity of the framework to the word embedding technique, with #REval based on BERTag more superior over #REval based on FastText and Word2Vec.
Following approaches for understanding lexical meaning developed by Yāska, Patanjali and Bhartrihari from Indian linguistic traditions and extending approaches developed by Leibniz and Brentano in the modern times, a framework of formal ontology of language was developed. This framework proposes that meaning of words are in-formed by intrinsic and extrinsic ontological structures. The paper aims to capture such intrinsic and extrinsic meanings of words for two major Indian languages, namely, Hindi and Telugu. Parts-of-speech have been rendered into sense-types and sense-classes. Using them we have developed a gold-standard annotated lexical resource to support semantic understanding of a language. The resource has collection of Hindi and Telugu lexicons, which has been manually annotated by native speakers of the languages following our annotation guidelines. Further, the resource was utilised to derive adverbial sense-class distribution of verbs and kāraka-verb sense-type distribution. Different corpora (news, novels) were compared using verb sense-types distribution. Word Embedding was used as an aid for the enrichment of the resource. This is a work in progress that aims at lexical coverage of language extensively.
This chapter explores meta-learning application in natural language processing. This chapter investigates 20 advanced pieces of research, all of which can be divided into seven taxonomies according to their tasks to solve: semantic parsing, sentiment analysis, machine translation, dialogue system, knowledge graph (including named entity recognition and question answering), relation extraction, and emerging topics (e.g., domain-specific word embedding, multilabel classification, compositional generalization, query suggestion). Numerous attempts are discussed across various research areas, including domain adaption, few-shot/zero-shot learning, and a lifelong learning setting. Each section outlines the definition of relevant terminology, early attempts, current status, multiple research lines, milestones of model architecture, distribution and general purposes, remarkable techniques, objective function and pseudocode of critical methods, regular datasets, and benchmarks. Finally, a summary compares deep learning with meta-learning, and presents further discussion of meta-approaches based on natural language processing.
Several models of the bilingual mind have been suggested (e.g., DFM, Multilink), which are inspired by connectionist and artificial neural network models. Those models aim at explaining and predicting translation latencies of human translators based on cross-linguistic similarities of word properties. To foster researching these models, bilingualism studies have developed translation norms which enumerate factors (such as concreteness or ambiguity) assumed to be responsible for delays in word recognition and word translation production. While neural machine translation (NMT) systems are currently revolutionizing the translation industry, the compatibility of those identified processes and assumed representations in bilingualism studies have not (often) been compared with measures and representations that can be traced in NMT systems. In this chapter, we map translation norms into word embeddings (i.e., vector representations of words used in NMT systems) and compare model predictions—as gathered from similarity measures of continuous word embeddings—with reported human behavioral latencies. We also investigate to what extent the findings from single-word translation experiments can be carried over to translations in context. We re-assess predictions of DFM and Multilink in the light of the findings.
Researchers who are non-native speakers of English always face some problems when composing scientific articles in this language. Most of the time, it is due to lack of vocabulary or knowledge of alternate ways of expression. In this paper, we suggest to use word embeddings to look for substitute words used for academic writing in a specific domain. Word embeddings may not only contain semantically similar words but also other words with similar word vectors, that could be better expressions. A word embedding model trained on a collection of academic articles in a specific domain might suggest similar expressions that comply to that writing style and are suited to that domain. Our experiment results show that a word embedding model trained on the NLP domain is able to propose possible substitutes that could be used to replace the target words in a certain context.
Entity synonyms play an important role in natural language processing applications, such as query expansion and question answering. There are three main distribution characteristics in web texts:1) appearing in parallel structures; 2) occurring with specific patterns in sentences; and 3) distributed in similar contexts. The first and second characteristics rely on reliable prior knowledge and are susceptive to data sparseness, bringing high accuracy and low recall to synonym extraction. The third one may lead to high recall but low accuracy, since it identifies a somewhat loose semantic similarity. Existing methods, such as context-based and pattern-based methods, only consider one characteristic for synonym extraction and rarely take their complementarity into account. For increasing recall, this article proposes a novel extraction framework that can combine the three characteristics for extracting synonyms from the web, where an Entity Synonym Network (ESN) is built to incorporate synonymous knowledge. To improve accuracy, the article treats synonym detection as a ranking problem and uses the Spreading Activation model as a ranking means to detect the hard noise in ESN. Experimental results show the proposed method achieves better accuracy and recall than the state-of-the-art methods.
Purpose
The purpose of this paper is to provide an overall review of grammar checking and relation extraction (RE) literature, their techniques and the open challenges associated with them; and, finally, suggest future directions.
Design/methodology/approach
The review on grammar checking and RE was carried out using the following protocol: we prepared research questions, planed for searching strategy, addressed paper selection criteria to distinguish relevant works, extracted data from these works, and finally, analyzed and synthesized the data.
Findings
The output of error detection models could be used for creating a profile of a certain writer. Such profiles can be used for author identification, native language identification or even the level of education, to name a few. The automatic extraction of relations could be used to build or complete electronic lexical thesauri and knowledge bases.
Originality/value
Grammar checking is the process of detecting and sometimes correcting erroneous words in the text, while RE is the process of detecting and categorizing predefined relationships between entities or words that were identified in the text. The authors found that the most obvious challenge is the lack of data sets, especially for low-resource languages. Also, the lack of unified evaluation methods hinders the ability to compare results.
Entity synonyms play an important role in natural language processing applications, such as query expansion and question answering. There are three main distribution characteristics in texts on the web: (1) appearing in parallel structures; (2) occurring with specific patterns in sentences; and (3) distributed in similar contexts. These characteristics are largely complementary. Existing methods, such as pattern-based and context-based methods, only consider one characteristic for synonym extraction and ignore the complementarity among them. For increasing accuracy and recall, we propose a novel method that integrates the three characteristics for extracting synonyms from the web, where Entity Synonym Network (ESN) is built to incorporate synonymous knowledge. To further improve accuracy, we treat synonym detection as a ranking problem and use the Spreading Activation model as a ranking means to detect the hard noise in ESN. Experimental results show our method achieves better accuracy and recall than the state-of-the-art methods.
We demonstrate the potential for using aligned bilingual word embeddings to create an unsupervised method to evaluate machine translations without a need for parallel translation corpus or reference corpus. We explain why movie subtitles differ from other text and share our experimental results conducted on them for four target languages (French, German, Portuguese and Spanish) with English source subtitles. We propose a novel automated evaluation method of calculating edits (insertion, deletion, substitution and shifts) to indicate translation quality and human aided post edit requirements to perfect machine translation.
The paper is devoted to analysis of methods that can be used for automatic generation of specialized thesauri. The authors developed a test bench that allows to estimate most popular methods for relation extraction that constitute the main part of such generation. On the basis of experiments conducted on the test bench the idea of hybrid thesaurus generation methods that combine the algorithms showed the best performance was proposed. Its efficiency was illustrated by creation of the thesaurus for the medical domain with its subsequent estimation on the test bench.
In this paper, a simple pipeline for identifying labeled topics for temporally ordered and topic-specific news corpora using word embeddings and keyword identification is proposed. The steps in the pipeline rely on NLP techniques to identify keywords, corpus periodization, and word embeddings. The proposed method is evaluated by applying it on a dataset consisting of TV and Radio transcripts on “Donald Trump.” The results demonstrated that the topics captured using this pipeline are more coherent and informative than ones generated using Latent Dirichlet Allocation. Findings from this preliminary experiment suggest that word embeddings models and common NLP keyword identification techniques can be used to identify coherent and labeled topics for a temporally ordered news corpus.
This paper presents a novel approach to machine translation evaluation by combining register features – characterised by particular distributions of lexico-grammatical features – with text classification techniques. The goal of this method is to compare machine translation output with comparable originals in the same language, as well as with human reference translations. The degree of similarity – in terms of register features – between machine translations and originals, and machine translations and reference translations is measured by applying two text classification methods trained on 1) originals and 2) reference translations, and tested on machine translations. The results from the experiments prove our assumption that machine translations share register features rather with human translations than with non-translated texts produced by humans. This confirms that registers are one of the most important factors that should be integrated into register-based machine translation evaluation.
This paper describes Meteor Universal, released for the 2014 ACL Workshop on Statistical Machine Translation. Meteor Universal brings language specific evaluation to previously unsupported target languages by (1) automatically extracting linguistic resources (paraphrase tables and function word lists) from the bitext used to train MT systems and (2) using a universal parameter set learned from pooling human judgments of translation quality from several language directions. Meteor Universal is shown to significantly outperform baseline BLEU on two new languages, Russian (WMT13) and Hindi (WMT14).
In this paper we present VERTa, a linguistically-motivated metric that combines linguistic features at different levels. We provide the linguistic motivation on which the metric is based, as well as describe the different modules in VERTa and how they are combined. Finally, we describe the two versions of VERTa, VERTa-EQ and VERTa-W, sent to WMT14 and report results obtained in the experiments conducted with the WMT12 and WMT13 data into English.
We have proposed a simple, effective and fast method named retrofitting to
improve word vectors using word relation knowledge found in semantic lexicons
constructed either automatically or by humans. Retrofitting is used as a
post-processing step to improve vector quality and is simpler to use than other
approaches that use semantic information while training. It can be used for
improving vectors obtained from any word vector training model and performs
better than current state-of-the-art approaches to semantic enrichment of word
vectors. We validated the applicability of our method across tasks, semantic
lexicons, and languages.
The recently introduced continuous Skip-gram model is an efficient method for
learning high-quality distributed vector representations that capture a large
number of precise syntactic and semantic word relationships. In this paper we
present several extensions that improve both the quality of the vectors and the
training speed. By subsampling of the frequent words we obtain significant
speedup and also learn more regular word representations. We also describe a
simple alternative to the hierarchical softmax called negative sampling. An
inherent limitation of word representations is their indifference to word order
and their inability to represent idiomatic phrases. For example, the meanings
of "Canada" and "Air" cannot be easily combined to obtain "Air Canada".
Motivated by this example, we present a simple method for finding phrases in
text, and show that learning good vector representations for millions of
phrases is possible.
We propose two novel model architectures for computing continuous vector
representations of words from very large data sets. The quality of these
representations is measured in a word similarity task, and the results are
compared to the previously best performing techniques based on different types
of neural networks. We observe large improvements in accuracy at much lower
computational cost, i.e. it takes less than a day to learn high quality word
vectors from a 1.6 billion words data set. Furthermore, we show that these
vectors provide state-of-the-art performance on our test set for measuring
syntactic and semantic word similarities.
We examine a new, intuitive measure for evaluating machine-translation output that avoids the knowledge intensiveness of more meaning-based approaches, and the labor-intensiveness of human judg-ments. Translation Edit Rate (TER) mea-sures the amount of editing that a hu-man would have to perform to change a system output so it exactly matches a reference translation. We show that the single-reference variant of TER correlates as well with human judgments of MT quality as the four-reference variant of BLEU. We also define a human-targeted TER (or HTER) and show that it yields higher correlations with human judgments than BLEU—even when BLEU is given human-targeted references. Our results in-dicate that HTER correlates with human judgments better than HMETEOR and that the four-reference variants of TER and HTER correlate with human judg-ments as well as—or better than—a sec-ond human judgment does.
Part-of-speech (POS) induction is one of the most popular tasks in research on unsupervised NLP. Many different methods have been proposed, yet comparisons are difficult to make since there is little consensus on evaluation framework, and many papers evaluate against only one or two competitor systems. Here we evaluate seven different POS induction systems spanning nearly 20 years of work, using a variety of measures. We show that some of the oldest (and simplest) systems stand up surprisingly well against more recent approaches. Since most of these systems were developed and tested using data from the WSJ corpus, we compare their generalization abilities by testing on both WSJ and the multilingual Multext-East corpus. Finally, we introduce the idea of evaluating systems based on their ability to produce cluster prototypes that are useful as input to a prototype-driven learner. In most cases, the prototype-driven learner outperforms the unsupervised system used to initialize it, yielding state-of-the-art results on WSJ and improvements on non-English corpora.
There have been many proposals to ex- tract semantically related words using measures of distributional similarity, but these typically are not able to distin- guish between synonyms and other types of semantically related words such as antonyms, (co)hyponyms and hypernyms. We present a method based on automatic word alignment of parallel corpora con- sisting of documents translated into mul- tiple languages and compare our method with a monolingual syntax-based method. The approach that uses aligned multilin- gual data to extract synonyms shows much higher precision and recall scores for the task of synonym extraction than the mono- lingual syntax-based approach.
Finding semantically related words is a first step in the dire ction of automatic ontology building. Guided by the view that similar words occur in similar contexts, we looked at the syntactic context of words to measure their semantic similarity. Words that occur in a direct object relation with the verb drink, for instance, have something in common (liq- uidity, ...). Co-occurrence data for common nouns and proper names, for several syntactic relations, was collected from an automatically parsed corpus of 78 million words of news- paper text. We used several vector-based methods to compute the distributional similarity between words. Using Dutch EuroWordNet as evaluation standard, we investigated which vector-based method and which combination of syntactic relations is the strongest predictor of semantic similarity.
Genes and proteins are often associated with multiple names, and more names are added as new functional or structural information is discovered. Because authors often alternate between these synonyms, information retrieval and extraction benefits from identifying these synonymous names. We have developed a method to extract automatically synonymous gene and protein names from MEDLINE and journal articles. We first identified patterns authors use to list synonymous gene and protein names. We developed SGPE (for synonym extraction of gene and protein names), a software program that recognizes the patterns and extracts from MEDLINE abstracts and full-text journal articles candidate synonymous terms. SGPE then applies a sequence of filters that automatically screen out those terms that are not gene and protein names. We evaluated our method to have an overall precision of 71% on both MEDLINE and journal articles, and 90% precision on the more suitable full-text articles alone
We present a new part-of-speech tagger that demonstrates the following ideas: (i) explicit use of both preceding and following tag contexts via a dependency network representation, (ii) broad use of lexical features, including jointly conditioning on multiple consecutive words, (iii) effective use of priors in conditional loglinear models, and (iv) fine-grained modeling of unknown word features. Using these ideas together, the resulting tagger gives a 97.24% accuracy on the Penn Treebank WSJ, an error reduction of 4.4% on the best previous single automatically learned tagging result.
Human evaluations of machine translation are extensive but expensive. Human evaluations can take months to finish and involve human labor that can not be reused.
We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.
This paper describes USAAR’s submission to the the metrics shared task of the Workshop on Statistical Machine Translation (WMT) in 2015. The goal of our submission is to take advantage of the semantic overlap between hypothesis and reference translation for predicting MT output adequacy using language independent document embeddings. The approach presented here is learning a Bayesian Ridge Regressor using document skip-gram embeddings in order to automatically evaluate Machine Translation (MT) output by predicting semantic adequacy scores. The evaluation of our submission – measured by the correlation with human judgements – shows promising results on system-level scores.
While continuous word embeddings are gaining popularity, current models are based solely on linear contexts. In this work, we generalize the skip-gram model with negative sampling introduced by Mikolov et al. to include arbitrary contexts. In particular, we perform experiments with dependency-based contexts, and show that they produce markedly different embeddings. The dependencybased embeddings are less topical and exhibit more functional similarity than the original skip-gram embeddings.
This paper describes the UPC submissions to the WMT14 Metrics Shared Task : UPC-IPA and UPC-STOUT. These metrics use a collection of evaluation measures integrated in ASIYA, a toolkit for machine translation evaluation. In addition to some standard metrics, the two submissions take advantage of novel metrics that consider linguistic structures, lexical relationships, and semantics to compare both source and reference translation against the candidate translation. The new metrics are available for several target languages other than English. In the the official WMT14 evaluation, UPC-IPA and UPC-STOUT scored above the average in 7 out of 9 language pairs at the system level and 8 out of 9 at the segment level.
Semantic hierarchy construction aims to build structures of concepts linked by hypernym-hyponym ("is-a") relations. A major challenge for this task is the automatic discovery of such relations. This paper proposes a novel and effective method for the construction of semantic hierarchies based on word embeddings, which can be used to measure the semantic relationship between words. We identify whether a candidate word pair has hypernym-hyponym relation by using the word-embedding-based semantic projections between words and their hypernyms. Our result, an F-score of 73.74%, outperforms the state-of-theart methods on a manually labeled test dataset. Moreover, combining our method with a previous manually-built hierarchy extension method can further improve Fscore to 80.29%.
WordNet is one of the most valuable lexical resources in the Natural Language Processing community. Unfortunately, the benefits of building a WordNet for the Macedonian language have never been recognized. Due to the time and labor intensive process of manual building of such a lexical resource, we were inspired to develop a method for its automated construction. In this paper, we present a new method for construction of non-English WordNets by using the Princeton implementation of WordNet as a backbone for their construction along with Google's translation tool and search engine. We applied the new method for construction of the Macedonian WordNet and managed to develop a WordNet containing 17,553 words grouped into 33,276 synsets. However, the method in consideration is general and can also be applied for other languages. Finally, we report the results of an experiment using the Macedonian WordNet as a means to improve the performance of the text classification algorithms. Avtomatska izdelava wordneta z uporabo strojnega prevajanja in jezikovnega modeliranja Wordnet velja za enega najbolj uporabnih leksikalnih virov na področju računalniške obdelave naravnega jezika, vendar za makedonščino še ne obstaja. Ker je ročna izdelava tovrstnega vira izjemno dolgotrajna in draga, smo se odločili za gradnjo z avtomatskimi pristopi. V prispevku predstavljamo metodo za izdelavo wordneta v izbranem ciljnem jeziku, pri čemer izhajamo iz angleškega Prinecton WordNeta, za generiranje sinsetov pa uporabimo dvojezični slovar, Googlov spletni strojni prevajalnik in iskalnik. Čeprav je na ta način mogoče izdelati wordnet za kateri koli jezik, smo v pričujoči raziskavi generirali makedonski wordnet, ki vsebuje 17.553 besed oz. 33.265 sinsetov. Izdelan wordnet tudi preizkusimo na sistemu za avtomatsko klasifikacijo besedil in s tem preverimo njegovo uporabnost v praksi.
Unsupervised word representations are very useful in NLP tasks both as inputs to learning algorithms and as extra word features in NLP systems. However, most of these models are built with only local context and one representation per word. This is problematic because words are often polysemous and global context can also provide useful information for learning word meanings. We present a new neural network architecture which 1) learns word embeddings that better capture the semantics of words by incorporating both local and global document context, and 2) accounts for homonymy and polysemy by learning multiple embeddings per word. We introduce a new dataset with human judgments on pairs of words in sentential context, and evaluate our model on it, showing that our model outperforms competitive baselines and other neural language models.
We describe METEOR, an automatic metric for machine translation evaluation that is based on a generalized concept of unigram matching between the machine- produced translation and human-produced reference translations. Unigrams can be matched based on their surface forms, stemmed forms, and meanings; further- more, METEOR can be easily extended to include more advanced matching strate- gies. Once all generalized unigram matches between the two strings have been found, METEOR computes a score for this matching using a combination of unigram-precision, unigram-recall, and a measure of fragmentation that is designed to directly capture how well-ordered the matched words in the machine translation are in relation to the reference. We evaluate METEOR by measuring the cor- relation between the metric scores and human judgments of translation quality. We compute the Pearson R correlation value between its scores and human qual- ity assessments of the LDC TIDES 2003 Arabic-to-English and Chinese-to-English datasets. We perform segment-by- segment correlation, and show that METEOR gets an R correlation value of 0.347 on the Arabic data and 0.331 on the Chinese data. This is shown to be an im- provement on using simply unigram- precision, unigram-recall and their har- monic F1 combination. We also perform experiments to show the relative contribu- tions of the various mapping modules.
WordNet, the on-line English thesaurus and lexical database developed at Princeton University b y George Miller and his colleagues (Fellbaum 1998), has proved to be an extremely important resource used in much research in computational linguis-tics where lexical knowledge of English is required. The goal of the EuroWordNet project is to create similar wordnets for other languages of Europe. The initial four languages are Dutch (at the University of Amsterdam), Italian (CNR, Pisa), Spanish (Fundacion Universidad Empresa), and English (University of Sheffield, adapting the original WordNet); later Czech, Estonian, German, and French will be added. The results of the project will be publicly available. 1 Like the original Princeton WordNet, the new wordnets--that's now a generic term--are hierarchies in which each node is a synset: a word sense, with which one or more synonymous words or phrases is associated. The synsets are connected b y relations such as hyponymy, meronymy, and antonymy. However, some improvements have been made to the original design of WordNet. N e w relationships, including relationships across parts of speech, have been introduced. For example, the verb adorn and the noun adornment are related b y XPOSd~EAR _ SYNONYMY; h y p o n y m y and hyperonymy 2 across parts of speech are also permitted. Semantic roles of verbs are marked; for example, the noun student is related to the verb teach b y ROLE _ PATIENT; the inverse relationship is called INVOLVEDd~ATIENT. Another new relationship, both within and across parts of speech, is causality, which may further be marked as intentional or nonfactive; for example, to redden CAUSES red; to search CAUSES (nonfactive, intentional) to find. Meronymy is much more fine-grained than in Princeton WordNet, with a number of new kinds of part-whole relationships. The most important new development, however, is multilinguality: the use of a common framework to build the individual wordnets and integrate them in a single database in which an inter-lingual-index (ILI) connects the synsets that are "equiv-alent" in the different languages. EuroWordNet thus becomes a multilingual lexicon and thesaurus that could be used in applications such as multilingual text retrieval and (rather basic) lexical transfer in machine translation. The project has sought to
Evaluation is recognized as an extremely helpful forcing function in Human Language Technology R&D. Unfortunately, evaluation has not been a very powerful tool in machine translation (MT) research because it requires human judgments and is thus expensive and time-consuming and not easily factored into the MT research agenda. However, at the July 2001 TIDES PI meeting in Philadelphia, IBM described an automatic MT evaluation technique that can provide immediate feedback and guidance in MT research. Their idea, which they call an "evaluation understudy", compares MT output with expert reference translations in terms of the statistics of short sequences of words (word N-grams). The more of these N-grams that a translation shares with the reference translations, the better the translation is judged to be. The idea is elegant in its simplicity. But far more important, IBM showed a strong correlation between these automatically generated scores and human judgments of translation quality. As a result, DARPA commissioned NIST to develop an MT evaluation facility based on the IBM work. This utility is now available from NIST and serves as the primary evaluation measure for TIDES MT research.
For the purposes of the present discussion, the term structure will be used in the following non-rigorous sense: A set of phonemes or a set of data is structured in respect to some feature, to the extent that we can form in terms of that feature some organized system of statements which describes the members of the set and their interrelations (at least up to some limit of complexity). In this sense, language can be structured in respect to various independent features. And whether it is structured (to more than a trivial extent) in respect to, say, regular historical change, social intercourse, meaning, or distribution — or to what extent it is structured in any of these respects — is a matter decidable by investigation. Here we will discuss how each language can be described in terms of a distributional structure, i.e. in terms of the occurrence of parts (ultimately sounds) relative to other parts, and how this description is complete without intrusion of other features such as history or meaning. It goes without saying that other studies of language — historical, psychological, etc.—are also possible, both in relation to distributional structure and independently of it.
This paper presents and compares WordNet- based and distributional similarity approaches. The strengths and weaknesses of each ap- proach regarding similarity and relatedness tasks are discussed, and a combination is pre- sented. Each of our methods independently provide the best results in their class on the RG and WordSim353 datasets, and a super- vised combination of them yields the best pub- lished results on all datasets. Finally, we pio- neer cross-lingual similarity, showing that our methods are easily adapted for a cross-lingual task with minor losses.
There have been many proposals to compute sim- ilarities between words based on their distribu- tions in contexts. However, these approaches do not distinguish between synonyms and antonyms. We present two methods for identifying synonyms among distributionally similar words.
Because meaningful sentences are composed of meaningful words, any system that hopes to process natural languages as people do must have information about words and their meanings. This information is traditionally provided through dictionaries, and machine-readable dictionaries are now widely available. But dictionary entries evolved for the convenience of human readers, not for machines. WordNet ¹ provides a more effective combination of traditional lexicographic information and modern computing. WordNet is an online lexical database designed for use under program control. English nouns, verbs, adjectives, and adverbs are organized into sets of synonyms, each representing a lexicalized concept. Semantic relations link the synonym sets [4].
Unsupervised Phrasal Near-Synonym Generation from Text Corpora
Jan 2015
Dishan Gupta
Jaime Carbonell
Anatole Gershman
Steve Klein
David Miller
Gupta, Dishan, Jaime Carbonell, Anatole Gershman, Steve Klein, and David Miller.
Unsupervised Phrasal Near-Synonym Generation from Text Corpora. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015a.
Learning Context-Sensitive Word Embeddings with Neural Tensor Skip-Gram Model
Jan 2015
Pengfei Liu
Xipeng Qiu
Xuanjing Huang
Liu, Pengfei, Xipeng Qiu, and Xuanjing Huang. Learning Context-Sensitive Word Embeddings with Neural Tensor Skip-Gram Model. Proceedings of the Twenty-Fourth
International Joint Conference on Artificial Intelligence, 2015a.
Tat-Seng Chua, and Maosong Sun. Topical Word Embeddings
Jan 2015
Yang Liu
Zhiyuan Liu
Liu, Yang, Zhiyuan Liu, Tat-Seng Chua, and Maosong Sun. Topical Word Embeddings.
In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015b.
Sholpan Nazarovna Sagyndykova, Zhanat Elubaevna Kenzhebayeva, and Irbulat Turemuratovich. The Method of Synonyms Extraction from Unannotated Corpus
Jan 2015
1-5
Alexander Pak
Sergazy Alexandrovich
Sakenovich Narynov
Pak, Alexander Alexandrovich, Sergazy Sakenovich Narynov, Arman Serikuly Zharmagambetov, Sholpan Nazarovna Sagyndykova, Zhanat Elubaevna Kenzhebayeva, and Irbulat Turemuratovich. The Method of Synonyms Extraction from
Unannotated Corpus. In Digital Information, Networking, and Wireless Communications (DINWC), 2015 Third International Conference on, pages 1-5. IEEE, 2015.
Gonzà lez, Meritxell, Alberto Barrón-Cedeño, and Lluís Màrquez. IPA and STOUT: Leveraging Linguistic and Source-based Features for Machine Translation Evaluation
Jun 2014
394-401
Ruiji Fu
Jiang Guo
Bing Qin
Wanxiang Che
Haifeng Wang
Ting Liu
Fu, Ruiji, Jiang Guo, Bing Qin, Wanxiang Che, Haifeng Wang, and Ting Liu. Learning semantic hierarchies via word embeddings. In Proceedings of the 52th Annual
Meeting of the Association for Computational Linguistics: Long Papers, volume 1, 2014.
Gonzà lez, Meritxell, Alberto Barrón-Cedeño, and Lluís Màrquez. IPA and STOUT:
Leveraging Linguistic and Source-based Features for Machine Translation Evaluation. In Proceedings of the 9th Workshop on Statistical Machine Translation (WMT),
pages 394-401, Baltimore, Maryland, USA, June 2014. Association for Computational Linguistics.
Automatic acquisition of synonyms for French using parallel corpora. Distributed Agent-based Retrieval Tools
Jan 2010
99
Van Der Plas
Joerg Lonneke
Jean-Luc Tiedemann
Manguin
van der Plas, Lonneke, Joerg Tiedemann, and Jean-Luc Manguin. Automatic acquisition of synonyms for French using parallel corpora. Distributed Agent-based Retrieval
Tools, page 99, 2010.
Meteor Universal Specific Translation Evaluation for Any Target InProceedings of the Workshop on Statistical Machine Translation
Jan 2014
Denkowski
GloVe Global vectors for word representation Proceedings of the Empiricial Methods in Natural Processing
Jan 2014
Pennington
lez Leveraging Linguistic and Source - based Features for Machine Translation Evaluation InProceedings of the th Workshop on Statistical Machine Translation pages Association for
Jan 2014
394
Gonzà
der Jörg Finding synonyms using automatic word alignment and measures of distributional similarity InProceedings of the COLING ACL on Main conference poster sessions pages Association for
Jan 2006
866
Van
a method for automatic evaluation of machine translation InProceedings of the th annual meeting of the Association for pages Association for
Jan 2002
311
Papineni
Kristina Feature - rich part - of - speech tagging with a cyclic dependency network InProceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume pages Association for
Jan 2003
1
Toutanova
and The Method of Synonyms Extraction from Third International Conference on pages