Article

Semantic search extension based on Polish WordNet relations in business document exploration

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This paper addresses the problem of building a specialized semantic search engine for documents collected in small or medium-sized enterprises. It presents the results of a project that brought together computer scientists and entrepreneurs for the purpose of providing a common perspective regarding the implementation in company practice of a search engine based on the Polish version of WordNet semantic relations. The core functionality of the search engine module is provided along with a discussion on how to arrange semantic similarity structures so as to ensure the efficient generation of relevant search engine results. Some patterns and similarity coefficients for hyperonymy, hyponymy, holonymy and meronymy relations are presented and analyzed for the purpose of producing relationship structures. Finally, the architecture of the system that can be implemented in a company is outlined.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

Conference Paper
A narrative review of the literature on the importance of affective factors in the information retrieval (IR) behaviours of entrepreneurs is presented in this paper. Through the lens of Media Richness Theory, we examine the importance of the richness of information and medium to the success of the IR process. The results show that IR system does not only serve entrepreneurs regarding their information needs but also their emotional needs. The richness of the IR system, referring to the accessibility, is an essential factor for entrepreneurs’ preference and use of the IR medium. This paper contributes to literature by showing how affective attributes are associated with other factors, in addition to their effect on the IR behaviours. The findings reveal that the affective characteristics are both an individual need and a determining component in the process.
Chapter
Full-text available
This work presents the analysis of investigating methods concerning similarity of text documents in the area of the Polish law, basing on known designating similarity measures of texts among other things: Jaccard, Cosine, Sørensen-Dice measures and the VSM (Vector Space the Model) model with additional use of semantic network relations from Polish WordNET. There was proposed investigating method of legal texts semantic similarity consideration of their specificity. There was presented comparison of similarity investigation results for the designated legal documents corpus in case of the analysis without usage of semantic network WordNET lexical relations and with their participation.
Conference Paper
Full-text available
Noun phrases in queries are identified and classified into four types: proper names, dictionary phrases, simple phrases and complex phrases. A document has a phrase if all content words in the phrase are within a window of a certain size. The window sizes for different types of phrases are different and are determined using a decision tree. Phrases are more important than individual terms. Consequently, documents in response to a query are ranked with matching phrases given a higher priority. We utilize WordNet to disambiguate word senses of query terms. Whenever the sense of a query term is determined, its synonyms, hyponyms, words from its definition and its compound words are considered for possible additions to the query. Experimental results show that our approach yields between 23% and 31% improvements over the best-known results on the TREC 9, 10 and 12 collections for short (title only) queries, without using Web data.
Conference Paper
Full-text available
Applications such as office automation, news filtering, help facilities in complex systems, and the like require the ability to retrieve documents from full-text databases where vocabulary problems can be particularly severe. Experiments performed on small collections with single-domain thesauri suggest that expanding query vectors with words that are lexically related to the original query words can ameliorate some of the problems of mismatched vocabularies. This paper examines the utility of lexical query expansion in the large, diverse TREC collection. Concepts are represented by WordNet synonym sets and are expanded by following the typed links included in WordNet. Experimental results show this query expansion technique makes little difference in retrieval effectiveness if the original queries are relatively complete descriptions of the information being sought even when the concepts to be expanded are selected by hand. Less well developed queries can be significantly improved by expansion of hand-chosen concepts. However, an automatic procedure that can approximate the set of hand picked synonym sets has yet to be devised, and expanding by the synonym sets that are automatically generated can degrade retrieval performance.
Conference Paper
Full-text available
Information Content (IC) is an important dimension of word knowledge when assessing the similarity of two terms or word senses. The conventional way of measuring the IC of word senses is to combine knowledge of their hierarchical structure from an on- tology like WordNet with statistics on their actual usage in text as derived from a large corpus. In this paper we present a wholly in- trinsic measure of IC that relies on hierarchical structure alone. We report that this measure is consequently easier to calculate, yet when used as the basis of a similarity mechanism it yields judgments that correlate more closely with human assessments than other, extrinsic measures of IC that additionally employ corpus analysis.
Conference Paper
Full-text available
In the area of information retrieval, the dimension of doc- ument vectors plays an important role. Firstly, with higher dimensions index structures suer the "curse of dimensionality" and their eciency rapidly decreases. Secondly, we may not use exact words when looking for a document, thus we miss some relevant documents. LSI (Latent Se- mantic Indexing) is a numerical method, which discovers latent semantic in documents by creating concepts from existing terms. However, it is hard to compute LSI. In this article, we oer a replacement of LSI with a projection matrix created from WordNet hierarchy and compare it with LSI.
Conference Paper
Full-text available
This paper proposes benchmarks for systems of automatic sense identification. A textual corpus in which open-class words had been tagged both syntactically and semantically was used to explore three statistical strategies for sense identification: a guessing heuristic, a most-frequent heuristic, and a co-occurrence heuristic. When no information about sense-frequencies was available, the guessing heuristic using the numbers of alternative senses in WordNet was correct 45% of the time. When statistics for sense-frequencies were derived from a semantic concordance, the assumption that each word is used in its most frequently occurring sense was correct 69% of the time; when that figure was calculated for polysemous words alone, it dropped to 58%. And when a co-occurrence heuristic took advantage of prior occurrences of words together in the same sentences, little improvement was observed. The semantic concordance is still too small to estimate the potential limits of a co-occurrence heuristic.
Article
Full-text available
Questions the metric and dimensional assumptions that underlie the geometric representation of similarity on both theoretical and empirical grounds. A new set-theoretical approach to similarity is developed in which objects are represented as collections of features and similarity is described as a feature-matching process. Specifically, a set of qualitative assumptions is shown to imply the contrast model, which expresses the similarity between objects as a linear combination of the measures of their common and distinctive features. Several predictions of the contrast model are tested in studies of similarity with both semantic and perceptual stimuli. The model is used to uncover, analyze, and explain a variety of empirical phenomena such as the role of common and distinctive features, the relations between judgments of similarity and difference, the presence of asymmetric similarities, and the effects of context on judgments of similarity. The contrast model generalizes standard representations of similarity data in terms of clusters and trees. It is also used to analyze the relations of prototypicality and family resemblance. (39 ref) (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Full-text available
WordNet is an on-line lexical reference system whose design is inspired by current psycholinguistic theories of human lexical memory. English nouns, verbs, and adjectives are organized into synonym sets, each representing one underlying lexical concept. Different relations link the synonym sets.
Conference Paper
Full-text available
It is well known that ontologies will become a key piece, as they allow making the semantics of semantic Web content explicit. In spite of the big advantages that the semantic Web promises, there are still several problems to solve. Those concerning ontologies include their availability, development, and evolution. In the area of information retrieval, the dimension of document vectors plays an important role. Firstly, with higher index dimensions the indexing structures suffer from the "curse of dimensionality" and their efficiency rapidly decreases. Secondly, we may not use exact words when looking for a document, thus we miss some relevant documents. LSI is a numerical method, which discovers latent semantics in documents by creating concepts from existing terms. In this paper we present a basic method of mapping LSI concepts on given ontology (Word-Net), used both for retrieval recall improvement and dimension reduction. We offer experimental results for this method on a subset of TREC collection, consisting of Los Angeles Times articles
Article
Full-text available
Motivated by the properties of spreading activation and conceptual distance, the authors propose a metric, called distance, on the power set of nodes in a semantic net. Distance is the average minimum path length over all pairwise combinations of nodes between two subsets of nodes. Distance can be successfully used to assess the conceptual distance between sets of concepts when used on a semantic net of hierarchical relations. When other kinds of relationships, like `cause', are used, distance must be amended but then can again be effective. The judgements of distance significantly correlate with the distance judgements that people make and help to determine whether one semantic net is better or worse than another. The authors focus on the mathematical characteristics of distance that presents novel cases and interpretations. Experiments in which distance is applied to pairs of concepts and to sets of concepts in a hierarchical knowledge base show the power of hierarchical relations in representing information about the conceptual distance between concepts
Article
Full-text available
The classical, vector space model for text retrieval is shown to give better results (up to 29% better in our experiments) if WordNet synsets are chosen as the indexing space, instead of word forms. This result is obtained for a manually disambiguated test collection (of queries and documents) derived from the Semcor semantic concordance. The sensitivity of retrieval performance to (automatic) disambiguation errors when indexing documents is also measured. Finally, it is observed that if queries are not disambiguated, indexing by synsets performs (at best) only as good as standard word indexing.
Article
Semantic Search and Information Retrieval forms an integral part of various Search Engines in use. Famous search engines such as, Yahoo, Google, Lycos etc. use the concept of semantic search, where the only comparator for the objects under study is semantic similarity between the objects. The general method involves document-to-document similarity search. This sort of search involves the sequential search of documents one after the other, which involves numerous noise effects. An efficient way of improving this technique is the Latent Semantic Indexing (LSI). LSI maps the words under study on a conceptual space. The conceptual space depends on the queries and the document collection. It uses a mathematical function to figure out the similarity between the words, something called as Singular Value Decomposition. It utilizes the words under study and the ones that are being compared and produces appropriate results. The results obtained are free of semantics like synonymy, polysemy etc. Integrating Word Net, a large lexical database of English language is an efficient way to increase the search result. The word under consideration is linked to the application and the semantic similarities of the word are found out. Documents similar to these similarities are then indexed and listed. The proposed model is tested with standard set of Forum for Information Retrieval (FIRE) documents and a comparison with the term based search has been done. © 2006-2017 Asian Research Publishing Network (ARPN). All rights reserved.
Book
Summary Solr in Action is a comprehensive guide to implementing scalable search using Apache Solr. This clearly written book walks you through well-documented examples ranging from basic keyword searching to scaling a system for billions of documents and queries. It will give you a deep understanding of how to implement core Solr capabilities. About the Book Whether you're handling big (or small) data, managing documents, or building a website, it is important to be able to quickly search through your content and discover meaning in it. Apache Solr is your tool: a ready-to-deploy, Lucene-based, open source, full-text search engine. Solr can scale across many servers to enable real-time queries and data analytics across billions of documents. Solr in Action teaches you to implement scalable search using Apache Solr. This easy-to-read guide balances conceptual discussions with practical examples to show you how to implement all of Solr's core capabilities. You'll master topics like text analysis, faceted search, hit highlighting, result grouping, query suggestions, multilingual search, advanced geospatial and data operations, and relevancy tuning. This book assumes basic knowledge of Java and standard database technology. No prior knowledge of Solr or Lucene is required. Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications. What's Inside How to scale Solr for big data Rich real-world examples Solr as a NoSQL data store Advanced multilingual, data, and relevancy tricks Coverage of versions through Solr 4.7 About the Authors Trey Grainger is a director of engineering at CareerBuilder. Timothy Potter is a senior member of the engineering team at LucidWorks. The authors work on the scalability and reliability of Solr, as well as on recommendation engine and big data analytics technologies.
Article
A large number of different tags, limited corpora and the free word order are the main causes of low accuracy of tagging in Polish (automatic disambiguation of morphological descriptions) by applying commonly used techniques based on stochastic modelling. In the paper the rule-based architecture of the TaKIPI Polish tagger combining handwritten and automatically extracted rules is presented. The possibilities of optimisation of its parameters and component are discussed, including the possibility of using different methods of rules extraction, than C4.5 Decision Trees applied initially. The main goal of this paper is to explore a range of promising rule-based classifiers and investigate their impact on the accuracy of tagging. Simple techniques of combing classifiers are also tested. The performed experiments have shown that even a simple combination of different classifiers can increase the tagger’s accuracy by almost one percent.
Article
This study proposes a new way of using WordNet for Query Expansion (QE). We choose candidate expansion terms, as usual, from a set of pseudo relevant documents; however, the usefulness of these terms is measured based on their definitions provided in a hand-crafted lexical resource like WordNet. Experiments with a number of standard TREC collections show that this method outperforms existing WordNet based methods. It also compares favorably with established QE methods such as KLD and RM3. Leveraging earlier work in which a combination of QE methods was found to outperform each individual method (as well as other well-known QE methods), we next propose a combination-based QE method that takes into account three different aspects of a candidate expansion term's usefulness: (i) its distribution in the pseudo relevant documents and in the target corpus, (ii) its statistical association with query terms, and (iii) its semantic relation with the query, as determined by the overlap between the WordNet definitions of the term and query terms. This combination of diverse sources of information appears to work well on a number of test collections, viz., TREC123, TREC5, TREC678, TREC robust new and TREC910 collections, and yields significant improvements over competing methods on most of these collections.
Conference Paper
This paper presents a new model of measuring semantic similarity in the taxonomy of WordNet. The model takes the path length between two concepts and IC value of each concept as its metric, furthermore, the weight of two metrics can be adapted artificially. In order to evaluate our model, traditional and widely used datasets are used. Firstly, coefficients of correlation between human ratings of similarity and six computational models are calculated, the result shows our new model outperforms their homologues. Then, the distribution graphs of similarity value of 65 word pairs are discussed our model having no faulted zone more centralized than other five methods. So our model can make up the insufficient of other methods which only using one metric(path length or IC value) in their model.
Conference Paper
Query expansion is an effective technique to improve the performance of information re- trieval systems. Although hand-crafted lexi- cal resources, such as WordNet, could provide more reliable related terms, previous stud- ies showed that query expansion using only WordNet leads to very limited performance improvement. One of the main challenges is how to assign appropriate weights to expanded terms. In this paper, we re-examine this prob- lem using recently proposed axiomatic ap- proaches and find that, with appropriate term weighting strategy, we are able to exploit the information from lexical resources to sig- nificantly improve the retrieval performance. Our empirical results on six TREC collec- tions show that query expansion using only hand-crafted lexical resources leads to signif- icant performance improvement. The perfor- mance can be further improved if the proposed method is combined with query expansion us- ing co-occurrence-based resources.
Article
Because meaningful sentences are composed of meaningful words, any system that hopes to process natural languages as people do must have information about words and their meanings. This information is traditionally provided through dictionaries, and machine-readable dictionaries are now widely available. But dictionary entries evolved for the convenience of human readers, not for machines. WordNet ¹ provides a more effective combination of traditional lexicographic information and modern computing. WordNet is an online lexical database designed for use under program control. English nouns, verbs, adjectives, and adverbs are organized into sets of synonyms, each representing a lexicalized concept. Semantic relations link the synonym sets [4].
Article
Full-text of this article is not available in this e-prints service. This article was originally published following peer-review in IEEE Transactions on Knowledge and Data Engineering, published by and copyright IEEE. Semantic similarity between words is becoming a generic problem for many applications of computational linguistics and artificial intelligence. This paper explores the determination of semantic similarity by a number of information sources, which consist of structural semantic information from a lexical taxonomy and information content from a corpus. To investigate how information sources could be used effectively, a variety of strategies for using various possible information sources are implemented. A new measure is then proposed which combines information sources nonlinearly. Experimental evaluation against a benchmark set of human similarity ratings demonstrates that the proposed measure significantly outperforms traditional similarity measures.
Article
Similarity is an important and widely used concept. Previous definitions of similarity are tied to a particular application or a form of knowledge representation. We present an informationtheoretic definition of similarity that is applicable as long as there is a probabilistic model. We demonstrate how our definition can be used to measure the similarity in a number of different domains.
Article
This paper presents a new measure of semantic similarity in an is-a taxonomy, based on the notion of information content. Experimental evaluation suggests that the measure performs encouragingly well (a correlation of r = 0:79 with a benchmark set of human similarity judgments, with an upper bound of r = 0:90 for human subjects performing the same task), and significantly better than the traditional edge counting approach (r = 0:66). 1 Introduction Evaluating semantic relatedness using network representations is a problem with a long history in artificial intelligence and psychology, dating back to the spreading activation approach of Quillian [ 1968 ] and Collins and Loftus [ 1975 ] . Semantic similarity represents a special case of semantic relatedness: for example, cars and gasoline would seem to be more closely related than, say, cars and bicycles, but the latter pair are certainly more similar. Rada et al. [ 1989 ] suggest that the assessment of similarity in semantic n...
Epimenidis Voutsakis, Euripides GM Petrakis, and Evangelos Milios
  • Angelos Hliaoutakis
  • Giannis Varelas
Angelos Hliaoutakis, Giannis Varelas, Epimenidis Voutsakis, Euripides GM Petrakis, and Evangelos Milios. 2006. Information retrieval by semantic similarity. International journal on semantic Web and information systems (IJSWIS) 2, 3 (2006), 55-73.
Concept based query expansion using wordnet
  • Jiuling Zhang
  • Beixing Deng
  • Xing Li
Jiuling Zhang, Beixing Deng, and Xing Li. 2009. Concept based query expansion using wordnet. In Proceedings of the 2009 international e-conference on advanced science and technology. IEEE Computer Society, 52-55.