Article
To read the full-text of this research, you can request a copy directly from the author.

Abstract

The interaction of suffixing algorithms and ranking techniques in retrieval performance, particularly in an online environment, was investigated. Three general purpose suffixing algorithms were used for retrieval on the Cranfield 1400, Medlars, and CACM test collections, with no significant improvement in performance shown for any of the algorithms. A failure analysis suggested three modifications to ranking techniques: variable weighting of term variants, selective stemming depend- ing on query length, and selective stemming depending on term importance. None of these modifications im- proved performance. Recommendations are made re- garding the uses of suffixing in an online environment.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... La morphologie est un domaine de la linguistique qui s'intéresseà la structure des mots. D'une manière générale, elle concerne l'étude de différentes combinaisons des morphèmes ou tout simplement les plus petites unités de sens qui constituent les mots [Porter 1980, Harman 1991, Gaussier & Stéfanini 2003]. L'analyse morphologique consisteà reconnaitre les différentes variations des mots. ...
... Cependant, la racinisation consisteà transformer les formes fléchies d'un motà leur racine en supprimant les affixes (préfixes et suffixes).À l'opposé des lemmes qui sont des mots réels de la langue, les racines peuvent ne pasêtre des mots réels 2 . Plusieurs travaux de recherche ont montré que la racinisation et la lemmatisation améliorent significativement la performance de la RI [Porter 1980, Harman 1991, Abu El-Khair 2007. ...
... Avec cette collectionà disposition, ilétait possible de mener les premières expériences d'évaluation des résultats retournés par un SRI. Ce paradigme d'évaluation a inspiré les chercheurs de domaine de RI [Harman 1992, Peters & Braschler 2001, Gövert & Kazai , Harman & Voorhees 2006], principalement par les possibilités de reproduction des résultats et des expériences, pour la miseà disposition des compagnes d'évaluation telles que TREC 4 , INEX 5 , CLEF 6 , FIRE 7 et NTCIR 8 . D'une manière générale, les collections d'évaluation construites par ces initiatives sont constituées de troiséléments principaux : ...
Thesis
Full-text available
Given the amount of Arabic textual information available on the web, developing effective Information Retrieval Systems (IRS) has become essential to retrieve relevant information. Most of the current Arabic SRIs are based on the bag-of-words representation, where documents are indexed using surface words, roots or stems. Two main drawbacks of the latter representation are the ambiguity of Single Word Terms (SWTs) and term mismatch. The aim of this work is to deal with SWTs ambiguity and term mismatch. Accordingly, we propose four contributions to improve Arabic content representation, indexing, and retrieval. The first contribution consists of representing Arabic documents using Multi-Word Terms (MWTs). The latter is motivated by the fact that MWTs are more precise representational units and less ambiguous than isolated SWTs. Hence, we propose a hybrid method to extract Arabic MWTs, which combines linguistic and statistical filtering of MWT candidates. The linguistic filter uses POS tagging to identify MWTs candidates that fit a set of syntactic patterns and handles the problem of MWTs variation. Then, the statistical filter rank MWT candidate using our proposed association measure that combines contextual information and both termhood and unithood measures. In the second contribution, we explore and evaluate several IR models for ranking documents using both SWTs and MWTs. Additionally, we investigate a wide range of proximity-based IR models for Arabic IR. Then, we introduce a formal condition that IR models should satisfy to deal adequately with term dependencies. The third contribution consists of a method based on Distributed Representation of Word vectors, namely Word Embedding (WE), for Arabic IR. It relies on incorporating WE semantic similarities into existing probabilistic IR models in order to deal with term mismatch. The aim is to allow distinct, but semantically similar terms to contribute to documents scores. The last contribution is a method to incorporate WE similarity into Pseud-Relevance Feedback PRF for Arabic Information Retrieval. The main idea is to select expansion terms using their distribution in the set of top pseudo-relevant documents along with their similarity to the original query terms. The experimental validation of all the proposed contributions is performed using standard Arabic TREC 2002/2001 collection.
... First, many stemmers produce terms that are not recognizable English words and may be difficult to map back to a valid original word, such as "stai" as the Porter stem of "stay". Second, although stemming aids document retrieval for many languages, English is a notorious exception (Harman, 1991). In English, the complexity of compound affixes with meaning can lead to overstemming, such as "recondition," a word sharing a stem but not a root meaning with "recondite." ...
... While stemmers are used in topic modeling, we know of no analysis focused on their effect. We draw inspiration from prior studies of the effects of stemming for other tasks and models (Harman, 1991;Han et al., 2012;Jivani, 2011;Rani et al., 2015) to apply rule-based stemmers to a variety of corpora to test their effect on topic models. We evaluate the quantitative fit of the models generated and the qualitative differences between differentlystemmed corpora to investigate the effects each stemmer has on a corpus. ...
... The S-removal stemmer or "S" stemmer (SS) removes S-based endings using only three rules. Harman (1991) introduces the "S" stemming algorithm as a weaker and simpler counterpoint to more standard rule-based stemmers. As the rules are simple and good representatives of the types of rules employed by the other stemmers in this section, we include them in Table 1. ...
Article
Rule-based stemmers such as the Porter stemmer are frequently used to preprocess English corpora for topic modeling. In this work, we train and evaluate topic models on a variety of corpora using several different stemming algorithms. We examine several different quantitative measures of the resulting models, including likelihood, coherence, model stability, and entropy. Despite their frequent use in topic modeling, we find that stemmers produce no meaningful improvement in likelihood and coherence and in fact can degrade topic stability.
... Similarly, while the role of stemming in IR is well explored for English, results are controversial [77,248,452]. This is a recurring issue in several studies [196,213,255] with many sources of variation such as e.g. linguistic vs non-linguistic stemmers, language, query and document length and evaluation measures [248]. ...
... linguistic vs non-linguistic stemmers, language, query and document length and evaluation measures [248]. Hull [213] found that the average absolute improvement ranged from 1-3% and Harman [196] found no statistical significant difference between stemming and non-stemming. ...
Preprint
Several tasks in information retrieval (IR) rely on assumptions regarding the distribution of some property (such as term frequency) in the data being processed. This thesis argues that such distributional assumptions can lead to incorrect conclusions and proposes a statistically principled method for determining the "true" distribution. This thesis further applies this method to derive a new family of ranking models that adapt their computations to the statistics of the data being processed. Experimental evaluation shows results on par or better than multiple strong baselines on several TREC collections. Overall, this thesis concludes that distributional assumptions can be replaced with an effective, efficient and principled method for determining the "true" distribution and that using the "true" distribution can lead to improved retrieval performance.
... The power of an IR system to conflate words enhances the recall (Kraaij and Pohlmann, 1996). Although the effectiveness of stemmers has always been argued (Harman 1987(Harman , 1991, stemming is still kept as an active method in the IR toolbox. ...
... He compared his algorithm with that of Porter (1980) and reported that his technique is a heavy stemmer, i.e. it has a tendency to over stem. Harman (1991) studied the effectiveness of stemming and suggested a weak stemmer, called 'S' stemmer. This was a very simple algorithm that conflates plural and singular forms of English nouns. ...
Article
Full-text available
A language-independent stemmer has always been looked for. Single N-gram tokenization technique works well; however, it often generates stems that start with intermediate characters, rather than initial ones. We present a novel technique that takes the concept of N-gram stemming one step ahead and compare our method with an established algorithm in the field, say, Porter’s stemmer for English, Spanish, and Portuguese languages. Results indicate that our N-gram stemmer is comparable with the Porter’s linguistic stemmer.
... After tokenization, we use stemming to further normalize the morphological variants of the base word. In this paper, we use the S stemmer [46], which removes only a few common word endings, and it is less aggressive as other stemmer. ...
Chapter
Spreading of automatically generated clickbaits, fake news, and fake reviews undermines the veracity of the internet as a credible source of information. We investigate the problem of recognizing automatically generated short texts by exploring different Deep Learning models. To improve the classification results, AQ1 we use text augmentation techniques and classifier hyperparameter optimization. For word embedding and vectorization we use Glove and RoBERTa. We compare the performance of dense neural network, convolutional neural network, gated recurrent network, and hierarchical attention network. The experiments on the TweepFake dataset achieved an 89.7% accuracy.
... As Avram aims at very high precision, very conservative stemming was applied. The S-Stemmer, presented in [Har91], is a very minimal stemmer that only converts plural to singular forms. The author reports that this stemmer provides smaller advantages that Porter Stemmer, but also the number of queries that are negatively affected is significantly lower. ...
Thesis
Full-text available
Recently, we observe increasing demand for software systems that allow large-scale information retrieval. These systems start to become critical parts of many business applications. Numerous services of this kind are built upon Apache Lucene library and related open-source solutions. This thesis investigates the design process of a search system based on the aforementioned technologies, with focus on incorporating the latest research results. These techniques allow building more efficient systems and providing the users with semantic search tools. The methods presented in this thesis are implemented in an example academic search service. Covered topics include all the phases of search system design: harvesting and preprocessing of the documents, indexation, querying, relevance ranking and retrieval performance evaluation. Furthermore, techniques for link analysis, topic extraction and query expansion with use of lightweight ontologies are presented. All the methods are also discussed in the context of distributing processing. This thesis shows that creation of a highly-efficient and specialized vertical search service is possible with help of the open-source software. It also argues that a deep domain analysis and development of many extensions is necessary to achieve this goal.
... This helps when handling words that share the same core meaning, thus playing an important role in IR. According to (Harman 1991), grouping words with the same root (or stem) increases the success with which documents can be matched against a query. For Arabic, unlike the English language, the stripping of suffixes alone would not be sufficient for the purpose of IR. ...
Article
Full-text available
Question answering is a subfield of information retrieval. It is a task of answering a question posted in a natural language. A question answering system (QAS) may be considered a good alternative to search engines that return a set of related documents. The QAS system is composed of three main modules; question analysis, passage retrieval, and answer extraction. Over the years, numerous QASs have been presented for use in different languages. However, the the development of Arabic QASs has been slowed by linguistic challenges and the lack of resources and tools available to researchers. In this survey, we start with the challenges due to the language and how these challenges make the development of new Arabic QAS more difficult. Next, we do a detailed review of several Arabic QASs. This is followed by an in-depth analysis of the techniques and approaches in the three modules of a QAS. We present an overview of important and recent tools that were developed to help the researchers in this field. We also cover the available Arabic and multilingual datasets, and a look at the different measures used to assess QASs. Finally, the survey delves into the future direction of Arabic QAS systems based on the current state-of-the-art techniques developed for question answering in other languages.
... After stop word removal, we normalize the content further and apply stemming [40] based on the porter stemmer [33] when appropriate. Previous work [15,16,19,36] suggests that stemming does not work equally well in all languages, hence we only apply stemming for English content. Lemmatization (determining the canonical form of a set of words depending on the context) is not considered since previous work showed that it can even degrade performance when applied to English content. ...
Preprint
Full-text available
In this paper, we analyze the topology and the content found on the "darknet", the set of websites accessible via Tor. We created a darknet spider and crawled the darknet starting from a bootstrap list by recursively following links. We explored the whole connected component of more than 34,000 hidden services, of which we found 10,000 to be online. Contrary to folklore belief, the visible part of the darknet is surprisingly well-connected through hub websites such as wikis and forums. We performed a comprehensive categorization of the content using supervised machine learning. We observe that about half of the visible dark web content is related to apparently licit activities based on our classifier. A significant amount of content pertains to software repositories, blogs, and activism-related websites. Among unlawful hidden services, most pertain to fraudulent websites, services selling counterfeit goods, and drug markets.
... Harman [34] compares the performance of data stemmed with three suffix-stripping algorithms for English language against unstemmed data in information retrieval queries. The researcher comes to the conclusion that stemming does not consistently improved performance. ...
Book
Full-text available
Because of the disturb Advertisement in many Explorer were designed a new explorer for filtering all these Advertisement .This book explain the way for filtering these what is called SPAM ,because what problems lead from this. Web Spam is the Web pages that is not desired and noisy .Most spammers exploit the events to flexing its muscles for appearing there Posters and ads to declare there products and trivial pictures.
... related to a variation in number, gender, or grammatical case). For the English language, the S-stemmer (Harman, 1991) applies three ordered rules to replace the plural form of a word with the corresponding singular form (e.g. the last rule is to remove the ending '-s' unless the word ends in '-ss' or '-us'). ...
Article
Based on n text excerpts, the authorship linking task is to determine a way to link pairs of documents written by the same person together. This problem is closely related to authorship attribution questions and its solution can be used in the author clustering task. However, no training information is provided and the solution must be unsupervised. To achieve this, various text representation strategies can be applied, such as characters, punctuation symbols, or letter n-grams as well as words, lemmas, Part-Of-Speech (POS) tags, and sequences of them. To estimate the stylistic distance (or similarity) between two text excerpts, different measures have been suggested based on the L 1 norm (e.g., Manhattan, Tanimoto), the L 2 norm (e.g., Matusita), the inner product (e.g., Cosine), or the entropy paradigm (e.g., Jeffrey divergence). From those possible implementations, it is not clear which text representation and distance functions produce the best performance and this study provides an answer to this question. Three corpora, extracted from French and English literature, have been evaluated using standard methodology. Moreover, we suggest an additional performance measure called high precision capable of judging the quality of a ranked list of links to provide only correct answers. No systematic difference can be found between token-or lemma-based text representations. Simple POS tags do not provide an effective solution but short sequences of them form a good text representation. Letter n-grams (with n = 4 to 6) give high precision rates. As distance measures, this study found that the Tanimoto, Matusita, and Clark distance measures perform better than the often-used Cosine function. Finally, applying a pruning procedure (e.g., culling terms appearing once or twice or limiting the vocabulary to the 500 most frequent words) reduces the representation complexity and might even improve the effectiveness of the attribution scheme.
... However, variation and inconsistency are also observed in the results reported by some studies. In some cases, stemming did not show any consistent average performance improvement (Harman, 1991) or showed very little improvement (Hull, 1996). On the other hand, Krovertz (1993) reported an increase in performance by a factor ranging between 15% and 35%. ...
Conference Paper
Full-text available
Due to the affix structure of words in Arabic, a given word may take different forms in different contexts. These variants may not be recognized as semantically equivalent in IR without some processing, Stemming is one of the most commonly used techniques for doing so. The research reported in this paper evaluates the retrieval effectiveness of four different stemming algorithms for Arabic information retrieval systems, including those reported by Khoja, Taghva, Mustafa, and Aljlayl and compare their performance with no stemming. The first two are considered heavy stemmers, while the others are classified as light stemmers. The evaluation was based on a set of 477 documents on medical herbs comprising more than 95856 Arabic words. The index terms were prepared by an expert in the field. Three performance metrics were used in this study, including average precision at recall 10, P@5, and R-precision. The results indicated that all stemmers significantly outperformed zero stemming. However, light stemming algorithms showed better performance than heavy stemming algorithms in all experiments using the three evaluation metrics.
... For example, Chinese verbs are not conjugated and nouns in Chinese are usually not pluralized by adding an ending. A host of studies have shown stemming to be an effective form of preprocessing in English; however, the benefits are both application-and language-specific (Salton 1989;Harman 1991;Krovetz 1995;Hull 1996;Hollink et al. 2004;Manning, Raghavan, and Schutze 2008). 5 Stemming is an approximation to a more general goal called lemmatization-identifying the base form of a word and grouping these words together. ...
Article
The intensively propagated moral construct of ma’naviyat penetrates all fields of public life in present-day Uzbekistan. The originally religious term is used as the moral foundation of the state’s official ‘ideology of national independence’. Portrayed as a return of the Uzbek people to their pre-Soviet past and their innate values and traditions, the ideological concept is preached as the only effective remedy for overcoming the negative Soviet legacies. Yet the analysis of the phenomenon shows that both the conceptualization of ma’naviyat at large and the concept’s underlying rhetoric, ideas and structures reveal many analogies to Soviet times. By offering a detailed linguistic analysis of a range of official writings, I argue that the discourse about ma’naviyat works – similarly to Soviet ideological patterns – as a strong legitimizing factor of today’s authoritarian regime.
... This is due to the fact that prefixes could change a word's meaning entirely. Among the stemmers in the English lanugaue is the "S" stemmer [26], stemmer propsed by Lovins [27], and Porter [28], and Snowball is a framework that extends porter's English stemmer to other languages [29]. ...
Conference Paper
Full-text available
A morphologically-rich language such as Arabic requires deep analysis; this is due to its invaluable characteristics which are beneficial for the task of root extraction. This paper investigates employing new techniques to enumerate and rank possible roots for a given word, using linguistic rules as scoring mechanisms. The proposed technique extends the use of roots' dictionary to extract new features in order to develop a more accurate root extractor. The proposed root extractor showed an accuracy of 83.9% with at least 11.8% accuracy difference over other root extractors using a direct evaluation dataset.
... ICF shows how much stemmer compresses the input and how much it reduces the storage capacity and increases the efficiency of the information retrieval system which has to deal with a small dictionary. The approved experiments shows that the vocabulary compression depends on the strength of the stemmer, as can be seen in Table 2. Lennon, et al. [15], Frakes and Fox [26], Paice [16] and Harman [28] have proved with their experiments that strong stemmers produces a better index compression, on the contrary the weak stemmer generated poor index compression. Firstly, if we generalize the algorithms, we found that Lovins, Porters, Dawson and Pace/Husk stemming algorithms have two similarities, where first is they remove suffixes and perform the transformation on stripped words. ...
Article
Full-text available
Stemming is one of the processes that can improve information retrieval process. Stemming means is to reduce the inflectional or derivational word into its root form. The stemming also used in text mining and natural language processing. Each language in the world has unique characteristics and grammatical rules, hence, it is very hard apply same stemming algorithm to all languages. This paper discusses development of the Uzbek stemming algorithm. In this paper, we will discuss the related works, a base model of the Uzbek stemmer. Besides that, this paper also discusses the proposed model of the Uzbek stemmer and implementation of the Uzbek stemmer.
... ICF shows how much stemmer compresses the input and how much it reduces the storage capacity and increases the efficiency of the information retrieval system which has to deal with a small dictionary. The approved experiments shows that the vocabulary compression depends on the strength of the stemmer, as can be seen in Table 2. Lennon, et al. [15], Frakes and Fox [26], Paice [16] and Harman [28] have proved with their experiments that strong stemmers produces a better index compression, on the contrary the weak stemmer generated poor index compression. Firstly, if we generalize the algorithms, we found that Lovins, Porters, Dawson and Pace/Husk stemming algorithms have two similarities, where first is they remove suffixes and perform the transformation on stripped words. ...
Conference Paper
Full-text available
Stemming is one of the pipeline feature of Information Retrieval and commonly used in natural language processing and text mining. The main purpose of a stemming process is to reduce the inflectional or derivational word into its root form. The difficulties on developing stemming algorithm is to identify and remove affixes since each language in the world has unique characteristics and grammatical rules. This paper compares related study on existing stemmers to be used in Uzbek language. We discuss the type of stemming algorithms, an overview of available popular English stemmers and comparison between discussed stemmers as well as their evaluation and analysis of available stemmers on Uzbek language experiment. Based on the comparative study and experiment, we proposal our model of the Uzbek stemmer that enhances some of the features in Lovins stemmer to suit the requirements for the Uzbek language.
... Lovins and Porter developed nonlinguistic algorithms for suffix stripping based on a list of frequent suffixes to reduce words to their stems (Lovins, 1968;Porter, 1980). It is a common belief that stemmers improve recall without losing too much precision, however, a comparison of the Lovins stemmer, the S stemmer, and the Porter stemmer with a baseline of no stemming at all, concluded after detailed evaluation that none of the three stemming algorithms consistently improves retrieval for English documents (Harman, 1991). It was argued that the evaluation measures were not appropriate, and new measures were proposed for evaluating the performance of different stemming algorithms (Hull, 1996). ...
... Because chair want(s) article be reviewed by reviewer [20] e o algorítmo de normalização de Porter [21]. Com a técnica e o algoritmo foi possível identificar e criar os radicais dos verbos, que podem ser classificados segundo o seu Fator de Certeza -FC [22], que é baseado no histórico de ocorrências das ações em elicitações anteriores, seus valores começam a aparecer depois de algumas execuções da ferramenta. ...
Conference Paper
Full-text available
A linguagem peculiar e a pluralidade de visões distintas exigem conhecimento e experiência do engenheiro de requisitos para o sucesso da atividade de extração de requisitos em domínios específicos. Neste trabalho, são apresentadas as ferramentas i*Get e TEKBS, que utilizam técnicas de inteligência artificial para auxiliarem o engenheiro de requisitos na extração de metas concretas e flexíveis, no contexto de iStar. A i*Get utiliza as “ações concretas” definidas no LAL enquanto a TEKBS as “ações flexíveis” combinadas com os sinônimos dos termos que representam essas ações, obtidos a partir do Wordnet. Os resultados são transformados em base de fatos da ferramenta CLIPS, e então processados usando uma base de regras de análise disparadas pela máquina de inferência do CLIPS. Resultados preliminares mostram que um número maior de requisitos são extraídos pelo engenheiro, melhorando o conhecimento sobre o domínio.
... Based on our experiments, it is not always clear whether a light stemmer (removing only inflectional suffixes or part of them) or an aggressive stemmer removing both inflectional and derivational suffixes proposes the best solution. For the English language, the conservative S-stemmer (Harman, 1991) removes only the plural suffix while Porter's stemmer (Porter, 1980) is a more aggressive approach. Such algorithmic or rule-based stemmers ignore word meanings and tend to make errors, usually due to over-stemming (e.g., "organization" is reduced to "organ") or to under-stemming (e.g., "European" and "Europe" do not conflate to the same root). ...
Chapter
Full-text available
This chapter describes the lessons learnt from the ad hoc track at CLEF in the years 2000 to 2009. This contribution focuses on Information Retrieval (IR) for languages other than English (monolingual IR), as well as bilingual IR (also termed “cross-lingual”; the request is written in one language and the searched collection in another), and multilingual IR (the information items are written in many different languages). During these years the ad hoc track has used mainly newspaper test collections, covering more than 15 languages. The authors themselves have designed, implemented and evaluated IR tools for all these languages during those CLEF campaigns. Based on our own experience and the lessons reported by other participants in these years, we are able to describe the most important challenges when designing a IR system for a new language. When dealing with bilingual IR, our experiments indicate that the critical point is the translation process. However, currently online translating systems tend to offer rather effective translation from one language to another, especially when one of these languages is English. In order to solve the multilingual IR question, different IR architectures are possible. For the simplest approach based on query translation of individual language pairs, the crucial component is the merging of the intermediate bilingual results. When considering both document and query translation, the complexity of the whole system represents clearly a main issue.
... • Harman tarafından önerilmiştir ve amacı çoğul formda olan kelimeleri tekil forma indirgemektir (Harman, 1991). Bu altyapının asıl amacı ise; yazılımcılara kendi dillerindeki karakter setleri ile kendi algoritmalarını oluşturmalarını sağlamaktır. ...
Thesis
Full-text available
The topic of data mining is a very popular subject, especially nowadays. Data mining is a process which access the information among large-scale data and mine the knowledge. The most widespread use in the literature is to process large amounts of data automatically or semi-automatically to find meaningful patterns. Depending on the pace of the spread of Internet usage, digital media takes the place of traditional media, and the size of the domain is increasing day by day. The number of textual forms in digital media is also quite large. For this reason, Text Mining techniques should be used for text review. Text Mining is one of the computer-based ways that make text meaningful by automatically extracting data that can not be deduced from any meaning, which is regarded as insignificant from unstructured text. Text Mining is a new and interdisciplinary field consisting of a combination of fields such as data mining, machine learning and statistics. The commercial potential of this area seems to be quite high, as most of the information is stored as text. At the same time, the largest source of information currently available in the field of text mining is unstructured text on the Internet. In the field of Data and Text Mining which is a very popular field in the recent period, some studies are also being carried out in our country. In this thesis study, economical researches have been carried out especially on the importance given to these fields in our country and in the world, the market sizes, the successes of the trainings given in this field prior to university education and the added value of the studies made in this field. In this study, Sector of Information and Communication Technologies in the World, European Union 2020 Innovation Indicators, Information and Communication Technologies Sector in Turkey, India - Turkey Comparison in the Field of Information and Communication Technologies, Education in Information and Communication Technologies, Qualified labor issues in communication technologies have been examined. In the introductory part of this work, the traditional suggestion systems and recommendation engines approaches applied according to the size of the increasing digital data on the Internet are mentioned and the concepts of machine learning and artificial intelligence, which are the current topics in information technology, are mentioned. In the second chapter, economic effects of data analysis are mentioned and in this context, the importance of information and communication technologies in the world and Turkey, the importance of qualified labor and education in this area, and India and Turkey are examined comparatively. In the third chapter, the concept of data analysis is mentioned and the concepts of data mining, text mining, suggestion engine and ethics in data mining are mentioned. In the fourth chapter, literature review has been done and the application steps in text analysis and recommendation engines have been extensively discussed. In the fifth chapter, the aim, structure and flow of the personalized recommendation engine developed within the context of the application are mentioned. In the sixth chapter, the application is explained and the data collection, preprocessing and suggestion engine application realized in this context are explained. In the conclusion part, the thesis application was interpreted, evaluated and finalized. The microeconomic effect shown in the conclusion of the application is obtained by deductions from the extra advertising revenues obtained due to the increase of intrasite interaction thanks to the Turkish Text Mining Support Engine which is built on the advertisement areas of "Gazetemsi", which is a Web content site. As a result of this project, the amount of time spent by users on the site and the number of related content readings were increased, resulting in an additional gain of approximately TL 22.230 per month. Therefore the project has been achieved for the purpose of development. As a result of these conclusions, a local recommendation engine supported by Turkish Text Mining has been developed as a national achievement in the project. The Recommendation Engine Project, which was put into the scope of the study, was supported by TÜBİTAK under the project name "TUBITAK-TEYDEB 1507-SME R & D Start-Up Support Program" with the project name "Personalized Recommendation Engine Supported by Text and Data Mining Approaches for Content Based Website".
... In Stemming, morphologically similar words are clustered together under the hypothesis that they are semantically similar [2]- [7]. It is useful for Text Mining, Natural Language Processing (NLP) functions, Text clustering, Text categorization, Text summarization, and application of Text Mining (TM) [1], [8,9]. ...
Article
Full-text available
With the large quantity of information offered on-line, it's equally essential to retrieve correct information for a user query. A large amount of data is available in digital form in multiple languages. The various approaches want to increase the effectiveness of on-line information retrieval but the standard approach tries to retrieve information for a user query is to go looking at the documents within the corpus as a word by word for the given query. This approach is incredibly time intensive and it's going to miss several connected documents that are equally important. So, to avoid these issues, stemming has been extensively utilized in numerous Information Retrieval Systems (IRS) to extend the retrieval accuracy of all languages. These papers go through the problem of stemming with Web Page Categorization on Gujarati language which basically derived the stem words using GUJSTER algorithms [1]. The GUJSTER algorithm is based on morphological rules which is used to derived root or stem word from inflected words of the same class. In particular, we consider the influence of extracted a stem or root word, to check the integrity of the web page classification using supervised machine learning algorithms. This research work is intended to focus on the analysis of Web Page Categorization (WPC) of Gujarati language and concentrate on a research problem to do verify the influence of a stemming algorithm in a WPC application for the Gujarati language with improved accuracy between from 63% to 98% through Machine Learning supervised models with standard ratio 80% as training and 20% as testing.
... As a final remark, each document must be represented in an effective way. To achieve this, the input text was put into lowercase and a light stemmer has been applied (removing the final "-s") (Harman, 1991). Punctuation symbols or sequences of them have been kept as they are. ...
Article
Full-text available
To determine the author of a text's gender, various feature types have been suggested (e.g., function words, n‐gram of letters, etc.) leading to a huge number of stylistic markers. To determine the target category, different machine learning models have been suggested (e.g., logistic regression, decision tree, k nearest‐neighbors, support vector machine, naïve Bayes, neural networks, and random forest). In this study, our first objective is to know whether or not the same model always proposes the best effectiveness when considering similar corpora under the same conditions. Thus, based on 7 CLEF‐PAN collections, this study analyzes the effectiveness of 10 different classifiers. Our second aim is to propose a 2‐stage feature selection to reduce the feature size to a few hundred terms without any significant change in the performance level compared to approaches using all the attributes (increase of around 5% after applying the proposed feature selection). Based on our experiments, neural network or random forest tend, on average, to produce the highest effectiveness. Moreover, empirical evidence indicates that reducing the feature set size to around 300 without penalizing the effectiveness is possible. Finally, based on such reduced feature sizes, an analysis reveals some of the specific terms that clearly discriminate between the 2 genders.
... Therefore, such techniques are not a good choice to be employed in a multilingual IR environment. Lovins' algorithm [12], Dawson's algorithms [3], Porter's algorithm [27], his Snowball project [28], Paice/Husk algorithm [19], Harman's 'S' stemmer [7] are some highlighted research articles in the realm of affix removal conflation techniques. Some other recent works of this domain are: English [4], Al-Shammar [1], Araujo et al. [2], Pande and Dhami [21], Sirsat et al. [30], Thangarasu and Manavalan [32], Shende and Kute [29]. ...
Article
Full-text available
This work is an attempt to devise a Stemmer that can remove both prefix and suffix together from a given word in English language. For a given input word, our method considers all possible internal [Formula: see text]-grams for detection of potential stems. We frame a hypothesis where the stem length is closest to the half of the length of the input word. A standard English dictionary has been employed to identify morphologically correct [Formula: see text]-grams in the process. We apply our techniques over a random sample of 100 English words, each possessing both prefix and suffix. We also compare our proposed Stemmer with three standard algorithms from the literature. Empirical results exhibit that our technique performs better than the rest of the stemmers.
... This algorithm was proposed by Donna Harman. The algorithm has rules to remove suffixes in plurals so as to convert them to the singular forms [9]. ...
Article
Full-text available
Data mining is used for finding the useful information from the large amount of data. Data mining techniques are used to implement and solve different types of research problems. The research related. It is also called knowledge discovery in text (KDT) or knowledge of intelligent text analysis. Text mining is a technique which extracts information from both structured and unstructured data and also finding patterns. Text mining techniques are used in various types of research domains like natural language processing, information retrieval, text classification and text clustering.
... ike (bālak + -t . i + -ke) "to the boy," jaleo concatenation between a noun and an inflection (Harman, 1991 The first stage is the conversion of a raw Bengali text into a POS annotated text. ...
Chapter
In this chapter, we describe a process of lemmatization of inflected nouns in Bengali as a part of lexical processing. Inflected nouns are used at a very high frequency in Bengali texts. We first collect a large number of inflected nouns from a Bengali corpus and compile a noun database. Then we apply a process of lemmatization to separate inflections from nominal bases. There are several intermediate stages in lemmatization which are applied following grammatical mapping rules (GMRs). These rules isolate inflections from nominal bases. The GMRs are first designed manually after analyzing a large set of inflected nouns to collect necessary data and information. At subsequent stages, these GMRs are developed in a machine-readable format so that the lemmatizer can separate the inflections from inflected nouns with the least human intervention. This strategy is proved to be largely successful in the sense that most of the inflected Bengali nouns, which are stored in a noun database, are rightly lemmatized. This multilayered process also generates an exhaustive list of nominal inflections and a large list of lemmatized nouns. At the subsequent stage, nouns are semantically classified for their use in translation, dictionary compilation, lexical decomposition, and language teaching. We have also applied this method to lemmatize inflected pronouns and adjectives which follow a similar pattern of inflection and affixation in Bengali.
... Equation Eq. (15) computes the MCR score. Frakes and Fox (2003) used the Moby Common Dictionary wordlist 3 to evaluate four stemmers ["S" stemmer (Harman 1991), Lovins (1968), Porter (1980), Paice/Husk(1990)] and claimed that Paice and Lovins stemming algorithms are the most similar, while the Paice and "S" stemmers are the most dissimilar. ...
Article
Full-text available
Text stemming is one of the basic preprocessing step for Natural Language Processing applications which is used to transform different word forms into a standard root form. For Arabic script based languages, adequate analysis of text by stemmers is a challenging task due to large number of ambigious structures of the language. In literature, multiple performance evaluation metrics exist for stemmers, each describing the performance from particular aspect. In this work, we review and analyze the text stemming evaluation methods in order to devise criteria for better measurement of stemmer performance. Role of different aspects of stemmer performance measurement like main features, merits and shortcomings are discussed using a resource scarce language i.e. Urdu. Through our experiments we conclude that the current evaluation metrics can only measure an average conflation of words regardless of the correctness of the stem. Moreover, some evaluation metrics favor some type of languages only. None of the existing evaluation metrics can perfectly measure the stemmer performance for all kind of languages. This study will help researchers to evaluate their stemmer using right methods.
... This provides more convenience when handling words that share the same core meaning, thus playing an important role in the field of information retrieval (IR). In IR, grouping words with the same root (or stem) increases the success with which documents can be matched against a query [27]. For the English language, a simple stemming that involves the stripping of suffixes is sufficient for the purpose of IR. ...
Article
Full-text available
Document classification is a classical problem in information retrieval, and plays an important role in a variety of applications. Automatic document classification can be defined as content-based assignment of one or more predefined categories to documents. Many algorithms have been proposed and implemented to solve this problem in general, however, classifying Arabic documents is lagging behind similar works in other languages. In this paper, we present seven deep learning-based algorithms to classify the Arabic documents. These are: Convolutional Neural Network (CNN), CNN-LSTM (LSTM = Long Short-Term Memory), CNN-GRU (GRU = Gated Recurrent Units), BiLSTM (Bidirectional LSTM), BiGRU, Att-LSTM (Attention-based LSTM), and Att-GRU. And for word representation, we applied the word embedding technique (Word2Vec). We tested our approach on two large datasets–with six and eight categories–using ten-fold cross-validation. Our objective was to study how the classification is affected by the stemming strategies and word embedding. First, we looked into the effects of different stemming algorithms on the document classification with different deep learning models. We experimented with eleven different stemming algorithms, broadly falling into: root-based and stem-based, and no stemming. We performed ANOVA test on the classification results using the different stemmers, which helps assure if the results are significant. The results of our study indicate that stem-based algorithms perform slightly better compared to root-based algorithms. Among the deep learning models, the Attention mechanism and the Bidirectional learning gave outstanding performance with Arabic text categorization. Our best performance is F-score=97.96%, achieved using the Att-GRU model with stem-based algorithm. Next, we looked into different controlling parameters for word embedding. For Word2Vec, both skip-gram and bag-of-words (CBOW) perform well with either stemming strategies. However, when using a stem-based algorithm, skip-gram achieves good results with a vector of smaller dimension, while CBOW requires a larger dimension vector to achieve a similar performance.
... Relaxed plural match to MeSH. This method uses the same mapping as the relaxed match, but also processes each token using a conservative plural stemmer 28 ...
Article
Full-text available
Automatically identifying chemical and drug names in scientific publications advances information access for this important class of entities in a variety of biomedical disciplines by enabling improved retrieval and linkage to related concepts. While current methods for tagging chemical entities were developed for the article title and abstract, their performance in the full article text is substantially lower. However, the full text frequently contains more detailed chemical information, such as the properties of chemical compounds, their biological effects and interactions with diseases, genes and other chemicals. We therefore present the NLM-Chem corpus, a full-text resource to support the development and evaluation of automated chemical entity taggers. The NLM-Chem corpus consists of 150 full-text articles, doubly annotated by ten expert NLM indexers, with ~5000 unique chemical name annotations, mapped to ~2000 MeSH identifiers. We also describe a substantially improved chemical entity tagger, with automated annotations for all of PubMed and PMC freely accessible through the PubTator web-based interface and API. The NLM-Chem corpus is freely available.
... Stemming is a computational procedure that reduces all words with the same root (or the same stem, in case prefixes are left untouched) to a common form, usually by stripping each word of its derivational and inflectional suffixes, for example, the words "retrieval", "retrieved", "retrieves" are reduced to the stem "retrieve". In IR, grouping words with the same root (or stem) increases the success with which documents can be matched against a query [48]. ...
Article
Full-text available
Traditional information retrieval systems return a ranked list of results to a user’s query. This list is often long, and the user cannot explore all the results retrieved. It is also ineffective for a highly ambiguous language such as Arabic. The modern writing style of Arabic excludes the diacritical marking, without which Arabic words become ambiguous. For a search query, the user has to skim over the document to infer if the word has the same meaning they are after, which is a time-consuming task. It is hoped that clustering the retrieved documents will collate documents into clear and meaningful groups. In this paper, we use an enhanced k-means clustering algorithm, which yields a faster clustering time than the regular k-means. The algorithm uses the distance calculated from previous iterations to minimize the number of distance calculations. We propose a system to cluster Arabic search results using the enhanced k-means algorithm, labeling each cluster with the most frequent word in the cluster. This system will help Arabic web users identify each cluster’s topic and go directly to the required cluster. Experimentally, the enhanced k-means algorithm reduced the execution time by 60% for the stemmed dataset and 47% for the non-stemmed dataset when compared to the regular k-means, while slightly improving the purity.
Article
Full-text available
Information retrieval is a process of retrieving the documents to satisfy the user's need for information. The user's information need is represented by a query, the retrieval decision is made by comparing the terms of the query with the terms in the document itself or by estimating the degree of relevance that the document has to the query. Words in a document may have many morphological variants. These morphological variants of words have similar semantic interpretations and can be considered as equivalent for the purpose of IR applications. For this reason, a number of so-called Stemming Algorithms, which reduces the word to its stem or root form have been developed. Thus, the key terms of a query or document are represented by stems rather than by the original words. Stemming reduces the size of the index files and also improves the retrieval effectiveness. A stemming algorithm is a computational procedure which reduces all words with the same root (or, if prefixes are left untouched, the same stem) to a common form, usually by stripping each word of its derivational and inflectional suffixes. This study evaluated the performance analysis of the basic three suffix removal stemming algorithms in the English language called Lovins, Porter and Paice/Husk by Accuracy and strength of the algorithm.
Article
Full-text available
This study was based on a major assumption that the lexical structure of Arabic textual words involves semantic content that could be used to determine the class of a given word and its functional features within a given text. Hence, the purpose of the study was to explore the extent at which we can rely on word structure to determine word class without the need for using language glossaries and word lists or using the textual context. The results indicate that the morphological structure of Arabic textual word was helpful in achieving a rate of success approaching 79% of the total number of words in the sample used in the study. In certain cases, the approach adopted in the investigation was not adequate for class tagging due to two major reasons, the first of which was the absence of prefixes and suffixes and the second was the incapability of distinguishing affixes from original letters. It was concluded that the approach adopted in this study should be supplemented by using other techniques adopted in other studies, particularly the textual context.
Chapter
As a first approach, it is assumed that stylistic markers can be detected by considering words, or more precisely, the most frequent ones. This chapter explores several other ways to define useful stylistic traces let by the author. Instead of considering only isolated words, one can explore the usefulness of short sequences of words (called word n-grams). After applying a part-of-speech (POS) tagger, the resulting tags or sequences of them could be pertinent to discriminate between distinct styles. On the other hand, the letters and n-grams of them could also reflect the distinction between authors. In addition, various feature selection functions have been proposed to select the best subset of stylistic markers to describe a given writer or category (e.g., men vs. women). All those solutions present advantages and drawbacks and this chapter exposes and illustrates them. Finally, two methods for extracting the overused terms and expressions corresponding to a given author or category are discussed and examples are presented to illustrate the required computation.
Conference Paper
The excessive consumption of network bandwidth for transmitting unwanted emails has always been a major problem in the web, since, the existing classification approaches are still lacking for a complete solution. This paper presents an enhanced vocabulary-based dictionary algorithm for protecting web user by receiving unwanted spam mails. The proposed algorithm identifies and classifies legitimate incoming mails against unsolicited email attacks. We present a porter stemmer algorithm as a part of normalization process for removing the common morphological and inflexional endings from English words. A comparative study and evaluation of these classification approaches are carried out using machine-learning techniques. The performance of the proposed algorithm is visualized using confusion matrix. The experimental results show that our method produces less number of false negatives when compared with existing techniques.
Article
Stemming is a process in which the variant word forms are mapped to their base form. It is among the basic text pre-processing approaches used in Language Modeling, Natural Language Processing, and Information Retrieval applications. In this article, we present a comprehensive survey of text stemming techniques, evaluation mechanisms, and application domains. The main objective of this survey is to distill the main insights and present a detailed assessment of the current state of the art. The performance of some wellknown rule-based and statistical stemming algorithms in different scenarios has been analyzed. In the end, we highlighted some open issues and challenges related to unsupervised statistical text stemming. This research work will help the researchers to select the most suitable text stemming technique in a specific application and will also serve as a guide to identify the areas that need attention from the research community.
Article
Full-text available
Information retrieval is a process of retrieving the documents to satisfy the user's need for information. The user's information need is represented by a query, the retrieval decision is made by comparing the terms of the query with the terms in the document itself or by estimating the degree of relevance that the document has to the query. Words in a document may have many morphological variants. These morphological variants of words have similar semantic interpretations and can be considered as equivalent for the purpose of IR applications. For this reason, a number of so-called Stemming Algorithms, which reduces the word to its stem or root form have been developed. Thus, the key terms of a query or document are represented by stems rather than by the original words. Stemming reduces the size of the index files and also improves the retrieval effectiveness. A stemming algorithm is a computational procedure which reduces all words with the same root (or, if prefixes are left untouched, the same stem) to a common form, usually by stripping each word of its derivational and inflectional suffixes. This study evaluated the performance analysis of the basic three prefix and suffix removal stemming algorithms in the Tamil language called Light Stemmer, Improved Light Stemmer and Rule Based Iterative Affix stripping Stemmer by Accuracy and strength of the algorithms.
Article
The first part is dedicated to introducing the authors of the book, namely Carol Peters, Martin Braschler and Paul Clough. Then the book is introduced. This resource has a good place among the English language teaching resources in the field of information science. Its use has been widespread in the world. Hence the translation is a good option for teaching courses related to information retrieval in information systems at universities and higher education institutions. The book has the structure of conference articles. It provides practical examples of the book's benefits. The second part is dedicated to the Persian translation of the book. This translation is relatively good, but there are some shortcomings. In Persian translation, there are spelling, grammatical, structural and conceptual (language) bugs throughout the different chapters.
Chapter
As presented in the previous chapters, stylometric models and applications are located at the crossroads of several domains such as applied linguistics, statistics, and computer science. This position is not unique but, in a broader view, it corresponds to digital humanities, a field largely open to many relevant research directions and useful applications. These considerations lead to one of the main intents of this book: an introduction in this joint open discipline bringing together varied skills and requiring multi-disciplinary knowledge. Nowadays, we are just at the beginning of exploring all the potential of computer-based tools to represent, explore, understand, and identify patterns in literary textual datasets as well in other corpus formats.
Chapter
In this second chapter presenting stylometric applications, the social networks, and more precisely Twitter, are the source of our dataset. To explore new forms of communication, this chapter explores the distinct linguistic characteristics related to Twitter compared to the traditional oral or written form. For example, the frequency of mentions (e.g., @POTUS44), hyperlinks (e.g., www.nytimes.com), retweets or emojis (e.g., , ) can be exploited to profile the author of a set of tweets. The dataset, freely available, is provided by the CLEF PAN evaluation campaign in 2019. With this corpus, the first classification task is to discriminate between tweets generated by bots or by humans. In a second application, the computer must identify tweets written by men or women. As a useful additional result, one can discover the linguistic features strongly related to bots, and those associated with men or women.
Chapter
Some well-known models have been explained in the previous chapter, but various advanced approaches have been suggested. Related to the humanities, the Zeta test is focusing on terms used recurrently by one author and mainly ignored by the others. Selecting stylistic markers based on this criterion, the model builds a graph showing the similarities between text excerpts. Compression algorithms could also be applied to identify the true author of a text based on similar word frequencies. More related to the natural language processing domain, the latent Dirichlet allocation (LDA) could be applied to define the most probable author of a given document. To solve the verification problem, several dedicated approaches have been suggested and an overview of them is included in this chapter. Although we usually assume that a novel is written only by a single person, collaborative authorship is possible. To detect passages written by each possible author, the rolling Delta and other ad hoc approaches are described. As neural models constitute an important research field, three sections have been dedicated to them, with one on the basic neural approach, one focusing on word embeddings, and the third on the long short-term memory (LSTM), a well-known deep learning model. The last section is dedicated to adversarial stylometry and obfuscation, or how one can possibly program a computer to hide stylistic markers left by the original author.
Chapter
In this chapter several authorship attribution methods are applied to identify the true author behind the Elena Ferrante penname. To achieve this objective, a corpus must be carefully generated as shown in the first section. Next, four authorship models with known high effectiveness levels must be selected and applied to resolve this authorship problem. After explaining how the stylistic features have been chosen and how we applied the PCA, Delta, Labbé, and Zeta models, we reach the same conclusion: D. Starnone is the true writer of the My Brilliant Friend’s saga. To complement this deduction, a more qualitative analysis is performed to obtain a better understanding of the strong stylistic relationship between Elena Ferrante and Domenico Starnone.
Chapter
With US political speeches, this chapter illustrates some notions and concepts presented in the first two parts of this book. The studied corpus is composed of 233 annual State of the Union (SOTU) speeches written under 43 presidencies from Washington (Jan. 8th, 1790) to Trump (Feb. 4th, 2020). Moreover, 58 inaugural allocutions uttered by 40 presidents have been added to complement this dataset. To analyze the evolution over time of the US presidential style, simple statistics such as the relative term frequency could show the decrease of some forms (e.g., the determiner the) and the near constant increase of others (e.g., the pronoun we). With time, one can detect a trend toward more direct and simple formulation with a diminution of the percentage of big words (composed of six letters or more) and the decrease of the mean sentence length. In addition, one can distinguish the writing style adopted by the different presidents (or periods) by depicting a Principal Component Analysis (PCA) graph based on the part-of-speech (POS) distributions or using the 300 most frequent lemmas. Finally, a method is applied to detect the most characteristic words or expressions per presidency. Based on those terms, the distinction between presidents can be illustrated using both stylistic markers and topical expressions. Finally, some distinctive sentences based on specific wordlists illustrate the style and rhetoric of some presidencies.
Chapter
As many stylometric applications must first learn and represent the distinctive style of different categories or authors, several machine learning algorithms have been suggested to solve the authorship attribution or profiling issues. This sixth chapter presents four important models. Based on vector space representation, the k-nearest neighbors (k-NN) model is based on a distance (or similarity) measure computed between the doubtful text and either the different categories (profile-based) or all texts (instances-based). The closest or the k closest instances are then used to define the proposed decision. With the naïve Bayes model, probability theory is used to estimate the occurrence of each selected stylistic marker according to the different categories. Given the query text, the model computes the probability of each class to determine the most probable one. A more complex approach, the support vector machine (SVM) defines a linear border splitting the training set into two distinct regions, one for each category. Based on this representation, the doubtful text is projected into this space and its position defines its attribution. Finally, logistic regression is described as an approach to estimate the probability of a query text belonging to a given class. As the practical aspect is important to obtain a clear understanding of all these methods, examples, written in R, are provided using usually the Federalist Papers as a testbed corpus.
Chapter
This chapter explores the indexing process of information retrieval. After some introductory discussion, the two broad approaches to indexing, manual and automated, are described. For manual indexing, approaches applied to bibliographic, full-text, and Web-based content are presented. This is followed by a description of automated approaches to indexing, with discussion limited to those used in operational retrieval systems. The problems associated with each type of indexing are explored. The final section describes computer data structures used to maintain indexing information for efficient retrieval.
Article
Full-text available
Gujarati language is indigenous language in the Indian State of Gujarat, known for its rich morphology structure. Gujarati information mining processes have become a current area of research. Many methods and approaches have designed and introduced algorithms to solve the problem of morphology and stemming of Gujarati language. Each researcher proposed his/her own standards, testing methodology and accuracy measurements to test his/her algorithm. Therefore, we can’t make an exact comparison between these algorithms.
Chapter
This chapter will give an overview of how human languages differ from each other and how those differences are relevant to the development of human language understanding technology for the purposes of information access. It formulates what requirements information access technology poses (and might pose) to language technology. We also discuss a number of relevant approaches and current challenges to meet those requirements.
Article
There have been several studies of the use of stemming algorithms for conflating morphological variants in free‐text retrieval systems. Comparison of stemmed and nonconflated searches suggests that there are no significant increases in the effectiveness of retrieval when stemming is applied to English‐language documents and queries. This article reports the use of stemming on Slovene‐language documents and queries, and demonstrates that the use of an appropriate stemming algorithm results in a large, and statistically significant, increase in retrieval effectiveness when compared with nonconflated processing; similar comments apply to the use of manual, right‐hand truncation. A comparison is made with stemming of English versions of the same documents and queries and it is concluded that the effectiveness of a stemming algorithm is determined by the morphological complexity of the language that it is designed to process. © 1992 John Wiley & Sons, Inc.
Article
The automatic removal of suffixes from words in English is of particular interest in the field of information retrieval. An algorithm for suffix stripping is described, which has been implemented as a short, fast program in BCPL. Although simple, it performs slightly better than a much more elaborate system with which it has been compared. It effectively works by treating complex suffixes as compounds made up of simple suffixes, and removing the simple suffixes in a number of steps. In each step the removal of the suffix is made to depend upon the form of the remaining stem, which usually involves a measure of its syllable length.
Article
The characteristics of conflation algorithms are discussed and examples given of some algorithms which have been used for information retrieval systems. Comparative experiments with a range of keyword dictionaries and with the Cranfield document test collection suggest that there is relatively little difference in the performance of the algorithms despite the widely disparate means by which they have been developed and by which they operate.
Article
The exhaustivity of document descriptions and the specificity of index terms are usually regarded as independent. It is suggested that specificity should be interpreted statistically, as a function of term use rather than of term meaning. The effects on retrieval of variations in term specificity are examined, experiments with three test collections showing in particular that frequently-occurring terms are required for good overall performance. It is argued that terms should be weighted according to collection frequency, so that matches on less frequent, more specific, terms are of greater value than matches on frequent terms. Results for the test collections show that considerable improvements in performance are obtained with this very simple procedure.
Article
Word truncation is a familiar technique employed by online searchers in order to increase recall in free text retrieval. The use of truncation, however, can be a mixed blessing since many words starting with the same root are not semantically or logically related. Consequently, online searchers often select words to be OR-ed together from an alphabetic display of neighbouring terms in the inverted file in order to assure precision in the search. Automatic stemming algorithms typically function in a manner analogous to word truncation, with the added risk of the word roots being incorrectly identified by the algorithm. This paper describes a two-phase stemming algorithm that consists of the identification of the word root and the automatic selection of ‘well-formed’ morphological word variants from the actual inverted file entries that start with the same word root. The algorithm has been successfully used in an end-user interface to NLM's Catline book catalog file.
Article
Minicomputer-operated information retrieval (IR) systems are capable of employing relatively advanced methods, some of which are comparable with those employed in main-frame systems. Many of these systems operate on dedicated machines and can therefore provide very rapid access to information, while remaining under the direct control of an information department. One system has now given over three years of satisfactory operation: this is MORPHS-Minicomputer Operated Retrieval (Partially Heuristic) System. This system incorporates a number of linguistic features including the ability to find roots of words through affix stripping. Synonyms and compound words can also be handled and several search strategies (including SDI) are available. The latter have been developed considerably since the inception of the system. Consideration is given to the automation of the indexing process which is currently restricted to material for SDI.
Conference Paper
The ability to effectively rank retrieved documents in order of their probable relevance to a query is a critical factor in statistically-based keyword retrieval systems. This paper summarizes a set of experiments with different methods of term weighting for documents, using measures of term importance within an entire document collection, term importance within a given document, and document length. It is shown that significant improvements over no term weighting can be made using a combination of weighting measures and normalizing for document length.
Conference Paper
This paper describes two experiments concerned with term conflation for information retrieval, and the CATALOG retrieval system designed utilizing the results of the experiments. The experiments performed here has as their aim 1) finding a theoretical basis and method for maximizing the effect of conflation, and 2) determining if conflation can be automated with no loss of system performance.Experimental results indicate that,1. Experienced searchers generally truncate terms at root morpheme boundaries. When searchers do not truncate at root boundaries, the deviations are small.2. Small deviations from root boundaries do not significantly affect retrieval performance.3. There is no significant performance difference between automatic conflation and manual conflation carried out by experienced searchers.4. Based on 3, term conflation can be automated in a retrieval system with no average loss of performance, thus allowing easier and user access to the system.A retrieval system incorporating the information in 4 is described, and shown to be feasible.
Article
IRX is a text retrieval system designed to be a testbed for conducting information retrieval research on statistically-based retrieval strategies in either batch or interactive modes. The modular structure of IRX has permitted major changes in components of the system (e.g., ranking algorithms, parsers, interfaces; without redesign. As an interactive system IRX is in use at the Johns Hopkins University and the Lister Hill Center providing access to databases in human and molecular genetics.
Article
This paper describes a method for automatically segmenting words into their stems and affixes. The process uses certain statistical properties of a corpus (successor and predecessor letter variety counts) to indicate where words should be divided. Consequently, this process is less reliant on human intervention than are other methods for automated stemming.The segmentation system is used to construct stem dictionaries for document classification. Information retrieval experiments are then performed using documents and queries so classified. Results show not only that this method is capable of high quality word segmentation, but also that its use in information retrieval produces results that are at least as good as those obtained using the more traditional stemming processes.
Article
This paper describes an automated procedure for the identification and subsequent transformation of nominal forms into the corresponding adjectival forms, participles into nominal forms, and a method for adjectivization of certain Greek and Latin phrases in medical language. The paper is an extension of the authors' previous report which dealt with transforms of adjectives into nouns and nouns plural into nouns singular in medical English. It is a part of the information retrieval system for processing of pathology data which was developed at the Division of Computer Research and Technology, National Institutes of Health.
Developing tools for online medical reference works-560)
  • R Huntzinger
Huntzinger R. (1986). Developing tools for online medical reference works. In Proceedings of Medinfo 86. (pp. 558-560). Amsterdam: Elsevier Science Publishers B. V.
Word segmentation by letter successor varieties , Information Storage and Retrieval
  • M Hafer
  • S Weiss
Hafer M., & Weiss S. (1974). Word segmentation by letter successor varieties, Information Storage and Retrieval, 10, 371-385.