Article

An Algorithm for Suffix Stripping

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

The automatic removal of suffixes from words in English is of particular interest in the field of information retrieval. An algorithm for suffix stripping is described, which has been implemented as a short, fast program in BCPL. Although simple, it performs slightly better than a much more elaborate system with which it has been compared. It effectively works by treating complex suffixes as compounds made up of simple suffixes, and removing the simple suffixes in a number of steps. In each step the removal of the suffix is made to depend upon the form of the remaining stem, which usually involves a measure of its syllable length.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... This stemming algorithmS was designed by Martin Porter [Porter, 1980]. It is a process for removing the common morphological and inflexional endings from words in English. ...
... Morphological analysis is the process of recognising the root form of a morphological A morphology rule often strips a suffix or prefix from a word, and sometimes adds back replacement characters, to produce a possible root form [Porter, 1980]. Rules must be applied recursively as multiple derivations are common. ...
... The suffixes used are amalgamated from work by Quirk [Quirk et aI., 1985], the Porter Stemmer [Porter, 1980], and linguistic information. This is a class of suffixes that are strong discriminators of word-class. ...
Thesis
p>Information is the most powerful resource available to an organisation. Problems arise when the amount of management needed to effectively organise the mass of information available in data-repositories starts to increase, and reaches a level that is impossible to maintain. Instead of providing value to an organisation the information can serve to confuse and hamper. This research presents the topic of automatic key theme extraction as a method for information management, specifically the extraction of pertinent information from natural language texts. The motivation for this research was to achieve improved accuracy in automatic key theme extraction in natural language texts. The performance was evaluated against an industrial context, which was provided by Active Navigation Ltd, a content management system. The author has produced an architecture for theme extraction using a pipeline of individual processing components that adhere to a lossless information strategy. This loss less architecture has shown that it is capable of providing a higher accuracy of extracting key themes from natural language texts than that of Active Navigation. The accurate extraction of key themes is essential as it provides a solid base for other Active Navigation information navigation tasks. These include advanced search, categorisation, building summaries, finding related documents, and dynamic linking. Improving these navigation techniques increases the effectiveness of the content management system.</p
... Ces statistiques sont calculées sur les ensembles de test. Le ratio de mots-clés absents est calculé en comparant les formes racinisées (Porter, 1980) TermiTH-Eval (Bougouin et al., 2016) Dans le cadre du projet ANR TermITH 3 , et à l'aide des indexeurs professionnels de l'Inist 4 , 400 articles scientifiques en français ont été annotés en mots-clés de manière non contrôlée. Il y a en moyenne 11,8 mots-clés par document. ...
... Cet appariement, bien que simple à mettre en place, ne permet pas de prendre en compte certaines variantes telles que : Le traitement de ces variantes peut être complexe à mettre en oeuvre. C'est pourquoi seule la racinisation (Porter, 1980) est couramment employée : elle permet de traiter une partie des variantes flexionnelles et morphologiques. Ce traitement est bien adapté à l'anglais mais moins au français où le phénomène d'allomorphie est courant. ...
Thesis
Le nombre de documents scientifiques dans les bibliothèques numériques ne cesse d'augmenter. Les mots-clés, permettant d'enrichir l'indexation de ces documents ne peuvent être annotés manuellement étant donné le volume de document à traiter. La production automatique de mots-clés est donc un enjeu important. Le cadre évaluatif le plus utilisé pour cette tâche souffre de nombreuses faiblesses qui rendent l'évaluation des nouvelles méthodes neuronales peu fiables. Notre objectif est d'identifier précisément ces faiblesses et d'y apporter des solutions selon trois axes. Dans un premier temps, nous introduisons KPTimes, un jeu de données du domaine journalistique. Il nous permet d'analyser la capacité de généralisation des méthodes neuronales. De manière surprenante, nos expériences montrent que le modèle le moins performant est celui qui généralise le mieux. Dans un deuxième temps, nous effectuons une comparaison systématique des méthodes états de l'art grâce à un cadre expérimental strict. Cette comparaison indique que les méthodes de référence comme TF×IDF sont toujours compétitives et que la qualité des mots-clés de référence a un impact fort sur la fiabilité de l'évaluation. Enfin, nous présentons un nouveau protocole d'évaluation extrinsèque basé sur la recherche d'information. Il nous permet d'évaluer l'utilité des mots-clés, une question peu abordée jusqu'à présent. Cette évaluation nous permet de mieux identifier les mots-clés importants pour la tâche de production automatique de mots-clés et d'orienter les futurs travaux.
... However, since the numbers of hidden layers and neurons are randomly generated, the network structure generated each time is different, making the model training process difficult and the calculations very complicated. References [9][10][11] proposed a widely used Bayes classifier, which is a single classifier. It has a good classification effect on text data, and the calculations are fast and easy to implement. ...
Article
Full-text available
A single model is often used to classify text data, but the generalization effect of a single model on text data sets is poor. To improve the model classification accuracy, a method is proposed that is based on a deep neural network (DNN), recurrent neural network (RNN), and convolutional neural network (CNN) and integrates multiple models trained by a deep learning network architecture to obtain a strong text classifier. Additionally, to increase the flexibility and accuracy of the model, various optimizer algorithms are used to train data sets. Moreover, to reduce the interference in the classification results caused by stop words in the text data, data preprocessing and text feature vector representation are used before training the model to improve its classification accuracy. The final experimental results show that the proposed model fusion method can achieve not only improved classification accuracy but also good classification effects on a variety of data sets.
... Second, bug reports are divided into sentences based on punctuation marks, and each sentence is tokenized following the software-specific regular expression [7]. Last, stop words are removed [15] and Porter stemming [16,17] is performed. ...
Article
Full-text available
During the maintenance phase of software development, bug reports provide important information for software developers. Developers share information, discuss bugs, and fix associated bugs through bug reports; however, bug reports often include complex and long discussions, and developers have difficulty obtaining the desired information. To address this issue, researchers proposed methods for summarizing bug reports; however, to select relevant sentences, existing methods rely solely on word frequencies or other factors that are dependent on the characteristics of a bug report, failing to produce high-quality summaries or resulting in limited applicability. In this paper, we propose a deep-learning-based bug report summarization method using sentence significance factors. When conducting experiments over a public dataset using believability, sentence-to-sentence cohesion, and topic association as sentence significance factors, the results show that our method outperforms the state-of-the-art method BugSum with respect to precision, recall, and F-score and that the application scope of the proposed method is wider than that of BugSum.
... Similarly, Nonnen et al. Use heuristics to determine where notions associated with the qualifiers in the source [21] are introduced or described. In both methods, each expression is obtained from code rudiments, and this material is not usually used to classify software technologies. ...
Conference Paper
The lack of informal language and software technology standard taxonomy makes it impossible to analyze technology trends on forums and other online sites. Researchers have done an in-depth study of the seven top-notch technology tools and proposed a better automated method to classify software technologies. e.g., Witt uses phrases that describe software tech or concept and put back a wide-ranging class that defines it (i.e., IDE). Furthermore, it defines its function (commercial, PHP). By extension, this method can dynamically compile the list using all technologies of a given type. In a same way working of WordNet, WebIsADb, WiBiTaxonomy, and few other tools is also studied by the researchers. Eventually, they compared these classification methodologies and establish that Witt, once used in software jargon, showed healthier results as compared to all other explanations assessed, lacking a conforming diminution in untrue alarm rates.
... L'étape de segmentation est donc assurée par nos soins, la description de chaque expôt est utilisée comme une seule unité textuelle à classer, bien qu'elle soit elle-même souvent composée de plusieurs phrases. À partir de ce pseudo-document, sont appliqués les pré-traitements habituels (Manning et al. 1999;Porter 1980), ce qui permet ensuite à l'algorithme d'effectuer une meilleure pondération des segments textuels. Sans continuer le processus de génération du résumé, les résultats sont récupérés et le score associé à chaque unité textuelle est attribué à l'expôt correspondant comme étant le prestige de cette oeuvre. ...
Thesis
Cette thèse s'intéresse à la recommandation de visites culturelles à travers une approche interdisciplinaire. Ces travaux mêlent des techniques issues de la Recherche Opérationnelle et du traitement automatique de la langue naturelle écrite tout en se basant sur des concepts issus de la sociologie des publics et de la géographie. Nous proposons de nouvelles méthodes autour de l'évaluation des points d'intérêt culturel ainsi que la création automatique de parcours touristiques prenant en compte les envies exprimées par un visiteur. Ces principes sont appliqués à deux échelles et contextes différents, la visite de musées et les parcours culturels dans une ville.Dans une première partie, nous nous concentrons sur les visites dans les musées d'art en fonction des préférences exprimées par le visiteur et du prestige des œuvres. Cette double approche permet de classer les œuvres à la fois en fonction des affinités culturelles du visiteur mais aussi en fonction de leur importance au sein du musée. Cette dernière est calculée en appliquant des algorithmes de résumé automatique de texte aux cartouches officiels du musée qui décrivent les œuvres et permet d'obtenir un profil de visite reflétant la découverte d'un musée en fonction de ses chefs-d'œuvre. Ce profil peut ensuite être modifié en fonction des goûts du visiteur pour obtenir une visite lui correspondant tout en préservant le point de vue du musée.Par la suite, nous assimilons la construction d'un parcours à un problème de routage, visant à trouver un itinéraire parmi les différentes salles et œuvres dont le but est de maximiser la satisfaction du visiteur tout en respectant des contraintes de temps. Deux méthodes sont proposées, un modèle de programmation linéaire en nombres entiers puis une heuristique qui peut être utilisée pour de la proposition de parcours en temps réel, par exemple à leur arrivée dans le musée.Dans une deuxième partie, nous nous intéressons à la recommandation touristique en ville en établissant des métriques permettant de construire un parcours. À partir d'une étude interdisciplinaire, nous mettons en évidence l'importance de la personnalisation des parcours et identifions un facteur essentiel lors de leur construction en plus des goûts culturels, la cadence de visite. Une nouvelle méthode de mesure de la qualité d'expérience d'un parcours regroupant ces deux critères est utilisée. Cette dernière unit des méthodes de la littérature pour ce qui est de l'évaluation de l'intérêt culturel et utilise des actogrammes comme représentation géographique d'un parcours et ainsi définir une mesure de la cadence de visite.Par la suite, nous développons un système de recommandation de visites touristiques sous la forme d'un modèle de programmation linéaire en nombre entiers basé sur un formalisme extensible permettant de prendre en compte une grande diversité de contraintes et qui intègre trois critères pour l'évaluation du parcours : d'une part l'intérêt culturel et la cadence de visite, qui dépendent des préférences du touristes sont mesurés à différentes échelles, permettant d'introduire une cohérence dans la construction du parcours ; d'autre part, nous proposons d'intégrer dans la fonction objectif, l'effet apogée-fin, une heuristique psychologique célèbre qui a déjà été appliquée dans de nombreux autres domaines.En nous basant sur des études de cas concrets, nous montrons que l'utilisation conjointe de techniques issues de diverses disciplines permettent d'obtenir de bons résultats, tant au niveau de l'estimation de l'attrait des points d'intérêt que de la construction de parcours touristiques.
... Also, since cat is found in every class and document, the lowest score is assigned. Note: CiCi ing stage, removing of stop-words 6 and stemming 28 were utilized. Also, the TF-IDF 10 feature weighting method was utilized during the feature weighting step. ...
Article
Full-text available
In the field of text classification, some of the datasets are unbalanced datasets. In these datasets, feature selection stage is important to increase performance. There are many studies in this area. However, existing methods have been developed based on the document frequency of only intra‐class. In this study, a new method is proposed considering the situation of the feature in class and corpus. A new feature selection method, namely class‐index corpus‐index measure (CiCi) was presented for unbalanced text classification. The CiCi is a probabilistic method which is calculated using feature distribution in both class and corpus. It has shown a higher performance compared to successful methods in the literature. Multinomial Naïve Bayes and support vector machines were used as classifiers in the experiments. Three different unbalanced datasets are used in the experiments. These benchmark datasets are reuters‐21578, ohsumed, and enron1. Experimental results show that the proposed method has more performance in terms of three different success measures.
... We included only previously published data, even if created by semi-automatic or automatic segmentation meth-ods; we did not attempt to create any new datasets ourselves, e.g. via application of automatic stemming algorithms such as Porter stemmer (Porter, 1980) or segmenters such as Morfessor (Smit et al., 2014). Other limiting factors included: nonexisting or insufficient digitization of printed resources (e.g. in the case of Sokolová et al. (2005)), licenses disallowing redistribution, or actual inaccessibility of data. ...
Conference Paper
Full-text available
Our work aims at developing a multilingual data resource for morphological segmentation. We present a survey of 17 existing data resources relevant for segmentation in 32 languages, and analyze diversity of how individual linguistic phenomena are captured across them. Inspired by the success of Universal Dependencies, we propose a harmonized scheme for segmentation representation, and convert the data from the studied resources into this common scheme. Harmonized versions of resources available under free licenses are published as a collection called UniSegments 1.0.
... At first, they extracted individual and paired tokens from the text using a systematic feature selection process. Hence, they identified traffic-accident-related tweets using commonly used keywords and preprocessed them by removing stopwords, and then stemmed them using Porter stemmer [56]. In the end, they used appropriate stemmed tokens as features. ...
Article
Full-text available
Social media platforms have many users who share their thoughts and use these platforms to organize various events collectively. However, different upsetting incidents have occurred in recent years by taking advantage of social media, raising significant concerns. Therefore, considerable research has been carried out to detect any disturbing event and take appropriate measures. This review paper presents a thorough survey to acquire in-depth knowledge about the current research in this field and provide a guideline for future research. We systematically review 67 articles on event detection by sensing social media data from the last decade. We summarize their event detection techniques, tools, technologies, datasets, performance metrics, etc. The reviewed papers mainly address the detection of events, such as natural disasters, traffic, sports, real-time events, and some others. As these detected events can quickly provide an overview of the overall condition of the society, they can significantly help in scrutinizing events disrupting social security. We found that compatibility with different languages, spelling, and dialects is one of the vital challenges the event detection algorithms face. On the other hand, the event detection algorithms need to be robust to process different media, such as texts, images, videos, and locations. We outline that the event detection techniques compatible with heterogeneous data, language, and the platform are still missing. Moreover, the event and its location with a 24 × 7 real-time detection system will bolster the overall event detection performance.
... In information retrieval, stemming is the process of reducing all derived or declined words to their base or root form. The Porter stemmer was used in the present work [38]. ...
Article
Full-text available
Negotiation constitutes a fundamental skill that applies to several daily life contexts; however, providing a reliable assessment and definition of it is still an open challenge. The aim of this research is to present an in-depth analysis of the negotiations occurring in a role-play simulation between users and virtual agents using Natural Language Processing. Users were asked to interact with virtual characters in a serious game that helps practice negotiation skills and to complete a psychological test that assesses conflict management skills on five dimensions. The dialogues of 425 participants with virtual agents were recorded, and a dataset comprising 4250 sentences was built. An analysis of the personal pronouns, word context, sentence length and text similarity revealed an overall consistency between the negotiation profiles and the user verbal choices. Integrating and Compromising users displayed a greater tendency to involve the other party in the negotiation using relational pronouns; on the other hand, Dominating individuals tended to use mostly single person pronouns, while Obliging and Avoiding individuals were shown to generally use fewer pronouns. Users with high Integrating and Compromising scores adopted longer sentences and chose words aimed at increasing the other party’s involvement, while more self-concerned profiles showed the opposite pattern.
... 7 We follow standard NLP procedures to clean the keyword-based corpus. We exclude all English stop words and use the Snowball Stemmer (Porter 2006) to only consider the word stems. 8 As we ...
Article
Full-text available
This paper proposes a new methodological framework to identify economic clusters over space and time. We employ a unique open source dataset of geolocated and archived business webpages and interrogate them using Natural Language Processing to build bottom-up classifications of economic activities. We validate our method on an iconic UK tech cluster – Shoreditch, East London. We benchmark our results against existing case studies and administrative data, replicating the main features of the cluster and providing fresh insights. As well as overcoming limitations in conventional industrial classification, our method addresses some of the spatial and temporal limitations of the clustering literature.
... Observe that the words to be subsumed need to share their first part. One of the currently most frequently employed methods for word stemming was developed by M. F. Porter [15] in 1980. Previous approaches include that proposed by J. B. Lovins [16] as early as 1968. ...
Preprint
Full-text available
Language provides one of the main ways in which humans communicate and store information and knowledge, which has motivated continuing interest and developments related to the respective automated analysis, especially by employing data structures and artificial intelligence concepts and methods. In the present work, we study the potential effect of text preprocessing-with emphasis on stop word removal, lemmatization and stemming-on the representation of texts by respective similarity networks. Two main similarity comparison approaches are taken into account: cosine similarity, and coincidence similarity. Multiset representation of the paragraphs is also adopted In order to consider the repetition of works in paragraphs. Each paragraph is mapped into a respective node, while interconnections between paragraphs are assigned weights corresponding to the similarity values. The obtained results are discussed in terms of the obtained level of interconnection details and modularity. The coincidence network obtained while removing stop words and implementing lemmatization is found to be particularly detailed and modular.
... The textual data of the users in the training dataset are pre-processed by reducing all words to their root form, via a Porter stemmer [28]. The stemmed text is then used to extract a vocabulary, which consists of those words and hashtag words (those prefixed by '#') that are present in at least 3 user profiles and at most 80% of all user profiles. ...
Article
Full-text available
We demonstrate a system for predicting gaming related properties from Twitter accounts. Our system predicts various traits of users based on the tweets publicly available in their profiles. Such inferred traits include degrees of tech-savviness and knowledge on computer games, actual gaming performance, preferred platform, degree of originality, humor and influence on others. Our system is based on machine learning models trained on crowd-sourced data. It allows people to select Twitter accounts of their fellow gamers, examine the trait predictions made by our system, and the main drivers of these predictions. We present empirical results on the performance of our system based on its accuracy on our crowd-sourced dataset.
... Stemming: We removed word suffixes and conflated the resulting morphemes with the Porter stemmer [18] which leads to a crude affix chopping. For example, "automates" and "automation" all reduce to "automat" using the Porter stemmer. ...
Article
Full-text available
There remains a limited understanding of the HIV prevention and treatment needs among female sex workers in many parts of the world. Systematic reviews of existing literature can help fill this gap; however, well-done systematic reviews are time-demanding and labor-intensive. Here, we propose an automatic document classification approach to a systematic review to significantly reduce the effort in reviewing documents and optimizing empiric decision making. We first describe a manual document classification procedure that is used to curate a pertinent training dataset and then propose three classifiers: a keyword-guided method, a cluster analysis-based method, and a random forest approach that utilizes a large set of feature tokens. This approach is used to identify documents studying female sex workers that contain content relevant to either HIV or experienced violence. We compare the performance of the three classifiers by cross-validation in terms of area under the curve of the receiver operating characteristic and precision and recall plot, and found random forest approach reduces the amount of manual reading for our example by 80%; in sensitivity analysis, we found that even trained with only 10% of data, the classifier can still avoid reading 75% of future documents (68% of total) while retaining 80% of relevant documents. In sum, the automated procedure of document classification presented here could improve both the precision and efficiency of systematic reviews and facilitate live reviews, where reviews are updated regularly. We expect to obtain a reasonable classifier by taking 20% of retrieved documents as training samples. The proposed classifier could also be used for more meaningfully assembling literature in other research areas and for rapid documents screening with a tight schedule, such as COVID-related work during the crisis.
... Another case is when the keyphrase consists of a mix of generic and specific words, such as "Milky Way". "Way" is generally a stopword [32], so the keyphrase extractor might only be able to detect "Milky" and throw away "Way" without realizing that the term "Way" is not a stopword in this specific context. ...
Preprint
Full-text available
The scientific publication output grows exponentially. Therefore, it is increasingly challenging to keep track of trends and changes. Understanding scientific documents is an important step in downstream tasks such as knowledge graph building, text mining, and discipline classification. In this workshop, we provide a better understanding of keyword and keyphrase extraction from the abstract of scientific publications.
... A popular algorithm has been introduced by Lovins in 1968 ('Lovins Stemmer') [148]. In 1980, Porter published an algorithm for stemming of the English language, which is still being used nowadays ('Porter Stemmer') [149]. The core of the method is that each word is considered as a sequence of vowels and consonants, where, English suffixes (e. g., '-ed' or '-ation') are either removed or replaced, based on predefined lists in a defined order and based on other conditions, such as the length of vowel-consonant sequences. ...
Thesis
Full-text available
Computer audition is omnipresent in everyday life, in applications ranging from personalised virtual agents to health care. From a technical point of view, the goal is to robustly classify the content of an audio signal in terms of a defined set of labels, such as, e.g., the acoustic scene, a medical diagnosis, or, in the case of speech, what is said or how it is said. Typical approaches employ machine learning (ML), which means that task-specific models are trained by means of examples. Despite recent successes in neural network-based end-to-end learning, taking the raw audio signal as input, models relying on hand-crafted acoustic features are still superior in some domains, especially for tasks where data is scarce. One major issue is nevertheless that a sequence of acoustic low-level descriptors (LLDs) cannot be fed directly into many ML algorithms as they require a static and fixed-length input. Moreover, also for dynamic classifiers, compressing the information of the LLDs over a temporal block by summarising them can be beneficial. However, the type of instance-level representation has a fundamental impact on the performance of the model. In this thesis, the so-called bag-of-audio-words (BoAW) representation is investigated as an alternative to the standard approach of statistical functionals. BoAW is an unsupervised method of representation learning, inspired from the bag-of-words method in natural language processing, forming a histogram of the terms present in a document. The toolkit openXBOW is introduced, enabling systematic learning and optimisation of these feature representations, unified across arbitrary modalities of numeric or symbolic descriptors. A number of experiments on BoAW are presented and discussed, focussing on a large number of potential applications and corresponding databases, ranging from emotion recognition in speech to medical diagnosis. The evaluations include a comparison of different acoustic LLD sets and configurations of the BoAW generation process. The key findings are that BoAW features are a meaningful alternative to statistical functionals, offering certain benefits, while being able to preserve the advantages of functionals, such as data-independence. Furthermore, it is shown that both representations are complementary and their fusion improves the performance of a machine listening system.
... The second module is called "porter stem". This module only maps unigrams if they match after they were stemmed with the porter stemmer [114] (e.g., in this case computers maps to computer ). The last module maps synonymous unigrams and we call it "WN synonymy". ...
Thesis
Humans are faced with a constant flow of visual stimuli, e.g., from the environment or when looking at social media. In contrast, visually-impaired people are often incapable to perceive and process this advantageous and beneficial information that could help maneuver them through everyday situations and activities. However, audible feedback such as natural language can give them the ability to better be aware of their surroundings, thus enabling them to autonomously master everyday's challenges. One possibility to create audible feedback is to produce natural language descriptions for visual data such as still images and then read this text to the person. Moreover, textual descriptions for images can be further utilized for text analysis (e.g., sentiment analysis) and information aggregation. In this work, we investigate different approaches and techniques for the automatic generation of natural language of visual data such as still images and video clips. In particular, we look at language models that generate textual descriptions with recurrent neural networks: First, we present a model that allows to generate image captions for scenes that depict interactions between humans and branded products. Thereby, we focus on the correct identification of the brand name in a multi-task training setting and present two new metrics that allow us to evaluate this requirement. Second, we explore the automatic answering of questions posed for an image. In fact, we propose a model that generates answers from scratch instead of predicting an answer from a limited set of possible answers. In comparison to related works, we are therefore able to generate rare answers, which are not contained in the pool of frequent answers. Third, we review the automatic generation of doctors' reports for chest X-ray images. That is, we introduce a model that can cope with a dataset bias of medical datasets (i.e., abnormal cases are very rare) and generates reports with a hierarchical recurrent model. We also investigate the correlation between the distinctiveness of the report and the score in traditional metrics and find a discrepancy between good scores and accurate reports. Then, we examine self-attentive language models that improve computational efficiency and performance over the recurrent models. Specifically, we utilize the Transformer architecture. First, we expand the automatic description generation to the domain of videos where we present a video-to-text (VTT) model that can easily synchronize audio-visual features. With an extensive experimental exploration, we verify the effectiveness of our video-to-text translation pipeline. Finally, we revisit our recurrent models with this self-attentive approach.
... Finally, resulting expressions were stemmed. In that, Mathematica 13 employing Porter's algorithm for stemming [67] was used as software. While lacking advanced features of more advanced natural language processing (NLP), Mathematica produced a quite acceptable outcome and was selected here because it is supposedly easy enough to adopt, in comparison to more advanced NLP methods. ...
Article
Full-text available
Complex networks are often used to analyze written text and reports by rendering texts in the form of a semantic network, forming a lexicon of words or key terms. Many existing methods to construct lexicons are based on counting word co-occurrences, having the advantage of simplicity and ease of applicability. Here, we use a quantum semantics approach to generalize such methods, allowing us to model the entanglement of terms and words. We show how quantum semantics can be applied to reveal disciplinary differences in the use of key terms by analyzing 12 scholarly texts that represent the different positions of various disciplinary schools (of conceptual change research) on the same topic (conceptual change). In addition, attention is paid to how closely the lexicons corresponding to different positions can be brought into agreement by suitable tuning of the entanglement factors. In comparing the lexicons, we invoke complex network-based analysis based on exponential matrix transformation and use information theoretic relative entropy (Jensen–Shannon divergence) as the operationalization of differences between lexicons. The results suggest that quantum semantics is a viable way to model the disciplinary differences of lexicons and how they can be tuned for a better agreement.
... We extract text features by considering all commit logs as a bag of words, excluding stop words (e.g., "as", "is", "would", etc.) which are very frequently appearing in any English document and will not hold any discriminative power. We then reduce each word to its root form using Porter' stemming (Porter 1980) algorithm. Finally, given the large number of rooted words, and to limit the curse of dimensionality, we focus on the top 10 of the most recurring words in commit logs of security patches for the feature engineering step. ...
Article
Full-text available
Timely patching (i.e., the act of applying code changes to a program source code) is paramount to safeguard users and maintainers against dire consequences of malicious attacks. In practice, patching is prioritized following the nature of the code change that is committed in the code repository. When such a change is labeled as being security-relevant, i.e., as fixing a vulnerability, maintainers rapidly spread the change, and users are notified about the need to update to a new version of the library or of the application. Unfortunately, oftentimes, some security-relevant changes go unnoticed as they represent silent fixes of vulnerabilities. In this paper, we propose SSPCatcher, a Co-Training-based approach to catch security patches (i.e., patches that address vulnerable code) as part of an automatic monitoring service of code repositories. Leveraging different classes of features, we empirically show that such automation is feasible and can yield a precision of over 80% in identifying security patches, with an unprecedented recall of over 80%. Beyond such a benchmarking with ground truth data which demonstrates an improvement over the state-of-the-art, we confirmed that SSPCatcher can help catch security patches that were not reported as such.
... Each scraped domain yields a single string of concatenated words pre-processed to remove capitalization, punctuation, numbers, and stop words (e.g., a, the, is, are, etc.), and then stemmed using the Porter Stemmer algorithm [31], and lemmatized using the Word-Net Lemmatizer [26] (both of which are are implemented in the Python Natural Language Toolkit 3 ). In order to reduce featurevector dimensionality and focus on discriminatory features, any under-represented words (present in fewer than 10% of our domains) or over-represented words (present in more than 90% of our domains) are eliminated from consideration. ...
Preprint
Full-text available
How, in 20 short years, did we go from the promise of the internet to democratize access to knowledge and make the world more understanding and enlightened, to the litany of daily horrors that is today's internet? We are awash in disinformation consisting of lies, conspiracies, and general nonsense, all with real-world implications ranging from horrific humans rights violations to threats to our democracy and global public health. Although the internet is vast, the peddlers of disinformation appear to be more localized. To this end, we describe a domain-level analysis for predicting if a domain is complicit in distributing or amplifying disinformation. This process analyzes the underlying domain content and the hyperlinking connectivity between domains to predict if a domain is peddling in disinformation. These basic insights extend to an analysis of disinformation on Telegram and Twitter. From these insights, we propose that search engines and social-media recommendation algorithms can systematically discover and demote the worst disinformation offenders, returning some trust and sanity to our online communities.
... Each data instance contains raw tweet text and a label (i.e., ''Ham'' or ''Spam''). Each tweet text and its label is pre-processed to be cleaned and converted into features through a number of steps: (1) Tokenize the tweet and remove extra space and special characters, (2) Stem each tokenized word using ''Porter Stemmer'' [65]. This will reduce the tokenized word to its root, stem, or base. ...
Article
Full-text available
Recently, spam on online social networks has attracted attention in the research and business world. Twitter has become the preferred medium to spread spam content. Many research efforts attempted to encounter social networks spam. Twitter brought extra challenges represented by the feature space size, and imbalanced data distributions. Usually, the related research works focus on part of these main challenges or produce black-box models. In this paper, we propose a modified genetic algorithm for simultaneous dimensionality reduction and hyper parameter optimization over imbalanced datasets. The algorithm initialized an eXtreme Gradient Boosting classifier and reduced the features space of tweets dataset; to generate a spam prediction model. The model is validated using a 50 times repeated 10-fold stratified cross-validation, and analyzed using nonparametric statistical tests. The resulted prediction model attains on average 82.32% and 92.67% in terms of geometric mean and accuracy respectively, utilizing less than 10% of the total feature space. The empirical results show that the modified genetic algorithm outperforms $Chi^{2}$ and $PCA$ feature selection methods. In addition, eXtreme Gradient Boosting outperforms many machine learning algorithms, including BERT-based deep learning model, in spam prediction. Furthermore, the proposed approach is applied to SMS spam modeling and compared to related works.
... Furthermore, the k-means clustering was done on keywords to assess concepts in village chicken production as a significant food security tool by utilizing the function of conceptual framework of the bibliometrix R-package. This function employed the Porter's stemming algorithm [45] to regulate adjusted words to their exact form. ...
Article
Full-text available
Background The present study aimed to reveal outputs of research works on village chicken production as a tool to combat food insecurity, taking into account the recurring challenge posed by food shortage and high rise in hunger among vulnerable people of several countries. Results On aggregate, 104 publications were obtained in a BibTeX design for analysis using bibliometric package in R studio. The obtained data comprised, but not limited to authors, citations, institutions, key words and journals. Published articles on village chicken production with relation to food security retrieved from web of science (WOS) and Scopus data banks were utilized with a rise in research publications of a yearly growth of 12.93% during the study period. With regard to country, USA was ranked first with an aggregate sum of publications (n = 16), and a huge global academic influence with most top article citations (n = 509). The frequently used authors’ keywords in this studied research area were food security (n = 23), poultry (n = 9), chickens (n = 7), backyard poultry (n = 5), gender (n = 4), which all together created a hint on related studies on village chicken production and food security. Conclusions The present study provides a worldwide situation that traverse the intellectual quandary on village chicken production and food security research, and a direction for further researches in this field. It is very vital to emphasize that the current study only dealt with principal areas of village chicken production as related to food security research, hence, it is projected that new empirical research and prospective research findings would afford new knowledge and insight on village chicken production as a means to address food security challenges as new studies evolves.
... For this purpose, we analysed the collected dataset by looking for frequency of key words as well as evaluating them qualitatively through open coding. To make sure that related words were not treated separately, the word frequency was based on word stems by using the Porter stemmer (Porter 1997). ...
... Where the algorithm provided multiple suggestions, the word with the highest frequency across responses was used. To reduce data sparsity, in the STM analysis, we further stemmed words using the Porter (1980) algorithm, dropped responses if they contained fewer than five words, and dropped words if they appeared in fewer than five responses (Banks et al., 2018). Data cleaning was carried out in R version 3.6.3 ...
Article
Full-text available
Background The COVID-19 pandemic has had substantial impacts on lives across the globe. Job losses have been widespread, and individuals have experienced significant restrictions on their usual activities, including extended isolation from family and friends. While studies suggest population mental health worsened from before the pandemic, not all individuals appear to have experienced poorer mental health. This raises the question of how people managed to cope during the pandemic. Methods To understand the coping strategies individuals employed during the COVID-19 pandemic, we used structural topic modelling, a text mining technique, to extract themes from free-text data on coping from over 11,000 UK adults, collected between 14 October and 26 November 2020. Results We identified 16 topics. The most discussed coping strategy was ‘thinking positively’ and involved themes of gratefulness and positivity. Other strategies included engaging in activities and hobbies (such as doing DIY, exercising, walking and spending time in nature), keeping routines, and focusing on one day at a time. Some participants reported more avoidant coping strategies, such as drinking alcohol and binge eating. Coping strategies varied by respondent characteristics including age, personality traits and sociodemographic characteristics and some coping strategies, such as engaging in creative activities, were associated with more positive lockdown experiences. Conclusion A variety of coping strategies were employed by individuals during the COVID-19 pandemic. The coping strategy an individual adopted was related to their overall lockdown experiences. This may be useful for helping individuals prepare for future lockdowns or other events resulting in self-isolation.
... Verbs are transformed to the infinitive form and suffixes are stripped from words to get the roots which are easier to recall. A good example of stemming algorithm is from Porter, M. F. (1980) [10]. After the page is pre-processed a better precision for classification is obtained, getting improved results using distance functions like cosine similarity. ...
Article
Full-text available
Internet is a Virtual Society: opinion expression, information distribution, contents sharing, buying, etc. loaded with an astonishing amount of information. Human knowledge has the ability to gather and process this data, doing it automatically is a challenging research field. On the other side, human knowledge is limited when dealing with the large volume of data available, so it is necessary to find a machine supported system that helps humans to find and filter data. In this article we perform a review of Web Mining techniques and we describe a Bootstrap Statistics methodology applied to pattern model classifier building, comparison and verification for Supervised Learning. It is virtually impossible to thoroughly test these models with pure empirical data, but using the computer-based Bootstrap paradigm is possible to design a test environment where they are checked with less human intervention and getting a better confidence in their behaviour. Bootstrap technique is a powerful tool for all related works with the Internet, it allows to create test environments that can simulate real conditions with less human effort. We go further, by varying the characteristics of the sample and applying bootstrap for each case analyzing model behavior.
... We further performed Porter's stemming process [46] on the lemma, which refers to the removal of word endings to a stem that could possibly be a word not present in the dictionary. ...
Preprint
In software development teams, developer turnover is among the primary reasons for project failures as it leads to a great void of knowledge and strain for the newcomers. Unfortunately, no established methods exist to measure how knowledge is distributed among development teams. Knowing how this knowledge evolves and is owned by key developers in a project helps managers reduce risks caused by turnover. To this end, this paper introduces a novel, realistic representation of domain knowledge distribution: the ConceptRealm. To construct the ConceptRealm, we employ a latent Dirichlet allocation model to represent textual features obtained from 300k issues and 1.3M comments from 518 open-source projects. We analyze whether the newly emerged issues and developers share similar concepts or how aligned the developers' concepts are with the team over time. We also investigate the impact of leaving members on the frequency of concepts. Finally, we evaluate the soundness of our approach to closed-source software, thus allowing the validation of the results from a practical standpoint. We find out that the ConceptRealm can represent the high-level domain knowledge within a team and can be utilized to predict the alignment of developers with issues. We also observe that projects exhibit many keepers independent of project maturity and that abruptly leaving keepers harm the team's concept familiarity.
... Subsequently, the stemming step is done using the Porter Stemmer algorithm [243] in the nltk library. Stemming is needed to avoid two or more words with the same meaning but in different forms (e.g., "allow " vs. "allows"). ...
Preprint
Full-text available
The thesis advances the field of software security by providing knowledge and automation support for software vulnerability assessment using data-driven approaches. Software vulnerability assessment provides important and multifaceted information to prevent and mitigate dangerous cyber-attacks in the wild. The key contributions include a systematisation of knowledge, along with a suite of novel data-driven techniques and practical recommendations for researchers and practitioners in the area. The thesis results help improve the understanding and inform the practice of assessing ever-increasing vulnerabilities in real-world software systems. This in turn enables more thorough and timely fixing prioritisation and planning of these critical security issues.
... In many situations, some techniques to reduce complexity may also be applied such as stemming and lemmatization. Stemming is a heuristic method that reduces each term to its radical term (stem) so that words with the same radical term may be analyzed together (Porter, 1980). For example, words such as "run" and "running" may be analyzed together after a stemming procedure. ...
Chapter
Electronic word-of-mouth (e-WOM) is a very important way for firms to measure the pulse of its online reputation. Today, consumers use e-WOM as a way to interact with companies and share not only their satisfaction with the experience, but also their discontent. E-WOM is even a good way for companies to co-create better experiences that meet consumer needs. However, not many companies are using such unstructured information as a valuable resource to help in decision making: first, because e-WOM is mainly textual information that needs special data treatment and second, because it is spread in many different platforms and occurs in near-real-time, which makes it hard to handle. The current chapter revises the main methodologies used successfully to unravel hidden patterns in e-WOM in order to help decision makers to use such information to better align their companies with the consumer's needs.
Article
Full-text available
Text Mining has turned into an imperative exploration territory. Content Mining is the revelation by PC of new, beforehand obscure data, via naturally separating data from various composed assets. Highlight determination in bunching is utilized for extricating the important information from a huge accumulation of information by breaking down on different examples of comparable information. In view of the exactness and productivity of the information, real issue happens in bunching. A database can contain a few measurements or characteristics. Numerous Clustering strategies are intended for grouping low-dimensional information. In high dimensional space discovering bunches of information items is trying because of the scourge of dimensionality. At the point when the dimensionality expands, information in the unessential measurements may deliver much commotion and cover the genuine bunches to be found. Content mining could be a system for dissecting content records to extricate supportive data and information. Most content mining methodologies like arrangement, bunching, and rundown need choices like terms (words), designs (regular term sets), or expressions (n-grams) to speak to content records. To help the execution of content mining procedures, content element decision could be a strategy to select an arrangement of content alternatives applicable to the mining undertaking, and utilize these choices to speak to the record of interest. Be that as it may, ensuring the top nature of picked alternatives from content could be a test attributable to the monstrous amount of digressive information in content records. For instance, content alternatives regularly typify a few choices that will be excess or immaterial; these are contemplated as uproarious choices amid this investigation. Some term-based or design based methodologies are wanted to search out achievable significant alternatives for a given subject; however, these methodologies haven't gave partner degree satisfactory because of see connections between choices, especially amongst examples and n-grams. So they're unrealistic to search out the right arrangement of alternatives. In this examination, we tend to acquaint 2 routes in which with consider the relations between alternatives in content. The primary technique is to utilize a co-event grid to clarify the connections between examples. We tend to moreover blessing partner degree stretched out irregular immaculate science to know the relations between n-grams or examples bolstered their parts. We then propose calculations to choose choices exploitation the broadened irregular immaculate science and methodologies. To assess the arranged calculations and techniques, we tend to utilize the picked alternatives for partner degree information separating framework. These tests are led utilizing two standard information sets: Reuters Corpus Volume one (RCV1) and Reuters 21578. Considerable examinations on every data sets are contrasted and the cutting edge systems, and in this way the consequences of the arranged approaches demonstrate a noteworthy increment inside the offer changes in execution for content component decision.
Article
Most cross-national human rights datasets rely on human coding to produce yearly, country-level indicators of state human rights practices. Hand-coding the documents that contain the information on which these scores are based is tedious and time-consuming, but has been viewed as necessary given the complexity and detail of the information contained in the text. However, advances in automated text analysis have the potential to streamline this process without sacrificing accuracy. In this research note, we take the first step in creating this streamlined process by employing a supervised machine learning automated coding method that extracts specific allegations of physical integrity rights violations from the original text of country reports on human rights. This method produces a dataset including 163,512 unique abuse allegations in 196 countries between 1999 and 2016. This dataset and method will assist researchers of physical integrity rights abuse because it will allow them to produce allegation-level human rights measures that have previously not existed and provide a jumping-off point for future projects aimed at using supervised machine learning to create global human rights metrics.
Article
During a search, phrase-terms expressed in queries are presented to an information retrieval system (IRS) to find documents relevant to a topic. The IRS makes relevance judgements by attempting to match vocabulary in queries to documents. If there is a mismatch, the problem of vocabulary mismatch occurs. The aim is to examine ways of searching for documents more effectively, in order to minimise mismatches. A further aim is to understand the mechanisms of, and the differences between, human and machine-assisted, retrieval. The objective of this study was to determine whether IRS-H (an IRS using the hybrid indexing method) and human participants agree or disagree on relevancy judgments, and whether the problem of mismatching vocabulary can be solved. A collection of eighty research documents and sixty-five phrase-terms were presented to (i) IRS-H and four participants in Test 1, and (ii) IRS-H and one participant (aided by search software) in Test 2. Statistical analysis was performed using the Kappa coefficient. IRS-H and the four participants' judgements disagreed. IRS-H and the participant aided by search software judgments did agree. IRS-H solves the problem of mismatching vocabulary between a query and a document.
Article
Electronic nicotine delivery systems (ENDS) (also known as ‘e-cigarettes’) can support smoking cessation, although the long-term health impacts are not yet known. In 2019, a cluster of lung injury cases in the USA emerged that were ostensibly associated with ENDS use. Subsequent investigations revealed a link with vitamin E acetate, an additive used in some ENDS liquid products containing tetrahydrocannabinol (THC). This became known as the EVALI (E-cigarette or Vaping product use Associated Lung Injury) outbreak. While few cases were reported in the UK, the EVALI outbreak intensified attention on ENDS in general worldwide. We aimed to describe and explore public commentary and discussion on Twitter immediately before, during and following the peak of the EVALI outbreak using text mining techniques. Specifically, topic modelling, operationalised using Latent Dirichlet Allocation (LDA) models, was used to discern discussion topics in 189,658 tweets about ENDS (collected April - December 2019). Individual tweets and Twitter users were assigned to their dominant topics and countries respectively to enable international comparisons. A 10-topic LDA model fit the data best. We organised the ten topics into three broad themes for the purposes of reporting: informal vaping discussion; vaping policy discussion and EVALI news; and vaping commerce. Following EVALI, there were signs that informal vaping discussion topics decreased while discussion topics about vaping policy and the relative health risks and benefits of ENDS increased, not limited to THC products. Though subsequently attributed to THC products, the EVALI outbreak disrupted online public discourses about ENDS generally, amplifying health and policy commentary. There was a relatively stronger presence of commercially oriented tweets among UK Twitter users compared to USA users.
Article
During the past 15 years, automatic text scaling has become one of the key tools of the Text as Data community in political science. Prominent text-scaling algorithms, however, rely on the assumption that latent positions can be captured just by leveraging the information about word frequencies in documents under study. We challenge this traditional view and present a new, semantically aware text-scaling algorithm, SemScale , which combines recent developments in the area of computational linguistics with unsupervised graph-based clustering. We conduct an extensive quantitative analysis over a collection of speeches from the European Parliament in five different languages and from two different legislative terms, and we show that a scaling approach relying on semantic document representations is often better at capturing known underlying political dimensions than the established frequency-based (i.e., symbolic) scaling method. We further validate our findings through a series of experiments focused on text preprocessing and feature selection, document representation, scaling of party manifestos, and a supervised extension of our algorithm. To catalyze further research on this new branch of text-scaling methods, we release a Python implementation of SemScale with all included datasets and evaluation procedures.
Article
Text preprocessing is not only an essential step to prepare the corpus for modeling but also a key area that directly affects the natural language processing (NLP) application results. For instance, precise tokenization increases the accuracy of part-of-speech (POS) tagging, and retaining multiword expressions improves reasoning and machine translation. The text corpus needs to be appropriately preprocessed before it is ready to serve as the input to computer models. The preprocessing requirements depend on both the nature of the corpus and the NLP application itself, that is, what researchers would like to achieve from analyzing the data. Conventional text preprocessing practices generally suffice, but there exist situations where the text preprocessing needs to be customized for better analysis results. Hence, we discuss the pros and cons of several common text preprocessing methods: removing formatting, tokenization, text normalization, handling punctuation, removing stopwords, stemming and lemmatization, n-gramming, and identifying multiword expressions. Then, we provide examples of text datasets which require special preprocessing and how previous researchers handled the challenge. We expect this article to be a starting guideline on how to select and fine-tune text preprocessing methods.
Conference Paper
Full-text available
We describe our work on sentiment analysis for Hausa, where we investigated monolingual and cross-lingual approaches to classify student comments in course evaluations. Furthermore, we propose a novel stemming algorithm to improve accuracy. For studies in this area, we collected a corpus of more than 40,000 comments-the Hausa-English Sentiment Analysis Corpus For Educational Environments (HESAC). Our results demonstrate that the monolingual approaches for Hausa sentiment analysis slightly outperform the cross-lingual systems. Using our stemming algorithm in the pre-processing even improved the best model resulting in 97.4% accuracy on HESAC.
Chapter
Purchase decisions are better when opinions/reviews about products are considered. Similarly, reviewing customer feedback help in improving the sale and ultimately benefit the business. Web 2.0 provides various platforms such as Twitter, Facebook, etc. where one can comment, review, or post to express his/her happiness, anger, disbelief, sadness toward products, people, etc. To computationally analyze the sentiments in text requires a better understanding of the technologies used in sentiment analysis. This chapter gives a comprehensive understanding about the techniques used in sentiment analysis. Machine learning approaches are mostly used for sentiment analysis. Whereas, as per the text and required results, lexicon-based approaches are also used for the same purpose. This chapter includes the discussion on the evaluation parameters for the sentiment analysis. This chapter would also highlight ontology approach for sentiment analysis and outstanding contributions made in this field. Keywords: Sentiment Analysis, Product reviews, Supervised learning, Unsupervised learning, Social networking websites, Ontology
Chapter
Due to the advent of Web 2.0, the size of social media content (SMC) is growing rapidly and likely to increase faster in the near future. Social media applications such as Instagram, Twitter, Facebook, etc. have become an integral part of our lives, as they prompt the people to give their opinions and share information around the world. Identifying emotions in SMC is important for many aspects of sentiment analysis (SA) and is a top-level agenda of many firms today. SA on social media (SASM) extends an organization's ability to capture and study public sentiments toward social events and activities in real time. This chapter studies recent advances in machine learning (ML) used for SMC analysis and its applications. The framework of SASM consists of several phases, such as data collection, pre-processing, feature representation, model building, and evaluation. This survey presents the basic elements of SASM and its utility. Furthermore, the study reports that ML has a significant contribution to SMC mining. Finally, the research highlights certain issues related to ML used for SMC.
Article
It is widely recognized that students’ learning can be enhanced and facilitated when students have the opportunity to work together in teams. As a consequence, the pursuit of a methodology to form optimal student teams continues to challenge academics. Based on a review of related literature, we propose a model that includes new approaches to two team criteria. The first is a discrete optimization approach to commonality of schedule. To facilitate team meetings, we offer an exact formulation to ensure students on a given team share a minimum number of common time slots during which they are available. The second team criterion is sufficient soft skills. Using a unique text analysis approach, we ensure that each team includes students with adequate soft skills, such as leadership and interpersonal skills. Our analytic approach enhances the students’ learning experience and class performance and simplifies the faculty task of forming teams.
Article
With the advent of artificial intelligence, most of the main techniques have found their way into intelligent education. Knowledge tracing is one of the essential tasks in educational research, which aims to model and qualify students' procedural knowledge acquisition using machine learning or deep learning techniques. While numerous studies have focused on improving models and algorithms of knowledge tracing, few have thoroughly examined the dynamic and complex aspects of this research field. This study conducts a bibliometric analysis that included 383 key articles published between 1992 and 2021 to review the evolutionary nuances of knowledge tracing research. Besides, we employ document clustering to uncover the most common topics of knowledge tracing and systematically review each topic's characteristics. Major findings include broad knowledge tracing trends information such as the most productive authors, the most referenced articles, and the occurrence of author keywords. Existing knowledge tracing models are further divided into three clusters: Markov process-based knowledge tracing, logistic knowledge tracing, and deep learning-based knowledge tracing. The attributes of each cluster were then discussed, as well as recent development and application. Finally, we highlighted existing constraints and identified promising future research topics in knowledge tracing.
Article
In Information Retrieval, numerous retrieval models or document ranking functions have been developed in the quest for better retrieval effectiveness. Apart from some formal retrieval models formulated on a theoretical basis, various recent works have applied heuristic constraints to guide the derivation of document ranking functions. While many recent methods are shown to improve over established and successful models, comparison among these new methods under a common environment is often missing. To address this issue, we perform an extensive and up-to-date comparison of leading term-independence retrieval models implemented in our own retrieval system. Our study focuses on the following questions: (RQ1) Is there a retrieval model that consistently outperforms all other models across multiple collections; (RQ2) What are the important features of an effective document ranking function? Our retrieval experiments performed on several TREC test collections of a wide range of sizes (up to the terabyte-sized Clueweb09 Category B) enable us to answer these research questions. This work also serves as a reproducibility study for leading retrieval models. While our experiments show that no single retrieval model outperforms all others across all tested collections, some recent retrieval models, such as MATF and MVD, consistently perform better than the common baselines.
Article
This study investigates the role of media owners for the political bias of newspapers in Sweden, using an original dataset on outlets, consumer preferences, and ownership between January 2014 and April 2019. We construct an index of slant based on similarities in the language between newspapers and speeches given by members of parliament. Our results indicate that newspapers held by the same owner tend to offer the same mix of slant, rather than aligning their bias with consumer preferences in their area of circulation. Owners are even less inclined to differentiate the slant across outlets before elections, when the political returns to persuasion are high. We find no evidence that owners impose a one‐size‐fits‐all slant because product differentiation is too costly. In addition, we find suggestive evidence of owner‐independent bias induced by the writers of opinion articles. The Swedish context illustrates that supply‐driven slant cannot be ruled out in market‐based media systems if the ties between media and politics are strong.
Article
Full-text available
Technology question and answer websites are a great source of technical knowledge. Users of these websites raise various types of technical questions and answer them. These questions cover a wide range of domains in Computer Science like Networks, Data Mining, Multimedia, Multithreading, Web Development, Mobile App Development, etc. Analyzing the actual textual content of these websites can help computer science and software engineering community better understand the needs of developers and learn about the current trends in technology. In this project, textual data from famous question and answer website called StackOverflow is analyzed using Latent Dirichlet Allocation (LDA) topic modeling algorithm. The results show that these techniques help discovers dominant topics in developer discussions. These topics are analyzed to find a number of interesting observations such as popular technology/language, impact of a technology, technology trends over time, a relationship of a technology/language with other technologies and comparison of technologies addressing an area of computer science or software engineering.
ResearchGate has not been able to resolve any references for this publication.