Conference Paper
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The goal of the current work is to evaluate semantic feature aggregation techniques in a task of gender classification of public social media texts in Russian. We collect Facebook posts of Russian-speaking users and apply them as a dataset for two topic modelling techniques and a distributional clustering approach. The output of the algorithms is applied as a feature aggregation method in a task of gender classification based on a smaller Facebook sample. The classification performance of the best model is favorably compared against the lemmas baseline and the state-of-the-art results reported for a different genre or language. The resulting successful features are exemplified, and the difference between the three techniques in terms of classification performance and feature contents are discussed, with the best technique clearly outperforming the others.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... И при наличии технических возможностей (а также иногда в силу различных ограничений, введенных руководством некоторых социальных сетей) эти следы можно исследовать. Такой подход к сбору данных и в целом к созданию и проведению психологических и междисциплинарных исследований набирает популярность (Ледовая и др., 2017а, б;Bogolyubova et al., 2017Guntuku et al., 2017;Kosinski et al., 2015;Moskvichev et al., 2018;Panicheva et al., 2016Panicheva et al., , 2018. Исследователи, представители социальных наук и лингвисты, а также математики, программисты и специалисты по работе с данными начали изучать поведение людей в социальных сетях, пытаются найти и проанализировать внешне наблюдаемые и зафиксированные следы пользовательской истории. ...
... Наконец, в-пятых, на данных такого масштаба можно строить регрессионные модели и проверять предсказательные модели о личностных особенностях и поведении людей: например, по текстам публичных постов пользователя или по тематике сообществ, на которые он/она подписан(а) (это так называемые «дешевые», как правило, доступные для сбора при помощи программы-краулера данные), можно предсказывать его/ее личностные особенности, если до этого было собрано достаточно «дорогих» данных большого количества других пользователей (таких данных, в которых есть ответы этих пользователей на вопросы психологических опросников, которые можно было сопоставлять для построения моделей с также собранными краулером или при помощи приложения данными об их поведении в социальной сети -о текстах, подписках на публичные страницы и т. п.) (Moskvichev et al., 2018;Panicheva et al., 2018;Schwartz, Ungar, 2015). ...
Article
Full-text available
В статье рассматриваются создавшиеся за последние 15 лет благодаря социальным сетям возможности для исследования онлайн-поведения людей из самых разных групп и выборок. Приводится статистика вовлеченности в использование социальных сетей. Описываются преимущества и потенциальные недостатки исследований, в которых анализируются так называемые «цифровые следы» личности. Рас-смотрены этические аспекты сбора таких данных на примере случая компании «Cambridge Analyticа». С критических позиций осмыслен текущий новостной дискурс и некорректно подающиеся для потребителей основных СМИ результаты расследования данного случая. Также при-ведены примеры исследований «цифровых следов», ведущихся в России, и описаны некоторые результаты проекта СПбГУ «Стресс, здоровье и психологическое благополучие в социальных сетях: кросс-культурное исследование», касающиеся лингвистических коррелятов черт «Темной триады»-нарциссизма, макиавеллизма, неклинической психопатии.
... Depending on the source of the data, this information can be directly obtained from the user profiles. However, in some cases, this information can be partially hidden or completely unavailable, so researches may also apply machine learning methods to identify gender or age group based on texts (Litvinova, Sboev, & Panicheva, 2018;Panicheva, Mirzagitova, & Ledovaya, 2017;Sboev, Litvinova, Gudovskikh, Rybka, & Moloshnikov, 2016). The applications of sentimental analysis, in most cases, do not occur in isolation. ...
Article
Recently, transfer learning from pre-trained language models has proven to be effective in a variety of natural language processing tasks, including sentiment analysis. This paper aims at identifying deep transfer learning baselines for sentiment analysis in Russian. Firstly, we identified the most used publicly available sentiment analysis datasets in Russian and recent language models which officially support the Russian language. Secondly, we fine-tuned Multilingual Bidirectional Encoder Representations from Transformers (BERT), RuBERT, and two versions of the Multilingual Universal Sentence Encoder and obtained strong, or even new, state-of-the-art results on seven sentiment datasets in Russian: SentRuEval-2016, SentiRuEval-2015, RuTweetCorp, RuSentiment, LINIS Crowd, and Kaggle Russian News Dataset, and RuReviews. Lastly, we made fine-tuned models publicly available for the research community.
... A traditional mass survey involves associating opinions to socio-demographic groups, while in data from social media this reliable demographic information is commonly unavailable. To compare obtained results with the traditional opinion polls, researchers may utilise geolocation information, user profile information, and gender and age prediction systems [207]- [212] to compete with mass polls surveys. 8) Monitoring of sentiment index of social media content in Russian. ...
Article
Full-text available
Sentiment analysis has become a powerful tool in processing and analysing expressed opinions on a large scale. While the application of sentiment analysis on English-language content has been widely examined, the applications on the Russian language remains not as well-studied. In this survey, we comprehensively reviewed the applications of sentiment analysis of Russian-language content and identified current challenges and future research directions. In contrast with previous surveys, we targeted the applications of sentiment analysis rather than existing sentiment analysis approaches and their classification quality. We synthesised and systematically characterised existing applied sentiment analysis studies by their source of analysed data, purpose, employed sentiment analysis approach, and primary outcomes and limitations. We presented a research agenda to improve the quality of the applied sentiment analysis studies and to expand the existing research base to new directions. Additionally, to help scholars selecting an appropriate training dataset, we performed an additional literature review and identified publicly available sentiment datasets of Russian-language texts.
... Work on age author profiling has been performed in a variety of languages. However, to our knowledge there has been very little work in the area in the Russian language, despite a number of successful approaches to profiling of gender (Litvinova et al. 2017Panicheva et al. 2018;Sboev et al. 2016Sboev et al. , 2018 as well as different psychological characteristics ). This paper is aimed at prediction of the age of the authors of Russian-language blogs using different combinations of linguistic features and two different approaches. ...
Chapter
The task of predicting demographics of social media users, bloggers and authors of other types of online texts is crucial for marketing, security, etc. However, most of the papers in authorship profiling deal with author gender prediction. In addition, most of the studies are performed in English-language corpora and very little work in the area in the Russian language. Filling this gap will elaborate on the multilingual insights into age-specific linguistic features and will provide a crucial step towards online security management in social networks. We present the first ageannotated dataset in Russian. The dataset contains blogs of 1260 authors from LiveJournal and is balanced against both age group and gender of the author. We perform age classification experiments (for age groups 20-30, 30-40, 40-50) with the presented data using basic linguistic features (lemmas, part-of-speech unigrams and bigrams etc.) and obtain a considerable baseline in age classification for Russian. We also consider age as a continuous variable and build regression models to predict age. Finally, we analyze significant features and provide interpretation where possible.
Chapter
This paper proposes a linguistically-rich approach to hidden community detection which was tested in experiments with the Russian corpus of VKontakte posts. Modern algorithms for hidden community detection are based on graph theory, these procedures leaving out of account the linguistic features of analyzed texts. The authors have developed a new hybrid approach to the detection of hidden communities, combining author-topic modeling and automatic topic labeling. Specific linguistic parameters of Russian posts were revealed for correct language processing. The results justify the use of the algorithm that can be further integrated with already developed graph methods.
Article
Full-text available
Подробно описываются основные этапы и частично основные результаты проекта, в рамках которого с помощью специального онлайн-приложения проводился сбор психологических, демографических и текстовых данных пользователей социальной сети «Фейсбук» из России и США. Психологические характеристики респондентов оценивались через онлайн-опросники со встроенной автоматической обратной связью (в том числе Короткий опросник Темной триады, опросники выраженности отчуждения моральной ответственности, субъективного благополучия «ВОЗ-5», удовлетворенности жизнью SWLS Э.Динера, короткая шкала для скрининга посттравматических симптомов PC PTSD). Кроме того, с согласия участников приложение позволяло скачивать демографические данные из аккаунтов и тексты публичных постов пользователей, которые впоследствии анализировались лингвистическими методами. По нашим данным, впервые в России приведено подробное описание этапов подобного исследования, включая данные о затратах на рекламу приложения среди пользователей социальной сети «Фейсбук» в двух странах, и представлена часть содержательных результатов проекта, завершающегося в СПбГУ. Упомянуты необходимые для учета будущими исследователями важные нюансы организации такого подхода к сбору данных (в том числе временные, финансовые, связанные с созданием программного обеспечения и с формулировками вариантов обратной связи для респондентов), а также возможные технические, организационные и методические сложности и способы их преодоления с опорой на опыт собственной работы.
Article
Full-text available
Статья содержит описание нового междисциплинарного подхода к сбору индивидуальных психологических, поведенческих и языковых данных в социальных сетях. В описываемой методологии личные данные пользователей социальных сетей (так называемые «цифровые следы», “digital footprints”) собираются с помощью специальных программ и онлайн-приложений. Как правило, участники также заполняют психологические опросники, встроенные в такие приложения. Психологические переменные могут сопоставляться с указанной доступной информацией о поведении пользователей в социальной сети. Все эти данные, получаемые на многотысячных выборках, могут не только анализироваться с помощью классических статистических методов, но и использоваться для построения предсказательных моделей с помощью алгоритмов машинного обучения. Таким образом, психологические переменные (например, личностные особенности, уровень субъективного благополучия и др.) и социально-демографические характеристики могут предсказываться только на основе открытых данных пользователей социальных сетей — текстов, подписок на сообщества и т.п., что является совершенно новым способом получения информации о респондентах. В таких исследованиях, как правило, участвуют психологи, веб-программисты, а также специалисты по компьютерной лингвистике, анализу данных и машинному обучению. Обсуждаются преимущества и ограничения этой методологии, описаны конкретные подходы к сбору и обработке данных. Представлены некоторые результаты работы пионеров этого направления исследований — участников британского (“Mypersonality.org”) и американского (“World Well-Being Project”) проектов, наиболее масштабно использующих рассматриваемый подход. SOCIAL NETWORKS AS A NEW ENVIRONMENT FOR INTERDISCIPLINARY STUDIES OF HUMAN BEHAVIOR The paper describes a new approach to collecting individual psychological, behavioral and language data from online social networks. Within this approach, personal data (“digital footprints”) are collected by means of special programs and web-applications that are embedded in social networks interfaces or otherwise connected with them. Usually, users provide additional information by answering questions of online surveys embedded in such applications. Psychological variables can be then associated with online behavioral data and other available information. The data of thousands of users can be not only analyzed with traditional statistical methods, but can also be used to build predictive models with machine learning algorithms. Thus, psychological characteristics (personality traits, wellbeing, etc.) and demographical data can be predicted based on public user information — wall posts, page likes, etc., which is a completely new approach to data collection. Such research projects usually involve multidisciplinary teams of psychologists, web developers, computational linguists and data scientists. Advantages and limitations of this methodology are discussed, as well as the methods of data collection and processing and predictive models building. Key findings of the pioneers of this research direction are presented. These are the findings of the British project “Mypersonality.org” and the USA-based project “World Well-Being Project”. Both are employing the described methodology quite massively
Article
Full-text available
The paper gives an overview of the Russian Semantic Similarity Evaluation (RUSSE) shared task held in conjunction with the Dialogue 2015 conference. There exist a lot of comparative studies on semantic similarity, yet no analysis of such measures was ever performed for the Russian language. Exploring this problem for the Russian language is even more interesting, because this language has features, such as rich morphology and free word order, which make it significantly different from English, German, and other well-studied languages. We attempt to bridge this gap by proposing a shared task on the semantic similarity of Russian nouns. Our key contribution is an evaluation methodology based on four novel benchmark datasets for the Russian language. Our analysis of the 105 submissions from 19 teams reveals that successful approaches for English, such as distributional and skip-gram models, are directly applicable to Russian as well. On the one hand, the best results in the contest were obtained by sophisticated supervised models that combine evidence from different sources. On the other hand, completely unsupervised approaches, such as a skip-gram model estimated on a large-scale corpus, were able score among the top 5 systems.
Conference Paper
Full-text available
The main goal of this paper was to improve topic modelling algorithms by introducing automatic topic labelling, a procedure which chooses a label for a cluster of words in a topic. Topic modelling is a widely used statistical technique which allows to reveal internal conceptual organization of text corpora. We have chosen an unsupervised graph-based method and elaborated it with regard to Russian. The proposed algorithm consists of two stages: candidate generation by means of PageRank and morphological filters, and candidate ranking. Our topic labelling experiments on a corpus of encyclopaedic texts on linguistics has shown the advantages of labelled topic models for NLP applications.
Article
Full-text available
Over the past century, personality theory and research has successfully identified core sets of characteristics that consistently describe and explain fundamental differences in the way people think, feel and behave. Such characteristics were derived through theory, dictionary analyses, and survey research using explicit self-reports. The availability of social media data spanning millions of users now makes it possible to automatically derive characteristics from language use -- at large scale. Taking advantage of linguistic information available through Facebook, we study the process of inferring a new set of potential human traits based on unprompted language use. We subject these new traits to a comprehensive set of evaluations and compare them with a popular five factor model of personality. We find that our language-based trait construct is often more generalizable in that it often predicts non-questionnaire-based outcomes better than questionnaire-based traits (e.g. entities someone likes, income and intelligence quotient), while the factors remain nearly as stable as traditional factors. Our approach suggests a value in new constructs of personality derived from everyday human language use.
Conference Paper
Full-text available
The presented project is intended to make use of growing amounts of textual data in social networks in the Russian language, in order to find linguistic correlates of the Dark Triad personality traits, comprising non-clinical Narcissism, Machiavellianism and Psychopathy. The background for the investigation includes, on the one hand, psychological research on these phenomena and their measurement instruments, and on the other hand, recent advances in computational stylometry and text-based author profiling. The measures for these psychological phenomena are provided by recognized self-report psychological surveys adapted to Russian. Morphological and semantic analysis are applied to investigate the relationship between the Dark traits and their linguistic manifestation in social network texts. Significant morphological and semantic correlates of Narcissism, Machiavellianism and Psychopathy are identified and compared to respective advances in English author profiling. In order to deepen our understanding of the relation between these psychological characteristics and natural language use, the identified linguistic features are interpreted in terms of the fine-grained factor structure of the Dark traits. Identifying correlated features is a step towards automatic Dark trait prediction and early detection of the potentially harmful mental states.
Article
Full-text available
Facebook is rapidly gaining recognition as a powerful research tool for the social sciences. It constitutes a large and diverse pool of participants, who can be selectively recruited for both online and offline studies. Additionally, it facilitates data collection by storing detailed records of its users' demographic profiles, social interactions, and behaviors. With participants' consent, these data can be recorded retrospectively in a convenient, accurate, and inexpensive way. Based on our experience in designing, implementing, and maintaining multiple Facebook-based psychological studies that attracted over 10 million participants, we demonstrate how to recruit participants using Facebook, incentivize them effectively, and maximize their engagement. We also outline the most important opportunities and challenges associated with using Facebook for research, provide several practical guidelines on how to successfully implement studies on Facebook, and finally, discuss ethical considerations. (PsycINFO Database Record
Conference Paper
Full-text available
This article offers an empirical exploration on the use of character-level convolutional networks (ConvNets) for text classification. We constructed several large-scale datasets to show that character-level convolutional networks could achieve state-of-the-art or competitive results. Comparisons are offered against traditional models such as bag of words, n-grams and their TFIDF variants, and deep learning models such as word-based ConvNets and recurrent neural networks.
Article
Full-text available
Distributed vector representations for natural language vocabulary get a lot of attention in contemporary computational linguistics. This paper summarizes the experience of applying neural network language models to the task of calculating semantic similarity for Russian. The experiments were performed in the course of Russian Semantic Similarity Evaluation track, where our models took from the 2nd to the 5th position, depending on the task. We introduce the tools and corpora used, comment on the nature of the shared task and describe the achieved results. It was found out that Continuous Skip-gram and Continuous Bag-of-words models, previously successfully applied to English material, can be used for semantic modeling of Russian as well. Moreover, we show that texts in Russian National Corpus (RNC) provide an excellent training material for such models, outperforming other, much larger corpora. It is especially true for semantic relatedness tasks (although stacking models trained on larger corpora on top of RNC models improves performance even more). High-quality semantic vectors learned in such a way can be used in a variety of linguistic tasks and promise an exciting field for further study.
Conference Paper
Full-text available
pymorphy2 is a morphological analyzer and generator for Russian and Ukrainian languages. It uses large efficiently encoded lexi- cons built from OpenCorpora and LanguageTool data. A set of linguistically motivated rules is developed to enable morphological analysis and generation of out-of-vocabulary words observed in real-world documents. For Russian pymorphy2 provides state-of-the-arts morphological analysis quality. The analyzer is implemented in Python programming language with optional C++ extensions. Emphasis is put on ease of use, documentation and extensibility. The package is distributed under a permissive open-source license, encouraging its use in both academic and commercial setting.
Article
Full-text available
This overview presents the framework and the results for the Author Profiling task at PAN 2014. Objective of this year is the analysis of the adaptability of the detection approaches when given different genres. For this purpose a corpus with four different parts (subcorpora) has been compiled: social media, Twitter, blogs, and hotel reviews. The construction of the Twitter subcorpus happened in cooperation with RepLab in order to investigate also a reputational perspective. Altogether, the approaches of 10 participants are evaluated.
Conference Paper
Full-text available
We introduce Chinese Whispers, a randomized graph-clustering algorithm, which is time-linear in the number of edges. After a detailed definition of the algorithm and a discussion of its strengths and weaknesses, the performance of Chinese Whispers is measured on Natural Language Processing (NLP) problems as diverse as language separation, acquisition of syntactic word classes and word sense disambiguation. At this, the fact is employed that the small-world property holds for many graphs in NLP.
Conference Paper
Full-text available
We present langid.py, an off-the-shelf language identification tool. We discuss the design and implementation of langid.py, and provide an empirical comparison on 5 long-document datasets, and 2 datasets from the microblog domain. We find that langid.py maintains consistently high accuracy across all domains, making it ideal for end-users that require language identification without wanting to invest in preparation of in-domain training data.
Conference Paper
Full-text available
Twitter, or the world of 140 characters poses serious challenges to the efficacy of topic models on short, messy text. While topic models such as Latent Dirichlet Allocation (LDA) have a long history of successful application to news articles and academic abstracts, they are often less coherent when applied to microblog content like Twitter. In this paper, we investigate methods to improve topics learned from Twitter content without modifying the basic machinery of LDA; we achieve this through various pooling schemes that aggregate tweets in a data preprocessing step for LDA. We empirically establish that a novel method of tweet pooling by hashtags leads to a vast improvement in a variety of measures for topic coherence across three diverse Twitter datasets in comparison to an unmodified LDA baseline and a variety of pooling schemes. An additional contribution of automatic hashtag labeling further improves on the hashtag pooling results for a subset of metrics. Overall, these two novel schemes lead to significantly improved LDA topic models on Twitter content.
Article
Full-text available
The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of "Canada" and "Air" cannot be easily combined to obtain "Air Canada". Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.
Article
Full-text available
We analyzed 700 million words, phrases, and topic instances collected from the Facebook messages of 75,000 volunteers, who also took standard personality tests, and found striking variations in language with personality, gender, and age. In our open-vocabulary technique, the data itself drives a comprehensive exploration of language that distinguishes people, finding connections that are not captured with traditional closed-vocabulary word-category analyses. Our analyses shed new light on psychosocial processes yielding results that are face valid (e.g., subjects living in high elevations talk about the mountains), tie in with other research (e.g., neurotic people disproportionately use the phrase 'sick of' and the word 'depressed'), suggest new hypotheses (e.g., an active life implies emotional stability), and give detailed insights (males use the possessive 'my' when mentioning their 'wife' or 'girlfriend' more often than females use 'my' with 'husband' or 'boyfriend'). To date, this represents the largest study, by an order of magnitude, of language and personality.
Conference Paper
Full-text available
Automated topic labelling brings benefits for users aiming at analysing and understanding document collections, as well as for search engines targetting at the linkage between groups of words and their inherent topics. Current approaches to achieve this suffer in quality, but we argue their performances might be improved by setting the focus on the structure in the data. Building upon research for concept disambiguation and linking to DBPedia, we are taking a novel approach to topic labelling by making use of structured data exposed by DBPedia. We start from the hypothesis that words co-occurring in text likely refer to concepts that belong closely together in the DBPedia graph. Using graph centrality measures, we show that we are able to identify the concepts that best represent the topics. We comparatively evaluate our graph-based approach and the standard text-based approach, on topics extracted from three corpora, based on results gathered in a crowd-sourcing experiment. Our research shows that graph-based analysis of DBPedia can achieve better results for topic labelling in terms of both precision and topic coverage.
Article
Full-text available
We introduce the author-topic model, a generative model for documents that extends Latent Dirichlet Allocation (LDA; Blei, Ng, & Jordan, 2003) to include authorship information. Each author is associated with a multinomial distribution over topics and each topic is associated with a multinomial distribution over words. A document with multiple authors is modeled as a distribution over topics that is a mixture of the distributions associated with the authors. We apply the model to a collection of 1,700 NIPS conference papers and 160,000 CiteSeer abstracts. Exact inference is intractable for these datasets and we use Gibbs sampling to estimate the topic and author distributions. We compare the performance with two other generative models for documents, which are special cases of the author-topic model: LDA (a topic model) and a simple author model in which each author is associated with a distribution over words rather than a distribution over topics. We show topics recovered by the author-topic model, and demonstrate applications to computing similarity between authors and entropy of author output.
Article
Full-text available
We introduce Chinese Whispers, a randomized graph-clustering algorithm, which is time-linear in the number of edges. After a detailed definition of the algorithm and a discussion of its strengths and weaknesses, the performance of Chinese Whispers is measured on Natural Language Processing (NLP) problems as diverse as language separation, acquisition of syntactic word classes and word sense disambiguation. At this, the fact is employed that the small-world property holds for many graphs in NLP.
Article
Full-text available
The common approach to the multiplicity problem calls for controlling the familywise error rate (FWER). This approach, though, has faults, and we point out a few. A different approach to problems of multiple significance testing is presented. It calls for controlling the expected proportion of falsely rejected hypotheses – the false discovery rate. This error rate is equivalent to the FWER when all hypotheses are true but is smaller otherwise. Therefore, in problems where the control of the false discovery rate rather than that of the FWER is desired, there is potential for a gain in power. A simple sequential Bonferroni-type procedure is proved to control the false discovery rate for independent test statistics, and a simulation study shows that the gain in power is substantial. The use of the new procedure and the appropriateness of the criterion are illustrated with examples.
Conference Paper
Full-text available
Multinomial distributions over words are frequently used to model topics in text collections. A common, major chal- lenge in applying all such topic models to any text mining problem is to label a multinomial topic model accurately so that a user can interpret the discovered topic. So far, such labels have been generated manually in a subjective way. In this paper, we propose probabilistic approaches to automat- ically labeling multinomial topic models in an objective way. We cast this labeling problem as an optimization problem involving minimizing Kullback-Leibler divergence between word distributions and maximizing mutual information be- tween a label and a topic model. Experiments with user study have been done on two text data sets with different genres. The results show that the proposed labeling meth- ods are quite effective to generate labels that are meaningful and useful for interpreting the discovered topic models. Our methods are general and can be applied to labeling topics learned through all kinds of topic models such as PLSA, LDA, and their variations.
Conference Paper
Full-text available
We introduce the author-topic model, a generative model for documents that extends Latent Dirichlet Allocation (LDA; Blei, Ng, & Jordan, 2003) to include authorship information.
Conference Paper
Full-text available
This paper presents the novel task of best topic word selection, that is the selection of the topic word that is the best label for a given topic, as a means of enhancing the interpretation and visualisation of topic models. We propose a number of features intended to capture the best topic word, and show that, in combination as inputs to a reranking model, we are able to consistently achieve results above the baseline of simply selecting the highest-ranked topic word. This is the case both when training in-domain over other labelled topics for that topic model, and cross-domain, using only labellings from independent topic models learned over document collections from different domains and genres.
Conference Paper
Full-text available
We propose a method for automatically labelling topics learned via LDA topic models. We generate our label candidate set from the top-ranking topic terms, titles of Wikipedia articles containing the top-ranking topic terms, and sub-phrases extracted from the Wikipedia article titles. We rank the label candidates using a combination of association measures and lexical features, optionally fed into a supervised ranking model. Our method is shown to perform strongly over four independent sets of topics, significantly better than a benchmark method.
Conference Paper
Full-text available
An algorithm for the automatic labeling of topics accordingly to a hierarchy is presented. Its main ingredients are a set of similarity measures and a set of topic labeling rules. The labeling rules are specifically designed to find the most agreed labels between the given topic and the hierarchy. The hierarchy is obtained from the Google Directory service, extracted via an ad-hoc developed software procedure and expanded through the use of the OpenOffice English Thesaurus. The performance of the proposed algorithm is investigated by using a document corpus consisting of 33,801 documents and a dictionary consisting of 111,795 words. The results are encouraging, while particularly interesting and significant labeling cases emerged. Index Terms—Automatic Topic Labeling, Topics Tree , Latent Dirichlet Allocation
Conference Paper
Full-text available
Article
Full-text available
Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from http://scikit-learn.sourceforge.net.
Article
Mental illnesses adversely affect a significant proportion of the population worldwide. However, the methods traditionally used for estimating and characterizing the prevalence of mental health conditions are time-consuming and expensive. Consequently, best-available estimates concerning the prevalence of mental health conditions are often years out of date. Automated approaches to supplement these survey methods with broad, aggregated information derived from social media content provides a potential means for near real-time estimates at scale. These may, in turn, provide grist for supporting, evaluating and iteratively improving upon public health programs and interventions. We propose a novel model for automated mental health status quantification that incorporates user embeddings. This builds upon recent work exploring representation learning methods that induce embeddings by leveraging social media post histories. Such embeddings capture latent characteristics of individuals (e.g., political leanings) and encode a soft notion of homophily. In this paper, we investigate whether user embeddings learned from twitter post histories encode information that correlates with mental health statuses. To this end, we estimated user embeddings for a set of users known to be affected by depression and post-traumatic stress disorder (PTSD), and for a set of demographically matched `control' users. We then evaluated these embeddings with respect to: (i) their ability to capture homophilic relations with respect to mental health status; and (ii) the performance of downstream mental health prediction models based on these features. Our experimental results demonstrate that the user embeddings capture similarities between users with respect to mental conditions, and are predictive of mental health.
Article
In economics and psychology, delay discounting is often used to characterize how individuals choose between a smaller immediate reward and a larger delayed reward. People with higher delay discounting rate (DDR) often choose smaller but more immediate rewards (a "today person"). In contrast, people with a lower discounting rate often choose a larger future rewards (a "tomorrow person"). Since the ability to modulate the desire of immediate gratification for long term rewards plays an important role in our decision-making, the lower discounting rate often predicts better social, academic and health outcomes. In contrast, the higher discounting rate is often associated with problematic behaviors such as alcohol/drug abuse, pathological gambling and credit card default. Thus, research on understanding and moderating delay discounting has the potential to produce substantial societal benefits.
Article
Exposure to violence has been shown to negatively affect mental health and well-being. The goal of this Facebook-based study was to describe the rates of exposure to violence in a sample of Russian adults and to assess the impact of these experiences on subjective well-being and victimization-related psychological distress. Three types of victimization were assessed: physical assault by a stranger, physical assault by someone known to victim, and nonconsensual sexual experiences. The 5-item World Health Organization Well-Being Index (WHO-5) was used to assess subjective well-being, and Primary Care PTSD Screen (PC-PTSD) was employed as an indicator of victimization-related psychological distress. Data were obtained from 6,724 Russian-speaking Facebook users. Significant levels of lifetime victimization were reported by the study participants. Lifetime physical assault by a stranger, physical assault by someone known to victim, and sexual assault were reported by 56.9%, 64.2%, and 54.1% of respondents, respectively. Respondents exposed to violence were more likely to report posttraumatic stress symptoms and lower levels of subjective well-being. Participants who were exposed to at least one type of violence were more likely to experience symptoms of traumatic stress (U = 1,794,250.50, p < .001, d = 0.35). Exposure to multiple forms of violence was associated with more severe traumatic stress symptoms (rs = .257, p < .001). Well-being scores were significantly lower among participants exposed to violence (t = 8.37, p < .001, d = 0.31). The study demonstrated that violence exposure is associated with reduced well-being among Russian adults. Our findings highlight the negative impact of violence exposure on subjective well-being and underscore the necessity to develop programs addressing violence exposure in Russian populations.
Conference Paper
The native representation of LDA-style topics is a multinomial distributions over words, which can be time-consuming to interpret directly. As an alternative representation, automatic labelling has been shown to help readers interpret the topics more efficiently. We propose a novel framework for topic labelling using word vectors and letter trigram vectors. We generate labels automatically and propose automatic and human evaluations of our method. First, we use a chunk parser to generate candidate labels, then map topics and candidate labels to word vectors and letter trigram vectors in order to find which candidate label is more semantically related to that topic. A label can be found by calculating the similarity between a topic and its candidate label vectors. Experiments on three common datasets show that not only the labelling method, but also out approach to automatic evaluation is effective.
Conference Paper
The Author Profiling (AP) task aims to determine specific demographic characteristics such as gender and age, by analyzing the language usage in groups of authors. Notwithstanding the recent advances in AP, this is still an unsolved problem, especially in the case of social media domains. According to the literature most of the work has been devoted to the analysis of useful textual features. The most prominent ones are those related with content and style. In spite of the success of using jointly both kinds of features, most of the authors agree in that content features are much more relevant than style, which suggest that some profiling aspects, like age or gender could be determined only by observing the thematic interests, concerns, moods, or others words related to events of daily life. Additionally, most of the research only uses traditional representations such as the BoW, rather than other more sophisticated representations to harness the content features. In this regard, this paper aims at evaluating the usefulness of some topic-based representations for the AP task. We mainly consider a representation based on Latent Semantic Analysis (LSA), which automatically discovers the topics from a given document collection, and a simplified version of the Linguistic Inquiry and Word Count (LIWC), which consists of 41 features representing manually predefined thematic categories. We report promising results in several corpora showing the effectiveness of the evaluated topic-based representations for AP in social media.
Article
Facebook is rapidly gaining recognition as a powerful research tool for the social sciences. It constitutes a large and diverse pool of participants, who can be selectively recruited for both online and offline studies. Additionally, it facilitates data collection by storing detailed records of its users’ demographic profiles, social interactions, and behaviors. With participants’ consent, these data can be recorded retrospectively in a convenient, accurate, and inexpensive way. Based on our experience in designing, implementing, and maintaining multiple Facebook-based psychological studies that attracted over 10 million participants, we demonstrate how to recruit participants using Facebook, incentivize them effectively, and maximize their engagement. We also outline the most important opportunities and challenges associated with using Facebook for research; provide several practical guidelines on how to successfully implement studies on Facebook; and finally, discuss ethical considerations.
Conference Paper
This paper describes our system used in the Aspect Based Sentiment Analysis Task 4 at the SemEval-2014. Our system consists of two components to address two of the subtasks respectively: a Conditional Random Field (CRF) based classifier for Aspect Term Extraction (ATE) and a linear classifier for Aspect Term Polarity Classification (ATP). For the ATE subtask, we implement a variety of lexicon, syntactic and semantic features, as well as cluster features induced from unlabeled data. Our system achieves state-of-the-art performances in ATE, ranking 1st (among 28 submissions) and 2rd (among 27 submissions) for the restaurant and laptop domain respectively.
Article
This modern treatment of computer vision focuses on learning and inference in probabilistic models as a unifying theme. It shows how to use training data to learn the relationships between the observed image data and the aspects of the world that we wish to estimate, such as the 3D structure or the object class, and how to exploit these relationships to make new inferences about the world from new image data. With minimal prerequisites, the book starts from the basics of probability and model fitting and works up to real examples that the reader can implement and modify to build useful vision systems. Primarily meant for advanced undergraduate and graduate students, the detailed methodological presentation will also be useful for practitioners of computer vision. • Covers cutting-edge techniques, including graph cuts, machine learning and multiple view geometry • A unified approach shows the common basis for solutions of important computer vision problems, such as camera calibration, face recognition and object tracking • More than 70 algorithms are described in sufficient detail to implement • More than 350 full-color illustrations amplify the text • The treatment is self-contained, including all of the background mathematics • Additional resources at www.computervisionmodels.com
Conference Paper
This paper introduces an unsupervised graph-based method that selects textual labels for automatically generated topics. Our approach uses the topic keywords to query a search engine and generate a graph from the words contained in the results. PageRank is then used to weigh the words in the graph and score the candidate labels. The state-of-the-art method for this task is supervised (Lau et al., 2011). Evaluation on a standard data set shows that the per-formance of our approach is consistently superior to previously reported methods.
Article
The purpose of this study was to assess the prevalence of childhood victimization experiences in a sample of young adults in St. Petersburg, Russia. The study sample included 743 students aged 19 to 25 from 15 universities in St. Petersburg, Russia. All of the study participants completed a reliable questionnaire assessing the following types of childhood victimization: conventional crime, child maltreatment, peer victimization, sexual victimization, and witnessing violence. Participation in the study was anonymous. High rates of victimization and exposure to violence were reported by the study participants. The majority of the sample experienced at least one type of victimization during childhood or adolescence, and poly-victimization was reported frequently. The most common type of victimization reported was peer or sibling assault (66.94%), followed by witnessing an assault without weapon (63.91%), personal theft (56.19%), vandalism (56.06%), and emotional bullying (49.99%). Sexual assault by a known adult was reported by 1.45% males and 5.16% of females. This study provides new information on the scope of childhood victimization experiences in Russia. Further research is warranted, including epidemiological research with representative data across the country and studies of the impact of trauma and victimization on mental health and well-being of Russian adults and children.
Article
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.
Article
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.
Article
SciPy is a Python-based ecosystem of open-source software for mathematics, science, and engineering. See http://www.scipy.org/ .
Jobimtext visualizer: a graph-based approach to contextualizing distributional similarity
  • A Gliozzo
  • C Biemann
  • M Riedl
  • B Coppola
  • M R Glass
  • M Hatem
Gliozzo, A., Biemann, C., Riedl, M., Coppola, B., Glass, M.R., Hatem, M.: Jobimtext visualizer: a graph-based approach to contextualizing distributional similarity. In: Graph-Based Methods for Natural Language Processing, p. 6 (2013)
Gender prediction for authors of Russian texts using regression and classification techniques
  • T Litvinova
  • P Seredin
  • O Litvinova
  • O Zagorovskaya
  • A Sboev
  • D Gudovskih
  • I Moloshnikov
  • R Rybka
Litvinova, T., Seredin, P., Litvinova, O., Zagorovskaya, O., Sboev, A., Gudovskih, D., Moloshnikov, I., Rybka, R.: Gender prediction for authors of Russian texts using regression and classification techniques. In: CDUD 2016-The 3rd International Workshop on Concept Discovery in Unstructured Data, p. 44 (2016). https:// cla2016.hse.ru/data/2016/07/24/1119022942/CDUD2016.pdf#page=51
Revealing interpetable content correlates of the dark triad personality traits
  • P Panicheva
  • Y Ledovaya
  • O Bogoliubova
Panicheva, P., Ledovaya, Y., Bogoliubova, O.: Revealing interpetable content correlates of the dark triad personality traits. In: Russian Summer School in Information Retrieval (2016)
Gensim-python framework for vector space modelling
  • R Rehurek
  • P Sojka
Rehurek, R., Sojka, P.: Gensim-python framework for vector space modelling. NLP Centre, Faculty of Informatics, Masaryk University, Brno (2011)
Natural Language Processing With Python: Analyzing Text With The Natural Language Toolkit. O’Reilly Media Inc
  • S Bird
  • E Klein
  • E Loper
Morphological analyzer and generator for russian and ukrainian languages
  • M Korobov
  • M Y Khachay
  • N Konstantinova
  • A Panchenko
  • D I Ignatov
  • Labunets
Korobov, M.: Morphological analyzer and generator for russian and ukrainian languages. In: Khachay, M.Y., Konstantinova, N., Panchenko, A., Ignatov, D.I., Labunets, V.G. (eds.) AIST 2015. CCIS, vol. 542, pp. 320-332. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-26123-2 31
Automatic labelling of topic models using word vectors and letter trigram vectors
  • W Kou
  • F Li
  • T Baldwin
  • G Zuccon
  • S Geva
  • H Joho
  • F Scholer
  • A Sun
Kou, W., Li, F., Baldwin, T.: Automatic labelling of topic models using word vectors and letter trigram vectors. In: Zuccon, G., Geva, S., Joho, H., Scholer, F., Sun, A., Zhang, P. (eds.) AIRS 2015. LNCS, vol. 9460, pp. 253-264. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-28940-3 20
  • J W Pennebaker
  • M E Francis
  • R J Booth
Pennebaker, J.W., Francis, M.E., Booth, R.J.: Linguistic inquiry and word count: Liwc 2001. Mahway: Lawrence Erlbaum Associates 71 (2001)
Overview of the 2nd author profiling task at pan
  • F Rangel
  • P Rosso
  • M Potthast
  • M Trenkmann
  • B Stein
  • B Verhoeven
  • W Daeleman
Rangel, F., Rosso, P., Potthast, M., Trenkmann, M., Stein, B., Verhoeven, B., Daeleman, W., et al.: Overview of the 2nd author profiling task at pan 2014. In: CEUR Workshop Proceedings, vol. 1180, pp. 898-927. CEUR Workshop Proceedings. https://riunet.upv.es/handle/10251/61150