Conference Paper

All about Microtext - A Working Definition and a Survey of Current Microtext Research within Artificial Intelligence and Natural Language Processing.

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

This paper defines a new term, 'Microtext', and takes a survey of the most recent and promising research that falls under this new definition. Microtext has three distinct attributes that differentiate it from the traditional free-text or unstructured text considered within the AI and NLP communities. Microtext is text that is generally very short in length, semi-structured, and characterized by amorphous or informal grammar and language. Examples of microtext include chatrooms (such as IM, XMPP, and IRC), SMS, voice transcriptions, and micro-blogging such as Twitter(tm). This paper expands on this definition, and provides some characterizations of typical microtext data. Microtext is becoming more prevalent. It is the thesis of this paper that the three distinct attributes of microtext yield different results and require different techniques than traditional AI and NLP techniques on long-form free text. By creating a working definition for microtext, providing a survey of the current state of research in the area, it is the goal of this paper to create an understanding of microtext within the AI and NLP communities.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... Дослідники різних галузей наук і практик вивчають ці мовні одиниці, їхнє розповсюдження, правила використання, ознаки ефективності тощо. Б. Соліс у роботі "The Hashtag Economy" [5] зазначив, що хештеги вплинули на культуру суспільства, і нині вони вбудовані в цифровий стиль життя. Економіст визначив їх ефективне використання тільки в разі встановлення вдалих культурних зв'язків і тематичної спрямованості, відкидаючи різного роду бренди. ...
... Як зазначає Дж. Еллен, якщо звичайний текст розкриває глобальну тему, то мікротекст ситуативний і розкриває щось конкретне [5]. ...
... However, when it comes to microblog text, standard language processing tools become inapplicable [29], [30]. Microblogs typically contain short sentences and casual language [31]. Unknown words, such as named entities and neologisms often cause problems with these term-based models. ...
... 30 A tragedy that happened in 1970 during the Cultural Revolution. 31 Chairman Mao Memorial Hall is in application for a World Heritage site. 32 ...
Article
Full-text available
We present measurements and analysis of censorship on Weibo, a popular microblogging site in China. Since we were limited in the rate at which we could download posts, we identified users likely to participate in sensitive topics and recursively followed their social contacts. We also leveraged new natural language processing techniques to pick out trending topics despite the use of neologisms, named entities, and informal language usage in Chinese social media. We found that Weibo dynamically adapts to the changing interests of its users through multiple layers of filtering. The filtering includes both retroactively searching posts by keyword or repost links to delete them, and rejecting posts as they are posted. The trend of sensitive topics is short-lived, suggesting that the censorship is effective in stopping the "viral" spread of sensitive issues. We also give evidence that sensitive topics in Weibo only scarcely propagate beyond a core of sensitive posters.
... L. Kornexl and U. Lenker study short textual inscriptions which can be found on all kinds of materials e.g. wood, stone, metal, textile and on sundry objects such as coins, pieces of lead sheet or garments [11]. J. Ellen gives a list of available public and commercial technologies, services, and standards that can be considered microtext, iscluding: SMS (aka Text Messages); Instant Messaging (point to point messages such as XMPP/Google Talk/Jabber); Multi-User Chatrooms (including chatrooms, and communication within online communities such as Second Life or World of Warcraft); Voicemail Transcriptions (Enterprise or government level, as well as consumer level technologies such as Google Voice or Jott); Microblogs (Twitter, Facebook etc.) [10]. ...
Conference Paper
Full-text available
The complexity of text research, its corpus, and meaning also lies in the fact that their analysis is conducted from different perspectives and in the context of various linguistic directions. One of the current issues in modern textology is the study of small-form texts.
... For example, discussions of COVID-19 on Twitter between February and May 2020 involve the emergence, evolution, and extinction of multiple topics over time. Moreover, tweets are short bursts composed in micro-text (Ellen, 2011), which traditional LDA models struggle to model effectively. ...
... The tasks discovered in this branch overlap considerably with those in more traditional NLP [11], including topic detection, summarization, sentiment analysis and classification, question-answering, and Fig. 1 The ligature "et" in the shorthand system developed by Tiro to annotate M. T. Cicero's speeches (Rome, First Century BCE). Image source: [54] information extraction [22]. Microtext in social media is a form of contemporary brachygraphy. ...
Article
Full-text available
Human civilizations have performed the art of writing across continents and over different time periods. In order to speed up the writing process, the art of shorthand (brachygraphy) came into existence. Today, the performance of writing does not make an exception in social media platforms. Brachygraphy started to re-emerge in the early 2000s in the form of microtext in order to facilitate faster typing without compromising semantic clarity. This paper focuses on microtext approaches predominantly found in social media and explains the relevance of microtext normalization for natural language processing tasks in English. The review introduces brachygraphy and how it has evolved into microtext in today’s social media–dominant society. The study provides a comprehensive classification of microtext normalization based on different approaches. We propose to classify microtext based on different normalization techniques, i.e. syntax-based (syntactic), probability-based (probabilistic) and phonetic-based approaches and review application areas, strategies and challenges of microtext normalization. The review shows that there is a compelling similarity between brachygraphy and microtext even though they started centuries apart. This paper represents the first attempt to connect brachygraphy to current texting language and to show its impact in social media. This paper classifies microtext normalization according to different approaches and discusses how, in the future, microtext will likely comprise both words and images together. This will expand the horizon of human creative power. We conclude the review with some considerations on future directions.
... Окрім лінгвістів вивченням мікротексту займається Дж. Еллен [7] вчений в ІТ сфері, який досліджує текст та мікротекст. ...
... Microtext can be used in as large a range of applications as regular text, including for instance information extraction, automated summarization, and question answering (Ellen 2011). However, thanks to the fact that they are constantly being generated by users, microtext can also be used as a live stream of information in new types of social trend-monitoring applications. ...
... Papers that characterize Twitter as a news media offer solutions to recommendation tasks like news and contents [10,3,11] or users [1]. In [12] the authors make a stateof-the-art survey on research on Twitter and try to define possible topics and open problems regarding the matter. ...
Article
Full-text available
Social networks are generators of large amount of data produced by users, who are not limited with respect to the content of the information they exchange. The data generated can be a good indicator of trends and topic preferences among users. In our paper we focus on analyzing and representing hashtags by the corpus in which they appear. We cluster a large set of hashtags using K-means on map reduce in order to process data in a distributed manner. Our intention is to retrieve connections that might exist between different hashtags and their textual representation, and grasp their semantics through the main topics they occur with.
... For the summarization task, summarizing the text within forums, blogs, or wikis is different from fully structured document summarization. The text in these tools is usually not or semi-structured, and falls under the definition of microtext [1]. Therefore, the summarization process may require different techniques and approaches like the ones in [2], [3], [4], and [5]. ...
Article
Massive Open Online Courses (MOOC) platforms provide a rich environment for knowledge creation through its massiveness and inherited collaborative tools. However, it also restricts spontaneous knowledge sharing by the existing LMS barriers between the main multimedia content and the collaborative tools. None the less, the collaboration still massive due to the number of participants. The separation of the multimedia content and the discussion tools is the first focus point of this paper. Moreover, this article is presenting a new added value to the MOOC architecture so to link the learner’s discussions and its summary with the multimedia contents. The added-value component involves a summarization algorithm that summarizes the shared collaborative textual discussion collected from the various learners viewing relevant MOOC multimedia/video contents. The affectivity of the summarization component was tested using the popular ROUGE software package from University of Southern California. The new MOOC architecture represents an enhanced learning environment that enables learners to share the multimedia information along with its annotated collaborative information with the power of summarizing the final outcome of the presented annotations relevant to a specific shared multimedia content.
... The analysis of Twitter and various micro-blogging messages is a research area with high and rapidly growing interest within the academic community [1]. Because of the relative freshness of this research area, some research problems have been poorly defined and new problems are being defined every day. ...
Article
Full-text available
In the emerging field of micro-blogging and social communication services, users post millions of short messages every day. Keeping track of all the messages posted by your friends and the conversation as a whole can become tedious or even impossible. In this paper, we presented a study on automatically clustering and classifying Twitter messages, also known as "tweets", into different categories, inspired by the approaches taken by news aggregating services like Google News. Our results suggest that the clusters produced by traditional unsupervised methods can often be incoherent from a topical perspective, but utilizing a supervised methodology that utilize the hash-tags as indicators of topics produce surprisingly good results. We also offer a discussion on temporal effects of our methodology and training set size considerations. Lastly, we describe a simple method of finding the most representative tweet in a cluster, and provide an analysis of the results.
... from news articles. According to Ellen et al. (2011), microblog text is a typical microtext. Compared to regular long text such as news article, the microblog text exhibits the following characteristics. ...
Article
As a classic natural language processing technology, topic detection recently attracts more research interests due largely to the rapid development of microblog. The most challenging issue in microblog topic detection is sparse data problem. In this paper, the temporal-author-topic (TAT) model is designed to accomplish microblog topic detection in two phases. In the first phase, the TAT model is applied to clean the thread, namely, to filter noisy microblog texts out of each thread. In the second phase, microblog texts within each thread are merged to form the thread text so that the TAT model is applied to find global topics. The new approach differs from the Hierarchical Agglomerative Clustering (HAC) algorithm by making use of microblog threads to overcome the sparse data problem. Experimental results justify our claims.
... However, when it comes to microblog text, standard language processing tools become inapplicable [18,40]. Microblogs typically contain short sentences and casual language [7]. Unknown words, such as named entities and neologisms often cause problems with these termbased models. ...
Article
Full-text available
Weibo and other popular Chinese microblogging sites are well known for exercising internal censorship, to comply with Chinese government requirements. This research seeks to quantify the mechanisms of this censorship: how fast and how comprehensively posts are deleted.Our analysis considered 2.38 million posts gathered over roughly two months in 2012, with our attention focused on repeatedly visiting "sensitive" users. This gives us a view of censorship events within minutes of their occurrence, albeit at a cost of our data no longer representing a random sample of the general Weibo population. We also have a larger 470 million post sampling from Weibo's public timeline, taken over a longer time period, that is more representative of a random sample. We found that deletions happen most heavily in the first hour after a post has been submitted. Focusing on original posts, not reposts/retweets, we observed that nearly 30% of the total deletion events occur within 5- 30 minutes. Nearly 90% of the deletions happen within the first 24 hours. Leveraging our data, we also considered a variety of hypotheses about the mechanisms used by Weibo for censorship, such as the extent to which Weibo's censors use retrospective keyword-based censorship, and how repost/retweet popularity interacts with censorship. We also used natural language processing techniques to analyze which topics were more likely to be censored.
... We recently completed a survey of existing research on microtext [15]. Although there were some interesting and notable findings, such as O'Connor clustering " statistically unlikely phrases that co-occur " with Tweet Motif[18], or Go and Bhayani performing sentiment analysis using emoticons as noisy labels[19], there was no other research on classification of microtext, and many of the reports could not be generalized more broadly because of assumptions or limitations. ...
Article
The goal is classification of microtext: classifying lines of military chat, or posts, which contain items of interest. This paper evaluates non-linear statistical data modeling techniques, and compares with our previous results using several text categorization and feature selection methodologies. The chat posts are examples of 'microtext', or text that is generally very short in length, semi-structured, and characterized by unstructured or informal grammar and language. These three distinct attributes cause different results than traditional long-form free text. In this paper, we further characterize microtext. Highly accurate classification of microtext entries is crucial to facilitate more complex information extraction. Although this study focused specifically on tactical updates via chat, we believe the findings are applicable to content of a similar linguistic structure regardless of domain. This includes other microtext sources such as IM/XMPP, SMS, voice transcriptions, and micro-blogging such as Twitter(tm).
... This chat-speak-style text is especially prevalent in Short Message Service (SMS), chat rooms and micro-blogs. Such chatspeak-style text is referred to as Microtext by (Ellen 2011). In this work, Tweets and SMS messages are explored as typical examples of microtext. ...
Conference Paper
Full-text available
The use of computer mediated communication has resulted in a new form of written text - Microtext - which is very different from well-written text. Tweets and SMS messages, which have limited length and may contain misspellings, slang, or abbreviations, are two typical examples of microtext. Micro-text poses new challenges to standard natural language processing tools which are usually designed for well-written text. The objective of this work is to normalize microtext, in order to produce text that could be suitable for further treatment. We propose a normalization approach based on the source channel model, which incorporates four factors, namely an orthographic factor, a phonetic factor, a contextual factor and acronym expansion. Experiments show that our approach can normalize Twitter messages reasonably well, and it outperforms existing algorithms on a public SMS data set. Copyright © 2011, Association for the Advancement of Artificial Intelligence. All rights reserved.
Preprint
Full-text available
This thesis focuses on data that has complex spatio-temporal structure and on probabilistic graphical models that learn the structure in an interpretable and scalable manner. We target two research areas of interest: Gaussian graphical models for tensor-variate data and summarization of complex time-varying texts using topic models. This work advances the state-of-the-art in several directions. First, it introduces a new class of tensor-variate Gaussian graphical models via the Sylvester tensor equation. Second, it develops an optimization technique based on a fast-converging proximal alternating linearized minimization method, which scales tensor-variate Gaussian graphical model estimations to modern big-data settings. Third, it connects Kronecker-structured (inverse) covariance models with spatio-temporal partial differential equations (PDEs) and introduces a new framework for ensemble Kalman filtering that is capable of tracking chaotic physical systems. Fourth, it proposes a modular and interpretable framework for unsupervised and weakly-supervised probabilistic topic modeling of time-varying data that combines generative statistical models with computational geometric methods. Throughout, practical applications of the methodology are considered using real datasets. This includes brain-connectivity analysis using EEG data, space weather forecasting using solar imaging data, longitudinal analysis of public opinions using Twitter data, and mining of mental health related issues using TalkLife data. We show in each case that the graphical modeling framework introduced here leads to improved interpretability, accuracy, and scalability.
Thesis
This thesis focuses on data that has complex spatio-temporal structure and on probabilistic graphical models that learn the structure in an interpretable and scalable manner. We target two research areas of interest: Gaussian graphical models for tensor-variate data and summarization of complex time-varying texts using topic models. This work advances the state-of-the-art in several directions. First, it introduces a new class of tensor-variate Gaussian graphical models via the Sylvester tensor equation. Second, it develops an optimization technique based on a fast-converging proximal alternating linearized minimization method, which scales tensor-variate Gaussian graphical model estimations to modern big-data settings. Third, it connects Kronecker-structured (inverse) covariance models with spatio-temporal partial differential equations (PDEs) and introduces a new framework for ensemble Kalman filtering that is capable of tracking chaotic physical systems. Fourth, it proposes a modular and interpretable framework for unsupervised and weakly-supervised probabilistic topic modeling of time-varying data that combines generative statistical models with computational geometric methods. Throughout, practical applications of the methodology are considered using real datasets. This includes brain-connectivity analysis using EEG data, space weather forecasting using solar imaging data, longitudinal analysis of public opinions using Twitter data, and mining of mental health related issues using TalkLife data. We show in each case that the graphical modeling framework introduced here leads to improved interpretability, accuracy, and scalability.
Preprint
Microblogs such as Twitter represent a powerful source of information. Part of this information can be aggregated beyond the level of individual posts. Some of this aggregated information is referring to events that could or should be acted upon in the interest of e-governance, public safety, or other levels of public interest. Moreover, a significant amount of this information, if aggregated, could complement existing information networks in a non-trivial way. This dissertation proposes a semi-automatic method for extracting actionable information that serves this purpose. First, we show that predicting time to event is possible for both in-domain and cross-domain scenarios. Second, we suggest a method which facilitates the definition of relevance for an analyst's context and the use of this definition to analyze new data. Finally, we propose a method to integrate the machine learning based relevant information classification method with a rule-based information classification technique to classify microtexts. Fully automatizing microtext analysis has been our goal since the first day of this research project. Our efforts in this direction informed us about the extent this automation can be realized. We mostly first developed an automated approach, then we extended and improved it by integrating human intervention at various steps of the automated approach. Our experience confirms previous work that states that a well-designed human intervention or contribution in design, realization, or evaluation of an information system either improves its performance or enables its realization. As our studies and results directed us toward its necessity and value, we were inspired from previous studies in designing human involvement and customized our approaches to benefit from human input.
Article
Full-text available
Рябова К. О., «Англомовний хештег як об’єкт мовознавчих розвідок» У статті розглядається основні теоретичні положення та підходи до вивчення англомовного хештегу як мікротексту. У розвідці надаються теоретичні відомості про текст та мікротекст, їх основні характеристики та класифікації, а також визначаються лінгвістичні особливості хештегу у соціальних мережах Twitter, Facebook та Instagram. Ключові слова: хештег, текст, мікротекст, соціальні мережі, комунікація, лексикологія. Riabova K.O., "English hashtag as an object of linguistic study" The popularity of a hashtag is rising up every day. We use the hashtags to mark our messages or to find the information in the Internet. The hashtag has attracted attention not only Internet users but also scientists. The article focuses on the theoretical background and the approaches to the research of the hashtag as a kind of microtext. The author provides theoretical information about text and microtext, their main characteristics and classification as well as outlines the main linguistic features of the hashtag in the social networks Twitter, Facebook and Instagram. We considered the etymology of "hashtag" and analyzed the previous works. The article presents the characteristic differences of microtext on phonetic-grapheme, lexical, syntactic levels. Highlight variation of pronunciation, graphic and phonetic substitutions, spelling mistakes. The paper investigates various aspects of using microtext in the social networks. Key words: hashtag, text, microtext, social networks, communication, lexicology.
Article
Full-text available
"English hashtag as an object of linguistic study" The popularity of a hashtag is rising up every day. We use the hashtags to mark our messages or to find the information in the Internet. The hashtag has attracted attention not only Internet users but also scientists. The article focuses on the theoretical background and the approaches to the research of the hashtag as a kind of microtext. The author provides theoretical information about text and microtext, their main characteristics and classification as well as outlines the main linguistic features of the hashtag in the social networks Twitter, Facebook and Instagram. We considered the etymology of "hashtag" and analyzed the previous works. The article presents the characteristic differences of microtext on phonetic-grapheme, lexical, syntactic levels. Highlight variation of pronunciation, graphic and phonetic substitutions, spelling mistakes. The paper investigates various aspects of using microtext in the social networks. Key words: hashtag, text, microtext, social networks, communication, lexicology.
Conference Paper
We introduce Relevancer that processes a tweet set and enables generating an automatic classifier from it. Relevancer satisfies information needs of experts during significant events. Enabling experts to combine automatic procedures with expertise is the main contribution of our approach and the added value of the tool. Even a small amount of feedback enables the tool to distinguish between relevant and irrelevant information effectively. Thus, Relevancer facilitates the quick understanding of and proper reaction to events presented on Twitter.
Article
Full-text available
Sentiment analysis is one of the fastest growing research areas in computer science, making it challenging to keep track of all the activities in the area. We present a computer-assisted literature review, where we utilize both text mining and qualitative coding, and analyze 6,996 papers from Scopus. We find that the roots of sentiment analysis are in the studies on public opinion analysis at the beginning of 20th century and in the text subjectivity analysis performed by the computational linguistics community in 1990’s. However, the outbreak of computer-based sentiment analysis only occurred with the availability of subjective texts on the Web. Consequently, 99% of the papers have been published after 2004. Sentiment analysis papers are scattered to multiple publication venues, and the combined number of papers in the top-15 venues only represent ca. 30% of the papers in total. We present the top-20 cited papers from Google Scholar and Scopus and a taxonomy of research topics. In recent years, sentiment analysis has shifted from analyzing online product reviews to social media texts from Twitter and Facebook. Many topics beyond product reviews like stock markets, elections, disasters, medicine, software development and cyberbullying extend the utilization of sentiment analysis.
Article
Full-text available
Summer thunderstorms in Gauteng are often dramatic, noisy, wet events. They can appear suddenly on exceptionally hot sunny days travelling fast across the province. With such dramatic arrivals, people often flock to social media sites such as Twitter to comment on the rain, wind, hail, lightning and thunder. This paper investigates the possibility of mapping the track of Gauteng thunderstorms by using crowdsourced data from Twitter. This paper describes a model (entitled the ThunderChatter Model) and instantiation of that model which extracts data from Twitter, analyses the textual information for thunderstorm information and plots the appropriate data on a map. For evaluation purposes, these generated maps are then compared against lightning-stroke maps provided by the South African Weather Service. The maps are visually compared by independent people using Content Analysis techniques ensuring unbiased and reproducible results. The results of this research are mixed. For thunderstorms which traverse the strip of land between Soweto and Pretoria more or less correlated to the N1 highway (and representing the most heavily populated area of Gauteng and the area with the highest percentage of home Internet facilities), the results are excellent. However, in outlying areas of Gauteng such as Carletonville, Heidelberg, Hammanskraal and Bronkhorstspruit, the thunderstorms are only trackable using crowdsourced Twitter data in the case of extreme storms which damage property. The results imply that data obtained from social media could be used in some cases to supplement geographical data obtained from traditional sources.
Conference Paper
The aim of this research is to easily monitor the reputation of a company in the Twittersphere. We propose a strategy that organizes a stream of tweets into different clusters based on the tweets’ topics. Furthermore, the obtained clusters are assigned into different priority levels. A cluster with high priority represents a topic which may affect the reputation of a company, and that consequently deserves immediate attention. The evaluation results show that our method is competitive even though the method does not make use of any external knowledge resource.
Article
Knowledge exchange and opinion sharing over the Internet has reached levels never experienced before. People from different regions and socio-cultural backgrounds now have the possibility to create as much web content as they wish. This data represents a massive source of information useful to understand many aspects of society. Through sentiment analysis it is possible to leverage this highly topical data to identify people's perception of a certain topic. Different approaches have been implemented in order to detect sentiment in microtexts using mainly lexical ontologies and classification models. In this work, a tool designed for sentiment detection in microtexts named Sure is presented. This tool leverages inductive learning in order to find differentiating patterns in opinions about a given topic. Using the identified patterns Sure creates decision trees to classify microtexts as supportive or unsupportive towards the analyzed topic.
Article
Microtext can be defined as very short messages which are typically unstructured and informal in nature. Microtext can be seen in SMS (Short Messages System), MIMs (Mobile Instant Messaging), and Twitter. This paper presents empirical evidence on changes which are occurring in microtext. Statistics from four corpora of messages from 2010 through 2013 are presented. These values show a trend towards alignment with more traditional English. This paper attributes this alignment with the growth in market share of smart phones.
Conference Paper
Twitter is a microblogging facility that allows people to post 140 character status updates about various topics. In times of special events (such as extreme weather, emergencies, sporting goals, etc), status updates on Twitter often give people a better view of the event than traditional news operations or weather services. This paper describes a project in monitoring Twitter for weather status updates for a specific city and being able to automatically determine the current weather by analysing those tweets.
Conference Paper
The huge amount of social data actually available and everyday produced demands for discovery techniques to mine prominent information topics. In this paper, we present topic-clouds as a solution for thematic, conceptual exploration of social data, with specific application to microblogging posts of Twitter. Topic-clouds are the result of a discovery approach based on classification and abstraction techniques to mine the most prominent topics that emerge from a possibly-large set of social data.
Article
We survey research on the analysis of multiparticipant chat. Multiple research and applied communities (e.g., AI, educational, law enforcement, military) have interest in this topic. After introducing some context, we describe relevant problems and how these have been addressed using AI techniques. We also identify recent research trends and unresolved issues that could benefit from more attention.
Article
Full-text available
This article presents an investigation of corpus-based methods for the automation of help-desk e-mail responses. Specifically, we investigate this problem along two operational dimensions: (1) information-gathering technique, and (2) granularity of the information. We consider two information-gathering techniques (retrieval and prediction) applied to information represented at two levels of granularity (document-level and sentence-level). Document-level methods correspond to the reuse of an existing response e-mail to address new requests. Sentence-level methods correspond to applying extractive multi-document summarization techniques to collate units of information from more than one e-mail. Evaluation of the performance of the different methods shows that in combination they are able to successfully automate the generation of responses for a substantial portion of e-mail requests in our corpus. We also investigate a meta-selection process that learns to choose one method to address a new inquiry e-mail, thus providing a unified response automation solution.