Article

Learning to classify documents according to genre: Special Topic Section on Computational Analysis of Style

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Current document-retrieval tools succeed in locating large numbers of documents relevant to a given query. While search results may be relevant according to the topic of the documents, it is more difficult to identify which of the relevant documents are most suitable for a particular user. Automatic genre analysis (i.e., the ability to distinguish documents according to style) would be a useful tool for identifying documents that are most suitable for a particular user. We investigate the use of machine learning for automatic genre classification. We introduce the idea of domain transfer—genre classifiers should be reusable across multiple topics—which does not arise in standard text classification. We investigate different features for building genre classifiers and their ability to transfer across multiple-topic domains. We also show how different feature-sets can be used in conjunction with each other to improve performance and reduce the number of documents that need to be labeled. © 2006 Wiley Periodicals, Inc.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... This approach enables the effortless enrichment of thousands of texts with genre information. The majority of previous works [1,2,[7][8][9][10][11][12] focused on developing models from an information retrieval standpoint. ...
... Before the emergence of neural networks, the most frequently used machine learning method for automatic genre identification was support vector machines (SVMs) [27,[35][36][37], which continues to be valuable for analyzing which textual features are the most informative in this task [38,39]. Other non-neural methods, including discriminant analysis [40,41], decision tree classifiers [8,42], and the Naive Bayes algorithm [10,40], were also used for genre classification. Multiple studies searched for the most informative features in this task. ...
... Multiple studies searched for the most informative features in this task. They experimented with lexical features (words, word or character n-grams), grammatical features (part-of-speech tags) [31,38], text statistics [8], visual features of HTML web pages such as HTML tags and images [43][44][45], and URLs of web documents [10,46,47]. However, the results for the discriminative features varied across studies and datasets. ...
Article
Full-text available
Massive text collections are the backbone of large language models, the main ingredient of the current significant progress in artificial intelligence. However, as these collections are mostly collected using automatic methods, researchers have few insights into what types of texts they consist of. Automatic genre identification is a text classification task that enriches texts with genre labels, such as promotional and legal, providing meaningful insights into the composition of these large text collections. In this paper, we evaluate machine learning approaches for the genre identification task based on their generalizability across different datasets to assess which model is the most suitable for the downstream task of enriching large web corpora with genre information. We train and test multiple fine-tuned BERT-like Transformer-based models and show that merging different genre-annotated datasets yields superior results. Moreover, we explore the zero-shot capabilities of large GPT Transformer models in this task and discuss the advantages and disadvantages of the zero-shot approach. We also publish the best-performing fine-tuned model that enables automatic genre annotation in multiple languages. In addition, to promote further research in this area, we plan to share, upon request, a new benchmark for automatic genre annotation, ensuring the non-exposure of the latest large language models.
... In general, most definitions describe genre classes based on the socially recognized form of a document and its intended communicative purpose (Kwasnik & Crowston, 2005). In addition to this, some definitions also consider the target audience (Eissen & Stein, 2004), expectations of the reader (Santini, 2006), style (Finn & Kushmerick, 2006;Argamon et al., 1998), or the content of the texts (Rosso, 2008). To identify genres, researchers observe their intrinsic features, i.e., the linguistic and other "look'n'feel" features in the text (text-internal perspective), their extrinsic features, that is the function of the texts (text-external or functional perspective), or both (see Sharoff (2010Sharoff ( , 2021 for a detailed discussion on both perspectives). ...
... Thus, they consist of genre categories that are deemed to be useful for the search engine users (see Eissen and Stein (2004); Vidulin et al. (2007); Roussinov et al. (2001); Dewe et al. (1998);Lim et al. (2005); Santini (2007)). 3. Schemata, developed with other aims: limited sets of categories used in studies which analyse one or a few genre classes, e.g., a smaller set of genres of interest (Boese (2005); Lee and Myaeng (2004); Asheghi et al. (2016)), academic genres (Rehm, 2002), e-shop genres (Levering et al., 2008), poetry and prose genres (Shavrina, 2019), news articles and reviews (Finn & Kushmerick, 2006), and home pages (Kennedy & Shepherd, 2005). ...
... In the past, support vector machines (SVMs) were most frequently used (Rezapour Asheghi, 2015;Laippala et al., 2017Laippala et al., , 2021Sharoff et al., 2010;Pritsos & Stamatatos, 2018;Petrenz & Webber, 2011), as it was shown that they are very suitable for text categorization (Joachims, 1998). Other methods, previously used for genre classification, are the discriminant analysis (Feldman et al., 2009; which is the earliest method that was applied to this task (Karlgren & Cutting, 1994), decision tree classifiers, more specifically the C4.5 algorithm (Finn & Kushmerick, 2006;Dewdney et al., 2001), Naive Bayes algorithm (Feldman et al., 2009;Priyatam et al., 2013) and graph-based models using hyper-link connections between web pages (Asheghi et al., 2014;Zhu et al., 2011). In recent studies, SVMs were used to obtain an insight into which feature sets are most relevant for genre identification Sharoff et al., 2010;Pritsos & Stamatatos, 2018;Asheghi et al., 2014). ...
Article
Full-text available
Automatic genre identification (AGI) is a text classification task focused on genres, i.e., text categories defined by the author’s purpose, common function of the text, and the text’s conventional form. Obtaining genre information has been shown to be beneficial for a wide range of disciplines, including linguistics, corpus linguistics, computational linguistics, natural language processing, information retrieval and information security. Consequently, in the past 20 years, numerous researchers have collected genre datasets with the aim to develop an efficient genre classifier. However, their approaches to the definition of genre schemata, data collection and manual annotation vary substantially, resulting in significantly different datasets. As most AGI experiments are dataset-dependent, a sufficient understanding of the differences between the available genre datasets is of great importance for the researchers venturing into this area. In this paper, we present a detailed overview of different approaches to each of the steps of the AGI task, from the definition of the genre concept and the genre schema, to the dataset collection and annotation methods, and, finally, to machine learning strategies. Special focus is dedicated to the description of the most relevant genre schemata and datasets, and details on the availability of all of the datasets are provided. In addition, the paper presents the recent advances in machine learning approaches to automatic genre identification, and concludes with proposing the directions towards developing a stable multilingual genre classifier.
... In consideration of this study employing document analysis, the purpose of this chapter is to acquaint the reader with the three document data sources: the chemistry component of the Sciences textbook (Grayson et al., 2011) is an example of textbook genre, and the Grade 10 chemistry exemplar examination question paper (DBE, 2012) is an example of test genre. Finn andKushmerick (2006, p. 1506) used genre to "loosely mean the style of text in the document". Although relevant literature has been drawn upon for introducing the three documents in Chapter One and exploring them in more depth in terms of the broader notion of curriculum in Chapter Two, it is fitting to discuss some literature on curriculum genre at this point in the thesis as the discussion moves to a more fine-grained level of engagement with the documents that provided the data for this study. ...
... Although the term 'genre' appears regularly in popular culture, "genres are often vague concepts with no clear boundaries" (Finn & Kushmerick, 2006, p. 1507. Finn and Kushmerick (2006) acknowledged the lack of agreement regarding a universal definition of genre, but also consensus about genre relating to style. ...
... Although the term 'genre' appears regularly in popular culture, "genres are often vague concepts with no clear boundaries" (Finn & Kushmerick, 2006, p. 1507. Finn and Kushmerick (2006) acknowledged the lack of agreement regarding a universal definition of genre, but also consensus about genre relating to style. With regards to documents, a particular genre reflects a particular text style, rather than content (Finn & Kushmerick, 2006) -documents covering the same topic can belong to different genre (as is the case of the three curriculum documents explored in the current thesis all covering chemistry topics but not displaying the same text style), and documents of the same genre can cover different topics (as would be the case for either syllabi or textbooks or exemplar examination question papers of different school subjects, being comparable in terms of text style). ...
Thesis
Full-text available
South Africa experiences crippling challenges in the recruitment and retention of Science, Technology, Engineering and Mathematics (STEM) students in Higher Education, with major implications for such things as socioeconomic development. In the country’s school curriculum, it is Grade 10 which marks the beginning of a learner’s potential STEM career trajectory. A deeper understanding of South Africa’s Grade 10 curriculum literacy challenges and associated curriculum alignment in the key STEM field of chemistry is needed for enabling forms of epistemological access (such as semiotic access) that are critical for the empowerment of future scientists. Chemistry as an academic discipline, is sustained by many individuals with shared ways of knowing facilitated by a system of semiotic resources such as visuals and text, referred to as discourse. Despite chemistry playing an important role in our lives and in school curriculum, the abstract nature of chemistry discourse poses challenges to students. The visuals and text of chemistry discourse contribute to chemistry curriculum demands imposed on students. While there is clear justification for promoting literacy practices in classrooms, the reading involved in school science has received less attention, and recommendations from literature point to the need for defining discipline-specific curriculum literacies and identifying implicit literacy practices. Such recommendations are further supported by the broader call made by sociologists of education for overcoming knowledge blindness in education. This case study of South African Grade 10 Chemistry curriculum utilised document analysis for exploring the alignment of school chemistry curriculum literacy demands between the syllabus, textbook and exemplar examination in terms of abstraction. The Legitimation Code Theory conceptualisation of degree of abstraction in knowledge practices as Semantic Gravity (SG), provided a theoretical perspective for characterising visual and textual curriculum literacy demands of school chemistry curriculum documents. One translation device was developed specifically for exploring SG of visual items and a second translation device was devised specifically for exploring SG of textual items. The SG of visuals in the exemplar examination paper and textbook were tabulated and graphed in order to identify areas of stronger and weaker alignment between the visual literacy demands of these two documents of the pedagogic recontextualising field. Similarly, the SG of textual items in the syllabus, exemplar examination paper and textbook were compared to identify areas of stronger and weaker alignment between the textual literacy demands of the pedagogic and official recontextualising fields. The methodological contribution of this study lies in it demonstrating the utility of SG as a mode of analysing curriculum alignment of subjects with hierarchical knowledge structures. The empirical findings reveal an overall high level of alignment for visual chemistry curriculum literacy demands, and for textual chemistry curriculum literacy demands at the lower levels of abstraction. Visual literacy demands were found to be higher than textual literacy demands, due to emphasis on visuals at the highest level of abstraction while the curriculum documents displayed a more even distribution of focal lexical items across levels of textual abstraction. This thesis argues that while exploring the alignment of visual and textual chemistry curriculum literacy demands between different curriculum documents is useful, it is equally important to consider how evenly the visual and textual items are distributed across the SG continuum as this has cognitive and affective implications for academic achievement and life chances of chemistry learners.
... Where ̃( ) is the empirical distribution of evidence in the training dataset and it is usually set equal to 1/N. By constraining the expected value to be equal to the empirical value and from equations (9) and (10) Next step is to solve the optimization problem using the Lagrangian multipliers, with focusing on the unconstrained dual problem and give an estimation to the lamda free variables { 1 , … , } with the Maximum Likelihood Estimation method. After that, the probability given an evidence to be classified as a hypothesis is equal equation (14). ...
... Recall is the percentage of correctly predicted positive data, the equation (18) Genre is a term that is often mentioned in modern society. Genre is a vague concept without a clear boundary, and no need to be united [9]. Also according to The American heritage dictionary of the English language, genre is perceived as a category of artistic composition, as in music or literature, marked by a distinctive style, form, or content. ...
Article
Full-text available
In the last two decades, novel translation had become one of the popular products among the literature community. People had favorited some genre based on their ages. The reader needs to finish reading until the end first before they could determine what genre a novel should have. There were some cases where the genre written in the description differs from the real novel’s content, which made readers felt upset and had not pleasant reading experience. This research is going to do classification for the novel’s genre automatically. Naïve Bayes is the method chosen for classification, later the result of Naïve Bayes classification is going to be compared with another algorithm, which is Maximum Entropy algorithm. Each method would apply algorithms to label the data based on an existing class. The data origin was taken from 12 translated novel that has 3746 lines. Data was portioned into three genre classes, “Action-Fantasy” for about 1293 lines, “Modern-Slice-of-Life” for 1203 lines, and 1250 for “Other”. Evaluation of the two models, Naïve Bayes and MaxEnt, would be using confusion matrix that generated the highest number precision, recall, and F-score which values were 77,52%; 75,59%; and 77,55% for the Naïve Bayes method, and 78,11%; 83,82%; and 75,81% for MaxEnt method. The value of accuracy were 72,72% for Naïve Bayes, and 71,86% for MaxEnt. Both methods showed that the genre “Action-Fantasy” was the correct genre for almost every novel among 12 novels listed.
... Shifting from the pure focus on features, Finn and Kushmerick (2006) building genre classifiers to testify their ability of transferring across multiple-topic domains. Besides, the combination set of multiple feature sets was also adopted for the improvement of the classifier performance. ...
... distinguished between a style and a genre. A style is a consistent and distinguishable tendency to make certain linguistic choices. A genre is a grouping of documents that are stylistically consistent and intuitive to accomplished readers of the communication channel in question.Finn and Kushmerick (2006) however argued that "By genre, we loosely mean the style of text in the document. A genre class is a class of documents that are of a similar type.This classification is based not on the topic of the documents but rather on the kind of text used." In other words, the genre of a document reflects a certain kind of text rather than being ...
... It does, however, require a definition of register sufficiently precise that human annotators can label texts accordingly with high inter-rater reliability, which is not always easy to achieve. Register classification can comprise a stand-alone task (Stamatatos, Fakotakis, & Kokkinakis 2000, Biber & Conrad 2001, Argamon, Koppel, Fine, & Shimoni 2003, Finn & Kushmerick 2006, Santini 2006, Herring & Paolillo 2006, Abbasi & Chen 2007, Dong, Watters, Duffy, & Shepherd 2008, Sharoff, Wu, & Markert 2010 or may be used to derive insights into larger questions related to linguistic variation (e.g., Atkinson 1992, Argamon, Dodick, & Chase 2008, Eisenstein, Smith, & Xing 2011, Teich, Degaetano-Ortlieb, Kermes, & Lapshinova-Koltunski 2013, Clarke & Grieve 2017. Register labels, either manually or automatically assigned, can also be used to control for register in research on other text analysis methods (e.g., Carroll et al. 1999, Giesbrecht & Evert 2009, Sharoff et al. 2010; differences in register between training and testing data often affect outcomes for NLP tasks such as part-ofspeech tagging, parsing, or information extraction. ...
... Little work, if any, has examined this problem specifically from the standpoint of register, as opposed to other stylistic questions such as genre and authorship -researchers have sought "stylistic features" of texts that are correlated with different styles, contrasted with "topical features" that correlate with different topics or domains of discourse, which are typically used for information retrieval and related tasks. A wide variety of such features have been proposed (Stamatatos et al. 2000, Finn & Kushmerick 2006, Argamon & Koppel 2010, including relative frequencies of function words, part-of-speech n-grams, character n-grams, syntactic constructs, and systemicfunctional categories, as well as type/token ratios, word and sentence length, and other textual statistics. These are discussed in more detail below. ...
Preprint
Full-text available
The study of register in computational language research has historically been divided into register analysis, seeking to determine the registerial character of a text or corpus, and register synthesis, seeking to generate a text in a desired register. This article surveys the different approaches to these disparate tasks. Register synthesis has tended to use more theoretically articulated notions of register and genre than analysis work, which often seeks to categorize on the basis of intuitive and somewhat incoherent notions of prelabeled “text types”. I argue that a integration of computational register analysis and synthesis will benefit register studies as a whole, by enabling a new large-scale research program in register studies. It will enable comprehensive global mapping of functional language varieties in multiple languages, including the relationships between them. Furthermore, computational methods together with high coverage systematically collected and analyzed data will thus enable rigorous empirical validation and refinement of different theories of register, which will have also implications for our understanding of linguistic variation in general.
... It does, however, require a definition of register sufficiently precise that human annotators can label texts accordingly with high inter-rater reliability, which is not always easy to achieve. Register classification can comprise a stand-alone task (Stamatatos, Fakotakis, & Kokkinakis 2000, Biber & Conrad 2001, Argamon, Koppel, Fine, & Shimoni 2003, Finn & Kushmerick 2006, Santini 2006, Herring & Paolillo 2006, Abbasi & Chen 2007, Dong, Watters, Duffy, & Shepherd 2008, Sharoff, Wu, & Markert 2010 or may be used to derive insights into larger questions related to linguistic variation (e.g., Atkinson 1992, Argamon, Dodick, & Chase 2008, Eisenstein, Smith, & Xing 2011, Teich, Degaetano-Ortlieb, Kermes, & Lapshinova-Koltunski 2013, Clarke & Grieve 2017. Register labels, either manually or automatically assigned, can also be used to control for register in research on other text analysis methods (e.g., Carroll et al. 1999, Giesbrecht & Evert 2009, Sharoff et al. 2010; differences in register between training and testing data often affect outcomes for NLP tasks such as part-ofspeech tagging, parsing, or information extraction. ...
... Little work, if any, has examined this problem specifically from the standpoint of register, as opposed to other stylistic questions such as genre and authorship -researchers have sought "stylistic features" of texts that are correlated with different styles, contrasted with "topical features" that correlate with different topics or domains of discourse, which are typically used for information retrieval and related tasks. A wide variety of such features have been proposed (Stamatatos et al. 2000, Finn & Kushmerick 2006, Argamon & Koppel 2010, including relative frequencies of function words, part-of-speech n-grams, character n-grams, syntactic constructs, and systemicfunctional categories, as well as type/token ratios, word and sentence length, and other textual statistics. These are discussed in more detail below. ...
Preprint
Full-text available
The study of register in computational language research has historically been divided into register analysis, seeking to determine the registerial character of a text or corpus, and register synthesis, seeking to generate a text in a desired register. This article surveys the different approaches to these disparate tasks. Register synthesis has tended to use more theoretically articulated notions of register and genre than analysis work, which often seeks to categorize on the basis of intuitive and somewhat incoherent notions of prelabeled 'text types'. I argue that an integration of computational register analysis and synthesis will benefit register studies as a whole, by enabling a new large-scale research program in register studies. It will enable comprehensive global mapping of functional language varieties in multiple languages, including the relationships between them. Furthermore, computational methods together with high coverage systematically collected and analyzed data will thus enable rigorous empirical validation and refinement of different theories of register, which will have also implications for our understanding of linguistic variation in general.
... Kessler et al. [12] first used the term "genre" to represent any widely recognized class of texts defined by some common communicative purposes or other functional traits. Finn et al. [8] proposed that the genre classification is orthogonal to the topic classification. That is to say the documents had same topic can have different genre and documents in the same genre can also have different topics. ...
... Because these features might also be predictive to knowledgeable and unknowledgeable documents, we put them into our feature set. These features include the number of words, the length of document, the number of sentences and average length of sentences [8,18]. We also extend these features, such as the number of paragraphs, average sentences' number of each paragraph, the number of distinct words in title and so on. ...
Preprint
Full-text available
In this study, we focus on extracting knowledgeable snippets and annotating knowledgeable documents from Web corpus, consisting of the documents from social media and We-media. Informally, knowledgeable snippets refer to the text describing concepts, properties of entities, or relations among entities, while knowledgeable documents are the ones with enough knowledgeable snippets. These knowledgeable snippets and documents could be helpful in multiple applications, such as knowledge base construction and knowledge-oriented service. Previous studies extracted the knowledgeable snippets using the pattern-based method. Here, we propose the semantic-based method for this task. Specifically, a CNN based model is developed to extract knowledgeable snippets and annotate knowledgeable documents simultaneously. Additionally, a "low-level sharing, high-level splitting" structure of CNN is designed to handle the documents from different content domains. Compared with building multiple domain-specific CNNs, this joint model not only critically saves the training time, but also improves the prediction accuracy visibly. The superiority of the proposed method is demonstrated in a real dataset from Wechat public platform.
... Our deep learning approach, which relies on abstract vector representations, does not permit direct exploration of the linguistic features that influence model predictions, unlike more traditional methods (e.g. Stamatatos, Fakotakis, and Kokkinakis 2000;Finn and Kushmerick 2006). However, this methodology opens avenues for future linguistic research; our models could be used to label large web-text corpora like OSCAR (as demonstrated by Laippala et al. 2022), opening up opportunities for extensive corpuslinguistic studies. ...
Preprint
Full-text available
This article explores deep learning models for the automatic identification of registers - text varieties such as news reports and discussion forums - in web-based datasets across 16 languages. Web register (or genre) identification would provide a robust solution for understanding the content of web-scale datasets, which have become crucial in computational linguistics. Despite recent advances, the potential of register classifiers on the noisy web remains largely unexplored, particularly in multilingual settings and when targeting the entire unrestricted web. We experiment with a range of deep learning models using the new Multilingual CORE corpora, which includes 16 languages annotated using a detailed, hierarchical taxonomy of 25 registers designed to cover the entire unrestricted web. Our models achieve state-of-the-art results, showing that a detailed taxonomy in a hierarchical multi-label setting can yield competitive classification performance. However, all models hit a glass ceiling at approximately 80% F1 score, which we attribute to the non-discrete nature of web registers and the inherent uncertainty in labeling some documents. By pruning ambiguous examples, we improve model performance to over 90%. Finally, multilingual models outperform monolingual ones, particularly benefiting languages with fewer training examples and smaller registers. Although a zero-shot setting decreases performance by an average of 7%, these drops are not linked to specific registers or languages. Instead, registers show surprising similarity across languages.
... The results showed that SVM outperformed K-NN and LR in accuracy and processing speed. Finn et al. [11] showed ways of learning to classify documents according to genre. Two sorts of genre classification tasks are done: if an article is subjective or objective and if a review is positive or negative. ...
Article
Full-text available
Digital books and internet retailers are growing in popularity daily. Different individuals prefer various genres of literature. Categorizing genres facilitates the discovery of books that match a reader's tastes. The assortment is the process of categorizing or genre-classifying a book. In this paper, we categorize books by genre using a variety of traditional machine learning and deep learning models based on book titles and snippets. Such work exists for books in other languages but has not yet been completed for Bengali novels. We have developed two types of datasets as a result of data collection for this research. One dataset includes the titles of Bengali novels across nine genres, while the other includes book snippets from three genres. For classification, we have employed logistic regression, Support Vector Machines (SVM), random forest classifiers, decision trees, Recurrent Neural Networks (RNN), Long Short-Term Memory (LSTM), Convolutional Neural Networks (CNN), and Bidirectional Encoder Representations from Transformers (BERT). Among all the models, BERT has the highest performance for both datasets, with 90% accuracy for the book excerpt dataset and 77% accuracy for the book Title dataset. With the exception of BERT, traditional machine learning models performed better in the Snippets dataset, whereas deep learning models performed better in the Titles dataset. Due to the quantity and the number of words present in the dataset, the performance varied.
... Grammatical features have also been used in automatic register identification, though the performance of models trained only on grammatical information tends to be poorer (see Petrenz and Webber 2011). Grammatical tags, however, do not reflect the topic of the text but rather the genre or register the text represents (Finn and Kushmerick 2006), and are considered to have register specific functional associations (Biber 1988;Biber and Conrad 2019). Grammatical features are widely recognized as essential components of registers and form the basis of register studies. ...
Article
Full-text available
Registers are situationally defined text varieties, such as letters, essays, or news articles, that are considered to be one of the most important predictors of linguistic variation. Often historical databases of language lack register information, which could greatly enhance their usability (e.g. Early English Books Online). This article examines register variation in Late Modern English and automatic register identification in historical corpora. We model register variation in the corpus of Founding Era American English (COFEA) and develop machine-learning methods for automatic register identification in COFEA. We also extract and analyze the most significant grammatical characteristics estimated by the classifier for the best-predicted registers and found that letters and journals in the 1700s were characterized by informational density. The chosen method enables us to learn more about registers in the Founding Era. We show that some registers can be reliably identified from COFEA, the best overall performance achieved by the deep learning model Bidirectional Encoder Representations from Transformers with an F1-score of 97 per cent. This suggests that deep learning models could be utilized in other studies concerned with historical language and its automatic classification.
... At the same time, from the perspective of texts classification, the function words are omitted or only their counts as part of POS statistics are used, e.g., in automatic genre classification (Finn and Kushmerick (2006)), (Karlgren & Cutting, 1994), topic modeling/classification (e.g., (Jockers & Mimno, 2013), (Moumivand et al., 2021)), or text categorization/document classification (Joachims, 1998), (Baharudin et al., 2010), as they are considered as irrelevant for such tasks. ...
Preprint
Full-text available
Our work aims to evaluate the strength of the association between function words and several text types: novels, poems, academic articles, reviews and blog posts, and the accuracy of their classification to these categories. The principal conclusion is that the types of texts are distin-guishable based only on the function words, either by vocabulary or vocabulary diversity. Such findings may impact the techniques of authorship attribution based on function words and text clustering techniques since some function words add information about the text types/genres, in addition to content words.
... For example, Posadas-Durán et al. (2017) presented an approach that uses word ngrams and Doc2vec to distribute document representations, scoring over 98% in accuracy in a binary authorship attribution task. Others implemented various approaches and methods to assign authorship to literary texts, ranging from simple lexical, syntactical and/or grammatical features (Argamon & Levitan, 2005;Finn & Kushmerick, 2006;Grieve, 2007;Stańczyk & Cyran, 2007;van Halteren et al., 2005;Wu, Zhang, & Wu, 2021;Zhao & Zobel, 2005a;Zheng & Jin, 2022) to only focusing on function or content words (Boumparis, submitted;Kestemont, 2014). On the other end of the spectrum, deep learning methods were implemented (Shrestha et al., 2017;Zhang, Zhao, & LeCun, 2015). ...
Preprint
Full-text available
This paper aims to explore whether cross-linguistic authorship attribution and author's gender identification are feasible using Machine Translation (MT) as a method to bridge the language gap. We designed a series of computational stylistics experiments to explore whether the stylometric signal survives through the MT process. We compiled an extensive blog corpus in Greek containing 100 authors, balanced in gender with 50 texts from each author. Then, we used Google's Neural Machine Translation to automatically translate each text into English. We ran several classification experiments using the Random Forest algorithm in authorship attribution and gender profiling tasks employing different feature groups in both the source language (Greek) and the machine-translated (English) corpora. Moreover, we trained models in the source language and used the texts in the target language as the unseen test set, to simulate a cross-linguistic prediction. The results showed that cross-linguistic gender identification could be achieved reliably using n-gram features. Authorship attribution, however, was not feasible in our cross-linguistic experimental setting, although features of lexical diversity exhibited promising results. Finally, when performed using training and testing set from the same language, both authorship attribution and gender identification exhibit similar accuracy across the full range of the feature groups used.
... Experimental results have shown standout performance on the group activity datasets. In the field of genre recognition, there are many successful attempts to classify music by genre [15][16][17]. ...
Article
Full-text available
Despite the idiom not to prejudge something by its outward appearance, we consider deep learning to learn whether we can judge a book by its cover or, more precisely, by its text and design. The classification was accomplished using three strategies, i.e., text only, image only, and both text and image. State-of-the-art CNNs (convolutional neural networks) models were used to classify books through cover images. The Gram and SE layers (squeeze and excitation) were used as an attention unit in them to learn the optimal features and identify characteristics from the cover image. The Gram layer enabled more accurate multi-genre classification than the SE layer. The text-based classification was done using word-based, character-based, and feature engineering-based models. We designed EXplicit interActive Network (EXAN) composed of context-relevant layers and multi-level attention layers to learn features from books title. We designed an improved multimodal fusion architecture for multimodal classification that uses an attention mechanism between modalities. The disparity in modalities convergence speed is addressed by pre-training each sub-network independently prior to end-to-end training of the model. Two book cover datasets were used in this study. Results demonstrated that text-based classifiers are superior to image-based classifiers. The proposed multimodal network outperformed all models for this task with the highest accuracy of 69.09% and 38.12% for Latin and Arabic book cover datasets. Similarly, the proposed EXAN surpassed the extant text classification models by scoring the highest prediction rates of 65.20% and 33.8% for Latin and Arabic book cover datasets.
... In previous stylometric studies, most frequent words (MFWs) are the most popular and the most frequently used features (Rybicki and Heydel, 2013;Evans, 2018;Lee, 2018). However, MFWs often include content words that bear topical information, thus POS tags that are immune to the shift of topics (Finn and Kushmerick, 2006) have been employed in genre classification tasks. Some studies included POS tags with other linguistics features such as discourse markers and temporal words to form a combined feature set (Xu et al., 2017), while other studies treated POS tags as an independent feature set (Fang and Cao, 2015;Tang and Cao, 2015). ...
Article
The visibility of translator’s style is a much-discussed topic in translation studies with the application of corpus tools. So far, however, no agreement has been reached. The present study aims to explore this issue by a comparison of three Chinese translations of the English literary work, Alice’s Adventure in Wonderland, using two stylometric techniques, bootstrap consensus tree analyses, and bootstrap consensus network analyses. The results show that all the three Chinese translations preserved the style of the original text and that individual translator’s style could not be identified based on the entire set of part-of-speech (POS) tags. Furthermore, a feature selection method (the chi-square metric) was used to obtain the top fifteen distinctive POS unigrams and bigrams, and these distinctive features successfully identified translatorial fingerprints across the three translations examined. The findings suggest that translators have their own stylistic choices when translating the same text, but their stylistic differences can only be detected by distinctive features. Our attempt to combine feature selection methods and stylometric techniques may offer new insights into the investigation of translator’s stylistic visibility in translation studies.
... Furthermore, numerous genre annotation studies have been conducted with the aim of improving the Information Retrieval (IR) tools (Stubbe and Ringlstetter, 2007; Zu Eissen 1 https://macocu.eu/ and Stein, 2004; Vidulin et al., 2007;Roussinov et al., 2001;Finn and Kushmerick, 2006;Boese, 2005). Our contributions in this work are as follows. ...
Preprint
Full-text available
This paper presents a new training dataset for automatic genre identification GINCO, which is based on 1,125 crawled Slovenian web documents that consist of 650 thousand words. Each document was manually annotated for genre with a new annotation schema that builds upon existing schemata, having primarily clarity of labels and inter-annotator agreement in mind. The dataset consists of various challenges related to web-based data, such as machine translated content, encoding errors, multiple contents presented in one document etc., enabling evaluation of classifiers in realistic conditions. The initial machine learning experiments on the dataset show that (1) pre-Transformer models are drastically less able to model the phenomena, with macro F1 metrics ranging around 0.22, while Transformer-based models achieve scores of around 0.58, and (2) multilingual Transformer models work as well on the task as the monolingual models that were previously proven to be superior to multilingual models on standard NLP tasks.
... For the different corpora, the results of accuracy lay between 0.42 and 0.94. Finn and Kushmerick (2006) use the concept genre (and describe the state of the art both from the theoretical and the computational approach) to classify different aspects of the text such as its objectivity and the polarity of reviews. These are nowadays clearly related to sentiment analysis and not to genre classification. ...
... For instance, Finn and Kushmerick (2006) made use of an automatic genre analysis tool to categorize data according to styles and features. Similarly, Mu, (2015) utilized K-nearest neighbor, naive Bayes classifier and support vector machine to categorize the genre of poems with 95% accuracy. ...
Chapter
Full-text available
The English curriculum implemented in the Indonesian secondary schools has undergone some development in the past few decades. The development is meant to ensure that the curriculum remains up to date with the development of English language teaching theories and practices in the world. This paper discusses the English curricula which were developed in 1984, 1994, 2004, and 2006 and critically analyses the communicative approach that has been implemented in Indonesian secondary schools. The newest 2013 curriculum is also discussed, but only superficially as the curriculum was just officially implemented at national level in the end of 2019. The writers finally offer some recommendations for future curriculum developers and government officials in order to improve English language teaching and learning in Indonesia.
... More recently, the development of robust and fairly accurate NLP pipelines together with the increased computational power to process large volumes of data has allowed to automatize the process of feature extraction from large-scale corpora, enhancing the potential contribution of linguistic profiling for studying language variation. By modeling the 'form' of a text through large sets of features spanning across distinct levels of language description, it has been possible not only to improve automatic classification of genres (Stamatatos, Fakotakis, and Kokkinakis 2001), but also to get a better understanding of the impact of those features in classifying genres and text varieties (Cimino et al. 2017;Finn and Kushmerick 2006). This paper adopts this framework but, quite differently from much previous research, it presents a new application of linguistic profiling, in which the unit of analysis is not the document as a whole entity, but the internal parts in which it is articulated. ...
Article
Full-text available
Moving from the assumption that formal, rather than content features, can be used to detect differences and similarities among textual genres and registers, this paper presents a new approach to linguistic profiling – a well-established methodological framework to study language variation – which is applied to detect significant variations within the internal structure of a text. We test this approach on the Italian language using a wide spectrum of linguistic features automatically extracted from parsed corpora representative of four main genres and two levels of complexity for each, and we show that it is possible to model the degree of stylistic variance within texts according to genre and language complexity.
... Biber, 1993b) and his multidimensional analysis. However, especially because of the inherent complexity of the task, supervised machine learning methods also have a long tradition in this area (Finn & Kushmerick, 2006;Karlgren, 2004;Kessler et al., 1997). Teich and Frankhauser (2009), for example, analysed different scientific registers using data mining (see also Teich et al., 2015). ...
... Para o desenvolvimento dos experimentos é necessário definir as abordagens de atribuição de autoria. Neste ponto, duas abordagens são as mais usuais na literatura: verificação e a identificação da autoria [3], [10], [14], [15], [16], [17], [18], [22], [23], [24]. Tais abordagens, são executadas por intermédio da observação de atributos linguísticos, tais como os estilísticos, apresentados pelo autor do texto. ...
Article
Full-text available
In this paper we presented a computational approach to authorship attribution in a multilingual environment, based on latin languages. Initially, we defined the databases of literary texts, written by consecrated authors of Portuguese, Spanish and French literature. Subsequently, we established a set formed by groups of stylometric characteristics, which are: morphological, flexors, syntactic and auxiliary. The main objective is to extract from the grammatical structures of the sentences, the stylometric pattern of each author. We perform experiments with author-dependent approach, using verification and identification strategies. In the classification process we use the Support Vector Machines – SVM, with a linear kernel.
... Two main factors that can be used for categorization of Ghazal text, as presented in [13][14], are content and the style. In Urdu poetry, each poet can have a varying degree of styles and themes. ...
Article
Full-text available
Urdu literature has a rich tradition of poetry, with many forms, one of which is Ghazal. Urdu poetry structures are mainly of Arabic origin. It has complex and different sentence structure compared to our daily language which makes it hard to classify. Our research is focused on the identification of poets if given with ghazals as input. Previously, no one has done this type of work. Two main factors which help categorize and classify a given text are the contents and writing style. Urdu poets like Mirza Ghalib, Mir Taqi Mir, Iqbal and many others have a different writing style and the topic of interest. Our model caters these two factors, classify ghazals using different classification models such as SVM (Support Vector Machines), Decision Tree, Random forest, Naïve Bayes and KNN (K-Nearest Neighbors). Furthermore, we have also applied feature selection techniques like chi square model and L1 based feature selection. For experimentation, we have prepared a dataset of about 4000 Ghazals. We have also compared the accuracy of different classifiers and concluded the best results for the collected dataset of Ghazals.
... For example, recent attempts have been done to identify artistic styles and quality of paintings and photographs [10], [11] with neural network models. In addition, there have been trials to classify music by genre [12], [13], book covers by genre [1], movie posters by genre [14], paintings by genre [15], and text by genre [16], [17]. Also, in a general sense, document classification can be considered genre classification and deep CNNs are the stateof-the-art in the document classification domain [18]- [20]. ...
... In these approaches, support vector machines (SVMs) are widely used for authorship attribution [3][4][5][6][7][8][11][12][14][15]. Different models of machine learning and statistical analysis have been attempted, such as neural networks [11], decision trees [11][12]17], discriminant analysis [2,13], and multivariate analysis [14]. ...
... For example, recent attempts have been done to identify artistic styles and quality of paintings and photographs [10], [11] with neural network models. In addition, there have been trials to classify music by genre [12], [13], book covers by genre [1], movie posters by genre [14], paintings by genre [15], and text by genre [16], [17]. Also, in a general sense, document classification can be considered genre classification and deep CNNs are the stateof-the-art in the document classification domain [18]- [20]. ...
Preprint
Full-text available
In this paper, we aim to understand the design principles in book cover images which are carefully crafted by experts. Book covers are designed in a unique way, specific to genres which convey important information to their readers. By using Convolutional Neural Networks (CNN) to predict book genres from cover images, visual cues which distinguish genres can be highlighted and analyzed. In order to understand these visual clues contributing towards the decision of a genre, we present the application of Layer-wise Relevance Propagation (LRP) on the book cover image classification results. We use LRP to explain the pixel-wise contributions of book cover design and highlight the design elements contributing towards particular genres. In addition, with the use of state-of-the-art object and text detection methods, insights about genre-specific book cover designs are discovered.
... In the real world, there are several significant applications. As a matter of fact, identifying the genre of a text document becomes more and more important in web technology systems [5]. For instance, news are generally organized according to their subject categories or geographical areas; papers are typically classified by their domains. ...
Conference Paper
Full-text available
Natural Language Processing (NLP) is a prominent subject which includes various subcategories such as text classification, error correction, machine translation, etc. Unlike other languages, there are limited number of Turkish NLP studies in literature. In this study, we apply text classification on Turkish documents by using n-gram features. Our algorithm applies different preprocessing techniques, namely, n-gram choice (character level or word level, bigram or trigram models), stemming, and use of punctuation, and then determines the Turkish document's author and genre, and the gender of the author. For this purpose, Naive Bayes, Support Vector Machines and Random Forest are used as classification techniques. Finally, we discuss the effects of above mentioned preprocessing techniques to the performance of Turkish text classification.
... Fin and Kunshmerick [12] weigh the impact of three sets of features for genre classification of texts, having to distinguish between opinionated and factual articles across three domains. Using a corpus of approximately 800 texts, and extracting features like part-of-speech information, a bag of words approach, textual statistics and a combination of all the aforementioned features, they record average accuracy scores between 82.4% and 90.5%. ...
... Inligtingswins vir oorspronklike en vertaalde NederlandsNog ʼn moontlike rede vir die lae prestasie van die Afrikaanse sisteem op die Nederlandse toetsdata is die invloed van die domeinoordrag. Die hoeveelheid wat die sisteem se prestasie afneem wanneer dit met ʼn toetsstel uit ʼn ander domein geëvalueer word, word deur Finn en Kushmerick[12] ondersoek vir ʼn onderwerpklassifikasiesisteem. Daar word ʼn afname van ongeveer 0.100 (gemiddeld oor al die eksperimente van verskillende domeine) in die presisie opgemerk vir domeinoordrag. ...
... These URLs were obtained by crawling repositories of known malicious URL examples (Phistank) and offset against known benign URL examples (Open Directory). From this corpus, features were extracted as a normalised bag of words (Finn & Kushmerick, 2006) based on URL snippets. The features were constructed as follows: A hyperlink in an email can be set to display any text while still pointing towards a URL that is hidden from the user. ...
... Second, the use of TC is economical both in terms of time and cost (Duriau et al., 2007). Third, many of the techniques that have been developed in TC, such as sentiment analysis (Pang & Lee, 2008), genre classification (Finn & Kushmerick, 2006), and sentence classification (Khoo, Marom, & Albrecht, 2006) seem particularly well suited to address contemporary organizational research questions. Fourth, the acceptance and broader use of TC within the organizational research community can stimulate the development of novel TC techniques. ...
Article
Full-text available
Organizations are increasingly interested in classifying texts or parts thereof into categories, as this enables more effective use of their information. Manual procedures for text classification work well for up to a few hundred documents. However, when the number of documents is larger manual procedures become laborious, time-consuming, and potentially unreliable. Techniques from text mining facilitate the automatic assignment of text strings to categories, making classification expedient, fast, and reliable, which creates potential for its application in organizational research. The purpose of this paper is to familiarize organizational researchers with text mining techniques from machine learning and statistics. We describe the text classification process in several roughly sequential steps, namely training data preparation, preprocessing, transformation, application of classification techniques, and validation, and provide concrete recommendations at each step. In order to help researchers develop their own text classifiers, the R code associated with each step is presented in a tutorial. The tutorial draws from our own work on job vacancy mining. We end the paper by discussing how researchers can validate a text classification model and the associated output.
... It is mentioned, for example, that different genres elicit different social interactions between characters [6]. Also, topics often used for text categorization [7] are considered orthogonal to genres because documents addressing the same topic can be of a different genre [8], [9], [4], [5] and, in contrast to topics, a genre "describes something about what kind of document it is rather than what the document is about" [10]. Devitt [11] criticizes that genres are not about formal features or text classification and proposes a notion based on how humans experience the written text. ...
Article
Full-text available
The concept of genre is as old as literary theory itself, but centuries of debate haven't produced much consensus on the topic. Part of the reason is that genre looks like a different thing at different points in the life of a text. Scholars of rhetoric tend to focus on the patterns of communicative action that produce memoranda or tragedies. Sociologists are sometimes more interested in institutions that organize reception. Literary scholars, for their part, have traditionally been preoccupied with the patterning of the texts themselves. Of course, all of these aspects of genre are connected. But it's not easy to describe the connections.
... For instance, Chen et al. [36] examined the performance of several feature sets, such as features related to genre-related information, text-based features and image-based features for genre identification on document pages and they classify the genres of documents by an ensemble learner. Similarly, Finn and Kushmerick [37] examined the performance of different feature sets and machine learning methods for automatic genre classification. In this scheme, bag of words, POS statistics and text statistics were considered as the feature sets. ...
Article
Full-text available
Text genre classification is the process of identifying functional characteristics of text documents. The immense quantity of text documents available on the web can be properly filtered, organised and retrieved with the use of text genre classification, which may have potential use on several other tasks of natural language processing and information retrieval. Genre may refer to several aspects of text documents, such as function and purpose. The language function analysis (LFA) concentrates on single aspect of genres and it aims to classify text documents into three abstract classes, such as expressive, appellative and informative. Text genre classification is typically performed by supervised machine learning algorithms. The extraction of an efficient feature set to represent text documents is an essential task for building a robust classification scheme with high predictive performance. In addition, ensemble learning, which combines the outputs of individual classifiers to obtain a robust classification scheme, is a promising research field in machine learning research. In this regard, this article presents an extensive comparative analysis of different feature engineering schemes (such as features used in authorship attribution, linguistic features, character n-grams, part of speech n-grams and the frequency of the most discriminative words) and five different base learners (Naïve Bayes, support vector machines, logistic regression, k-nearest neighbour and Random Forest) in conjunction with ensemble learning methods (such as Boosting, Bagging and Random Subspace). Based on the empirical analysis, an ensemble classification scheme is presented, which integrates Random Subspace ensemble of Random Forest with four types of features (features used in authorship attribution, character n-grams, part of speech n-grams and the frequency of the most discriminative words). For LFA corpus, the highest average predictive performance obtained by the proposed scheme is 94.43%.
... In the field of genre classification, there have been attempts to classify music by genre [14]- [16]. It was also done in the fields of paintings [10], [17] and text [18], [19]. Most of these methods use designed features or features specific to the task. ...
Article
Full-text available
Book covers communicate information to potential readers, but can that same information be learned by computers? We propose using a deep Convolutional Neural Network (CNN) to predict the genre of a book based on the visual clues provided by its cover. The purpose of this research is to investigate whether relationships between books and their covers can be learned. However, determining the genre of a book is a difficult task because covers can be ambiguous and genres can be overarching. Despite this, we show that a CNN can extract features and learn underlying design rules set by the designer to define a genre. Using machine learning, we can bring the large amount of resources available to the book cover design process. In addition, we present a new challenging dataset that can be used for many pattern recognition tasks.
... It is mentioned, for example, that different genres elicit different social interactions between characters [6]. Also, topics often used for text categorization [7] are considered orthogonal to genres because documents addressing the same topic can be of a different genre [8], [9], [4], [5] and, in contrast to topics, a genre "describes something about what kind of document it is rather than what the document is about" [10]. Devitt [11] criticizes that genres are not about formal features or text classification and proposes a notion based on how humans experience the written text. ...
Article
Full-text available
In thiscreview paper usingcconcept of Sentiment Analysis (SA) whichcis a progressing field ofcresearch in data miningcfield. SA is theccomputational treatment ofcfeelings, conclusions andcsubjectivity of content. Thiscreview paper handles ancexhaustive diagramcof the last refresh in thiscfield. Numerous ascof late proposedccalculations' upgradescand different SAcapplications are researchedcand exhibited quickly in thiscstudy. These articles arecarranged by theirccommitments in the differentcSA strategies. The relatedcfields to SA (exchange learning, feelingclocation, and buildingcassets) that pulledcin analysts ascof late are talkedcabout. The principlecfocus of this review is tocgive almost fullcpicture of SA strategiescand the related fieldscwith brief subtleties. Thiscstudy covers procedures andcmethodologies that guarantee tocstraight forwardly empowercsupposition arrangedcdata looking forcframeworks. Our attention iscon techniques thatclook to address thecnew difficultiescraised by suppositioncmindful applications, when contrastedcwith those that are ascof now present incincreasingly conventional truthcbased investigation. Wecincorporate material on synopsiscof evaluative contentcand on more extensivecissues with respect tocprotection, control, andcfinancial effect thatcthe improvement ofcfeeling focused oncpeople feedback datacget to administrations offerscascend to. To encouragecfuture work, a dialog ofcaccessible assets, benchmarkcdatasets,
Article
Full-text available
Sentiment Analysis (SA) is a progressing field of research in content mining field. SA is the computational treatment of feelings, conclusions and subjectivity of content. This paper handles an exhaustive diagram of the last refresh in this field. Numerous as of late proposed calculations' upgrades and different SA applications are researched and exhibited quickly in this study. These articles are arranged by their commitments in the different SA strategies. The related fields to SA (exchange learning, feeling location, and building assets) that pulled in analysts as of late are talked about. The principle focus of this paper is to give almost full picture of SA strategies and the related fields with brief subtleties. This study covers procedures and methodologies that guarantee to straightforwardly empower supposition arranged data looking for frameworks. Our attention is on techniques that look to address the new difficulties raised by supposition mindful applications, when contrasted with those that are as of now present in increasingly conventional truth based investigation. We incorporate material on synopsis of evaluative content and on more extensive issues with respect to protection, control, and financial effect that the improvement of feeling focused twitter data get to administrations offers ascend to. To encourage future work, a dialog of accessible assets, benchmark datasets, and assessment crusades is additionally given. Keyword: Sentiment Analysis, methodologies, twitter data,framework. I. INTRODUCTION With the expanding prominence of long range interpersonal communication, blogging and miniaturized scale blogging sites, each day an enormous measure of casual abstract content articulations are made accessible on the web. The data caught from these writings, could be utilized for logical overviews from a social or political point of view [9]. Organizations and item proprietors who plan to enhance their items/administrations may emphatically profit by the rich input [6], [15]. Then again, clients could likewise find out about inspiration or cynicism of various highlights of items/administrations as indicated by clients' sentiments, to make an informed buy. Besides, applications like rating motion pictures dependent on online motion picture surveys [12] couldn't develop without making utilization of these information. A Sentiment Summarization framework takes as information a lot of records that contain suppositions about some substance of intrigue. Thusly, it forms all the given archives and produces an outline of all the information reports. This rundown ought to speak to the normal supposition of the considerable number of reports and critical parts of the objective of SA tended to in those records. This enables the two clients and organizations to have simple access to general's conclusion with respect to specific things/items. There are two primary ways to deal with creating printed rundowns. The primary technique, known as extractive-based outline, is removing some as far as anyone knows vital parts of the correct messages in a corpus of records and introducing them as a rundown of that corpus. The second strategy, known as abstractive outline, is producing a printed rundown in which the utilized words are not really the ones utilized in the corpus. There has been a lot of past research on different strategies to use the web innovation to expand the advantages of clients and in addition organizations in the commercial center [2]. In like manner, this examination seeks after a similar objective by performing buyer items based SA and rundown in the area of cell phones on Twitter information. For this reason, we physically commented on Twitter information for our examinations. Conclusion examination is where the dataset comprises of feelings, dispositions or evaluation which considers the way a human thinks [1]. In a sentence, attempting to comprehend the positive and the negative angle is an extremely troublesome errand. The highlights used to characterize the sentences ought to have an extremely solid modifier so as to condense the survey. These substances are even written in various methodologies which are not actually derived by the clients or the organizations making it hard to group them. Supposition examination impacts clients to group whether the data about the item is tasteful or not before they gain it. Advertisers and firms utilize this investigation to comprehend about their items or administrations so that it very well may be offered according to the client's needs. There are two sorts of machine learning systems which are commonly utilized for notion examination, one is unsupervised and the other is administered [2]. Unsupervised learning does not comprise of a class and they don't furnish with the right focuses at all and hence lead grouping. Administered learning depends on marked dataset and therefore the names are given to the model amid the procedure. These marked dataset are prepared to deliver sensible yields when experienced amid
Article
Given the size of digital library collections and the inconsistencies in their genre‐related bibliographic metadata, as digital libraries grow and their contents are opened for computational analysis, finding materials of interest becomes a major challenge. This challenge increases for sub‐genres and other categories of text data that are less distinct from the whole. This project pilots machine learning methods and word feature analysis for identifying Black Fantastic genre texts within the HathiTrust Digital Library. These texts are sometimes referred to as “Afrofuturism” but more commonly today described as “Black Fantastic,” in which African Diaspora artists and creators engage with the intersections of race and technology in their works with a primary focus on world‐building. Black Fantastic texts pose a challenge to genre classification, as they incorporate aspects of science fiction and fantasy with typical characteristics of African Diaspora‐produced literature. This paper presents and reports on results from a pilot predictive modeling process to computationally identify Black Fantastic texts using curated word feature sets for each class of data: general English‐language fiction, Black‐authored fiction, and Black Fantastic fiction.
Article
Full-text available
La información acerca de los géneros podría mejorar los proceso de selección y búsqueda de información, especialmente en grandes colecciones de documentos. Se asume que el género revela la intención de uso de un documento en determinada comunidad de usuarios. En esta investigación pretendemos averiguar cuáles son los géneros web más utilizados entre los usuarios, al menos, en el contexto de un grupo de usuarios. Para llevar esto a cabo, definimos un contexto, la Facultad de Ciencias de la Documentación de la Universidad Complutense de Madrid, y un grupo de usuarios específicos de dicho centro, los cuales no se tienen en cuenta normalmente en otras investigaciones sobre la materia. Los usuarios web entrevistados fueron seleccionados entre el personal que estudia y trabaja en dicho centro. Los participantes fueron entrevistados en sesiones de 15 minutos aproximadamente. Hicieron comentarios acerca de las páginas web o los recursos web que utilizan de forma habitual, explicando las razones por las que los encuentran útiles, y su forma de utilizarlos. De forma complementaria, las dos investigadoras clasificaron estas páginas web en géneros web. Obtuvimos un conjunto limitado de géneros web, que se detallan lo suficientemente como para conocer la intención de usos y las mutuas relaciones. Los géneros web representan, hasta cierto punto, algunas de las necesidades de información de los grupos de usuarios considerados. Asimismo, los géneros nos dan una idea de la manera de utilizar la información encontrada en la web. Este conjunto de géneros web puede extrapolarse a otras comunidades de usuarios para contrastar resultados. También puede servir como base para implementar algoritmos que reconozcan exclusivamente conjuntos predeterminados de géneros web.
Chapter
The expansion of institutional repositories involves new challenges for autonomous agents that control the quality of semantic annotations in large amounts of scholarly knowledge. While evaluating metadata integrity in documents was already widely tackled in the literature, a majority of the frameworks are intractable when confronted with a big data environment. In this paper, we propose an optimal strategy based on feature engineering to identify spurious objects in large academic repositories. Through an application case dealing with a Brazilian institutional repository containing objects like PhD theses and MSc dissertations, we use maximum likelihood estimations and bag-of-words techniques to fit a minimalist Bayesian classifier that can quickly detect inconsistencies in class assertions guaranteeing approximately 94% of accuracy.
Article
Music genre classification has its own popularity index in the present times. Machine learning can play an important role in the music streaming task. This research article proposes a machine learning based model for the classification of music genre. The evaluation of the proposed model is carried out while considering different music genres as in blues, metal, pop, country, classical, disco, jazz and hip-hop. Different audio features utilized in this study include MFCC (Mel Frequency Spectral Coefficients), Delta, Delta-Delta and temporal aspects for processing the data. The implementation of the proposed model has been done in the Python language. The results of the proposed model reveal an accuracy SVM accuracy of 95%. The proposed algorithm has been compared with existing algorithms and the proposed algorithm performs better than the existing ones in terms of accuracy.
Article
Shlomo Argamon is Professor of Computer Science and Director of the Master of Data Science Program at the Illinois Institute of Technology (USA). In this article, he reflects on the current and potential relationship between register and the field of computational linguistics. He applies his expertise in computational linguistics and machine learning to a variety of problems in natural language processing. These include stylistic variation, forensic linguistics, authorship attribution, and biomedical informatics. He is particularly interested in the linguistic structures used by speakers and writers, including linguistic choices that are influenced by social variables such as age, gender, and register, as well as linguistic choices that are unique or distinctive to the style of individual authors. Argamon has been a pioneer in computational linguistics and NLP research in his efforts to account for and explore register variation. His computational linguistic research on register draws inspiration from Systemic Functional Linguistics, Biber’s multi-dimensional approach to register variation, as well as his own extensive experience accounting for variation within and across text types and authors. Argamon has applied computational methods to text classification and description across registers – including blogs, academic disciplines, and news writing – as well as the interaction between register and other social variables, such as age and gender. His cutting-edge research in these areas is certain to have a lasting impact on the future of computational linguistics and NLP.
Article
Das Internet bietet die Funktionalität eines sphärenübergreifenden Öffentlichkeits- forums. Positive oder negative Meinungen über Unternehmen sowie Ansichten darüber, wofür ein Unternehmen steht, diffundieren schnell durch die verschiedenen Sphären des Webs. Welche Werte und Attribute eine Marke verkörpert, welcher Ruf mit ihr verbunden ist, bestimmt daher nicht mehr das agierende Unternehmen alleine. Vielmehr prägt die Vielfalt unterschiedlicher sowie konsensualer Ansichten der Webuser das Wahrnehmungbild einer Marke. Quellen, die zu dieser Markenkonstitution beitragen, sind das Social Web, Online- News-Dienste und das "statische Web" im Sinne der Corporate Communications auf unternehmenseigenen Websites. Eine Markenführung, ebenso wie ein Reputationsmanagement kann daher nur effizient und zielführend sein, wenn die mit der Marke assoziierten Bedeutungsinhalte in umfassender Weise registriert und ausgewertet werden. Dazu schafft das Web ideale Voraussetzungen, die sich aber laufend verändern. Die gezielte Nutzung des Web ist heute in vielen Bereichen eine Selbstverständlichkeit und erlaubt rasche vielseitige Abfragen. Der resultierende Erkenntnisgewinn ermöglicht einerseits die Konfrontation der Aussenwahrnehmung mit dem Selbstverständnis des Unternehmens. Andererseits kann sich das Unternehmen langfristige Wettbewerbsvorteile verschaffen. Diese ergeben sich aus den Möglichkeiten einer trennscharfen Markenpositionierung, der Verringerung des Risikos für Reputationsschädigungen und der verbesserten Kommunikation und Steuerung erwünschter Markenattribute. Diese Dissertation verfolgt übergeordnet das Ziel, ein Instrumentarium für das Marketing von Hochschulen im mitteleuropäischen Raum zu entwickeln und zu evaluieren, welches als Basis für die Führung der Hochschulmarke in Bezug auf ihre Markenbekanntheit, Positionierung und Reputation genutzt werden kann. Gerade Hochschulen, welche eine komplexe Stakeholder-Architektur besitzen, multikulturell zusammengesetzt sind und egalitär gestaltete Organisationsstrukturen aufweisen, stehen in verstärktem Masse vor der Herausforderung einer konsistenten Kontrolle und Führung der Hochschulmarke. Als zentrales Instrument der Datenerhebung wurde ein Webmonitoring-System eingesetzt und auf die spezifischen Anforderungen an das Marketing für den tertiären Bildungssektor ausgerichtet. Als Untersuchungsobjekte dienten sieben Hochschulen aus dem mitteleuropäischen Raum: ETH Zürich, Universität Zürich (UZH), Universität Basel (uniba), Universität St. Gallen (HSG), Ludwig-Maximilian- Universität München (LMU), Karlsruher Institut für Technologie (KIT) und Wirtschaftsuniversität Wien (WU). Für jede der sieben Hochschulen wurde die Markenbekanntheit, die Positionierung sowie die Reputation anhand von textbasierten Webinformationen analysiert. Dabei flossen sowohl Informationen aus dem Social Web, aus Online-News als auch aus dem statischen Web in die Analysen ein. In Bezug auf die Markenbekanntheit der sieben untersuchten Hochschulen zeigt die empirische Analyse folgende Befunde: Grundsätzlich tragen zur Visibilität einer Hochschule vorwiegend online-basierte Newsdienste bei. Insofern kommt der klassischen Zusammenarbeit der universitären PR-Abteilungen mit den Medien nach wie vor eine zentrale Rolle zu. Eine vergleichende Betrachtung der sieben Hochschulen zeigt, dass die ETH Zürich sowie Universität Zürich mit Abstand die stärkste Medienpräsenz geniessen. Dies ergibt sich nicht nur aus der starken Präsenz in Schweizer Medien, sondern auch aus einer bemerkenswerten Präsenz in der deutschen und österreichischen Medienberichterstattung. Die Positionierungsanalyse lässt ein für Hochschulen charakteristisches Markenprofil erkennen, in welchem die Dimensionen "Forschung" und "Studium" gegenüber den Dimensionen "Weiterbildung", "Vernetzung" und "öffentliche Dienstleistungen" dominieren. Dies bedeutet, dass der überwiegende Teil an webbasierten Informationen, die sich auf Hochschulen beziehen, Themen der Forschung und Verhältnissen des Studiums tangieren. Insofern liegen die Profile der untersuchten Hochschulen nahe beieinander und bringen damit das Problem einer geringen gegenseitigen Differenzierung zum Ausdruck. Insbesondere sind die Volluniversitäten von dieser Problematik betroffen. Die Reputationsanalyse zeigt, dass die Themengebiete "Medienpräsenz", "Qualität der akademischen Lehre", "Forschungsleistungen", "Auszeichnungen", "Skandale" sowie "Qualität der Hochschuladministration" aus Sicht der zentralen Anspruchsgruppe der Studierenden für die Hochschulreputation - wenn auch in unterschiedlichem Masse - wesentlich sind. Die Berechnung von Reputationsindizes für die sieben Hochschulen ergab, dass vorab die ETH Zürich sowie die Universität Zürich und die Universität St. Gallen die grössten Potenziale besitzen, aufgrund der auf sie bezogenen Informationen die Reputation stärken zu können. Die Reputationsanalyse weist zudem darauf hin, dass eine hohe Medienpräsenz grundsätzlich eine stabilisierende Wirkung auf die Hochschulreputation im positiven Sinne ausüben dürfte. Dies liegt darin begründet, dass die Mehrzahl der mit Hochschulen assoziierten Medienberichte Forschungs- leistungen (insbesondere Studienergebnisse) sowie Auszeichnungen, Preisverleihungen, Würdigungen und Ehrungen thematisieren. Gleichzeitig gehen Risiken einer erheblichen Reputationsschädigung ebenso von den Medien aus, sofern sich die Medienberichterstattung intensiv und wiederholt auf skandalöse Ereignisse an einer Hochschule bezieht. Bislang war weitgehend unklar, wie Hochschulen im deutschsprachigen Raum Mitteleuropas im Internet durch die Gesamtheit der mit ihnen assoziierten Webinformationen repräsentiert sind. Die Befunde dieser Arbeit haben Transparenz in Bezug auf die webbasierte Repräsentation des Markenwissens und in dessen Rahmen der Markenbekanntheit, der Markenpositionierung und der Reputation anhand ausgewählter Hochschulen geschaffen. Darüber hinaus wurden relevante Einfluss- grössen dieser zentralen Marketingkonstrukte identifiziert sowie Möglichkeiten und Grenzen des Webmonitoring, spezifisch für den Einsatz im Hochschulmarketing, evaluiert. Für die untersuchten sieben Hochschulen konnten verschiedene Anregungen für die Praxis abgeleitet werden, welche sowohl konkrete Handlungsmassnahmen, eine Informationsbasis zu Benchmarking-Zwecken als auch eine für das Hochschulmarke- ting spezifische Methodik umfasst. Executive Summary The world wide web acts as a cross-spherical public forum. Positive and negative opinions about organizations as well as opinions about what an organization stands for quickly diffuse across the different spheres of the web. Organizations are therefore no longer capable of deciding which values and attributes their brand is thought to incorporate and what reputation they are associated with on their own. In fact, the plurality of views and opinions that web users have about a firm form a brand’s perception. The social web, the news-sphere and the static web (primarily the corporate website) are the sources that influence brand perception. Brand management as well as reputation management can therefore only be efficient and constructive when all meanings associated with the brand are broadly recorded and evaluated. The web provides ideal conditions for this purpose which, however, are constantly changing. The target use of the web is commonly a matter of course in a variety of sectors and enables quick and versatile requests. The resulting insight not only allows a comparison of the external views with the self-image but can also create competitive advantages for an organization. These are derived from the possibilities of selective positioning, reduction of risks for reputation damage and improved communication and control of the desired brand attributes. The overall objective of this dissertation is the development and evaluation of an instrument for marketing in higher education focused on Central European universities. This instrument can be used as a foundation for managing brand awareness, brand positioning and reputation. Higher education institutions with complex stakeholder- architectures, multiple subcultures and egalitarian organizational structures are especially facing the challenges of consistent and controlled brand management. A webmonitoring-system was used and adapted to the specific requirements of the tertiary sector for data collection. Seven Central European universities were selected as research objects: ETH Zurich (ETH), University of Zurich (UZH), University of Basel (uniba), University of St Gallen (HSG), Ludwig Maximilian University of Munich (LMU), Karlsruhe Institute of Technology (KIT) and the Vienna University of Economics and Business (WU). They are analyzed with regard to brand awareness, brand positioning and reputation based on textual web-information. In doing so, information from the social web, online-news and the static web are considered. The analysis of brand awareness shows the following results for the seven universities: Mainly online-news adds to the visibility of the university brand. This means that cooperation between universities’ PR teams and media organizations is essential. A comparative examination of the seven universities shows that the ETH and the UZH clearly enjoy a high profile. In sum, the results not only stem from a strong presence in Swiss media, but also from a respectable frequency of references in German and Austrian news sites. The positioning analysis reveals a brand profile that seems characteristic for higher education institutions. This profile is dominated by the dimensions “research” and “studies” whereas the dimensions “further education”,“networking” and “public services” have considerably lower values, meaning that the predominant percentage of web- based information associated with universities remain positioned against the background of issues applying research and conditions of studies. In so far the university profiles are close together and show low differentiation. Mainly full-scale universities are affected by this problem. The reputation analysis shows, based on a survey at the UZH, that students consider “media presence”, “quality of academic teaching”, “research performance”, “awards” ,“scandals”, and “quality of university’s service administration” to be critical influences on reputation, however, in varying degrees. The calculation of reputation for the seven universities revealed that the ETH, followed by the UZH and the HSG have the potential to strengthen their reputation based upon associated web-information. Reputation analysis further indicates that a strong media presence seems to have a stabilizing influence on positive reputation. This is due to the fact that the vast majority of media reports relating to a university discuss topics like research performance, awards, distinctions and appreciations. Media organizations are also able to severely damage a university’s reputation when they report about scandalous events. Until today there has been a certain ambiguity about how higher education institutions in the German-speaking area in Central Europe are represented based on the totality of web-information associated with them. The results of the empirical study in this dissertation have created transparency relating to web-based representations of brand knowledge, which encompass brand awareness, brand positioning and reputation. Beyond this, relevant factors of these marketing constructs are identified and subsequently possibilities and limits of webmonitoring are displayed, specifically for higher education marketing. For the seven universities examined, different suggestions were derived for practical application which include action recommendations, an information basis for benchmarking and a methodology tailored to higher education marketing.
Article
Full-text available
We critically assess mainstream accounting and finance research applying methods from computational linguistics (CL) to study financial discourse. We also review common themes and innovations in the literature and assess the incremental contributions of work applying CL methods over manual content analysis. Key conclusions emerging from our analysis are: (a) accounting and finance research is behind the curve in terms of CL methods generally and word sense disambiguation in particular; (b) implementation issues mean the proposed benefits of CL are often less pronounced than proponents suggest; (c) structural issues limit practical relevance; and (d) CL methods and high quality manual analysis represent complementary approaches to analyzing financial discourse. We describe four CL tools that have yet to gain traction in mainstream AF research but which we believe offer promising ways to enhance the study of meaning in financial discourse. The four tools are named entity recognition (NER), summarization, semantics and corpus linguistics. This article is protected by copyright. All rights reserved
Article
An additional dimension that facilitate a swift and relevant response from a web search engine is to introduce a genre class for each web page. The web genre classification distinguishes between pages by means of their features such as functionality, style, presentation layout, form and meta-content rather than on content. In this work, nineteen web metrics are identified according to the lexical, structural and functionality attributes of the web page rather than topic. The study is carried out to determine which of these attributes (lexical, structural and functionality) or its combinations, are significant for the development of web genre classification model. Also, we investigate the best web genre prediction model using parametric (Logistic Regression), non-parametric (Decision Tree) and ensemble (Bagging, Boosting) machine learning algorithms. We built forty-two genre classification models to classify web pages into Movie, TV or Music genre using a sample space data extracted from the Pixel Awards nominated and award winning websites. Our results obtained from the area under the curve analysis of these forty-two models show that the ensemble algorithms provide better performance. The rest of the models have acceptable performance, only in cases for which the lexical and structural attributes were fed in combination. Functionality metrics were found to considerably degrade the performance measure, irrespective of the algorithm used. The overall results of the study indicate the predictive capability of machine learning models for web genre classification, provided an appropriate choice is made on the selection of the input metrics.
Article
Sentiment analysis from text consists of extracting information about opinions, sentiments, and even emotions conveyed by writers towards topics of interest. It is often equated to opinion mining, but it should also encompass emotion mining. Opinion mining involves the use of natural language processing and machine learning to determine the attitude of a writer towards a subject. Emotion mining is also using similar technologies but is concerned with detecting and classifying writers emotions toward events or topics. Textual emotion-mining methods have various applications, including gaining information about customer satisfaction, helping in selecting teaching materials in e-learning, recommending products based on users emotions, and even predicting mental-health disorders. In surveys on sentiment analysis, which are often old or incomplete, the strong link between opinion mining and emotion mining is understated. This motivates the need for a different and new perspective on the literature on sentiment analysis, with a focus on emotion mining. We present the state-of-the-art methods and propose the following contributions: (1) a taxonomy of sentiment analysis; (2) a survey on polarity classification methods and resources, especially those related to emotion mining; (3) a complete survey on emotion theories and emotion-mining research; and (4) some useful resources, including lexicons and datasets.
Article
Newspaper text can be broadly divided in the classes ‘opinion’ (editorials, commentary, letters to the editor) and ‘neutral’ (reports). We describe a classification system for performing this separation, which uses a set of linguistically motivated features. Working with various English newspaper corpora, we demonstrate that it significantly outperforms bag-of-lemma and PoS-tag models. We conclude that the linguistic features constitute the best method for achieving robustness against change of newspaper or domain.
Chapter
The amount of literature and the research produced about reputa-tional risk in banking has grown rapidly (some of the contributions are: Fiordelisi, Soana and Schwizer, 2012; Gillet, Hubner and Plunus, 2010; Sturm, 2013) due to the obvious responsibilities of the banking and financial industry in the economic crises that have emerged since 2007. In banking studies attention has been paid to reputational damage stemming from operational risk events and losses: as often, when debating risks in banking, more effort has been dedicated to measuring effect rather than understanding the real determinants of risks and losses, and offering suggestions about how to manage risks and their causes. Having noticed a lack of or insufficient information on corporate reputation (CR) and reputational risk (RR) in the banking industry in the mainstream literature, we try to go back to basics and justify, both theoretically and practically, the need for new approaches and practices. We think that it can be useful to pick up information on how stakeholders observe and exchange opinions about reputational facts and events connected with decision-making processes and actions inside the banks.
Conference Paper
Full-text available
A simple method for categorizing texts into pre-determined text genre categories using the statistical standard technique of discriminant analysis is demonstrated with application to the Brown corpus. Discriminant analysis makes it possible use a large number of parameters that may be specific for a certain corpus or information stream, and combine them into a small number of functions, with the parameters weighted on basis of how useful they are for discriminating text genres. An application to information retrieval is discussed.
Article
Full-text available
A simple method for categorizing texts into pre-determined text genre categories using the statistical standard technique of discriminant analysis is demonstrated with application to the Brown corpus. Discriminant analysis makes it possible use a large number of parameters that may be specific for a certain corpus or information stream, and combine them into a small number of functions, with the parameters weighted on basis of how useful they are for discriminating text genres. An application to information retrieval is discussed. Text Types There are different types of text. Texts "about" the same thing may be in differing genres, of different types, and of varying quality. Texts vary along several parameters, all relevant for the general information retrieval problem of matching reader needs and texts. Given this variation, in a text retrieval context the problems are (i) identifying genres, and (ii) choosing criteria to cluster texts of the same genre, with predictable precision an...
Article
Full-text available
Most research on automated text categorization has focused on determining the topic of a given text. While topic is generally the main characteristic of an information need, there are other characteristics that are useful for information retrieval. In this paper we consider the problem of text categorization according to style. For example, in searching the web, we may wish to automatically determine if a given page is promotional or informative, was written by a native English speaker or not, and so on. Learning to determine the style of a document is a dual to that of determining its topic, in that those document features which capture the style of a document are precisely those which are independent of its topic. We here define the features of a document to be the frequencies of each of a set of function words and parts-of-speech triples. We then use machine learning techniques to classify documents. We test our methods on four collections of downloaded newspaper and magazine articl...
Article
Full-text available
In this paper we present a method for detecting the text genre quickly and easily following an approach originally proposed in authorship attribution studies which uses as style markers the frequencies of occurrence of the most fi'equent words in a training corpus (Burrows, 1992). In contrast to this approach we use the frequencies of occurrence of the most frequent words of the entire written language. Using as testing ground a part of the Wall Street Journal corpus, we show that the most frequent words of the British National Corpus, representing the most frequent words of the written English language, are more reliable discriminators of text genre in comparison to the most fi'equent words of the training corpus. Moreover, the frequencies of occurrence of the most common punctuation lnarks play an important role in terms of accurate text categorization as well as when dealing with training data of limited size.
Article
Full-text available
We present a unique approach to identifying news stories that influence the behavior of financial markets. We describe the design and implementation of AEnalyst, a system for predicting trends in stock prices based on the content of news stories that precede the trends. We identify trends in time series using piecewise linear fitting and then assign labels to the trends according to an automated binning procedure. We use language models to represent patterns of language that are highly associated with particular labeled trends. AEnalyst can then identify news stories that are highly indicative of future trends. We evaluate the system in terms of its ability to predict forthcoming trends in the stock prices. We perform a market simulation, demonstrate that AEnalyst is capable of producing profits that are significantly highly than random.
Article
Full-text available
This paper describes how collection specific empirically defined stylistics based genre prediction can be brought together together with rapid topical clustering to build an interactive information retrieval interface with multi-dimensional presentation of search results. The prototype presented addresses two specific problems of information retrieval: how to enrich the information seeking dialog by encouraging and supporting iterative refinement of queries, and how to enrich the document representation past the shallow semantics allowed by term frequencies. Searching For More Than Words Today's tools for searching information in a document database are based on term occurrence in texts. The searcher enters a number of terms and a number of documents where those terms or closely related terms appear comparatively frequently are retrieved and presented by the system in list form. This method works well up to a point. It is intuitively understandable, and for competent users and well e...
Article
Full-text available
As the text databases available to users become larger and more heterogeneous, genre becomes increasingly important for computational linguistics as a complement to topical and structural principles of classification. We propose a theory of genres as bundles of facets, which correlate with various surface cues, and argue that genre detection based on surface cues is as successful as detection based on deeper structural properties.
Article
A discussion on various experiments to utilize stylistic variation among texts for information retrieval purposes.
Article
The self-organized map, an architecture suggested for artificial neural networks, is explained by presenting simulation experiments and practical applications. The self-organizing map has the property of effectively creating spatially organized internal representations of various features of input signals and their abstractions. One result of this is that the self-organization process can discover semantic relationships in sentences. Brain maps, semantic maps, and early work on competitive learning are reviewed. The self-organizing map algorithm (an algorithm which order responses spatially) is reviewed, focusing on best matching cell selection and adaptation of the weight vectors. Suggestions for applying the self-organizing map algorithm, demonstrations of the ordering process, and an example of hierarchical clustering of data are presented. Fine tuning the map by learning vector quantization is addressed. The use of self-organized maps in practical speech recognition and a simulation experiment on semantic mapping are discussed
Book
The Self-Organising Map (SOM) algorithm was introduced by the author in 1981. Its theory and many applications form one of the major approaches to the contemporary artificial neural networks field, and new technologies have already been based on it. The most important practical applications are in exploratory data analysis, pattern recognition, speech analysis, robotics, industrial and medical diagnostics, instrumentation, and control, and literally hundreds of other tasks. In this monograph the mathematical preliminaries, background, basic ideas, and implications are expounded in a manner which is accessible without prior expert knowledge.
Article
The automatic removal of suffixes from words in English is of particular interest in the field of information retrieval. An algorithm for suffix stripping is described, which has been implemented as a short, fast program in BCPL. Although simple, it performs slightly better than a much more elaborate system with which it has been compared. It effectively works by treating complex suffixes as compounds made up of simple suffixes, and removing the simple suffixes in a number of steps. In each step the removal of the suffix is made to depend upon the form of the remaining stem, which usually involves a measure of its syllable length.
Article
Categorization of text in IR has traditionally focused on topic. As use of the Internet and e-mail increases, categorization has become a key area of research as users demand methods of prioritizing documents. This work investigates text classification by format style, i.e. "genre", and demonstrates, by complementing topic classification, that it can significantly improve retrieval of information. The paper compares use of presentation features to word features, and the combination thereof, using Naïve Bayes, C4.5 and SVM classifiers. Results show use of combined feature sets with SVM yields 92% classification accuracy in sorting seven genres.
Article
We report on our ongoing study of using the genre of Web pages to facilitate information exploration. By genre, we mean socially recognized regularities of form and purpose in documents (e.g., a letter, a memo, a research paper). Our study had three phases. First, through a user study, we identified genres which most/least frequently meet searchers' information needs. We found that certain genres are better suited for certain types of needs. We identified five (5) major groups of document genres that might be used in an interactive search tool that would allow genre-based navigation. We tried to balance the following dual objectives: 1) each group should be recognizable by a computer algorithm as easily as possible 2) each group has a better chance of satisfying particular types of information needs. Finally, we developed a novel user interface for a web searching that allows genre-based navigation through three major functionalities: 1) limiting search to specified genres 2) visualizing the hierarchy of genres discovered in the search results and 3) accepting user feedback on the relevancy of the specified genres.
Conference Paper
With the number and types of documents in digital library systems incr easing, tools for automatically organizing and presenting the content have to be found. While many approaches focus on topic-based organization and structuring, hardly any system incorporates automatic structural analysis and representation. Yet, genre information (unconsciously) forms one of the most distinguishing features in conventional libraries and in information searches. In this paper we present an approach to automatically analyze the structure of documents and to integrate this information into an automatically created content-based organization. In the resulting visualization, documents on similar topics, yet representing different genres, are depicted as books in differing colors. This representation supports users intuitively in locating relevant information presented in a relevant form.
Article
Purpose The automatic removal of suffixes from words in English is of particular interest in the field of information retrieval. This work was originally published in Program in 1980 and is republished as part of a series of articles commemorating the 40th anniversary of the journal. Design/methodology/approach An algorithm for suffix stripping is described, which has been implemented as a short, fast program in BCPL. Findings Although simple, it performs slightly better than a much more elaborate system with which it has been compared. It effectively works by treating complex suffixes as compounds made up of simple suffixes, and removing the simple suffixes in a number of steps. In each step the removal of the suffix is made to depend upon the form of the remaining stem, which usually involves a measure of its syllable length. Originality/value The piece provides a useful historical document on information retrieval.
Article
On permanent loan to the Publications Unit. Incl. a biographical note on Noah Webster, 16 October 1758 - 28 May 1843
Article
Subjectivity tagging is distinguishing sentences used to present opinions and evaluations from sentences used to objectively present factual information. There are numerous applications for which subjectivity tagging is relevant, including information extraction and information retrieval. This paper identifies strong clues of subjectivity using the results of a method for clustering words according to distributional similarity (Lin 1998), seeded by a small amount of detailed manual annotation. These features are then further refined with the addition of lexical semantic features of adjectives, specifically polarity and gradability (Hatzivassiloglou & McKeown 1997), which can be automatically learned from corpora. In 10-fold cross validation experiments, features based on both similarity clusters and the lexical semantic features are shown to have higher precision than features based on each alone.
Article
Most recent research in trainable part of speech taggers has explored stochastic tagging. While these taggers obtain high accuracy, linguistic information is captured indirectly, typically in tens of thousands of lexical and contextual probabilities. In (Brill 1992), a trainable rule-based tagger was described that obtained performance comparable to that of stochastic taggers, but captured relevant linguistic information in a small number of simple non-stochastic rules. In this paper, we describe a number of extensions to this rulebased tagger. First, we describe a method for expressing lexical relations in tagging that stochastic taggers are currently unable to express. Next, we show a rule-based approach to tagging unknown words. Finally, we show how the tagger can be extended into a k-best tagger, where multiple tags can be assigned to words in some cases of uncertainty.
Tomb raider [Motion picture] United Kingdom: Paramount Pictures
  • S West
West, S. (Director) (2001). Tomb raider [Motion picture]. United Kingdom: Paramount Pictures.
The form is the substance: Classification of genres in text. Paper presented at the Work-shop on Human Language Technology and Knowledge Management
  • N Dewdney
  • C Dykema
  • R Mcmillan
Dewdney, N., VanEss-Dykema, C., & McMillan, R. (2001). The form is the substance: Classification of genres in text. Paper presented at the Work-shop on Human Language Technology and Knowledge Management, Conference of the Association of Computational Linguistics.
Routing documents accord-ing to style. First International Workshop on Innovative Information Systems Some advances in transformation-based part of speech tagging
  • S Argamon
  • M Koppel
  • G Avneri
  • Ma Brill
Argamon, S., Koppel, M., & Avneri, G. (1998). Routing documents accord-ing to style. First International Workshop on Innovative Information Systems, Boston, MA. Brill, N. (1994). Some advances in transformation-based part of speech tagging. Proceedings of the 12th National Conference on Artificial Intelligence (pp. 722–727).