Conference PaperPDF Available

Text Classification for Monolingual Political Manifestos with Words Out of Vocabulary

Text Classification for Monolingual Political Manifestos with Words Out
of Vocabulary
Arsenii Rasov1,, Ilya Obabkov1, Eckehard Olbrich2and Ivan P. Yamshchikov2
1Ural Federal University, Mira Street, 19, Yekaterinburg, Russia
2Max Planck Institute for Mathematics in the Sciences, Inselstrasse 22, Leipzig, Germany
Keywords: Electoral Programs, Text Corpus, Classification of Political Texts.
Abstract: In this position paper, we implement an automatic coding algorithm for electoral programs from the Manifesto
Project Database. We propose a new approach that works with new words that are out of the training vocab-
ulary, replacing them with the words from training vocabulary that are the closest neighbors in the space of
word embeddings. A set of simulations demonstrates that the proposed algorithm shows classification accu-
racy comparable to the state-of-the-art benchmarks for monolingual multi-label classification. The agreement
levels for the algorithm is comparable with manual labeling. The results for a broad set of model hyperparam-
eters are compared to each other.
Computational social science is a field that lever-
ages the capacity to collect and analyze data at scale.
One hopes that automated data analysis of such data
may reveal patterns of individual and group behav-
iors, (Laser et al. (2009)). Analysis of political dis-
course is one of the prominent fields where data anal-
ysis overlaps with sociology, history, and political sci-
ence. Scientists study electoral processes, interactions
of political actors with one another and with the pub-
lic. In these works, researchers use different types
of data that could describe such processes. However,
the demand for well-annotated high-quality datasets
is continuously higher than the supply of new data.
Political scientists are in general need of annotated
datasets to do their work. It can be a document-wise
annotation, which matches the whole documents with
specific categorical labels, retrieving the document’s
basic idea, or sentence-wise labeling, that matches
each sentence with a particular label.
There are many widely-used sources, which pro-
vide different types of political data. Some re-
searchers use data from social networks, such as
twitter1. For those who are interested in parlia-
ment debates, there are such projects as EuroParl
This project has received funding from the European
Union’s Horizon 2020 research and innovation programme
under grant agreement No 732942.
1UK MPs:
corpus (Koehn(2005)), Linked EP2, ConVote dataset
(Gentzkow et al. (2018)).
One of the most popular corpora in political sci-
ence is the Manifesto Project (Lehmann et al. (2018)).
It is a large human-annotated, open-access, cross-
national text corpus that consists of electoral pro-
grams. Here, the experts implemented human anno-
tation (or the so-called ”coding”) based on the con-
tent analysis of electoral programs. The sentences are
divided into statements (quasi-sentences). Each sen-
tence is coded with one of 57 categories (which, in
turn, form 7 broad topics). Currently, the corpus in-
cludes more than 2300 machine-readable documents,
more than 1150 of them are coded already. There are
about 1 000 000 coded quasi-sentences in the corpus
(Volkens et al. (2015)).
The annotation process in Manifesto is a very
challenging task. It is carried out by the groups of ex-
perts, specially trained to perform such labeling. This
process is a very time-consuming, labor-intensive,
and expensive procedure. Moreover, it is not a triv-
ial task to label each quasi-sentence with only one of
57 categories; indeed, the level of agreement of the
experts is only about 50% (Mikhailov et al.(2012)).
One way to overcome those challenges is to use
algorithms of supervised machine learning for quasi-
sentence classification. For a long time, text classifi-
cation was perceived as a monolingual task. However,
Rasov, A., Obabkov, I., Olbrich, E. and Yamshchikov, I.
Text Classification for Monolingual Political Manifestos with Words Out of Vocabulary.
DOI: 10.5220/0009792101490154
In Proceedings of the 5th International Conference on Complexity, Future Information Systems and Risk (COMPLEXIS 2020), pages 149-154
ISBN: 978-989-758-427-5
Copyright c
2020 by SCITEPRESS – Science and Technology Publications, Lda. All r ights reserved
there are more and more recent results that treat it as
a multilingual one. Monolingual methods are trained
on the data represented in one language only. Such
methods could be used when there is enough mono-
lingual data for training. Naturally, since some lan-
guages can be scarcely represented, one would like
to use multilingual methods and transfer some in-
formation learned in the data-rich languages to the
more challenging ones. Such methods are called mul-
tilingual. The majority of these methods are based
on the idea that one can construct specific semantic
space and embed texts in different languages into one
shared space. Training language-specific embedder
algorithms, in this case, could have nothing to do with
the classification task per se. However, the joint mul-
tilingual embedding space equipped with a particular
measure of semantic similarity could be used for clas-
sification purposes. Such methods are harder to train
but can be useful if we are interested in the languages
that are underrepresented in the training data. In the
next section, we briefly review the latest multi- and
monolingual results relevant to our project.
There are several baselines for the classification of
Manifesto texts that vary across different formula-
tions of the classification task. One could split the
works in this area into two huge sections: the re-
searchers that build algorithms for seven coarse high-
level topics and the researcher that classify individual
For seven high-level topics classification, one of
the baselines is the paper (Glavas et al.(2017)). Here
authors implement multilingual text classification us-
ing convolutional neural networks to match the sen-
tence for a given manifesto with seven coarse-grained
classes. They outperform the state-of-the-art for Ital-
ian, French, and English languages. In a monolin-
gual setting (Zirn et al.(2016)) present the method,
which combines the topic-classification method and
topic-shift method using the Markov logic network
for seven sparse-level categories, reaching 74.9% of
macro-average F1-score.
The classification process for individual labels is
more challenging due to the lack of data for training
and the sheer fact that it is typically harder to build
classification algorithms with more categories. (Sub-
ramanian et al. (2017)) use a joint sentence-document
model for both sentence-level and document-level
classification. They propose the neural multilin-
gual network-based approach for fine-grained sen-
tence classification and demonstrate the state-of-the-
art quality for different languages. In (Subramanian et
al. (2018)), authors improve their performance using
a hierarchical bidirectional LSTM approach.
The current state-of-the-art benchmark for the
Manifesto quasi-sentence classification on 57 fine-
grained labels is presented in (Merz et al. (2016)).
The authors describe the approach of monolingual
text classification, using the SVM algorithm. They
show 42% accuracy for German manifestos.
In this research, we propose our method that out-
performs the (Merz et al. (2016)) benchmark for 57
labels. We also modify the experimental conditions
to make it more similar to the real conditions and ad-
dress the out-of-vocabulary words problem that we
describe in detail further.
Since some labels are under-represented in the train-
ing sample, it is hard to balance the training, and it
is futile to expect that a multi-parameter model such
as a deep neural network could be trained on such
scarce data in a monolingual setting. In this work,
we suggest focusing on basic machine learning meth-
ods that are robust under the variation of the training
categories sizes. Further, we experiment with sup-
port vector machines (Vapnik, V. (1998)) and gradient
boosting (Chen, T. and Guestrin, C. (2016)).
3.1 Training
We perform the following preprocessing of the mani-
festo data. We remove punctuation, split all sentences
into the lists of separated words, and remove stop-
We train a tf-icf matrix to vectorize each word in a
semi-sentence. Tf-icf is a supervised version of tf-idf,
which includes supervised term-weighting, see (Lan
et al. (2009)). In tf-icf scheme, we build the term-
category matrix instead of the term-document one.
To do that, we join all semi-sentences of each class
in separate new documents and train tf-icf matrix on
Then for each sentence, we create weighted one-
hot vectors, using a scheme, proposed at (Merz et al.
(2016)): we take a sum of weighted one-hot vectors of
the target sentence (weighted by 1/2) and vectors of 4
nearest sentences (weighted by 1/3 and 1/6 w.r.t. dis-
tance to the target sentence). In this work, we also
experiment with the different sizes of such kernel:
three, five, and seven sentences. After the multipli-
cation of such vectors by the tf-icf matrix we receive
COMPLEXIS 2020 - 5th International Conference on Complexity, Future Information Systems and Risk
Table 1: Various results for English, German and Spanish.
A longer window of seven sentences seems to yield bet-
ter results. Unigram-based method outperforms bigrams in
range Metric Language
Eng Ger Span
71accuracy 0.485 0.438 0.461
correlation 0.876 0.891 0.617
2accuracy 0.484 0.437 0.453
correlation 0.875 0.890 0.601
51accuracy 0.471 0.425 0.468
correlation 0.863 0.890 0.690
2accuracy 0.481 0.431 0.450
correlation 0.878 0.892 0.605
31accuracy 0.468 0.409 0.446
correlation 0.878 0.892 0.636
1accuracy 0.468 0.408 0.438
correlation 0.878 0.891 0.616
57-dimensional vectors. We also experiment with dif-
ferent sizes of n-grams (uni- and bigrams). This way,
one could hope to retrieve more information about the
context, taking into consideration more than one word
as a bit of meaningful information.
Finally, we train a machine learning algorithm us-
ing the obtained matrix as input and labels as a target.
3.2 Reproducing Experiments
(Merz et al. (2016)) also use the supervised version of
tf-icf vectorization. In the experiments, authors train
the final tf-icf-based matrix on the whole dataset, in-
cluding the train and the test parts. Then they train
the ML algorithm on the train part of the dataset and
benchmark it on the test set. Here we first reproduce
that experiment with various parameters.
Since the data in Manifesto dataset is historical, it
makes sense to train the algorithm on older documents
and test the resulting quality on newer ones. Here
we use the documents of the most recent year in the
dataset as a test set. These would be the year 2017 for
German, and 2016 for English and Spanish.
We use accuracy as a quality metric for our exper-
iments. It is analogous to the agreement level for hu-
man coders and provides the possibility to compare
the classification quality to the human’ annotation.
As another quality metric, we use the document-wise
Pearson correlation between human-annotated cate-
gories and algorithm-annotated ones, proposed kin
(Merz et al. (2016)). This metric helps to estimate the
similarity of code assignment at the aggregate level.
The results of the experiments are shown in Table 1.
Figure 1 shows scatter-plots for the frequencies of
all manually assigned categories versus automatically
assigned ones. The plots are drawn for the best per-
Table 2: Various results for English, German, and Span-
ish without out-of-vocabulary words. Bigrams with longer
window kernel demonstrate higher accuracy across all lan-
range Metric Language
Eng Ger Span
71accuracy 0.430 0.368 0.434
correlation 0.866 0.878 0.604
2accuracy 0.430 0.368 0.435
correlation 0.866 0.877 0. 606
51accuracy 0.427 0.364 0.430
correlation 0.866 0.880 0.611
2accuracy 0.427 0.364 0.430
correlation 0.867 0.880 0.611
31accuracy 0.416 0.345 0.418
correlation 0.867 0.878 0.638
2accuracy 0.416 0.354 0.418
correlation 0.867 0.878 0.643
forming models in English, German, and Spanish, re-
For German and English, the highest agreement
with human annotators is achieved when including
bigrams to the tf.icf vocabulary. The accuracy and
the correlation score for German texts outperforms
the state-of-the-art one (0.42 and 0.88, (Merz et al.
(2016))). The accuracy for English and Spanish lan-
guages are comparable to the state-of-the-art models.
3.3 Out-of-Vocabulary Words
Due to the supervised nature of the tf-icf algorithm, it
is fair to say that in real-life conditions, one does not
have the annotation to the new historical data. One
has to classify these new data as it arrives. That means
that the method described above could only be par-
tially reproduced: one can not build a complete tf-
icf matrix that would include every word in the new
data, since some of the words may not occur in the
training dataset. These words out of vocabulary con-
stitute a significant portion of the vocabulary that can
not be ignored. If we use the latest datasets for the
test, there would be 3485, 3266, and 8018 out-of-
vocabulary (O-o-V) words for German, English, and
Spanish datasets, respectively. Table 2 shows that if
one initializes O-o-V words with zeros, it drastically
reduces the quality of the classification.
One should notice here that without any informa-
tion on the out-pf-vocabulary words, the best accu-
racy is achieved on a bigger kernel with bigrams. This
stands to reason: due to the absence of information
on new words that were not observed in the training
set, the model needs to rely on a broader context to
achieve higher accuracy. Table 3 compares the ac-
curacy for the model with a full if-icf matrix (with
Text Classification for Monolingual Political Manifestos with Words Out of Vocabulary
Figure 1: Comparison of code frequencies of 57 categories,
trained on the whole dataset, in six electoral programs by
human and semi-automatic coding for Spanish, English and
German texts respectively.
bigrams and kernel size 7) and the same model but
without information on the O-o-V words. There is a
drastic decline in accuracy for all three languages.
Table 3: Overview of the change in accuracy for the algo-
rithm that does not use words out-of-vocabulary.
Language Accuracy with full Drop in accuracy
tf-icf matrix without words O-o-V
English 0.485 11.3%
German 0.436 15.6%
Spanish 0.461 5.6%
To overcome this problem, we propose a specific
method of word replacement, based on the FastText
word embeddings (Bojanovski et al. (2016)). One
can use pre-trained Wikipedia FastText vectors (Ver-
berne et al. (2018)) for all of the words in our dataset
and replace out-of-vocabulary words with the closest
ones from the training set, using cosine distance be-
tween FastText vectors as a distance metric. This ma-
nipulation helps to keep part of the information that
comes with the out-of-vocabulary words intact dur-
ing the vectorization process. Table 4 shows the re-
sults for various parameters of the algorithm across
all three languages.
However, again the best accuracy is achieved us-
ing bigrams and the kernel of size 7 for all languages.
Figure 2 shows scatter-plots for the frequencies of
all manually assigned categories versus automatically
assigned ones. The plots are drawn for the best per-
forming models in English, German, and Spanish, re-
Table5 shows relative accuracy improvement
when O-o-V words are substituted with their nearest
neighbors in the FastText embeddings.
Looking at Table 5, one can see that replacing the
out-of-vocabulary words with their nearest FastText
Table 4: Various results for English, German and Spanish
with out-of-vocabulary words replacements. Bi-grams with
longer window kernel demonstrate higher accuracy across
all languages.
range Metric Language
Eng Ger Span
71accuracy 0.426 0.371 0.448
correlation 0.860 0.879 0.646
2accuracy 0.426 0.371 0.449
correlation 0.858 0.879 0. 648
51accuracy 0.423 0.368 0.444
correlation 0.858 0.878 0.649
2accuracy 0.424 0.368 0.444
correlation 0.858 0.879 0.651
31accuracy 0.412 0.351 0.431
correlation 0.859 0.880 0.685
2accuracy 0.413 0.352 0.432
correlation 0.859 0.879 0.685
COMPLEXIS 2020 - 5th International Conference on Complexity, Future Information Systems and Risk
Figure 2: Comparison of code frequencies of 57 categories,
trained only on the training part with the word embeddings
for the out-of-vocabulary words, in six electoral programs
by human and semi-automatic coding for Spanish, English,
German texts respectively.
Table 5: Overview of the results for the algorithms that
use FastText nearest neighbours instead of the words out-
of-vocabulary. Performance varies across the languages.
Language Accuracy Change # of Vocab.
without with O-o-V O-o-V size
O-o-V repl. in test
English 0.426 0.9% 3 485 52 949
German 0.372 +1.1% 3 266 24 227
Spanish 0.449 +3.2% 8 018 49 969
neighbors that are included in the training dataset can
partially address the problem. Moreover, the more
out-of-vocabulary words there are in the test dataset,
the better such replacement performs. Indeed, Ta-
ble 5 shows that the accuracy significantly improves
for Spanish that has twice as many words out-of-
vocabulary in the test set. In contrast, for German and
English, the performance varied within one percent-
age point (and is even weaker for English than for the
model that omits O-o-V words altogether).
However, experiments clearly show that there is
a need to analyze a more extended context for better
label classification. With the current amount of mono-
lingual data, there is little one can do to broaden the
context used by the models. We believe that further
accuracy improvements could be achieved with mul-
tilingual models with the attention that could leverage
varying importance of the words within different top-
The achieved results are promising, concerning the
complexity of the category scheme. Indeed, human
coders’ agreement level is only about 50%, compar-
ing to a master copy (Lacewell and Werner(2013)).
However, this level of accuracy does not allow to au-
tomation the real-world task completely.
It is also important to note here that some semi-
sentences may contain more than one category. For
coarse-grained ones, it is not a common problem, be-
cause the labels already include a variety of topics, but
for the small labels, it is a real challenge. In this case,
the current labeling scheme is difficult to reproduce
by machine learning algorithms.
One possible way to modify the annotation pro-
cess is to assign more than one label to the sample if it
is needed. It should decrease ambiguity in the human
coding process and, therefore, increase the machine-
classification quality. Another idea is to change the
structure of labels themselves to decrease overlap-
Text Classification for Monolingual Political Manifestos with Words Out of Vocabulary
This paper implements a classification algorithm
for electoral programs from the Manifesto Project
Database. A new approach is proposed to overcome
the problem of the words that are out of the train-
ing vocabulary. The algorithm demonstrates the ac-
curacy comparable to the state-of-the-art benchmarks
for multi-label classification. The algorithm is tested
on different languages, showing its applicability, and
on different sizes of the kernel (weighting scheme of
(Merz et al. (2016))). The experiments show that
longer textual context is useful for the classification
Authors are grateful to Oleg Gluhih, Maxim Gnativ
and Alexei Postnikov for their help and constructive
Chen T. and Guestrin C. (2016) Xgboost: A scalable tree
boosting system. Proceedings of the 22nd acm sigkdd
international conference on knowledge discovery and
data mining, 785–794.
Vapnik V. (1998) The support vector method of func-
tion estimation. Nonlinear Modeling, title=The sup-
port vector method of function estimation, au-
thor=Vapnik, Vladimir, booktitle=Nonlinear Model-
ing, 55–85, Springer.
Lazer D., Pentland A., Adamic L., Aral S., Barab´
asi A.L.,
Brewer D., Christakis N., Contractor N., Fowler J.,
Gutmann M., Jebara T., King G., Macy M., Roy D.,
Alstyne M. (2009) Computational Social Science. Sci-
ence, vol. 323, issue 5915, 721-723
Koehn P. (2005) Europarl: A Parallel Corpus for
Statistical Machine Translation. MT Summit.
Gentzkow M., Shapiro J.M., Taddy M. (2018) Parsed
Speeches and Phrase Counts. Congressional Record
for the 43rd-114th Congresses Palo Alto, CA: Stan-
ford Libraries. text
Lehmann P., Lewandowski J., Matthieß T., Merz N.,
Regel S., Werner A. (2018) Manifesto Corpus. Ver-
sion: 2018-1. Berlin: WZB Berlin Social Science
Volkens A., Krause W., Lehmann P., Matthieß T.,
Merz N., Regel S., Weßels B. (2019) The
Manifesto Data Collection. Manifesto Project
(MRG/CMP/MARPOR). Version 2019a. WZB.
Mikhaylov S., Laver M., Benoit K.R. (2012) Coder Reli-
ability and misclassification in the human coding of
party manifestos. Political Analysis 20(1), 78–91.
s G., Nanni F., Ponzetto S.P. (2017) Cross-lingual
classification of topics in political texts
Zirn C., Glavaˇ
s G., Nanni F., Eichorts J., Stuckenschmidt H.
(2016) Classifying topics and detecting topic shifts in
political manifestos. In PolText.
Subramanian S., Cohn T., Baldwin T. (2017) Hierarchical
Structured Model for Fine-to-coarse Manifesto Text
Analysis Proceedings of the 2018 Conference of the
North American Chapter of the Association for Com-
putational Linguistics: Human Language Technolo-
gies, vol. 1
Subramanian S., Cohn T., Baldwin T., Brooke J. (2018)
Joint Sentence–Document Model for Manifesto Text
Analysis Proceedings of Australasian Language Tech-
nology Association Workshop: 25-33.
Merz N., Regel S., Lewandowski J. (2016) The Manifesto
Corpus: A new resource for research on political par-
ties and quantitative text analysis, Research and Poli-
tics. DOI: 10.1177/2053168016643346
Lan M., Tan C.L., Su J. (2007) Supervised and Traditional
Term Weighting Methods for Automatic Text Catego-
rization. Journal of IEEE PAMI, vol. 10, No. 10
Lacewell O.P. and Werner A. (2013) Coder training: Key
to enhancing coding reliability and estimate validity.
In: Volkens A., Bara J., Budge I., et al. (eds) Map-
ping Policy Preferences from Texts. Statistical Solu-
tions for Manifesto Analysts. Oxford: Oxford Univer-
sity Press.
Bojanovski P., Grave E., Joulin A., Mikolov T. (2016) En-
riching Word Vectors with Subword Information.
Grave E, Bojanowski P., Gupta P., Joulin A., Mikolov T.
(2018) Learning Word Vectors for 157 Languages.
Proceedings of the International Conference on Lan-
guage Resources and Evaluation (LREC 2018)
Verberne S., D’hondt E., van den Bosch A., Marx M.
(2014). Automatic thematic classification of election
manifestos. Information Processing & Management,
50(4), 554-567.
COMPLEXIS 2020 - 5th International Conference on Complexity, Future Information Systems and Risk
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
This article presents a digital, open-access, multilingual, annotated corpus of electoral programs. It complements the recent methodological innovations in (semi-) computerized content analysis by providing a large, standardized text corpus for the political science community. The corpus is based on the collection of the Manifesto Project, which comprises of (at the time of writing) the largest hand-annotated text corpus of electoral programs available. Since 2009 the project’s costly and time-intensive procedure of collecting and coding documents has been fully digitized. As a result, it now provides more than 1800 machine readable documents from 40 different countries. Six hundred of these documents contain content-analyzed annotations at the level of single (quasi-) sentences, which correspond to the Manifesto Project coding scheme. Additionally, the corpus will continually be extended by incorporating new elections and digitizing older documents. The database also provides meta-information for each document (eg. party, election, language, etc.) that allow it to be referenced back to the Manifesto Dataset. The corpus is stored in a standardized format in an online database, and an API and R package (manifestoR) guarantee easy access.
Conference Paper
Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems.
The Comparative Manifesto Project (CMP) provides the only time series of estimated party policy positions in political science and has been extensively used in a wide variety of applications. Recent work (e.g., Benoit, Laver, and Mikhaylov 2009; Klingemann et al. 2006) focuses on nonsystematic sources of error in these estimates that arise from the text generation process. Our concern here, by contrast, is with error that arises during the text coding process since nearly all manifestos are coded only once by a single coder. First, we discuss reliability and misclassification in the context of hand-coded content analysis methods. Second, we report results of a coding experiment that used trained human coders to code sample manifestos provided by the CMP, allowing us to estimate the reliability of both coders and coding categories. Third, we compare our test codings to the published CMP “gold standard” codings of the test documents to assess accuracy and produce empirical estimates of a misclassification matrix for each coding category. Finally, we demonstrate the effect of coding misclassification on the CMP's most widely used index, its left–right scale. Our findings indicate that misclassification is a serious and systemic problem with the current CMP data set and coding process, suggesting the CMP scheme should be significantly simplified to address reliability issues.
We digitized three years of Dutch election manifestos annotated by the Dutch political scientist Isaac Lipschits. We used these data to train a classifier that can automatically label new, unseen election manifestos with themes. Having the manifestos in a uniform XML format with all paragraphs annotated with their themes has advantages for both electronic publishing of the data and diachronic comparative data analysis. The data that we created will be disclosed to the public through a search interface. This means that it will be possible to query the data and filter them on themes and parties. We optimized the Lipschits classifier on the task of classifying election manifestos using models trained on earlier years. We built a classifier that is suited for classifying election manifestos from 2002 onwards using the data from the 1980s and 1990s. We evaluated the results by having a domain expert manually assess a sample of the classified data. We found that our automatic classifier obtains the same precision as a human classifier on unseen data. Its recall could be improved by extending the set of themes with newly emerged themes. Thus when using old political texts to classify new texts, work is needed to link and expand the set of themes to newer topics.
We collected a corpus of parallel text in 11 lan-guages from the proceedings of the European Par-liament, which are published on the web 1 . This cor-pus has found widespread use in the NLP commu-nity. Here, we focus on its acquisition and its appli-cation as training data for statistical machine trans-lation (SMT). We trained SMT systems for 110 lan-guage pairs, which reveal interesting clues into the challenges ahead.
In vector space model (VSM), text representation is the task of transforming the content of a textual document into a vector in the term space so that the document could be recognized and classified by a computer or a classifier. Different terms (i.e. words, phrases, or any other indexing units used to identify the contents of a text) have different importance in a text. The term weighting methods assign appropriate weights to the terms to improve the performance of text categorization. In this study, we investigate several widely-used unsupervised (traditional) and supervised term weighting methods on benchmark data collections in combination with SVM and kappa NN algorithms. In consideration of the distribution of relevant documents in the collection, we propose a new simple supervised term weighting method, i.e. tf.rf, to improve the terms' discriminating power for text categorization task. From the controlled experimental results, these supervised term weighting methods have mixed performance. Specifically, our proposed supervised term weighting method, tf.rf, has a consistently better performance than other term weighting methods while other supervised term weighting methods based on information theory or statistical metric perform the worst in all experiments. On the other hand, the popularly used tf.idf method has not shown a uniformly good performance in terms of different data sets.
The support vector method of function estimation. Nonlinear Modeling, title=The support vector method of function estimation, au-thor=Vapnik
  • V Vapnik
Vapnik V. (1998) The support vector method of function estimation. Nonlinear Modeling, title=The support vector method of function estimation, au-thor=Vapnik, Vladimir, booktitle=Nonlinear Modeling, 55-85, Springer.
  • D Lazer
  • A Pentland
  • L Adamic
  • S Aral
  • A L Barabási
  • D Brewer
  • N Christakis
  • N Contractor
  • J Fowler
  • M Gutmann
  • T Jebara
  • G King
  • M Macy
  • D Roy
  • M Alstyne
Lazer D., Pentland A., Adamic L., Aral S., Barabási A.L., Brewer D., Christakis N., Contractor N., Fowler J., Gutmann M., Jebara T., King G., Macy M., Roy D., Alstyne M. (2009) Computational Social Science. Science, vol. 323, issue 5915, 721-723