Conference PaperPDF Available


Many current documents include multimedia consisting of text, images and embedded videos. This paper presents a general method that uses Random Forests to automatically extract keyphrases that can be used as very short summaries and to help in retrieval, classification and clustering processes.
Proceedings of the 25th International Conference on Computational Linguistics, pages 121–123,
Dublin, Ireland, August 23-29 2014.
Keyphrase Extraction using Textual and Visual Features
This work is licensed under a Creative Commons Attribution 4.0 International Licence. License de-
Yaakov HaCohen-Kerner1, Stefanos Vrochidis2, Dimitris Liparas2, Anastasia Moumtzidou2,
Ioannis Kompatsiaris2
1 Dept. of Computer Science, Jerusalem College of Technology Lev Academic Center,
21 Havaad Haleumi St., P.O.B. 16031, 9116001 Jerusalem, Israel,
2 Information Techologies Institute, Centre for Research and Technology Hellas, Thermi-
Thessaloniki, Greece, {stefanos, dliparas, moumtzid, ikom}
Many current documents include multimedia consisting of text, images and embedded videos. This pa-
per presents a general method that uses Random Forests to automatically extract keyphrases that can be
used as very short summaries and to help in retrieval, classification and clustering processes.
1 Introduction
A keyphrase is an important concept, presented either as a single word (unigram), e.g.: 'extraction',
'keyphrase' or as a collocation, i.e., a meaningful group of two or more words, e.g.: 'keyphrase extrac-
tion'. Keyphrases can be regarded as very short summaries and can be used for representing documents
in retrieval, classification and clustering problems.
Nowadays, many documents (e.g. web pages, articles) include multimedia consisting of text, imag-
es and embedded videos. In this case, the keyphrase extraction process should not be limited to the
textual data but also consider the audiovisual data.
In this context, this paper proposes a novel framework for automatic keyphrase extraction from
documents containing text and images based on supervised learning and textual and visual features.
2 Baseline Methods for Keyphrase Extraction
In this section, we introduce the baseline methods we use for keyphrase extraction using textual and
visual information.
2.1 Textual Keyprhase Extraction
In all methods, words and terms that have a grammatical role for the language are excluded from the
key words list according to Fox's stop list. This stop list contains 421 high frequency stop list words
(e.g.: we, this, and, when, in, usually, also, near).
(1) Term Frequency (TF): This method rates a term according to the number of its occurrences in
the text. Only the N terms with the highest TF in the document are selected.
(2) Term length (TL): TL rates a term according to the number of the words included in the term.
(3) First N Terms (FN): Only the first N terms in the document are selected. The assumption is that
the most important keyphrases are found at the beginning of the document because people tend to
place important information at the beginning. This method is based on the baseline summariza-
tion method which chooses the first N sentences. This simple method provides a relatively strong
baseline for the performance of any text-summarization method.
(4) Last N Terms (LN): Only the last N terms in the document are selected. The assumption is that
the most important keyphrases are found at the end of the document because people tend to place
their important keyphrases in their conclusions which are usually placed near to the end.
(5) At the Beginning of its Paragraph (PB): This method rates a term according to its relative posi-
tion in its paragraph. The assumption is that the most important keyphrases are likely to be found
close to the beginning of their paragraphs.
(6) At the End of its Paragraph (PE): This method rates a term according to its relative position in
its paragraph. The assumption is that the most important keyphrases are likely to be found close
to end of their paragraphs.
(7) Resemblance to Title (RT): This method rates a term according to the resemblance of its sen-
tence to the title of the article. Sentences that resemble the title will be granted a higher score.
(8) Maximal Section Headline Importance (MSHI): This method rates a term according to its most
important presence in a section or headline of the article. It is a known that some parts of papers
are more important from the viewpoint of presence of keyphrases. Such parts can be headlines
and sections as: abstract, introduction and conclusions.
(9) Accumulative Section Headline Importance (ASHI): This method is very similar to the previ-
ous one. However, it rates a term according to all its presences in important sections or headlines
of the article.
(10) Negative Brackets (NBR): Phrases found in brackets are not likely to be keyphrases. Therefore,
they are defined as negative phrases, and will grant negative scores.
These methods were applied to extract and learn keyphrases from scientific articles (HaCohen-
Kerner et al., 2005).
2.2 Visual Keyprhase Extraction
On the other hand, visual keyphrase extraction is performed for a pre-defined set of keyphrases (e.g.
demonstration, moving car, etc.). The predefined keyphrases are selected in order to be relevant to the
domain of interest. In the following, low level visual features (SIFT, SURF) are extracted (Markato-
poulou, et al., 2013). We apply supervised machine learning using Random Forests (RF) (Breiman,
2001) to detect the presence of each concept in an image. RF have been successfully applied to several
image classification problems (e.g. (Bosch et al., 2007; Xu et al., 2012)). Moreover, an important mo-
tivation for using RF was the application of late fusion based on the RF operational capabilities, which
is discussed below.
In the training phase, the feature vectors from each low level feature vector are used as input for the
construction of a single RF. The training set can be constructed either manually or automatically. In
the automatic case, we submit a text query to a general purpose web search engine (e.g. Google, Bing)
to retrieve relevant images, while irrelevant images can be selected randomly from the web. From the
RFs that are constructed (one for each descriptor), we compute the weights for each modality in the
following way. From the out-of-bag (OOB) error estimate of each modality’s RF, the corresponding
OOB accuracy values are computed. These values are computed for each concept separately. Then the
values are normalized and serve as weights for the different modalities. Finally, each image is repre-
sented with a vector that includes the scores for each predefined visual keyphrase.
It should be noted that the visual concept/keyphrase detectors perform decently for specific visual
concepts (e.g. news studio: 0,5 MEIAP (Mean Extended Inferred Average Precision)), while for some
others (e.g. bridge: 0,02MEIAP) the performance is very low (Markatopoulou, et al., 2013). Therefore,
the representation is based on visual concepts for which the trained models can perform decently.
3 The Proposed Supervised Extraction Model
Our model, in general, is composed of the following steps:
For each document:
(1) Extract all possible n-grams (n=1, 2, 3) that do not contain stop-list words.
(2) Transform these n-grams into lower case.
(3) Apply all baseline textual extraction methods on these n-grams.
(4) Apply variable selection using Random Forests on all textual features (the results of the textual
baseline methods) in order to find the best combination of the textual features (Genuer, et al.
(5) Extract visual keyphrases for each image and calculate the average score for each visual
keyphrase to represent the document.
(6) Apply variable selection using Random Forests on all visual features in order to find the best per-
forming visual features (Genuer, et al. 2010).
(7) After the feature selection two fusion techniques are investigated:
a. Early fusion: Concatenation of the textual and visual vectors in a single vector. In the case
of unsupervised tasks (e.g. retrieval, clustering) the L1 distances between these vectors are
considered to compute similarity measures. In supervised tasks (e.g. classification) we train
a RF with the concatenated vector using as training set manually annotated documents.
b. Weighted late fusion: In the case of unsupervised tasks similarity scores are computed in-
dependently for each modality and the results are fused. In order to calculate the weights a
regression model based on Support Vector Machines is applied. In the case of supervised
tasks we train two RF (i.e. one for each modality) using a manually constructed training set
and finally we apply weighted late fusion based on the OOB error estimate using the ap-
proach mentioned in chapter 2.
4 Conclusions and Future Work
The proposed approach is work in progress so specific results are not yet available. However, initial
results using weighted late fusion (based on OOB estimate) of textual features and visual low level
features for a representative (i.e. histograms and not concepts) have shown that the results are im-
proved when compared to the ones generated with using only textual features. The next steps of this
work include application of the proposed method to retrieval, clustering and classification problems of
web pages and news articles, which include multimodal information such as text and images.
Future directions for research are: (1) Developing additional baseline methods for keyphrase ex-
traction, (2) Applying other ML methods in order to find the most effective combination between
these baseline methods, (3) Conducting more experiments using additional documents from additional
domains (5) Development of Methodology for predefined visual concept selection, and (6) Applying
ML to extract keyphrases using both textual and low level visual features.
Concerning research on additional domains, there are many potential research directions. For in-
stance the following research questions can be addressed: (1) Which baseline extraction methods are
good for which domains? (2) Which are the specific reasons for methods to perform better or worse on
different domains? (3) Which are the guidelines to choose the correct methods for a certain domain?
(4) Can the appropriateness of a method for a domain be estimated automatically?
Acknowledgment: This work is supported by MULTISENSOR project (FP7-610411) partially funded
by the European Commission. The authors would like to acknowledge networking support by the
COST Action IC1307: The European Network on Integrating Vision and Language (iV&L Net) and
the COST Action IC1002: MUMIA.
1. Anna Bosch., Andrew Zisserman, and Xavier Munoz. 2007. Image classification using random forests and
ferns. In ICCV, pp. 1-8.
2. Leo Breiman. 2001. Random Forests. In Machine Learning, 45(1): 5-32.
3. Christopher Fox. 1990. A Stop List for General Text. ACM-SIGIR Forum, 24, pp. 1935.
4. Yaakov HaCohen-Kerner, Zuriel Gross, and Asaf Masa. 2005. Automatic extraction and learning of
keyphrases from scientific articles. In Computational Linguistics and Intelligent Text Processing, pp. 657-
669, Springer Berlin Heidelberg.
5. Fotini Markatopoulou, Anastasia Moumtzidou, Christos Tzelepis, Kostas Avgerinakis, Nikolaos Gkalelis,
Stefanos Vrochidis, Vasileios Mezaris, and Ioannis Kompatsiaris. 2013. “ITI-CERTH participation to
TRECVID 2013,” in TRECVID 2013 Workshop, Gaithersburg, MD, USA.
6. Baoxun Xu, Yunming Ye, and Lei Nie. 2012. An improved random forest classifier for image classification.
In, International Conference on Information and Automation (ICIA), pp. 795-800, IEEE.
7. Robin Genuera, Jean-Michel Poggi, and Christine Tuleau-Malot. 2010. Variable Selection using Random
Forests, In Pattern Recognition Letters 31(14):2225-2236.
Conference Paper
Full-text available
This paper provides an overview of the tasks submitted to TRECVID 2013 by ITI-CERTH. ITI- CERTH participated in the Semantic Indexing (SIN), the Event Detection in Internet Multimedia (MED), the Multimedia Event Recounting (MER) and the Instance Search (INS) tasks. In the SIN task, techniques are developed, which combine new video representations (video tomographs) with existing well-performing descriptors such as SIFT, Bag-of-Words for shot representation, ensemble construction techniques and a multi-label learning method for score re�nement. In the MED task, an e�cient method that uses only static visual features as well as limited audio information is evaluated. In the MER sub-task of MED a discriminant analysis-based feature selection method is combined with a model vector approach for selecting the key semantic entities depicted in the video that best describe the detected event. Finally, the INS task is performed by employing VERGE, which is an in- teractive retrieval application combining retrieval functionalities in various modalities, used previously for supporting the Known Item Search (KIS) task.
Conference Paper
Full-text available
Many academic journals and conferences require that each article include a list of keyphrases. These keyphrases should provide general information about the contents and the topics of the article. Keyphrases may save precious time for tasks such as filtering, summarization, and categorization. In this paper, we investigate automatic extraction and learning of keyphrases from scientific articles written in English. Firstly, we introduce various baseline extraction methods. Some of them, formalized by us, are very successful for academic papers. Then, we integrate these methods using different machine learning methods. The best results have been achieved by J48, an improved variant of C4.5. These results are significantly better than those achieved by previous extraction systems, regarded as the state of the art.
Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, ∗∗∗, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.
Conference Paper
This paper proposes an improved random forest algorithm for image classification. This algorithm is particularly designed for analyzing very high dimensional data with multiple classes whose well-known representative data is image data. A novel feature weighting method and tree selection method are developed and synergistically served for making random forest framework well suited to classify image data with a large number of object categories. With the new feature weighting method for subspace sampling and tree selection method, we can effectively reduce subspace size and improve classification performance without increasing error bound. Experimental results on image datasets with diverse characteristics have demonstrated that the proposed method could generate a random forest model with higher performance than the random forests generated by Breiman's method.
This paper proposes, focusing on random forests, the increasingly used statistical method for classification and regression problems introduced by Leo Breiman in 2001, to investigate two classical issues of variable selection. The first one is to find important variables for interpretation and the second one is more restrictive and try to design a good parsimonious prediction model. The main contribution is twofold: to provide some experimental insights about the behavior of the variable importance index based on random forests and to propose a strategy involving a ranking of explanatory variables using the random forests score of importance and a stepwise ascending variable introduction strategy.
Conference Paper
We explore the problem of classifying images by the ob- ject categories they contain in the case of a large number of object categories. To this end we combine three ingredi- ents: (i) shape and appearance representations that support spatial pyramid matching over a region of interest. This generalizes the representation of Lazebnik et al (16) from an image to a region of interest (ROI), and from appear- ance (visual words) alone to appearance and local shape (edge distributions). (ii) automatic selection of the regions of interest in training. This provides a method of inhibiting background clutter and adding invariance to the object in- stance's position, and (iii) the use of random forests (and random ferns) as a multi-way classifier. The advantage of such classifiers (over multi-way SVM for example) is the ease of training and testing. Results are reported for classification of the Caltech-101 and Caltech-256 data sets. We compare the performance of the random forest/ferns classifier with a benchmark multi- way SVM classifier. It is shown that selecting the ROI adds about 5% to the performance and, together with the other improvements, the result is about a10% improvement over the state of the art for Caltech-256.
A stop list, or negative dictionary is a device used in automatic indexing to filter out words that would make poor index terms. Traditionally stop lists are supposed to have included only the most frequently occurring words. In practice, however, stop lists have tended to include infrequently occurring words, and have not included many frequently occurring words. Infrequently occurring words seem to have been included because stop list compilers have not, for whatever reason, consulted empirical studies of word frequencies. Frequently occurring words seem to have been left out for the same reason, and also because many of them might still be important as index terms.This paper reports an exercise in generating a stop list for general text based on the Brown corpus of 1,014,000 words drawn from a broad range of literature in English. We start with a list of tokens occurring more than 300 times in the Brown corpus. From this list of 278 words, 32 are culled on the grounds that they are too important as potential index terms. Twenty-six words are then added to the list in the belief that they may occur very frequently in certain kinds of literature. Finally, 149 words are added to the list because the finite state machine based filter in which this list is intended to be used is able to filter them at almost no cost. The final product is a list of 421 stop words that should be maximally efficient and effective in filtering the most frequently occurring and semantically neutral words in general literature in English.
Stefanos Vrochidis, Vasileios Mezaris, and Ioannis Kompatsiaris ITI-CERTH participation to TRECVID 2013
  • Fotini Markatopoulou
  • Anastasia Moumtzidou
  • Christos Tzelepis
  • Kostas Avgerinakis
  • Nikolaos Gkalelis
Fotini Markatopoulou, Anastasia Moumtzidou, Christos Tzelepis, Kostas Avgerinakis, Nikolaos Gkalelis, Stefanos Vrochidis, Vasileios Mezaris, and Ioannis Kompatsiaris. 2013. " ITI-CERTH participation to TRECVID 2013, " in TRECVID 2013 Workshop, Gaithersburg, MD, USA.