Bioingenium at ImageClefmed 2010: A Latent Semantic Approach.
Conference Proceeding: Summarizing video using non-negative similarity matrix factorization[show abstract] [hide abstract]
ABSTRACT: We present a novel approach to automatically extracting summary excerpts from audio video and video. Our approach is to maximize the average similarity between the excerpt and the source. We first calculate a similarity matrix by comparing each pair of time samples using a quantitative similarity measure. To determine the segment with highest average similarity, we maximize the summation of the self-similarity matrix over the support of the segment. To select multiple excerpts while avoiding redundancy, we compute the non-negative matrix factorization (NMF) of the similarity matrix into its essential structural components. We then build a summary comprised of excerpts from the main components, selecting the excerpts for maximum average similarity within each component. Variations integrating segmentation and other information are also discussed, and experimental results are presented.Multimedia Signal Processing, 2002 IEEE Workshop on; 01/2003
ACM Comput. Surv. 01/2008; 40.
Conference Proceeding: Non-negative matrix factorization for visual coding[show abstract] [hide abstract]
ABSTRACT: This paper combines linear spun coding and nonnegative matrix factorization into sparse non-negative matrix factorization. In contrast to non-negative matrix factorization, the new model can learn much sparser representation via imposing sparseness constraints explicitly; in contrast to a close model -non-negative sparse coding, the new model can learn parts-based representation via fully multiplicative updates because of adapting a generalized Kullback-Leibler divergence instead of the conventional mean error for approximation error. Experiments on MIT-CBCL training facts data demonstrate the effectiveness of the proposed method.Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03). 2003 IEEE International Conference on; 05/2003 · 4.63 Impact Factor
Bioingenium at ImageClefmed 2010: A Latent
Jose G Moreno, Juan C Caicedo, and Fabio A Gonz´ alez
National University of Colombia,
Bioingenium Research Group
Abstract. This paper describes the participation of the Bioingenium
Research Group in the ad hoc Medical Image Retrieval task for the 2010
ImageCLEF forum. The work aimed to explore semantic relationships
in textual information and transfer into visual information by building
a unified image searching index. The proposed strategy is based on the
use of Non-Negative Matrix Factorization to decompose matrix data and
build latent semantic spaces.
Keywords: Non-Negative Matrix Factorization, Content-Based Infor-
The Medical Image Retrieval task of ImageCLEF aims to evaluate computational
methods to access and retrieve visual contents for medical applications. For the
2010 forum, the challenge consists of proposing computational alternatives for
retrieving images from a database containing 77,000 images extracted from med-
ical papers and 16 query topics . This paper describes the approach followed
by the Bioingenium Research Group at the National University of Colombia,
which focused on combining visual features extracted from images together with
text captions taken from the manuscript, to build a unified index for searching
medical images. Latent Semantic Indexing (LSI) strategies are used at the core
of our approach to model implicit relationships between visual and textual data.
Since our image indexing method integrates visual and textual information,
we are able to map different types of queries to the latent semantic space so as
to search for images. Be it a text query, an image example or a mixed query, the
system uses the same index to retrieve relevant images. To achieve this, we use
Non-Negative Matrix Factorization algorithms to build latent semantic spaces
using multimodal information. We paid special attention to the case in which
a visual example is provided as query to search the multimodal index. When
only visual features are used to match contents in the collection, the multimodal
index still has information taken from textual data, which can influence the
2Jose G Moreno, Juan C Caicedo, and Fabio A Gonz´ alez
These algorithms can be computationally expensive and therefore it might
take too long for them to process large image collections. For this reason, they
had to be suited for running experiments in the ImageCLEFmed 2010 collection
by using a structured initialization strategy based on Singular Value Decomposi-
tion, which allowed the algorithms to converge faster to the desired factorization.
We run experiments that involved different configurations of our model to eval-
uate their performance.
The contents on this paper are organized as follow: Section 2 summarizes the
methods used to process visual and textual data separately. Section 3 presents
the main methods of our retrieval framework. Section 4 presents the experimen-
tal setup, the results and discussions. Finally, the paper ends with concluding
remarks in Section 5.
2 Data Processing
In our approach, textual data and visual data are first processed in an indepen-
dent manner. The purpose of this preprocessing step is to build two matrices
that represent the content of each modality for all images in the collection by
using a certain set of features.
We used the paper title and image caption as semantic context for each image.
The Natural Language Toolkit (NLTK)  was used to build a vector space
representation of textual information. Common text processing techniques, such
as stop words removal and word stemming, were applied to the corpus and
a TF-IDF weighted scheme was used as final text representation. Some terms
were removed from the vector space to generate a more compact representation
by pruning terms with too low or too high frequency within the corpus.
We used a Bag of Features strategy in which each image is represented as a
histogram of frequencies of predefined visual patterns . These visual patterns
are organized in a codebook of DCT features  that is quantized using the
k-means algorithm. Blocks of 8x8 pixels are first taken from a regular grid in
each image in the collection and processed using the Discrete Cosine Transform
in the three RGB color channels. The coefficients of these transform are used as
features to construct the codebook and to match visual patterns when building
the histogram representation. The size of the codebook was set to 2,000 and
5,000 features in our experiments.
Bioingenium at ImageClefmed 2010: A Latent Semantic Approach3
3 Retrieval Framework
Latent semantic strategies have been shown to be a powerful set of methods to
find latent relationships between features in document collections. Using a term-
document representation, it is possible to find latent patterns such as highly
correlated terms through matrix decompositions . A Singular Value Decom-
position (SVD) of the term-document matrix is used in information retrieval to
construct latent semantic spaces so that documents that refer to certain top-
ics can be highly scored even when there is not an explicit occurrence of the
query terms. This is possible thanks to the implicit relationships existing be-
tween terms that are found during the indexing process. This technique usually
shows improved retrieval performance compared to a document retrieval engine
that only uses the term frequencies to score documents.
We use Non-negative Matrix Factorization as a flexible algorithm to model
a latent semantic space, which correlates visual features and text terms in a
3.1 Input data
The term-document matrix Xt of size mxn is used to represent text data in
the collection, where m is the number of images and n is the number of terms.
The values in each cell are the frequency of the corresponding term for an image.
Similarly, the feature-document matrix Xvis a matrix of size mxp, with m being
the number of images and p the number of visual features.
3.2 Non-negative Matrix Factorization (NMF)
In document clustering as well as in document retrieval and document classi-
fication, an important task is to recognize the semantic relationships existing
between terms in a collection. NMF is a novel technique proposed to address
problems of this nature and has been actively used for the analysis of text doc-
uments  as well as image collections . This method decomposes a
non-negative matrix into two lower rank non-negative matrices; the first one
encoding the basis of the latent space and the second one encoding the coef-
ficients of the document representation in that latent space. NMF is modeled
as an optimization problem with non-negative restrictions on both matrix fac-
tors for which different objective functions can be used. We used the Divergence
objective function to find the factorization :
− Xij+ WHij
3.3Singular Value Decomposition (SVD) Initialization
NMF is a powerful matrix decomposition strategy but it has an important prob-
lem to consider: its convergence performance is slow. In order to make it usable
4Jose G Moreno, Juan C Caicedo, and Fabio A Gonz´ alez
for a large-scale image retrieval application, we applied an initialization algo-
rithm based on SVD operations .
SVD is a very common matrix decomposition strategy, in which three ma-
trices are obtained: two orthonormal matrices and a diagonal matrix. The or-
thonormal matrices have the particularity of being linearly independent, but the
can still have negative values. Based on an interesting result about the increase
of the rank unit matrix when the negative values are changed to zero, a method
that allows getting a non-negative SVD decomposition is used to build W and
H initial values. This strategy is known as Double Non-Negative Singular Value
Decomposition  and its use speeds up the indexing process for about 3 times.
3.4 Latent Representation
We model the factorization problem using an asymmetric algorithm as it is
described in . In this method, the text matrix Xtis first exploited to build
the basis of the latent semantic space and then the matrix Xvis used to adapt
the visual representation to such latent semantic space. The algorithm follows
two basic steps:
1. Learning the semantic basis: NMF is applied to solve Xt= WtHt, where Xt
is the text matrix.
2. Adapting a visual basis: A modified version of NMF is applied to find Xv=
WvHtin order to learn only a Wvthat spans the same latent space obtained
from the text analysis, but using visual features.
Since the basis of the latent space is formed using textual data, it is expected
that this representation helps reduce the semantic gap in the CBIR system.
This approach is meant to use the same semantic representation found in text
decomposition to calculate the corresponding basis that satisfies the NMF de-
composition of the visual data. With the new Wvbasis, images lacking text can
be mapped into the semantic space in the same way as text documents are, but
notice that only visual features are required.
Indexing texts or projecting text queries to the semantic space is straightfor-
ward because a Wtbasis has been also learned. When queries have both modal-
ities available, an automatic strategy to combine both is used. The Wtand Wv
basis are simply concatenated to allow a mixed projection of multimodal data.
4Experiments and Results
We participated with three different approaches that are classified according to
the information used in the query: only text terms, only visual features and mixed
text-visual information. To tune the model parameters, we used the queries from
the 2009 ImageCLEFmed challenge and tried solving them on the 2010 collec-
tion. Our goal was to maximize the Mean Average Precision (MAP) according
to the ground truth for the 2009 queries and then use the same configuration to
solve the 2010 topics.
Bioingenium at ImageClefmed 2010: A Latent Semantic Approach5
The main parameter in our model is the size of the semantic space, which
determines the number of latent concepts. We calculated various semantic spaces
using different sizes to evaluate the performance with different configurations.
In order to explore the search space before submitting our runs, we used a
logarithmic scale to set the latent space size parameter.
Below is a description of our 6 submitted runs:
– Text − k = 211: only text information was used in the queries as well as
in the collection. In this experiment we used 2,048 as the size of the latent
– AsymmetricMixed − k = 211: both visual and text information is used to
process the queries. The size of the semantic space was the same as in the
previous experiment.both visual and text information is used to process the
queries. The size of the semantic space was the same as in the previous
The following runs only used visual information to query the system:
– AsymmetricDCT2000 − k = 25: a codebook of 2,000 visual patterns was
used to represent image features. The size of the latent semantic space was
set to 25= 32.
– AsymmetricDCT5000−k = 25: a codebook with 5,000 elements and 25= 32
latent semantic dimensions.
– AsymmetricDCT5000 − k = 27: a codebook with 5,000 elements and 27=
128 latent semantic dimensions.
– AsymmetricDCT5000 − k = 27.5: a codebook with 5,000 elements and
27.5= 181 latent semantic dimensions.
We provide a comparison of the results obtained with the 2009 and 2010
queries, even though the ground truths were obtained by analyzing different
versions of the same collection. In particular, the number of images in the col-
lection when the ground truth of 2009 was built was about 10% smaller than
the 2010 version. However, this may be considered as a good estimation of the
We observed that when the size of the latent space is increased, a better per-
formance is obtained using text queries. Table 1 shows the MAP and precision
in 1000 (P1000) obtained using the 2009 and 2010 queries, respectively, with the
corresponding configurations for each run. Using only visual information on the
2009 queries, the configuration described for AsymDCT2000−k = 25(2,000 vi-
sual features and 25latent semantic dimensions) obtained the best performance.
This result is even a very good score compared to the results obtained by par-
ticipants in the 2009challenge when using only visual information . However,
in the 2010 challenge the results were not as encouraging.
The performance results shown in Table 1 show consistency between the
results obtained for the 2009 and 2010 queries, both for MAP and Precision
at 1,000 results (P1000). The search strategy based only on text information
outperformed both the visual and mixed strategies. However, it is well known
that only visual information lacks of semantic information to search for images
and this gap still remains an open problem for image retrieval. Our method is