Giving order to image queries
ABSTRACT Users of image retrieval systems often find it frustrating that the image they are looking for is not ranked near the top of the results they are presented. This paper presents a computational approach for ranking keyworded images in order of relevance to a given keyword. Our approach uses machine learning to attempt to learn what visual features within an image are most related to the keywords, and then provide ranking based on similarity to a visual aggregate. To evaluate the technique, a Web 2.0 application has been developed to obtain a corpus of user-generated ranking information for a given image collection that can be used to evaluate the performance of the ranking algorithm.
[show abstract] [hide abstract]
ABSTRACT: This paper presents a novel technique for learning the underlying structure that links visual observations with semantics. The technique, inspired by a text-retrieval technique known as cross-language latent semantic indexing uses linear algebra to learn the semantic structure linking image features and keywords from a training set of annotated images. This structure can then be applied to unannotated images, thus providing the ability to search the unannotated images based on keyword. This factorisation approach is shown to perform well, even when using only simple global image features.
Conference Proceeding: Video Google: a text retrieval approach to object matching in videos[show abstract] [hide abstract]
ABSTRACT: We describe an approach to object and scene retrieval which searches for and localizes all the occurrences of a user outlined object in a video. The object is represented by a set of viewpoint invariant region descriptors so that recognition can proceed successfully despite changes in viewpoint, illumination and partial occlusion. The temporal continuity of the video within a shot is used to track the regions in order to reject unstable regions and reduce the effects of noise in the descriptors. The analogy with text retrieval is in the implementation where matches on descriptors are pre-computed (using vector quantization), and inverted file systems and document rankings are used. The result is that retrieved is immediate, returning a ranked list of key frames/shots in the manner of Google. The method is illustrated for matching in two full length feature films.Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on; 11/2003
Article: Indexing by Latent Semantic Analysis[show abstract] [hide abstract]
ABSTRACT: A new method for automatic indexing and retrieval is described. The approach is to take advantage of implicit higher-order structure in the association of terms with documents ("semantic structure") in order to improve the detection of relevant documents on the basis of terms found in queries. The particular technique used is singular-value decomposition, in which a large term by document matrix is decomposed into a set of ca 100 orthogonal factors from which the original matrix can be approximated by linear combination. Documents are represented by ca 100 item vectors of factor weights. Queries are represented as pseudo-document vectors formed from weighted combinations of terms, and documents with supra-threshold cosine values are returned. Initial tests find this completely automatic method for retrieval to be promising. Deerwester - 1 - 1.05/2001;
Giving order to image queries
Jonathon S. Harea, Patrick A. S. Sinclairb, Paul H. Lewisaand Kirk Martineza
aSchool of Electronics and Computer Science, University of Southampton, Southampton,
SO17 1BJ, UK;
bRoom 718, BBC Henry Wood House, 3-6 Langham Place, London, W1A 1AA, UK
Users of image retrieval systems often find it frustrating that the image they are looking for is not ranked near
the top of the results they are presented. This paper presents a computational approach for ranking keyworded
images in order of relevance to a given keyword. Our approach uses machine learning to attempt to learn what
visual features within an image are most related to the keywords, and then provide ranking based on similarity
to a visual aggregate. To evaluate the technique, a Web 2.0 application has been developed to obtain a corpus
of user-generated ranking information for a given image collection that can be used to evaluate the performance
of the ranking algorithm.
Keywords: Image retrieval, ranking, visual features, web 2.0
At the present time, the current hot-topic in the image retrieval community is the retrieval of images that are not
annotated. The subject of retrieving images that have subject metadata in the form of keywords has gathered
very little attention. The reason behind this is that there has been an assumption by the research community
that once an image has subject metadata then the retrieval should largely be a trivial matter of finding images
annotated with the terms from the query. In practice, however, this approach is not particularly satisfying for
the image-searchers due to the peculiarities of the keywording schemes often used for annotating imagery.
When analyzing keywording schemes used for image annotation three specific problem areas are often en-
countered. Firstly, many image collections are annotated with free-form keywords, without the use of a strictly
enforced controlled vocabulary. Without a controlled vocabulary, it becomes harder to find relevant imagery
in response to a query. As an example, consider the problem of finding images of a “garbage can”, when an
inconsistent vocabulary is used (for example “trash can”, “dustbin”, etc.). Many collections that do not strictly
enforce a vocabulary also suffer from inconsistent or bad spelling of keyword terms, which again confounds the
issue of finding relevant images.
The second problem area is related to the quantity and depth of keywords associated with imagery. Many
image collections have a tendency to apply large numbers of keywords to their images without much thought
as to whether the keywords will help or hinder the image searcher.1An example of this would be an image of
“George Bush Sr.” that was also keyworded “man”. The problem with this is that many professional image
searchers would not find images of “George Bush Sr.” helpful in a search for “man”. This issue is one that can
perhaps be solved by using structured thesauri and ontologies, and then only annotating images with the most
specific concepts - this would make automatic query expansion possible, if it were required by the searcher.
The third and final issue is the one which this paper aims to investigate. The problem concerns the relevance
of keywords to the image, that is, how well do the keywords represent the subject and content of the image?
Generally speaking, a keyword that describes an object in the background of the image is less relevant than one
that describes the subject or primary object of interest within the image. Unfortunately, the standard model
of annotating images with keywords does not allow this information to be modeled. This has big implications
for image search as searchers would like to have the most relevant images to their query presented first, and
less relevant images presented later, in order to minimize the time it takes to locate a suitable target image.
Further author information E-mail: email@example.com
Basically, keyword-based search would be better if the images could be ranked in order of decreasing relevance
to the query term. The idea of ranking images with respect to their relevance is not a new idea; for example,
the indexes of books tend to list relevant pages for a term in decreasing order of relevance to the term.
The first part of this paper explores how a machine learning approach can be used to learn the relevance
of keywords to an image from the images’ visual content. The second part describes a system for evaluating
how searchers would like to see images ranked, and presents results of such an evaluation. The second part also
compares how well the system compares to the user-provided ranking of images under a number of different
2. ADDING RANKING TO KEYWORDED IMAGES
Most keyword-based image retrieval strategies for annotated images will return an unordered set of images that
were labeled with the corresponding query term. As far as a searcher is concerned, the images will be displayed
in an essentially random order with no link between image relevance to the query and image position. In this
section we propose an algebraic technique that combines visual features and keyword labels in order to provide
a ranked order to returned images.
Our central hypothesis is that, at least for single-term queries, the more of the particular object representing
the query the image displays, then the more relevant it is to the query. The problem, therefore, is to determine
how well each image represents the keyword given a visual signature for the image.
The central idea behind the approach we propose is that of a mathematical factorization which is able to
determine, from a collection of images and their associated visual signatures and keywords, a set of factors that
relate how the image signatures and keywords are related to each other. In our approach, visual signatures are
articulated as a set of ‘visual terms’.2–5The idea of using a factorization and indeed the approach described
below comes from our own previous work on retrieving unannotated imagery using keywords,2,3which was
in-turn inspired by the text retrieval technique, Latent Semantic Indexing.
Latent Semantic Indexing (LSI)6is a technique for indexing documents in a dimensionally-reduced semantic
vector space. Landauer and Littman,7demonstrate a system based on LSI for performing text searching on a set
of French and English documents where the queries could be in either French or English (or conceivably both),
and the system would return documents in both languages which corresponded to the query. Landauer’s system
negated the need for explicit translations of all the English documents into French; instead, the system was trained
on a set of English documents and versions of the documents translated into French, and through a process called
‘folding-in’, the remaining English documents were indexed without the need for explicit translations. This idea
has become known as Cross-Language Latent Semantic Indexing (CL-LSI).
In general, any document (be it text, image, or even video) can be described by a series of observations, or
measurements, made about its content. We refer to each of these observations as terms. Terms describing a
document can be arranged in a vector of term occurrences, i.e. a vector whose i-th element contains a count of
the number of times the i-th term occurs in the document. There is nothing stopping a term vector having terms
from a number of different modalities. For example a term vector could contain term-occurrence information for
both ‘visual’ terms and textual annotation terms.
Given a corpus of n documents, it is possible to form a matrix of m observations or measurements (i.e. a
term-document matrix). This m × n observation matrix, O, essentially represents a combination of terms and
documents, and can be factored into a separate term matrix, T, and document matrix, D:
O = TD .
These two matrices can be seen to represent the structure of a semantic-space co-inhabited by both terms
and documents. Similar documents and/or terms in this space share similar locations. The advantage of this
approach is that it doesn’t require a-priori knowledge and makes no assumptions of either the relationships
between terms or documents. The primary tool in this factorisation is the Singular Value Decomposition. This
factorisation approach to decomposing a measurement matrix has been used before in computer vision; for
example, in factoring 3D-shape and motion from measurements of tracked 2D points using a technique known
as Tomasi-Kanade Factorisation.8
The technique presented here consists of creating an observation matrix (containing both visual- and keyword-
terms) for the image collection and then decomposing it into separate term and document matrices. These term
and document matrices can then be used to generate rankings for subsets of images retrieved using standard
query-keyword matching approaches.
2.1 Decomposing the Observation Matrix
Following the reasoning of Tomasi and Kanade,8although modified to fit measurements of terms in documents,
we first show how the observation matrix can be decomposed into separate term and document matrices.
Lemma 2.1 (The rank principle for a noise-free term-document matrix). Without noise, the
observation matrix, O, has a rank at most equal to the number of independent terms or documents observed.
The rank principle expresses the simple fact that if all of the observed terms are independent, then the rank
of the observation matrix would be equal to the number of terms, m. In practice, however, terms are often
highly dependent on each other, and the rank is much less than m. Even terms from different modalities may
be interdependent; for example a term representing the colour red, and the word “Red”. This fact is what we
intend to exploit.
In reality, the observation term-document matrix is not at all noise free. The observation matrix, O can be
decomposed using SVD into a m×r matrix U, a r ×r diagonal matrix Σ and a r ×n matrix VT, O = UΣVT,
such that UTU = VVT= VTV = I, where I is the identity matrix. Now partitioning the U, Σ and VT
matrices as follows:
we have, UΣVT= UkΣkVT
Assume O∗is the ideal, noise-free observation matrix, with k independent terms. The rank principle implies
that the singular values of O∗are at most k. Since the singular values of Σ are in monotonically decreasing
order, Σk must contain all of the singular values of O∗. The consequence of this is that UNΣNVT
entirely due to noise, and UkΣkVT
Lemma 2.2 (The rank principle for a noisy term-document matrix). All of the information about
the terms and documents in O is encoded in its k largest singular values together with the corresponding left and
We now define the estimated noise-free term matrix,ˆT, and document matrix,ˆD, to beˆT
kis the best possible approximation to O∗.
= Uk, and,
k, respectively. From Equation 1, we can write
whereˆO represents the estimated noise-free observation matrix.
2.2 Using the decomposition to rank images
The two vector bases created in the decomposition form an aligned vector-space of terms and documents. The
rows of the term matrix create a basis representing a position in the space of each of the observed terms. The
columns of the document matrix represent positions of the observed documents in the space. Similar documents
and terms share similar locations in the space.
In order to query the document set for documents relevant to a term, we just need to rank all of the documents
based on their position in the space with respect to the position of the query term in the space. The cosine
similarity is a suitable measure for this task. As we are only interested in ranking the images with keywords
lexically matching the query we only need to calculate the cosines over a reduced subset of documents.
Thus far, we have ignored the value of k. The rank principle states that k is such that all of the semantic
structure of the observation matrix, minus the noise is encoded in the singular values and eigenvectors. k is
also the number of independent, un-correlated terms in the observation matrix. In practice, k will vary across
data-sets, and so we have to estimate its optimal value empirically.
2.3 Example: Images of ‘sun’
As an example of how the technique can be applied, consider the task of looking for images that represent the
keyword “sun” in the well known Corel dataset.9For this example, we will use a very simple 64-bin (4 × 4 × 4)
RGB color histogram to represent the visual features of the image. Each bin from the histogram will represent
a single visual term, and the number of pixels corresponding to that bin will be the number of occurrences of
the term in the observation vector. Figure 1 illustrates how the observation matrix is created.
Figure 1. Creation of an observation matrix from the Corel dataset using color histogram visual terms.
In the 5000 image Corel dataset, there are 111 images labeled with the keyword “sun”. Of these, there are at
least a couple of images where the keyword is not directly relevant to the image. By applying our factorization
technique to the image collection we are able to essentially learn which visual features are most closely associated
with each keyword by looking at the spatial proximity of the features and keywords in the resultant vector space.
Figure 2 attempts to illustrate the topology of such a vector space.
In the case of the Corel images and color histogram features, the keyword ‘sun’ occurs very near to the set
of red and yellow colors that often occur in images depicting the sun. The relevant images are also similarly
co-located in the space, however they are arranged in such a way that the images with the closest visual similarity
to the prototypical colors are closer than those images containing less of those colors.
Figure 2. Example illustration showing the topology of the vector-space created through the factorization process. The
colored spheres represent color-based image features.
Table 1. Example images of ‘sun’ from the Corel dataset after ranking by the factorization technique.
Rank1 2649 7397100
Table 1 shows some of the images retrieved together with their associated rank position for a query of ‘sun’.
The author’s opine that there is a definite correlation between the relevance of the images to the query term and
the respective rank of the images. Section 3 explores this in more detail through a process of user-evaluation.
3. USER EVALUATION OF IMAGE-KEYWORD RELEVANCE
The challenge in evaluating image ranking mechanisms for keyword-based search is in obtaining the ground-
truth; an image ranking must be determined against which search systems can be evaluated. This is a problem
requiring human intelligence, that is asking a group of users to determine which images they find most relevant
to satisfy a given query. Separate user trials on each algorithm being evaluated were considered, for example
asking users to comment on the relevance of the ranking of images returned by each system, but it was felt that
this approach would not have been scalable. For each new algorithm being tested, a new user trial would have
to be conducted. A base system, to which each algorithm could be compared, would also be required and it is
not clear which traditional ranking mechanism would be suitable to employ.
Instead, our approach involves a user-based experiment to determine suitable image rankings that can be
used to evaluate the image ranking mechanism described in Section 2. Users are asked to rank various subsets of
images. The subsets are the result of a keyword search, and the ranking is order of relevance for the respective
keyword. The aim is to create a corpus of rankings, which can also potentially be used to evaluate the ranking
performance of other image retrieval systems.