A framework for understanding Latent Semantic Indexing (LSI) performance
ABSTRACT In this paper we present a theoretical model for understanding the performance of Latent Semantic Indexing (LSI) search and retrieval application. Many models for understanding LSI have been proposed. Ours is the first to study the values produced by LSI in the term by dimension vectors. The framework presented here is based on term cooccurrence data. We show a strong correlation between secondorder term cooccurrence and the values produced by the Singular Value Decomposition (SVD) algorithm that forms the foundation for LSI. We also present a mathematical proof that the SVD algorithm encapsulates term cooccurrence information.
 [Show abstract] [Hide abstract]
ABSTRACT: Latent Semantic Indexing (LSI) is a famous Information Retrieval (IR) technique that tries to overcome the problems of lexical matching using conceptual indexing. LSI is a variant of vector space model and proved to be 30% more effective. Many studies have reported that good retrieval performance is related to the use of various retrieval heuristics. In this paper, we focus on optimising two LSI retrieval heuristics: term weighting and rank approximation. The results obtained demonstrate that the LSI performance improves significantly with the combination of optimised term weighting and rank approximation.Journal of Information & Knowledge Management 11/2011; 05(02). 
Conference Paper: Detecting cyberbullying: query terms and techniques
[Show abstract] [Hide abstract]
ABSTRACT: In this paper we describe a close analysis of the language used in cyberbullying. We take as our corpus a collection of posts from Formspring.me. Formspring.me is a social networking site where users can ask questions of other users. It appeals primarily to teens and young adults and the cyberbullying content on the site is dense; between 7% and 14% of the posts we have analyzed contain cyberbullying content. The results presented in this article are twofold. Our first experiments were designed to develop an understanding of both the specific words that are used by cyberbullies, and the context surrounding these words. We have identified the most commonly used cyberbullying terms, and have developed queries that can be used to detect cyberbullying content. Five of our queries achieve an average precision of 91.25% at rank 100. In our second set of experiments we extended this work by using a supervised machine learning approach for detecting cyberbullying. The machine learning experiments identify additional terms that are consistent with cyberbullying content, and identified an additional querying technique that was able to accurately assign scores to posts from Formspring.me. The posts with the highest scores are shown to have a high density of cyberbullying content.Proceedings of the 5th Annual ACM Web Science Conference; 05/2013  [Show abstract] [Hide abstract]
ABSTRACT: The success of machine learning approaches to word sense disambiguation (WSD) is largely dependent on the representation of the context in which an ambiguous word occurs. Typically, the contexts are represented as the vector space using “Bag of Words (BoW)” technique. Despite its ease of use, BoW representation suffers from wellknown limitations, mostly due to its inability to exploit semantic similarity between terms. In this paper, we apply the semantic diffusion kernel, which models semantic similarity by means of a diffusion process on a graph defined by lexicon and cooccurrence information, to smooth the BoW representation for WSD systems. Semantic diffusion kernel can be obtained through a matrix exponentiation transformation on the given kernel matrix, and virtually exploits higher order cooccurrences to infer semantic similarity between terms. The superiority of the proposed method is demonstrated experimentally with several SensEval disambiguation tasks.Engineering Applications of Artificial Intelligence 01/2014; 27:167–174. · 1.96 Impact Factor
Page 1
A Framework for Understanding LSI Performance
April Kontostathis and William M. Pottenger
Lehigh University
19 Memorial Drive West
Bethlehem, PA 18015
apk5@lehigh.edu
billp@cse.lehigh.edu
Abstract. In this paper we present a theoretical model for understanding the performance of LSI search and retrieval
applications. Many models for understanding LSI have been proposed. Ours is the first to study the values produced by
LSI in the term dimension vectors. The framework presented here is based on term cooccurrence data. We show a
strong correlation between second order term cooccurrence and the values produced by the SVD algorithm that forms
the foundation for LSI. We also present a mathematical proof that the SVD algorithm encapsulates term cooccurrence
information.
1 Introduction
Latent Semantic Indexing (LSI) (Deerwester, et al., 1990) is a wellknown textmining algorithm. LSI has
been applied to a wide variety of learning tasks, such as search and retrieval (Deerwester, et al.), classification
(Zelikovitz and Hirsh, 2001) and filtering (Dumais, 1994, 1995). LSI is a vector space approach for modeling
documents, and many have claimed that the technique brings out the ‘latent’ semantics in a collection of
documents (Deerwester, et al., 1990; Dumais, 1993).
LSI is based on well known mathematical technique called Singular Value Decomposition (SVD).
The algebraic foundation for Latent Semantic Indexing (LSI) was first described in (Deerwester, et al., 1990)
and has been further discussed in (Berry, Dumais and O’Brien, 1995; Berry, Drmac, and Jessup, 1999).
These papers describe the SVD process and interpret the resulting matrices in a geometric context. The SVD,
truncated to k dimensions, gives the best rankk approximation to the original matrix. In (WiemerHastings,
1999), WiemerHastings shows that the power of LSI comes primarily from the SVD algorithm.
Other researchers have proposed theoretical approaches to understanding LSI. (Zha, 1998)
describes LSI in terms of a subspace model and proposes a statistical test for choosing the optimal number of
dimensions for a given collection. (Story, 1996) discusses LSI’s relationship to statistical regression and
Bayesian methods. (Ding, 1999) constructs a statistical model for LSI using the cosine similarity measure.
Although other researchers have explored the SVD algorithm to provide an understanding of SVD
based information retrieval systems, to our knowledge, only Schütze has studied the values produced by SVD
(Schütze, 1992). We expand upon this work, showing here that SVD exploits higher order term co
occurrence in a collection. Our work provides insight into the origin of the values in the termterm matrix.
This work provides a model for understanding LSI. Our framework is based on the concept of
term cooccurrences. Term cooccurrence data is implicitly or explicitly used for almost every advanced
application in textual data mining.
This work is the first to study the values produced in the SVD term by dimension matrix and we
have discovered a correlation between the performance of LSI and the values in this matrix. Thus we have
discovered the basis for the claim that is frequently made for LSI: LSI emphasizes underlying semantic
distinctions (latent semantics) while reducing noise in the data. This is an important component in the
theoretical foundation for LSI.
In section 2 we present a simple example of higher order term cooccurrence in SVD. In section 3
we present our analysis of the values produced by SVD. Section 4 presents a mathematical proof of term
transitivity within SVD, previously reported in (Kontostathis and Pottenger, 2002b).
Page 2
2 Cooccurrence in LSI – An Example
The data for the following example is taken from (Deerwester, et al., 1990). In that paper, the authors
describe an example with 12 terms and 9 documents. The termdocument matrix is shown in table 1 and the
corresponding termterm matrix is shown in table 2.
The SVD process used by LSI decomposes the matrix into three matrices: T, a term by dimension
matrix, S a singular value matrix, and D, a document by dimension matrix. The number of dimensions is the
rank of the term by document matrix. The original matrix can be obtained, through matrix multiplication of
TSDT. The reader is referred to (Deerwester, et al., 1990) for the T, S, and D matrices. In the LSI system, the
T, S and D matrices are truncated to k dimensions. The purpose of dimensionality reduction is to reduce
“noise” in the term–term matrix, resulting in a richer word relationship structure that reveals latent semantics
present in the collection. After dimensionality reduction the termterm matrix can be recomputed using the
formula TkSk(TkSk)T. The termterm matrix, after reduction to 2 dimensions, is shown in table 3.
Table 1. Deerwester Term by Document Matrix
c1 c2c3c4
human1001
interface1010
computer1100
user0110
system0112
response0100
time0100
EPS0011
Survey0100
trees0000
graph0000
minors0000
Table 2. Deerwester Term by Term Matrix
We will assume that the value in position (i,j) of the matrix represents the similarity between term i
and term j in the collection. As can be seen in table 3, user and human now have a value of .94, representing
a strong similarity, where before the value was zero. In fact, user and human is an example of second order
cooccurrence. The relationship between user and human comes from the transitive relation: user cooccurs
with interface and interface cooccurs with human.
A closer look reveals a value of 0.15 in the relationship between trees and computer. Looking at
the cooccurrence path gives us an explanation as to why these terms received a positive (although weak)
c5
0
0
0
1
0
1
1
0
0
0
0
0
m1
0
0
0
0
0
0
0
0
0
1
0
0
m2
0
0
0
0
0
0
0
0
0
1
1
0
m3
0
0
0
0
0
0
0
0
0
1
1
1
m4
0
0
0
0
0
0
0
0
1
0
1
1
human
x
1
1
0
2
0
0
1
0
0
0
0
interface
1
x
1
1
1
0
0
1
0
0
0
0
computer
1
1
x
1
1
1
1
0
1
0
0
0
user
0
1
1
x
2
2
2
1
1
0
0
0
system
2
1
1
2
x
1
1
3
1
0
0
0
response
0
0
1
2
1
x
2
0
1
0
0
0
time
0
0
1
2
1
2
x
0
1
0
0
0
EPS
1
1
0
1
3
0
0
x
0
0
0
0
Survey
0
0
1
1
1
1
1
0
x
0
1
1
trees
0
0
0
0
0
0
0
0
0
x
2
1
graph
0
0
0
0
0
0
0
0
1
2
x
2
minors
0
0
0
0
0
0
0
0
1
1
2
x
human
interface
computer
user
system
response
time
EPS
Survey
trees
graph
minors
Page 3
similarity value. From table 2, we see that trees cooccurs with graph, graph cooccurs with survey, and
survey cooccurs with computer. Hence the trees/computer relationship is an example of third order co
occurrence. In the next section we present correlation data that confirms the relationship between the term
term matrix values and the performance of LSI.
To completely understand the dynamics of the SVD process, a closer look at table 1 is warranted.
We note the nine documents in the collection can be split into two subsets {C1C5} and {M1M4}. If the
term survey did not appear in the {M1M4} subset, the subsets would be disjoint. The data in table 4 was
developed by changing the survey/m4 entry to 0 in table 1, computing the decomposition of this new matrix,
truncating to two dimensions and deriving the associated termterm matrix.
Table 3. Deerwester Term by Term Matrix, Truncated to two dimensions
Table 4. Modified Deerwester Term by Term Matrix, Truncated to two dimensions
Notice the segregation between the terms; all values between {trees, graph, minors} subset and the rest of
the terms have been reduced to zero. In the section 4 we prove a theorem that explains this phenomena,
showing, in all cases, that if there is no connectivity path between two terms, the resultant value in the term
term matrix must be zero.
human
0.62
0.54
0.56
0.94
1.69
0.58
0.58
0.84
0.32
0.32 0.20
0.34 0.19
0.25 0.14
interface
0.54
0.48
0.52
0.87
1.50
0.55
0.55
0.73
0.35
computer
0.56
0.52
0.65
1.09
1.67
0.75
0.75
0.77
0.63
0.15
0.27
0.20
user
0.94
0.87
1.09
1.81
2.79
1.25
1.25
1.28
1.04
0.23 0.47
0.42 0.39
0.31 0.28
system
1.69
1.50
1.67
2.79
4.76
1.81
1.81
2.30
1.20
response
0.58
0.55
0.75
1.25
1.81
0.89
0.89
0.80
0.82
0.38
0.56
0.41
time
0.58
0.55
0.75
1.25
1.81
0.89
0.89
0.80
0.82
0.38 0.41
0.56 0.43
0.41 0.31
EPS
0.84
0.73
0.77
1.28
2.30
0.80
0.80
1.13
0.46
Survey
0.32 0.32 0.34 0.25
0.35 0.20 0.19 0.14
0.63 0.15
1.040.23
1.20 0.47 0.39 0.28
0.82 0.38
0.82 0.38
0.46 0.41 0.43 0.31
0.96 0.88
0.88 1.55
1.171.96
0.85 1.43
trees
graph
minors
human
interface
computer
user
system
response
time
EPS
Survey
trees
graph
minors
0.27
0.42
0.20
0.31
0.56
0.56
0.41
0.41
1.17
1.96
2.50
1.81
0.85
1.43
1.81
1.32
human
0.56
0.50
0.60
1.01
1.62
0.66
0.66
0.76
0.45



interface
0.50
0.45
0.53
0.90
1.45
0.59
0.59
0.68
0.40



computer
0.60
0.53
0.64
1.08
1.74
0.71
0.71
0.81
0.48



user
1.01
0.90
1.08
1.82
2.92
1.19
1.19
1.37
0.81



system
1.62
1.45
1.74
2.92
4.70
1.91
1.91
2.20
1.30



response
0.66
0.59
0.71
1.19
1.91
0.78
0.78
0.90
0.53



time
0.66
0.59
0.71
1.19
1.91
0.78
0.78
0.90
0.53



EPS
0.76
0.68
0.81
1.37
2.20
0.90
0.90
1.03
0.61



Survey
0.45
0.40
0.48
0.81
1.30
0.53
0.53
0.61
0.36



trees









2.05
2.37
1.65
graph









2.37
2.74
1.91
minors









1.65
1.91
1.33
human
interface
computer
user
system
response
time
EPS
Survey
trees
graph
minors
Page 4
3 Analysis of the LSI Values
In this section we expand upon the work in (Kontostathis and Pottenger, 2002b). The results of our analysis
show a strong correlation between the values produced by the SVD process and higher order term co
occurrences. In the conclusion we discuss the practical applications of this analytical study.
We chose six collections for our study of the values produced by SVD. These collections are
described in Table 5. These collections were chosen because they have query and relevance judgment sets
that are readily available.
Table 5. Characteristics of collections used in study
Figure 2. LSI Performance for LISA and NPL
The Parallel General Text Parser (PGTP) (Martin and Berry, 2001) was used to preprocess the text
data, including creation and decomposition of the term document matrix. For our experiments, we applied the
log entropy weighting option and normalized the document vectors.
We were interested in the distribution of values for both optimal and suboptimal parameters for
each collection. In order to identify the most effective k (dimension truncation parameter) for LSI, we used
the fmeasure, a combination of precision and recall (van Rijsbergen, 1979), as a determining factor. In our
experiments we used a beta=1 for the fmeasure parameter. We explored possible values from k=10,
incrementing by 5, up to k=200 for the smaller collections, values up to k=500 were used for the LISA and
NPL collections. For each value of k, precision and recall averages were identified for each rank from 10 to
100 (incrementing by 10), and the resulting fmeasure was calculated. The results of these runs for selected
Identifier Description
No. of
Docs
No. of
Terms
No.
Queries
Optimal
k
Values of k used in
study
10,25,40,75,
100,125,150,200
10,25,40,75,
100,125,150,200
10,25,50,70,
100,125,150,200
10,25,50,75,
100,125,150,200
10,50,100,150,
165,200,300,500
10,50,100,150,
200,300,400,500
MEDMedical Abstracts 1033 58313040
CISI
Information Science
Abstracts
Communications of the
ACM Abstracts
14605143 76 40
CACM3204 4863 52 70
CRANCranfield Collection 1398393122550
LISA
Library and Information
Science Abstracts
Larger collection of very
short documents
6004 1842935 165
NPL114296988 93 200
0
0.05
0.1
0.15
0.2
0.25
10 50 100 150 200 250 300 350 400 450 500
k
Fmeasure (beta = 1)
LISA
NPL
Page 5
values of k are summarized in figures 2 and 3. Clearly MED has the best performance overall. To choose the
optimal k, we selected the smallest value that provided substantially similar performance as larger k values.
For example, in the LISA collection k=165 was chosen as the optimum because the k values higher than 165
provide only slighter better performance. A smaller k is preferable to reduce the computational overhead of
both the decomposition and the search and retrieval processing. The optimal k was identified for each
collection and is shown in table 5.
Figure 3. LSI Performance for Smaller Collections
Figure 4. Algorithm for data collection
3.1 Methodology
The algorithm used to collect the cooccurrence data appears in figure 4. After we compute the SVD using
the original term by document matrix, we calculate termterm similarities. LSI provides two natural methods
for describing termterm similarity. First, the termterm matrix can be created using TkSk(TkSk)T. This
approach results in values such as those shown in table 3. Second, the term by dimension (TkSk) matrix can
be used to compare terms using a vector distance measure, such as cosine similarity. In this case, the cosine
similarity is computed for each pair of rows in the TkSk matrix. The computation results in a value in the
range [1, 1] for each pair of terms (i,j).
After the term similarities are created, we need to determine the order of cooccurrence for each
pair of terms. The order of cooccurrence is computed by tracing the cooccurrence paths. In figure 5 we
present an example of this process. In this small collection, terms A and B appear in document D1, terms B
Create the term by document matrix
Compute the SVD for the matrix
For each pair of terms (i,j) in the collection
Compute the termterm matrix value for the (i,j) element after truncation to k dimensions
Compute the cosine similarly value for the (i,j) element after truncation to k dimemsions
Determine the ‘order of cooccurrence’
If term i and term j appear in same document (cooccur), ‘order of cooccurrence’ = 1
If term i and term j do not appear in same document, but i cooccurs with m,
and m cooccurs with j, then ‘order of cooccurrence’ = 2
(Higher orders of cooccurrence are computed in a similar fashion by induction on the number of
intermediate terms)
Summarize the data by range of values and order of cooccurrence
0.00
0.10
0.20
0.30
0.40
0.50
0.60
10 25 50 75100125 150 175 200
k
Fmeasure (beta = 1)
MED
CRAN
CISI
CACM
Page 6
and C appear in document D2, and terms C and D occur in document D3. If each term is considered a node in
a graph, arcs can be drawn between the terms that appear in the same document. Now we can assign order of
cooccurrence as follows: nodes that are connected are considered first order pairs, nodes that can be reached
with one intermediate hop are second order cooccurrences, nodes that can be reached with two intermediate
hops are third order pairs, etc. In general the order of cooccurrence is n + 1, where n is the number of hops
needed to connect the nodes in the graph. Note that some term pairs may not have a connecting path; the LSI
term by term matrix for this situation is shown in table 4, the term by term entry will always be zero for terms
that do not have a connectivity path.
Figure 5. Tracing order of cooccurrence
3.2 Results
The order of cooccurrence summary for the NPL collection is shown in table 6. The values are expressed as
a percentage of the total number of pairs of first, second and third order cooccurrences for each collection.
The values in table 6 represent the distribution using the cosine similarity. LSI performance is also shown.
Table 7a shows the correlation coefficient for all collections. There is a strong correlation between
the percentage of second order negative values and LSI performance for all collections, with the correlations
for MED appearing slightly weaker than the other collections. There also appears to be a strong inverse
correlation between the positive second and third order values and the performance of LSI. In general the
values for each order of cooccurrence/value pair appear to be consistent across all collections, with the
exception of the third order negative values for CACM.
A
B
B
C
C
D
D1
D3 D2
1st order term cooccurrence {A,B}, {B,C}, {C,D}
2nd order term cooccurrence {A,C}, {B,D}
3rd order term cooccurrence {A,D}
A B C D
Page 7
Table 6. Distribution summary by sign and order of cooccurrence for NPL
Table 7a and 7b. Within Collection Correlation data
The corresponding data using the termterm similarity as opposed to the cosine similarity is shown
in table 7b. In this data we observe consistent correlations for negative and zero values across all collections,
but there are major variances in the correlations for the positive values.
k=10
1.4%
0.5%
98.1%
k=50
2.5%
1.7%
95.7%
k=100
3.0%
2.9%
94.1%
k=150
3.1%
4.1%
92.8%
k=200
3.1%
5.0%
91.8%
k=300
2.8%
6.7%
90.5%
k=400
2.2%
8.1%
89.7%
k=500
1.6%
9.4%
89.0%
[1.0,.01]
(.01,.01)
[.01, 1.0]
k=10
14.0%
2.4%
83.6%
k=50
24.6%
6.7%
68.6%
k=100
28.2%
9.9%
61.9%
k=150
30.1%
12.4%
57.5%
k=200
31.2%
14.7%
54.2%
k=300
32.1%
18.9%
49.0%
k=400
32.2%
22.7%
45.1%
k=500
31.8%
26.4%
41.7%
[1.0,.01]
(.01,.01)
[.01, 1.0]
k=10
44.6%
3.9%
51.5%
k=50
62.7%
8.8%
28.5%
k=100
66.9%
11.6%
21.6%
k=150
69.3%
13.5%
17.2%
k=200
70.0%
15.1%
14.9%
k=300
69.7%
18.1%
12.2%
k=400
68.3%
21.0%
10.7%
k=500
66.6%
23.6%
9.8%
[1.0,.01]
(.01,.01)
[.01, 1.0]
k=10
0.08
k=50
0.13
k=100
0.15
k=150
0.15
k=200
0.16
k=300
0.17
k=400
0.17
k=500
0.18
Fmeasure
2nd Order
3rd Order
LSI Performance Beta=1
1stOrder
1st2nd 3rd1st 2nd 3rd
[1.0,.01]
(.01,.01)
[.01, 1.0]
0.38
0.88
0.95
0.99
0.89
0.98
0.92
0.93
1.00
(9999,.001]
(.001,.001)
[.001, 9999)
0.90
0.87
0.90
0.99
0.99
0.99
0.99
0.99
0.99
[1.0,.01]
(.01,.01)
[.01, 1.0]
0.55
0.79
0.93
0.99
0.77
0.92
0.99
0.84
0.97
(9999,.001]
(.001,.001)
[.001, 9999)
0.90
0.89
0.89
0.95
0.95
0.95
0.99
1.00
0.97
[1.0,.01]
(.01,.01)
[.01, 1.0]
0.91
0.77
0.88
0.94
0.76
0.88
0.97
0.78
0.96
(9999,.001]
(.001,.001)
[.001, 9999)
0.90
0.86
0.81
0.75
0.73
0.87
0.59
0.59
0.46
[1.0,.01]
(.01,.01)
[.01, 1.0]
0.79
0.82
0.92
0.93
0.78
0.89
0.99
0.77
0.96
(9999,.001]
(.001,.001)
[.001, 9999)
0.89
0.87
0.84
0.76
0.77
0.82
0.72
0.72
0.68
[1.0,.01]
(.01,.01)
[.01, 1.0]
0.71
0.52
0.68
0.76
0.48
0.66
0.80
0.52
0.74
(9999,.001]
(.001,.001)
[.001, 9999)
0.86
0.96
0.96
0.88
0.88
0.86
0.82
0.82
0.87
[1.0,.01]
(.01,.01)
[.01, 1.0]
0.99
0.92
0.96
0.98
0.93
0.96
0.21
0.93
0.94
(9999,.001]
(.001,.001)
[.001, 9999)
0.98
0.95
0.94
0.91
0.88
0.62
0.93
0.92
0.90
LISA Cosine
Correlation Data for Cosine Similarity
Correlation Coefficients
Correlation Data for TermTerm Similarity
Correlation Coefficients
CRAN TermTerm
CISI Cosine
CRAN Cosine
NPL TermTerm
LISA TermTerm
NPL Cosine
MED Cosine
CACM Cosine
CISI TermTerm
MED TermTerm
CACM TermTerm
Page 8
Table 8a and 8b. Correlation data by value distribution only
Table 8a shows the values when the correlation coefficient is computed for selected ranges of the
cosine similarity, without taking order of cooccurrence into account. Again we note strong correlations for
all collections for value ranges (.2,1], (.1,.01] and (.01,.01).
Table 8b shows the values when the correlation coefficient is computed for selected ranges of the
termterm similarity, without taking order of cooccurrence into account. These results are more difficult to
interpret. We see some similarity in the (.2,1], (.1,.01] and (.01,.01) ranges for all collections except
MED. The positive values do not lend weight to any conclusion. NPL and CACM show strong correlations
for some ranges, while the other collections report weaker correlations.
Our next step was to determine if these correlations existed when the distributions and LSI
performance were compared across collections. Two studies were done, one holding k constant at k=100 and
the second using the optimal k (identified in table 5) for each collection. Once again we looked at both the
cosine and the term term similarities. Table 9 shows the value distribution for the cosine similarity for k=100.
The correlation coefficients for the cross collection studies are shown in table 10. Note the correlation
between the second order negative and zero values and the LSI performance, when k=100 is used. These
correlations are not as strong as the correlations obtained when comparing different values of k within a single
collection, but finding any similarity across these widely disparate collections is noteworthy. The cross
collection correlation coefficients for optimal k values (as defined in table 5) are also shown in table 10.
There is little evidence that the distribution of values has an impact on determining the optimal value of k, but
Similarity
(.3,.2]
(.2,.1]
(.1,.01]
(.01,.01)
[.01,.1]
(.1,.2]
(.2,.3]
(.3,.4]
(.4,.5]
(.5,.6]
NPL
0.74
0.97
0.78
0.98
0.36
0.85
0.98
0.99
0.98
0.96
LISA
0.54
0.96
0.77
0.98
0.14
0.81
0.99
0.99
0.97
0.96
CISI
0.09
0.86
0.78
0.88
0.25
0.59
0.85
0.97
1.00
0.98
CRAN
0.28
0.89
0.76
0.90
0.29
0.64
0.90
0.99
1.00
0.97
MED
0.39
0.92
0.81
0.93
0.21
0.77
0.92
0.98
1.00
0.99
CACM
0.76
0.82
0.75
0.83
0.91
0.36
0.17
0.48
0.69
0.87
Correlation Coefficients for Cosine Similarity
Similarity
(.02,.01]
(.01,.001]
(.001,.001)
[.001,.01]
(.01,.02]
(.02,.03]
(.03,.04]
(.04,.05]
(.05,.06]
(.06,.07]
(.07,.08]
(.08,.09]
(.09,.1)
[.1, 9999]
NPL
0.87
1.00
0.99
0.99
0.35
0.52
0.95
0.87
0.86
0.84
0.82
0.83
0.84
0.87
LISA
0.71
0.96
0.95
0.95
0.93
0.79
0.72
0.69
0.69
0.68
0.68
0.70
0.71
0.81
CISI
0.73
0.73
0.73
0.66
0.72
0.69
0.73
0.71
0.72
0.73
0.66
0.69
0.69
0.73
CRAN
0.75
0.76
0.74
0.93
0.82
0.80
0.78
0.74
0.75
0.78
0.75
0.80
0.77
0.84
MED
0.47
0.40
0.41
0.41
0.50
0.44
0.44
0.45
0.46
0.49
0.51
0.52
0.55
0.64
CACM
0.92
0.91
0.89
0.22
0.92
0.93
0.93
0.94
0.94
0.95
0.96
0.95
0.96
0.97
Correlation Coefficients for TermTerm Similarity
Page 9
there is a correlation between the distribution of cosine similarity values and the retrieval performance at
k=100.
Table 9. Cross collection distribution by sign and order of cooccurrence, cosine similarity, k=100
Table 10a and 10b. Cross collection correlation coefficients
3.3
Our results show strong correlations between higher orders of cooccurrence in the SVD algorithm and the
performance of LSI, a search and retrieval algorithm, particularly when the cosine similarity metric is used.
Higher order cooccurrences play a key role in the effectiveness of many systems used for text mining. We
detour briefly to describe recent applications that are implicitly or explicitly using higher orders of co
occurrence to improve performance in applications such as Search and Retrieval, Word Sense
Disambiguation, Stemming, Keyword Classification and Word Selection.
Philip Edmonds shows the benefits of using second and third order cooccurrence in (Edmonds,
1997). The application described selects the most appropriate term when a context (such as a sentence) is
provided. Experimental results show that the use of second order cooccurrence significantly improved the
precision of the system. Use of third order cooccurrence resulted in incremental improvements beyond
second order cooccurrence.
Zhang, et. al. explicitly used second order term cooccurrence to improve an LSI based search and
retrieval application (Zhang, Berry and Raghavan, 2000). Their approach narrows the term and document
space, reducing the size of the matrix that is input into the LSI system. The system selects terms and
Discussion
CACM
1.8%
1.9%
96.3%
MED
1.9%
1.9%
96.2%
CISI
2.6%
2.3%
95.0%
CRAN
2.5%
2.6%
95.0%
LISA
2.3%
2.1%
95.6%
NPL
3.0%
2.9%
94.1%
[1.0,.01]
(.01,.01)
[.01, 1.0]
CACM
21.3%
7.8%
71.0%
MED
35.0%
11.4%
53.6%
CISI
31.7%
9.2%
59.1%
CRAN
31.2%
10.6%
58.2%
LISA
28.7%
8.5%
62.8%
NPL
28.2%
9.9%
61.9%
[1.0,.01]
(.01,.01)
[.01, 1.0]
CACM
55.6%
17.3%
27.1%
MED
75.0%
9.9%
15.1%
CISI
77.3%
8.7%
14.0%
CRAN
72.8%
12.1%
15.2%
LISA
69.9%
10.3%
19.9%
NPL
66.9%
11.6%
21.6%
[1.0,.01]
(.01,.01)
[.01, 1.0]
CACM
0.13
MED
0.56
CISI
0.23
CRAN
0.14
LISA
0.20
NPL
0.16
Fmeasure
LSI Performance Beta=1
1st Order
2nd Order
3rd Order
1st 2nd 3rd1st 2nd3rd
[1.0,.01]
(.01,.01)
[.01, 1.0]
0.40
0.47
0.44
Cosine, K=Optimal
0.68
0.63
0.69
0.49
0.45
0.48
(9999,.001]
(.001,.001)
[.001, 9999)
0.53
0.21
0.04
TermTerm, K=Optimal
0.24
0.36
0.43
0.29
0.32
0.38
[1.0,.01]
(.01,.01)
[.01, 1.0]
0.36
0.35
0.36
0.32
0.17
0.12
0.23
0.34
0.02
(9999,.001]
(.001,.001)
[.001, 9999)
0.43
0.48
0.49
0.29
0.36
0.44
0.31
0.31
0.32
Cosine Similarity TermTerm Similarity
Cosine, K=100 TermTerm, K=100
Page 10
documents for the reduced space by first selecting all the documents that contain the terms in the query, then
selecting all terms in those documents, and finally selecting all documents that contain the expanded list of
terms. This approach reduces the nonzero entries in the term document matrix by an average of 27%.
Unfortunately average precision also was degraded. However, when terms associated with only one document
were removed from the reduced space, the number of nonzero entries was reduced by 65%, when compared
to the baseline, and precision degradation was only 5%.
Hinrich Schütze explicitly uses second order cooccurrence in his paper on Automatic Word Sense
Disambiguation (Schütze, 1998). In this paper, Schütze presents an algorithm for discriminating the senses
of a given term. For example, the word senses in the previous sentence can mean the physical senses (sight,
hearing, etc.) or it can mean ‘a meaning conveyed by speech or writing.’ Clearly the latter is a better
definition of this use of senses, but automated systems based solely on keyword analysis would return this
sentence to a query that asked about the sense of smell. The paper presents an algorithm based on use of
secondorder cooccurrence of the terms in the training set to create context vectors that represent a specific
sense of a word to be discriminated.
Xu and Croft introduce the use of cooccurrence data to improve stemming algorithms in (Xu and
Croft, 1998). The premise of the system described in this paper is to use contextual (e.g., cooccurrence)
information to improve the equivalence classes produced by an aggressive stemmer, such as the Porter
stemmer. The algorithm breaks down one large class for a family of terms into small contextually based
equivalence classes. Smaller, more tightly connected equivalence classes result in more effective retrieval (in
terms of precision and recall), as well an improved runtime performance (since fewer terms are added to the
query). Xu and Croft’s algorithm implicitly uses higher orders of cooccurrence. A strong correlation
between terms A and B, and also between terms B and C will result in the placement of terms A, B, and C into
the same equivalence class. The result will be a transitive semantic relationship between A and C. Orders of
cooccurrence higher than two are also possible in this application.
In this section we have empirically demonstrated the relationship between higher orders of co
occurrence in the SVD algorithm and the performance of LSI. Thus we have provided a model for
understanding the performance of LSI by showing that secondorder cooccurrence plays a critical role. In the
conclusion we describe the applicability of this result to applications in information retrieval.
4 Transitivity and the SVD
In this section we present mathematical proof that the SVD algorithm encapsulates term cooccurrence
information. Specifically we show that a connectivity path exists for every nonzero element in the truncated
matrix. This proof was first presented in (Kontostathis and Pottenger, 2002b) and is repeated here.
We begin by setting up some notation. Let A be a term by document matrix. The SVD process
decomposes A into three matrices: a term by dimension matrix, T, a diagonal matrix of singular values, S,
and a document by dimension matrix D. The original matrix is reformed by multiplying the components, A =
TSDT. When the components are truncated to k dimensions, a reduced representation matrix, Ak is formed as
Ak = TkSkDk
The termterm cooccurrence matrices for the full matrix and the truncated matrix are (Deerwester
et al., 1990):
B = TSSTT
Y = TkSkSkTk
We note that elements of B represent term cooccurrences in the collection, and bij >= 0 for all i
and j. If term i and term j cooccur in any document in the collection, bij > 0. Matrix multiplication results in
equations 3 and 4 for the ijth element of the cooccurrence matrix and the truncated matrix, respectively. Here
uip is the element in row i and column p of the matrix T, and sp is the pth largest singular value.
m
sb
=
1
T (Deerwester et al., 1990).
(1)
(2)
T
jp
p
ipp
ij
u u
=
2
(3)
Page 11
jp
k
p
ipp
ij
u usy
=
=
1
2
(4)
B2 can be represented in terms of the T and S:
B2 = (TSSTT)( TSSTT) = TSS(TTT)SSTT = TSSSSTT = TS4TT
An inductive proof can be used to show:
Bn = TS2nTT
And the element bij
(5)
(6)
n can be written:
jp
m
p
ip
n
p
ij
u usb
n
=
=
1
2
(7)
To complete our argument, we need two lemmas related to the powers of the matrix B.
LEMMA 1: Let i and j be terms in a collection, there is a transitivity path of order <= n between
the terms, iff the ijth element of Bn is nonzero.
LEMMA 2: If there is no transitivity path between terms i and j, then the ijth element of Bn (bij
zero for all n.
The proof of these lemmas can be found in (Kontostathis and Pottenger, 2002a). We are now
ready to present our theorem.
THEOREM 1: If the ijth element of the truncated term by term matrix, Y, is nonzero, then there is
a transitivity path between term i and term j.
We need to show that if yij ≠ 0, then there exists terms q1, … , qn, n >= 0 such that bi q1 ≠ 0, bq1 q2
≠ 0, …. bqn j ≠ 0. Alternately, we can show that if there is no path between terms i and j, then yij = 0 for all k.
Assume the T and S matrices have been truncated to k dimensions and the resulting Y matrix has
been formed. Furthermore, assume there is no path between term i and term j. Equation (4) represents the yij
element. Assume that s1 > s2 > s3 > … > sk > 0. By lemma 2, bij
conclude that
n) is
n = 0 for all n. Dividing (7) by s1
2n, we
bij
n = ui1uj1 +
jp
m
p
ip
n
n
p
u u
s
s
=2
2
1
2
= 0 (8)
We take the limit of this equation as n ? ∞, and note that (sp/s1) < 1 when 2 <= p <= m. Then as
n? ∞, (sp/s1) 2n ? 0 and the summation term reduces to zero. We conclude that ui1uj1 = 0. Substituting back
into (7) we have:
m
n
p
us
=2
Dividing by s2
jp
p
ip
u
2
= 0 (9)
2n yields:
Page 12
ui2uj2 +
jp
m
p
ip
n
n
p
u u
s
s
=3
2
2
2
= 0. (10)
Taking the limit as n? ∞, we have that ui2uj2 = 0. If we apply the same argument k times we will
obtain uipujp = 0 for all p such that 1 <= p <= k. Substituting back into (4) shows that yij = 0 for all k.
The argument thus far depends on our assumption that s1 > s2 > s3 > … > sk. When using SVD it is
customary to truncate the matrices by removing all dimensions whose singular value is below a given
threshold (Dumais, 1993); however, for our discussion, we will merely assume that, if s1 > s2 > … > sz1 > sz =
sz+1 = sz+2 = … = sz+w > sz+w+1 > … > sk for some z and some w >= 1, the truncation will either remove all of the
dimensions with the duplicate singular value, or keep all of the dimensions with this value.
We need to examine two cases. In the first case, z > k and the z … z+w dimensions have been
truncated. In this case, the above argument shows that either ui q = 0 or uj q = 0 for all q <=k and, therefore, yij
= 0.
To handle the second case, we assume that z < k and the z … z+w dimensions have not been
truncated and rewrite equation (7) as:
z
n
p
u us
=
1
zp
=
bij
n =
jp
p
ip
−
1
2
+
jp
wx
ip
n
p
u us
+
2
+
jp
m
wzp
ip
n
p
u us
++=
1
2
= 0 (11)
The argument above can be used to show that uip ujp = 0 for p <= z1, and the first summation can
be removed. After we divide the remainder of the equation by
n
s2
2:
n
u
bij
n =
jp
wx
zp
ipuu
+
=
+
jp
m
w
w
zp
ip
n
z
p
2
u
s
s
++
+
=
1
2
= 0 (12)
Taking the limit as n ? ∞, we conclude that
jp
x
zp
ipuu
=
= 0, and bij
n is reduced to:
bij
n =
jp
m
wzp
ip
n
p
u us
++=
1
2
= 0 (13)
Again using the argument above, we can show that uip ujp = 0 for z+w+1 <= p <= k. Furthermore,
z
u us
=
1
zp
=
And our proof is complete.
Conclusions and Future Work
yij =
jp
p
ipp
−
1
2
+
jp
wx
ipp
u us
+
2
+
jp
k
wzp
ipp
u us
++=
1
2
= 0. (14)
?
5
Higher order cooccurrences play a key role in the effectiveness of systems used for information retrieval and
text mining. We have explicitly shown use of higher orders of cooccurrence in the Singular Value
Decomposition (SVD) algorithm and, by inference, on the systems that rely on SVD, such as LSI. Our
empirical and mathematical studies prove that term cooccurrence plays a crucial role in LSI. The work
shown here will find many practical applications. Below we describe our own research activities that were
directly influenced by our discovery of the relationship between SVD and higher order term cooccurrence.
Our first example is a novel approach to term clustering. Our algorithm defines term similarity as
the distance between the term vectors in the TkSk matrix. We conclude from section 3 that this definition of
term similarity is more directly correlated to improved performance than is use of the reduced dimensional
termterm matrix values. Preliminary results (preprint available from the authors) show that this metric, when
used to identify terms for query expansion, matches or exceeds the retrieval performance of traditional vector
Page 13
space retrieval or LSI.
Our second, and more ambitious, application of these results is the development of an algorithm
for approximating LSI. LSI runtime performance is significantly slower than vector space performance for
two reasons. First, the decomposition must be performed and it is computationally expensive. Second, the
matching of queries to documents in LSI is also computationally expensive. The original document vectors
are very sparse, but the document by dimension vectors used in LSI retrieval are dense, and the query must be
compared to each document vector. Furthermore, the optimal truncation value (k) must be discovered for
each collection. We believe that the correlation data presented here can be used to develop an algorithm that
approximates the performance of an optimal LSI system while reducing the computational overhead.
6 Acknowledgements
This work was supported in part by National Science Foundation Grant Number EIA0087977. The authors
gratefully acknowledge the assistance of Dr. Kae Kontostathis and Dr. WeiMin Huang in developing the
proof of the transitivity in the SVD as well as in reviewing drafts of this article. The authors also would like
to express their gratitude to Dr. Brian Davison for his comments on a draft. The authors gratefully
acknowledge the assistance of their colleagues in the Computer Science and Engineering Department at
Lehigh University in completing this work. Coauthor William M. Pottenger also gratefully acknowledges his
Lord and Savior, Yeshua (Jesus) the Messiah, for His continuing guidance in and salvation of his life.
7
Berry, M.W., Drmac Z. and Jessup, E.R. (1999). Matrices, Vector Spaces, and Information Retrieval. SIAM Review
vol. 41, no. 2. pp. 335362.
Berry, M.W., Dumais, S. T., and O'Brien, G. W. (1995). Using linear algebra for intelligent information retrieval. SIAM
Review, vol. 37, no. 4, pp. 573595.
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K. and Harshman, R. (1990). Indexing by latent semantic
analysis. Journal of the American Society for Information Science, vol. 41, no. 6, pp. 391407.
Ding, C.H.Q. (1999). A similaritybased Probability Model for Latent Semantic Indexing. Proceedings of the 22nd ACM
SIGIR ’99 Conference. pp. 5965.
Dumais, S.T. (1993). LSI meets TREC: A status report. In: D. Harman (Ed.), The First Text REtrieval Conference
(TREC1), National Institute of Standards and Technology Special Publication 500207, pp. 137152.
Dumais, S.T. (1994). Latent Semantic Indexing (LSI) and TREC2. In: D. Harman (Ed.), The Second Text REtrieval
Conference (TREC2), National Institute of Standards and Technology Special Publication 500215 , pp. 105
116.
Dumais, S.T. (1995). Using LSI for information filtering: TREC3 experiments. In: D. Harman (Ed.), The Third Text
REtrieval Conference (TREC3) National Institute of Standards and Technology Special Publication 500226.
pp. 219230.
Edmonds, P. (1997). Choosing the Word Most Typical in Context Using a Lexical Cooccurrence Network.
Proceedings of the 35thAnnual Meeting of the Association for Computational Linguistics. pp. 507509.
Kontostathis, A. and W.M. Pottenger. (2002a). A Mathematical View of Latent Semantic Indexing: Tracing Term Co
occurrences. Lehigh University Technical Report, LUCSE02006.
Kontostathis, April and William M. Pottenger. (2002b). Detecting Patterns in the LSI TermTerm Matrix. Workshop on
the Foundation of Data Mining and Discovery, The 2002 IEEE International Conference on Data Mining. pp.
243248.
Martin, D.I. and Berry, M.W. (2001). Parallel General Text Parser. Copyright 2001.
Schütze, H. (1992). Dimensions of Meaning. Proceedings of Supercomputing ’92.
Schütze, Hinrich. (1998). Automatic Word Sense Disambiguation. Computational Linguistics, Vo. 24, no. 1.
Story, R.E. (1996). An explanation of the Effectiveness of Latent Semantic Indexing by means of a Bayesian Regression
Model. Information Processing and Management, Vol. 32 , No. 3, pp. 329344.
van Rijsbergen, CJ. (1979). Information Retrieval. Butterworths, London.
WiemerHastings, P. (1999). How Latent is Latent Semantic Analysis? Proceedings of the 16th International Joint
Conference on Artificial Intelligence, pp. 932937.
References
Page 14
Xu, J. and Croft, W.B. (1998). Corpus –Based Stemming Using Cooccurrence of Word Variants. In ACM
Transactions on Information Systems, Vol. 16, No 1. pp. 6181.
Zelikovitz, S. and Hirsh, H. (2001). Using LSI for Text Classification in the Presence of Background Text. Proceedings
of CIKM01, 10th ACM International Conference on Information and Knowledge Management.
Zha, H. (1998). A SubspaceBased Model for Information Retrieval with Applications in Latent Semantic Indexing,
Technical Report No. CSE98002, Department of Computer Science and Engineering, Pennsylvania State
University.
Zhang, X., Berry M.W. and Raghavan, P. (2000). Level search schemes for information filtering and retrieval.
Information Processing and Management. Vo. 37, No. 2. pp. 313334.
View other sources
Hide other sources
 Available from April Kontostathis · May 21, 2014
 Available from psu.edu