A framework for understanding Latent Semantic Indexing (LSI) performance

Lehigh University, 19 Memorial Drive West, Bethlehem, PA 18015, United States
Information Processing & Management (Impact Factor: 1.07). 01/2006; DOI: 10.1016/j.ipm.2004.11.007
Source: DBLP

ABSTRACT In this paper we present a theoretical model for understanding the performance of Latent Semantic Indexing (LSI) search and retrieval application. Many models for understanding LSI have been proposed. Ours is the first to study the values produced by LSI in the term by dimension vectors. The framework presented here is based on term co-occurrence data. We show a strong correlation between second-order term co-occurrence and the values produced by the Singular Value Decomposition (SVD) algorithm that forms the foundation for LSI. We also present a mathematical proof that the SVD algorithm encapsulates term co-occurrence information.


Available from: April Kontostathis, Mar 28, 2014
1 Follower
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Based on the Aristotelian concept of potentiality vs. actuality allowing for the study of energy and dynamics in language, we propose a field approach to lexical analysis. Falling back on the distributional hypothesis to statistically model word meaning, we used evolving fields as a metaphor to express time-dependent changes in a vector space model by a combination of random indexing and evolving self-organizing maps (ESOM). To monitor semantic drifts within the observation period, an experiment was carried out on the term space of a collection of 12.8 million Amazon book reviews. For evaluation, the semantic consistency of ESOM term clusters was compared with their respective neighbourhoods in WordNet, and contrasted with distances among term vectors by random indexing. We found that at 0.05 level of significance, the terms in the clusters showed a high level of semantic consistency. Tracking the drift of distributional patterns in the term space across time periods, we found that consistency decreased, but not at a statistically significant level. Our method is highly scalable, with interpretations in philosophy.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Latent Semantic Analysis (LSA) is a technology which is used to analyze the latent concepts. LSA is based on the Vector Space Model (VSM) and statistics, and it usually takes the Singular Value Decomposition (SVD) as the kernel algorithm. Always, LSA increases the scale of the training data to improve system performance. However, as it needs many extra operations, and it also generates too much cooccurrence paths which are unreasonable between the different features, the problem of noise will be a serious disadvantage. This paper proposes a new method which is called augmented space model to optimize the latent semantic space model. Besides, it is also suggested in this paper that multiple models can be combined with integration technology to improve system performance. Through integration technology and space optimization, the models may describe the latent semantic structure more exactly. At the same time, to some extent, the probability of generating noise co-occurrence is reduced. As shown from comparative experiments, the system accuracy is higher after adopting integration technology and space optimization technology.
    2012 IEEE 2nd International Conference on Cloud Computing and Intelligence Systems (CCIS); 10/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: It is known that latent semantic indexing (LSI) takes advantage of implicit higher-order (or latent) structure in the association of terms and documents. Higher-order relations in LSI capture “latent semantics”. These findings have inspired a novel Bayesian framework for classification named Higher-Order Naive Bayes (HONB), which was introduced previously, that can explicitly make use of these higher-order relations. In this paper, we present a novel semantic smoothing method named Higher-Order Smoothing (HOS) for the Naive Bayes algorithm. HOS is built on a similar graph based data representation of the HONB which allows semantics in higher-order paths to be exploited. We take the concept one step further in HOS and exploit the relationships between instances of different classes. As a result, we move beyond not only instance boundaries, but also class boundaries to exploit the latent information in higher-order paths. This approach improves the parameter estimation when dealing with insufficient labeled data. Results of our extensive experiments demonstrate the value of HOS on several benchmark datasets.
    Journal of Computer Science and Technology 05/2014; 29(3):376-391. DOI:10.1007/s11390-014-1437-6 · 0.64 Impact Factor