A framework for understanding Latent Semantic Indexing (LSI) performance

Ursinus College, PO Box 1000, 601 Main Street, Collegeville, PA 19426, United States; Lehigh University, 19 Memorial Drive West, Bethlehem, PA 18015, United States
Information Processing & Management (Impact Factor: 1.07). 01/2006; DOI: 10.1016/j.ipm.2004.11.007
Source: DBLP

ABSTRACT In this paper we present a theoretical model for understanding the performance of Latent Semantic Indexing (LSI) search and retrieval application. Many models for understanding LSI have been proposed. Ours is the first to study the values produced by LSI in the term by dimension vectors. The framework presented here is based on term co-occurrence data. We show a strong correlation between second-order term co-occurrence and the values produced by the Singular Value Decomposition (SVD) algorithm that forms the foundation for LSI. We also present a mathematical proof that the SVD algorithm encapsulates term co-occurrence information.

1 Bookmark
  • [Show abstract] [Hide abstract]
    ABSTRACT: Latent Semantic Indexing (LSI) is a famous Information Retrieval (IR) technique that tries to overcome the problems of lexical matching using conceptual indexing. LSI is a variant of vector space model and proved to be 30% more effective. Many studies have reported that good retrieval performance is related to the use of various retrieval heuristics. In this paper, we focus on optimising two LSI retrieval heuristics: term weighting and rank approximation. The results obtained demonstrate that the LSI performance improves significantly with the combination of optimised term weighting and rank approximation.
    Journal of Information & Knowledge Management 11/2011; 05(02).
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we describe a close analysis of the language used in cyberbullying. We take as our corpus a collection of posts from is a social networking site where users can ask questions of other users. It appeals primarily to teens and young adults and the cyberbullying content on the site is dense; between 7% and 14% of the posts we have analyzed contain cyberbullying content. The results presented in this article are two-fold. Our first experiments were designed to develop an understanding of both the specific words that are used by cyberbullies, and the context surrounding these words. We have identified the most commonly used cyberbullying terms, and have developed queries that can be used to detect cyberbullying content. Five of our queries achieve an average precision of 91.25% at rank 100. In our second set of experiments we extended this work by using a supervised machine learning approach for detecting cyberbullying. The machine learning experiments identify additional terms that are consistent with cyberbullying content, and identified an additional querying technique that was able to accurately assign scores to posts from The posts with the highest scores are shown to have a high density of cyberbullying content.
    Proceedings of the 5th Annual ACM Web Science Conference; 05/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: The success of machine learning approaches to word sense disambiguation (WSD) is largely dependent on the representation of the context in which an ambiguous word occurs. Typically, the contexts are represented as the vector space using “Bag of Words (BoW)” technique. Despite its ease of use, BoW representation suffers from well-known limitations, mostly due to its inability to exploit semantic similarity between terms. In this paper, we apply the semantic diffusion kernel, which models semantic similarity by means of a diffusion process on a graph defined by lexicon and co-occurrence information, to smooth the BoW representation for WSD systems. Semantic diffusion kernel can be obtained through a matrix exponentiation transformation on the given kernel matrix, and virtually exploits higher order co-occurrences to infer semantic similarity between terms. The superiority of the proposed method is demonstrated experimentally with several SensEval disambiguation tasks.
    Engineering Applications of Artificial Intelligence 01/2014; 27:167–174. · 1.96 Impact Factor

Full-text (3 Sources)

Available from
May 21, 2014