A framework for understanding Latent Semantic Indexing (LSI) performance

Lehigh University, 19 Memorial Drive West, Bethlehem, PA 18015, United States
Information Processing & Management (Impact Factor: 1.27). 01/2006; 42(1):56-73. DOI: 10.1016/j.ipm.2004.11.007
Source: DBLP


In this paper we present a theoretical model for understanding the performance of Latent Semantic Indexing (LSI) search and retrieval application. Many models for understanding LSI have been proposed. Ours is the first to study the values produced by LSI in the term by dimension vectors. The framework presented here is based on term co-occurrence data. We show a strong correlation between second-order term co-occurrence and the values produced by the Singular Value Decomposition (SVD) algorithm that forms the foundation for LSI. We also present a mathematical proof that the SVD algorithm encapsulates term co-occurrence information.

Download full-text


Available from: April Kontostathis, Mar 28, 2014
  • Source
    • "The underlying distributional hypothesis is often cited for explaining how word meaning enters information processing [37]. Before attempts to utilize lexical resources for the same purpose, this used to be the sole source of word semantics in information retrieval, inherent in the exploitation of term occurrences – most notably, in the term frequency-inverse document frequency (TFIDF) measure – and co-occurrences [26], [56], [63], including multiple-level term co-occurrences [39]. On the other hand, the referential approach relies these days on lexical resources. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Based on the Aristotelian concept of potentiality vs. actuality allowing for the study of energy and dynamics in language, we propose a field approach to lexical analysis. Falling back on the distributional hypothesis to statistically model word meaning, we used evolving fields as a metaphor to express time-dependent changes in a vector space model by a combination of random indexing and evolving self-organizing maps (ESOM). To monitor semantic drifts within the observation period, an experiment was carried out on the term space of a collection of 12.8 million Amazon book reviews. For evaluation, the semantic consistency of ESOM term clusters was compared with their respective neighbourhoods in WordNet, and contrasted with distances among term vectors by random indexing. We found that at 0.05 level of significance, the terms in the clusters showed a high level of semantic consistency. Tracking the drift of distributional patterns in the term space across time periods, we found that consistency decreased, but not at a statistically significant level. Our method is highly scalable, with interpretations in philosophy.
    International Joint Conference on Neural Networks, Killarney, Ireland; 07/2015
    • "This considers the aspect of meaning from the documents. It is in contrast to knowledge structure based approaches that consider the aspects of words (Kontostathis & Pottenger, 2006). "
    [Show abstract] [Hide abstract]
    ABSTRACT: We investigate the impact of idea mining filtering on web-based weak signal detection to improve strategic decision making. Existing approaches for identifying weak signals in strategic decision making use environmental scanning procedures based on standard filtering algorithms. These algorithms discard patterns with low information content; however, they are not able to discard patterns with low relevance to a given strategic problem. Idea mining is proposed as an algorithm that identifies relevant textual patterns from documents or websites to solve a given (strategic) problem. Thus, it enables to estimate patterns’ relevance to the given strategic problem. The provided new methodology that combines weak signal analysis and idea mining is in contrast to existing methodologies. In a case study, a web-based scanning procedure is implemented to identify textual internet data in the field of self-sufficient energy supply. Idea mining is applied for filtering and weak signals are identified based on the proposed approach. The proposed approach is compared to a further - already evaluated - approach processed without using idea mining. The results show that idea mining filtering improves quality of weak signal analysis. This supports decision makers by providing early and suggestive signals of potentially emerging trends, even with only little expressive strength.
    Futures 12/2014; 66. DOI:10.1016/j.futures.2014.12.007 · 1.29 Impact Factor
  • Source
    • "According to a result by Kontostathis and Pottenger [6], LSI using the truncated SVD can recognize synonyms as long as there is a short path that chain the synonyms together. For example in table I 'mark' and 'twain' are connected to 'samuel' and 'clemens' through Doc3. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Latent semantic indexing (LSI) is an indexing method to improve performance of an information retrieval system by indexing terms that appear in related documents and weakening influences of terms that appear in unrelated documents. LSI usually is conducted by using the truncated singular value decomposition (SVD). The main difficulty in using this technique is its retrieval performance depends strongly on the choosing of an appropriate decomposition rank. In this paper, by observing the fact that the truncated SVD makes the related documents more connected, we devise a matrix completion algorithm that can mimick this capability. The proposed algorithm is nonparametric, has convergence guarantee, and produces a unique solution for each input. Thus it is more practical and easier to use than the truncated SVD. Experimental results using four standard datasets in LSI research show that the retrieval performances of the proposed algorithm are comparable to the best results offered by the truncated SVD over some decomposition ranks.
    Control System, Computing and Engineering (ICCSCE), 2013 IEEE International Conference on; 01/2013
Show more