A framework for understanding Latent Semantic Indexing (LSI) performance

Lehigh University, 19 Memorial Drive West, Bethlehem, PA 18015, United States
Information Processing & Management (Impact Factor: 1.27). 01/2006; 42(1):56-73. DOI: 10.1016/j.ipm.2004.11.007
Source: DBLP


In this paper we present a theoretical model for understanding the performance of Latent Semantic Indexing (LSI) search and retrieval application. Many models for understanding LSI have been proposed. Ours is the first to study the values produced by LSI in the term by dimension vectors. The framework presented here is based on term co-occurrence data. We show a strong correlation between second-order term co-occurrence and the values produced by the Singular Value Decomposition (SVD) algorithm that forms the foundation for LSI. We also present a mathematical proof that the SVD algorithm encapsulates term co-occurrence information.

Download full-text


Available from: April Kontostathis, Mar 28, 2014
  • Source
    • "The underlying distributional hypothesis is often cited for explaining how word meaning enters information processing [37]. Before attempts to utilize lexical resources for the same purpose, this used to be the sole source of word semantics in information retrieval, inherent in the exploitation of term occurrences – most notably, in the term frequency-inverse document frequency (TFIDF) measure – and co-occurrences [26], [56], [63], including multiple-level term co-occurrences [39]. On the other hand, the referential approach relies these days on lexical resources. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Based on the Aristotelian concept of potentiality vs. actuality allowing for the study of energy and dynamics in language, we propose a field approach to lexical analysis. Falling back on the distributional hypothesis to statistically model word meaning, we used evolving fields as a metaphor to express time-dependent changes in a vector space model by a combination of random indexing and evolving self-organizing maps (ESOM). To monitor semantic drifts within the observation period, an experiment was carried out on the term space of a collection of 12.8 million Amazon book reviews. For evaluation, the semantic consistency of ESOM term clusters was compared with their respective neighbourhoods in WordNet, and contrasted with distances among term vectors by random indexing. We found that at 0.05 level of significance, the terms in the clusters showed a high level of semantic consistency. Tracking the drift of distributional patterns in the term space across time periods, we found that consistency decreased, but not at a statistically significant level. Our method is highly scalable, with interpretations in philosophy.
    Full-text · Conference Paper · Jul 2015
    • "This considers the aspect of meaning from the documents. It is in contrast to knowledge structure based approaches that consider the aspects of words (Kontostathis & Pottenger, 2006). "
    [Show abstract] [Hide abstract]
    ABSTRACT: We investigate the impact of idea mining filtering on web-based weak signal detection to improve strategic decision making. Existing approaches for identifying weak signals in strategic decision making use environmental scanning procedures based on standard filtering algorithms. These algorithms discard patterns with low information content; however, they are not able to discard patterns with low relevance to a given strategic problem. Idea mining is proposed as an algorithm that identifies relevant textual patterns from documents or websites to solve a given (strategic) problem. Thus, it enables to estimate patterns’ relevance to the given strategic problem. The provided new methodology that combines weak signal analysis and idea mining is in contrast to existing methodologies. In a case study, a web-based scanning procedure is implemented to identify textual internet data in the field of self-sufficient energy supply. Idea mining is applied for filtering and weak signals are identified based on the proposed approach. The proposed approach is compared to a further - already evaluated - approach processed without using idea mining. The results show that idea mining filtering improves quality of weak signal analysis. This supports decision makers by providing early and suggestive signals of potentially emerging trends, even with only little expressive strength.
    No preview · Article · Dec 2014 · Futures
  • Source
    • "It is applied to find the sub-eigensapce with large eignvalues [15]. Specifically, it can decompose the term by document matrix into three matrices: an m by r term-concept matrix, a r by r singular value matrix and a n by r document-concept matrix [8]. SVD truncates the singular value matrix to size k ≪ r. "
    [Show abstract] [Hide abstract]
    ABSTRACT: A large amount of software maintenance effort is spent on program comprehension. How to accurately and quickly get the functional features in a program becomes a hot issue in program comprehension. Some studies in this area are focused on extracting the topics by analyzing linguistic information in the source code based on the textual mining techniques. However, the extracted topics are usually composed of some standalone words and difficult to understand. In this paper, we attempt to solve this problem based on a novel program summarization technique. First, we propose to use latent semantic indexing and clustering to group source artifacts with similar vocabulary to analyze the composition of each package in the program. Then, some topics composed of a vector of independent words can be extracted based on latent semantic indexing. Finally, we employ Minipar, a nature language parser, to help generate the summaries. The summaries can effectively organize the words from the topics in the form of the predefined sentence based on some rules. With such form of summaries, developers can understand what the features the program has and their corresponding source artifacts.
    Full-text · Conference Paper · Jun 2014
Show more