A framework for understanding Latent Semantic Indexing (LSI) performance

Lehigh University, 19 Memorial Drive West, Bethlehem, PA 18015, United States
Information Processing & Management (Impact Factor: 1.07). 01/2006; DOI: 10.1016/j.ipm.2004.11.007
Source: DBLP

ABSTRACT In this paper we present a theoretical model for understanding the performance of Latent Semantic Indexing (LSI) search and retrieval application. Many models for understanding LSI have been proposed. Ours is the first to study the values produced by LSI in the term by dimension vectors. The framework presented here is based on term co-occurrence data. We show a strong correlation between second-order term co-occurrence and the values produced by the Singular Value Decomposition (SVD) algorithm that forms the foundation for LSI. We also present a mathematical proof that the SVD algorithm encapsulates term co-occurrence information.

Download full-text


Available from: April Kontostathis, Mar 28, 2014
1 Follower
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Based on the Aristotelian concept of potentiality vs. actuality allowing for the study of energy and dynamics in language, we propose a field approach to lexical analysis. Falling back on the distributional hypothesis to statistically model word meaning, we used evolving fields as a metaphor to express time-dependent changes in a vector space model by a combination of random indexing and evolving self-organizing maps (ESOM). To monitor semantic drifts within the observation period, an experiment was carried out on the term space of a collection of 12.8 million Amazon book reviews. For evaluation, the semantic consistency of ESOM term clusters was compared with their respective neighbourhoods in WordNet, and contrasted with distances among term vectors by random indexing. We found that at 0.05 level of significance, the terms in the clusters showed a high level of semantic consistency. Tracking the drift of distributional patterns in the term space across time periods, we found that consistency decreased, but not at a statistically significant level. Our method is highly scalable, with interpretations in philosophy.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Latent semantic indexing (LSI) is an indexing method to improve performance of an information retrieval system by indexing terms that appear in related documents and weakening influences of terms that appear in unrelated documents. LSI usually is conducted by using the truncated singular value decomposition (SVD). The main difficulty in using this technique is its retrieval performance depends strongly on the choosing of an appropriate decomposition rank. In this paper, by observing the fact that the truncated SVD makes the related documents more connected, we devise a matrix completion algorithm that can mimick this capability. The proposed algorithm is nonparametric, has convergence guarantee, and produces a unique solution for each input. Thus it is more practical and easier to use than the truncated SVD. Experimental results using four standard datasets in LSI research show that the retrieval performances of the proposed algorithm are comparable to the best results offered by the truncated SVD over some decomposition ranks.
    Control System, Computing and Engineering (ICCSCE), 2013 IEEE International Conference on; 01/2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This thesis looks at three aspects of source-code plagiarism. The first aspect of the thesis is concerned with creating a definition of source-code plagiarism; the second aspect is concerned with describing the findings gathered from investigating the Latent Semantic Analysis information retrieval algorithm for source-code similarity detection; and the final aspect of the thesis is concerned with the proposal and evaluation of a new algorithm that combines Latent Semantic Analysis with plagiarism detection tools. A recent review of the literature revealed that there is no commonly agreed definition of what constitutes source-code plagiarism in the context of student assignments. This thesis first analyses the findings from a survey carried out to gather an insight into the perspectives of UK Higher Education academics who teach programming on computing courses. Based on the survey findings, a detailed definition of source-code plagiarism is proposed. Secondly, the thesis investigates the application of an information retrieval technique, Latent Semantic Analysis, to derive semantic information from source-code files. Various parameters drive the effectiveness of Latent Semantic Analysis. The performance of Latent Semantic Analysis using various parameter settings and its effectiveness in retrieving similar source-code files when optimising those parameters are evaluated. Finally, an algorithm for combining Latent Semantic Analysis with plagiarism detection tools is proposed and a tool is created and evaluated. The proposed tool, PlaGate, is a hybrid model that allows for the integration of Latent Semantic Analysis with plagiarism detection tools in order to enhance plagiarism detection. In addition, PlaGate has a facility for investigating the importance of source-code fragments with regards to their contribution towards proving plagiarism. PlaGate provides graphical output that indicates the clusters of suspicious files and source-code fragments.
    IEEE Transactions on Computers 03/2012; 61(3):379-394. DOI:10.1109/TC.2011.223 · 1.47 Impact Factor