The exponential growth of textual documents has caused difficulties in the process of informatioan retrieval, mainly in the model of linear retrieval based on word matching that generally ineffective. The word synonimy of a text has triggered to the resulting of non relevan documents in the retrieval, on the other hand polisemy factor has caused many of relevan document remain unretrieved. The application of document clustering can improve the performance of retrieval process according to the hypothesis that the documents relevant to the same query tends to be in the same cluster. This research studied the application of document clustering to improve the effectiveness of document retrieval by using cluster-based retrieval in the vector space model. In the first step, document collection was clustered using any cluster algorithm and the cluster center was selected to be cluster representative. In the second step, the search process then matched the query to the all cluster representatives and finally the all documents in the cluster that have the highest similarity to the query was selected to present to the user.. The clustering methods used in this study are partitional method (Bisecting K-Mean and Buckshot algorithms) and hierarchical agglomerative method using cluster similarity of UPGMA and Complete Link. The performance of retrieval was measured using F-measure parameter derived from Precision and Recall of retrieva process. The test document collection used are 1000 news text documents with known cluster structure and 3000 news text documents with unknown cluster structure. The results showed that in the test collection which is evaluated in the retrieval process based on cluster-matching has imporved the performance of 12.3% and 9.5% compare to the process of linear retrieval based on word –matching.

131 Reads
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Excerpts of technical papers and magazine articles that serve the purposes of conventional abstracts have been created entirely by automatic means. In the exploratory research described, the complete text of an article in machine-readable form is scanned by an IBM 704 data-processing machine and analyzed in accordance with a standard program. Statistical information derived from word frequency and distribution is used by the machine to compute a relative measure of significance, first for individual words and then for sentences. Sentences scoring highest in significance are extracted and printed out to become the “auto-abstract."
    Ibm Journal of Research and Development 05/1958; 2(2-2):159 - 165. DOI:10.1147/rd.22.0159 · 0.69 Impact Factor


131 Reads
Available from