Conference Paper

Spelling Correction for Search Engine Queries.

DOI: 10.1007/978-3-540-30228-5_33 Conference: Advances in Natural Language Processing, 4th International Conference, EsTAL 2004, Alicante, Spain, October 20-22, 2004, Proceedings
Source: DBLP


Search engines have become the primary means of accessing informa- tion on the Web. However, recent studies show misspelled words are very com- mon in queries to these systems. When users misspell query, the results are incor- rect or provide inconclusive information. In this work, we discuss the integration of a spelling correction component into tumba!, our community Web search en- gine. We present an algorithm that attempts to select the best choice among all possible corrections for a misspelled term, and discuss its implementation based on a ternary search tree data structure.

Download full-text


Available from: Mário J. Silva,
  • Source
    • "One advantage of using context-based approaches is that the computation time is lower (although the training is costly). However, such context-based approaches depend on proper contexts, which are not always available [47]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Figures are ubiquitous in biomedical full-text articles, and they represent important biomedical knowledge. However, the sheer volume of biomedical publications has made it necessary to develop computational approaches for accessing figures. Therefore, we are developing the Biomedical Figure Search engine ( to allow bioscientists to access figures efficiently. Since text frequently appears in figures, automatically extracting such text may assist the task of mining information from figures. Little research, however, has been conducted exploring text extraction from biomedical figures. We first evaluated an off-the-shelf Optical Character Recognition (OCR) tool on its ability to extract text from figures appearing in biomedical full-text articles. We then developed a Figure Text Extraction Tool (FigTExT) to improve the performance of the OCR tool for figure text extraction through the use of three innovative components: image preprocessing, character recognition, and text correction. We first developed image preprocessing to enhance image quality and to improve text localization. Then we adapted the off-the-shelf OCR tool on the improved text localization for character recognition. Finally, we developed and evaluated a novel text correction framework by taking advantage of figure-specific lexicons. The evaluation on 382 figures (9,643 figure texts in total) randomly selected from PubMed Central full-text articles shows that FigTExT performed with 84% precision, 98% recall, and 90% F1-score for text localization and with 62.5% precision, 51.0% recall and 56.2% F1-score for figure text extraction. When limiting figure texts to those judged by domain experts to be important content, FigTExT performed with 87.3% precision, 68.8% recall, and 77% F1-score. FigTExT significantly improved the performance of the off-the-shelf OCR tool we used, which on its own performed with 36.6% precision, 19.3% recall, and 25.3% F1-score for text extraction. In addition, our results show that FigTExT can extract texts that do not appear in figure captions or other associated text, further suggesting the potential utility of FigTExT for improving figure search.
    PLoS ONE 01/2011; 6(1):e15338. DOI:10.1371/journal.pone.0015338 · 3.23 Impact Factor
  • Source
    • "the formation a suitable graph of unprocessed text information [5]. The other common method, for Lexicon representation, is utilization of a tree based data structure [2]. Many researches have been done in order to model the error pattern and specifying its parameters. "
    [Show abstract] [Hide abstract]
    ABSTRACT: CloniZER spell checker is an adaptive, language independent and 'built-in error pattern free' spell checker tool which is based on 'Ternary Search Tree' data structure. It suggests the proper form of the misspelled words using nondeterministic traverse. In other words the problem of spell checking is addressed by traverse a tree with variable weighted edges. The proposed method learns media error pattern and improves its suggestions as time goes by. Instead of using expert knowledge for error pattern modelling, the proposed algorithm learns error pattern by interaction with user. 1
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper introduces an adaptive, language independent, and 'built-in error pattern free' spell checker. Proposed system suggests proper form of misspelled words using non deterministic traverse of 'ternary search tree' data structure. In other words the problem of spell checking is addressed by traverse a tree with variable weighted edges. The proposed system uses interaction with user to learn error pattern of media. In this way, system improves its suggestions as time goes by
Show more