Article

A Technique for Computer Detection and Correction of Spelling Errors

Communications of the ACM (Impact Factor: 3.62). 03/1964; 7(3):171-176. DOI: 10.1145/363958.363994
Source: DBLP

ABSTRACT

The method described assumes that a word which cannot be found in a dictionary has at most one error, which might be a wrong, missing or extra letter or a single transposition. The unidentified input word is compared to the dictionary again, testing each time to see if the words match—assuming one of these errors occurred. During a test run on garbled text, correct identifications were made for over 95 percent of these error types.

  • Source
    • "The second error domain, misspelling, can occur at many points in the process of communicating a name (initial labeling, transcription when digitizing, or transforming for data sharing). Misspellings can be defined as differences in a text string due to character insertions , deletions, substitutions or transpositions of what is otherwise correct[40]. The last of Chapman's domains—issues with " format " —is a broad group of errors. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Taxonomic names associated with digitized biocollections labels have flooded into repositories such as GBIF, iDigBio and VertNet. The names on these labels are often misspelled, out of date, or present other problems, as they were often captured only once during accessioning of specimens, or have a history of label changes without clear provenance. Before records are reliably usable in research, it is critical that these issues be addressed. However, still missing is an assessment of the scope of the problem, the effort needed to solve it, and a way to improve effectiveness of tools developed to aid the process. We present a carefully human-vetted analysis of 1000 verbatim scientific names taken at random from those published via the data aggregator VertNet, providing the first rigorously reviewed, reference validation data set. In addition to characterizing formatting problems, human vetting focused on detecting misspelling, synonymy, and the incorrect use of Darwin Core. Our results reveal a sobering view of the challenge ahead, as less than 47% of name strings were found to be currently valid. More optimistically, nearly 97% of name combinations could be resolved to a currently valid name, suggesting that computer-aided approaches may provide feasible means to improve digitized content. Finally, we associated names back to biocollections records and fit logistic models to test potential drivers of issues. A set of candidate variables (geographic region, year collected, higher-level clade, and the institutional digitally accessible data volume) and their 2-way interactions all predict the probability of records having taxon name issues, based on model selection approaches. We strongly encourage further experiments to use this reference data set as a means to compare automated or computer-aided taxon name tools for their ability to resolve and improve the existing wealth of legacy data.
    Full-text · Article · Jan 2016 · PLoS ONE
  • Source
    • "To identify explicit noise, we embed a misspelling correction module in our unsupervised language model-based biomedical term detection method. For temporality and other types of named entities, we set up seed patterns and run our own bootstrapping method: It detects variants of the seed patterns in the data using Damerau-Levenshtein distance [Damerau 1964]. To identify implicit noise, we use more detailed NLP method employing syntactic analysis and filter out untrustworthy information. "
    [Show abstract] [Hide abstract]
    ABSTRACT: We explore methods for effectively extracting information from clinical narratives that are captured in a public health consulting phone service called HealthLink. Our research investigates the application of stateof- the-art natural language processing and machine learning to clinical narratives to extract information of interest. The currently available data consist of dialogues constructed by nurses while consulting patients by phone. Since the data are interviews transcribed by nurses during phone conversations, they include a significant volume and variety of noise. When we extract the patient-related information from the noisy data, we have to remove or correct at least two kinds of noise: explicit noise, which includes spelling errors, unfinished sentences, omission of sentence delimiters, and variants of terms, and implicit noise, which includes non-patient information and patient's untrustworthy information. To filter explicit noise, we propose our own biomedical term detection/normalization method: it resolves misspelling, term variations, and arbitrary abbreviation of terms by nurses. In detecting temporal terms, temperature, and other types of named entities (which show patients' personal information such as age and sex), we propose a bootstrapping-based pattern learning process to detect a variety of arbitrary variations of named entities. To address implicit noise, we propose a dependency path-based filtering method. The result of our denoising is the extraction of normalized patient information, and we visualize the named entities by constructing a graph that shows the relations between named entities. The objective of this knowledge discovery task is to identify associations between biomedical terms and to clearly expose the trends of patients' symptoms and concern; the experimental results show that we achieve reasonable performance with our noise reduction methods.
    Full-text · Article · Jul 2015 · ACM Transactions on Intelligent Systems and Technology
  • Source
    • "The language may also interfere in the order in which words are misspelled [van Berkel and Smedt 1988]. Alternatively, we can classify errors as (1) non-word errors, (which fit into one of Damerau's categories [Damerau 1964]), and (2) real-word errors [Kukich 1992], as previously explained. "
    [Show description] [Hide description]
    DESCRIPTION: This technical report presents an OO Writer module for rearranging the spelling suggestion list in Brazilian Portuguese. To do so, the module relies on some statistics collected from texts typed in this language. As it turned out, a comparison between the lists generated by the newly added module and the ones originally suggested by Open Office Writer showed an improvement regarding the order in which suggestions are presented to the user.
    Full-text · Research · May 2015
Show more