ArticlePDF Available

Orthographic Error Patterns of Author Names in Catalog Searches

Authors:

Abstract and Figures

An investigation of error patterns in author names based on data from a survey of library catalog searches. Position of spelling errors was noted and related to length of name. Probability of a name having a spelling error was found to increase with length of name. Nearly half of the spelling mistakes were replacement errors; following, in order of decreasing frequency, were omission, addition, and transposition errors.
Content may be subject to copyright.
A preview of the PDF is not available
... Based on Cutter's objects, a catalog is arranged on the assumption that searchers arrive at the catalog knowing at least one of the three access points (author, title, or subject). However, studies of information-seeking behavior in both manual and automated environments show that people arrive at a catalog with incomplete information for any of the access points (Borgman & Siegfried, 1992; Chen & Dhar, 1990; Tagliacozzo, Kochen, & Rosenberg, 1970; Taylor, 1984 ). They must use information external to the catalog (e.g., bibliographies, lists of subject headings) to obtain sufficient data to express their search within the scope of Cutter's objects. ...
Article
We return to arguments made 10 years ago (Borgman, 1986a) that online catalogs are difficult to use because their design does not incorporate sufficient understanding of searching behavior. The earlier article examined studies of information retrieval system searching for their implications for online catalog design; this article examines the implications of card catalog design for online catalogs. With this analysis, we hope to contribute to a better understanding of user behavior and to lay to rest the card catalog design model for online catalogs. We discuss the problems with query matching systems, which were designed for skilled search intermediaries rather than end-users, and the knowledge and skills they require in the information-seeking process, illustrated with examples of searching card and online catalogs. Searching requires conceptual knowledge of the information retrieval process—translating an information need into a searchable query; semantic knowledge of how to implement a query in a given system—the how and when to use system features; and technical skills in executing the query—basic computing skills and the syntax of entering queries as specific search statements. In the short term, we can help make online catalogs easier to use through improved training and documentation that is based on information-seeking behavior, with the caveat that good training is not a substitute for good system design. Our long term goal should be to design intuitive systems that require a minimum of instruction. Given the complexity of the information retrieval problem and the limited capabilities of today's systems, we are far from achieving that goal. If libraries are to provide primary information services for the networked world, they need to put research results on the information-seeking process into practice in designing the next generation of online public access information retrieval systems. © 1996 John Wiley & Sons, Inc.
... Name-matching is just one area of research in IR and Web search engines (Hayes, 1994;Hermansen, 1985;Navarro et al., 2003;Pfeiffer, Poersch, & Fuhr, 1996;Spink, Jansen, & Pedersen, 2004). The identification of PN variants is a recurring problem for the retrieval of information from online catalogs and bibliographic databases (Borgman & Siegfried, 1992;Bouchard & Pouyez, 1980;Bourne, 1977;Rogers & Willett, 1991;Ruiz-Perez, Delgado López-Cózar, E., & Jiménez-Contreras, 2002;Siegfried & Bernstein, 1991;Strunk, 1991;Tagliacozzo, Kochen, & Rosenberg, 1970;Tao & Cole, 1991;Taylor, 1984;Weintraub, 1991). In general, the techniques for identifying variants are included under the common heading of authority work (Auld, 1982;Taylor, 1989;Tillett, 1989). ...
Article
Full-text available
This article shows how finite-state methods can be employed in a new and different task: the conflation of personal name variants in standard forms. In bibliographic databases and citation index systems, variant forms create problems of inaccuracy that affect information retrieval, the quality of information from databases, and the citation statistics used for the evaluation of scientists' work. A number of approximate string matching techniques have been developed to validate variant forms, based on similarity and equivalence relations. We classify the personal name variants as nonvalid and valid forms. In establishing an equivalence relation between valid variants and the standard form of its equivalence class, we defend the application of finite-state transducers. The process of variant identification requires the elaboration of: (a) binary matrices and (b) finite-state graphs. This procedure was tested on samples of author names from bibliographic records, selected from the Library and Information Science Abstracts (LISA) and Science Citation Index Expanded (SCI-E) databases. The evaluation involved calculating the measures of precision and recall, based on completeness and accuracy. The results demonstrate the usefulness of this approach, although it should be complemented with methods based on similarity relations for the recognition of spelling variants and misspellings.
Book
Full-text available
متن کامل کتاب از گوتنبرگ تا زیر ساخت جهانی اطلاعات رایگان برای استفاده همگان
Article
An analysis of library use studies previously published by this author (Kilgour 1989) revealed that of every hundred user requests for a book, seven are not satisfied because the library has not acquired the book, eleven because there is a defect in the catalog or error in its use (3.45 to deficiency and 7.55 to user error), and twenty-three because the book is not on the shelf. This paper demonstrates types of online catalog access that can reduce the failures caused by card catalog flaws or by user search errors by half.
Article
Systems for automatic detection and correction of spelling errors in natural language texts are considered. The development of such systems for both English and Russian (and for inflected languages in general, including all Slavic languages) is discussed. An approach associated with morphological analysis of the wordforms in the given text is described. The topics considered in the paper include the main methods of automatic spelling correction, levels of automation of the spelling error correction process, the effect of the type of computer used, the use of spelling error correctors in a stand-alone mode and in combination with word-processing software, and the maintenance of auxiliary dictionaries.
Article
Full-text available
A concept of a catalog that is hospitable to mechanized descriptive cataloging is presented, together with four areas of research programs that will yield findings useful in the development of such a catalog. (Author/JB)
Article
Full-text available
Introducción -- Los catálogos automatizados: definición, características, componentes y estructura -- Evolución temporal: desarrollo histórico de los catálogos y generaciones de OPACs -- La cultura de la evalución en la era de internet -- La aproximación conceptual al proceso de evaluación: el enfoque basado en el usuario -- Métodos de análisis y técnicas de recogida de datos -- Indicadores para la evaluación de funcionalidades -- La evaluación de la recuperación de la información -- El acceso por materias en OPACs: análisis de problemas y propuestas de solución -- Evaluación del contenido de las bases de datos de los OPACs -- Interfaces: técnicas de visualización y diseño -- Interfaces de OPACs: tipología, evaluacion y visualización de la información -- Recomendaciones para la presentación de la información en las interfaces -- Recomendaciones para la presentación de la información bibliográfica en las interfaces de OPACs -- El futuro: tendencias de desarrollo de los OPACs En port. : Impresa da Universidade de Coimbra
Article
Full-text available
A description and comparison is presented of four compression techniques for word coding having application to information retrieval. The emphasis is on codes useful in creating directories to large data files. It is further shown how differing application objectives lead to differing measures of optimality for codes, though compression may be a common quality.
Article
Full-text available
The relative frequencies of spelling errors as a function of letter position have been examined for 7-, 9-, and 11-letter words selected at random from the Thorndike-Lorge word list. These were administered to 150 8th granders, and 89 juniors college freshmen, respectively. The distribution of errors according to letter position was found to closely approximate the classical skewed, bow-shaped, serial-position curve for errors generally found in serial rote learning. Other features in common between spelling and serial learning were discussed. It is suggested that a theory of serial learning and of the serial-position effect may be germane to the psychology of spelling. (PsycINFO Database Record (c) 2006 APA, all rights reserved).
Article
Full-text available
227 A study of problems associated with bibliographic retrieval using unveri-fied input data supplied by requesters. A code derived from compression of title and author information to four, four-character abbreviations each was used for retrieval tests on an IBM 1401 computer. Retrieval accuracy was 98.67%. Current acquisitions systems which utilize computer processing have been oriented toward handling the order request only after it has been manually verified. Systems, such as that of Texas A & I University (1), have proven useful in reducing certain clerical routines and in handling fund accounting (2). Lack of a larger bibliographic data base and lack of adequate computer time have prevented many libraries from studying more sophisticated acquisitions systems. At the time the MARC Pilot Project (3) was started, the Fondren Li-brary at Rice University did not have operating computer applications in acquisitions, serials, or cataloging. The University administration and the Research Computation Center provided sufficient access to the IBM 7040 to permit the study of problems associated with bibliographic retrieval using input data which has varying accuracy. In 1966, Richmond expressed the concern of many librarians about the lack of specific statements describing the techniques by which on-line re-trieval could be accomplished without complicating the problems pre-sented by the current card catalog (4). She had previously described some of the problems created by the kind and quality of data being uti-lized as references by library users (5). 228 Journal of Library Automation Vol. 1/ 4 December, 1968 An examination of the pertinent literature indicates that most of the current work in retrieval, while related to problems of bibliographic re-trieval, does not offer much assistance when the input data is suspect (6, 7,8). Tainiter and Toyoda, for example, have described different tech-niques of addressing storage using known input data (9,10). One of the best-known retrieval systems is that of the Chemical Abstracts Service, which provides a fairly sophisticated title-scan of journal articles with a surprising degree of flexibility in the logic and term structure used as input. Comparable systems are used by the Defense Documentation Center, Medlars Centers, and NASA Technology Centers. These systems have one specific feature in common: a high level of accuracy in the input data. USER-SUPPLIED BIBLIOGRAPHIC DATA The reliability of bibliographic data supplied to university libraries from faculty and students has long been questioned (5). Any search system which accepts such data must be designed 1) to increase the level of con-fidence through machine-generated search structures and variable thresh-holds and 2) to reduce the dependence upon spelling accuracy, punctu-ation, spacing and word order. The initial task of formulating an approach to this problem is to deter-mine the type, quality, and quantity of data generally supplied by a user. To derive a controlled set of data for this purpose, the Acquisition Depart-ment of the Fondren Library provided Xerox copies of all English language requests dated 1965 or later and a random sample of 295 requests was drawn from that file of 5000 items. This random sample was compared to the manually-verified, original order-requests to determine 1) the frequency with which data was sup-plied by the requestor and 2) the accuracy of the provided information. Results of this study are given in Table 1.
Article
Full-text available
The relative frequencies of spelling errors as a function of letter position have been examined for 7-, 9-, and 11-letter words selected at random from the Thorndike-Lorge word list. These were administered to 150 8th granders, and 89 juniors college freshmen, respectively. The distribution of errors according to letter position was found to closely approximate the classical skewed, bow-shaped, serial-position curve for errors generally found in serial rote learning. Other features in common between spelling and serial learning were discussed. It is suggested that a theory of serial learning and of the serial-position effect may be germane to the psychology of spelling.
Article
Patterns of searching in library catalogues were analysed, using the data from a large survey of the use of three university library and one public library catalogues. ‘Known-item’ searches were the object of the study. Success or failure of the search was correlated to degree of correctness and completeness of the searcher's information about title and author of the item that he wished to locate. Factors involved in searching strategies were discussed. The double role played by both the title and the author as a way of access to the catalogue and as a means for identifying the right entry was examined.
Article
A study of the feasibility of applying an experimental dictionary and a digital computer to proofreading led to investigation of the nature of conversion errors and development of computer programs for correction of unorthographic machine-readable text. The correction programs were tested with a sample of unproofread technical abstracts with a large number of possible errors. The error-correction program has three levels: the first level corrects commonly misspelled words through watching with a stored dictionary; the second level treats common misspelling patterns not readily amenable to direct dictionary correction. Error patterns most probably associated with human operation of transcription devices, or with limitations or malfunctions of conversion equipment, are dealt with by the third level. Beyond the first level, the program can apply nine error-correction procedures. The general structure and organization of the program is illustrated in Fig. 1 of the text. Because selections of alternative spellings are combinatorial, the correction algorithms must test several hundred candidates per error, on the average. Hence a study explored system design characteristics to increase the efficiency of correction procedures.
Patterns of Searching in Library Catalogs In: Integrative Mechanisms in Literature Growth
  • R Tagliacozzo
Tagliacozzo, R., et al.: "Patterns of Searching in Library Catalogs." In: Integrative Mechanisms in Literature Growth. Vol IV. (University of Michigan, Mental Health Research Institute, January 1970). Report to the National Science Foundation, GN 716.
An Algorithm for Noisy Matches in Catalog Searching In: A Study of the Organization and Search of Bibliographic Holdings Records in On-Line Computer Systems: Phase I
  • James L Dolby
Dolby, James L. : "An Algorithm for Noisy Matches in Catalog Searching." In: A Study of the Organization and Search of Bibliographic Holdings Records in On-Line Computer Systems: Phase I. (Berkeley, Cal.: Institute of Library Research, University of California March 1969 ), 119-136.