Casey Whitelaw’s research while affiliated with Google Inc. and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (2)


Figure 1: Spelling process, and knowledge sources used.
Figure 2: Effect of corpus size used to train the error model.
Using the Web for Language Independent Spellchecking and Autocorrection.
  • Conference Paper
  • Full-text available

January 2009

·

427 Reads

·

129 Citations

Casey Whitelaw

·

·

·

We have designed, implemented and evaluated an end-to-end system spellchecking and autocorrection system that does not require any manually annotated training data. The World Wide Web is used as a large noisy corpus from which we infer knowledge about misspellings and word usage. This is used to build an error model and an n-gram language model. A small secondary set of news texts with artificially inserted misspellings are used to tune confidence classifiers. Because no manual annotation is required, our system can easily be instantiated for new languages. When evaluated on human typed data with real misspellings in English and German, our web-based systems outperform baselines which use candidate corrections based on hand-curated dictionaries. Our system achieves 3.8% total error rate in English. We show similar improvements in preliminary results on artificial data for Russian and Arabic.

Download

Web-scale named entity recognition

October 2008

·

332 Reads

·

81 Citations

Automatic recognition of named entities such as people, places, organizations, books, and movies across the entire web presents a number of challenges, both of scale and scope. Data for training general named entity recognizers is difficult to come by, and efficient machine learning methods are required once we have found hundreds of millions of labeled observations. We present an implemented system that addresses these issues, including a method for automatically generating training data, and a multi-class online classification training method that learns to recognize not only high level categories such as place and person, but also more fine-grained categories such as soccer players, birds, and universities. The resulting system gives precision and recall performance comparable to that obtained for more limited entity types in much more structured domains such as company recognition in newswire, even though web documents often lack consistent capitalization and grammatical sentence construction.

Citations (2)


... Within this framework, the basic idea revolves around employing machine learning to extract entities with similar contextual features from the web page, subsequently achieving entity classification or clustering. For example, Whitelaw et al. [16] proposed an iterative approach for expanding the entity corpus in a network environment. This approach depended on the construction of feature models grounded in known entities, enabling the processing of massive datasets. ...

Reference:

A Survey of Knowledge Graph Construction Using Machine Learning
Web-scale named entity recognition
  • Citing Conference Paper
  • October 2008

... In SB, the user completes or corrects the word by selecting an option from the Suggestion Bar which is located above the keyboard. ITE methods are built on language models (LMs) such as N-gram models [1] or complex deep neural network models trained with a large amount of text data [2][3][4]. LMs can be used to calculate the most likely words in a certain context making them ideal for error correction and prediction tasks. ...

Using the Web for Language Independent Spellchecking and Autocorrection.