Takao Fujikawa’s research while affiliated with Osaka University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (2)


The Trove search interface to get the “public meeting” advertisement pages
Average number of lines in an article per year in the annotated OCR error correction dataset
Average WER in a single article per year in the annotated OCR error correction dataset
Visual differences in example articles in 1840, 1881, and 1938
Overview of the pretrained OCR error correction model by Dong and Smith [11]

+13

Corpus Construction for Historical Newspapers: A Case Study on Public Meeting Corpus Construction Using OCR Error Correction
  • Article
  • Publisher preview available

September 2022

·

78 Reads

·

6 Citations

SN Computer Science

Koji Tanaka

·

Chenhui Chu

·

Tomoyuki Kajiwara

·

[...]

·

Takao Fujikawa

Large text corpora are indispensable for natural language processing. However, in various fields such as literature and humanities, many documents to be studied are only scanned to images, but not converted to text data. Optical character recognition (OCR) is a technology to convert scanned document images into text data. However, OCR often misrecognizes characters due to the low quality of the scanned document images, which is a crucial factor that degrades the quality of constructed text corpora. This paper works on corpus construction for historical newspapers. We present a corpus construction method based on a pipeline of image processing, OCR, and filtering. To improve the quality, we further propose to integrate OCR error correction. To this end, we manually construct an OCR error correction dataset in the historical newspaper domain, propose methods to improve a neural OCR correction model and compare various OCR error correction models. We evaluate our corpus construction method on the accuracy of extracting articles of a specific topic to construct a historical newspaper corpus. As a result, our method improves the article extraction F score by 1.7%1.7\% via OCR error correction comparing to previous work. This verifies the effectiveness of OCR error correction for corpus construction.

View access options

Overview of the public meeting article extraction method (Small columns in articles are shown as blue lines in the “trimming” sub-figure, and OCR errors are shown in blue fonts in the “OCR” sub-figure)
Example of a public meeting article. Information corresponding to each question is shown in the red boxes (information corresponding to question number 1 is shown in box q1 and so on)
Example of annotation information from an extracted public meeting article
Illustration of fine-tuning ALBERT on our public meeting information extraction task
Information Extraction from Public Meeting Articles

SN Computer Science

Public meeting articles are the key to understanding the history of public opinion and public sphere in Australia. Information extraction from public meeting articles can obtain new insights into Australian history. In this paper, we create an information extraction dataset in the public meeting domain. We manually annotate the date and time, place, purpose, people who requested the meeting, people who convened the meeting, and people who were convened of 1258 public meeting articles. We further present an information extraction system, which formulates information extraction from public meeting articles as a machine reading comprehension task. Experiments indicate that our system can achieve an F1 score of 74.98% for information extraction from public meeting articles.

Citations (1)


... For example, the methods of constructing corpora described in [1][2][3][4] require a large amount of natural text data based on which the corpus is generated. This requirement significantly limits the possibility of their use in developing information systems because the initial data must be stored somewhere additionally. ...

Reference:

DICTIONARY-BASED DETERMINISTIC METHOD OF GENERATION OF TEXT CORPORA
Corpus Construction for Historical Newspapers: A Case Study on Public Meeting Corpus Construction Using OCR Error Correction

SN Computer Science