Article

Digitization and Data Frames for Card Index Records

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

We develop a methodology for converting card index archival records into usable data frames for statistical and textual analyses. Leveraging machine learning and natural-language processing tools from Amazon Web Services (AWS), we overcome hurdles associated with character recognition, inconsistent data reporting, column misalignment, and irregular naming. In this article, we detail the step-by-step conversion process and discuss remedies for common problems and edge cases, using historical records from the Reconstruction Finance Corporation.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... However, the procedures for adapting and tuning existing document image datasets can be complicated to reproduce due to the use of proprietary services or the need to manage numerous software dependencies. For example, a pipeline was recently developed for digitizing scanned card index records (Amujala et al., 2023). However, its reliance on proprietary OCR and natural language processing (NLP) services available through Amazon Web Services makes it difficult for other researchers to inspect or adapt the underlying models. ...
... inconsistent data entry) and computational error (e.g. inaccurate character recognition) (Amujala et al., 2023). A key challenge of extracting structured textual information from semistructured historical records is incorporating their layout information into digitization workflows (Shen et al., 2021). ...
Article
Purpose Many libraries and archives maintain collections of research documents, such as administrative records, with paper-based formats that limit the documents' access to in-person use. Digitization transforms paper-based collections into more accessible and analyzable formats. As collections are digitized, there is an opportunity to incorporate deep learning techniques, such as Document Image Analysis (DIA), into workflows to increase the usability of information extracted from archival documents. This paper describes the authors' approach using digital scanning, optical character recognition (OCR) and deep learning to create a digital archive of administrative records related to the mortgage guarantee program of the Servicemen's Readjustment Act of 1944, also known as the G.I. Bill. Design/methodology/approach The authors used a collection of 25,744 semi-structured paper-based records from the administration of G.I. Bill Mortgages from 1946 to 1954 to develop a digitization and processing workflow. These records include the name and city of the mortgagor, the amount of the mortgage, the location of the Reconstruction Finance Corporation agent, one or more identification numbers and the name and location of the bank handling the loan. The authors extracted structured information from these scanned historical records in order to create a tabular data file and link them to other authoritative individual-level data sources. Findings The authors compared the flexible character accuracy of five OCR methods. The authors then compared the character error rate (CER) of three text extraction approaches (regular expressions, DIA and named entity recognition (NER)). The authors were able to obtain the highest quality structured text output using DIA with the Layout Parser toolkit by post-processing with regular expressions. Through this project, the authors demonstrate how DIA can improve the digitization of administrative records to automatically produce a structured data resource for researchers and the public. Originality/value The authors' workflow is readily transferable to other archival digitization projects. Through the use of digital scanning, OCR and DIA processes, the authors created the first digital microdata file of administrative records related to the G.I. Bill mortgage guarantee program available to researchers and the general public. These records offer research insights into the lives of veterans who benefited from loans, the impacts on the communities built by the loans and the institutions that implemented them.
... Les taules han estat transcrites automàticament mitjançant la intel·ligència artificial i l'extractor de text AWS Textract, que utilitza la computació en línia i algoritmes predefinits OCR o Optical Character Recognition per a l'aprenentatge automàtic i el reconeixement de caràcters escrits en fons documentals de diversa naturalesa (Amujala et al., 2023;Correia i Luck, 2023). Textract és especialment sensible a la transcripció de documents històrics que no poden ser transcrits de manera automàtica o semiautomàtica amb altres eines ofimàtiques més habituals, com és el cas de les còpies digitals dels oficis mecanografiats del CTV (Figura 3). ...
Article
La batalla del Turó del Balís és un episodi poc conegut que s’emmarca en l’ofensiva de Catalunya i l’ocupació militar del Maresme a les acaballes de la Guerra Civil espanyola. Les tropes republicanes van establir una línia defensiva ocupant els promontoris costaners situats entre Sant Andreu de Llavaneres, Sant Vicenç de Montalt i Caldes d’Estrac. El gruix de les forces divisionàries italianes de les Fletxes Blaves va trobar una forta resistència al Turó del Balís, a on s’havia atrinxerat la recentment formada 77a Divisió republicana. Entre els dies 28 i 29 de gener de 1939 tingué lloc un intens combat amb foc creuat d’artilleria fins que el bàndol republicà es retirà en direcció al nou front establert al riu Tordera. Aquest article explora per primera vegada el context històric i espacial de la batalla. Primer, es compara aquest enfrontament amb la resta de l’ofensiva italiana a partir de l’extracció automatitzada de dades del fons documental de les Fletxes Blaves. Tot seguit s’explora l’entorn del turó i la transformació del paisatge en els últims anys. La pressió urbanística de la zona posa en risc la preservació de possibles restes materials i constructives associades als combats. L’estudi arqueològic i la conservació d’aquest espai esdevé un propòsit primordial per a la recuperació de la memòria històrica al Maresme, i representa un pas endavant per posar en relleu la resistència combativa de l’exèrcit republicà durant la defensa de Catalunya.
Article
Full-text available
We examine how media reports influenced trading volumes and order imbalances on the Sydney Stock Exchange (SSX) from 1901 to 1950, focusing on wool market reports as a substitute for broader financial advice in the absence of a specialised investment press. Given wool's status as Australia's primary export and its integration with various sectors, we construct a weekly media sentiment index based on news about wool sales and auctions from the Sydney Morning Herald . Our findings reveal that positive news about the wool market correlates with increased trading volumes and reduced order imbalances on the SSX. This relationship persisted during significant events such as the UK government's wool purchase plans, the 1929 Wall Street Crash, World War II-related trading restrictions, and the short selling ban.
Chapter
The relevance of the Key Information Extraction (KIE) task is increasingly important in natural language processing problems. But there are still only a few well-defined problems that serve as benchmarks for solutions in this area. To bridge this gap, we introduce two new datasets (Kleister NDA and Kleister Charity). They involve a mix of scanned and born-digital long formal English-language documents. In these datasets, an NLP system is expected to find or infer various types of entities by employing both textual and structural layout features. The Kleister Charity dataset consists of 2,788 annual financial reports of charity organizations, with 61,643 unique pages and 21,612 entities to extract. The Kleister NDA dataset has 540 Non-disclosure Agreements, with 3,229 unique pages and 2,160 entities to extract. We provide several state-of-the-art baseline systems from the KIE domain (Flair, BERT, RoBERTa, LayoutLM, LAMBERT), which show that our datasets pose a strong challenge to existing models. The best model achieved an 81.77% and an 83.57% F1-score on respectively the Kleister NDA and the Kleister Charity datasets. We share the datasets to encourage progress on more in-depth and complex information extraction tasks.
Article
This article develops a framework for estimating multivariate treatment effect models in the presence of sample selection. The methodology deals with several important issues prevalent in policy and program evaluation, including application and approval stages, non-random treatment assignment, endogeneity, and discrete outcomes. This paper presents a computationally efficient estimation algorithm and techniques for model comparison and treatment effects. The framework is applied to evaluate the effectiveness of bank recapitalization programs and their ability to resuscitate the financial system. The analysis of lender of last resort (LOLR) policies is not only complicated due to econometric challenges, but also because regulator data are not easily obtainable. Motivated by these difficulties, this paper constructs a novel bank-level data set and employs the new methodology to jointly model a bank’s decision to apply for assistance, the LOLR’s decision to approve or decline the assistance, and the bank’s performance following the disbursements. The paper offers practical estimation tools to unveil new answers to important regulatory and policy questions.
Article
Managers proceeded from compiling simple descriptive reports to performing increasingly complex statistical analyses, not just in specialized areas such as cost accounting, but throughout the manufacturing and marketing functions. The new emphasis on using large amounts of data required improved methods of information handling. This period saw a revolution in office equipment and methods. From forms to projecting lanterns, and from vertical files to the Hollerith machine, the new tools were predecessors of today's computerized information systems. Innovations in information technology enabled managers to use large amounts of data effectively and efficiently. This paper uses published literature of the period and archival materials from two manufacturing firms-- E.I. du Pont de Nemours and Company and the Scovill Manufacturing Company-- to trace the evolution and use of information-handling systems in American manufacturing firms. 1 After briefly discussing the relationship between managerial methods and the uses of information, I will explore the new techniques and devices that emerged to support the collection, storage, analysis, and presentation of data.
Article
An analysis of innovations in the eighteenth-century British textile industry is the basis for an evaluation of aggregate studies of invention during the Industrial Revolution, derived from patent evidence alone. Disaggregation of the data challenges recent generalizations concerning the pace and pattern of technical change over the period. Discontinuities in the nature of invention, promoting an acceleration in total factor productivity growth, are traced to the 1790s. Prior to that date, industrial development conformed to a pattern of Smithian growth, as manufacturers diversified their output in response to an expanding domestic market for consumer goods.
Document AI: benchmarks, models and applications
  • Cui