ThesisPDF Available

Wordgrid: Extending Chargrid with Word-level Information


Abstract and Figures

Chargrid is a recently proposed approach to understanding documents with 2-dimensional structure. It represents a document with a grid, thereby preserving its spatial structure for the processing model. Text is embedded in the grid with one-hot encoding on character level. With Wordgrid we extend Chargrid by employing a grid on word level. For embedding words with semantically meaningful vectors, we propose a novel method for estimating dense word vectors, called word2vec-2d. It is a fork of word2vec that is trained on 2D document corpora rather than 1D text sequences. The notion of context is redefined to be the variably-sized set of words that are spatially located within a certain distance to the center word. BERTgrid, our most enhanced Wordgrid version, uses contextualized word piece vectors. The concrete vector chosen for a position in the grid is retrieved from the hidden representations of a BERT language model. This model has access to the neighboring text, as opposed to mapping every symbol 1:1 to its corresponding representation, irrespective of position and contextual meaning. Both new methods benefit greatly from unsupervised pre-training. We apply them to two proprietary SAP invoice datasets, a large unlabeled and a smaller labeled one. The task is key-value extraction, e.g. determining the invoice date or vendor name. The best Wordgrid model improves over the Chargrid baseline by a margin of 0.91 percentage points; BERTgrid achieves even better performance, 3.73 percentage points above Chargrid.
Content may be subject to copyright.
urttemberg Cooperative
State University Karlsruhe
T2 3300
Bachelor’s Thesis
Wordgrid: Extending Chargrid
with Word-level Information
Timo I. Denk
Author Timo I. Denk (
Student ID, course 1441403, TINF16B1
Course of studies Computer Science
Company SAP SE (Berlin, Germany)
Supervisors Dr. Christian Reisswig (
SAP Deep Learning Center of Excellence
PD Dr. Markus Reischl (
Karlsruhe Institute of Technology (KIT)
Date of submission September 2, 2019
Chargrid is a recently proposed approach to understanding documents
with 2-dimensional structure. It represents a document with a grid,
thereby preserving its spatial structure for the processing model. Text
is embedded in the grid with one-hot encoding on character level. With
Wordgrid we extend Chargrid by employing a grid on word level.
For embedding words with semantically meaningful vectors, we propose
a novel method for estimating dense word vectors, called word2vec-2d. It
is a fork of word2vec that is trained on 2D document corpora rather than
1D text sequences. The notion of context is redefined to be the variably-
sized set of words that are spatially located within a certain distance to
the center word.
BERTgrid, our most enhanced Wordgrid version, uses contextualized word
piece vectors. The concrete vector chosen for a position in the grid is re-
trieved from the hidden representations of a BERT language model. This
model has access to the neighboring text, as opposed to mapping every
symbol 1:1 to its corresponding representation, irrespective of position
and contextual meaning.
Both new methods benefit greatly from unsupervised pre-training. We
apply them to two proprietary SAP invoice datasets, a large unlabeled and
a smaller labeled one. The task is key-value extraction, e.g. determining
the invoice date or vendor name. The best Wordgrid model improves over
the Chargrid baseline by a margin of 0.91 percentage points; BERTgrid
achieves even better performance, 3.73 percentage points above Chargrid.
Keywords: 2D document understanding, embedding, contextualization
Author’s Declaration. Unless otherwise indicated in the text or references,
or acknowledged above, this thesis is entirely the product of my own scholarly
work. The use of we as a subject is a stylistic choice and refers to me. This
thesis has not been submitted either in whole or part, for a degree at this or
any other university or institution.
Berlin, Germany; August 28, 2019
Timo I. Denk
1 Introduction 1
2 Background 5
2.1 Embedding in 1D NLP . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 2D Document Understanding . . . . . . . . . . . . . . . . . . . . 9
3 Prerequisites 13
3.1 Chargrid: Towards Understanding 2D Documents . . . . . . . . . 13
3.2 Word2vec: Estimation of Word Representations in Vector Space 16
3.3 BERT: Pre-training of Deep Bidirectional Transformers for Lan-
guageUnderstanding......................... 18
4 Novel Methods 22
4.1 Word2vec-2d: Word2vec for 2D Corpora . . . . . . . . . . . . . . 22
4.2 Document Understanding with Wordgrid . . . . . . . . . . . . . . 25
4.2.1 2D Document Representation with Wordgrid . . . . . . . 26
4.2.2 Non-contextualized Embedding . . . . . . . . . . . . . . . 27
4.2.3 BERTgrid: Contextualized Embedding . . . . . . . . . . . 29
4.2.4 Combining Chargrid and Wordgrid . . . . . . . . . . . . . 32
5 Application: Information Extraction from Invoices 35
5.1 Concur Travel and Expense Management . . . . . . . . . . . . . 35
5.2 InvoiceDatasets ........................... 37
5.3 Evaluation Measure . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.4 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . 41
6 Results 47
6.1 Experiments.............................. 48
6.2 Model Complementarity . . . . . . . . . . . . . . . . . . . . . . . 51
7 Discussion 53
8 Future Work 57
9 Conclusion 60
References 61
1 Chargrid network architecture . . . . . . . . . . . . . . . . . . . . 14
2 Word2vec CBOW model . . . . . . . . . . . . . . . . . . . . . . . 17
3 BERT model architecture . . . . . . . . . . . . . . . . . . . . . . 19
4 Serialization of a 2D document . . . . . . . . . . . . . . . . . . . 23
5 Contextin2Dvs.1D ........................ 24
6 Visualization of Chargrid and Wordgrid for a sample invoice . . . 26
7 Invoice serialization methods . . . . . . . . . . . . . . . . . . . . 28
8 BERTgrid creation pipeline . . . . . . . . . . . . . . . . . . . . . 30
9 Model complementarity Venn diagrams . . . . . . . . . . . . . . 34
10 Concur “New Expense” user interface . . . . . . . . . . . . . . . 36
11 Datasetgroundtruth......................... 36
12 Spatial distribution of invoice fields . . . . . . . . . . . . . . . . . 38
13 Word piece box size histograms . . . . . . . . . . . . . . . . . . . 39
14 Word piece count per invoice histogram . . . . . . . . . . . . . . 40
15 Comparison of convergence speed . . . . . . . . . . . . . . . . . . 49
16 Model complementarity on the line item description field . . . . . 51
17 Model complementarity on the line item quantity field . . . . . . 52
18 Model complementarity on the vendor address field . . . . . . . . 52
1 Comparison of embedding in 1D and 2D . . . . . . . . . . . . . . 12
2 Comparison of Wordgrid with non-contextualized and contextu-
alizedembedding........................... 32
3 Mainresults.............................. 48
4 Embedding choice results for plain Wordgrid . . . . . . . . . . . 50
5 Chargrid-Wordgrid combination method results . . . . . . . . . . 51
1 Distance between two bounding boxes . . . . . . . . . . . . . . . 24
2 Wordgrid pre-processing function . . . . . . . . . . . . . . . . . . 43
3 BERTgrid pre-processing function . . . . . . . . . . . . . . . . . 45
1 Introduction
The digitalization of industry processes and consumer interaction comes with
the transition away from paper documents for information carriage to digital
representations. Instead of sending letters or printing files, it gets increasingly
common to use digital information processing. A substantial set of processes,
however, still relies on printed documents. When they interface with digital
systems, the information must be digitalized. This repetitive task can be done
by humans entering the information manually. Recently, there has been a ten-
dency to automate such processes using computer programs that are capable
of understanding the documents to a degree that allows them to extract the
needed information accurately.
A common way of processing physical documents with computer programs is
scanning them to retrieve a digital representation, namely an image. In the
second step it is common to detect and recognize the individual characters with
an optical character recognition (OCR) engine. While the digital representation
contains character-level information at this stage, the desired information is
still to be extracted. The difficulty of the extraction task, which is the third
step, depends on the type of document that is being processed: standardized
formats can be tackled with rule-based systems that would search for instance
for characters at a pre-defined spatial position in the document. With more
heterogeneous inputs the understanding gets more difficult and requires a certain
degree of intelligence.
An example for the aforementioned document processing is the information
extraction from invoices in travel expense report systems. Customers report
the expenses they had, testified with an uploaded photo or scan of the invoice
they received (e.g. from the hotel they stayed at). For validation purposes,
fields like the invoice vendor, date, and amount must be extracted from the
filed document.
SAP’s document understanding pipeline Chargrid, recently proposed by Katti
et al. (2018), is a method for solving the information extraction (third) step in
the processing procedure. It relies solely on the information provided by the
OCR engine, in that it constructs a tensor representation from it, where the
spatial information of the original document is preserved. This tensor serves as
the input to a convolutional neural network (CNN), trained to predict bounding
boxes and segmentation masks for the fields to be extracted from the document.
The choice of the input tensor is one of the key components of Chargrid: Com-
pared to a raw image, as retrieved from a scan, it is a much more compact
representation of a document that enables the machine learning (ML) model to
learn how to extract the desired fields more easily.
Chargrid’s information extraction is sometimes erroneous and below human-
level performance. As part of the constant effort to improve the quality of
Chargrid’s key-value extraction, we extend it with word-level information. That
is, we modify the input representation to embed words or sub-words of the orig-
inal document rather than individual characters. In a second advancement, we
lift the representation richness to a higher level by using so called contextual-
ized embeddings. That means we construct a given spatial position of the input
tensor not only based on the word that is located at the corresponding position
in the invoice, but also based on its neighborhood, also knows as context. We
show that the new representations improve the extraction performance of the
ML model significantly and lead to convergence after fewer training steps.
In the field of natural language processing (NLP), commonly used input repre-
sentations are subject to change: The popular word2vec embedding computation
algorithms by Mikolov et al. (2013) slowly made way for more advanced, con-
textualized embeddings such as ELMo (Peters et al. (2018)) or BERT (Devlin
et al. (2018)). The transition came with significant improvements in perfor-
mance. The best performing models on common NLP benchmarks are using
contextualized embeddings to represent the input text sequences. Another as-
pect that has changed, is the level on which input is being embedded: word piece
embedding or variants of it prevail over character- and word-level embedding,
The NLP tasks in GLUE and other popular NLP benchmarks consist of word
sequences where the words are arranged in one dimension. The position of a
word within the input can be stored as its index in the sequence. Our document
understanding problem differs fundamentally in that its input symbols (charac-
ters, words, or word pieces) are positioned within a 2-dimensional document, i.e.
each symbol is associated with an x- and a y-coordinate. Because the recent,
popular NLP research focuses almost exclusively on the 1-dimensional domain,
we see our work as an important and successful transfer of non-character-level
embedding and contextualization from 1D to 2D.
Another tendency in the NLP research community is the utilization of large unla-
beled corpora for pre-training. An example is OpenAI’s GPT-2 model (Radford
et al. (2019)) which is pre-trained on “slightly over 8 million documents for a
total of 40 GB of text”. Research has shown that extensive pre-training aids
a model’s performance on smaller tasks. From a practical point of view, pre-
training is often particularly useful when there is limited labeled, but abundant
unlabeled data available.
In SAP’s concrete situation, the subsidiary Concur, a travel and expense man-
agement service provider, processes more than 100k invoice documents per day.
Processing means information about the invoice, such as its date, the total
amount, or the list of line items, are being extracted. The automation of the
invoice field extraction has several business benefits for Concur: most notable
are faster processing of documents, processing with fewer errors and difficulties,
and cost savings. Concur’s mode of operation leads to the accumulation of large
amounts of unlabeled invoices, which we seek to leverage for training our invoice
field extraction system by pre-training models on it.
This Bachelor’s thesis was written as part of a cooperative integrated degree
program between the Cooperative State University Karlsruhe (DHBW) and
SAP. The respective department at SAP is the Deep Learning Center of Excel-
lence, located in Berlin, Germany. It is a team of data scientists who perform
ML prototype development as well as ML research. An ongoing pro ject of the
department is the enhancement and extension of the invoice parsing system
Chargrid, on top of which this work is building.
This work was inspired by a suggestion made in the original Chargrid paper
(Katti et al. (2018)) to represent a document on word level rather than character
level. The goal was to improve the information extraction performance on the
invoice task. The thesis is a combination of research and practical application as
the developed methods are applicable to 2D document understanding in general,
despite being motivated by and validated on the invoice key-value extraction
Listed below are the main scientific and economic contributions made within
the scope of this Bachelor’s thesis:
Word2vec-2d: Lifting of the word2vec method (Mikolov et al. (2013))
from 1D to 2D. While word2vec is being applied to corpora of sequential
text, word2vec-2d uses a novel notion of context allowing it to operate on
corpora comprised of 2D documents.
Development of Wordgrid; an extension of Chargrid (Katti et al. (2018))
enriched with word-level embedding.
Contextualization in 2D: Introduction of contextualized 2D document
representations, which greatly profit from unsupervised pre-training on
large, unlabeled document datasets.
Implementation of the new methods and integration into the existing
Chargrid code base at SAP.
Improvement of the existing SAP invoice parsing system by a significant
margin of 3.73 percentage points.
The remainder of this work is structured as follows: Section 2positions this
work with respect to other research. It discusses related tendencies in the re-
search community and sheds light on current methodologies and similarities be-
tween 1D NLP and 2D document understanding. In Section 3we recapitulate
the previous work required to understand our research, namely the document
understanding pipeline Chargrid, the word2vec model for word embedding com-
putation, and the language representation model BERT. That section may be
skipped by readers familiar with the matter. Section 4introduces our novel
methods. It is split into the presentation of word2vec-2d in Section 4.1 and
Wordgrid (both non-contextualized and contextualized) in Section 4.2. The
concrete application of our research to information extraction from invoices is
discussed in Section 5. There, we also define the evaluation measure (Sec-
tion 5.3) associated to the invoice application and used in most experiments,
and discuss noteworthy implementation details (Section 5.4). The listing of re-
sults (Section 6) is followed by its discussion (Section 7). Suggestions for future
work are discussed in Section 8. Section 9draws a conclusion wrapping the
The mathematical notation used throughout the document is the one defined
in the “Notation” chapter in Goodfellow et al. (2016). Most importantly sets
use the A,B,Cfont, and vectors, tensors, and matrices are bold, e.g. vor M.
When indexing a scalar entry in a matrix the letter is not written in bold, e.g.
Mi,j .
2 Background
The work on Wordgrid is centered around the representation of 2D documents
for subsequent processing with a neural network. It therefore touches the aspects
embedding/representation and 2D document understanding. There is plenty
of work on embedding in 1D natural language processing (NLP) contrasting
relatively little research on the analogous aspects in 2D.
This section is split into two parts: First, we recapitulate and analyze the
recent (roughly since 2014) research in the field of embedding for 1D NLP. We
mainly name and discuss two key aspects of embedding that are relevant for
this work: (1) Different embedding levels, i.e. whether to embed characters,
words, or word pieces; (2) embedding choices, i.e. how embedding vectors can
be chosen, where we put particular focus on the distinction of non-contextualized
and contextualized embedding. After that we do the same analysis for the field
of 2D document understanding and examine how current approaches tackle the
We point out that the 2D domain is lacking behind the 1D domain in terms of
embedding levels and embedding choices. This suggests that transferring suc-
cessful methods from 1D to 2D would open up a field of new methods. Those
might, analogous to the improvements made in the 1D domain, improve 2D
document understanding. In general, the processing of natural language with
2D information attached to it has received little attention. Neither are there
widespread, publicly available benchmarks/datasets, nor are many papers be-
ing published. Also, the terminology 2D document understanding itself was
introduced just recently by Katti et al. (2018); a tantamount term is document
intelligence. This is in stark contrast to the rich history and strong presence of
normal, 1D NLP.
2.1 Embedding in 1D NLP
In the field of NLP, embedding refers to the method of representing a sym-
bol with a vector. A symbol is commonly a character, word, or word piece.
Sentences or larger chunks of text can also be embedded, however, this is not
of relevance here. The embedded representation of natural language is then
processable by ML models. Over the past years, different ways of embedding
were developed and usage patterns shifted. In this section, we shed light on the
developments, name relevant work, and put it into the context of our research.
Embedding Levels. A basic NLP task is text classification. A model is asked
to predict a class for a given piece of text, e.g. whether or not a movie review
is positive. Neural text classification models operating on different embedding
levels were developed. Zhang et al. (2015) was the first to show that character-
level CNNs can solve text classification task comparably well. In their case,
characters are embedded with 70-dimensional one-hot encoded vectors. The
authors show that the CNN is able to learn to understand the language, by
looking at character sequences, to an extent sufficient to solve the classification
tasks. Kim (2014) trains a CNN on a text classification tasks but chooses to
embed the sentences on word level. The results are also comparably good.
Devlin et al. (2018) use word pieces: Words are split into parts and those parts
are then embedded. Smaller or well known words might not be split at all, for
instance the sentences
How are you doing? Extraordinarily well!
would be split1into
where commas separate tokens. The longer word Extraordinarily would
be split into four word pieces. Another example are the German words
Außerst solides Halbwissen!
which are tokenized to the word pieces
The aforementioned embedding levels have individual advantages and disad-
vantages: Character-level embedding is universal in that every character can be
embedded (for languages similar to English or German in nature) and anything
can be learned, if dataset and model capacity are sufficiently large. Conversely,
word-level embedding typically requires fewer training samples for the process-
ing neural model to converge, because information about the language is already
given to the model. The model does not need to learn that the consecutive char-
acters h,e,r,e, zoned by spaces, belong together and mean “here”; instead
it can learn directly that here means “here”. In addition, semantically similar
words may be grouped. Furthermore, the sequence length is reduced by a signif-
icant factor (approximately five for English; see Bochkarev et al. (2012)) when
embedding on word level. Models such as recurrent neural networks (RNNs) or
models with attention are known to work better and are more time performant
with shorter sequences (see Section 10.7 in Goodfellow et al. (2016)).
The main problem of word-level embedding is the handling of out-of-vocabulary
words. A fixed-size vocabulary cannot contain all words of a natural language,
1Using tokenizer and vocabulary from the BERT repository at
for some languages like German that is by definition, as new words can be con-
structed by concatenating other ones. Depending on the task, spelling mistakes
can also occur which make an embedding impossible or wrong. This problem
was mitigated by Lample et al. (2016) with a word-level embedding that is a
combination of a normal word-level lookup table and a pooled, fixed-size em-
bedding computed from the characters of the given word using a character-level
Word pieces are currently the most successful approach, see Yang et al. (2019).
They have the benefit of shortening the sequence relative to character-level
embedding, while being able to embed most words simply by splitting them
into chunks.
Embedding Choices. Let Vdenote the number of different symbols that
can be embedded; Nis the dimensionality of the embedding. The simplest way
of embedding a symbol is to use a one-hot encoding. Here, V=Nand the
embedding for the ith word in the vocabulary is a vector that is all-zero except
for the ith dimension where it is defined to be 1. While being simple, one-hot
embedding is impractical for large V. Also, prior knowledge cannot be injected
in the vectors as they are all orthogonal to each other.
Alternatively, Ncan be set to a value of choice and the V×Nembedding
matrix can be generated randomly. While the embedding still does not carry
any prior information, this method is simple and Ncan be chosen freely. Kocmi
and Bojar (2017) experiment with different random initializations and achieve
best results by sampling from N(0,0.01) or N(0,0.001).
Since Mikolov et al. (2013) introduced word2vec, efficient methods for estima-
tion of word representations in vector space, embedding with semantic meaning
has become very popular, see e.g. GloVe (Pennington et al. (2014)) and Fast-
Text (Bojanowski et al. (2017)). Instead of pointing into random directions,
these methods compute embeddings which carry semantic meaning: Vectors of
similar2words have a high cosine similarity. Downstream models, i.e. the ones
operating on the embedding, achieve better scores and converge faster, thanks
to the prior information held in the embedding. Kim (2014) finds embedding
with semantics to be superior over random embedding. While N= 400 is a
common choice for word2vec and its variants, there is also research by Yin and
Shen (2018) investigating mathematically what the optimal dimensionality for
embedding spaces is.
Embeddings computed with word2vec, GloVe, or FastText have an important
2Similarity refers to similar usage in the word’s language, e.g. “king” and “prince” are
similar. The words “goes” and “goals” are not similar in this sense, as they appear in different
shortcoming: Single words can have different meanings which humans identify
based on the context. The named models, however, map a word to exactly one
vector (conceivable as a hash map). They are non-contextualized. For example,
it would be desirable to embed the word “May” differently in the following two
example sentences: (1) “May I visit you?”, (2) “In May 2020 you can!”. In
(1), “May” refers to the enquirer asking for permission. In stark contrast, (2)
uses “May” to refer to the month. With contextualized embedding methods, the
words would be embedded differently. A model would produce the embedding
vector for each occurrence of “May” dynamically based on the context.
Contextualized embedding was introduced by Peters et al. (2018) with ELMo,
where the embedding vector is a concatenation of the hidden state of a forward
and backward neural language model; more specifically two Long Short-Term
Memory (LSTM; Hochreiter and Schmidhuber (1997)) networks reading the
neighboring words (context) from left and right, respectively. Recent extensions
of ELMo are OpenAI’s GPT (Radford et al. (2018), and version two of it Radford
et al. (2019)), Google’s BERT Devlin et al. (2018) which uses Transformers
(Vaswani et al. (2017)), the very recent BERT generalization XLNet (Yang
et al. (2019)), and RoBERTa (Liu et al. (2019b)). The enumerated research
has constantly pushed the state-of-the-art on challenging 1D NLP tasks, e.g.
Wang et al. (2018). Also on comparably simple text classification tasks, BERT-
based models like XLNet outperform older methods such as the CNN approaches
Zhang et al. (2015) (character-level) and Kim (2014) (word-level with word2vec)
by a considerable margin.
Both, non-contextualized word embedding methods (e.g. word2vec, GloVe) and
contextualized ones (e.g. ELMo, BERT) have in common that they require a
large corpus of data, to be pre-trained on the target language. The training
objectives are similar. Word2vec-based methods try to predict a word based
on its context with a fully-connected neural network with a single hidden layer.
The context size is commonly below 20; the parameter count is 2 ×N×V.
Contextualized models like BERT in turn are trained on a language modeling
task. They predict a word based on a much larger context (up to 512 words)
and other auxiliary tasks. They have significantly more capacity; BERT-Large
for instance has 340M parameters.
Pre-processing. Before embedding natural language it is common to apply
some pre-processing to it. The following list is composed from several chapters
of Manning et al. (2008) which describe pre-processing in the context of search
queries. Common types of pre-processing are (1) lowercasing, where all char-
acters are converted to lowercase, (2) removal of stop words, i.e. “extremely
common words which would appear to be of little value”, (3) removal of dia-
critics, e.g. Se˜
norita becomes Senorita, (4) stemming and lemmatization
“reduce inflectional forms and sometimes derivationally related forms of a word
to a common base form”, e.g. goes is converted to go.
The listed pre-processing methods are particularly important when dealing with
fixed-sized embedding vocabularies on word level. They tend to lower the num-
ber of out-of-vocabulary words. With word piece embedding pre-processing is
less important.
Power of Embedding. Embedding has proven to be a very useful concept.
The best solutions researchers can find for challenging NLP benchmarks are
relying heavily on the chosen type of embedding. That is mainly for two rea-
sons: First, with semantic embedding information can be encoded in the text
representation. Conneau et al. (2017) have shown that word embedding spaces,
learned from text corpora of different languages, can even be aligned without
any supervision from parallel data, to allow for word translation. This work
endorses how powerful embedding spaces are. Second, the embedding computa-
tion is unsupervised, i.e. it works with unlabeled corpora. These are available
excessively on the internet in sizes so large that even models like the GPT-2
do not overfit on the train data of 40GB text despite having more than 1.5B
parameters (see Radford et al. (2019)).
2.2 2D Document Understanding
We define the term 2D document to be a document with textual information
which is partly contained in the spatial arrangement of the text: Words are
not positioned in a 1D sequence but in a 2-dimensional coordinate system.
Human readers are guided by font sizes, positioning of text boxes, and the like.
Document understanding, also referred to as document intelligence, deals with
the information extraction from such documents.
Examples for 2D documents are presentation slides, layout-rich websites, posters,
cover pages, flyers, and invoices. Concrete tasks could be key-value extraction
from invoices (Katti et al. (2018)), election poster classification, or vision-based
web page rank estimation (Denk and G¨uner (2019)). Many more are conceiv-
Previous Work. A decent share of NLP research and benchmarks focuses
on serialized text, where spatial information does not exist. The transfer to 2D
poses new requirements to the ML models that perform the understanding. At
the same time, the representation of a document is a choice left to the data
scientist. In the following we look at multiple 2D document understanding
methods with special focus on their embedding choices. The list is sorted in
temporal order (old to recent):
Esser et al. (2012), Automatic Indexing: The authors work with a
database of templates where a template defines the graphical structure of
a certain type of document, including the positions of fields of interest.
New documents are assigned to a template and (after OCR application)
fields are extracted based on it. The method is rule-based and works well
on documents that can be clustered into groups of homogeneous layout.
Schuster et al. (2013), Intellix: Similar to Automated Indexing, Intellix
performs a document classification first and applies rule-based information
extraction in a second step. They make the assumption that “a large
number of interesting fields [...] are always placed at the same position
within documents of the same template”. Depending on the domain this
assumption does not hold unless the number of templates is very large.
Palm et al. (2017), CloudScan: In contrast to Automatic Indexing and
Intellix, CloudScan does not use templates as it is the first work in this
enumeration that is using a neural network. The document is processed
sequentially by an n-grammer. Each n-gram is classified by an LSTM
which reads the invoice lines from left to right on word level. Words are
embedded in a trainable, 500-dimensional embedding space. Initially the
embedding is randomly initialized.
Katti et al. (2018), Chargrid: Chargrid preserves the 2D structure of the
input document by converting it into a 3-axes tensor, where characters
are embedded along the depth axis with 54-dimensional one-hot encoding.
This tensor is then processed by a CNN predicting bounding boxes and
segmentation masks. The application of a CNN is similar to Zhang et al.
(2015) where the CNN is also operating on character level so the lower level
convolutional layers are presumably learning to recognize words relevant
for the task.
Liu et al. (2019a), Graph Convolution (GC): The authors convert a
2D document into a graph of text segments. Each node in the graph
corresponds to one text segment and is attributed with the text as well
as its 2D position. The graph is fully connected and the edges’ attributes
include the visual distance of the two text segments. An LSTM is used to
convert the text in each node into a feature vector. It works on word level
with word2vec embedding. For the node classification they use a graph
convolutional network.
Zhao et al. (2019), CUTIE: The authors use an approach very similar to
Chargrid/Wordgrid. Scans are processed with an OCR engine and the text
is split into word pieces; the vocabulary size is 20,000. The word pieces
are embedded into a 128-dimensional vector space, presumably randomly
initialized and tuned during training. The word piece tensor is processed
by a CNN.
The enumeration of 2D document understanding methods shows a tendency
from rule-based systems to neural ones (CNNs, LSTMs). The neural methods
embed on character (Chargrid), word (CloudScan and Graph Convolution), or
word piece level (CUTIE). Note though, that only Chargrid, Graph Convolu-
tion, and CUTIE make the 2D information accessible to the processing neural
It is hard to rank the listed work because it is commonly evaluated on pro-
prietary datasets and tasks. However, it is certain that rule-based systems
cannot deal with heterogeneous document layouts without an enormous rule
base. Consequently, Automatic Indexing and Intellix are not applicable in such
cases. The rule-based approach was replaced by the neural CloudScan method.
All following work is neural too. At the current stage of research there is a lack
of methods that can gain performance improvements from unlabeled data. The
general objective is the overall information extraction performance as today’s
state-of-the-art is below human level performance.
CUTIE by Zhao et al. (2019) uses the same concept as Chargrid except on
word piece level, which is in turn close to this work which experiments with
word and word piece level. There are, however, two fundamental differences
that distinguish Wordgrid from CUTIE: (1) our embedding space is not nec-
essarily randomly initialized, we introduce word2vec-2d, and (2) we propose a
contextualized Wordgrid.
Transfer from 1D to 2D. After analyzing the recent work on embedding
in 1D NLP (Section 2.1) and 2D document understanding (above) we draw the
lines between the two fields. Specifically, we hypothesize that the 2D domain
could benefit from the recent advancements made in the 1D domain which have
pushed the state-of-the-art there.
Table 1compares the embedding methods used in 1D NLP and 2D document
understanding. The comparison shows that there is a lack of research on 2D
document understanding which use semantically rich embedding to represent
the document. The existing methods (Chargrid and CUTIE) use either one-hot
or random embedding3. This not only requires more training data, but also has
3We acknowledge that the Graph Convolution work is using semantically rich embedding,
Embedding 1D 2D
One-hot Character-level models Chargrid
Random Rarely used CUTIE, Wordgrid (ours)
Non-contextual. word2vec, GloVe, FastText Wordgrid w/ word2vec-2d (ours)
Contextualized ELMo, GPT, BERT, XLNet BERTgrid (ours)
Table 1: Comparison of embedding in classical 1D NLP and 2D document
understanding. One-hot embedding is only applicable on character level as
the vector size equals the number of distinct symbols and the latter is too
large on word level. Random embedding is commonly used with fine-tuning
so the embedding is eventually being learned. Non-contextualized embedding
used to be the de-facto standard in 1D NLP, where each symbol (word or word
piece) is mapped to a semantically meaningful vector representation. The recent
invention of contextualized embedding was followed by its broad adoption. Our
novel methods cast embedding methods from 1D to 2D by filling three cells in
the right-most column.
the significant disadvantage that unlabeled 2D data is hard to use. Pre-training
on unlabeled data is, however, a huge part of the success of (non-)contextualized
methods in the 1D domain, e.g. word2vec or BERT. That is where we position
our research on word2vec-2d and Wordgrid/BERTgrid.
In terms of embedding levels CUTIE is the only work which uses word piece
embedding – the level used by the best models in NLP. Chargrid relies on its
CNN to learn the meaning of words and the methods CloudScan and Graph
Convolution cannot deal with out-of-vocabulary words.
To the best of our knowledge the only research casting from 1D NLP to 2D is
the Image Transformer work by Parmar et al. (2018). Opposed to us, in 2D they
work with images and not with 2D documents/text anymore. They mitigate the
problem of attending over too many spatial locations in an image by restricting
the attention area of a Transformer block to a local neighborhood. Nevertheless,
thanks to the Transformer block stacking, the receptive field grows faster than
the one of CNNs would in a comparable setup.
By bringing more meaningful word-piece-level embedding to the 2D document
understanding domain, we expect to see significant performance improvements
analogous to the 1D successes of ELMo and BERT. Furthermore, we make
unlabeled data more usable. In particular in business applications, unlabeled
documents are often readily and plentifully available, while manual annotation
is rather expensive.
however, it processes the vectors with a 1D architecture, namely an LSTM.
3 Prerequisites
This section summarizes selected previous research that is needed to understand
the novel methods suggested in this thesis. It assumes the reader is familiar with
the fundamentals of ML, in particular neural networks, convolutional layers,
embedding, backpropagation, and gradient descent. Goodfellow et al. (2016) is
a comprehensive reference.
The content of this section is taken mainly from five papers, namely the Char-
grid paper by Katti et al. (2018), the original word2vec paper by Mikolov et al.
(2013) and an explanation of word2vec’s parameter updates (Rong (2014)), and
the work by Devlin et al. (2018) on BERT, which is grounded on the Trans-
former (Vaswani et al. (2017)). Credit for the content of this section goes to the
respective authors of the named papers. Readers may either read the original
publications or the summary provided in the following.
3.1 Chargrid: Towards Understanding 2D Documents
Chargrid is a 2D document understanding pipeline proposed by Katti et al.
(2018). Instead of serializing a document into a 1D text sequence, the proposed
method, named Chargrid, preserves the spatial structure of the document by
representing it as a sparse 2D grid of characters. In this section we repeat the
most important points of the method.
Let Cdenote the set of all characters and let a document be
D:= {(ci, xi, yi, wi, hi)|i∈ {1, . . . , n}} ,(1)
consisting of ncharacters ciCpositioned at xi, yiNwith width wiNand
height hiN. The documents can be retrieved from scans with an OCR engine
or from PDFs directly. Different documents may contain different numbers of
The Chargrid tensor C∈ {0,1}w×h×dis constructed from a document of size
w×haccording to
ecg(ci) if xixxi+wiyiyyi+hi
0dotherwise ,(2)
with the character one-hot embedding function ecg :C→ {0,1}d. In cases
where several character bounding boxes enclose the coordinate x, y, the one
with the bounding box center closest to x, y is chosen.
d=2 d=4 d=8
4C 8C
Decoder: Semantic Segmentation
Decoder: Bounding Box Regression
cc d
1x1 Conv
3x3 Conv
3x3 Conv
3x3 Conv
1x1 Conv
3x3 Conv
3x3 Conv
3x3 Conv
3x3 Conv
3x3 Conv
3x3 Conv
3x3 Conv
3x3 Conv
3x3 Conv
3x3 Conv
a" a"
3x3 Conv
3x3 Conv
3x3 Conv
Raw data Chargrid
3x3 Conv
with dilation
3x3 Conv
with dilation
3x3 Conv
with dilation
3x3 Conv
3x3 Conv
with dilation
3x3 Conv
with dilation
Figure 1: Network architecture for document understanding, the chargrid-net.
Each convolutional block in the network is represented as a box. The height
of a box is a proxy for feature map resolution while the width is a proxy for
the number of output channels. Ccorresponds to the number of base channels,
which in turns corresponds to the number of output channels in the first encoder
block. ddenotes dilation rate. Figure source: Katti et al. (2018)
Intuitively, Ccan be understood as a representation of Dwhich preserves the
spatial positions of characters and uses the depth dimension to encode which
character there is at the given position. At places where there is no character in
the document, the Chargrid tensor contains zeros (referred to as background ).
Raw data and Chargrid are compared on the left-hand side of Figure 1. The
different embedding vectors are encoded with colors. Note that lines, colors,
and any information other than the characters are not contained in the Chargrid
The Chargrid representation Cis the input to a fully convolutional neural net-
work which performs semantic segmentation and bounding box regression. The
former is implemented as a probability distribution over all target classes for
each pixel x, y in the input. The bounding box regression is needed to distin-
guish a plurality of instances of the same class and to group entity predictions
e.g. into the same invoice line item. The network architecture is depicted in
Figure 1.
The network is trained on a loss composed of three equally weighted terms
ltotal := lseg +lboxmask +lboxcoord ,(3)
described in greater detail in the following.
First, for each pixel the model outputs a probability distribution over all classes,
denoted by PRw×h×lwhere lis the number of labels. The ground truth ˆ
has the same shape as Pbut is a one-hot encoding of the correct class.
lseg :=
Yx,y,c log Px,y,c (4)
is the cross entropy loss for segmentation (e.g. Ronneberger et al. (2015); Long
et al. (2014)).
Second, for each pixel the model outputs a binary probability distribution in-
dicating whether it assumes a bounding box to be present in the ground truth.
Let the model’s prediction be Pbox and the binary ground truth label be ˆ
lboxmask :=
y=1 ˆ
x,y logPbox
x,y +1ˆ
x,y log1Pbox
x,y (5)
is the binary cross entropy loss for box masks.
Third, for each pixel the model outputs a single bounding box, represented by
four coordinates. Recall the Huber loss
Lδ(y, ˆy) :=
2(yˆy)2if |yˆy|< δ
2δotherwise .(6)
Let BRw×h×4be the bounding box regression output of the model where
the last dimension holds x-coordinate, y-coordinate, width, and height. Further,
let ˆ
Bbe the corresponding ground truth which is only defined at places where
there is a bounding box present in the image, i.e. where Ybox
x,y = 1.
lboxcoord :=
x,y = 1i4
is the Huber loss for box coordinate regression, see Ren et al. (2015), where [·]
is the indicator function which is 1 if the enclosed boolean expression is true, 0
otherwise. In the equation it ensures only the bounding box predictions made
at places where there is a bounding box in the ground truth are being used for
training. In the actual implementation we predict the offset relative to a so
called anchor box easing the regression task.
3.2 Word2vec: Estimation of Word Representations in
Vector Space
In this section we recapitulate the word2vec models by Mikolov et al. (2013)
and focus on the aspects relevant for our work. The mathematical formulations
are taken from Rong (2014).
Word2vec are two models for computing continuous vector representations of
words from large, unlabeled data sets. In other words, given some corpus of
text, the algorithms compute a vector representation for each of the words of
a vocabulary. The vocabulary is commonly chosen to contain the top-Vmost
frequently occurring words in the corpus.
The vector representations are helpful for downstream tasks because they carry
semantic meaning after training. For instance vectors of synonyms are expected
to point into a very similar direction in the vector space, i.e. have a high cosine
similarity. Therefore, a model processing them will be likely to interpret them in
a similar fashion. Alternatives such as random embedding or one-hot encoding
of words do not feature the aforementioned advantages.
The two word2vec models are called continuous bag-of-words (CBOW) and skip-
gram. They are conceptually similar. We focus on CBOW in the following as it
is faster than skip-gram and was the method used in our experiments.
The objective of CBOW is to predict a word based on its context. Given for ex-
ample the (fairly small) corpus the car is driving down the street
the model would be trained to predict the word driving based on car is
down the. Also it would try to predict down based on is driving the
street, and so on. In the example the model has access to the two adjacent
words on both sides. These words are referred to as context and two is the
window size.
For simplicity we select only one word from the context, i.e. the model tries to
predict the center word based on some word in the surrounding context. Let
W,W0RV×Nbe two weight matrices, where Vis the vocabulary size, Nis
the embedding dimensionality, and the ith row of Wis the vector representation
vwifor the ith word wiin the vocabulary:
wi:= Wi,:.(8)
Likewise v0
wiis the vector representation for wiin W0.
For a given context word wcthe algorithm computes a score
uj:= vwj·vwc(9)
Input layer Hidden layer Output layer
Figure 2: Word2vec CBOW model with only one word in the context. Figure
source: Rong (2014)
for each word in the vocabulary. The scores are normalized with softmax to
retrieve a probability distribution
Pr(wj|wc) = euj
i=1 eui
i=1 expv0
wi·vwc=: yj.(10)
Note that both, Wand W0contain a vector representation for each word, so
wiand vwiare two different vectors belonging to the same vocabulary word.
Verbally explained, Equation 10 computes a probability distribution over all
vocabulary words based on a single given context word wc. It does so by com-
puting the dot product of the vector representations of context and vocabulary
word, and normalizing it to retrieve a probability distribution. A given com-
bination of context and vocabulary word will have a high value for yiif the
two vectors v0
wjand vwcpoint into a similar direction. The vector magnitude
matters as well.
The mathematical formulations from above can be interpreted as a neural net-
work, see Figure 2. The input is a one-hot encoded version of the context word,
the hidden layer is its vector representation vwcand the output layer yields the
probabilities for all vocabulary words. The hidden layer’s activation function is
linear and the output layer uses a softmax.
In order to learn meaningful values for the parameters θ= (W,W0), the neural
network is being trained with stochastic gradient descent (SGD). The training
objective (for a single training sample) is to maximize Equation 10 for the actual
output word wO:
max Pr(wO|wc) = max log Pr(wO|wc).(11)
The maximum is over all possible parameterizations for θ. The logarithm is
taken to break the fraction (Equation 10) up into a subtraction, convert the
multiplication of the multi-word context case into a sum, and improve numerical
stability. The resulting loss function is
E:= log Pr(wO|wc),(12)
from which the parameter updates can be derived with backpropagation, see
Rong (2014).
In the case of a multi-word context, the equations remain the same except the
model seeks to maximize the probability
Pr(wO|ws, ws+1, . . . , w1, w+1 , . . . , ws1, ws),(13)
where sis the windows size and wiis the ith word in the right context of a par-
ticular occurrence of wOin the corpus and wiis the ith word in the left context
of wO. Given the corpus example from above (the car is driving down
the street) and the output word driving, the model would maximize
Pr(driving |car,is,down,the),(14)
with s= 2. Note that this is just a single example from the corpus. Analo-
gously, the model maximizes all other words in the sequence given their contexts:
Pr(down |is,driving,the,street), Pr(the |. . . ), Pr(is |. . . ), etc.
3.3 BERT: Pre-training of Deep Bidirectional Transform-
ers for Language Understanding
When the Transformer model was published by Vaswani et al. (2017) it was
revolutionary: It dropped recurrent and convolutional architectures to instead
rely solely on attention mechanisms in the sequence-to-sequence (seq2seq) do-
main. In seq2seq the input to a model is a variably-sized sequence of input
symbols (e.g. words) which is mapped to an also variably-sized sequence of out-
put symbols. Seq2seq models are being applied to many NLP tasks, for instance
translation. Let ndenote the length of the input and mthe length of the output
Vaswani et al. (2017) use an embedding layer which maps a 1D sequence of n
input symbols to a sequence of vectors. Subsequently, the individual symbols in
the sequence are enriched with a position-dependent positional encoding. The
positional encoding is simply being added to the symbol embedding. The result-
ing vector sequence serves as the input to the model. Positional encodings allow
w1w2. . . wn
e1e2. . . en
T1,1T1,2. . . T1,n
T2,1T2,2. . . T2,n
Tl,1Tl,2. . . Tl,n
Figure 3: The BERT model architecture: A sequence of word pieces w1, . . . , wn
is converted into embedding vectors e1,...,enwhich is combined with the corre-
sponding positional encodings p1,...,pn. The resulting inputs traverse llayers
of Transformer blocks T. The arrows interconnecting the blocks indicate the
information flow of BERT’s bidirectional attention.
the model to attend to words relative4to themselves. The details of attention
are not relevant at this point, but the special property of the positional encoding
is. In short: The vector sequence fed into the model is computed as the sum of
a positional encoding and a symbol embedding (e.g. word embedding).
The Transformer architecture itself consists of Transformer blocks. Each block
(in the Transformer encoder) is a function
bl:Rn×dmodel Rn×dmodel (15)
which converts a n×dmodel matrix to another matrix of the same shape. The
Transformer paper uses dmodel ∈ {512,1024}. Let Tl,:denote the output of the
lth transformer block.
While we refer the interested reader to Vaswani et al. (2017) to read about the
block internals, the key points we want to make here are that the Transformer
processes sequences and encodes them with an embedding that is composed of
position and symbol, and that the Transformer consists of stacked Transformer
blocks. By the time it was published it achieved state-of-the-art performance
on translation tasks.
4This post explains mathematically why the relative attention works with the chosen posi-
tional encoding which is based on trigonometric functions:
linear-relationships- in-the- transformers- positional-encoding/
Bidirectional Encoder Representations from Transformers (BERT) by Devlin
et al. (2018) builds on top of the Transformer. Instead of using an encoder-
decoder architecture it consists of l×nTransformer blocks converting a sequence
of ninput symbols into a sequence of noutput vectors. ldenotes the number of
stacked blocks, also called layers5. Figure 3illustrates the BERT architecture’s
grid-like structure.
The Transformer is pre-trained in an unsupervised fashion on two tasks simul-
taneously. The training data is a corpus of natural language, e.g. retrieved from
Wikipedia. The two tasks are:
1. Task #1 Masked Language Model (MLM): The MLM objective asks
the model to predict the words in a sentence that were masked out. Given
for instance the sentence mens sana in corpore sano6, a random word
would be replaced with a mask token: mens sana [MASK] corpore sano.
The stack of Transformer blocks processes the word sequence where in
is masked out and outputs a probability distribution over all vocabulary
words. The loss is formulated such that the model learns to infer poten-
tially fitting words given the rest of the sequence that was not masked.
It boils down to the same negative log-likelihood loss used by word2vec
(Equation 12).
The actual MLM formulation is slightly more complicated (multiple words
are being masked out simultaneously, sometimes they are replaced with
random words, sometimes just kept as-is). In essence the MLM task trains
the model to infer missing words based on their context. Solving this
task might seem easy, it does, however, require some understanding of
language. In fact language models were shown to be very performant on a
variety of different tasks when pre-trained long enough on the MLM task,
see Radford et al. (2019) “Language Models are Unsupervised Multitask
2. Task #2: Next Sentence Prediction: Given two sentences the model
is asked to classify whether they appeared in the corpus next to each
other or not. For example the first sentence might be the man went to
the store and the second sentence could be he bought a liter of milk
in which case the ground truth would be “is next” because the second
sentence did indeed follow the first sentence in the corpus. If the sec-
ond sentence was cats land on their feet the model should output “not
5Technically, the term layer is ambiguous here: Each block itself is made up of several fully
connected layers. Still, the BERT terminology refers the that group of layers (a block) as a
6Latin phrase meaning a healthy mind in a healthy body.
BERT uses word pieces and embeds them with initially random initialization.
The positional encodings from the Transformer are used too, however, they are
not learned and not initialized from periodic functions.
Trained on the two mentioned tasks, BERT models learn something about the
language used in the domain of the corpus. They learn to model language. In
order to predict missing words, information about the context is passed between
the Transformer blocks (see arrows in Figure 3). The internal representations
Tcontain this contextualized information. It is therefore common to use these
contextualized word vectors to embed language for other models. Note that, as
opposed to word2vec and similar methods, there is no more 1:1 correspondence
between entries of the vocabulary and vectors. A symbol’s vector representation
is computed dynamically based on the context.
When published in October 2018, BERT achieved state-of-the-art performance
on a range of 1D NLP benchmarks, including the GLUE dataset Wang et al.
(2018), Stanford Question Answering Dataset (SQuAD; Rajpurkar et al. (2018)),
and Situations With Adversarial Generations (SWAG; Zellers et al. (2018)).
Since then is has been the basis of future work such as TransformerXL (Dai
et al. (2019)), XLNet (Yang et al. (2019)), and most recently RoBERTa (Liu
et al. (2019b)).
4 Novel Methods
In this section we present the research contributions made as part of this Bach-
elor’s thesis.
The first is an adaptation of word2vec that makes it applicable to 2D documents,
see Section 4.1. We redefine the word2vec notion of context, so an output word
is being predicted based on surrounding words in a 2D sense, as opposed to
words to its left and right within a 1D sequence.
The second is Wordgrid (formally introduced in Section 4.2.1), a Chargrid-based
method for representing 2D documents on word level with 2D context preser-
vation. For Wordgrid there are several possible choices for non-contextualized
symbol embedding, which we name in Section 4.2.2. One of them is the afore-
mentioned word2vec-2d. We bring contextualized embedding from the 1D do-
main to the 2D document domain in form of a contextualized Wordgrid, see
Section 4.2.3. The contextualization allows for document representations more
powerful than the ones possible with the non-contextualized Chargrid. In Sec-
tion 4.2.4, we suggest ways of combining Chargrid and Wordgrid and a measure
for the complementarity.
4.1 Word2vec-2d: Word2vec for 2D Corpora
Word2vec by Mikolov et al. (2013) is a method for estimating word representa-
tions in vector space. A summary of the aspects relevant for understanding our
advancement can be found in Section 3.2. Word2vec operates on text corpora
that are sequential in nature. That means a corpus is an array of words, where
each word has a 1D position. The documents processed by Chargrid/Wordgrid,
however, are 2-dimensional, which is motivating the advancement of word2vec
to work with 2D data as well. We dub the new method word2vec-2d.
Motivation. To clarify what the limitation of word2vec in its vanilla form
is, suppose one wanted to train vector representations on a corpus of patent
cover pages (examples of 2D documents). In a first step, one would construct
a vocabulary by retrieving the top Vmost frequently occurring words from
the patent corpus (words like “patent”, “number”, “date”, or “abstract” would
likely be among them). After constructing the vocabulary, one would continue
by feeding the documents into the word2vec algorithm. But this one requires
the data to be a long sequence of words. The patent cover pages would need
to be serialized, e.g. by reading them line-by-line into a string. Figure 4shows
how words can be taken out of context when serializing a 2D document: Labels
that are above the word they refer to, are in a different line and will end up in
Patent No.:
US 4,241,342 B4
United States
Related Patents
Date of Patent:
June 17, 1997
This is a short abstract
describing the patent.
United States Patent No.: Patent US 4,241,342 B4 Related
Patents Date of Patent: 4,223,563 June 17, 1997 5,123,543
Abstract This is a short abstract describing the patent.
Figure 4: Line-by-line serialization of a 2D document. It is apparent that due
to the 2D arrangements in the document (left) some words in the serialized,
1D representation (right) are taken out of context. For instance the Date
of Patent: is pulled away from the June 17, 1997 to which it belongs
an entirely different position in the 1D sequence.
While the word2vec algorithm can without doubt operate on serialized versions
of 2D documents, it is presumably less performant at capturing the seman-
tics. Motivated by this assumption we introduce word2vec-2d, which operates
directly on the 2D document, so the serialization step becomes unnecessary.
Word2vec-2d. Let V={w1, . . . , wV}denote the vocabulary constructed
from the corpus. The corpus C=D(i)N
i=1 consists of Ndocuments. Each of
which is a set of words alongside with their spatial position
w(j), x(j)
min, y(j)
min, x(j)
max, y(j)
| {z }
bounding box b(j)
|jn1, . . . , N (i)o
where N(i)denotes the number of words in the ith document, w(j)is the jth
word in the ith document, and the remaining four scalars in the tuple are the
word’s bounding box b(j).
We define a radial context surrounding a given word w(j)as a function cr,
which maps from a word of a given document to a set of words from the same
document. The output set is what we define to be the context of the argument.
The function is
crw(j):= nw(j)D(i)|db(j),b(j)r, j 6=jo(17)
with w(j)D(i),dbeing a distance measure between two bounding boxes, and
rbeing the radius, i.e. our variant of the word2vec window size for 2D. wD
is short for “there is a 5-tuple in Dwhere the first element, i.e. the word, is w”.
United States Patent No.: Patent US 4,241,342 B4 Related
Patents Date of Patent: 4,223,563 June 17, 1997 5,123,543
Abstract This is a short abstract describing the patent.
United States
Related Patents
This is a short abstract
describing the patent.
Date of Patent:
June 17, 1997
Patent No.:
US 4,241,342 B4
Figure 5: Comparison of different notions of context in word2vec-2d (left) vs.
word2vec (right) on the same document. In both cases w(j)=June (red and
bold). In word2vec-2d the context words are the ones which lay partly within a
circle with flat sides around w(j). In word2vec w(j)’s context are the swords
to its left and right. The context is blue and italic. Note that in the example,
the word2vec-2d context of June includes the word Date, which is desirable.
Normal word2vec (with line-by-line serialization) is not able to capture this
The intuition behind Equation 17 is that for a given center word w(j)it returns
a set containing all words that lay within a certain distance raround the center
word. The returned context words belong to the same document as the center
word. Figure 5visualizes how the new context definition can potentially capture
semantics of 2D documents better than the normal word2vec methods.
The notion of distance is generic at this point. Possible implementations of d
are the distance of two bounding box centers or the minimum distance between
the bounding boxes’ edges. We use the latter; Listing 1is a C implementation.
1int bbox_dist(struct bbox a, struct bbox b) {
2struct bbox r;
3int inner_width, inner_height, min_distance;
5r.xmin = min(a.xmin, b.xmin);
6r.ymin = min(a.ymin, b.ymin),
7r.xmax = max(a.xmax, b.xmax),
8r.ymax = max(a.ymax, b.ymax);
10 inner_width = (r.xmax-r.xmin) - (a.xmax-a.xmin) - (b.xmax-b.xmin);
11 inner_height = (r.ymax-r.ymin) - (a.ymax-a.ymin) - (b.ymax-b.ymin);
13 inner_height = max(0, inner_height);
14 inner_width = max(0, inner_width);
15 return inner_width + inner_height;
16 }
Listing 1: The function bbox dist computes the minimum L1 distance
between two bounding boxes aand b. If the bounding boxes overlap, the
minimum distance is defined to be zero (line 13 and 14).
The adjusted training ob jective E2D of word2vec-2d uses the context defi-
nition from Equation 17 in the loss function of the original word2vec (Equa-
tion 12). With wObeing a center word, the loss for one sample is
E2D := log Pr(wO|cr(wO)) .(18)
Both word2vec’s CBOW and skip-gram model can be used with the word2vec-2d
notion of context.
Word2vec-2d can be summarized as follows: It is a modification of the notion of
context so a word is predicted based on7neighboring words in a 2-dimensional
sense. Hence, word2vec-2d is more applicable to documents with rich 2D struc-
ture. The parameter update rules are the word2vec ones, see Rong (2014).
4.2 Document Understanding with Wordgrid
Wordgrid is our adaption of Chargrid. We use a different representation method
for the input document: Instead of embedding the individual characters in a grid
arrangement, Wordgrid embeds on word level. The embedding is not one-hot,
but dense, and can potentially be contextualized. The latter means a word’s
embedding in the grid may also depend on its context (e.g. neighboring words).
We also introduce a method where both Chargrid and Wordgrid are combined.
While the changes suggested with Wordgrid do not require any modification of
the network architecture (depicted in Figure 1), a combination of both Chargrid
and Wordgrid does.
Wordgrid differs from Chargrid in the following aspects:
Representation level (Section 4.2.1): Wordgrid embeds on word level
rather than character level. When referring to Wordgrid embedding words,
we also include word pieces.
Embedding type (Section 4.2.2): Wordgrid works with dense embed-
ding, e.g. randomly initialized or word2vec/word2vec-2d pre-trained.
Contextualization (Section 4.2.3): The contextualized Wordgrid, dubbed
BERTgrid, can embed the same word with different vectors in different
locations, depending on the surrounding context.
Combination of representation levels (Section 4.2.4): Wordgrid can
be used as a hybrid, where character- and word-level information is com-
7This applies for CBOW; with skip-gram the context is predicted based on the center word.
(a) Invoice (b) Chargrid (c) Wordgrid
Figure 6: Visualization of Chargrid (b) and Wordgrid (c) computed for a sample
invoice (a). Each symbol (character or word) has a unique color representing its
embedding vector. Note that two identical characters/words will have the same
color at any location in the document. It is visually apparent that Chargrid
embeds on a more fine-grained level than Wordgrid.
In the following, let
C∈ {0,1}w×h×dc(19)
denote the Chargrid tensor for some document, introduced in Section 3.1.w
and hare the document representation’s width and height, respectively. The
embedding depth of Chargrid is denoted by dc.
4.2.1 2D Document Representation with Wordgrid
Wordgrid represents a 2D document as a Tensor WRw×h×dwhere the vector
at Wx,y,:is an embedding of the word at the spatial location (x, y) in the original
document. wand hdenote width and height, dis the tensor depth, i.e. the
embedding dimensionality. Figure 6visualizes the difference between Chargrid
and Wordgrid. With Wordgrid the area where a word lies is filled with the same
embedding vector; Chargrid does the same on a more fine-grained level, namely
on character level.
Formally, let W={w1, . . . , wV}be the set of all words, where Vis the vocab-
ulary size. Analogous to Section 4.1, we define a document to consist of words
and their bounding boxes
w(j), x(j)
min, y(j)
min, x(j)
max, y(j)
| {z }
bounding box b(j)
|jn1, . . . , N (i)o
where N(i)denotes the number of words in the ith document, w(j)Wis the
jth word in the ith document, and the remaining four scalars in the tuple are
the word’s bounding box b(j).wDis short for “there is a 5-tuple in Dwhere
the first element, i.e. the word, is w”.
Using an embedding function efixed :WRdwe construct Wordgrid for a
document D(i)as
efixedw(j)if x(j)
min xx(j)
max y(j)
min yy(j)
0dotherwise .(21)
Verbally explained, the fixed word embedding for wjis chosen at (x, y), if the
word’s bounding box encloses (x, y). Otherwise the tensor contains zeros (in-
dicating “background”). The formulation assumes that no two bounding boxes
Our formulation admits embedding fine-tuning: Gradients of the downstream
model’s loss with respect to Wcan be backpropagated into the embedding
function. There they can be used to update parameters, e.g. a word embedding
4.2.2 Non-contextualized Embedding
The embedding function uses a dictionary DR(V+1)×dholding the word
embedding vectors. We define the first row D0,:to hold the placeholder vector
that is used to embed out-of-vocabulary words. Parameterized by the dictionary,
the fixed embedding function is
Dj,:if jV
D0,:otherwise .(22)
Dcan be initialized in different ways. In all listed initialization schemes the
embedding dimensionality dcan be chosen freely.
Random Initialization. The simplest is a random initialization where for
all i, j
i,j N 0, σ 2,(23)
e.g. with σ2= 0.001, following Kocmi and Bojar (2017). The random initial-
ization does not carry any semantic information which is why it is commonly
used in combination with embedding training so the embedding space is incre-
mentally adjusted using the gradient of the loss with respect to D. Wordgrid
(a) OCR (b) Sorted by ymin (c) Sorted by xmin
Figure 7: Comparison of three invoice serialization methods. The colors indicate
the ordering of the words: blue is early in the sequence, yellow is closer to the
end. The raw order as output by the OCR engine (a) differs slightly from (b)
in the two column layout area at the top of the invoice.
in combination with this initialization mode is identical to the input document
representation proposed by Zhao et al. (2019).
Initialization from Word2vec. Here the embedding dictionary is initialized
from vectors computed by word2vec (see Section 3.2) or a method of the same
kind. The word2vec algorithm itself requires a corpus of data to be trained on,
where available options can be split into two categories: (1) data from other
domains and (2) the 2D documents which are input to the model. The former
is relatively simple to use as there are many pre-trained word2vec embeddings
available, which were computed e.g. on Wikipedia text. For (2) the word2vec
algorithm is not directly applicable since it needs a sequential text corpus. 2D
documents, however, are 2-dimensional in nature and must therefore be serial-
The best method of serializing a 2D document depends on the domain. Line-
by-line serialization is often applicable. In Figure 7we compare three different
serialization methods on examples from the invoice domain. After serialization,
word2vec can be used to train an embedding to initialize Dfrom.
Despite capturing some semantic meaning, the word2vec embedding is seemingly
incongruous in the 2D domain. Unless 1D corpora exist, which contain domain-
specific language that is close to the one used in the 2D documents, the vector’s
semantics are not matching the target domain when using (1). The serialization
(2) on the other hand removes some semantics from structured documents.
Consequently, the word2vec model is trained on a new domain.
Word2vec computes vectors for words in the corpus. To initialize D0,:we choose
the element-wise average of the other vectors in the embedding space:
D0,i := 1
Dj,i .(24)
It is important to normalize the embedding space because the magnitudes of
the vectors produced by the original word2vec implementation can be large,
e.g. kDi,:k>10 to provide an order of magnitude, which harms the proper
functioning of weight decay, because weights of the subsequent layer can easily
be small since their input is disproportionately large. The vectors can either
be normalized jointly or individually, in which case information is only carried
in the vectors’ directions. We choose to do the former and scale all vectors, so
the values in Dfollow roughly a normal distribution with σ2= 0.001, as in the
random initialization.
Word2vec-2d. Word2vec-2d (see Section 4.1) allows for training of word vec-
tors on 2D documents. It is therefore intuitively a more applicable choice for
The embedding space initialization remains the same as with word2vec: The
vectors output by the word2vec-2d algorithm are used as rows in D. Vec-
tor normalization and placeholder vector initialization are also equivalent to
4.2.3 BERTgrid: Contextualized Embedding
For BERTgrid, a variant of Wordgrid, we redefine the embedding function eto
have access to the context. The new embedding function
differs from the fixed one (Equation 22) in that it has access to the entire
document that the given word w(j)belongs to. The document is the word’s
context (or a superset of it, depending on the definition of context).
With efixed there is a 1:1 correspondence between the vocabulary and the em-
bedding space. The contextualized embedding function, however, is much more
powerful, because it can choose a word’s vector representation at a given po-
sition depending on its context, e.g. adjacent words or the absolute position
within the invoice.
We reify the generic definition from above by providing a concrete description of
an implementation of eusing a BERT (Devlin et al. (2018)) model. It assumes
(3) BERT model
(4) BERT model
1D(2) serialization
(1) BERT model
Figure 8: Visualization of the BERTgrid creation pipeline. The thick boxes are
parameterized functions (here neural networks), ovals indicate pure functions.
The two datasets are on the left-hand side, one is unlabeled and the other one
is labeled.
there is a dataset of unlabeled 2D documents available, as well as a 2D docu-
ment understanding target task. BERTgrid is constructed from contextualized
vectors, retrieved from a BERT model, giving BERTgrid its name. The steps
are written down in the enumeration below and are additionally visualized in
Figure 8.
1. A pre-trained BERT model is downloaded.8The model might have been
pre-trained on an entirely different task, but some information transfer to
the target domain might be possible anyways9. The downloaded model
comes with a BERT configuration and a vocabulary. For the official models
the vocabulary contains word pieces.
2. A dataset of 2D documents from the target domain is serialized, pre-
processed, and tokenized. The serialization is the same as the one needed
to train word2vec on 2D documents. Let serialize(D) denote a serialization
function which outputs a sequence of |D|words, given a document.
8Pre-trained BERT models are available online. For instance in the official BERT
repository’s pre-trained models section: research/bert#
pre-trained- models
9We make this assumption because we saw a faster convergence of BERT language models
on invoice data when they were initialized with weights pre-trained on Wikipedia data.
3. The downloaded BERT weights are pre-trained on the serialized dataset
of 2D documents until convergence. When speaking of pre-training we use
the BERT terminology where it refers to training the BERT model on the
loss described in the BERT paper, while fine-tuning is the propagation of
gradients into the BERT model from a target task. So at this stage we pre-
train a BERT model that is initialized from a checkpoint with pre-trained
4. The trained BERT model is inferred with the labeled 2D documents of
the target task, for which they must be serialized first (like the unlabeled
documents above). Each serialized document is fed into the BERT model,
which computes internal representations that carry semantic meaning and
depend on the context. For each word the corresponding hidden layer
activations (e.g. from the last or second to last hidden layer) are being
Let the BERT model fixed feature vector extraction function be denoted
bertl:V|D|R|Ddmodel ,(26)
where lis indicating which layer to extract the features from and dmodel
is the size of the hidden representation.
5. The contextualized embedding function yields the BERT activations of
the given word in the given invoice:
The embedding for the jth word in a document is computed by serializing
the document and feeding it into a BERT model. From the activation
matrix the row is chosen that corresponds to the word at the jth position
in the serialized sequence.
Analogous to the non-contextualized Wordgrid, the embedding model (BERT)
can be fine-tuned on the target task. In order to do so, the gradients must be
backpropagated into the BERT model which requires it to be inferred online.
Note that this is computationally very costly as a BERT model itself can already
fill the RAM of a modern GPU at a relatively low batch size.
Table 2compares some aspects of the non-contextualized Wordgrid with the
contextualized BERTgrid. We acknowledge the fact that the contextualized
embedding vectors are computed on a 1-dimensional representation of the input
document. The corresponding equivalent in the non-contextualized section is the
construction of a Wordgrid with vectors from word2vec, trained on serialized
documents. While by definition (Equation 25) the contextualized embedding
Non-contextualized Contextualized
Section 4.2.2 Section 4.2.3
Embedding function efixed w(j);Dew(j),D(i)
Serialized pre-training word2vec BERT
Pre-training on 2D data word2vec-2d -
Fine-tuning Dupdate w/ gradients Gradient propag. to BERT
Embedding computation offline (fast) online (slow)
Table 2: Comparison of Wordgrid with non-contextualized and contextualized
function has access to all the bounding boxes of the words in the document, the
flow described above does not make full use of it because of the application of
the serialize function.
4.2.4 Combining Chargrid and Wordgrid
Chargrid and Wordgrid/BERTgrid can be combined to provide the model with
information on both, character and word level. We suggest and discuss several
combination methods:
Concat: Concatenation is a straightforward method of combining Cand
W. The model is provided the tensor X(concat) Rw×h×(dc+d)defined as
x,y,::= hCx,y,:Wx,y,:i,(28)
where h·idenotes vector concatenation. For the network to use informa-
tion from both tensors, it is desireable to scale the embedding magnitudes
relative to the depth of the respective tensor so neither is predominant.
Add: If d=dc,Cand Wcan be added element-wise:
x,y,z := Cx,y,z +Wx,y,z .(29)
Pre-process: Both tensors can be processed separately by two functions
fθ1,gθ2with trainable parameters. The resulting tensors can then be
concatenated or added (if the dimensionalities match). One configuration
we experimented with was to define fand gto be one or more 1 ×1
convolutions with the same number of output channels.
For the one-hot encoded Chargrid, applying a 1 ×1 convolution corre-
sponds to choosing a dense embedding for the characters. Wordgrid in
turn is being compressed into a lower-dimensional vector space, assuming
the 1 ×1 convolution has fewer output than input channels.
Parallel U-net: We duplicate the first nblocks of the encoder U-net (the
first nblue rectangles in the encoder in Figure 1) to process Cand W
separately. At a later stage in the U-net, namely after nblocks, we merge
the outputs of both nth blocks by adding them together. For the skip
connections from encoder to decoder we add the hidden representations
of the two parallel branches.
Intuitively, this method is motivated by the different information levels
of Chargrid and Wordgrid. While the first convolutional layers of the
network are expected to learn to recognize words, when sliding spatially
across C, the Wordgrid tensor already contains word-level information.
Merging the representations at a later stage introduces a bias, making the
network process them independently at first.
Another way of looking at the tensors is their frequency: Wordgrid con-
tains larger areas with the same values than Chargrid, as can be seen very
well in Figure 6. The OctConv by Chen et al. (2019) looks at natural
images in the same way and splits the processing into two branches, one
for high- and one for low-frequency information.
Model Complementarity. Alongside with the combination methods we sug-
gest a way of determining which information a combination method actually
uses. Is Chargrid the primary source of information for the model or is Word-
grid being used much more? Our model complementarity method helps an-
swering such questions by visualizing the usage in a Venn diagram with three
sets: documents that Chargrid-based models perform well on (C), documents
that Wordgrid-based model perform well on (W), and documents on which a
model using a combination of both performs well on (B). First off, “performing
well on” means the model using Chargrid/Wordgrid/both has a high extraction
accuracy on the fields of that particular document. Second, Cand Ware typ-
ically not identical. While the intersection is not empty (some documents are
presumably easy and can be solved by both models) there is a decent amount
of documents which either Chargrid-models or Wordgrid-models understand
correctly. A requirement to any combination method is that it makes the in-
formation contained in both Chargrid and Wordgrid accessible to the model so
the resulting Bshould overlap with both Cand Was much as possible.
Let a(M)
d[0,1] be the accuracy of a model M∈ {C,W,B}on the dth sam-
ple in the set of documents. To mitigate noise, caused by the random weight
initialization and stochastic gradient computations, a(M)can be computed as
a performance average across multiple training steps and training runs (of the
same model).
Given two models M1and M2we define the fuzzy sets of correctly processed
10 10
30 30
10 10
30 30
40 20
Figure 9: Sample Venn diagrams showing the model complementarity of M1,
M2, and M3. Suppose M3is a combination of M1and M2. In (a) M3is able to
capture aspects from both, M1and M2. It exploits the complementarity of M1
and M2by performing well on most fields that either performs good at. In (b)
M3additionally performs well on a new fields that neither M1nor M2was able
to achieve high accuracies on alone. (c) illustrates a skewed model where M3is
mainly good at fields on which M1is good as well.
documents as M1and M2. Note that the documents belong partly to the sets
depending on the value of a(M). Based on the fuzzy set theory (see e.g. Beg
and Ashraf (2009)) we can calculate
d=n− |M1|(31)
d, a(M2)
d, a(M2)
Intuitively, |M1|is a measure for the overall performance of model M1.|M1M2|
is a measure for the negative complementarity of M1and M2.|M1\M2|indi-
cates how much M1improves over M2; and |M1\M2|+|M2\M1|is the measure
of model complementarity.
An ideal model M3combines M1and M2and achieves performances close to
|M1M2|.M3could even exceed the union by combining the capabilities of
both models to achieve high accuracies on samples that neither could solve
alone. Figure 9shows example Venn diagrams.
5 Application: Information Extraction from In-
In this section we briefly explain where SAP’s subsidiary Concur is processing
invoices (Section 5.1). Afterwards we describe the two datasets we are using
(Section 5.2) and the task associated with them, namely key-value extraction
from invoices. We then formulate in Section 5.3 what the evaluation measure
for that task is. Lastly, we elaborate on implementation details (Section 5.4).
That is mostly about how the existing Chargrid code base was extended as part
of this work. Focus is on the interesting/relevant implementation details, so the
descriptions are not exhaustively naming all the changes that were made.
5.1 Concur Travel and Expense Management
SAP’s subsidiary Concur10 is a travel and expense management service provider.
Travelers who have had expenses visit the tool and claim them back so their
company can reimburse. Concur is the provider of the proprietary invoice data
that we use. Concur users across the globe upload invoices which Concur stores
in its systems.
When creating a new expense in the Concur system, the user is asked to upload
the corresponding receipt(s); shown in the screenshot in Figure 10. In a manual
workflow, the information entered in the amount field would be checked against
the invoice line items and prices by a human. Likewise the date of the trans-
action would be expected to be around the same time the invoice was filed and
the line item description should match the expense type. This process requires
human workers and would be both inefficient an impractical because of the sheer
volume of uploaded documents.
In Concur’s application, SAP’s 2D document understanding system is employed
to extract the needed information from the invoice so the expense validation
can be automated and payments to the claimer can be made faster. Only in
case of mismatches humans are asked to manually perform the work. Concur
automatically processes more than 100k invoices per day.
Invoice processing is only one application of 2D document understanding where
information must be extracted from invoices. It has motivated the initial devel-
opment of Chargrid.
Figure 10: When creating a new expense, users can attach the corresponding
receipt(s) to it. The screenshot shows the “New Expense” user interface. Infor-
mation is manually entered on the left-hand side, the proof for the claim is on
the right.
(a) Invoice (b) Labels (c) Bounding boxes
Figure 11: Dataset ground truth sample. Each text field detected by the OCR
engine in the original invoice (a) is annotated with a label (b). Most text is
labeled as “other” (here indicated by the white color). Rows and columns are
additionally annotated with bounding boxes (c).
5.2 Invoice Datasets
We use two SAP proprietary datasets, in the following referred to as the “la-
beled dataset” and the “700k invoices dataset”. Both datasets were provided
by Concur and contain invoices.
700k Invoices Dataset. The dataset consists of approximately 700k scans
of invoices. An OCR engine was applied to each sample so the invoices are
available as images and 2D text data. Despite not being labeled the dataset has
value because of its size. It allows for the pre-training of language models and
word vector representations.
The invoices are from different vendors. The layouts and positions of fields vary
a lot across the dataset. Figure 12 illustrates the heterogeneity of the invoice
layouts. Some invoices span across multiple pages. Languages are mixed, the
majority, however, is English. The OCR data is about 17 GB in size.
Note that Chargrid by Katti et al. (2018) is not utilizing the 700k dataset at
all. One objective of this work was to use the information contained in it to
achieve better performance on the actual information extraction task.
Labeled Dataset. The dataset consists of 12k invoices split into 10k for train-
ing, 1k for validation, and 1k for testing. For each invoice the OCR data is
Like in the 700k invoice dataset, invoices are from a variety of different vendors
and the sets of vendors contained in training, validation, and testing samples
are disjoint. Most vendors occur only once or twice so the invoice layouts are
very heterogeneous. Languages are mixed, with the majority being English.
Figure 12 illustrates the heterogeneity of the invoice layouts.
Besides being smaller, the labeled dataset differs from the 700k invoices dataset
in that each sample comes with segmentation and bounding box labels. Fig-
ure 11 shows the ground truth for a sample invoice. The dataset was labeled
by human annotators who marked invoice fields and drew bounding boxes. La-
beling is rather costly and it would not be financially feasible to label the 700k
dataset as well. We distinguish two kinds of fields: (1) header fields and (2) line
item fields.
The former includes invoice number, invoice date, invoice amount, vendor name,
and vendor address. Each of these fields is annotated once per invoice.
The line item fields can exist multiple times per invoice. Line items are the
(a) Invoice amount (b) Line-item quantity
Figure 12: Spatial distribution of two invoice fields over the invoice: Invoice
amount (a) and line item quantity (b). This depicts the variation in the invoice
layouts contained in our dataset. Figure source: Katti et al. (2018)
items that are commonly listed on an invoice, in the example in Figure 11 there
are four line items. For each line item the field’s line item description, line
item quantity, and line item amount are annotated. Invoices do not necessarily
contain them all in which cases only the present ones are labeled.
Invoice Corpora Specifics. Both invoice datasets share some characteristics
that distinguish them from other common NLP datasets. In the following we
list them and discuss the implications for models operating on the corpora.
The invoices have different languages, hence a vocabulary must contain words
from all languages. However, when following the standard approach of con-
structing the vocabulary based on the word frequency, i.e. choosing the top V
most frequently occurring words to be a part of it, the imbalance of languages
can lead to important keywords of a rare language to not make it into the vocab-
ulary. If for instance only 10% of the invoices are German, the word “Straße”
might not make it into the vocabulary, despite its importance for recognizing
and extracting the vendor address field. This could harm performance on the
German invoices significantly. Word piece embedding mitigates this problem
significantly as it typically does not have out-of-vocabulary problems.
Without any pre-processing applied to it, the invoices of the 1k validation
dataset contain 56,998 distinct words (and a total of 217,489). The language of
the invoices is very different from other corpora such as Wikipedia. A GloVe
(Pennington et al. (2014)) embedding pre-trained on Wikipedia with a vocab-
ulary size of 400k can only embed 7k of the 57k distinct words of the invoices.
After transforming the words to lowercase and removal of special characters the
0 20 40
word piece width
(a) Frequency vs. word piece width
0 10 20
word piece height
(b) Frequency vs. word piece height
Figure 13: Histograms of the word piece box sizes in the validation dataset after
tokenization with the BERT (Devlin et al. (2018)) tokenizer. The document size
is 256 ×336 (width×height). Width (xmax xmin) and height (ymax ymin) of
the bounding boxes are used as defined in Equation 20. In (b) it can be clearly
seen that few different font sizes exist.
number of distinct words drops to 40k out of which 14k can be embedded with
the pre-trained GloVe embedding. The high percentage of out-of-vocabulary
words renders embeddings pre-trained on Wikipedia or similar corpora unap-
plicable. The most frequently occurring words after lowercasing are
1(#3084), invoice (#2367), to (#2367), of (#1814), date (#1678), total (#1540),
the (#1505), no (#1330), and (#1092), amount (#1005), 000 (#977), number (#966), st
(#961), de (#938), i(#913), a(#907), for (#884), due (#883), tax (#857), payment
(#789), po (#771), medical (#769), jude (#746), account (#741), 2(#736),
where the numbers in brackets indicate the occurrence count out of 210k words.
The difference between the language used in invoices and normal, spoken lan-
guage poses a challenge to embedding words in invoices.
In the 2D domain different properties of the dataset can be analyzed which do
not exist in 1D: Figure 13 shows the histograms of word piece bounding boxes
heights and widths of the labeled invoices validation dataset. The size of the
Wordgrid should be chosen as small as possible so that bounding boxes of the
symbols (words or word pieces) are small but do not overlap or fall into the
same pixel yet.
Figure 14 shows the distribution of invoice lengths in terms of word pieces. With
a maximum sequence length of 512, a BERT model can embed more than 76%
of the invoices entirely.
0 200 400 600 800 1000
number of word pieces
Figure 14: Histogram of the number of word pieces per invoice on the validation
set. Few outliers beyond 1k are sliced off.
5.3 Evaluation Measure
We did not modify the evaluation measure introduced in the Chargrid paper,
because it is applicable to Wordgrid alike. In the following we recite the content
of Section 4.3 from Katti et al. (2018) with minor supplementation.
For evaluating our model, we would like to measure how much work would be
saved by using the extraction system, compared to performing the field extrac-
tion manually. To capture this, we use a measure similar to the word error rate
Prabhavalkar et al. (2017) used in speech recognition or translation tasks.
For a given field, we count the number of insertions, deletions, and modifications
of the predicted instances (pooled across the entire test set) to match the ground
truth instances. Evaluations are made on the string level. We compute this
measure as
1#[insertions] + #[deletions] + #[modifications]
where Nis the total number of instances occurring in the ground truth of the
entire test set. This measure can be negative, meaning that it would be less
work to perform the extraction manually. The best value it can achieve is 1,
corresponding to flawless extraction.
In our present case, the error caused by the OCR engine does not affect this
measure, because the same errors are present in the prediction and in the ground
truth and are not considered a mismatch.
We extract a range of fields from an invoice and compute the measure from
Equation 35 for each. To condense the reporting even more, an all fields
metric is used which is defined to be the average extraction performance on all
fields across a single invoice.
5.4 Implementation Details
Word2vec-2d. We forked Mikolov’s word2vec C-implementation11 and mod-
ified it according to the specification of word2vec-2d from Section 4.1. The
script takes a corpus and outputs word vectors trained on it.
Originally the input corpus was a simple text file where each document was a
single line and the words were space-separated. A line would look like that:
United States Patent No.: Patent [...]. In the word2vec-2d im-
plementation we represent each 2D document in an individual line as well, how-
ever, a word is always followed by four space-separated integer values which
correspond to its bounding box. For example: United 20 5 50 15 States
55 5 100 15 Patent 200 5 230 15 [...]. The word2vec parser and
data structures were adjusted to account for the additional information encoded
in the input file.
While iterating over the corpus the context of a word is needed to compute
the gradient of the weight matrices. In Mikolov’s implementation the context is
computed on the fly as the swords to the left and to the right of the center word.
In our case the computation of a context is more complex: For each document
we compute a distance matrix between any two words. A word’s context are all
words which have a distance below the defined radius. The distance function is
shown in Listing 1. For each word we pre-compute its context before starting the
actual training so it is quickly accessible in the main training loop. Memory-wise
this comes at a noticeable yet bearable cost.
Given a center word and its neighbors, e.g. Left3 Left2 Left1 Center
Right1 Right2 Right3, word2vec randomly resizes the window size sto
any integral value s0; 1 s0s. The center word’s vector representation
can therefore be updated with either Left1 Center Right1,Left2 Left1
Center Right1 Right2, or all words (s0=s). We port this method to 2D
as follows. A center word’s context is sorted by distance in ascending order.
Let sbe the number of words within radius of the center word. We choose a
random number 1 s0slike Mikolov’s code and interpret it as selecting the
s0words that are closest to the center word for computing the gradient. That
way the measure of choosing closer words more often is ported from 1D to 2D.
Mikolov’s implementation is multi-threaded and optimized for efficiency. We
left the multi-threading aspects as-is, so our implementation is benefiting from
the performant implementation too. Likewise we did not need to modify the
gradient computation.
Assessing the quality of a computed word embedding is naturally hard. Mikolov
et al. (2013) use a word similarity task. Due to the nature of our invoice dataset
we cannot use an open benchmark (words would not be contained, semantics are
different) so we see no option other than evaluating the embeddings output by
word2vec-2d qualitatively. For that we use TensorFlow’s embedding projector12
and look at similar words and clusters in the embedding space manually to make
an assessment.
Chargrid Code Base. We build on top of the existing Chargrid code base
which is the basis of Katti et al. (2018). Unless indicated otherwise, the imple-
mentation details named there apply to us as well. During the research phase
of this thesis the code was modified and extended and the following paragraphs
explain a few selected, preliminary aspects of it.
Chargrid and Wordgrid are both implemented in Python using the TensorFlow
(TF) framework. Given the labeled invoice dataset (invoice images, extracted
OCR text, and labels) a model can be trained by following these steps: (1)
generate data, (2) compute class and bound box weights, and (3) start the
1. Generate data: The data generation script loads the OCR data and ap-
plies Chargrid-specific pre-processing to it, e.g. removal of unknown char-
acters, lower-casing, and such. It then generates a Chargrid representation
from the OCR data, i.e. a grayscale PNG file (similar to Figure 6(b))
with values ∈ {0,...,255}, where each value identifies the character at
the given position. It is a way of representing the Chargrid tensor (Equa-
tion 19). The advantage of converting the OCR data into a PNG image
is greater performance during training as PNGs can be loaded without
much CPU time consumption. The storage overhead of using PNGs is
comparably small because the images contain many large areas of equal
color which are being encoded efficiently by run-length encoding.
2. Compute weights: The script feeds the generated data through a model
and determines the occurrence frequency of individual classes. It computes
a weight factor for each class used during training to counteract class
imbalance. Modification of this step is not needed in this work.
3. Training: The training script uses the tf.estimator API. It constructs
the model graph (which we modify) and trains the model weights itera-
tively. Every other e.g. 5k steps it evaluates the model performance on
validation data. When training on a particular sample, the OCR image is
loaded and scaled down to the desired size on GPU. It is then fed into the
Chargrid network (depicted in Figure 1). The outputs combined with the
ground truth yield a loss which is backpropagated to update the model
The dataset is provided to the TF Estimator with the API. The
data cannot be held in RAM at once, so the initial TF dataset consists
of strings pointing to the files of the individual invoices. While training,
samples are being loaded on the fly; pre-fetching is being used.
Wordgrid Pipeline. We use the notation from Equation 8. Given an embed-
ding space Dwe read the OCR data and apply pre-processing to the individual
words, see Listing 2. Instead of writing the Wordgrid tensor Wfrom Equa-
tion 21 into a file we store the index i1, . . . , V of the word wi. By using three
channels of the PNG, Vis limited to 224 which is more than necessary. The
fourth channel is still usable, for instance if parallel storage of Wordgrid and
Chargrid is needed.
1import re
3def preprocess(s: str) -> str:
4s = s.lower() # lower case
5s = ’’.join(e for ein sif e.isalnum()) # remove special chars
6s = re.sub(r"[0-9]+", "1", s) # replace numbers, e.g. "123" -> "1"
7return s
Listing 2: Wordgrid pre-processing function
In the training script we use the indices stored in the PNG file as well as Dto
construct Wbefore feeding it into the model. This step is implemented to run
on GPU. In the model code Wis then fed into the model as a replacement of
In order to not store Das a constant in the TensorFlow graph we use a
tf.train.Scaffold. This allows us to use embedding dictionaries exceeding
the maximum size of 2 GB. Also, the embedding loading is lazy, i.e. it will not
happen prior to being needed.
BERTgrid Pipeline. Given both, the 700k and the labeled invoice dataset,
the procedure for training a BERT model is described in the following. It has
a structure similar to the steps enumerated in Section 4.2.3 but describes the
implementation details. We use several scripts from the BERT repository which
can be found there13.
1. We serialize the 700k invoices so each of them is a 1D sequence of words.
The invoices are available as OCR files and we read them in the order
output by the OCR (as opposed to sorting by bounding box coordinates;
see Figure 7). The resulting sequences of words are written into a text file
line-by-line. Each word is pre-processed with the function from Listing 3.
2. We use the create pretraining script from the repository
to convert the text files into TF Records. The script expects a vocabulary
file for which we choose the one associated with the “BERT-Base, Un-
cased”14 model in the repository. We do not use casing (i.e. all words are
converted to lowercase) as we do not expect the casing to carry important
semantic meaning in our task. The maximum sequence length is set to
512, justified by the histogram in Figure 14. We randomly mask word
pieces with a probability of 15% but at most 77 at a time.
3. Pre-training is started with the run script to which
the TF Records generated in the previous step are passed. We use the
BERT configuration file and pre-trained weights from “BERT-Base, Un-
cased”. In summary that is a 12-layer, 768-hidden representation size,
12 attention head, 110M parameter BERT model which was pre-trained
on Wikipedia. It converges on the 700k dataset after approximately 2M
steps with a learning rate of 1 ×104for the first 1M and 4 ×105for
the next 1M steps. We use a batch size of 10 on a Nvidia V-100 GPU,
fully utilizing its memory. The pre-training script stores checkpoints of
the trained model.
4. Next we serialize the labeled invoices in the same fashion as we serialized
the 700k invoices. We use the extract script to extract
fixed feature vectors for the serialized invoices. Again the maximum se-
quence length is set to 512 which leads to around 20% of the invoices being
embedded only partially. We experiment with exporting the last and the
second to last hidden layer. By default the official script exports the vec-
tors in JSON format, we found however, that using pickle could reduce
the size by a factor of two so we modified the script slightly. The output is
one pickle file for each invoice which contains a sequence of contextualized
feature vectors. We decided against storing the plain BERTgrid tensors
13 research/bert
14Model and vocabulary download link:
because their size would be about 256 ×336 ×768 ×4 (BERTgrid spatial
width ×height ×feature vector size ×four bytes per float32) 250 MB
per invoice.
5. Training a model requires the on-the-fly construction of the BERTgrid
tensor. We load the pickle files exported in the previous step dynami-
cally once an invoice is contained in the current batch. We then serialize
the OCR data, tokenize it and match it with the vectors stored in the
pickle file. This algorithm was implemented on CPU and is therefore rel-
atively slow. To improve performance we used the pre-fetching feature of
the API to keep the CPU busy constructing BERTgrids for the
upcoming batches while the GPU trains on the current batch. To lower
the memory requirements towards the GPU and to increase performance
we added an option to use BERTgrid with 16 bit floating point precision
(instead of 32 bit) which casts back to 32 bit precision after the first con-
volutional layer where the internal representation is already much smaller
and it can be afforded.
1import string
3def preprocess_bert(s: str) -> str:
4s = ’’.join(c for cin sif cin string.printable)
5s = s.lower() # lower case
6s = s.replace(" ", "") # remove whitespace
7s = s.replace("|||", "") # remove separator char sequence
8return s
Listing 3: BERTgrid pre-processing function
The flow described above does not admit gradient backpropagation into the
BERT model because the extraction of contextualized feature vectors (BERT
inference) happens offline. Updating the BERT weights on the downstream task
(the labeled invoice dataset in our case) is called fine-tuning. We decided not
to fine-tune because it simplifies the pipeline, requires less GPU memory, and
fine-tuning a model as large as BERT on a dataset with merely 10k labeled
samples seems likely to result in overfitting.
By the time of submission of this thesis the BERTgrid pipeline was not produc-
tized yet. Hence, no inference mode was implemented. Next we describe how
BERTgrid could be used in an inference setting. It is worth mentioning that
the sheer size of a BERT model makes it hardly usable for inference in many
15During the internship I had the chance to attend a talk on BERT by the inventor Jacob
Devlin at Google’s office in Berlin during which he remarked that Google does not use BERT
for inference in production, because it was too costly and slow. Instead they fit a smaller
student model to match the outputs of a BERT model and use it as a faster, more light-
weight opt-in replacement for the large model instead.
Given an invoice scan and trained weights of a BERT and Wordgrid model,
during inference an OCR engine would be applied first to retrieve the word
positions. This is done just as with Chargrid. For BERTgrid the OCR output
would then be serialized, pre-processed, and tokenized to be subsequently fed
into the BERT model. The extracted feature vectors can be held in RAM and
are passed on to the Wordgrid code which constructs BERTgrid therefrom and
forwards it to the neural network.
It would be useful to monitor the average number of tokens comprising the
invoices. A shift of the values shown in the histogram in Figure 14 would
inevitably lead to a deterioration of extraction accuracy, because the BERT
model only processes sequences of maximum length 512. Similarly, it seems
reasonable to monitor and detect and shift in bounding box sizes (Figure 13).
6 Results
In the following we present the results of our new methods. Since the dataset
is proprietary, the baseline we compare to is our previously best model, namely
Chargrid (see also the “Experiments and Results” section in Katti et al. (2018)).
The discussion of the results is separated from the pure reporting and can be
found in the subsequent section. Unless indicated otherwise, the performance
is measured according to the evaluation measure introduced in Section 5.3.
Shortly summarized, the Chargrid model – our baseline to compare to – achieves
an averaged extraction accuracy of 0.6176 ±0.0072. The best Wordgrid model
combines Chargrid with word2vec-2d word vectors and has an accuracy of
0.6267±0.0033, improving upon the baseline by 0.0091. The contextualized ver-
sion, BERTgrid, is even better with an absolute performance of 0.6548 ±0.0058
which is 0.0373 above the baseline.
This section is split into two parts: First, we list the results of our experiments
with Wordgrid with different embedding choices, BERTgrid, and combination
methods. Second, we show the results of our model complementarity analysis
which we ran for Chargrid, Wordgrid, and combination methods.
All trainings were run on an Nvidia DGX-1 machine equipped with eight Nvidia
Tesla V-100 GPUs with 16 GB RAM each. The machine has two 20-core Intel
Xeon CPUs and 500 GB of RAM. The hyperparameter configuration is the same
as in Katti et al. (2018) unless pointed out otherwise. We train with a batch
size of six.
We use the U-net architecture shown in Figure 1for all experiments except the
main results (Table 3). That is because we have a (not yet published) CU-net
architecture (see Tang et al. (2018)) in use internally, which performs better.
Results on the U-net are transferrable to the new architecture, except everything
is typically a notch better (around 0.02 for Chargrid models).
We compute the reported scores as follows: If we ran multiple trainings, we
average the reported performances at each step. Otherwise we use the single
training run that is available as-is. We then slide a windows of size 20k steps
over the (averaged) curve and compute the performance for each position of the
window as the mean of the part of the curve that is within the window. Since
we run a validation every other 5k steps, this corresponds to averaging five
consecutively reported results. Out of these performances we use the maximum
as the model’s performance measure. The dataset used is the validation set of
the labeled invoice dataset. We typically train for either 800k or 400k steps
depending on how fast the model converges.
When reporting the results in tables, we use the following abbreviations for
Header fields all fields Amount Number Date VName VAddress
[Chargrid] 0.6176 0.9142 0.8390 0.8574 0.4091 0.4510
[C+Wordgrid] 0.6267 0.9053 0.8430 0.8699 0.4163 0.4672
[C+BERTgrid] 0.6548 0.9238 0.8625 0.8846 0.4722 0.5018
Line items all fields Descr Quantity TotalPr ItemQuote VatRate
[Chargrid] 0.5659 0.5240 0.6697 0.7288 0.1907 0.2655
[C+Wordgrid] 0.5764 0.5226 0.6612 0.7325 0.2507 0.3238
[C+BERTgrid] 0.6042 0.5529 0.7318 0.7366 0.2948 0.3405
Table 3: Main results on all header (top) and line item fields (bottom); CU-net
architecture. [C+BERTgrid], the model with contextualization, is consistently
outperforming both, the baseline and a Wordgrid variant.
header field names in the column headers: all fields the mean performance
across all fields, including header fields and line item fields; Amount the mount
stated on the invoice; Number the invoice number/identifier; Date the date
the invoice was issued; VName the vendor name, e.g. “Denk Development” in
Figure 6;VAddress the address that belongs to the vendor.
For line item fields, the ones that can occur multiple times, we use: all fields
the mean performance across all line item fields (excluding header fields); Descr
the line item description; Quantity the line item quantity; TotalPr the total
price of the line item; ItemQuote additional information about a line item;
VatRate the VAT rate of the line item, i.e. what tax percentage applies.
6.1 Experiments
Main Results. In Table 3we show the main results achieved by three models
we compare with each other:
[Chargrid] the best Chargrid model available; the neural network archi-
tecture is an advancement of the one presented in Section 3.1, namely a
[C+Wordgrid] a combination of Chargrid and Wordgrid. For combining,
the “Parallel U-net” method is chosen (see Section 4.2.4) where we repli-
cate the first CU-net block. Wordgrid is using vectors from word2vec-2d
pre-training with r= 9.
[C+BERTgrid] a combination of Chargrid and BERTgrid. We combine
the vectors as in [C+Wordgrid]. The BERT model is pre-trained for 2M
steps, feature vectors from the second to last hidden layer are used, the
floating point precision of BERTgrid is 16 bit, the entire vector is used
(dmodel = 768).
[C+BERTgrid] achieves the best performance by a considerable margin. On
all fields, the mean across all fields, it outperforms the [Chargrid] baseline
0 50000 100000 150000 200000 250000 300000 350000 400000
Figure 15: Comparison of the convergence of three different models over the
first 400k steps (we trained until 800k); CU-net architecture. The error intervals
indicates the standard deviation across four training runs (for each model).
by 0.0373. There is not a single field on which we see a deterioration in perfor-
mance when using [C+BERTgrid]. The most significant improvement can be
observed on the line item field ItemQuote, where [C+BERTgrid] is better by
0.1041. On the header fields, [C+BERTgrid] pushes the performance by 0.0631
on the Vendor Name (VName) field.
[C+Wordgrid] outperforms the [Chargrid] baseline as well, albeit at a smaller
margin of 0.0091. It is noteworthy that we see a performance loss on the fields
Amount,Descr,Quantity, and TotalPr.
We also observe that both [C+Wordgrid] and [C+BERTgrid] converge faster
than [Chargrid]. Faster is measured in terms of the number of weight updates
(aka. training steps), not wall-clock time. Figure 15 shows the convergence
of the three models on all fields over the first 400k steps. [C+Wordgrid]
converges by far the fastest and starts to saturate after about 100k steps. It
breaks the accuracy of 0.6 after 60k steps; [C+BERTgrid] after 95k steps and
[Chargrid] much later after 215k steps.
Header fields all fields Amount Number Date VName VAddress
[rand] 0.5247 0.8750 0.7098 0.7854 0.3199 0.3219
[w2v] 0.5838 0.8998 0.7561 0.8408 0.4053 0.4178
[w2v-2d-3] 0.5821 0.9100 0.7596 0.8468 0.3645 0.4071
[w2v-2d-9] 0.5898 0.8964 0.7589 0.8470 0.3889 0.4204
Line items all fields Descr Quantity TotalPr ItemQuote VatRate
[rand] 0.4766 0.4647 0.4566 0.6892 0.1374 0.1343
[w2v] 0.5353 0.4939 0.6130 0.7050 0.2434 0.1787
[w2v-2d-3] 0.5336 0.4887 0.6311 0.7183 0.1754 0.1940
[w2v-2d-9] 0.5451 0.4929 0.6527 0.7075 0.2093 0.1940
Table 4: Embedding choice results for plain Wordgrid on all header (top) and
line item fields (bottom); U-net architecture
Embedding Choices. We run experiments comparing Wordgrid variants with
different choices for the non-contextualized embedding dictionary, see Section 4.2.2.
All embedding spaces have a dimensionality of d= 32 and the vocabulary size
is V= 40,000. The embedding dictionary initializations we use are:
[rand] random embedding, initialized from N(0,0.01).
[w2v] semantic embedding, pre-trained with word2vec on the 700k invoice
dataset; cbow method, window size 4, negative sampling 25, 100 iterations
on 50 threads, minimum count of 50 for the vocabulary words, vocabulary
reduced to 40k in a post-processing step.
[w2v-2d-3] semantic embedding, pre-trained with word2vec-2d (see Sec-
tion 4.1) on the 700k invoice dataset, with context radius r= 3, cbow,
negative sampling 25, for 150 iterations, minimum count of 50 for the
vocabulary words, vocabulary reduced to 40k in a post-processing step.
[w2v-2d-9] semantic embedding, same as [w2v-2d-3] except the radius is
set to r= 9.
Table 4lists the results for the different embedding methods on all fields. Note
that the values are generally lower than the ones in Table 3because we use
the U-net architecture and do not provide the model with character-level infor-
mation (no combination). [rand] consistently performs worst; on all fields
(also for line items) the novel word2vec-2d with r= 9, [w2v-2d-9], dominates.
On all fields but VName,Descr, and ItemQuote word2vec-2d is better than
Combination Methods. In Table 5we report the results achieved by dif-
ferent combination methods for Chargrid and Wordgrid as described in Sec-
tion 4.2.4. In addition we want to state that the outputs of the two 1 ×1
convolutions used in [1x1conv] are being added and that [par-b1] uses the first
Header fields all fields Amount Number Date VName VAddress
[concat] 0.6020 0.9144 0.8398 0.8702 0.4040 0.4385
[1x1conv] 0.6005 0.9054 0.8398 0.8631 0.4011 0.4572
[par-b1] 0.6113 0.9225 0.8437 0.8816 0.4151 0.4434
Line items all fields Descr Quantity TotalPr ItemQuote VatRate
[concat] 0.5437 0.4860 0.6561 0.7109 0.1860 0.2493
[1x1conv] 0.5421 0.4663 0.6579 0.7193 0.2071 0.3053
[par-b1] 0.5547 0.4689 0.6753 0.7345 0.1876 0.2993
Table 5: Chargrid-Wordgrid combination method results on all header (top)
and line item fields (bottom); U-net architecture
94 114
89 103
B [concat]
92 114
92 104
B [1x1conv]
99 115
90 106
B [par-b1]
Figure 16: Model complementarity on the line item description field (Descr).
All three combination methods are showing similar patterns at Chargrid/Word-
grid usage.
U-net block in parallel and adds the results, yielding the input to the second
As measured by all fields, the combination method [par-b1] leads to the
best results. On some fields, [concat] or [1x1conv] perform better.
6.2 Model Complementarity
We apply the model complementarity definition (see Section 4.2.4) to Chargrid
(M1), Wordgrid (M2), and combination methods (M3). The raw results of the
latter can be found in Table 5.
Exemplary, we pick three fields and show the Venn diagrams for them for each
combination method. Figure 16 for the Descr field, on which [concat] performs
best. Figure 17 for Quantity, where [par-1b] is the strongest, and Figure 18
for VAddress on which [1x1conv] yields the best results.
The complementarity analysis suggests that the weakness of [concat] and [1x1conv]
lies in their inability to use the information from both, Chargrid and Wordgrid.
While they do on some fields, they tend to rely more on the Chargrid input on
others, thereby losing performance compared to [par-1b].
207 211
221 218
B [concat]
216 215
218 218
B [1x1conv]
214 213
222 223
B [par-b1]
Figure 17: Model complementarity on the line item quantity field (Quantity).
All three combination methods use C and W in a balanced way. [par-b1] per-
forms best because it captures more of the samples in the intersection of C and
W than [1x1conv]. [concat] is weaker at using all Wordgrid information.
85 90
104 83
B [concat]
97 87
102 96
B [1x1conv]
91 80
102 99
B [par-b1]
Figure 18: Model complementarity on the vendor address field (VAddress). We
observe a clear imbalance in information usage for [concat] which leans towards
C and away from W. [par-1b] performs best at covering both, C and W.
7 Discussion
Our novel Wordgrid method pushes the state-of-the-art performance on SAP’s
proprietary invoice dataset. Its first variant, the non-contextualized grid of
word vectors, achieves better extraction accuracies when combined with Char-
grid than plain Chargrid. The conceptually fundamentally different BERTgrid
is even better and sets a new state-of-the-art for key-value extraction from in-
voices. Our methods were both motivated by recent advancements in 1D NLP.
By applying them to a 2D problem we have successfully shown that the field
of 2D document understanding can benefit from methods cast from 1D to 2D.
Specifically, the usage of contextualized embedding for 2D document represen-
tation leads to performance gains analogous to the contextualized embedding
of text sequences in 1D NLP.
Due to the lack of publicly available 2D document understanding benchmarks,
we were unable to compare our method to other published approaches. We do,
however, hypothesize that the performance boost seen on the invoice problem
translates to other 2D document understanding problems as well. We assume
word2vec-2d and Wordgrid are both beneficial in other areas as well. Key-
value extraction from invoices can be seen as a representative benchmark as
invoices rely heavily on 2D structure. A human reader would search for the
vendor name at the top of a document first, possibly printed in a larger font
size. Line items are typically positioned in the middle, arranged in a tabular
layout. These properties of invoices support our assumption that Wordgrid is
indeed an advancement of 2D document understanding methods in general.
Embedding Choices. Experiments with different embedding initializations
show that semantic meaning improves the downstream model’s performance
considerably. Most likely due to the lack of prior, semantic information, random
embedding ([rand]) is left far behind word2vec and word2vec-2d. The informa-
tion of the 700k invoices dataset is in some sense contained in the non-random
embedding spaces and is presumably responsible for the gains in accuracy.
The novel word2vec-2d method performs better than word2vec, which we see as
support for our assumption that the representation quality of vectors computed
by word2vec suffers from serialization of 2D documents. By circumventing the
serialization, word2vec-2d calculates embedding spaces which are semantically
meaningful for 2D documents.
The radius hyperparameter rof word2vec-2d does influence the quality of the
embedding: We observe differences between the embedding space trained with
r= 9 (better) and the one trained with r= 3. Other than performing grid
search, the hyperparameter can be tuned by computing the average context
size, i.e. the average cardinality of the set in Equation 17, across the training
data. A domain expert can make an assessment of how many words typically
belong to the semantic context of a word in a document and choose rbased
on that. As a second option, the 2D documents can be inspected visually and
distances between semantically distinct section can be measured. The radius
should be chosen to be lower.
Combination Methods. We consistently see the combination of Chargrid
and Wordgrid/BERTgrid to perform better than plain Wordgrid/BERTgrid.
Naturally, neural networks are good at handling additional information and
weighing the importance of input channels. The difficulty of combining Chargrid
and Wordgrid (this includes BERTgrid) lays in making both information as
accessible to the model as possible. Our model complementarity method shows
that parallel processing of Chargrid and Wordgrid with subsequent addition
is most beneficial. Intuitively, this seems reasonable as the semantic level on
which information is contained in the tensors differs a lot. The parallel U-net
branches can presumably convert both representations to a similar level, using
several stacked convolutional layers.
We observe that all combination methods lose some accuracy on fields which
Chargrid and/or Wordgrid worked well on. These areas are red, dark yellow, and
green in the Venn diagrams, Figure 16, Figure 17, and Figure 18. Despite the
gain on new fields (blue shaded areas at the bottom), this loss is a clear indicator
for sub-optimal combination. None of the proposed combination methods uses
the complementarity of Chargrid and Wordgrid to its full extent.
The most promising combination method is [par-1b]. Not only because it
achieves the best results quantitatively, but also because it has no bias towards
Chargrid or Wordgrid. On the other hand, [concat] is making it hard for the
downstream model to use all available information. The [1x1conv] method is
somewhere in between. That supports our assumption that some parallel pro-
cessing is necessary, because the 1 ×1 convolution is in some sense, albeit being
just one layer deep, a parallel U-net branch too.
Drawbacks of Wordgrid. The performance gains of Wordgrid come at a
cost. There are several disadvantages and requirements associated with Word-
grid which we elaborate on in the following.
Unless used in combination with random embedding, Wordgrid requires data for
embedding pre-training. While the labeled dataset can be used, it may often be
too small. In business applications, this requirement might be less of a problem
as unlabeled data is commonly available in abundance. It does, however, set
Chargrid apart from Wordgrid which works solely with a labeled dataset.
With Wordgrid, the overall wall-clock duration of a training increases. We
measure the training speed in seconds elapsed per 100 update steps (the lower
the better). Numbers reported are approximate and for the U-net architecture.
For Chargrid the number of seconds per 100 steps is 31, Wordgrid with a static
embedding function is a bit higher at 38. Contextualized embedding is very
costly: The full-size BERTgrid model needs 277. For reference, training such
a BERTgrid model for 800k steps (which is about where it converges on the
invoice task) takes about 26 days. The training speed can be increased to 100
by cutting the spatial size along xand ydirection in half. It can be further
improved to 56 by reducing the floating point precision to 16 bit and slicing off
half of the contextualized vector. The long training times of BERTgrid pose a
serious challenge in terms of rapid experimenting. We have not benchmarked
inference times of Wordgrid but assume them to be similarly higher.
In particular with BERTgrid, the overall training pipeline gets more complex.
That is due to the evaluation of the contextualized embedding function which is
way more complex than the fixed embedding function which can be implemented
as a simple dictionary lookup. For computing a contextualized vector, inference
of a BERT model must be run, which can be done offline or online. In the offline
case (which we have implemented), the contextualized feature vectors must be
matched with the current invoice’s OCR data to construct the BERTgrid tensor
on-the-fly. In the online case an entire BERT model must be held in RAM.
The usage of Wordgrid typically increases the number of model parameters. The
Chargrid network has 20.3M parameters, using Wordgrid with concatenation
results in 23.8M. The full-size BERTgrid model (full depth and full spatial size)
has 26.6M parameters. Additional parameters forced us to reduce the batch
size from 7 to 6, which is a relatively small sacrifice.
The pre-trained BERT models are limited to a maximum sequence length of
512. That is a strong limitation of the BERT pipeline we use, which in our case
is only not very pronounced because the document lengths coincidentally lay
mostly below it (see Figure 14). For applications with larger 2D documents,
the BERT model would need to be applied several times, similar to Dai et al.
Effectiveness of Unsupervised Pre-training. Unsupervised pre-training
is in huge parts responsible for the performance improvements of Wordgrid.
When comparing for instance the two Wordgrid training runs with random em-
bedding ([rand]) and word2vec ([w2v]) embedding we see significant differences
in performance. On a technical level, both are very similar: A word in the
invoice is mapped to one of 40k corresponding embedding vectors. The only
difference are the values of these vectors: the word2vec ones carry semantic
meaning. The resulting performance increase of 0.0591 is remarkable and credit
goes solely to the word2vec pre-training. BERTgrid can be seen in a similar way.
Even though it does not map 1:1 from word to vector anymore, it improves the
performance of the Chargrid network just by providing it with semantically
meaningful, contextualized input representations.
Based on these findings we conclude that unsupervised pre-training is an integral
part of the effectiveness of Wordgrid and BERTgrid. The latter profits even
more from pre-training, because it can capture much more complex lingual rules
and dependencies in its neural language model, than the small-sized word2vec
embedding space can. Consequently, the size of the unlabeled corpus must be
larger too. In our case the 700k invoices dataset is sufficiently large.
Effectiveness of Contextualization. It was previously well known that con-
textualization is very beneficial in 1D NLP. Our results show that this translates
to 2D: Our best model uses contextualized embedding.
8 Future Work
While working on Wordgrid we have touched several aspects that potentially
qualify as subjects of future work. We describe them in this section.
Using BERT-2D End-to-end. The BERT model was shown to be very
successful on a range of different tasks. After pre-training on sufficiently large
data, its feature vectors are often contextually rich enough such that a single
layer alone can perform classification very well. Hence, we believe instead of
feeding contextualized word piece vectors into a Chargrid network, the entire
Chargrid network could be replaced by a BERT-2D model altogether. For that,
two bigger changes must be made to BERT: (1) 2D contextualized embedding
and (2) sparse input representation.
While we have arranged the Transformer blocks in a sequential order in Fig-
ure 3, in reality the information about a word’s position in the sequence is only
provided to the network in the positional encoding. The Transformer blocks in
a BERT model attend to all blocks irrespective of their position. We therefore
suggest using a two-dimensional positional encoding where a word piece em-
bedding is a composition of (a) a randomly initialized, trainable vector (lookup
table), (b) a positional encoding for the xaxis, and (c) a positional encoding
for the yaxis. Whether or not to use trainable or non-trainable (e.g. sinusoidal
as in Vaswani et al. (2017)) positional encodings would need to be determined
In 1D NLP a sentence is a sequence of words and there are no gaps in between
words. 2D documents, like invoice, differ as they have blank background (see
Figure 6). The Chargrid network is provided symbols and background alike, for
BERT, however, this seems impractical because there would be too many input
symbols. The computational complexity of attention grows in On2with n
being the number of inputs. Having too many inputs is therefore unaffordable
which is also why the BERT model caps the maximum sequence length at 512.
By feeding only the Wordgrid cells into the BERT model where there is a word
(or word piece) present, we represent a 2D document in a much sparser way and
will be unlikely to reach the computational limits.
Mathematically described, BERT-2D processes a document as defined in Equa-
tion 20 as follows. Let e2D :R×RRdbe a function computing a positional
encoding vector for any given x, y-coordinate tuple. Further, let efixed be a non-
contextualized embedding function as defined in Equation 22. The input to the
first Transformer block b1(see Equation 15) is a sequence of length N(i), where
for each j1, . . . , N(i)position
Xj,::= e2D x(j)
max +x(j)
max +y(j)
is a combination of positional encoding and word embedding vector. The ten-
sor Xis processed as in normal BERT models by several stacked Transformer
blocks. The last layer could be modified to match the Chargrid network, i.e.
have two branches, for segmentation and bounding boxes, respectively. The
model could then be trained with the Chargrid loss, see Equation 3.
Pre-training on 2D Data. Several methods are available for pre-training
on our invoice datasets. Word2vec works with serialized 2D documents and
was used to produce a fixed lookup table with non-contextualized word vec-
tors. With our word2vec-2d we allowed for pre-training on non-serialized 2D
documents and produced non-contextualized embeddings as well. For the con-
textualized embedding we suggested to use a BERT model and pre-train it
on serialized 2D documents. There is, however, no published method for pre-
training models, capable of computing contextualized embedding vectors, on
raw 2D data, i.e. without serialization. This lack of methods is apparent in
Table 2where there is a gap in the “Contextualized”-“Pre-training on 2D data”
cell. We suggest two ways of filling that gap: (1) using the Chargrid network
as a denoising autoencoder (DAE) and (2) casting BERT to 2D.
The Chargrid (and equivalently Wordgrid) network (Figure 1) is a CNN which
converts, among other things, a tensor of spatial extent w×hinto a segmen-
tation mask of the same spatial size. In DAE (Vincent et al. (2008)) mode,
the bounding box regression part of the network is discarded and the segmenta-
tion mask depth is set to equal the input depth. The model is then a function
fDAE :Rw×h×dRw×h×d. The training objective is, given a Wordgrid Xand
acorrupted version ˜
Xof it, to reconstruct the original from it:
The corrupted version of the input can be computed in many different ways, e.g.
by randomly masking out words or replacing them. The contextualized feature
vector could then be extracted from an internal representation of fDAE.
Casting BERT (Devlin et al. (2018)) to 2D is very similar to the DAE. The
BERT model is also reconstructing the document from a corrupted version,
however, the loss is only computed for the places at which the input was cor-
rupted. By default, BERT cannot cope with 2D input so the BERT-2D version
would need to be used, which is described above.