Figure 3 - uploaded by Timo Immanuel Denk

Content may be subject to copyright.

# Web page rank estimation model architecture. Input to the model is a graph representing the web pages of a domain (1). Each node is attributed with two screenshots and the nodes are interconnected by web links. A screenshot feature extractor CNN is (2) applied node-wise, thereby converting the screenshot graph into a graph with feature vectors (3). The feature vector graph is the input to a (4) GN which looks at the graph as a whole (as opposed to node-wise processing). It reduces the graph to a single scalar (5) from which the relative ranking can be inferred. The internals of module (2) are explained in Section 3.1; the ones of (4) in Section 3.2.

Source publication

Based on an available list of the top 100,000 most popular domains on the web, we define a novel vision-based page rank estimation task: A model is asked to predict the rank of a given web domain purely based on screenshots of its web pages and information about the web link graph that interconnects them. This work is a feasibility study seeking to...

## Contexts in source publication

**Context 1**

... goal of our model's inference mode is to either estimate the absolute popularity of a web page with respect to the top 100,000 web pages or, secondly, compare two (or more) given web pages relative to one and another. The remainder of this section describes the components of our model in greater detail; Figure 3 gives an overview of the model's internal processing pipeline. We also introduce our loss function and the evaluation metric we use to measure the performance of our model. ...

**Context 2**

... converting the graph of screenshots into a graph of feature vectors, the entire graph is fed into the GN; component (4) in Figure 3. The GN module is an integral component of our model: It utilizes the graph structure of the dataset samples and outputs the actual model prediction, thereby converting a graph into a continuous scalar, interpretable as a rank estimation. ...

**Context 3**

... a dataset consisting of N ≥ 2 graphs and corresponding ranks. Let f model (G) be the model function mapping a graph to a scalar in R (visualized in Figure 3). f ∈ R N is the vector induced by applying the model function to the entire dataset: ...

**Context 4**

... it is a common method to look at the kernels of CNNs (see e.g. Figure 3 in [KSH12]), this approach is not reasonably applicable to our architecture because our filter tensors have a spatial dimensionality of at most 3×3. Instead, we analyze the CNN feature maps and retrieve hard and easy samples. ...

## Citations

... Examples for 2D documents are presentation slides, layout-rich websites, posters, cover pages, flyers, and invoices. Concrete tasks could be key-value extraction from invoices ( Katti et al. (2018)), election poster classification, or vision-based web page rank estimation (Denk and Güner (2019)). Many more are conceivable. ...

Chargrid is a recently proposed approach to understanding documents with 2-dimensional structure. It represents a document with a grid, thereby preserving its spatial structure for the processing model. Text is embedded in the grid with one-hot encoding on character level. With Wordgrid we extend Chargrid by employing a grid on word level.
For embedding words with semantically meaningful vectors, we propose a novel method for estimating dense word vectors, called word2vec-2d. It is a fork of word2vec that is trained on 2D document corpora rather than 1D text sequences. The notion of context is redefined to be the variably-sized set of words that are spatially located within a certain distance to the center word.
BERTgrid, our most enhanced Wordgrid version, uses contextualized word piece vectors. The concrete vector chosen for a position in the grid is retrieved from the hidden representations of a BERT language model. This model has access to the neighboring text, as opposed to mapping every symbol 1:1 to its corresponding representation, irrespective of position and contextual meaning.
Both new methods benefit greatly from unsupervised pre-training. We apply them to two proprietary SAP invoice datasets, a large unlabeled and a smaller labeled one. The task is key-value extraction, e.g. determining the invoice date or vendor name. The best Wordgrid model improves over the Chargrid baseline by a margin of 0.91 percentage points; BERTgrid achieves even better performance, 3.73 percentage points above Chargrid.