ArticlePDF Available

Utilizing HTML ‐analysis and computer vision on a corpus of website screenshots to investigate design developments on the web



We present preliminary results of a project investigating the design development of popular websites between 1996 and 2020 via HTML analysis and basic computer vision methods. We acquired a corpus of website screenshots of the current top 47 popular websites. We crawled a snapshot of every month of these websites via the wayback machine of the Internet Archive platform since the time snapshots are stored to gather 7,953 screenshots and HTML pages. We report upon quantitative analysis results concerning HTML elements, color distributions and visual complexity throughout the years.
Utilizing HTML-analysis and computer vision on a corpus
of website screenshots to investigate design developments
on the web
Thomas Schmidt | Anastasiia Mosiienko | Raffaela Faber | Juliane Herzog |
Christian Wolff
Media Informatics Group, University of
Regensburg, Regensburg, Germany
Thomas Schmidt, Media Informatics
Group, University of Regensburg,
Regensburg, Germany.
We present preliminary results of a project investigating the design develop-
ment of popular websites between 1996 and 2020 via HTML analysis and basic
computer vision methods. We acquired a corpus of website screenshots of the
current top 47 popular websites. We crawled a snapshot of every month of
these websites via the wayback machine of the Internet Archive platform since
the time snapshots are stored to gather 7,953 screenshots and HTML pages.
We report upon quantitative analysis results concerning HTML elements, color
distributions and visual complexity throughout the years.
colors, html, visual complexity, web design, web history, websites
The World Wide Web has become an important part
of modern media infrastructure and society. The web
itself has also become an object of research in
human-computer-interaction (Jørgensen & Myers,
2008) but also cultural and media studies
(Brügger, 2012). Investigating the history of web
interfaces is an important task for web and media his-
torians but can also give current web designers inspi-
ration on how developments might continue. One
important aspect of this research area is the archival
of the web and the preservation of this digital heri-
tage has been addressed by the UNESCO.
This led to
the platform Internet Archive,
which, via its
Wayback Machine
intends to archive the web and
enables design researchers to investigate trends on
large-scale corpora. We present work-in-progress
results of a project investigating web design
developments via quantitative and qualitative analy-
sis. We report on our current approach and the first
quantitative results we acquired.
2.1 |Corpus-acquisition
We decided to analyze the top 50 most popular websites
as of December 2019 according to the analytical platform
We filtered out any adult websites, which resulted
in a list of 47 websites. We acquired one snapshot per day
of these websites stored in the Wayback Machine, if a
snapshot was available for the timespan from 1996 to
2020. A snapshot is a stored representation of the website
for a given time. This results in a corpus of 151,682 snap-
shots. However, all websites are represented rather
unequally with the most popular and oldest being most
DOI: 10.1002/pra2.392
83rd Annual Meeting of the Association for Information Science & Technology October 25-29, 2020.
Author(s) retain copyright, but ASIS&T receives an exclusive publication license
Proc Assoc Inf Sci Technol. 2020;57:e392. 1of5
frequent. For this study, we limited this corpus to one
snapshot per month per website (the first available snap-
shot per month) to avoid problems with the unequal dis-
tribution. This subcorpus consists of 7,953 snapshots
(Tables 1 and 2) of which we scraped the HTML and took
a screenshot as TIFF-File with a width of 1920 pixels and
height according to the size of website. To enable com-
parisons, we sliced the screenshots at 3000 pixels height.
The sample size for the early years is rather limited.
However, beginning 2003 the sample size is more
representative with around 300 snapshots per year. Fig-
ures 1 and 2 show two snapshots of
2.2 |Analysis metrics
We analyzed multiple quantitative metrics: (a) HTML
metrics, (b) screenshot size, and (c) color metrics. We
counted the number of images via the img-tag, the num-
ber of hyperlinks via the a-tag and the overall amount of
words in the HTML. Another metric is the size of the
screenshots measured in kilobytes after transforming
TABLE 1 Websites of the corpus and number of snapshots
Website #snapshots Website #snapshots Website #snapshots Website #snapshot 152 200 98 183
121 194 155 169 121 199 202 103 161 242 179 191 203 114 218 Wordpress.
184 242 132 227 Xinhuanet.
228 184 102 Stackoverflow.
138 245 216 92 126 255 159 Microsoftonline.
124 169 220 107 241 128 178 214 229 110 240 153 105
TABLE 2 Website distribution per year
Year #snapshots Year #snapshots
1996 10 2009 378
1997 23 2010 412
1998 26 2011 487
1999 70 2012 494
2000 150 2013 500
2001 183 2014 487
2002 182 2015 491
2003 284 2016 484
2004 317 2017 503
2005 349 2018 498
2006 362 2019 501
2007 394 2020 30
2008 338
FIGURE 1 Snapshot of (1998)
them to PNG-files. This is an established metric to mea-
sure visual complexity and has been shown to correlate
et al., 2012); the larger the more complex an image. We
calculated the amount of the base colors red, green,
Table 3 gives an overview of the RGB-sections we
We averaged the number of image-tags, hyperlink-tags
and the text per year (see Figures 35).
For the number of images and hyperlinks we can
identify a steady increase up until 2014 followed by a sub-
stantial decrease. A similar development can be found for
text up until 2016.
Figure 6 shows the development of the average visual
complexity as measured via the filesize.
For this metric, we identified a steady increase until
2020, with the highest leap in the time span from 2009
until 2014.
The color analysis shows that black and white are the
most dominant colors since white is the general back-
ground color and black the basic font color. Figure 7
illustrates the average proportion of black and white
among the websites and shows that most websites consist
to around 80% percent of white. However, we observe a
small decrease up until now.
While the overall proportion is much lower, we identi-
fied red and blue as the most used among the base colors,
however without a consistent trend (Figure 8). Beginning
with 2013, we can see a more diverse distribution among
our analyzed colors. Nevertheless, without a striking
trend or development. Please note that the proportions
are overall very low, so the significant usage of very popu-
lar websites of one color for a year can lead to strong
manifestations of this color which is especially the case
for years we do not have many snapshots for (19962003).
We identified a steady increase of HTML-tags up until
2014 and a remarkable decrease afterwards. A similar but
less strong trend is found for the number of words. We
hypothesize that these results represent a trend in web
design to include more content up until 2014, which is
FIGURE 2 Snapshot of (2020)
TABLE 3 Overview of the used RGB color ranges
Color Lower limit (R,G,B) Upper limit (R,G,B)
Red (140, 0, 0) (255, 56, 50)
Green (0, 170, 0) (130, 255, 70)
Blue (0, 0, 145) (60, 115, 255)
Yellow (230, 220, 0) (255, 255, 55)
White (240, 240, 240) (255, 255, 255)
Black (0, 0, 0) (25, 25, 25)
FIGURE 3 Average number of images per year
FIGURE 4 Average number of hyperlinks per year
FIGURE 5 Average number of words per year
certainly connected to increased technical possibilities.
Snapshots of in our corpus are representative of
this trend (Figures 9 and 10).
After 2014, this trend is followed by a tendency toward
more minimalistic designs. This is opposite to the steady
increase of visual complexity. The reason for the develop-
ment of this metric, however, might also be the possibility
for designers to include more complex images and graphs
with the advent of bandwidth. While the usage of black
and white is dominant on websites until now, we identi-
fied a higher usage of red and blue in early days and a more
diverse color usage beginning 2013. Our color analysis is
however very limited since we neglect a wide range of
other colors.
We want to continue our research by including other
quantitative metrics like alternatives for visual complexity
and other colors and by analyzing the websites by cate-
gory. We plan to increase the corpus, especially with
websites that were popular in the years 19962010 and are
not nowadays to get a more representative sample of these
times. We pursue a mixed methods approach and want to
integrate qualitative analysis of a subset of the corpus to
get a better understanding of design developments.
FIGURE 6 Average size of PNG-files per year
FIGURE 7 The average proportion of black and white among
FIGURE 8 The average proportion of red, blue, green and
yellow per year
FIGURE 9 Snapshot of (1996)
FIGURE 10 Snapshot of (2014)
Brügger, N. (2012). When the present web is later the past: Web his-
toriography, digital history, and internet studies. Historical Social
Research/Historische Sozialforschung,37(4(142)), 102117.
Jørgensen, A. H., & Myers, B. A. (2008). User interface history. In
CHI'08 Extended Abstracts on Human Factors in Computing
Systems (pp. 24152418).
Purchase, H. C., Freeman, E., & Hamer, J. (2012). An exploration of
visual complexity. In International Conference on Theory and Appli-
cation of Diagrams (pp. 200213). Berlin, Heidelberg: Springer.
How to cite this article: Schmidt T, Mosiienko A,
Faber R, Herzog J, Wolff C. Utilizing HTML-
analysis and computer vision on a corpus of
website screenshots to investigate design
developments on the web. Proc Assoc Inf Sci
Technol. 2020;57:e392.
... This allows the potential for obtaining good initial clustering centroids, improving the refinement of the clustering centroids, and determining the optimal clustering centres [39]. The utilisation of nature-inspired metaheuristic algorithms is a widely-adopted practice in multiple sectors, which include computer science [40], data mining [41], industry [42], agriculture [43], computer vision [44], forecasting [45], medicine and biology [46], scheduling [47], economy [48], and engineering [49]. In this work, it was used as an Arabic optimisation cluster. ...
Full-text available
Natural language processing represents human language in computational technique, which is to achieve the extraction of important words. The verbs and nouns found in the Arabic language are significantly pertinent in the process of differentiating each class label available for the purpose of machine learning, specifically in 'Arabic Clustering'. This paper implemented the extraction of verbs and nouns sourced from the Qur'an and text clustering for further evaluation by using two datasets. The limitations of conventional clusters were identified, such as k-means clustering on the initial centroids. Therefore, the current work incorporated a novel clustering optimisation technique known as the water cycle algorithm; when combined with k-means, the algorithm would select the optimal initial centroids. Consequently, the experiments revealed the proposed extraction technique to outperform other extraction methods when using an actual Qur’an dataset. The use of Arabic clustering in assessing the proposed water cycle algorithm in combination with k-means as a clustering further depicted superior performance in harmony search, k-means, and the water cycle. By using the Text REtrieval Conference (TREC) 2001 dataset, the proposed water cycle algorithm in combination with k-means yielded the best score of 0.791%.
... Krasnoyarsk, Russia. 2nd International Conference of the European Association for Digital Humanities (EADH 2021) Krasnoyarsk, Russia September 21-25, 2021 also in Digital Humanities (DH) to analyze movies (Howanitz et al. 2019;Pustu-Iren et al. 2020;Zaharieva et al. 2012) and other visual media (Schmidt et al. 2020e). We argue that these methods are beneficial for digital film studies and give new perspectives. ...
Conference Paper
Full-text available
We present an exploratory study in the context of digital film analysis inspecting and comparing five canonical movies by applying methods of computer vision. We extract one frame per second of each movie which we regard as our sample. As computer vision methods we explore image-based object detection, emotion recognition, gender and age detection with state-of-the-art models. We were able to identify significant differences between the movies for all methods. We present our results and discuss the limitations and benefits of each method. We close by formulating future research questions we plan to answer by applying and optimizing the methods.
Purpose This paper aims to examine the process for estimation of efforts for software development and suggests a framework for estimating software development costs and ensuring quality of code in e-Government projects from the Indian state of Andhra Pradesh. With no established processes for estimation of efforts, the Government relied on open bids from the market to develop these e-Government applications. Design/methodology/approach The paper adopts an exploratory case study approach to analyze the e-Government applications in Andhra Pradesh. Using data from the information technology department of the Government of Andhra Pradesh, the paper evolves a framework to compute costs of software development, based on the software development life cycle. Findings The framework helps in arriving at a hurdle price before the tender process. The study has shown that an e-Government application in AP state would cost Rs. 224,000, or US$2,969.25, for a simple application, and Rs. 33,60,000, or US$44,538.71, for a complex application over a five-year period, depending on the complexity and size of the application. This information would be useful to the Government decision-makers for expanding e-Government. Research limitations/implications Further research may assess the utility of this framework for e-Government support activities like automation of data centers, video conferencing facilities and ushering in financial technologies for encouraging cashless payments. Originality/value The paper provides information that could be of value at a national level (for India) and at the same time providing a guide for other countries that would like to adopt this framework.
Conference Paper
Full-text available
Inspired by the contrast between ‘classical' and ‘expressive' visual aesthetic design, this paper explores the ‘visual complexity' of images. We wished to investigate whether the visual complexity of an image could be quantified so that it matched participants' view of complexity. An empirical study was conducted to collect data on the human view of the complexity of a set of images. The results were then related to a set of computational metrics applied to these images, so as to identify which objective metrics best encapsulate the human subjective opinion. We conclude that the subjective notion of ‘complexity' is consistent both to an individual and to a group, but that it does not easily relate to the most obvious computational metrics.
Full-text available
Taking as point of departure that since the mid-1990s the web has been an essential medium within society as well as in academia this article addresses some fundamental questions related to web historiography, that is the writing of the history of the web. After a brief identification of some limitations within digital history and internet studies vis-a-vis web historiography it is argued that the web is in itself an important historical source, and that special attention must be drawn to the web in web archives - termed reborn-digital material - since these sources will probably be the only web left for future historians. In line with this argument the remainder of the article discusses the following methodological issues: What characterizes the reborn-digital material in web archives, and how does this affect the historian's use of the material as well as the possible application of digital analytical tools on this kind of material?
Conference Paper
Full-text available
User Interfaces have been around as long as computers have existed, even well before the field of Human-Computer Interaction was established. Over the years, some papers on the history of Human-Computer Interaction and User Interfaces have appeared, primarily focusing on the graphical interface era and early visionaries such as Bush, Engelbart and Kay. With the User Interface being a decisive factor in the proliferation of computers in society and since it has become a cultural phenomenon, it is time to paint a more comprehensive picture of its history. This SIG will investigate the possibilities of launching a concerted effort towards creating a History of User Interfaces.
User interface history. InCHI'08 Extended Abstracts on Human Factors in Computing Systems
  • A H Jørgensen