ArticlePDF Available

Utilizing HTML ‐analysis and computer vision on a corpus of website screenshots to investigate design developments on the web

Authors:

Abstract

We present preliminary results of a project investigating the design development of popular websites between 1996 and 2020 via HTML analysis and basic computer vision methods. We acquired a corpus of website screenshots of the current top 47 popular websites. We crawled a snapshot of every month of these websites via the wayback machine of the Internet Archive platform since the time snapshots are stored to gather 7,953 screenshots and HTML pages. We report upon quantitative analysis results concerning HTML elements, color distributions and visual complexity throughout the years.
POSTERS
Utilizing HTML-analysis and computer vision on a corpus
of website screenshots to investigate design developments
on the web
Thomas Schmidt | Anastasiia Mosiienko | Raffaela Faber | Juliane Herzog |
Christian Wolff
Media Informatics Group, University of
Regensburg, Regensburg, Germany
Correspondence
Thomas Schmidt, Media Informatics
Group, University of Regensburg,
Regensburg, Germany.
Email: thomas.schmidt@ur.de
Abstract
We present preliminary results of a project investigating the design develop-
ment of popular websites between 1996 and 2020 via HTML analysis and basic
computer vision methods. We acquired a corpus of website screenshots of the
current top 47 popular websites. We crawled a snapshot of every month of
these websites via the wayback machine of the Internet Archive platform since
the time snapshots are stored to gather 7,953 screenshots and HTML pages.
We report upon quantitative analysis results concerning HTML elements, color
distributions and visual complexity throughout the years.
KEYWORDS
colors, html, visual complexity, web design, web history, websites
1|INTRODUCTION
The World Wide Web has become an important part
of modern media infrastructure and society. The web
itself has also become an object of research in
human-computer-interaction (Jørgensen & Myers,
2008) but also cultural and media studies
(Brügger, 2012). Investigating the history of web
interfaces is an important task for web and media his-
torians but can also give current web designers inspi-
ration on how developments might continue. One
important aspect of this research area is the archival
of the web and the preservation of this digital heri-
tage has been addressed by the UNESCO.
1
This led to
the platform Internet Archive,
2
which, via its
Wayback Machine
3
intends to archive the web and
enables design researchers to investigate trends on
large-scale corpora. We present work-in-progress
results of a project investigating web design
developments via quantitative and qualitative analy-
sis. We report on our current approach and the first
quantitative results we acquired.
2|METHODS
2.1 |Corpus-acquisition
We decided to analyze the top 50 most popular websites
as of December 2019 according to the analytical platform
Alexa.
4
We filtered out any adult websites, which resulted
in a list of 47 websites. We acquired one snapshot per day
of these websites stored in the Wayback Machine, if a
snapshot was available for the timespan from 1996 to
2020. A snapshot is a stored representation of the website
for a given time. This results in a corpus of 151,682 snap-
shots. However, all websites are represented rather
unequally with the most popular and oldest being most
DOI: 10.1002/pra2.392
83rd Annual Meeting of the Association for Information Science & Technology October 25-29, 2020.
Author(s) retain copyright, but ASIS&T receives an exclusive publication license
Proc Assoc Inf Sci Technol. 2020;57:e392. wileyonlinelibrary.com/journal/pra2 1of5
https://doi.org/10.1002/pra2.392
frequent. For this study, we limited this corpus to one
snapshot per month per website (the first available snap-
shot per month) to avoid problems with the unequal dis-
tribution. This subcorpus consists of 7,953 snapshots
(Tables 1 and 2) of which we scraped the HTML and took
a screenshot as TIFF-File with a width of 1920 pixels and
height according to the size of website. To enable com-
parisons, we sliced the screenshots at 3000 pixels height.
The sample size for the early years is rather limited.
However, beginning 2003 the sample size is more
representative with around 300 snapshots per year. Fig-
ures 1 and 2 show two snapshots of amazon.com.
2.2 |Analysis metrics
We analyzed multiple quantitative metrics: (a) HTML
metrics, (b) screenshot size, and (c) color metrics. We
counted the number of images via the img-tag, the num-
ber of hyperlinks via the a-tag and the overall amount of
words in the HTML. Another metric is the size of the
screenshots measured in kilobytes after transforming
TABLE 1 Websites of the corpus and number of snapshots
Website #snapshots Website #snapshots Website #snapshots Website #snapshot
360.cn 152 Facebook.com 200 Office.com 98 Twitter.com 183
Aliexpress.
com
121 Google.co.in 194 Okezone.com 155 Vk.com 169
Alipay.com 121 Google.com.hk 199 qq.com 202 Weibo.com 103
Amazon.co.jp 161 Google.com 242 Reddit.com 179 Wikipedia.org 191
Amazon.com 203 Instagram.com 114 Sina.com 218 Wordpress.
com
184
Apple.com 242 Jd.com 132 Sohu.com 227 Xinhuanet.
com
228
Babytree.com 184 Live.com 102 Stackoverflow.
com
138 Yahoo.co.jp 245
Baidu.com 216 Login.tmall.com 92 Taobao.com 126 Yahoo.com 255
Bing.com 159 Microsoftonline.
com
124 Tianya.cn 169 Yandex.ru 220
Blogspot.com 107 msn.com 241 Tmall.com 128 Youtube.com 178
Csdn.net 214 Naver.com 229 Tribunnews.com 110
Ebay.com 240 Netflix.com 153 Twitch.tv 105
TABLE 2 Website distribution per year
Year #snapshots Year #snapshots
1996 10 2009 378
1997 23 2010 412
1998 26 2011 487
1999 70 2012 494
2000 150 2013 500
2001 183 2014 487
2002 182 2015 491
2003 284 2016 484
2004 317 2017 503
2005 349 2018 498
2006 362 2019 501
2007 394 2020 30
2008 338
FIGURE 1 Snapshot of Amazon.com (1998)
2of5 SCHMIDT ET AL.
them to PNG-files. This is an established metric to mea-
sure visual complexity and has been shown to correlate
withhumanperceptionofcomplexity(Purchase
et al., 2012); the larger the more complex an image. We
calculated the amount of the base colors red, green,
blue,yellowaswellaswhiteandblackviaopenCV.
5
Table 3 gives an overview of the RGB-sections we
included.
3|RESULTS
We averaged the number of image-tags, hyperlink-tags
and the text per year (see Figures 35).
For the number of images and hyperlinks we can
identify a steady increase up until 2014 followed by a sub-
stantial decrease. A similar development can be found for
text up until 2016.
Figure 6 shows the development of the average visual
complexity as measured via the filesize.
For this metric, we identified a steady increase until
2020, with the highest leap in the time span from 2009
until 2014.
The color analysis shows that black and white are the
most dominant colors since white is the general back-
ground color and black the basic font color. Figure 7
illustrates the average proportion of black and white
among the websites and shows that most websites consist
to around 80% percent of white. However, we observe a
small decrease up until now.
While the overall proportion is much lower, we identi-
fied red and blue as the most used among the base colors,
however without a consistent trend (Figure 8). Beginning
with 2013, we can see a more diverse distribution among
our analyzed colors. Nevertheless, without a striking
trend or development. Please note that the proportions
are overall very low, so the significant usage of very popu-
lar websites of one color for a year can lead to strong
manifestations of this color which is especially the case
for years we do not have many snapshots for (19962003).
4|DISCUSSION
We identified a steady increase of HTML-tags up until
2014 and a remarkable decrease afterwards. A similar but
less strong trend is found for the number of words. We
hypothesize that these results represent a trend in web
design to include more content up until 2014, which is
FIGURE 2 Snapshot of Amazon.com (2020)
TABLE 3 Overview of the used RGB color ranges
Color Lower limit (R,G,B) Upper limit (R,G,B)
Red (140, 0, 0) (255, 56, 50)
Green (0, 170, 0) (130, 255, 70)
Blue (0, 0, 145) (60, 115, 255)
Yellow (230, 220, 0) (255, 255, 55)
White (240, 240, 240) (255, 255, 255)
Black (0, 0, 0) (25, 25, 25)
FIGURE 3 Average number of images per year
FIGURE 4 Average number of hyperlinks per year
FIGURE 5 Average number of words per year
SCHMIDT ET AL.3of5
certainly connected to increased technical possibilities.
Snapshots of msn.com in our corpus are representative of
this trend (Figures 9 and 10).
After 2014, this trend is followed by a tendency toward
more minimalistic designs. This is opposite to the steady
increase of visual complexity. The reason for the develop-
ment of this metric, however, might also be the possibility
for designers to include more complex images and graphs
with the advent of bandwidth. While the usage of black
and white is dominant on websites until now, we identi-
fied a higher usage of red and blue in early days and a more
diverse color usage beginning 2013. Our color analysis is
however very limited since we neglect a wide range of
other colors.
We want to continue our research by including other
quantitative metrics like alternatives for visual complexity
and other colors and by analyzing the websites by cate-
gory. We plan to increase the corpus, especially with
websites that were popular in the years 19962010 and are
not nowadays to get a more representative sample of these
times. We pursue a mixed methods approach and want to
integrate qualitative analysis of a subset of the corpus to
get a better understanding of design developments.
ENDNOTES
1
https://en.unesco.org/themes/information-preservation/digital-
heritage
2
https://archive.org/
3
https://archive.org/web/
4
https://www.alexa.com/
5
https://opencv.org/
FIGURE 6 Average size of PNG-files per year
FIGURE 7 The average proportion of black and white among
websites
FIGURE 8 The average proportion of red, blue, green and
yellow per year
FIGURE 9 Snapshot of Msn.com (1996)
FIGURE 10 Snapshot of Msn.com (2014)
4of5 SCHMIDT ET AL.
REFERENCES
Brügger, N. (2012). When the present web is later the past: Web his-
toriography, digital history, and internet studies. Historical Social
Research/Historische Sozialforschung,37(4(142)), 102117.
Jørgensen, A. H., & Myers, B. A. (2008). User interface history. In
CHI'08 Extended Abstracts on Human Factors in Computing
Systems (pp. 24152418).
Purchase, H. C., Freeman, E., & Hamer, J. (2012). An exploration of
visual complexity. In International Conference on Theory and Appli-
cation of Diagrams (pp. 200213). Berlin, Heidelberg: Springer.
How to cite this article: Schmidt T, Mosiienko A,
Faber R, Herzog J, Wolff C. Utilizing HTML-
analysis and computer vision on a corpus of
website screenshots to investigate design
developments on the web. Proc Assoc Inf Sci
Technol. 2020;57:e392. https://doi.org/10.1002/
pra2.392
SCHMIDT ET AL.5of5
... Our approach is predominantly exploratory. We investigate if these methods uncover certain characteristics of the movies that can be validated statistically and if we can identify diachronic developments across the movies with the metrics given by the CV methods (similar to research on websites by [17]). By doing so, we want to reflect upon the advantages, disadvantages and limitations of the specific methods for digital film studies and which methods to pursue for further research. ...
Article
Full-text available
We present an exploratory study performing distant viewing via computer vision methods in the genre of fantasy movies. As a case study we use 10 modern fantasy movies of the Harry Potter franchise (also referred to as Wizarding World franchise). We apply methods and state-of-the-art models for color and brightness analysis, object detection, location classification as well as facial emotion recognition. We present descriptive results as well as inference statistics. Furthermore, we discuss the results and the quality of the methods for this unique use case and give examples. We were able to find significant differences in our statistical analysis in the results of the methods across the movies with the movies of the Harry Potter series getting darker and negative emotional expressions on faces becoming more frequent.
... This makes it possible to find the best clustering centers, refine the clustering centroids further, and obtain good initial clustering centroids (Sahmoudi & Lachkar, 2017). Many fields, including computer science (Senseney & Dickson, 2018), data mining (Jones et al., 2018), industry (Wang & Zhang, 2020), agriculture , computer vision (Schmidt et al., 2020), forecasting (Shah et al., 2016), medicine and biology (Colebunders et al., 2014), scheduling (Guo et al., 2009), economy (Malo et al., 2014), and engineering (Girdhar & Bharadwaj, 2019), have adopted the use of nature-inspired metaheuristic algorithms. It served as an Arabic optimization cluster in this work. ...
Article
Full-text available
Natural language processing represents human language in computational technique, which is to achieve the extraction of important words. The verbs and nouns found in the Arabic language are significantly pertinent in the process of differentiating each class label available for the purpose of machine learning, specifically in 'Arabic Clustering'. This paper implemented the extraction of verbs and nouns sourced from the Qur'an and text clustering for further evaluation by using two datasets. The limitations of conventional clusters were identified, such as k-means clustering on the initial centroids. Therefore, the current work incorporated a novel clustering optimisation technique known as the water cycle algorithm; when combined with k-means, the algorithm would select the optimal initial centroids. Consequently, the experiments revealed the proposed extraction technique to outperform other extraction methods when using an actual Qur’an dataset.
... This allows the potential for obtaining good initial clustering centroids, improving the refinement of the clustering centroids, and determining the optimal clustering centres [39]. The utilisation of nature-inspired metaheuristic algorithms is a widely-adopted practice in multiple sectors, which include computer science [40], data mining [41], industry [42], agriculture [43], computer vision [44], forecasting [45], medicine and biology [46], scheduling [47], economy [48], and engineering [49]. In this work, it was used as an Arabic optimisation cluster. ...
Preprint
Full-text available
Natural language processing represents human language in computational technique, which is to achieve the extraction of important words. The verbs and nouns found in the Arabic language are significantly pertinent in the process of differentiating each class label available for the purpose of machine learning, specifically in 'Arabic Clustering'. This paper implemented the extraction of verbs and nouns sourced from the Qur'an and text clustering for further evaluation by using two datasets. The limitations of conventional clusters were identified, such as k-means clustering on the initial centroids. Therefore, the current work incorporated a novel clustering optimisation technique known as the water cycle algorithm; when combined with k-means, the algorithm would select the optimal initial centroids. Consequently, the experiments revealed the proposed extraction technique to outperform other extraction methods when using an actual Qur’an dataset. The use of Arabic clustering in assessing the proposed water cycle algorithm in combination with k-means as a clustering further depicted superior performance in harmony search, k-means, and the water cycle. By using the Text REtrieval Conference (TREC) 2001 dataset, the proposed water cycle algorithm in combination with k-means yielded the best score of 0.791%.
... Krasnoyarsk, Russia. 2nd International Conference of the European Association for Digital Humanities (EADH 2021) Krasnoyarsk, Russia September 21-25, 2021 also in Digital Humanities (DH) to analyze movies (Howanitz et al. 2019;Pustu-Iren et al. 2020;Zaharieva et al. 2012) and other visual media (Schmidt et al. 2020e). We argue that these methods are beneficial for digital film studies and give new perspectives. ...
Conference Paper
Full-text available
We present an exploratory study in the context of digital film analysis inspecting and comparing five canonical movies by applying methods of computer vision. We extract one frame per second of each movie which we regard as our sample. As computer vision methods we explore image-based object detection, emotion recognition, gender and age detection with state-of-the-art models. We were able to identify significant differences between the movies for all methods. We present our results and discuss the limitations and benefits of each method. We close by formulating future research questions we plan to answer by applying and optimizing the methods.
Article
Purpose This paper aims to examine the process for estimation of efforts for software development and suggests a framework for estimating software development costs and ensuring quality of code in e-Government projects from the Indian state of Andhra Pradesh. With no established processes for estimation of efforts, the Government relied on open bids from the market to develop these e-Government applications. Design/methodology/approach The paper adopts an exploratory case study approach to analyze the e-Government applications in Andhra Pradesh. Using data from the information technology department of the Government of Andhra Pradesh, the paper evolves a framework to compute costs of software development, based on the software development life cycle. Findings The framework helps in arriving at a hurdle price before the tender process. The study has shown that an e-Government application in AP state would cost Rs. 224,000, or US2,969.25,forasimpleapplication,andRs.33,60,000,orUS2,969.25, for a simple application, and Rs. 33,60,000, or US44,538.71, for a complex application over a five-year period, depending on the complexity and size of the application. This information would be useful to the Government decision-makers for expanding e-Government. Research limitations/implications Further research may assess the utility of this framework for e-Government support activities like automation of data centers, video conferencing facilities and ushering in financial technologies for encouraging cashless payments. Originality/value The paper provides information that could be of value at a national level (for India) and at the same time providing a guide for other countries that would like to adopt this framework.
Conference Paper
Full-text available
Inspired by the contrast between ‘classical' and ‘expressive' visual aesthetic design, this paper explores the ‘visual complexity' of images. We wished to investigate whether the visual complexity of an image could be quantified so that it matched participants' view of complexity. An empirical study was conducted to collect data on the human view of the complexity of a set of images. The results were then related to a set of computational metrics applied to these images, so as to identify which objective metrics best encapsulate the human subjective opinion. We conclude that the subjective notion of ‘complexity' is consistent both to an individual and to a group, but that it does not easily relate to the most obvious computational metrics.
Article
Full-text available
Taking as point of departure that since the mid-1990s the web has been an essential medium within society as well as in academia this article addresses some fundamental questions related to web historiography, that is the writing of the history of the web. After a brief identification of some limitations within digital history and internet studies vis-a-vis web historiography it is argued that the web is in itself an important historical source, and that special attention must be drawn to the web in web archives - termed reborn-digital material - since these sources will probably be the only web left for future historians. In line with this argument the remainder of the article discusses the following methodological issues: What characterizes the reborn-digital material in web archives, and how does this affect the historian's use of the material as well as the possible application of digital analytical tools on this kind of material?
Conference Paper
Full-text available
User Interfaces have been around as long as computers have existed, even well before the field of Human-Computer Interaction was established. Over the years, some papers on the history of Human-Computer Interaction and User Interfaces have appeared, primarily focusing on the graphical interface era and early visionaries such as Bush, Engelbart and Kay. With the User Interface being a decisive factor in the proliferation of computers in society and since it has become a cultural phenomenon, it is time to paint a more comprehensive picture of its history. This SIG will investigate the possibilities of launching a concerted effort towards creating a History of User Interfaces.
User interface history. InCHI'08 Extended Abstracts on Human Factors in Computing Systems
  • A H Jørgensen