Content uploaded by Dau Hoang
Author content
All content in this area was uploaded by Dau Hoang on Jan 28, 2021
Content may be subject to copyright.
Abstract— Over last decade, defacement attacks to
websites and web applications have been considered a
critical threat in many private and public organizations. A
defacement attack can result in a severe effect to the owner’s
website, such as instant discontinuity of website operations
and damage of the owner’s fame, which in turn may lead to
big financial damages. Many solutions have been studied
and deployed for monitoring and detecting defacement
attacks, such as those based on simple comparison methods
and those based on complicated methods. However, some
solutions only work on static web-pages and some others can
work on dynamic web-pages, but they generate high level of
false alarms. This paper proposes a Convolutional Neural
Network (CNN)-based detection model for website
defacements. The model is an extension of previous models
based on traditional supervised machine learning
techniques and its aims are to improve the detection rate
and reduce the false alarm rate. Experiments conducted on
the dataset of 100,000 web-pages show that the proposed
model performs significantly better than models based on
traditional supervised machine learning.
Keywords— CNN-based Model for Defacement
Detection, Defacement Attacks to Website, Detection of
Website Defacements.
I. INTRODUCTION
Defacements to websites and web applications are a
class of web attacks, which amend the web content and
thus change their appearance [1][2]. Fig. 1 is the website
of UK National Health Services (NHS), which was
defaced in 2018 with the message “Hacked by AnoaGhost
– Typical Idiot Security” left on the website [1]. It is
reported that the NHS website may have been defaced for
as long as 5 days. In 2019, 15,000 websites of government
organizations, banks, press agencies and television
broadcasters in Georgia, a small European country, were
defaced and took offline [1]. According to a recent report,
the number of defacement attacks to websites globally
has been risen sharply during the coronavirus lockdown
with an increase of about 51% and 65% in April and May
of 2020 compared to the figures of the same months of
Contact author: Hoang Xuan Dau, email: dauhx@ptit.edu.vn
Received: / /2021, Revised: / /2021, Accepted: / /2021.
2019, respectively [3]. Fig. 2 is a website of a UK-based
canoe and kayak club was recently defaced in 2020 [3].
A number of reasons that websites and web
applications were defaced have been pointed out. Among
them, the prime reason is severe security vulnerabilities
existed in websites, web applications, or hosting servers
allow attackers to download files to the servers, or to have
accesses to the websites’ administrative pages. Common
website vulnerabilities include XSS (Cross-Site
Scripting), SQLi (SQL injection), file inclusion,
inappropriate account and password administration, and
no-update software [1][2].
Fig. 1. The UK NHS website was defaced in 2018 [1]
Fig. 2. A website of a U.K.-based canoe and kayak club was recently
defaced [3]
Defacement attacks to websites can cause serious
Hoang Xuan Dau*, Nguyen Trong Hung+
* Posts and Telecommunications Institute of Technology
+ Academy of People’s Security
A CNN-BASED MODEL FOR DETECTING
WEBSITE DEFACEMENTS
damages to their owners. The attacks can cause instant
discontinuance to the normal operations of websites,
harm the reputation of website owners and lead to
possible data leakages. These in turn may result in large
financial losses [1][2]. Because of the wide spreading of
defacement attacks and their serious consequences, many
defensive measures have been researched and deployed
in practice. Current defensive measures to defacement
attacks can be classified into 3 main groups: (1) scanning
and fixing website security vulnerabilities; (2) using
website defacement monitoring and detecting tools, such
as Nagios Web Application Monitoring Software [4],
Site24x7 Website Defacement Monitoring [5] and
WebOrion Defacement Monitor [6]; and (3) using
various methods to detect website defacement attacks.
This paper proposes a detection model for website
defacements, which is based on the Convolutional Neural
Network (CNN). The proposed CNN-based model is an
alternative approach of the traditional machine learning-
based model proposed in [11], in which we exploit the
power of the CNN-based text classification scheme to
solve the problem of website defacement detection. In the
proposed model, CNN learning is used to construct the
model from the training data and then the model is used
to classify monitored web-pages into either Normal or
Defaced class.
The remaining of this paper is structured as follows:
Section II discusses some closely related works; Section
III describes our proposed model; Section IV shows
experiments and results, and Section V is the paper
conclusion.
II. RELATED WORKS
There have been some proposed techniques and tools
for monitoring and detecting defacement attacks on
websites and web applications. However, due to the paper
scope, this section provides a review of some typical
approaches that belong to group (3) mentioned in Section
I. The proposed approaches of group (3) are composed of
traditional methods and complicated or advanced
methods. These methods will be discussed in the next
sub-sections.
A. Traditional Methods for Detecting Defacements
Traditional methods for website defacement detection
include checksum comparison, diff comparison and
DOM tree analysis. The checksum comparison is the
simplest technique to detect changes in web-pages.
Firstly, the web page content’s checksum is calculated
using a hashing algorithm, such as MD5 or SHA1 and
saved to the detection profile. Then, the web page is
monitored and the new checksum is computed, and then
compared with its checksum stored in the detection
profile. If the two checksums are different, a defacement
alarm is raised. This technique works well for static web-
pages. For dynamic web-pages, such as e-commerce, or
forum web-pages, it is not applicable because their
content changes frequently [11][12][13].
In the Diff comparison method, the DIFF tool is used,
which is popularly supported on Linux and UNIX
systems to find the difference between two web-pages’
content. The most difficult thing to do is to determine an
anomaly threshold as the input for the monitoring process
of each web-page. This technique is relatively effective
and works well for dynamic websites if the anomaly
detection threshold is determined properly [11][12][13].
Document Object Model (DOM) is an Application
Programming Interface (API) that defines the logical
structure of HTML documents, or web-pages. DOM can
be used to scan and analyze the structure of the web-page.
DOM tree analysis technique is used to detect changes in
the page structure, rather than changes in the page
content. Firstly, the page structure is extracted from the
page content in the normal working condition and stored
in the detection profile. Then, the page structure of the
monitored page is extracted and then compared with the
stored page structure saved in detection profile to find the
difference. If a significant difference between the
structures of two pages is found a defacement attack
alarm is raised. Generally, this method works fine for
web-pages with stable structures. However, this method
is not able to detect unauthorized modifications in the
web-pages' content [11][12][13].
B. Complicated Methods for Detecting Defacements
Complicated methods for detecting website
defacements consist of those based on statistics [7],
genetic programming [8][9], page screenshot analysis
[10] and supervised machine leaning [11][12][13]. Kim
et al. [7] proposed a statistical model based on 2 -gram
technique to construct a profile from normal web-pages
for monitoring and detecting defacements, as shown in
Fig. 3. Each normal web-page of the training set is
converted to a vector, in which the page’s HTML content
is splitted to substrings using the 2-gram method and
substrings’ occurrence frequencies are counted. The
detection profile is composed of vectors of all normal
pages of the training set. Then, each monitored web-page
is also converted to a vector and then its vector is
compared with the page’s vector stored in the profile to
find the difference using the cosine distance. If the
difference is greater than a threshold an alarm is raised.
The paper also proposed an algorithm to generate a
dynamic threshold for each web-page to reduce the false
alarms. The proposed method’s major shortcoming is for
monitored pages with frequent changed content, the
periodically adjusted thresholds are not appropriate and
therefore the method still generates a high level of false
alarms.
Bartoli et al. [8] and Davanzo et al. [9] proposed to use
genetic programming to construct the detection profile
for defacement attacks. In their approaches, information
H.X.D et al.: A CNN-Based Model for Detecting Website Defacements
3
from monitored web-pages is monitored and extracted
using 43 sensors embedded in web-pages. Each web-
page’s information is then converted into a 1466-element
vector. In the training stage, normal web-pages are
collected and vectorized to construct the profile based on
genetic programming. In the detection stage, the
monitored page is collected, vectorized and compared
with the profile to look for the difference. Their
approaches’ the major drawback is it requires extensive
computing resources for the detection profile
construction since very large-size page vectors and
slowly-converged genetic programming are used.
Fig. 3. Detection process for web page defacements proposed by Kim
et. al. [7]
Fig. 4. Meerkat's architecture based on deep neural network [10]
Borgolte et al. [10] proposed Meerkat shown in Fig. 4,
which is a system based on image object recognition of
web-page screenshots using computer vision techniques
for detecting defacement attacks. The system first builds
a profile of screenshots of normal web-pages. It then
takes the monitored web-page’s screenshot and conducts
analysis to find the difference between the page’s current
screenshot and its normal screenshots stored in the profile
based on high-level screenshot features using advanced
learning methods, such as stacked auto-encoder and deep
neural network. Experiments con-ducted on 10 million of
defaced web-pages and 2.5 million of normal web-pages
show that the system achieves high detection accuracy
from 97.422% to 98.816% and low false positive rate
from 0.547% to 1.528%. The Meerkat’s advantages are
the profile can be constructed automatically and the
system was tested on a large dataset. However, its major
disadvantage is it requires extensively computational
resources for highly complex image processing and
recognition.
Hoang et al. [11][12][13] proposed several models for
detecting website defacements, including the machine
learning-based model, the hybrid model and the multi-
layer model. The main idea behind these models is they
use traditional supervised machine learning algorithms,
such as naive bayes, decision tree and random forest to
construct the detection models. Specifically, the problem
of defacement detection is transferred to the text
classification problem of web-pages’ HTML content. The
used dataset for training to build the detection model is a
combination of normal web-pages and defaced web-
pages. The detection model is then used to classify
monitored web-pages into either Normal or Attacked
class. The approach’s strong points are (1) the detection
model can be built automatically from the training data
and (2) the overall detection accuracy is high. However,
the approach’s main drawbacks include (1) the false
positive and negative rates are still relatively high and (2)
the experimental datasets of only about 1000-3000 web-
pages are relatively small in order to get a high level of
reliability of the reported results.
In this paper, we extend the defacement detection
model proposed in [11][12][13] by using CNN – a deep
machine learning method, instead of traditional
supervised machine algorithms to build our detection
model in order to increase the detection rate as well as to
reduce the false alarm rate. Furthermore, we prepare a
much large dataset to conduct our experiments in order to
comprehensively validate our proposed model.
III. PROPOSED MODEL FOR DETECTING
DEFACEMENT ATTACKS
A. The Proposed Detection Model
The proposed detection model for defacement attacks
is composed of two stages: the training stage and the
detection stage. The training stage as shown in Fig. 5 con-
sists of the following three steps:
• Collection of training dataset: The dataset for
training is a combination of normal web-pages
and defaced web-pages. Normal web-pages are
downloaded from various websites in normal
working conditions. Defaced web-pages are
downloaded from Zone-H.org [17].
• Pre-processing: In this step, we use n-gram
technique to extract the training features for each
web-page’s full content, including HTML code
and pure text. Based on the analysis of the
previous re-searches [11][12][13], we select 2-
gram and 3-gram to extract the page features and
then use the TF-IDF (Term Frequency – Inverse
Document Frequency) [16] to compute the value
for each feature. The result of this process is that
a web-page is converted to a vector and the
training dataset is transferred to the training array.
• Training: The CNN is used as the training
algorithm to construct the Classifier or Model
using the the training array.
Fig. 5. Proposed detection model for defacement attacks: Training stage
The detection stage, as illustrated in Fig. 6 also
includes three steps as follows:
• Collection of the monitored web-page: The
HTML code of the monitored page is downloaded
for pre-processing.
• Pre-processing: The monitored web-page’s
content is processed to extract features to form the
page vector using the same method as done for
each page of the training dataset.
• Classification: The page vector is classified using
the Classifier built in the training stage. The result
of this step is the page status of either Normal or
Defaced.
Fig. 6. Proposed detection model for defacement attacks: Detection
stage
B. Training the Detection Model Using CNN
As previously mentioned, we use CNN algorithm to
construct our detection model for website defacements
from the training data. The CNN algorithm is selected
because it is fast and it has been widely used with good
performance in many computer science areas, such as
image processing and recognition, natural language
processing [14][15]. Fig. 7 describes the CNN structure
used in the proposed model, in which a Conv_1D
function, a Flatten layer and 4 fully-connected layers of
Dense 1, 2, 3 and 4 to generate the output. The ELU
activation function and the Softmax loss function are used
in layers.
Fig. 7. The CNN structure used in the proposed detection model for
website defacements
The ELU (Exponential Linear Unit) [15] function is
defined as follows:
or
(1)
where α = 1 as recommended in [15]. We select ELU
function because it can produce relatively low level of
error frequencies and average training time.
C. Performance Measurement
We use 6 measurements, including TPR (True Positive
Rate or Recall), FPR (False Positive Rate), FNR (False
Negative Rate), PPV (Positive Predictive Value or
Precision), F1 (F1-Score) and ACC (Overall Accuracy)
to measure the proposed model’s performance as the
following:
(2)
(3)
(4)
(5)
(6)
(7)
H.X.D et al.: A CNN-Based Model for Detecting Website Defacements
5
where TP, FP, FN and TN are elements of the
confusion matrix given in Table I.
TABLE I. TP, FP, FN AND TN IN THE CONFUSION MATRIX
Actual Class
Defaced
Normal
Predicted
Class
Defaced
TP (True
Positives)
FP (False
Positives)
Normal
FN (False
Negatives)
TN (True
Negatives)
IV. EXPERIMENTS AND RESULTS
A. Experimental Dataset
The experimental dataset used in this paper consists of
a subset of normal web-pages and another subset of
defaced web-pages. We developed a small tool written in
JavaScript running on the NodeJS server and the
Puppeteer library to download and process HTML code
of web-pages. Specifically, the two subsets of the dataset
of 100,000 web-pages are as follows:
• The normal web-pages are composed of 40,000
web-pages in normal working conditions. These
web-pages are home pages of well-known
websites in Vietnam and in the world, including
news portals, e-commerce sites, online services
sites and forum sites. These websites are selected
from top 1 million websites listed by Alexa [18].
• The defaced web-pages consist of 60,000 web-
pages, which are collected from Zone-H.org [17].
Downloaded defaced web-pages are checked and
any duplicated pages are removed.
B. Pre-processing, Training and Validation Testing
The dataset collected is pre-processed using n-grams
and TF-IDF techniques to convert web-pages to the
training array of web-page vectors. Based on previous
works [11][12][13], we select a set of 8000 n-gram
features to create web-page vectors. The vectors from
normal web-pages are labelled “normal” and those from
defaced web-pages are labelled “defaced”. The training
array is then ready for training stage to construct and
validate the detection model.
We use 2 traditional machine learning algorithms of
decision tree and random forest proposed by [11][12][13]
for defacement detection, and the CNN algorithm in the
training stage to build different defacement detection
models for performance comparison. For the models
based on decision tree and 50-tree random forest, 10-fold
cross-validation method is used. For the CNN-based
model, parameters of epochs = 64 and batch_size = 32 are
used in the training and validation. For each run, 75% of
the dataset are used for training and 25% of the dataset
are used for validation testing. The final performance
measurements are computed as the average of measured
values of all runs.
C. Experimental Results and Comments
Table II provides the detection performance of our
CNN-based model and the decision tree-based model
[11] and the random forest (RF)-based model [12][13].
From the experimental results given in Table II, we can
draw the following comments:
• Our CNN-based model performs better than
previous models based on traditional supervised
machine learning methods of decision tree [11]
and random forest [12][13]. Specifically, our
model’s measurements are considerably higher
than those of decision tree-based model [11].
However, the proposed model’s ACC and F1 are
only slightly better than those of random forest-
based model [12][13].
• Although the proposed model’s ACC and F1 are
only slightly better than those of random forest-
based model [12][13], its false alarm rates (FPR
and FNR) are significantly lower than those of
decision tree-based model [11] and random
forest-based model [12][13]. Low false alarm
rates are very important for any practical solution.
TABLE II. THE PROPOSED MODEL’S DETECTION PERFORMANCE
VERSUS PERFORMANCE OF [11][12][13]
Detection models
PPV
TPR
FPR
FNR
ACC
F1
Decision tree-
based model [11]
97.47
97.85
3.82
2.15
97.18
97.66
RF-based
model [12][13]
98.91
98.15
1.63
1.85
98.24
98.53
Our CNN-
based model
98.55
98.61
0.97
1.39
98.86
98.61
V. CONCLUSION
This paper proposes a CNN-based model for detecting
website defacement attacks. In our model, we exploit the
CNN’s superior classification capability to solve the
problem of website defacement detection. Experiments
conducted on the dataset of 100,000 web-pages show that
the proposed CNN-based detection model outperforms
previous models based on traditional supervised machine
learning methods of decision tree [11] and random forest
[12][13]. Especially, the false alarm rates, including false
positive and negative rates are reduced significantly
compared to those of previous models.
One of the shortcomings of our model is it requires
higher computing resources because CNN is generally
computationally intensive than traditional supervised ma-
chine learning counterparts, such as decision tree and
random forest. For future work, we will carry out an
extensive assessment on all execution steps of the model
and find a solution to lower its computational
requirements.
REFERENCES
[1] Imperva, Website Defacement Attack, https://www.imperva.com/
learn/application-security/website-defacement-attack/, last
accessed 2020/11/10.
[2] Trend Micro, The Motivations and Methods of Web Defacement,
https://www.trendmicro.com/en_us/research/18/a/hacktivism-
web-defacement.html, last accessed 2020/11/10.
[3] Government Technology, The Coronavirus Pandemic Moved Life
Online – a Surge in Website Defacing Followed,
https://www.govtech.com/security/The-Coronavirus-Pandemic-
Moved-Life-Online--a-Surge-in-Website-Defacing-
Followed.html, last accessed 2020/11/10.
[4] Nagios Enterprises, LLC. Web Application Monitoring Software
with Nagios. https://www.nagios.com/solutions/web-application-
monitoring/, last accessed 2020/11/10.
[5] Site24x7. Website Defacement Monitoring.
https://www.site24x7.com/monitor-webpage-defacement.html,
last accessed 2020/11/10.
[6] Banff Cyber Technologies. WebOrion Defacement Monitor.
https://www.weborion.io/website-defacement-monitor/, last
accessed 2020/11/10.
[7] W. Kim, J. Lee, E. Park, S. Kim. 2006. Advanced Mechanism for
Reducing False Alarm Rate in Web Page Defacement Detection.
National Security Research Institute, Korea.
[8] A. Bartoli, G. Davanzo and E. Medvet. 2010. A Framework for
Large-Scale Detection of Web Site Defacements. ACM
Transactions on Internet Technology, Vol.10, No.3, Art.10.
[9] G. Davanzo, E. Medvet and A. Bartoli. 2011. Anomaly detection
techniques for a web defacement monitoring service. Journal of
Expert Systems with Applications, 38 (2011) 12521–12530,
doi:10.1016/j.eswa.2011.04.038, Elsevier.
[10] K. Borgolte, C. Kruegel and G. Vigna. 2015. Meerkat: Detecting
Website Defacements through Image-based Object Recognition.
In: Proceedings of the 24th USENIX Security Symposium
(USENIX Security).
[11] X.D. Hoang. 2018. A Website Defacement Detection Method
Based on Machine Learning Techniques. In SoICT ’18: Ninth
International Symposium on Information and Communication
Technology, December 6–7, 2018, Da Nang City, Viet Nam.
ACM, New York, NY, USA, 6 pages.
https://doi.org/10.1145/3287921.3287975.
[12] X.D. Hoang, N.T. Nguyen. 2019. Detecting Website Defacements
Based on Machine Learning Techniques and Attack Signatures,
Computers 2019, 8, 35; doi:10.3390/computers8020035.
[13] X.D. Hoang, N.T. Nguyen. 2019. A Multi-layer Model for
Website Defacement Detection. In In SoICT’19: Tenth
International Symposium on Information and Communication
Technology, December 4 – 6, 2019 | Hanoi - Ha Long Bay,
Vietnam. ACM, New York, NY, USA, 6 pages.
https://doi.org/10.1145/3368926.3369730.
[14] D-A. Clevert, T. Unterthiner and S. Hochreiter. 2015. Fast and
accurate deep network learning by exponential linear units (elus).
Available online: https://arxiv.org/abs/1511.07289.
[15] N.K. Sangani, H. Zarger. 2017. “Machine Learning in Application
Security,” Book chapter in "Advances in Security in Computing
and Communications", IntechOpen.
[16] X.D. Hoang. 2021. Detecting Common Web Attacks Based on
Machine Learning Using Web Log. In: Sattler KU., Nguyen D.C.,
Vu N.P., Long B.T., Puta H. (eds) Advances in Engineering
Research and Application. ICERA 2020. Lecture Notes in
Networks and Systems, vol 178. Springer, Cham.
https://doi.org/10.1007/978-3-030-64719-3_35
[17] Zone-H.org, http://zone-h.org/?hz=1, last accessed 2020/11/10.
[18] DN Pedia – Top Alexa one million domains. Available online:
https://dnpedia.com/tlds/topm.php, last accessed 2020/11/10.
Hoang Xuan Dau received the
bachelor degree of informatics in
1994 at the Hanoi University of
Science and Technology. He then
received the master degree and PhD
degree in computer science at the
RMIT university, Australia in 2000
and 2006, respectively. He is
currently a senior lecturer of the
faculty of information technology, Posts and
Telecommunications Institute of Technology. His
research interests include attack and intrusion detection,
malware detection, system and software security, web
security, machine learning-based applications for
information security.
Nguyen Trong Hung received the
bachelor degree of information
technology in 2013 at the Academy of
People’s Security. He then received the
master degree in information security at
the Academy of Cryptographic
Techniques in 2018. He is currently a
lecturer of the faculty of information
technology and security, Academy of People’s Security.
His research interests include attack and intrusion
detection, malware detection, and web security.