Content uploaded by Khaled Elleithy
Author content
All content in this area was uploaded by Khaled Elleithy on Dec 18, 2013
Content may be subject to copyright.
Novel Steganography over HTML Code
Ammar Odeh, Khaled Elleithy, Miad Faezipour, and Eman Abdelfattah
Department of Computer Science & Engineering,
University of Bridgeport
Bridgeport, CT 06604, USA
{aodeh, elleithy, faezipour, eman}@bridgeport.edu
Abstract—Different security strategies have been developed
to protect the transfer of information between users. This has
become especially important after the tremendous growth of
internet use. Encryption techniques convert readable data into
a ciphered form. Other techniques hide the message in another
file, and some powerful techniques combine hiding and
encryption concepts. In this paper, a new security algorithm is
presented by using Steganography over HTML pages. Hiding
the information inside Html page code comments and
employing encryption, can enhance the possibility to discover
the hidden data. The proposed algorithm applies some
statistical concepts to create a frequency array to determine
the occurrence frequency of each character. The encryption
step depends on two simple logical operations to change the
data form to increase the complexity of the hiding process. The
last step is to embed the encrypted data as comments inside the
HTML page. This new algorithm comes with many
advantages, such as generality, applicability to different
spoken languages, and can be extended to other Web
programming pages such as XML, ASP.
Key words—Steganography, Carrier file, Encryption, HTML
code,
I. INTRODUCTION
The rapid growth of the Internet has led to the increasing
demand for security mechanisms to facilities the
transformation of sensitive information through different
networks. Since the Internet is a public media used to
transfer information between different parties [1], hackers
can exploit the messages’ contents between communicating
parties. On the other hand, different methods have been
developed to prohibit an attempt to break or expose actual
messages. Encryption algorithms reported in literature
protect sensitive information by converting plaintext into
ciphertext. Modern encryption algorithms depend on
sophisticated mathematical operations to change the
information form. Other techniques depend on concealing
the message existence, which is called Steganography [2].
As Figure 1 shows, Steganography consists of three main
components; embedding algorithm, carrier file and the
hidden message.
Figure 1.Embedding Algorithm
The carrier file plays an important role in designing
steganography algorithms. Image , audio, video, and text are
different media used frequently over the Internet[3]. Each of
these carrier files types has certain characteristics that enable
the user to insert the data inside. Image files are the most
widely used files as carrier files which contain high ratio of
data frequency [4]. On the other hand, it is not easy to use
the same image to hide different messages, since comparing
similar images may allow attackers to expose the concealed
data. Audio files are represented as sine or cosine waves.
Some techniques suggest to shift the phase to hide zero’s and
one’s [5]. Text files represent the most difficult carrier files,
since text files contain little redundant data compared to
other carriers [6].
Text Steganography is classified into different categories.
One of the most popular text Steganography methods is
semantic Steganography [7]. This technique makes use of
synonyms in the same language or similar languages such as
American English and British English. This is done by
creating a dictionary of synonyms and exchanging words to
pass zero or one. Other categories hide data depending on the
language syntax. This is known as the earliest techniques
that employ the physical format of text to conceal
information. Other scenarios employ linguistic properties to
hide data and depend on the file generation to convey the
information [8].
In section II of this paper, prior work is presented and
compared. The proposed algorithm is discussed in section
III. Experimentation and results are demonstrated in section
IV. The algorithm is analyzed in section V. Finally, section
VI offers conclusions.
II. PRIOR WORKS
HTML or Hyper Text Markup Language is the basic
programming language for web pages, which can be
combined with other languages such as Macromedia Flash
and Java Script for animation goals [9]. Moreover, HTML
does not need special software for programming. Most of the
new web programming languages are based on HTML
concepts. Generally, HTML is used to create the static part
of websites. HTML code consists of two parts; i) tag which
is surrounded by angle parentheses (< >), and ii) the
information between tags. Internet browsers only display the
content without tags, since tags control the appearance of the
web page content. Tags order the page organization and
design, and are not-case sensitive.
HTML represents the source code of a web page.
However, Internet users are only concerned about the web-
page information. Based on this hypothesis, most
Steganography algorithms over web pages deal with the
coding of the web page and not the page’s information.
In [10], a text Steganography was presented by using HTML
files. Authors classified HTML into two categories; primary
attributes and secondary attributes. If a secondary attribute is
followed by a primary attribute, then a 0 bit is detected, else
a 1 is detected. The authors suggested applying two steps.
The first step is encryption to improve the message security.
The second step suggests applying HTML Steganography
scenarios to hide the bits. The HTML Steganography
algorithm consists of three main steps. The first step is the
scanning process to search all tags in the web page to
classify them into two categories; primary attributes and
secondary attributes. After the analysis, the hidden bit is read
from the hidden file. If the hidden bit is 1, the primary and
secondary consecutive attributes are swapped, otherwise no
change is applied. The main advantages of this algorithm are
that it can be applied in different languages without any
change in the file format or size. All the changes in the code
file result in no effect in the web information. In addition,
many web page contain a lot of information that are publicly
available and coded in HTML. Moreover, most of the new
versions of web programming languages apply the same
HTML concepts where all of them use tags.
Most programming languages improve their readability
and code documentation using non-compile notes called
comments [11]. Usually HTML files support this property by
adding “<! – –”at the beginning of the comment and “– –>”
at the end of the comment. Comments do not appear in the
Internet browser, so Internet users are unaware of any
changes in the website appearance because of the comments.
This implies that huge amount of data can be inserted inside
the web page without being noticed by the users. In addition,
comments can be added in any location inside the file.
The main advantage of this method is that huge amount
of data can be inserted in the carrier file, and the hidden data
can be inserted anywhere in the HTML document. Hidden
data is readable; however, Internet users generally do not
explore the page code. Only the programmer is concerned
about the comments to understand the programming
methods.
In [12] [13], other HTML Steganography algorithms were
presented by using one of the HTML characteristics,
changing tag letter cases to hide data. In these algorithms,
uppercase would correspond to 1 and lowercase is 0. This is
while there is no difference between upper and lower case
letters in the HTML code for web page viewers. The
advantages of this method are similar to other methods
where hidden data do not appear in the web browser.
Moreover, huge amount of data can be hidden inside the
HTML files. On the other hand, printing the file will,
however, remove the hidden data.
Other HTML algorithms were proposed that suggested
using HTML tags, and employ some varying combination or
gaps to hide data [13]. An example is as follows:
<img></img> hide 0
<img/> hide 1
By using this method, each tag can pass bits by adding “\” to
the end of tag. The main point of this method is the non-
suspicious property, as “\” is usually used in most HTML
tags.
The End of HTML file is also used to hide data. HTML files
usually start with <html > and end with </html> [14]. One of
the simplest methods is to hide data inside HTML by
inserting the whole hidden data after closing the HTML file
</html>. This way, data will not appear in the browser
output, and the whole hidden data can be read from the end
of the HTML file.
In [14], a technique was introduced that suggested using
one of the HTML properties; employing HTML attributes,
where each tag on the HTML page is the ID attribute. The
file is usually compressed to reduce the memory space. Each
tag ID consists of three parts; the object name, the title of the
HTML page, and four coded characters. By employing some
bytes from the ID attribute, 2 bytes can be hidden. The main
advantage of the ID attribute algorithm is the large number
of HTML files over the Internet. Moreover ID attribute is a
common way used to compress HTML files. This method
can also be applied on other web design languages such as
XML and ASP.
III. PROPOSED ALGORITHM
In this paper, we employ cryptography and
steganography techniques to pass secure information. Since
web pages are used as the carrier for data, and since the
pages are published over the Internet, authenticated users can
access the hidden data. The proposed algorithm consists of
three main steps as shown in Figure 2., where the first and
third step represent inverse operations.
Figure 2. Steganography Process
The Conceal operation consists of the 6 steps:
1. Statistical Operation :-
This step creates an array of 26 elements to count
the characters' frequency. The frequency array can
be extended or shrunk depending on the language
used in the web page. Our experiments are applied
to the English language.
2. Character representation:-
After the frequency array has been generated, the
lowest two characters in frequency can be
represented by one bit. If the two characters have
the same frequency number, the character order
specifies which one is zero. For example, if letters
X and Z appear 10 and 6 times, then Z can be
represented by 0 and X by 1. Moreover, if both
letters have the same occurrence number, then X is
represented by 0 and Z is represented by 1.
Similarly, the next four characters can be
represented by two bits, and so on.
3. Embedding process:-
In this step, the secret bits embedded after the
character representation is 8 bits. In other words, if
the first character representation is 0 and hidden
information is 0111011, the code will be 00111011.
4. Encryption process :-
This step consists of three simple binary operations.
The binary representation is first complemented
then exclusive OR (XOR) is performed with the
key. Output of XOR gate shift left by one bit and
again enter to XOR as input. On the other hand, the
key creation depends on the page index where each
page has rear index. This operation is repeated
twice as shown in Figure 3.
Figure 3. Encryption gates
The following is a numerical example where the
input is (C ) 01000011 and the key is 10001100:
Step 1:- Binary representation for C =>01000011
Step 2: - 1’s complement of C => 10111100
Step 3:- 10001100 XOR 10111100=>00110000
Step 4:-Shift left 00110000 => 01100000
Step 5:- 10001100 XOR 01100000=>11101100
Step 6:- Shift left 11101100=>11011001
5. Decoding Process (convert binary code to ASCII
code).
The next step after embedding is decoding to
convert the binary code to text form. In the running
example, 11011001are decoded to (Ù).
6. Insertion Operation :-
The last step in our algorithm is the insertion
operation where the output of step 5 is inserted into
the web page web page code as a comment. The
comments do not appear in the page output view.
At the other end, the user can exclude the hidden
information by the following procedure:-
1. Statistical operation:
This is the same first step in the conceal operation
2. Reading comments:
The next step is to read comments from the web
page.
3. Encoding process:
This step converts the comments from text into
binary representation.
4. Decryption process:
This step is similar to the encryption process with
the same number of iterations.
5. Exclude character code representation:
This step is performed by using the Frequency array
created in step 1, and comparing it with the binary
output of step 5. The embedded information is
acquired by removing the character representation.
IV. EXPERIMENTS AND RESULTS
In this section, we explore some of the results acquired
by applying our algorithm and considering the concept of
hidden ratio.
Figure 5 shows some experiments of different websites
and the corresponding statistical information for the visited
websites.
The number of hidden bits is 122 bits regardless of the
website size. On the other hand the suggested algorithm
applied non pure Steganography by employing encryption
gates to improve the system transparency, which increases
the complexity to identify the message content. In addition,
the statistical equation improves system robustness, and this
avoids ability to change the sensitive message during
transformation process.
V. ALGORITHM ANALYSIS
Our proposed algorithm has a number of advantages over
other algorithms. This section explores some of its benefits.
1. Language independency: - The suggested algorithm
can be applied to any language. This enables users to employ
it regardless of the language used in the web page. Different
languages will have different frequency array size. For
example, if the web page contains Arabic letters then the
array size is 28, and if English text, the array size is 26.
2. Algorithm transparency: - One of the most
important criteria to measure the performance of a
Steganography algorithm is the ability to avoid suspicion.
This algorithm improves the transparency feature by hiding
data inside the code, where the hidden data are also
encrypted. In addition, the embedded bits are inserted as
comments, and the comments do not appear in the web page
output. The proposed technique also avoids changing the file
format to reduce intruder suspicion.
Figure 4. Letter Frequency
3. Hidden ratio capacity: - The presented algorithm
hides different amount of bits inside each web page. For
example, if we assume the web page has English text, at
most 126 bits can be hidden in each page.
4. Algorithm reusability: - The presented algorithm
enables the user to create his/her own web page or reuse
same web page to hide different messages.
5. Algorithm robustness: - The proposed algorithm
prohibited any change for carrier page code during the
transmission process, as the hidden data is stored in the page
code as comments.
VI. CONCLUSION
Different algorithms have been presented to hide data
inside text files. Some of these methods were designed to be
applied in specific languages, while others can be applied
regardless of the language. In this paper, we presented a
promising algorithm that can be applied to different
languages over HTML pages. The proposed algorithm offers
high hidden capacity compared to other algorithms. In
addition, the algorithm offers robustness, as the hidden data
was inserted inside the page as comments, and the Internet
browser does not show it. Moreover, the algorithm enhances
transparency by using an encryption mechanism.
REFERENCES
[1] M. Venkata, "Cryptography and Steganography,"
International Journal of Computer Applications
(0975–8887), vol. 1, pp. 626-630, 2010.
[2] J. Neil and J. Sushil, "Exploring steganography:
Seeing the unseen," IEEE computer, vol. 31, pp.
26-34, 1998.
[3] P. Niels and H. Peter, "Hide and seek: An
introduction to steganography," Security &
Privacy, IEEE, vol. 1, pp. 32-44, 2003.
[4] C. Abbas, C. Joan, C. Kevin, and M. K. Paul,
"Digital image steganography: Survey and analysis
of current methods," Signal Processing, vol. 90,
pp. 727-752, 2010.
[5] D. Poulami, B. Debnath, and K. Tai-hoon, "Data
Hiding in Audio Signal: A Review," International
journal of database theory and application, vol. 2,
pp. 1-8, 2009.
[6] A. Odeh, A. Alzubi, Q. Bani, and K. Elleithy,
"Steganography by multipoint Arabic letters," in
Systems, Applications and Technology Conference
(LISAT), 2012 IEEE Long Island, 2012, pp. 1-7.
[7] A. Odeh, K. Elleithy, and M. Faezipour, "Text
Steganography Using Language Remarks,"
presented at the The American Society of
Engineering Education, 2013.
[8] A. Odeh and K. Elleithy, "Steganography in Arabic
Text Using Zero Width and Kashidha Letters,"
International Journal of Computer Science &
Information Technology (IJCSIT),, vol. 4, pp. 1-11,
2012.
[9] B. Lawson and R. Sharp, Introducing html5, 2 ed.:
Amazon, 2011.
[10] M. Garg, "A Novel Text Steganography Technique
Based on Html Documents," International Journal
of Advanced Science and Technology, vol. 35, pp.
129-138, 2011.
[11] K. Matz, "Designing and evaluating an intention-
based comment enforcement scheme for Java," 15
September, 2010.
[12] K. Bennett, "Linguistic steganography: Survey,
analysis, and robustness concerns for hiding
information in text," CERIAS Technical Report
2004-13, Purdue University, pp. 1-30, 2004.
[13] P. Singh, R. Chaudhary, and A. Agarwal, "A Novel
Approach of Text Steganography based on null
spaces," IOSR Journal of Computer Engineering,
vol. 3, pp. 11-17, 2012.
[14] M. Shahreza, "A New Method for Steganography
in HTML Files," Advances in Computer,
Information, and Systems Sciences, and
Engineering, pp. 247-252, 2006.
0
100
200
300
400
500
600
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Microsoft
CNN
Nytimes
Ctpost
Dailyfreepress