Conference Paper

Content-preserving Text Watermarking through Unicode Homoglyph Substitution

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Digital watermarking has become crucially important in authentication and copyright protection of the digital contents, since more and more data are daily generated and shared online through digital archives, blogs and social networks. Out of all, text watermarking is a more difficult task in comparison to other media watermarking. Text cannot be always converted into image, it accounts for a far smaller amount of data (eg. social network posts) and the changes in short texts would strongly affect the meaning or the overall visual form. In this paper we propose a text watermarking technique based on homoglyph characters substitution for latin symbols1. The proposed method is able to efficiently embed a password based watermark in short texts by strictly preserving the content. In particular, it uses alternative Unicode symbols to ensure visual indistinguishability and length preservation, namely content-preservation. To evaluate our method, we use a real dataset of 1.8 million New York articles. The results show the effectiveness of our approach providing an average length of 101 characters needed to embed a 64bit password based watermark.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... • Where there is a doubt about integrity and originality of text-based information shared on social media, the watermarking algorithm can embed an invisible watermark into the sensitive information before sharing and extract it whenever it is required [5,6] . ...
... Text hiding is involved with concealing the presence of secret message/watermark; and, in some cases, by extension, also the extraction of invisible symbols during the decoding the hidden information [1,3,[5][6][7][8][9][10][11][12] . Herein, the contributions are involved the text steganography and text watermarking on text messages. ...
... However, the proposed algorithms will be considered in the structural based category. To identify proper methods, we evaluated the state-of-the-art algorithms which utilize the Unicode encoding characteristics of the for concealing an / [4][5][6][7][8][9] . In general, a concealment system includes two functions: embedding or encoding and extraction or decoding. ...
Thesis
Text hiding is an intelligent programming technique, which embeds a secret message (SM) or watermark (ω) into a cover text file or message (CM/CT) in an imperceptible way to protect confidential information. Recently, text hiding in forms of watermarking and steganography has found broad applications in, for instance, covert communication, copyright protection, content authentication, and so on. It has also been widely considered as an attractive technology to improve the use of conventional cryptography algorithms in the area of multimedia security by concealing information into a cover being protected. In general, information hiding or data hiding can be categorized into two classifications: watermarking and steganography. While watermarking attempts to concern the robustness of the embedded watermark/signature at the expense of embedding capacity, steganography tries to embed as much secret information as feasible into a cover media. In contrast to text hiding, text steganalysis is the process and science of identifying whether a given carrier text file/message has a hidden message (HM) in it, and, if possible, extracting/detecting the embedded hidden information. In practice, steganalysis evaluates the efficiency of information hiding algorithms, meaning a robust watermarking/steganography algorithm should be invisible (or irremovable) not only to Human Vision Systems (HVS) but also to intelligent data processing attacks. Since the digital text is one of the most widely used digital media on the Internet, the significant part of Web sites, social media, articles, eBooks, and so on is only plain text. Thus, copyrights protection of plaintexts is still a remaining issue that must be improved to provide proof of ownership and obtain the integrity rate. During the last decade, digital watermarking and steganography techniques have been used as alternatives to prevent tampering, distortion, and media forgery attacks and also to protect both copyright and authentication. As yet, text hiding and steganalysis have drawn relatively less attention compared to data hiding in other media such as image, video, and audio. This dissertation aims to focus on this relatively neglected research area and has three main objectives as follows. 1) We discuss various types of text hiding algorithms, and their limitations in digital text documents and messages as well as the definition of the common evaluation criteria. We theoretically analyze the efficiency of the existing text hiding methods concerning the evaluation criteria. Then, we conduct a set of experiments on the real examples to evaluate the efficiency of existing techniques and their limitations and investigate the performance of structural-based text hiding techniques. Our findings confirm that the structural-based text hiding approaches provide better efficiency compared to other state-of-the-art methods. Thus, we outline some guidelines and directions to enhance the efficiency of structural-based techniques in digital texts for future works. 2) We propose a novel text steganography technique called AITSteg, which affords end-to-end secure conversation via SMS or social media between smartphone users. To meet this requirement, we investigate the trade-off between invisibility, embedding capacity, and distortion robustness criteria by considering proper embeddable locations for hiding the SM into the CM using Unicode Zero Width characters (ZWC). We then experiment the proposed technique concerning evaluation criteria by implementing it on some real CM examples. The experiments confirm that the AITSteg can prevent different attacks, including man-in-the-middle attack, message disclosure, and manipulation by readers. Also, we compare the experimental results with the existing approaches for showing the superiority of the proposed technique. To the best of our knowledge, this is the first technique that provides end-to-end hidden transmission of SM in the cover of text message using symmetric keys via social media. 3) We present an intelligent watermarking technique called ANiTW which utilizes an instance-based learning algorithm to hide an invisible watermark (ω) into Latin cover text-based information (CT) such that the ω can be extracted, even if a malicious user manipulates a portion of the watermarked information. We experiment with the ANiTW by implementing it on 16 social media applications (SMAs) and some real CT examples concerning evaluation criteria. Experiments demonstrate that the ANiTW can identify the integrity rate and ownership of watermarked information on social media, where there is a doubt about its originality. To the best of our knowledge, this is the first intelligent text watermarking technique that provides an invisible signature for forensic identification of spurious information on social media by evaluating the manipulation rate of watermarked information, while the other existing approaches only consider the robust/fragile marking of signature into the CT.
... In this method, the pixels of letters' curves are changed according to the values of the watermark bits. Although a wide variety of text watermarking methods have been proposed [12][13][14][15][16][17][18][19][20], some of these methods [12][13][14][15]20] are vulnerable to text reordering attacks, in which the watermark is distorted by changing the order of words of the watermarked text. In references [16,17], watermarking methods that utilize the inter-word spacing to hide the watermark bits were proposed. ...
... In this method, the pixels of letters' curves are changed according to the values of the watermark bits. Although a wide variety of text watermarking methods have been proposed [12][13][14][15][16][17][18][19][20], some of these methods [12][13][14][15]20] are vulnerable to text reordering attacks, in which the watermark is distorted by changing the order of words of the watermarked text. In references [16,17], watermarking methods that utilize the inter-word spacing to hide the watermark bits were proposed. ...
... The capacity of the proposed method is shown in Table 4 for 10 randomly selected documents from each class in the Reuters-8 dataset, using a 16-bit integer representation for d 1i (i.e., m = 16). It can be observed from this table that the capacity of the proposed watermarking method is approximately 1 bit/character, which is higher than the capacity that can be achieved with other watermarking methods [14,18] without affecting the visibility of the watermark. ...
Article
Full-text available
Due to the rapid growth of the Internet and content development services, digital text has become the most extensively used type of media in digital communication. However, digital text is at risk of being illegally tampered with, copied or redistributed, which could lead to many security threats related to privacy and ownership protection. Over the last few years, digital watermarking has been used to authenticate and protect the ownership of digital media such as images, audio, and videos. Compared to other digital media, digital text faces greater challenges in regard to watermarking due to its low capacity for information hiding and its high sensitivity to modification. While limited research has been conducted on text watermarking, the preliminary results indicate that the available schemes do not achieve an acceptable trade-off between imperceptibility, robustness, capacity, and security. In this paper, we propose a novel text watermarking method called Bloom Filter and Text Similarity Watermarking (BFTSW). A Bloom filter is employed in our proposed BFTSW method to reduce the size of the watermark that is generated based on a vector space model (VSM) representation of a text document. To verify the watermark, the text similarity between the VSM representation of the original text (which is recovered from the Bloom filter) and the VSM representation of a suspect text is computed and compared against a predefined threshold value to determine whether the watermark is present in the suspect text. Experimental results obtained with a prototype implementation show that the proposed BFTSW method is effective in terms of its robustness against malicious attacks, information capacity, and text quality preservation.
... Moreover, these techniques provide different variations and improvements in the multimedia security area that cannot be addressed by the traditional cryptosystems [9]- [22]. Digital Watermarking has many common attributes with the related but basically somewhat different data-hiding technology called steganography [13], [23]- [34]. Although both digital watermarking and steganography are employed to hide data stream in the cover media, the primary goal of steganography is to conceal the existence of confidential information. ...
... Since the text message is one of the most common communication media between end users in social media and, it is also easy to manipulate; thus the verification of authenticity, authorship attribution and the integrity of invalid information are becoming crucial. Due to there are limited characteristics in the structure of digital texts such as language dependency, limited length, different types of encoding, etc. the text watermarking is a much difficult task compare to watermarking approaches for other digital media [17], [19], [23]. ...
... In particular, if an SMA employs the Unicode standard to process digital texts in different languages, then the ZWCs will be unnoticeable, i.e., they show invisible written symbols. Otherwise, they might display some unusual symbols [23]- [27]. In this research, the proposed technique employs four ZWCs for marking the watermark into cover text which are depicted in Table 2. ...
Article
Full-text available
Digital Watermarking is required in multimedia applications where access to sensitive information has to be protected against malicious attacks. Since the digital text is one of the most widely used digital media on the Internet, the significant part of Web sites, social media, articles, eBooks, and so on is only plain text. Thus, copyrights protection of plain-texts is still a remaining issue that must be improved to provide proof of ownership and verify content integrity of vulnerable digital texts. In this research, we propose a novel intelligent text watermarking technique called ANiTW which utilizes an instance-based learning algorithm to hide an invisible watermark into Latin text-based information such that the hidden watermark can be extracted, even if a malicious user manipulates a portion of the watermarked information. Extensive experiments demonstrate the superior efficiency of the ANiTW with a significant improvement, especially in the short text domain. To the best of our knowledge, this is the first intelligent text watermarking technique that provides an invisible signature for forensic identification of spurious information on social media by evaluating the manipulation rate of watermarked information, while the other existing approaches only consider the robust/fragile marking of signature into cover text.
... The proposed method is based on homoglyphs substitutions for Latin-based languages. 3 We also show how to specialize this approach for text information hiding in SM by adapting its core algorithm on the visual and ...
... The proposed method is based on homoglyphs substitutions for Latin-based languages. 3 We also show how to specialize this approach for text information hiding in SM by adapting its core algorithm on the visual and content requirements of the set of SM platforms. The experimental results reveal that the proposed method can hold up to copying and pasting and is invisible to the human observer. ...
... We showed that it is possible to embed a 64-bit sequence in texts of 46-101 characters. 3 This feature makes the proposed method suitable even for the most demanding SM, which is Twitter with its 280-character limit. an optimal prefix code is used to binarize words, numbers, and symbols in the vocabulary, resulting in a codebook ( Table 2). ...
Article
A Unicode homoglyph is one of two or more characters with shapes that appear very similar to the human observer. If used in social media posts, homoglyphs allow users to implement an efficient method for hiding text information, turning anyone's post into a potential carrier of hidden messages.
... On the other hand, hackers are regularly trying to break the safety of communication channels (e.g., network protocols, SMS, etc.) to access sensitive information during data transmission. Therefore, demand is growing for intelligence and multimedia security studies that involve not only encryption, but also covert communication whose essence lies in concealing data [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19]. Recently, information hiding or data hiding in digital texts, known as text hiding, has drawn considerable attention due to its extensive usage, and potential applications in the cybersecurity and network communication industries . ...
... During the last two decades, many text hiding algorithms have been introduced in terms of text steganography and text watermarking for covert communication [1,6,8,[9][10][11][12][13][14]20,31,36,39,51,91], copyright protection [3][4][5]7,18,[20][21][22][23][24][25][26][27][28][29]44,[49][50][51][52][53][54][55][56][57][58][59][60][61][62][63][64][65][66][67][68][72][73][74][75][87][88][89][90][91][92][98][99][100][101][102][103][104][105][106][107][108][109], copy control and authentication [31,57,60,74,78,[93][94][95][96][97][98]. ...
... During the last two decades, many text hiding algorithms have been introduced in terms of text steganography and text watermarking for covert communication [1,6,[8][9][10][11][12][13][14]20,31,36,39,51,91], copyright protection [3][4][5]7,18,[20][21][22][23][24][25][26][27][28][29]44,[49][50][51][52][53][54][55][56][57][58][59][60][61][62][63][64][65][66][67][68][72][73][74][75][87][88][89][90][91][92][98][99][100][101][102][103][104][105][106][107][108][109], copy control and authentication [31,57,60,74,78,[93][94][95][96][97][98]. ...
Article
Full-text available
Abstract: Modern text hiding is an intelligent programming technique which embeds a secret message/watermark into a cover text message/file in a hidden way to protect confidential information. Recently, text hiding in the form of watermarking and steganography has found broad applications in, for instance, covert communication, copyright protection, content authentication, etc. In contrast to text hiding, text steganalysis is the process and science of identifying whether a given carrier text file/message has hidden information in it, and, if possible, extracting/detecting the embedded hidden information. This paper presents an overview of state of the art of the text hiding area, and provides a comparative analysis of recent techniques, especially those focused on marking structural characteristics of digital text message/file to hide secret bits. Also, we discuss different types of attacks and their effects to highlight the pros and cons of the recently introduced approaches. Finally, we recommend some directions and guidelines for future works.
... Over the last two decades, many information hiding tech- niques have been proposed in terms of text watermarking and text steganography for copyright protection [11][12][13][14], proof of ownership [15][16][17][18][19][20][21][22][23], and copy control and authentication [24][25][26][27][28][29][30][31]. Although the aim of steganography is different, it also can be used for the copyright protection of digital texts like watermarking. ...
... high imperceptibility but they used two spaces with the deferent length which makes more gaps between words in the watermarked text [19]. Rizzo et al. (2016) presented a text watermarking tech- nique which is able to embed a password based watermark in the Latin-based texts. This technique blends the original text and a user password through a hash function in order to compute the watermark. ...
... The authors claimed that this technique can hide a watermark (64 bit) into a short text with only 46 characters and, moreover, it provides high imperceptibility and high capacity. However, it is vulnerable against reformatting (e.g., changing the font type of water- marked text causes the watermark bits to be lost), tampering, and retyping attacks [29]. Due to utilizing the homoglyph Unicode characters, this method has low robustness against all the conventional attacks. ...
Article
Full-text available
with the ceaseless usage of web and other online services, it has turned out to be amazingly simple to copy, share, and transmit digital media over the Internet. Since the text is one of the main available data sources and most widely used digital media on the Internet, the significant part of websites, books, articles, daily papers, etc. are just the plain text. Therefore, copyrights protection of plaintexts is still a remaining issue that must be improved in order to provide proof of ownership and obtain the desired accuracy. During the last decade, digital watermarking and steganography techniques have been used as alternatives to prevent tampering, distortion, media forgery and also to protect both copyright and authentication. This paper presents a comparative analysis of information hiding techniques, especially on those ones which are focused on modifying the structure and content of digital texts. Herein, various text watermarking and text steganography techniques characteristics are highlighted along with their applications. In addition, various types of attacks are described and their effects are analyzed in order to highlight the advantages and weaknesses of current techniques. Finally, some guidelines and directions are suggested for future works.
... In Section V, we specialize our previous method [1], to ensure visual indistinguishability and length preservation of text in SM while being robust to copy and paste. Firstly, our findings reveal that in general SM do not perform any text watermarking. ...
... All the selected SM will be the subject of a set of experiments in order to answer our two initial questions. 1 The dataset is available from http://smartdata.cs.unibo.it/datasets#tw Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. ...
... The number of characters required to embed a watermark strictly depends on (i) how many confusable symbols are among those characters and (ii) how many symbol restrictions are applied. In [1] we showed that, using a 64bit hash watermark, it is possible to embed it in texts of lengths from 46 to 101 characters by using both letters and white spaces, making our method suitable even for the most requiring SM, that is the Twitter with its 140 characters limit. ...
Conference Paper
Full-text available
One of the most shared content in Social Media (SM) is text, making it vulnerable to copy and authorship misappropriation. Due to the low data noise, watermark embedding is very hard. This problem is exacerbated in the context of SM, where the amount of data in a single message can be extremely small, like in Twitter. Firstly, in this paper we investigate whether SM do applies watermarks on the texts. Then, we propose a text watermarking method able to work on all the SM platforms considered, while ensuring visual indistinguishability and length preservation of the original text and robustness to copy and paste. We conduct an extended evaluation on eighteen different SM platforms by using 6,000 posts from six public figures' profiles.
... The parameters such as invisibility, capacity and robustness are used to test the efficiency of the technique. Rizzo et al. (2016) proposed a text watermarking technique based on replacing Unicode Homoglyph for Latin text. The technique does not change the original text and therefore preserves the length of the cover text. ...
... Unicode is a standard used for representation, encoding and handling of digital text. Unicode has special ZWCs that is used to manage specific entities such as zero-width joiner in particular the script, which combines two supporting characters [12] [15]. Eight ZWCs are used to insert the encoding value into the cover text, as shown in Table 1. ...
Article
Full-text available
A novel watermarking technique for the tamper detection of English text images is proposed in this paper. The frequency of maximum occurring vowel in every sentence is counted and converted to Unicode Zero Width Characters (ZWCs) by using a lookup table. These ZWCs and total length of each sentence are added at the end of the sentence. ZWCs of the Hash value of the cover text is calculated and inserted at the end of the cover text. These values are extracted from the received data on the receiver side for the tamper detection. The hash value of the received text as well as frequency of maximum occurring vowel of each sentence are again calculated and compared to their extracted corresponding values to prove the authentication of the image. Comparison with existing state-of-the-art techniques shows the effectiveness of the proposed technique.
... Rizzo et al. [15] proposed a text watermarking technique based on replacing Unicode Homoglyph for Latin text. The technique does not change the original text and therefore preserves the length of the cover text. ...
... Unicode is a standard used for representation, encoding and handling of digital text. Unicode has special ZWCs that is used to manage specific entities such as zero-width joiner in particular the script, which combines two supporting characters [12,15]. Eight ZWCs are used to insert the encoding value into the cover text, as shown in Table 1. ...
Article
Full-text available
A novel watermarking technique for the tamper detection of English text images is proposed in this paper. The frequency of maximum occurring vowel in every sentence is counted and converted to Unicode Zero Width Characters (ZWCs) by using a lookup table. These ZWCs and total length of each sentence are added at the end of the sentence. ZWCs of the Hash value of the cover text is calculated and inserted at the end of the cover text. These values are extracted from the received data on the receiver side for the tamper detection. The hash value of the received text as well as frequency of maximum occurring vowel of each sentence are again calculated and compared to their extracted corresponding values to prove the authentication of the image. Comparison with existing state-of-the-art techniques shows the effectiveness of the proposed technique.
... Methodologies based on images The cover text is viewed as an image and embedded with a watermark according to the image-based approach described in [24]. The watermarked logos and images are converted into text strings and the data is generated. ...
... In spite of the fact that optical character recognition (OCR) is considered safe for formatting attacks, it has limited applicability due to the fact that it ruins hidden information [25]. A technique described by Rizzo et al. [24] encrypts a short piece of text with a hidden watermark while preserving its content strictly. Images cannot be altered in either their content or appearance when converted from the text. ...
Chapter
Full-text available
Watermarking is a modern technology in which identifying information is embedded in a data carrier. It is not easy to notice without affecting data usage. A text watermark is an approach to inserting a watermark into text documents. This is an extremely complex undertaking, especially given the scarcity of research in this area. This process has proven to be very complex, especially since there has only been a limited amount of research done in this field. Conducting an in-depth analysis, analysis, and implementation of the evaluation, is essential for its success. The overall aim of this chapter is to develop an understanding of the theory, methods, and applications of text watermarking, with a focus on procedures for defining, embedding, and extracting watermarks, as well as requirements, approaches, and linguistic implications. Detailed examination of the new classification of text watermarks is provided in this chapter as are the integration process and related issues of attacks and language applicability. Research challenges in open and forward-looking research are also explored, with emphasis on information integrity, information accessibility, originality preservation, information security, and sensitive data protection. The topics include sensing, document conversion, cryptographic applications, and language flexibility.
... Others hide messages by changing the character scale and color or adding underline styles in a document [Panda et al. 2015;Stojanov et al. 2014], although those changes are generally noticeable. More recent methods exploit special ASCII codes and Unicodes that are displayed as an empty space in a PDF viewer [Chaudhary et al. 2016;Rizzo et al. 2016]. ...
... ese methods are not format independent (FI); they are bounded to a speci c le format (such as Word or PDF) and text viewer. e concealed messages would be lost if the document was converted Ala ar 2004;Brassil et al. 1995;Gutub and Fa ani 2007;Kim et al. 2003] FSF [Bhaya et al. 2013;Chaudhary et al. 2016;Panda et al. 2015;Rizzo et al. 2016] Our work Table 1. A summary of related text steganographic methods. ...
Article
We introduce FontCode, an information embedding technique for text documents. Provided a text document with specific fonts, our method embeds user-specified information in the text by perturbing the glyphs of text characters while preserving the text content. We devise an algorithm to chooses unobtrusive yet machine-recognizable glyph perturbations, leveraging a recently developed generative model that alters the glyphs of each character continuously on a font manifold. We then introduce an algorithm that embeds a user-provided message in the text document and produces an encoded document whose appearance is minimally perturbed from the original document. We also present a glyph recognition method that recovers the embedded information from an encoded document stored as a vector graphic or pixel image, or even on a printed paper. In addition, we introduce a new error-correction coding scheme that rectifies a certain number of recognition errors. Lastly, we demonstrate that our technique enables a wide array of applications, using it as a text document metadata holder, an unobtrusive optical barcode, a cryptographic message embedding scheme, and a text document signature.
... Hiding information in documents is more challenging than images. Earlier approaches focused on slightly adjusting the format of electronic documents to embed information, such as word and line spacing (Brassil, Low, and Maxemchuk 1999;Rizzo, Bertini, and Montesi 2016), which are fragile to realworld distortions. To remedy it, some studies (Ueoka, Murawaki, and Kurohashi 2021;Abdelnabi and Fritz 2021;Yang et al. 2022) tried to make semantic modifications, but there is no guarantee that the semantics of the modified text remain exactly the same as the original. ...
Article
Hiding information in text documents has been a hot topic recently, with the most typical schemes of utilizing fonts. By constructing several fonts with similar appearances, information can be effectively represented and embedded in documents. However, due to the unstructured characteristic, font vectors are more difficult to synthesize than font images. Existing methods mainly use handcrafted features to design the fonts manually, which is time-consuming and labor-intensive. Moreover, due to the diversity of fonts, handcrafted features are not generalizable to different fonts. Besides, in practice, since documents might be distorted through transmission, ensuring extractability under distortions is also an important requirement. Therefore, three requirements are imposed on vector font generation in this domain: automaticity, generalizability, and robustness. However, none of the existing methods can satisfy these requirements well and simultaneously. To satisfy the above requirements, we propose AutoStegaFont, an automatic vector font synthesis scheme for hiding information in documents. Specifically, we design a two-stage and dual-modality learning framework. In the first stage, we jointly train an encoder and a decoder to invisibly encode the font images with different information. To ensure robustness, we target designing a noise layer to work with the encoder and decoder during training. In the second stage, we employ a differentiable rasterizer to establish a connection between the image and the vector modality. Then, we design an optimization algorithm to convey the information from the encoded image to the corresponding vector. Thus the encoded font vectors can be automatically generated. Extensive experiments demonstrate the superior performance of our scheme in automatically synthesizing vector fonts for hiding information in documents, with robustness to distortions caused by low-resolution screenshots, printing, and photography. Besides, the proposed framework has better generalizability to fonts with diverse styles and languages.
... However, it is more challenging to embed watermarks with imperceptible perturbations on text due to its inherent discrete nature. Traditional text watermarking schemes embed watermarks by slightly altering the image features like text format (Brassil, Low, and Maxemchuk 1999;Rizzo, Bertini, and Montesi 2016) and fonts (Xiao, Zhang, and Zheng 2018;Qi et al. 2019), which are fragile to cross-media transmissions like OCR. Considering this, natural language watermarking (NLW) schemes choose to manipulate the semantics of text, which are inherently robust in the OCR-style transmissions. ...
Article
Text content created by humans or language models is often stolen or misused by adversaries. Tracing text provenance can help claim the ownership of text content or identify the malicious users who distribute misleading content like machine-generated fake news. There have been some attempts to achieve this, mainly based on watermarking techniques. Specifically, traditional text watermarking methods embed watermarks by slightly altering text format like line spacing and font, which, however, are fragile to cross-media transmissions like OCR. Considering this, natural language watermarking methods represent watermarks by replacing words in original sentences with synonyms from handcrafted lexical resources (e.g., WordNet), but they do not consider the substitution’s impact on the overall sentence's meaning. Recently, a transformer-based network was proposed to embed watermarks by modifying the unobtrusive words (e.g., function words), which also impair the sentence's logical and semantic coherence. Besides, one well-trained network fails on other different types of text content. To address the limitations mentioned above, we propose a natural language watermarking scheme based on context-aware lexical substitution (LS). Specifically, we employ BERT to suggest LS candidates by inferring the semantic relatedness between the candidates and the original sentence. Based on this, a selection strategy in terms of synchronicity and substitutability is further designed to test whether a word is exactly suitable for carrying the watermark signal. Extensive experiments demonstrate that, under both objective and subjective metrics, our watermarking scheme can well preserve the semantic integrity of original sentences and has a better transferability than existing methods. Besides, the proposed LS approach outperforms the state-of-the-art approach on the Stanford Word Substitution Benchmark.
... Modifying the file structures has also been investigated since it leaves the content of the text unchanged and thus has good concealment but cannot resist against re-editing attacks [18,19]. Researchers have also tried to use the transformation of text sentence characteristics [20,21] or vocabulary [22,23] to generate the stego text such as the most common synonym substitution [24,25]. In particular, when using word (or say token) replacement to embed secret information, one should well design the synonym dictionary as well as the information encoding strategy. ...
Article
Full-text available
Linguistic steganography (LS) conceals the presence of communication by embedding secret information into a text. How to generate a high-quality text carrying secret information is a key problem. With the widespread application of deep learning in natural language processing, recent algorithms use a language model (LM) to generate the steganographic text, which provides a higher payload compared with many previous arts. However, the security still needs to be enhanced. To tackle this problem, we propose a novel autoregressive LS algorithm based on BERT and consistency coding, which achieves a better trade-off between embedding payload and system security. In the proposed work, based on the introduction of the masked LM, given a text, we use consistency coding to make up for the shortcomings of block coding used in the previous work so that we can encode arbitrary-size candidate token set and take advantage of the probability distribution for information hiding. The masked positions to be embedded are filled with tokens determined by an autoregressive manner to enhance the connection between contexts and therefore maintain the quality of the text. Experimental results have shown that compared with related works, the proposed work improves the fluency of the steganographic text while guaranteeing security and also increases the embedding payload to a certain extent.
... A document watermarking technique by using Unicode Homoglyph Substitution for Latin document image is proposed by Rizzo et al. [19]. The technique does not alter the text and preserves the length of the cover text. ...
Article
Full-text available
In this paper, a hashing based watermarking technique for the protection and authentication of document image is proposed. Message Digest 5 (MD5) hashing is applied on the cover document image to produce its hash value. This hash value is further translated to Unicode Zero Width Characters (ZWC) by using a lookup table. Watermark is also translated to ZWC prior to embedding in the input image. At the end of each sentence, both of these ZWC are inserted in the cover image. At the receiver end, this watermark is regenerated from its embedded ZWCs and used for the security purposes. The hash value of the cover image is used to detect tampering. The embedded data does not degrade the cover image as invisible Unicode white spaces are used in place of original watermark and hash value. The proposed technique’s effectiveness is demonstrated by comparison to existing state-of-the-art techniques.
... Modifying the file structures has also investigated since it leaves the content of the text unchanged and thus has good concealment, but cannot resist against re-editing attacks [11], [12]. Researchers have also tried to use the transformation of text sentence characteristics [13], [14] or vocabulary [15], [16] to generate the stego text such as the most common synonym substitution [17], [18]. In particular, when to use word (or say token) replacement to embed secret information, one should well design the synonym dictionary as well as the information encoding strategy. ...
Preprint
Linguistic steganography (LS) conceals the presence of communication by embedding secret information into a text. How to generate a high-quality text carrying secret information is a key problem. With the widespread application of deep learning in natural language processing, recent algorithms use a language model (LM) to generate the steganographic text, which provides a higher payload compared with many previous arts. However, the security still needs to be enhanced. To tackle with this problem, we propose a novel autoregressive LS algorithm based on BERT and consistency coding, which achieves a better trade-off between embedding payload and system security. In the proposed work, based on the introduction of the masked LM, given a text, we use consistency coding to make up for the shortcomings of block coding used in the previous work so that we can encode arbitrary-size candidate token set and take advantages of the probability distribution for information hiding. The masked positions to be embedded are filled with tokens determined by an autoregressive manner to enhance the connection between contexts and therefore maintain the quality of the text. Experimental results have shown that, compared with related works, the proposed work improves the fluency of the steganographic text while guaranteeing security, and also increases the embedding payload to a certain extent.
... However, it is more challenging to embed watermarks with imperceptible perturbations on text due to its inherent discrete nature. Traditional text watermarking schemes embed watermarks by slightly altering the image features like text format (Brassil, Low, and Maxemchuk 1999;Rizzo, Bertini, and Montesi 2016) and fonts (Xiao, Zhang, and Zheng 2018;Qi et al. 2019), which are fragile to cross-media transmissions like OCR. Considering this, natural language watermarking (NLW) schemes choose to manipulate the semantics of text, which are inherently robust in the OCR-style transmissions. ...
Preprint
Full-text available
Text content created by humans or language models is often stolen or misused by adversaries. Tracing text provenance can help claim the ownership of text content or identify the malicious users who distribute misleading content like machine-generated fake news. There have been some attempts to achieve this, mainly based on watermarking techniques. Specifically, traditional text watermarking methods embed watermarks by slightly altering text format like line spacing and font, which, however, are fragile to cross-media transmissions like OCR. Considering this, natural language watermarking methods represent watermarks by replacing words in original sentences with synonyms from handcrafted lexical resources (e.g., WordNet), but they do not consider the substitution's impact on the overall sentence's meaning. Recently, a transformer-based network was proposed to embed watermarks by modifying the unobtrusive words (e.g., function words), which also impair the sentence's logical and semantic coherence. Besides, one well-trained network fails on other different types of text content. To address the limitations mentioned above, we propose a natural language watermarking scheme based on context-aware lexical substitution (LS). Specifically, we employ BERT to suggest LS candidates by inferring the semantic relatedness between the candidates and the original sentence. Based on this, a selection strategy in terms of synchronicity and substitutability is further designed to test whether a word is exactly suitable for carrying the watermark signal. Extensive experiments demonstrate that, under both objective and subjective metrics, our watermarking scheme can well preserve the semantic integrity of original sentences and has a better transferability than existing methods. Besides, the proposed LS approach outperforms the state-of-the-art approach on the Stanford Word Substitution Benchmark.
... It has been found that LSB's issue offers more distortion and is less secure due to sequential mapping. Rizzo et al. [32] have presented the multiple text images using hidden image into a single coloured image applying modified LSB substitution method. Total 6 text image had been used for hiding purpose. ...
Article
Emergence of Internet of Things (IoT) and modern digital applications such as digital financial services and deliveries make it easy to reproduce and re-distribute digital contents thus give room to so many copyright violations of illegal use of contents that need to be resolved. Researcher have been presenting the watermarking algorithms to prevent these illicit activities to a document before distribution. However, several issues have been identified for the digital transactions in the IoT. Thus, this research proposes a new text document image watermarking algorithm which emphasizes on two most important measures, visual quality, and robustness. To boost these measures, third least significant bit has been used for insertion. In addition, to further strengthen the technique, the Pascal Triangle is applied to determine the best position for embedding. Experimental results on the standard dataset have revealed that the proposed watermarking has achieved very encouraging results with PSNR and NCC averaged 54.95db and 0.98, respectively.
... In this approach, the contents of watermark information are treated as images or logos [16]. This approach is considered safe against formatting attacks, but it has limited applicability because it is not robust against re-typing attack [17]. Rizzo et al. [18] suggest a method based on a password that embeds the watermark in short text and preserves the appearance and content without converting text to the image. ...
Article
Full-text available
Digital text is the most frequent interchange form of data that could hold sensitive information such as audit firms, banks, and educational institutes. This sensitive information needs to preserve its integrity and originality so that it could not only secure the data but also helps to identify ownership of text documents. This paper presents a novel and invisible digital watermarking approach for the secure exchange of text documents over the internet. Digital watermarking serves from the last decade for detection of forgery and tempering from digital text documents and maintained the copyright and authentication successfully. Many states of the art watermark techniques achieve high imperceptibility, robustness, and high hidden capacity; unfortunately failed to maintain the balance among these three conflicting parameters. As resolvent, we propose an intelligent Three-Level Digital Watermarking (3LDW) system for text documents copyright protection. 3LDW system can be applied to Microsoft Word objects, document open spaces, and text feature coding without affecting the content of the original document. Experimental results reveal that our proposed 3LDW system strongly resist against formatting attacks and efficiently preserves the imperceptibility. Additionally, embedding capacity analysis demonstrates a prominent improvement of the proposed system as compared to other similar approaches.
... Most of text steganography can be divided into two types, modificationbased [4] and generation-based [11,28,32]. Modification-based methods usually embed secret information by modifying the cover texts, such as synonym substitution, etc [5,15]. Generation-based text information hiding method can automatically generate steganographic texts according to the secret information [6,27,32,34,36]. ...
Chapter
Full-text available
With the rapid development of natural language processing technology, various linguistic steganographic methods have been proposed increasingly, which may bring great challenges in the governance of cyberspace security. The previous linguistic steganalysis methods based on neural networks with word embedding layer could only extract the context-independent word-level features, which are insufficient for capturing the complex semantic dependencies in sentences, thus may limit the performance of text steganalysis. In this paper, we propose a novel linguistic steganalysis model. We first employ the BERT or Glove component to extract the contextualized association relationships of words in the sentences. Then we put these extracted features into BiLSTM to further get context information. We use the attention mechanism to find out local parts that may be discordant in text. Finally, based on these extracted features, we use the softmax classifier to decide if the input sentence is cover or stego. Experimental results show that the proposed model can achieve currently the best performance of text steganalysis and hidden capacity estimation. Further experiments found that proposed model can even locate where the secret information may be embedded in the text to a certain extent. To the best of our knowledge, we made the first attempt to achieve text steganography positioning in the field of text steganalysis (Code and datasets are available at https://github.com/YangzlTHU/Linguistic-Steganography-and-Steganalysis).
... Linguistic steganography can usually be divided into two steganographic strategies: carrier-modification based steganography (CMS) and carrier-generation based steganography (CGS). CMS strategy is mainly to synonymously replace the lexical [13], [14] or sentence-level semantic units [15]- [17] in the text to embed specific secret information inside. For CGS strategy, it should automatically generate a semantic-complete and natural-enough carrier based on the secret information that need to be transmitted [8], [18]- [25]. ...
Article
Full-text available
In recent years, linguistic generative steganography has been greatly developed. The previous works are mainly to optimize the perceptual-imperceptibility and statistical-imperceptibility of the generated steganographic text, and the latest developments show that they have been able to generate steganographic texts that look authentic enough. However, we noticed that these works generally cannot control the semantic expression of the generated steganographic text, and we believe this will bring potential security risks. We named this kind of security challenges as cognitive-imperceptibility. We think this is a new challenge that the generative steganography models must strive to overcome in the future. In this paper, we conduct some preliminary attempts to solve this challenge. Experimental results show that the proposed methods can further constrain the semantic expression of the generated steganographic text on the basis of ensuring certain perceptual-imperceptibility and statistical-imperceptibility, so as to enhance its cognitive-imperceptibility.
... Rizzo et al. [14] has proposed a technique in which they substitute Latin symbols with homoglyph symbols to prevent any visual change in the content of the text document they have used different Unicode symbols for substitution which looks nearly the same as of the symbol to be substituted, in proposed work they embed their secret password by substitution of the Latin symbols by homoglyph symbols. [15] glyphs are also used to hide the watermark. ...
Article
Full-text available
In today's time when everyone is surrounded by digital data, the development of internet-related technologies, like cloud databases, social media platforms and much more, is rapidly increasing which raises the problem of information security. Today anyone can easily generate identical copies of any text document. So, to protect the data or the content i.e., intellectual property of someone, some measures are needed to verify the authenticity of one's work. Copyright protection is one of the most important and difficult challenges for the researchers. So, for verification, many researchers have proposed several algorithms to embed a watermark in a text document (In simple words embedding a watermark means hiding some secret information in the document which is hard to detect). The proposed approach uses the application class property of MS-Word document, RGB colour values of text and spacing between the lines to hide the watermark in a monocoloured MS-Word text document. Several different documents are used for evaluation which got attacked by different types of changes. © SSRG International Journal of Engineering Trends and Technology. All rights reserved.
... Bu özellikler, hiyerarşik veriyi düz metin kullanarak kodlayabilmesi ve kullandığı ayrıntılı etiketler sayesinde özel bir okuyucu veya tercümana ihtiyaç duyulmadan belgenin anlaşılabilir oluşudur. XML bilgi alışverişinde, web servislerinde, şirket-tüketici veya şirketler arasındaki iletişimde ve hatta medikal uygulamalarda sıklıkla kullanılmaktadır.Literatürde yapılmış örnek çalışmalar ve XML alanların özellikleri dikkate alındığında, bu çalışmada XML alanlar üzerinde uygulanabilir, bozulmaya sebep olmayan, etiketlerin içerdiği verinin türüne bakmaksızın damga yerleştirebilen, damga verisini sertifikalandırıp dış veri olarak farklı bir konumda saklanmasına gerek kalmayan, verinin boyutunu değiştirmeyen, damga çıkarım sonrası orijinal veriye geri dönüşüm sağlayan, algılanması güç yeni bir damgalama şeması önermekteyiz.Önerilen yöntem veri tabanlarında XML veri alanlarına damga yerleştirme sürecinde damganın fark edilebilirliğini azaltmak, damgalama kapasitesini artırmak ve veri türünden bağımsız olarak damgalama işlemini gerçekleştirmek amacıyla[17] çalışmasında da kullanılan Homoglif Dönüşüm yönteminden faydalanmaktadır. Literatürdeki çalışmaların birçoğu XML verilerin içeriğinde yer alan metin türlerine damgalama yapabilmektedir. ...
... Bu özellikler, hiyerarşik veriyi düz metin kullanarak kodlayabilmesi ve kullandığı ayrıntılı etiketler sayesinde özel bir okuyucu veya tercümana ihtiyaç duyulmadan belgenin anlaşılabilir oluşudur. XML bilgi alışverişinde, web servislerinde, şirket-tüketici veya şirketler arasındaki iletişimde ve hatta medikal uygulamalarda sıklıkla kullanılmaktadır.Literatürde yapılmış örnek çalışmalar ve XML alanların özellikleri dikkate alındığında, bu çalışmada XML alanlar üzerinde uygulanabilir, bozulmaya sebep olmayan, etiketlerin içerdiği verinin türüne bakmaksızın damga yerleştirebilen, damga verisini sertifikalandırıp dış veri olarak farklı bir konumda saklanmasına gerek kalmayan, verinin boyutunu değiştirmeyen, damga çıkarım sonrası orijinal veriye geri dönüşüm sağlayan, algılanması güç yeni bir damgalama şeması önermekteyiz.Önerilen yöntem veri tabanlarında XML veri alanlarına damga yerleştirme sürecinde damganın fark edilebilirliğini azaltmak, damgalama kapasitesini artırmak ve veri türünden bağımsız olarak damgalama işlemini gerçekleştirmek amacıyla[17] çalışmasında da kullanılan Homoglif Dönüşüm yönteminden faydalanmaktadır. Literatürdeki çalışmaların birçoğu XML verilerin içeriğinde yer alan metin türlerine damgalama yapabilmektedir. ...
... Rizzo et al. [10] proposed a technique that uses a password for the embedded watermark in short text while the contents are strictly preserved. When text changed into an image, the content and appearance cannot change. ...
Article
Full-text available
In the current era, information security is on its top priority for all organizations. The individuals, government officials, and military with the rapid development of Internet technologies like the Internet of Things (IoT), Big Data and Cloud Computing facing data security problems. As the massive rate of data growth, it’s challenging task for the researchers, that how to manage the vast amount of data safely and effectively while designing smart cities. It has been quite easy to produce an illegal copy of digital contents. The verification of digital content is one of the major issues because digital contents are generated daily and shared via the internet. Limited techniques are available for document copyright protection. However, most of the existing techniques produce distortion during watermark insertion or lack of capacity. In the said perspective, a digital watermarking technique is proposed for document copyright protection and ownership verification with the help of data mining. The techniques of data mining are applied to find suitable properties from the document for embedding watermark. The proposed system provides copyright protection to text documents on local and cloud computing paradigm. For the evaluation of the proposed technique, twenty different text documents are used to perform many attacks such as formatting, insertion and deletion attacks. The proposed technique attained a high-level of imperceptibility where Peak Signal Noise Ratio (PSNR) values are between 64.67% and 71.03%, Similarity (SIM) percentage is between 99.92% and 99.99%. The proposed technique is robust and resists from formatting attacks and capacity of the proposed technique is also improved as compared to the previous techniques.
... LSB is applied to watermark for the security of the image. But it is assumed that LSB is not a reliable technique of image watermarking as it works on spatial domain and one can easily identify the secret data in the LSB based watermarked image [23]. Multiple text images have been hidden into a single colored image using modified LSB substitution method. ...
Article
Full-text available
Nowadays, information hiding is becoming a helpful technique and fetches more attention due to the fast growth of using the internet; it is applied for sending secret information by using different techniques. Watermarking is one of major important technique in information hiding. Watermarking is of hiding secret data into a carrier media to provide the privacy and integrity of information so that no one can recognize and detect it's accepted the sender and receiver. In watermarking, many various carrier formats can be used such as an image, video, audio, and text. The text is most popular used as a carrier files due to its frequency on the internet. There are many techniques variables for the text watermarking; each one has its own robust and susceptible points. In this study, we conducted a review of text watermarking in the spatial domain to explore the term text watermarking by reviewing, collecting, synthesizing and analyze the challenges of different studies which related to this area published from 2013 to 2018. The aims of this paper are to provide an overview of text watermarking and comparison between approved studies as discussed according to the Arabic text characters, payload capacity, Imperceptibility, authentication, and embedding technique to open important research issues in the future work to obtain a robust method.
... LSB is applied to watermark for the security of the image. But it is assumed that LSB is not a reliable technique of image watermarking as it works on spatial domain and one can easily identify the secret data in the LSB based watermarked image [23]. Multiple text images have been hidden into a single colored image using modified LSB substitution method. ...
Conference Paper
Full-text available
Nowadays, information hiding is becoming a helpful technique and fetches more attention due to the fast growth of using the internet; it is applied for sending secret information by using different techniques. Watermarking is one of major important technique in information hiding. Watermarking is of hiding secret data into a carrier media to provide the privacy and integrity of information so that no one can recognize and detect it's accepted the sender and receiver. In watermarking, many various carrier formats can be used such as an image, video, audio, and text. The text is most popular used as a carrier files due to its frequency on the internet. There are many techniques variables for the text watermarking; each one has its own robust and susceptible points. In this study, we conducted a review of text watermarking in the spatial domain to explore the term text watermarking by reviewing, collecting, synthesizing and analyze the challenges of different studies which related to this area published from 2013 to 2018. The aims of this paper are to provide an overview of text watermarking and comparison between approved studies as discussed according to the Arabic text characters, payload capacity, Imperceptibility, authentication, and embedding technique to open important research issues in the future work to obtain a robust method.
... LSB is applied to watermark for the security of the image. But it is assumed that LSB is not a reliable technique of image watermarking as it works on spatial domain and one can easily identify the secret data in the LSB based watermarked image [23]. Multiple text images have been hidden into a single colored image using modified LSB substitution method. ...
Conference Paper
Nowadays, information hiding is becoming a helpful technique and fetches more attention due to the fast growth of using the internet; it is applied for sending secret information by using different techniques. Watermarking is one of major important technique in information hiding. Watermarking is of hiding secret data into a carrier media to provide the privacy and integrity of information so that no one can recognize and detect it's accepted the sender and receiver. In watermarking, many various carrier formats can be used such as an image, video, audio, and text. The text is most popular used as a carrier files due to its frequency on the internet. There are many techniques variables for the text watermarking; each one has its own robust and susceptible points. In this study, we conducted a review of text watermarking in the spatial domain to explore the term text watermarking by reviewing, collecting, synthesizing and analyze the challenges of different studies which related to this area published from 2013 to 2018. The aims of this paper are to provide an overview of text watermarking and comparison between approved studies as discussed according to the Arabic text characters, payload capacity, Imperceptibility, authentication, and embedding technique to open important research issues in the future work to obtain a robust method.
... Lower imperceptibility and robustness. 11 [90] x x x x x Memory complexity Good imperceptibility but high memory complexity 12 [ This algorithm shows high imperceptibility as well as robustness for conversion, copying, and addition and deletion attacks. The robustness evaluation proves that the proposed algorithm tolerates most of the possible attacks and is able to extract the watermark with high accuracy. ...
Article
Full-text available
During the recent years, the issue of preserving the integrity of digital text has become a focus of interest in the transmission of online content on the Internet. Watermarking has a useful tool in the protection of digital text content as it solves the problem of tampering, duplicating, unauthorized access and security breaches. The rapid development currently observable in information transfer and access is the consequences of the widespread usage of the Internet. When it comes to the different types of digital data, text constitutes the most complex and challenging type to which the method of text watermarking can be applied. Text watermarking constitutes a highly complex task, most of all since only limited research has been done in this field. In order to ensure the successful evaluation, analysis and implementation, a comprehensive research needs to be performed. This article studies the theory, methods and applications of text watermarking, which includes the discussion on the definition, embedding and extracting processes, requirements, approaches, and language applications of the established text watermarking methods. The article reviews in detail the new classification of text watermarking, which is through embedding process and its related issues of attacks and language applicability. Open research challenges and future directions are also investigated, with focus on its information integrity, information availability, originality preservation, information confidentiality, protection of sensitive information, document transformation, cryptography application, and language flexibility.
... As regards unformatted text such as notepad and computer source code, there is no formatted information except some basic information, and unlike the format text, it is difficult to embed watermark in them,so the research on this field is little. Even some thinks that it is unachievable to embed watermark in the unformatted ducoments [9]. ...
Chapter
We introduce the novel Nearest Pattern Constrained String (NPCS) problem of finding a minimum set Q of character mutation, insertion, and deletion edit operations sufficient to modify a string x to contain all contiguous words in a pattern set P and no contiguous words in a forbidden pattern set F. Letting Σ be the alphabet of allowed characters, and letting η and Υ be the longest string length and sum of all string lengths in P∪F, respectively, we show that NPCS is fixed-parameter tractable in |P| with time complexity O2|P|·Υ·|Σ|·|P|+η|x|+1.
Chapter
Security plays an important role in many sectors and industries. Nowadays, scams and illegal movements are spreading around the world. Copyright protection for PDF documentation is one of the focus in digital watermarking. Hence, a study was conducted on the related research of digital watermarking on PDF which is text documentation. This paper shows a review of watermarking techniques, characteristics, and possibility of attacks in text document format. This paper also discusses the comparisons between the existing scheme with different domains. The results can be seen from the dominant approaches in text digital watermarking which are the structural approach and hybrid approach. Based on the experimental phase, attacks are in a few formats such as insertion attack and removal attack. These are frequently used to test the robustness of digital watermarks that are embedded into an object.
Conference Paper
Web extension is a software that can be installed on a web browser. A web-extension link is displayed as an icon on the toolbar of the browser. Based on browsing activity, the extension works automatically or by clicking the extension icon depending on the functionalities made in the extension software. In this work, we developed a web extension on Google Chrome browser to verify online texts by simply clicking on an extension button. Upon clicking the button, the underlying algorithm in the extension software retrieves the texts from the current web-page being displayed. Verification and authentication of texts are performed by comparing the retrieved texts with text database. According to the comparison, texts are highlighted in colors. We consider authentication of Arabic Hadith texts for a case study. The authentic Hadith texts are highlighted by green color; authentic texts with partial diacritics, by yellow; and unauthentic texts, by red. This technique can also be used to authenticate laws, constitutions and Government documents in any language.
Book
Due to the wide range of intelligent data hiding in the form of digital watermarking and steganography applications in the modern digital world, almost all the available books focus on many applications of information hiding in multimedia such as image, video, audio, and network. This book is written to provide a clear understanding for the newcomers about all the possible applications of intelligent text hiding in the digital world. For this purpose, the existing techniques and new directions in each of the digital text watermarking and text steganography forms are described with simple and understandable language. Since the digital text hiding has a broad contribution in cybersecurity science, the authors believe that any sort of digital text hiding for digital contents requires a separate chapter for complete discussion. Hence, every chapter of this book is assigned to specific application of text hiding within text documents, source codes, text message, Cyber Physical Systems (CPS), Blockchain, Bitcoin, and password security. This book is appropriate for new comers and beginners who do not have any information about digital text hiding concepts. In order to motivate new comers to choose, familiar, and get some fundamental concepts of a desired text hiding field, it is organized and structured in such a way that the newcomers have ability to only concentrate on a chapter. For more and deep concentration on details for that chapter, the new comers must study state of-the-art references in that chapter. For course usage, the authors offer students to get familiar with fundamental concepts of natural language processing, digital signal processing, cryptography, software engineering, network engineering, machine learning, digital design, and subjects.
Conference Paper
Watermarking natural language is still a challenge in the domain of digital watermarking. Here, only the textual information must be used as a cover. No format changes or modified illustrations are accepted. Still, natural language watermarking (NLW) has some important applications, especially in leakage tracking, where a small set of individually marked copies of a confidently text is distributed. Properties of watermarking schemes such as imperceptibility, blindness or adaptability to non-English languages are of importance here. In order to address these three simultaneously, we present a blind NLW scheme, consisting of four independent embedding methods, which operate on the phonetical, morphological, lexical and syntactical layer of German texts. An evaluation based on 1,645 assessments provided by 131 test persons reveals promising results.
Article
Full-text available
Many have argued that technologies used to protect copyrighted works usually go beyond the letter of the law and subsequently impinge on interests relating to freedom of information and expression, privacy and free choice. Classic examples are technologies that prevent CDs or DVDs from being accessed or copied under certain conditions, or that block or filter-out copyright-protected materials. This article assesses digital text-watermarking, which does not restrict users’ access to or use of works, but individualises every user’s copy by changing the formatting or words in a text (e.g. “not visible” for “invisible”). Every purchaser/user receives a unique version of the work, meaning that, if there is any illegal upload or usage, it is possible to determine which user the copy came from. The technology thereby allows legal (and illegal) use to be undertaken, but serves as a tool for enforcement when there is illegal use. This article assesses digital text-watermarking from a comparative law perspective, particularly the Civil Law and the Common Law traditions.
Article
Full-text available
This paper proposes a text-based data hiding method to insert external information into Microsoft Word document. First, the drawback of low embedding efficiency in the existing text-based data hiding methods is addressed, and a simple attack, DASH, is proposed to reveal the information inserted by the existing text-based data hiding methods. Then, a new data hiding method, UniSpaCh, is proposed to counter DASH. The characteristics of Unicode space characters with respect to embedding efficiency and DASH are analyzed, and the selected Unicode space characters are inserted into inter-sentence, inter-word, end-of-line and inter-paragraph spacings to encode external information while improving embedding efficiency and imperceptivity of the embedded information. UniSpaCh is also reversible where the embedded information can be removed to completely reconstruct the original Microsoft Word document. Experiments were carried out to verify the performance of UniSpaCh as well as comparing it to the existing space-manipulating data hiding methods. Results suggest that UniSpaCh offers higher embedding efficiency while exhibiting higher imperceptivity of white space manipulation when compared to the existing methods considered. In the best case scenario, UniSpaCh produces output document of size almost 9 times smaller than that of the existing method.
Book
Full-text available
Until recently, information hiding techniques received very much less attention from the research community and from industry than cryptography. This situation is, however, changing rapidly and the first academic conference on this topic was organized in 1996. The main driving force is concern over protecting copyright; as audio, video and other works become available in digital form, the ease with which perfect copies can be made may lead to large-scale unauthorized copying, and this is of great concern to the music, film, book and software publishing industries. At the same time, moves by various governments to restrict the availability of encryption services have motivated people to study methods by which private messages can be embedded in seemingly innocuous cover messages. This book surveys recent research results in the fields of watermarking and steganography, two disciplines generally referred to as information hiding. Included are chapters about the following topics: Chapter 1: Introduction to information hiding (Fabien A. P. Petitcolas) gives an introduction to the field of information hiding, thereby discussing the history of steganography and watermarking and possible applications to modern communication systems. Chapter 2: Principles of steganography (Stefan Katzenbeisser) introduces a model for steganographic communication (the ‘prisoners problem") and discusses various steganographic protocols (such as pure steganography, secret key steganography, public key steganography and supraliminal channels). Chapter 3: A survey of steganographic techniques (Neil F. Johnson and Stefan Katzenbeisser) discusses several information hiding methods useable for steganographic communication, among them substitution systems, hiding methods in two-colour images, transform domain techniques, statistical steganography, distortion and cover generation techniques. Chapter 4: Steganalysis (Neil F. Johnson) introduces the concepts of steganalysis – the task of detecting and possibly removing steganographic information. Included is also an analysis of common steganographic tools. Chapter 5: Introduction to watermarking techniques (Martin Kutter and Frank Hartung) introduces the requirements and design issues for watermarking software. The authors also present possible applications for watermarks and discuss methods for evaluating watermarking systems. Chapter 6: A survey of current watermarking techniques (Jean-Luc Dugelay and Stéphane Roche) presents several design principles for watermarking systems, among them the choice of host locations, psychovisual aspects, the choice of a workspace (DFT, DCT, wavelet), the format of the watermark bits (spread spectrum, low-frequency watermark design), the watermark insertion operator and optimizations of the watermark receiver. Chapter 7: Robustness of copyright marking systems (Scott Craver, Adrian Perrig and Fabien A. P. Petitcolas) discusses the crucial issue of watermark robustness to intentional attacks. The chapter includes a taxonomy of possible attacks against watermarking systems, among them protocol attacks like inversion, oracle attacks, limitations of WWW spiders and system architecture issues. Chapter 8: Fingerprinting (Jong-Hyeon Lee) discusses principles and applications of fingerprinting to the traitor tracing problem, among them statistical fingerprinting, asymmetric fingerprinting and anonymous fingerprinting. Chapter 9: Copyright on the Internet and watermarking (Stanley Lai and Fabrizio Marongiu Buonaiuti) finally discusses watermarking systems from a legal point of view and addresses various other aspects of copyright law on the Internet.
Conference Paper
Full-text available
In this paper we discuss natural language watermarking, which uses the structure of the sentence constituents in natural language text in order to insert a watermark. This approach is different from techniques, collectively referred to as "text watermarking," which embed information by modifying the appearance of text elements, such as lines, words, or characters. We provide a survey of the current state of the art in natural language watermarking and introduce terminology, techniques, and tools for text processing. We also examine the parallels and differences of the two watermarking domains and outline how techniques from the image watermarking domain may be applicable to the natural language watermarking domain.
Conference Paper
Full-text available
Information-hiding in natural language text has mainly con- sisted of carrying out approximately meaning-preserving mod- ifications on the given cover text until it encodes the in- tended mark. A major technique for doing so has been synonym-substitution. In these previous schemes, synonym substitutions were done until the text "confessed", i.e., car- ried the intended mark message. We propose here a better way to use synonym substitution, one that is no longer en- tirely guided by the mark-insertion process: It is also guided by a resilience requirement, subject to a maximum allowed distortion constraint. Previous schemes for information hid- ing in natural language text did not use numeric quantifica- tion of the distortions introduced by transformations, they mainly used heuristic measures of quality based on confor- mity to a language model (and not in reference to the origi- nal cover text). When there are many alternatives to carry out a substitution on a word, we prioritize these alterna- tives according to a quantitative resilience criterion and use them in that order. In a nutshell, we favor the more am- biguous alternatives. In fact not only do we attempt to achieve the maximum ambiguity, but we want to simultane- ously be as close as possible to the above-mentioned distor- tion limit, as that prevents the adversary from doing further transformations without exceeding the damage threshold; that is, we continue to modify the document even after the text has "confessed" to the mark, for the dual purpose of maximizing ambiguity while deliberately getting as close as possible to the distortion limit. The quantification we use makes possible an application of the existing information- ∗Portions of this work were supported by Grants IIS- 0325345, IIS-0219560, IIS-0312357, and IIS-0242421 from the National Science Foundation, and by sponsors of the Center for Education and Research in Information Assur- ance and Security.
Conference Paper
Full-text available
We describe a scheme for watermarking natural language text by embedding small portions of the watermark bit string in the syntactic structure of a number of selected sentences in the text, with both the selection and embedding keyed (via quadratic residue) to a large prime number. Meaning-preserving transformations of sentences of the text (e.g., translation to another natural language) cannot damage the watermark. Meaning-modifying transformations have a probability, of damaging the watermark, proportional to the watermark length over the number of sentences. Having the key is all that is required for reading the watermark. The approach is best suited for longish meaning-rather than style-oriented "expository" texts (e.g., reports, directives, manuals, etc.), of which governments and industry produce in abundance and which need protection more frequently than fiction or poetry, which are not so tolerant of the small meaning-preserving syntactic changes that the scheme implements.
Article
Full-text available
Watermarking allows robust and unobtrusive insertion of information in a digital document. Very recently, techniques have been proposed for watermarking relational databases or XML documents, where information insertion must preserve a specific measure on data (e.g. mean and variance of numerical attributes.)In this paper we investigate the problem of watermarking databases or XML while preserving a set of parametric queries in a specified language, up to an acceptable distortion.We first observe that unrestricted databases can not be watermarked while preserving trivial parametric queries. We then exhibit query languages and classes of structures that allow guaranteed watermarking capacity, namely 1) local query languages on structures with bounded degree Gaifman graph, and 2) monadic second-order queries on trees or tree-like structures. We relate these results to an important topic in computational learning theory, the VC-dimension. We finally consider incremental aspects of query-preserving watermarking.
Article
Full-text available
Modern computer networks make it possible to distribute documents quickly and economically by electronic means rather than by conventional paper means. However, the widespread adoption of electronic distribution of copyrighted material is currently impeded by the ease of unauthorized copying and dissemination. In this paper we propose techniques that discourage unauthorized distribution by embedding each document with a unique codeword. Our encoding techniques are indiscernible by readers, yet enable us to identify the sanctioned recipient of a document by examination of a recovered document. We propose three coding methods, describe one in detail, and present experimental results showing that our identification techniques are highly reliable, even after documents have been photocopied
Article
Full-text available
A way to discourage illicit reproduction of copyrighted or sensitive documents is to watermark each copy before distribution. A unique mark is embedded in the text whose recipient is registered. The mark can be extracted from a possibly noisy illicit copy, identifying the registered recipient. Most image marking techniques are vulnerable to binarization attack and, hence, not suitable for text marking. We propose a different approach where a text document is marked by shifting certain text lines slightly up or down or words slightly left or right from their original positions. The shifting pattern constitutes the mark and is different on different copies. In this paper we develop and evaluate a method to detect such minute shifts. We describe a marking and identification prototype that implements the proposed method. We present preliminary experimental results which suggest that centroid detection performs remarkably well on line shifts even in the presence of severe distortions introduced by printing, photocopying, scanning, and facsimile transmission
Article
Full-text available
Multimedia watermarking technology has evolved very quickly during the last few years. A digital watermark is information that is imperceptibly and robustly embedded in the host data such that it cannot be removed. A watermark typically contains information about the origin, status, or recipient of the host data. In this tutorial paper, the requirements and applications for watermarking are reviewed. Applications include copyright protection, data monitoring, and data tracking. The basic concepts of watermarking systems are outlined and illustrated with proposed watermarking methods for images, video, audio, text documents, and other media. Robustness and security aspects are discussed in detail. Finally, a few remarks are made about the state of the art and possible future developments in watermarking technology
Conference Paper
Text steganography is hiding text in text. A hidden text gets hidden in a cover text to produce a plain looking stego text. This plain looking stego text is posted as the message which no one suspects to contain anything concealed. Today, text messages are a common mode of communication over the internet and it is associated with a huge amount of traffic. Steganography is an added layer of protection that can be used for security and privacy. In this paper, we describe a text steganography approach that provides a good capacity and maintains a high difficulty of decryption. We make use of approaches of space manipulation, linguistic translation and Unicode homoglyphs in our algorithm. Our implementation is in Python. Also, we explain a parallel approach for hiding large hidden text messages in large cover text messages.
Book
Until recently, information hiding techniques received very much less attention from the research community and from industry than cryptography. This situation is, however, changing rapidly and the first academic conference on this topic was organized in 1996. The main driving force is concern over protecting copyright; as audio, video and other works become available in digital form, the ease with which perfect copies can be made may lead to large-scale unauthorized copying, and this is of great concern to the music, film, book and software publishing industries. At the same time, moves by various governments to restrict the availability of encryption services have motivated people to study methods by which private messages can be embedded in seemingly innocuous cover messages. This book surveys recent research results in the fields of watermarking and steganography, two disciplines generally referred to as information hiding.
Conference Paper
SipHash is a family of pseudorandom functions optimized for short inputs. Target applications include network traffic authentication and hash-table lookups protected against hash-flooding denial-of-service attacks. SipHash is simpler than MACs based on universal hashing, and faster on short inputs. Compared to dedicated designs for hash-table lookup, SipHash has well-defined security goals and competitive perfor-mance. For example, SipHash processes a 16-byte input with a fresh key in 140 cycles on an AMD FX-8150 processor, which is much faster than state-of-the-art MACs. We propose that hash tables switch to SipHash as a hash function.
Article
Digital watermarking is a copyright protection technique used to embed specific data in a cover file to prevent illegal use. In this research invisible digital watermarking based on the text information contained in a webpage has been proposed. Watermarks are based on predefined semantic and syntactic rules, which are encrypted and then converted into whitespace using binary controlled characters before embedding into a webpage. Structural means of HTML (Hyper Text Markup Language) are used as a cover file to embed the formulated watermarks. Proposed system has been validated against various attacks to find optimum robustness.
Article
Copyright protection of plain text while traveling over the internet is very crucial. Digital watermarking provides the complete copyright protection solution for this problem. Text being the most dominant medium travelling over the internet needs absolute protection. Text watermarking techniques have been developed in past to protect the text from illegal copying, redistribution and to prevent copyright violations. This paper presents a review of some of the recent research in watermarking techniques for plain text documents. The reviewed approaches are classified into three categories, the image based approach, the syntactic approach and the semantic approach. This paper discusses the main contributions, advantages and drawbacks of different methods used for text watermarking in past.
Conference Paper
Security issues of text watermarking are greatly different from those of other multimedia watermarking, in terms of its specific requirements and characteristics of text watermarking. The security theory of text watermarking is proposed in this paper, and the following security topics are discussed: (i) the classification and application of text watermarking; (ii) the classification and analysis of attacks; (iii) the watermarking model and security countermeasures. Other open issues and further challenges related to text watermarking are also addressed.
Article
We develop a morphosyntax-based natural language watermarking scheme. In this scheme, a text is first transformed into a syntactic tree diagram where the hierarchies and the functional dependencies are made explicit. The watermarking software then operates on the sentences in syntax tree format and executes binary changes under control of Wordnet and Dictionary to avoid semantic drops. A certain level of security is provided via key-controlled randomization of morphosyntactic tools and the insertion of void watermark. The security aspects and payload aspects are evaluated statistically while the imperceptibility is measured using edit-hit counts based on human judgments. It is observed that agglutinative languages are somewhat more amenable to morphosyntax-based natural language watermarking and the free word order property of a language, like Turkish, is an extra bonus.
Article
This paper proposes a novel watermarking algorithm for grayscale text document images. The algorithm inserts the watermark signals through edge direction histograms. The concept of sub-image consistency is developed. The concept means the sub-images have similar-shaped edge direction histograms and it is shown to be valid over a wide range of document images. Algorithms to insert and detect watermark signals are proposed. The experiments performed with various document images produced plausible results in terms of robustness.
Conference Paper
In this paper, we present a scheme for embedding data in copies (color or monochrome) of predominantly text pages that may also contain color images or graphics. Embedding data imperceptibly in documents or images is a key ingredient of watermarking and data hiding schemes. It is comparatively easy to hide a signal in natural images since the human visual system is less sensitive to signals embedded in noisy image regions containing high spatial frequencies. In other instances, e.g. simple graphics or monochrome text documents, additional constraints need to be satisfied to embed signals imperceptibly. Data may be embedded imperceptibly in printed text by altering some measurable property of a font such as position of a character or font size. This scheme however, is not very useful for embedding data in copies of text pages, as that would require accurate text segmentation and possibly optical character recognition, both of which would deteriorate the error rate performance of the data-embedding system considerably. Similarly, other schemes that alter pixels on text boundaries have poor performance due to boundary-detection uncertainties introduced by scanner noise, sampling and blurring. The scheme presented in this paper ameliorates the above problems by using a text-region based embedding approach. Since the bulk of documents reproduced today contain black on white text, this data-embedding scheme can form a print-level layer in applications such as copy tracking and annotation
Conference Paper
This paper describes a feature calibration scheme for use in embedding and detecting watermarks in document images. Such watermarks have a variety of uses, including copyright protection, content identification, and tamper-proofing. In general, a watermark is encoded as a displacement of certain features that can be extracted from target document images. One of the technical challenges is reliable detection of the displacement when images are distorted by print-and-scan processes. We propose a calibration method that uses the difference between a two features extracted from two sets of partitions arranged symmetrically. Since this method counter-balances the cumulative effects on the features of distortions added in the print-and-scan process, the displacement can be reliably detected. The feasibility of the method was investigated by using the average width of character strokes is used as a feature
Article
Digital watermarking is widely believed to be a valid means to discourage illicit distribution of information content. Digital watermarking methods for text documents are limited because of the binary nature of text documents. A distinct feature of a text document is its space patterning. We propose a new approach in text watermarking in which interword spaces of different text lines are slightly modified. After the modification, the average spaces of various lines have the characteristics of a sine wave and the wave constitutes a mark. Both nonblind and blind watermarking algorithms are discussed. Preliminary experiments have shown promising results. Our experiments suggest that space patterning of text documents can be a useful tool in digital watermarking
Unicode security mechanisms. Unicode technical standard #39, Unicode
  • M Davis
  • M Suignard
The new york times annotated corpus. Linguistic Data Consortium
  • E Sandhaus
E. Sandhaus. The new york times annotated corpus. Linguistic Data Consortium, Philadelphia, 6(12):e26752, 2008.
Unicode technical standard #39
  • M Davis
  • M Suignard
M. Davis and M. Suignard. Unicode security mechanisms. Unicode technical standard #39, Unicode. http://www.unicode.org/reports/tr39/.