Conference Paper

Objectionable Content Filtering by Click-Through Data

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This paper explores users’ browsing intents to predict the category of a user’s next access during web surfing, and applies the results to objectionable content filtering. A user’s access trail represented as a sequence of URLs reveals the contextual information of web browsing behaviors. We extract behavioral features of each clicked URL, i.e., hostname, bag-of-words, gTLD, IP, and port, to develop a linear chain CRF model for context-aware category prediction. Large-scale experiments show that our method achieves a promising accuracy of 0.9396 for objectionable access identification without requesting their corresponding page content. Error analysis indicates that our proposed model results in a low false positive rate of 0.0571. In real-life filtering simulations, our proposed model accomplishes macro-averaging blocking rate 0.9271, while maintaining a favorably low macro-averaging over-blocking rate 0.0575 for collaboratively filtering objectionable content with time change on the dynamic web.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Preventing the display of inappropriate results while also avoiding over-filtering results that may appear as objectionable but are not, e.g., an article on breast cancer [5], requires a solution that goes beyond safe search. To account for the large variety of objectionable material present online, and inspired by prior strategies to detect objectionable resources [14,47], we treat as objectionable for children in the classroom resources that relate to any category in ObjCat: Abortion, Drugs, Hate Speech, Illegal Affairs, Gambling, Pornography, and Violence. Note that the Drugs category refers to resources over-arching drugs, but also alcohol, tobacco, and marijuana. ...
Preprint
Full-text available
We introduce a novel re-ranking model that aims to augment the functionality of standard search engines to support classroom search activities for children (ages 6 to 11). This model extends the known listwise learning-to-rank framework by balancing risk and reward. Doing so enables the model to prioritize Web resources of high educational alignment, appropriateness, and adequate readability by analyzing the URLs, snippets, and page titles of Web resources retrieved by a given mainstream search engine. Experimental results, including an ablation study and comparisons with existing baselines, showcase the correctness of the proposed model. The outcomes of this work demonstrate the value of considering multiple perspectives inherent to the classroom setting, e.g., educational alignment, readability, and objectionability, when applied to the design of algorithms that can better support children's information discovery.
... The main advantage of using URL-based filtering is the speed at which it filters the incoming Internet traffic, though some of these approaches may lack the desired accuracy. There have been some approaches which have employed using both URL-and content-based filtering: firstly the URL is analyzed and then in case of positive detection the content is analyzed; then, the decision is made after analyzing its contents, whether to block that particular URL or allow it to pass the filter [7]. Although this approach has higher accuracy, but analyzing every suspicious URLs content is expensive with respect to both time and processing. ...
Article
Full-text available
Web content filtering is one among many techniques to limit the exposure of selective content on the Internet. It has gotten trivial with time, yet filtering of multilingual web content is still a difficult task, especially while considering big data landscape. The enormity of data increases the challenge of developing an effective content filtering system that can work in real time. There are several systems which can filter the URLs based on artificial intelligence techniques to identify the site with objectionable content. Most of these systems classify the URLs only in the English language. These systems either fail to respond when multilingual URLs are processed, or over-blocking is experienced. This paper introduces a filtering system that can classify multilingual URLs based on predefined criteria for URL, title, and metadata of a web page. Ontological approaches along with local multilingual dictionaries are used as the knowledge base to facilitate the challenging task of blocking URLs not meeting the filtering criteria. The proposed work shows high accuracy in classifying multilingual URLs into two categories, white and black. Evaluation results conducted on a large dataset show that the proposed system achieves promising accuracy, which is on a par with those achieved in state-of-the-art literature on semantic-based URL filtering.
Article
This study explores URL click-through behaviour to predict the category of users’ online information accesses and applies the results to progressively filter objectionable accesses during web surfing. Each clicked URL is represented by the embedding technique and fed into the Bidirectional Long Short-Term Memory neural network cascaded with a Conditional Random Field (BiLSTM-CRF) model to predict the category of a user’s access. Large-scale experiments on click-through data from nearly one million real users show that our proposed BiLSTM-CRF model achieves promising results. The proposed method outperforms related approaches by a high accuracy of 0.9492 (near 27% relative improvement) for context-aware category prediction and an F1-score of 0.8995 (about 29% relative improvement) for objectionable access identification. In addition, in real-time filtering simulations, our model gradually achieves a macro-averaging blocking rate of 0.9221, while maintaining a favourably low false-positive rate of 0.0041.
Conference Paper
Full-text available
Children spend significant amounts of time on the Internet. Recent studies showed, that during these periods they are often not under adult supervision. This work presents an automatic approach to identifying suitable web pages for children based on topical and non-topical web page aspects. We discuss the characteristics of children's web sites with respect to recent findings in children's psychology and cognitive sciences. We finally evaluate our approach in a large-scale user study, finding, that it compares favourably to state of the art methods while approximating human performance.
Conference Paper
Full-text available
With the rise of large-scale digital video collections, the challenge of automatically detecting adult video content has gained significant impact with respect to applications such as content filtering or the detection of illegal material. While most systems represent videos with keyframes and then apply techniques well-known for static images, we investigate motion as another discriminative clue for pornography detection. A framework is presented that combines conventional keyframe-based methods with a statistical analysis of MPEG-4 motion vectors. Two general approaches are followed to describe motion patterns, one based on the detection of periodic motion and one on motion histograms. Our experiments on real-world web video data show that this combination with motion information improves the accuracy of pornography detection significantly (equal error is reduced from 9.9% to 6.0%). Comparing both motion descriptors, histograms outperform periodicity detection.
Conference Paper
Full-text available
We present a method to classify images into different categories of pornographic content to create a system for filtering pornographic images from network traffic. Although different systems for this application were presented in the past, most of these systems are based on simple skin colour features and have rather poor performance. Recent advances in the image recognition field in particular for the classification of objects have shown that bag-of-visual-words-approaches are a good method for many image classification problems. The system we present here, is based on this approach, uses a task-specific visual vocabulary and is trained and evaluated on an image database of 8500 images from different categories. It is shown that it clearly outperforms earlier systems on this dataset and further evaluation on two novel web-traffic collections shows the good performance of the proposed system.
Article
Full-text available
Along with the ever-growing Web comes the proliferation of objectionable content, such as sex, violence, racism, etc. We need efficient tools for classifying and filtering undesirable Web content. In this paper, we investigate this problem and describe WebGuard, an automatic machine learning-based pornographic Web site classification and filtering system. Unlike most commercial filtering products, which are mainly based on textual content-based analysis such as indicative keywords detection or manually collected black list checking, WebGuard relies on several major data mining techniques associated with textual, structural content-based analysis, and skin color related visual content-based analysis as well. Experiments conducted on a testbed of 400 Web sites including 200 adult sites and 200 nonpornographic ones showed WebGuard's filtering effectiveness, reaching a 97.4 percent classification accuracy rate when textual and structural content-based analysis was combined with visual content-based analysis. Further experiments on a black list of 12,311 adult Web sites manually collected and classified by the French Ministry of Education showed that WebGuard scored a 95.62 percent classification accuracy rate. The basic framework of WebGuard can apply to other categorization problems of Web sites which combine, as most of them do today, textual and visual content.
Article
This article presents a search-intent-based method to generate pornographic blacklists for collaborative cyberporn filtering. A novel porn-detection framework that can find newly appearing pornographic web pages by mining search query logs is proposed. First, suspected queries are identified along with their clicked URLs by an automatically constructed lexicon. Then, a candidate URL is determined if the number of clicks satisfies majority voting rules. Finally, a candidate whose URL contains at least one categorical keyword will be included in a blacklist. Several experiments are conducted on an MSN search porn dataset to demonstrate the effectiveness of our method. The resulting blacklist generated by our search-intent-based method achieves high precision (0.701) while maintaining a favorably low false-positive rate (0.086). The experiments of a real-life filtering simulation reveal that our proposed method with its accumulative update strategy can achieve 44.15% of a macro-averaging blocking rate, when the update frequency is set to 1 day. In addition, the overblocking rates are less than 9% with time change due to the strong advantages of our search-intent-based method. This user-behavior-oriented method can be easily applied to search engines for incorporating only implicit collective intelligence from query logs without other efforts. In practice, it is complementary to intelligent content analysis for keeping up with the changing trails of objectionable websites from users' perspectives.
Conference Paper
This paper presents an intent conformity model to collaboratively generate blacklists for cyberporn filtering. A novel porn detection framework via searches-and-clicks is proposed to explore collective intelligence embedded in query logs. Firstly, the clicked pages are represented in terms of the weighted queries to reflect the degrees related to pornography. Consequently, these weighted queries are regarded as discriminative features to calculate the pornography indicator by an inverse chi-square method for candidate determination. Finally, a candidate whose URL contains at least one pornographic keyword is included in our collaborative blacklists. The experiments on a MSN porn data set indicate that the generated blacklist achieves a high precision, while maintaining a favorably low false positive rate. In addition, real-life filtering simulations reveal that our blacklist is more effective than some publicly released blacklists.
Conference Paper
This paper presents a user intent method to generate blacklists for collaborative cyberporn filtering. A novel porn detection framework that finds new pornographic web pages by mining user search behaviors is proposed. It employs users' clicks in search query logs to select the suspected web pages without extra human efforts to label data for training, and determines their categories with the help of URL host name and path information, but without web page content. We adopt an MSN porn data set to explore the effectiveness of our method. This user intent approach achieves high precision, while maintaining favorably low false positive rate. In addition, real-life filtering simulation reveals that our user intent method with its accumulative update strategy achieves 43.36% of blocking rate, while maintaining a steadily less than 7% of over-blocking rate.
Conference Paper
By analyzing a set of access attempts by teenagers to pornographic Web sites, we found that more than half of them are image searches and visits to Web sites with little text information. It is obvious that textual content-based filters cannot correctly categorize such access attempts. This paper describes a novel URL-based objectionable content categorization approach and its application to Web filtering. In this approach, we break the URL into a sequence of n-grams with a range of n's and then a machine learning algorithm is applied to the n-gram representation of URLs to learn a classifier of pornographic Web sites. We showed empirically that the URL-based approach is able to correctly identify many of the objectionable Web pages. We also demonstrated that the optimum Web filtering results could be achieved when it was used with a content-based approach in a production environment
Conference Paper
A bipartite query-URL graph, where an edge indicates that a document was clicked for a query, is a useful construct for nding groups of related queries and URLs. Here we use this behavior graph for classication. We choose a click graph sampled from two weeks of image search activity, and the task of \adult" ltering: identifying content in the graph that is inappropriate for minors. We show how to perform classication using random walks on this graph, and two methods for estimating classier parameters.
Article
The authors review a log of billions of Web queries that constituted the total query traffic for a 6-month period of a general-purpose commercial Web search service. Pre- viously, query logs were studied from a single, cumula- tive view. In contrast, this study builds on the authors' previous work, which showed changes in popularity and uniqueness of topically categorized queries across the hours in a day. To further their analysis, they examine query traffic on a daily, weekly, and monthly basis by matching it against lists of queries that have been topi- cally precategorized by human editors. These lists rep- resent 13% of the query traffic. They show that query traffic from particular topical categories differs both from the query stream as a whole and from other cate- gories. Additionally, they show that certain categories of queries trend differently over varying periods. The au- thors key contribution is twofold: They outline a method for studying both the static and topical properties of a very large query log over varying periods, and they iden- tify and examine topical trends that may provide valu- able insight for improving both retrieval effectiveness and efficiency.
Article
This study presented an inverse chi-square based web content classification system that works along with an incremental update mechanism for incremental generation of pornographic blacklist. The proposed system, as indicated from the experimental results, can classify bilingual (English and Chinese) web pages at an average precision rate of 97.11%; while maintaining a favorably low false positive rate. Such satisfactory performance was obtained under a cost-effective parameter configuration used in inverse chi-square calculations. The proposed incremental update mechanism operates on the linking structure of pornographic hubs to locate newly added pornographic sites. The resulting blacklist has been empirically verified to be comparatively responsive to the growth dynamics of pornography sites than three public domain blacklists.
Article
With the proliferation of harmful Internet content such as pornography, violence, and hate messages, effective content-filtering systems are essential. Many Web-filtering systems are commercially available, and potential users can download trial versions from the Internet. However, the techniques these systems use are insufficiently accurate and do not adapt well to the ever-changing Web. To solve this problem, we propose using artificial neural networks to classify Web pages during content filtering. We focus on blocking pornography because it is among the most prolific and harmful Web content. However, our general framework is adaptable for filtering other objectionable Web material.
Bag-of-visual-words models for adult image classification and filtering ICPR'08
  • T Deselaers
  • L Pimenidis
  • H Hey