Fig 2 - uploaded by Grigori Sidorov
Content may be subject to copyright.
Advertising 

Advertising 

Source publication
Article
Full-text available
It is well-known that the lack of quality data is a major problem for information retrieval engines. Web articles are flooded with non-relevant data such as advertising and related links. Moreover, some of these ads are loaded in a randomized way every time you hit a page, so the HTML document will be different and hashing of the content will be no...

Citations

... In recent years, the research on web page segmentation technology has been widely concerned, and has made a wealth of research results [12]. Web page segmentation technology is based on the visual characteristics of people, summarizes some rules of web page segmentation, and then realizes web page segmentation based on these rules [13], [14]. Since then, many researchers have proposed many improved web page segmentation technologies based on this method [15], [16], but the idea of rule-based segmentation technology has no essential change. ...
Article
Full-text available
Usually, in addition to the main content, web pages contain additional information in the form of noise, such as navigation elements, sidebars and advertisements. This kind of noise has nothing to do with the main content, it will affect the tasks of data mining and information retrieval so that the sensor will be damaged by the wrong data and interference noise. Because of the diversity of web page structure, it is a challenge to detect relevant information and noise in order to improve the true reliability of sensor networks. In this paper, we propose a visual block construction method based on page type conversion (VB-PTC). This method uses a combination of site-level noise reduction based on hashtree and page-level noise reduction based on linked clusters to eliminate noise in web articles, and it successfully converts multi-record complex pages to multi-record simple pages, effectively simplifying the rules of visual block construction. In the aspect of multi-record content extraction, according to the characteristics of different fields, we use different extraction methods, combined with regular expression, natural language processing and symbol density detection methods which greatly improves the accuracy of multi-record content extraction. VB-PTC can be effectively used for information retrieval, content extraction and page rendering tasks.
... Our Web-Article Miner (Web-AM) algorithm first passes all the web-pages through Boilerpipe "Article Extractor" algorithm. Boilerpipe extracts significant portion of the main content, however, it retrieves a considerable amount of noise as well [22]. The noise retrieved by Boilerpipe as the main content is marked as green circle in the Figure 1. ...
Article
We present a method for gender and language variety identification using a convolutional neural network (CNN). We compare the performance of this method with a traditional machine learning algorithm – support vector machines (SVM) trained on character n-grams (n = 3–8) and lexical features (unigrams and bigrams of words), and their combinations. We use a single multi-labeled corpus composed of news articles in different varieties of Spanish developed specifically for these tasks. We present a convolutional neural network trained on word- and sentence-level embeddings architecture that can be successfully applied to gender and language variety identification on a relatively small corpus (less than 10,000 documents). Our experiments show that the deep learning approach outperforms a traditional machine learning approach on both tasks, when named entities are present in the corpus. However, when evaluating the performance of these approaches reducing all named entities to a single symbol “NE” to avoid topic-dependent features, the drop in accuracy is higher for the deep learning approach.
Article
Full-text available
In this paper, we construct paraphrase graphs for news text collections (clusters). Our aims are, first, to prove that paraphrase graph construction method can be used for news clusters identification and, second, to analyze and compare stylistically different news collections. Our news collections include dynamic, static and combined (dynamic and static) texts. Their respective paraphrase graphs reflect their main characteristics. We also automatically extract the most informationally important linked fragments of news texts, and these fragments characterize news texts as either informative, conveying some information, or publicistic ones, trying to affect the readers emotionally.