Question
Asked 11th Sep, 2014

What tags are more suitable for main content extraction from HTML webpages?

Hello, everyone
I am interesting the Content Extraction from HTML web pages. Now I use the HTML tags for dividing the block of web page and use the tag-to-text ratio and anchor-text-to-text ratio and title density to extract main content. But all of HTML tags don't appropriate where content extraction. SO I want to know what tags are more accurate and more suitable for web page' cleaning? Thank You all...

Most recent answer

Saurabh Gayali
Institute of Genomics and Integrative Biology
try visual ping
you can visually select what you want to extract
1 Recommendation

All Answers (7)

Dr. Senthilvel Vasudevan
Sri Venkateshwaraa Medical College Hospital and Research Centre
Hai, Good Morning.
Your question is very interesting.  For Content extration from HTML web pages is having: there are two issues which can be observed. 
1.  The source base of the content extraction. Typically, a content extraction method uses either the Document Object Model (DOM) representation or the plain HTML source code.
2.  The general approach to do content extraction. Gottron divides the content extraction approach in two categories namely single document extraction and multiple document extraction.
Since our operational settings require that our content extraction module runs during data acquisition, we need a relatively lightweight and fast method to do extraction. In general, single document extraction is relatively faster rather than multiple document extraction because it only considers the document at hand during extraction without looking into other documents from the same host.
Kindly see the attachment for more information.
Thanks.  
1 Recommendation
Dr. Senthilvel Vasudevan
Sri Venkateshwaraa Medical College Hospital and Research Centre
Hi,
Kindly see the following attachment also for your question.  I hope, its very helpful to you.  ok.
1 Recommendation
Fabio Gasparetti
Università Degli Studi Roma Tre
Have you checked out the Boilerpipe open source project? Maybe it can help you in the full text extraction from Html pages.
1 Recommendation
Panei San
University of Computer Studies, Yangon
Thank Senthilvel Vasudevan! 
Welcome your advice and thank for your given link...
Panei San
University of Computer Studies, Yangon
Thank Fabio Gasparetti !
Valuable advice and kindly response.....
1 Recommendation
Panei San
University of Computer Studies, Yangon
Thank you all!
 I really want to know that I consider the content extraction based on Line-block concept.  The line-block concept means that it will take from the start tag to the end tag. For example, <div>...</div>, <p>..</p> and so on. But I am testing and writing the code for it, it is corrected for the correct HTML format file. If the HTML format is wrong such as the start tag includes but don't the end tag, the code shows the wrong answer and error. So how to handle these coding and what tags are only used for the content extraction?
Saurabh Gayali
Institute of Genomics and Integrative Biology
try visual ping
you can visually select what you want to extract
1 Recommendation

Similar questions and discussions

Related Publications

Conference Paper
Full-text available
Content extraction is the process that aims to separate the main content of web pages from the bulk of template and decorative components. We present a method of doing this which achieves competitive performance on the Cleaneval dataset and sets a new state-of-the-art with an F1 score of 0.96 on the Dragnet dataset. We accomplish this by modeling t...
Article
Full-text available
Extracting web content is to obtain the required data embedded in web pages, usually including structured records, such as product information, and text content, such as news. Web pages use a large number of HTML tags to organize and to present various information. Both knowing little about the structures of web pages and mixing kinds of informatio...
Got a technical question?
Get high-quality answers from experts.