What tags are more suitable for main content extraction from HTML webpages?

Question

Hello, everyone
I am interesting the Content Extraction from HTML web pages. Now I use the HTML tags for dividing the block of web page and use the tag-to-text ratio and anchor-text-to-text ratio and title density to extract main content. But all of HTML tags don&#x27;t appropriate where content extraction. SO I want to know what tags are more accurate and more suitable for web page&#x27; cleaning? Thank You all...

Dr. Senthilvel Vasudevan · Answer

Hai, Good Morning.
Your question is very interesting.&#xA0; For Content extration from HTML web pages is having: there are two issues which can be observed.&#xA0;
1.&#xA0;&#xA0;The source base of the content extraction. Typically, a content extraction method uses either the Document Object Model (DOM) representation or the plain HTML source code.
2.&#xA0; The general approach to do content extraction. Gottron&#xA0;divides the content extraction approach in two categories namely single document extraction and multiple document extraction.
Since our operational settings require that our content extraction module runs during data acquisition, we need a relatively lightweight and fast method to do extraction. In general, single document extraction is relatively faster rather than multiple document extraction because it only considers the document at hand during extraction without looking into other documents from the same host.
Kindly see the attachment for more information.
Thanks.&#xA0;&#xA0;

Dr. Senthilvel Vasudevan · Answer

Hi,
Kindly see the following attachment also for your question.&#xA0; I hope, its very helpful to you.&#xA0; ok.

Fabio Gasparetti · Answer

Have you checked out the Boilerpipe open source project? Maybe it can help you in the full text extraction from Html pages.
https://code.google.com/p/boilerpipe/

Panei San · Answer

Thank Senthilvel Vasudevan!&#xA0;
Welcome your advice and thank for your given link...

Panei San · Answer

Thank Fabio Gasparetti !
Valuable advice and kindly response.....

Panei San · Answer

Thank you all!
&#xA0;I really want to know that I consider the content extraction based on Line-block concept.&#xA0; The line-block concept means that it will take from the start tag to the end tag. For example, &#x3C;div&#x3E;...&#x3C;/div&#x3E;, &#x3C;p&#x3E;..&#x3C;/p&#x3E; and so on. But I am testing and writing the code for it, it is corrected for the correct HTML format file. If the HTML format is wrong such as the start tag includes but don&#x27;t the end tag, the code shows the wrong answer and error. So how to handle these coding and what tags are only used for the content extraction?

Saurabh Gayali · Answer

try visual ping
https://visualping.io/
you can visually select what you want to extract

What tags are more suitable for main content extraction from HTML webpages?

Most recent answer

Top contributors to discussions in this field

All Answers (7)

Similar questions and discussions

Related Publications

Related Publications

HTML web content extraction using paragraph tags
Conference Paper
Jun 2016

Learning Web Content Extraction with DOM Features
Conference Paper
Full-text available
Sep 2018

Exploiting Multi-Category Characteristics and Unified Framework to Extract Web Content
Article
Full-text available
Jun 2018