Web Content Extraction

Web Content Extraction

  • William M. Marcellino added an answer:
    What is the difference between stopwords density and token count?

    Hello, please can you share info with me about how to count the stop words and tokens for text. I would like clarification with examples. Thanks

    William M. Marcellino · RAND Corporation

    I'd like to respectfully complicate the way we're talking about stopwords.  What constitutes a stopword is dependent on the corpus you are working on, and the goal of your research.  If my approach is primarily semantic, and I want to index what texts are about, then there is a good chance that many of the very common in my corpus--what Hope & Witmore (2010) call the "gloop" of language--isn't of interest, and I want to filter them out.  In that case, I may want to filter out gloop so I can find "plums."  But if I am doing pragmatic work, for trying to detect latencies in text such as affective or epistemic stance, then the gloop may be critical to my questions.  Biber, Conrad, & Reppen (2000) point out that human readers naturally detect plums and ignore gloop, and thus machine-based corpus approaches have an important advantage in doing truly accurate empirical work.  I looked at the stopword lists Mayur pointed to, and for the kind of work I do, almost everyone of those words has some important function, either alone of in a larger lexical bundle, that I want to detect and count.

    Ultimately, you will need to decide on a case by case basis what is of interest and what is not of interest, in your text analysis.  I think this issue reflects larger divisions between linguistics and computer scientists in our theory and assumptions about language use and how it can be investigated.

  • Michal Meina added an answer:
    Where can I get a stopwords list for webpage categorizations?

    Dear all, I want to get some stopwords for web page classification when I want the train for learning classifiers. So if you know some link and how to get these stopwords, can you share them with me please? Thanks all.

    Michal Meina · University of Warsaw

    I would strongly recommend to use stop word corpus from NLTK [ http://www.nltk.org/book/ch02.html ]

    It has 2,400 stopwords for 11 languages

  • Efstratios Kontopoulos added an answer:
    Can anyone suggest to me what can be done in an area of semantic web scrapping?

    I am interested in doing some work in area of semantic web crawling/scraping and using that semantic data to do some discovery.

  • Panei San added an answer:
    What kind of java is appropriate for information extraction and web page classification?

    Hello everyone!

    Can you advice me what java is more learn for my opinion?

    Panei San · University of Computer Studies, Yangon


  • Panei San added an answer:
    What tags are more suitable for main content extraction from HTML webpages?

    Hello, everyone

    I am interesting the Content Extraction from HTML web pages. Now I use the HTML tags for dividing the block of web page and use the tag-to-text ratio and anchor-text-to-text ratio and title density to extract main content. But all of HTML tags don't appropriate where content extraction. SO I want to know what tags are more accurate and more suitable for web page' cleaning? Thank You all...

    Panei San · University of Computer Studies, Yangon

    Thank you all!

     I really want to know that I consider the content extraction based on Line-block concept.  The line-block concept means that it will take from the start tag to the end tag. For example, <div>...</div>, <p>..</p> and so on. But I am testing and writing the code for it, it is corrected for the correct HTML format file. If the HTML format is wrong such as the start tag includes but don't the end tag, the code shows the wrong answer and error. So how to handle these coding and what tags are only used for the content extraction?

  • Mustapha Bouakkaz added an answer:
    What is the difference between Hadoop and Data Warehouse?
    I have read a couple of articles which are trying to sell the idea that the organization should basically choose between either implementing Hadoop (which is a powerful tool when it comes to unstructured and complex datasets) or implementing Data Warehouse (which is a powerful tool when it comes to structured datasets). But my question is, can´t they actually go along, since Big Data is about both structured and unstructured data?
    Mustapha Bouakkaz · Université Amar Telidji Laghouat

    The second one include the first one

  • Stephane RIBOT added an answer:
    What's the easiest way to collect data from Twitter and Facebook?
    I'm developing a strategy as a MSc project. I will be monitoring, collecting, and analyzing the data of a Facebook page (posts, comments, likes, shares) and a Twitter profile (tweets, retweets, mentions, and public tweets containing one/two keywords only). Any suggestions would be great. Also, what mining techniques do you recommend? I'm thinking sentiment analysis and would like to use one or two more techniques. What techniques do you recommend?

    Stephane RIBOT · Université Jean Moulin Lyon 3
    the easiest is GNIP, http://gnip.com/
    note that you may have to pay but it may be worth it due to its simplicity (if you dont program that is !!
  • Ian Kennedy added an answer:
    Can a Probabilistic timed automaton be used to model the underlying network in query routing?
    The network for routing the query is based on Markov process. If we want to model the time taken to answer a query , is Probabilistic timed automaton a better model?
    Ian Kennedy · Independent Researcher
    A stochastic queue would be indicated. Look up queueing theory. Start here: "http://en.wikipedia.org/wiki/Queueing_theory". If you wanted to use a number of probabilistic timed automatons, you would then have the complexity of having to build in the appropriate statistical properties.
  • Akila Gopu asked a question:
    What is the best model to capture Query passing/tossing in CQA?
    CQA is Community Query Answering, example stackoverflow.com or Yahoo answers. The current model that is used to capture this query passing is the Markov model. Is there any other alternative to the Markov model?
  • Andrew Meyerhoff added an answer:
    Is it be possible to extract the code for a specific segment of a webpage?
    Suppose I need to extract code for the voting portion of a webpage alone. Can it be any tool for doing this.
    Andrew Meyerhoff · Indiana University-Purdue University Indianapolis
    use python with beautifulsoup just make a URL request with python and parse it with beautifulsoup for the needed element. makes life much easier than you believe
  • Alexandre Beauvois added an answer:
    Looking for an efficient algorithm available for web crawling
    I need to extract specific data from related websites . For example I need to extract data from specific website providing the positive feedback about a type of vehicle. Kindly suggest some good code or algorithm for this.
    Alexandre Beauvois · Entropic Synergies
    Cheerio for javascrpt (nodejs) programming language : see http://vimeo.com/31950192
  • Fotis Kokkoras added an answer:
    Data mining and web content mining
    How to data mining algorithms be implemented for web content mining?
    Fotis Kokkoras · Technological Educational Institute of Thessaly
    The web content extraction is a task applied to web pages, not to databases. You are scraping unstructured data from the web, you put them in structured storage (databases) and then apply data mining algorithms to them. That's the order.
  • Debajyoti Mukhopadhyay asked a question:
    Semantic Search Engine with user friendly output and ranking needed.
    We have witnessed the power of a regular search engine like Google. There is a semantic search engine like Swoogle as well. However, we are trying to build a semantic search engine with more user friendly display capability and relevant ranking algorithm. Can anybody suggest ideas?
  • Andras Kornai added an answer:
    Chinese textmining software
    Does anybody know of any useable textmining software programs that do topic modeling and also cover Chinese as a language? This seems harder to find that I had thought. I found things like FudanNLP - (http://code.google.com/p/fudannlp/) and Ictclas (http://www.ictclas.org/ictclas_download.aspx), neither of which I have been able to make work so far. Pingar (http://apidemo.pingar.com/AnalyzeDocument.aspx) doesn't seem to have topic extraction. Mallet does seem to have a Chinese module and does have topic modeling, but I have yet to figure that one out too. Does anybody have any other suggestions?
    Andras Kornai · Hungarian Academy of Sciences
    Have you considered commercial software vendors like Basis Tech?
  • Massimo Ruffolo added an answer:
    Making effective Web Content Extraction technologies
    One of my principal research and devolpment interenst is in Web Content Extraction. I founded a start-up in this field www.altiliagroup.com. If there is someone interested in collaborating with us on this topic or in working as principal software architect for Altilia please let me know.
    Massimo Ruffolo · National Research Council
    Hi all,

    thanks for your interest in my post.

    We are seaching for companies interested in becoming resellers of our content extraction and management technologies and for technical people with deep expertices in web content extraction tecnologies interested in working with as software architect.

About Web Content Extraction

Content extraction is the process of identifying the Main Content and/or removing the additional items, such as advertisements, navigation bars, design elements or legal disclaimers. The rapid growth of text based information on the Web and various applications making use of this data motivates the need for efficient and effective methods to identify and separate the Main Content (MC) from the additional content items.

Topic Followers (2479) See all