Web Content Extraction

Web Content Extraction

  • Frances Hodgkins added an answer:
    How to get twitter historical data?

    I am doing a research in twitter sentiment analysis related to financial predictions and i need to have a historical dataset from twitter backed to three years. last year twitter announced that they will release historical data for scientific proposes.

    I am  asking if anybody have an idea about how to get this data? 

    Frances Hodgkins

    Prachi Dev · Did you ask Twitter?

    I am researching Ebola, and I asked about a data grant.  I have forms that I need to fill out as an academic to provide to Twitter the reason for my research and why I need the data.  Did you traverse this road already?  If so, I will need to modify my prospectus!  If not....give it a try! 

  • Victoria Rubin added an answer:
    What are the best ways to content-analyze social media streams?

    I'm looking for recent developments in automated analysis of Twitter, Facebook, or any other text-based social media streams. What are researchers able to extract? How are facts gathered, summarized, visualized?

    If you can point me to recent research, technologies, and specifically conferences dealing with automation of social media content, I'd much appreciate it. VR

    Victoria Rubin

    @Dr. Muhammad Zubair Asghar,

    Thank you for sharing these publications. I'll certainly be looking at those methods.

    @Olga Buchel, thanks or the PCA tip.

    @Emmanuel Mogaji, could you please expand on the NCapture capabilities?

    Thank you, VR

  • Dr. Muhammad Zubair Asghar added an answer:
    Is it be possible to extract the code for a specific segment of a webpage?
    Suppose I need to extract code for the voting portion of a webpage alone. Can it be any tool for doing this.
    Dr. Muhammad Zubair Asghar

    we use python-based scrapper: http://www.crummy.com/software/BeautifulSoup/. after scrapping, perform some manual editing and get things done as per specific requirement

  • Dr. Muhammad Zubair Asghar added an answer:
    Where can I get the web page datasets such as BBC or NY Times datasets for web page classification?

    Hello, everyone

    I do the implementation of web page classification. Now I am testing on small dataset such as downloading about 50 web pages (sport, business,...etc.)from bbc web sites . But I need more web pages for further implementation and calculate the classification accuracy. Therefore, if you know and have some web page dataset, please can you share me or give links.


    pan ei san

    Dr. Muhammad Zubair Asghar

    here is the required dataset:


  • Dr. Muhammad Zubair Asghar added an answer:
    How can I recognize a document is positive or negative based on polarity words?

    I have:

    - Polarity words.


    - Good: Pol 5.
    - Bad: Pol -5.

    My assignment:

    Determine a document is negative or positive. So how I have to do, please tell me about that, I'm a newbie in NLP (sentiment analysis).
    I want to use polarity to do that, don't use Naive Bayes. So anyone tell me about algorithm based on polarity words.

    Thanks for your time.

    Dr. Muhammad Zubair Asghar

    there are number of ways to accomplish the mentioned task, however, simple solution is to use set of if-then rules, which can be easily implemented in python. rule-based implementation in python is easy and one can customize it easily according to specific requirement. Following are the different studies conducted in this context.

    1. https://www.researchgate.net/publication/283318830_Lexicon-Based_Sentiment_Analysis_in_the_Social_Web

    2. https://www.researchgate.net/publication/281735672_Lexicon_based_approach_for_sentiment_classification_of_user_reviews?ev=prf_pub

    3. https://www.researchgate.net/publication/283318926_Sentiment_Classification_through_Semantic_Orientation_Using_SentiWordNet?ev=prf_pub

    • Source
      [Show abstract] [Hide abstract]
      ABSTRACT: Sentiment analysis is a compelling issue for both information producers and consumers. We are living in the " age of customer " , where customer knowledge and perception is a key for running successful business. The goal of sentiment analysis is to recognize and express emotions digitally. This paper presents the lexicon-based framework for sentiment classification, which classifies tweets as a positive, negative, or neutral. The proposed framework also detects and scores the slangs used in the tweets. The comparative results show that the proposed system outperforms the existing systems. It achieves 92% accuracy in binary classification and 87% in multi-class classification.
      Full-text · Article · Jan 2014

    + 2 more attachments

  • Virginia Angelica Garcia-Vega added an answer:
    What is your favorite search engine to find ontologies?

    I'm interested in finding ontologies in the domain of sustainable territories

    Virginia Angelica Garcia-Vega

    Have you try https://duckduckgo.com. Some of the search engine above don't work

  • Fadoua Ataa Allah added an answer:
    Where can I find the web pages dataset for information extraction?


    maybe someone knows where I can find a webpage dataset to Information extraction evaluation. I need a set like a:

    - domain_1 = { {web_page_1, {relevant entities}}, ..., { {web_page_2, {relevant entities} }

    I created a wrapper induction algorithm with based on domain's web pages. This algorithm can extract an important entity from these pages (for example from domain about movies they from each page extract information like film title, actors names etc.) . I created a reference dataset (I labeled 3 domain and 200 documents). But maybe there is an another better reference dataset?

    Maybe someone know where I can find a software to comparation with my solution (semi-supervised information extraction from web pages based on html structure) ?

    Fadoua Ataa Allah

    Hope these links could help you.

    Good luck.

    + 2 more attachments

  • Udit Chakraborty added an answer:
    Any advice on the calculation of weights for training vs test set in a feature vector?

    I am working with text classification using ant colony algoriithm, but basically I am confused with computation of feature vector for test set.

    For training feature vector, I took TF-IDF vector for each training data, and constructed a feature matrix [docs x terms] using the TF-IDF values.

    But how about computing the test set's feature vector? Should I just use the TF-IDF values in training set to compute it?

    eg: In training set for a particular word "apple", the doc frequency is 5. For test set, should I use the value 5 for "apple"? Or recompute the TF-IDF based on test set?? Or rather, am I going the wrong way in computing the feature vector??

    Thanks in advance!

    Udit Chakraborty

    I agree with Qasem

  • Bash Badawi added an answer:
    What is the difference between Hadoop and Data Warehouse?
    I have read a couple of articles which are trying to sell the idea that the organization should basically choose between either implementing Hadoop (which is a powerful tool when it comes to unstructured and complex datasets) or implementing Data Warehouse (which is a powerful tool when it comes to structured datasets). But my question is, can´t they actually go along, since Big Data is about both structured and unstructured data?
    Bash Badawi

    Financially speaking, Hadoop can be a very inexpensive alternative to a data warehouse. You can store your structured data across cheap computing/storage nodes as opposed to adding large servers, disaster recover, fail-over, etc.

    Where as in Hadoop, you can break up the data and let HDFS, the Hadoop distributed file system handle the 3X copies of each chunk of data. You can use Pig, or Hive, or Ambari or Flume to run queries against just as if you are using a DW. I have implemented this at a client site where we were migrating a COBOL/DB2 system and decided to go with Hadoop just to save money. It's also a great transition point until you know where the data should be housed, but Hadoop storing 3X the data and the ease of scaling out versus scaling up is a nice feature. 

  • Saurabh Gayali added an answer:
    What tags are more suitable for main content extraction from HTML webpages?

    Hello, everyone

    I am interesting the Content Extraction from HTML web pages. Now I use the HTML tags for dividing the block of web page and use the tag-to-text ratio and anchor-text-to-text ratio and title density to extract main content. But all of HTML tags don't appropriate where content extraction. SO I want to know what tags are more accurate and more suitable for web page' cleaning? Thank You all...

    Saurabh Gayali

    try visual ping


    you can visually select what you want to extract

  • Mustapha Bouakkaz added an answer:
    Can anyone suggest some concept extraction tools?

    Friends, I want to extract concepts from large collection of text. Is there any tools available for the same ?  As per my knowledge, Topics and Concepts are different and I can't use topic modeling tools to extract concepts. Please help me by suggesting some tools for concept extraction.

    Mustapha Bouakkaz

    TAG : Textual Aggregation by Graph 

  • Dr. Vaishali S. Parsania added an answer:
    Any online data source for download- on which various data mining algorithms can be performed?

    Datasource which is as well free of cost and permitted to download...

    Dr. Vaishali S. Parsania

    Thanks Lalitha....

  • Supratip Ghose added an answer:
    Is there any open source moodle dataset that can be used for research purpose?

    I want to get moodle learning dataset in CVS format. Is there any open source moodle dataset for research purpose and can anyone suggest me any tools to extract moodle web data in CVS format?

    Supratip Ghose

    Thanks for your answer. Actually I am talking about integrating data mining tools in course management systems....My student followed the paper "Data mining in course management systems.Moodle case study and tutorial" by Cristóbal Romero *
    , Sebastián Ventura, Enrique García published in Elsevier Science to get the technology to preprocess data and employ this dataset for knowledge discovery. But they failed to extract the data with some tools as described in the paper. 

  • Sandeep R Sirsat added an answer:
    Is there any algorithms for extracting the aspects from the text data?

    I am currently working with a topic modelling based aspect-specific sentiment analysis of product reviews. The topics returned by the topic modeling tools need not be aspects. So how can I find out the aspects from this information? Do I need to find aspects manually or is there any tool or algorithms available? 

    Sandeep R Sirsat

    Yes, there are many approaches available to extract aspects from the textual data. You can use use semantics based sentiment analysis in identifying and extracting aspects from textual corpora.

  • Viktor Dmitriyev added an answer:
    I want to download tweets for my research... Can you please help me to do the same?

    I require large dataset of tweets for analysis in Big Data. Please guide me how i can get those tweets.

    Viktor Dmitriyev

    Follow the link to find description of the twitter timeline extraction with python script that extracts those tweets. Here is the link - https://github.com/rasbt/datacollect/tree/master/twitter_timeline .

    In addition, you will need to install following packages "twitter, pandas, pyprind". It's described there in the README file.

    In case you are using Windows, it worth checking conda package  manager (http://conda.pydata.org/docs/) shipped with the Anaconda (https://store.continuum.io/cshop/anaconda/). Installing it will really simplify your life in sense of installing custom python packages that require a lot of 'work around' with compilers and etc.

  • Ahmad T Siddiqui added an answer:
    Looking for an old paper on a circuit-board information retrieval system implementation?

    Many years ago I read a paper on a hardware implementation of an information retrieval system. It was implemented as a circuit board, where the query would be set by putting jumpers on one side of the board and the result would be indicated by LEDs or the equivalent on another side of the board. The math behind it was very insightful, and I'd love to find it again, but I've been unable to. The paper was written (probably well) before 1975, perhaps even in the 1950's. I vaguely remember that the primary author's name began with an S but that's as far as I've gotten. (I'm not thinking of Vannevar Bush's Memex.)

    Can anyone help?

    Ahmad T Siddiqui

    Dear Sir,

    Check the file attached. May be it can help you.


  • Madhan Kumar Srinivasan added an answer:
    How do I get the DBLP and SIGMOD query set?

    Hello Everyone,

    I want to know how to get the DBLP and SIGMOD query set. If you know the links, please can you share me? But if it is not gained query set from the links,these tested query is created by yourself when the query is processed. Please share me.. Thank you all.

    Madhan Kumar Srinivasan

    I am not sure about DBLP data set. But, if you can explore, the following link is useful to get good data set for typical analytical problems. I hope this may be of use.


  • Arockiya Selvi added an answer:
    What is the difference between stopwords density and token count?

    Hello, please can you share info with me about how to count the stop words and tokens for text. I would like clarification with examples. Thanks

    Arockiya Selvi

    Stopword density analyses  the words that are repeated more times in our programming. It causes the search engines to confuse display information regarding whih keyword.

    Token count returns number of tokens(Smallest units of a program) in your text.

  • Michal Meina added an answer:
    Where can I get a stopwords list for webpage categorizations?

    Dear all, I want to get some stopwords for web page classification when I want the train for learning classifiers. So if you know some link and how to get these stopwords, can you share them with me please? Thanks all.

    Michal Meina

    I would strongly recommend to use stop word corpus from NLTK [ http://www.nltk.org/book/ch02.html ]

    It has 2,400 stopwords for 11 languages

  • Efstratios Kontopoulos added an answer:
    Can anyone suggest to me what can be done in an area of semantic web scrapping?

    I am interested in doing some work in area of semantic web crawling/scraping and using that semantic data to do some discovery.

  • Panei San added an answer:
    What kind of java is appropriate for information extraction and web page classification?

    Hello everyone!

    Can you advice me what java is more learn for my opinion?

    Panei San


  • Stephane RIBOT added an answer:
    What's the easiest way to collect data from Twitter and Facebook?
    I'm developing a strategy as a MSc project. I will be monitoring, collecting, and analyzing the data of a Facebook page (posts, comments, likes, shares) and a Twitter profile (tweets, retweets, mentions, and public tweets containing one/two keywords only). Any suggestions would be great. Also, what mining techniques do you recommend? I'm thinking sentiment analysis and would like to use one or two more techniques. What techniques do you recommend?

    Stephane RIBOT
    the easiest is GNIP, http://gnip.com/
    note that you may have to pay but it may be worth it due to its simplicity (if you dont program that is !!
  • Ian Kennedy added an answer:
    Can a Probabilistic timed automaton be used to model the underlying network in query routing?
    The network for routing the query is based on Markov process. If we want to model the time taken to answer a query , is Probabilistic timed automaton a better model?
    Ian Kennedy
    A stochastic queue would be indicated. Look up queueing theory. Start here: "http://en.wikipedia.org/wiki/Queueing_theory". If you wanted to use a number of probabilistic timed automatons, you would then have the complexity of having to build in the appropriate statistical properties.
  • Akila Gopu asked a question:
    What is the best model to capture Query passing/tossing in CQA?
    CQA is Community Query Answering, example stackoverflow.com or Yahoo answers. The current model that is used to capture this query passing is the Markov model. Is there any other alternative to the Markov model?
  • Alexandre Beauvois added an answer:
    Looking for an efficient algorithm available for web crawling
    I need to extract specific data from related websites . For example I need to extract data from specific website providing the positive feedback about a type of vehicle. Kindly suggest some good code or algorithm for this.
    Alexandre Beauvois
    Cheerio for javascrpt (nodejs) programming language : see http://vimeo.com/31950192
  • Fotis Kokkoras added an answer:
    Data mining and web content mining
    How to data mining algorithms be implemented for web content mining?
    Fotis Kokkoras
    The web content extraction is a task applied to web pages, not to databases. You are scraping unstructured data from the web, you put them in structured storage (databases) and then apply data mining algorithms to them. That's the order.
  • Debajyoti Mukhopadhyay asked a question:
    Semantic Search Engine with user friendly output and ranking needed.
    We have witnessed the power of a regular search engine like Google. There is a semantic search engine like Swoogle as well. However, we are trying to build a semantic search engine with more user friendly display capability and relevant ranking algorithm. Can anybody suggest ideas?
  • Andras Kornai added an answer:
    Chinese textmining software
    Does anybody know of any useable textmining software programs that do topic modeling and also cover Chinese as a language? This seems harder to find that I had thought. I found things like FudanNLP - (http://code.google.com/p/fudannlp/) and Ictclas (http://www.ictclas.org/ictclas_download.aspx), neither of which I have been able to make work so far. Pingar (http://apidemo.pingar.com/AnalyzeDocument.aspx) doesn't seem to have topic extraction. Mallet does seem to have a Chinese module and does have topic modeling, but I have yet to figure that one out too. Does anybody have any other suggestions?
    Andras Kornai
    Have you considered commercial software vendors like Basis Tech?
  • Massimo Ruffolo added an answer:
    Making effective Web Content Extraction technologies
    One of my principal research and devolpment interenst is in Web Content Extraction. I founded a start-up in this field www.altiliagroup.com. If there is someone interested in collaborating with us on this topic or in working as principal software architect for Altilia please let me know.
    Massimo Ruffolo
    Hi all,

    thanks for your interest in my post.

    We are seaching for companies interested in becoming resellers of our content extraction and management technologies and for technical people with deep expertices in web content extraction tecnologies interested in working with as software architect.

About Web Content Extraction

Content extraction is the process of identifying the Main Content and/or removing the additional items, such as advertisements, navigation bars, design elements or legal disclaimers. The rapid growth of text based information on the Web and various applications making use of this data motivates the need for efficient and effective methods to identify and separate the Main Content (MC) from the additional content items.

Topic followers (2,469) See all