Science topic

Web Mining - Science topic

Explore the latest questions and answers in Web Mining, and find Web Mining experts.
Questions related to Web Mining
  • asked a question related to Web Mining
Question
3 answers
I have a number of ongoing researches on adaptive web mining techniques and online social network analysis with applications.
Collaboration with funding support for presentation of research outputs in top conferences, workshops and international journals is highly solicited.
Please you can contact me via temitayo.fagbola@fuoye.edu.ng
Thank you
Relevant answer
Answer
I am interested in working on online social networks and social networking analysis
  • asked a question related to Web Mining
Question
2 answers
chenliang li
Relevant answer
Answer
Hello please check MATLAB and Python software's.
  • asked a question related to Web Mining
Question
25 answers
Dear all,
Do you know any available data set for text summarization-with text summaries?
Relevant answer
Answer
Dear Keramatfar,
Luis Adrián Cabrera-Diego is right. Please go through this.
  • asked a question related to Web Mining
Question
10 answers
We have a dataset collected from multiple users and would like to measure levels of similarities and distance between users to build users profiles. Currently we are using some common approches for clustering such as k-means, Hierarchical clustering, GMM but would like to hear from other active researchers if there are other useful techniques that we haven't thought of.
Relevant answer
Answer
You can use factor analysis. It provudez clustering of items that have similarities. Once clustered you can think of the names of tbe different clusters. SPSS CAN DO THIS
  • asked a question related to Web Mining
Question
6 answers
Can anyone suggest an open-source, real-time data set for applying fuzzy clustering?
Relevant answer
Answer
Glad to be of help.
  • asked a question related to Web Mining
Question
17 answers
I am interested in finding out the frequency of updates in several websites from a central point instead of scanning every page and link for dates or contacting the owner.  I have tried Wayback Machine but I'm not sure crawling information is the same as updates.
Relevant answer
Answer
Just open the page in question, then type
javascript:alert(document.lastModified)
into your address bar.
Cheers!
  • asked a question related to Web Mining
Question
2 answers
Because the paper Tweet Segmentation and its Application to
Named Entity Recognition it does not tell how the meaningful phrases are splitted
Relevant answer
Answer
Do you just need the sentences split up? You can use Python's NLTK for that:
Code:
from nltk.tokenize import sent_tokenize
text = "You rule. It's the truth! Someone should have said that before."
phrases_list = sent_tokenize(text)
phrases_list
Result:
["You rule.", "It's the truth!", "someone should have said that before"]
  • asked a question related to Web Mining
Question
2 answers
can any one suggest me?
Relevant answer
Answer
That really depends on what you are mining for. Are you doing opinion mining? Topic classification or what exactly are you trying to achieve? Sequential pattern analysis is based on the Apriori algorithm for mining patterns of event sequences. So it really depends on the problem at hand.
  • asked a question related to Web Mining
Question
1 answer
We have witnessed the power of a regular search engine like Google. There is a semantic search engine like Swoogle as well. However, we are trying to build a semantic search engine with more user friendly display capability and relevant ranking algorithm. Can anybody suggest ideas?
Relevant answer
Answer
Where can I have more formal info about the semantic search of RG search engine?
  • asked a question related to Web Mining
Question
3 answers
I have movielens dataset containing ratings of 1682 movie by 973 users. i want to make a movie Recommendation system. How to do this Project with MATLAB or Python.
Relevant answer
Answer
Hi Atta,
There are various approaches to movie recommendation and to recommender systems in general. Broadly, these divide into content-based approaches or social/collaborative approaches.
Content-based approaches are easier for recommending other content types (like text) where analysis techniques are more advanced. Social/collaborative approaches are based on the assumptions that (for example) people in the same demographic segment as you or people who have previously liked similar movies to you will share your tastes. This can then be used as a basis for recommendation.
An early survey of relevant approaches including some which are relatively straightforward to implement can be found here: http://ewic.bcs.org/content/ConWebDoc/4843
A more recent survey is here:
  • asked a question related to Web Mining
Question
2 answers
1) I need to extract the movie genre from dbpedia with Sparql, can anyone provide me with links or materials on this code?
2) I want a user- feature matrix  with genre as data points (18) and users as observations (6040). I need the procedure to get this done. Relevant links and documents will be appreciated.
Thks
Relevant answer
Answer
1) EXAMPLE
movie x genre 
SELECT * WHERE {
}
EXAMPLE
movie x genre for all italian movies:
SELECT * WHERE {
}
2) In Java? A 2-dim. array of Genre objects?
  • asked a question related to Web Mining
Question
6 answers
Is there any API available for collecting the Facebook data-sets to implement Sentiment analysis.
Relevant answer
Answer
You can use aforementioned datasets or if you want to scrap the data yourself there is Facebook graph API.
PYLON provides access to previously unavailable Facebook topic data and has some price.
If you don't exclusively want Facebook datasets you can easily get data from other sources like (twitter, google, Wikipedia) using pattern.
  • asked a question related to Web Mining
Question
2 answers
I'm brand new to social network analysis. I'm trying to identify meme creators in twitter. Is there a way to do this using data downloaded from twitter? 
Relevant answer
Answer
From data downloaded by the Twitter Streaming API, you can verify if the tweet is a retweet through the 'retweeted' field included in the json of the status (it is a boolean value), in which case the media may be included in the original tweet, which you can access by the 'retweeted_status' (the value of this field is a representation of the original tweet). However if the user has not directly retweeted  the media, but stored it locally and later tweeted it, it is not possible to ascertain the originality of the media.
  • asked a question related to Web Mining
Question
4 answers
I am looking for multiple documents that were drawn from the same domain. I would like to aggregate information from multple documents for summarization. 
Relevant answer
Answer
  • asked a question related to Web Mining
Question
7 answers
I have Engineering data where I need to classify Event vs NonEvent based on operational parameters. My Event class data size is about 1% and NonEvent class data size is 99%. I read an article about oversampling and undersampling. But in my case, these methods doesn't work. In my case Event is highly related to non-Event because it is sensor data which get captured very frequently.
How can I classify Event vs Non-Event in imbalanced class classification problem?
Relevant answer
Answer
Probably you can try One Class SVM because it is used for detecting outliers / novelty detection. You can treat your Events  class as  an oulier. You can also look at the probability estimates of classification done by linear regression classifier and check how do these probability estimates vary for event vs non events.  Then you can set a benchmark that for probability estimate > x, the input to classifier is event,  otherwise non-event ...something on these lines.
  • asked a question related to Web Mining
Question
12 answers
I am doing a research in tweets and hashtags analysis related to influenza predictions and i need to have a historical dataset from twitter during influenza period (December to march).
I am asking if anybody have an idea about how to get this data?
Relevant answer
Answer
Hi Askoum,
There are certain tools (free as well as paid) which allow you to search and store Twitter data retrieved for a specific hashtag. Below are some of the tools that you can try:
  • Keyhole (www.keyhole.co) is a great tool for searching hashtags on various social media platforms. It allows you to download the respective data for your search into an excel or comma-separated values (CSV) file.
  • Ritetag (www.ritetag.com) is another online search tool that allows you to search hashtags (including how the tag has been used as well as the influencers) 
  • NodeXL (www.codeplex.com) can be used to analyze Twitter data for research purpose.
There are various tutorials available on Youtube as well as various websites that are helpful. For example, you can try http://social-metrics.org/downloading-tweets-with-a-specific-hashtag/
I hope it helps. 
  • asked a question related to Web Mining
Question
3 answers
Kindly give the links of e-learning weblog data sets
Relevant answer
Answer
can you please check quandl.com
  • asked a question related to Web Mining
Question
2 answers
I want a dataset for types of hackers based on there behaviors on websites.
Or I want to build a dataset for hackers but I don't have any ideas how to build dataset. 
Relevant answer
Answer
Thank you Mr Martin Gilje Jaatun for answering my question. I hope that will help me to find any information about my research.
  • asked a question related to Web Mining
Question
4 answers
Hi all,
I want to use big data clustering algorithms in my PhD work but i don't know which topic is appropriate to apply big data clustering on it , I mean what is the good application of the big data clustering algorithms
if you can help me i will be grateful to you
Relevant answer
Answer
one of the most big data generators is ( Mobile network Operators ) :) 
  • asked a question related to Web Mining
Question
4 answers
I want to fetch online news from different news sources from today to one month back. How i can download those news? Is there any news API available in Python to download for Hindi News such as AajTak, Dainik Jagran, Dainik Bhaskar etc.
Relevant answer
Answer
Mr. Santosh Kumar,
I am not sure about any API for Hindi News and it is rather difficult to find for all News providers.
But you can easily find RSS Feed and fetch News from , e.g   Dainik Bhaskar (1st link below) which also can be done using python Feedparser library (2nd link).
For more help over RSS with python you can search on web.
Hope this will help you.
Thank you.
  • asked a question related to Web Mining
Question
5 answers
result is obtained only by typing the keywords in the search box
Relevant answer
Answer
thanks everyone, 
  • asked a question related to Web Mining
Question
14 answers
Please, could any one provide a link to any algerian dialect tweets dataset annotated for sentiment analysis ?
Relevant answer
Answer
Salam Alikoum
In 2016, we developed a Sentiment Analysis System for the Vernacular Algerian Arabic (Algerian  dialect), 
Among the components of our system:
1- The lexicons (L1, L2, L3 and L4),
2- The dataset (from twitter)
3- The different algorithms (subjectivity detection, polarity formula, etc.)
* We made our code available to other researchers on the following link:
* Our work has been published as paper in CICLING 2016 conference, and then in RCS journal, here is the link:
I provide you the resources as attachment
I wish you good luck.
-------------------------------
Dr. M'hamed MATAOUI
  • asked a question related to Web Mining
Question
4 answers
I am building an analysis tool and would like to see how it behaves with real world data. Also another traces like car GPS, or even checking data, like when you use a bus card to pay for a ride might help.
I am already looking forward to use twitter data, collected from the geolocated tweets.
Relevant answer
Answer
  • asked a question related to Web Mining
Question
4 answers
At this point, our research does not wish to perform blind text mining; however, we may wish to provide some indication of the type of text content in which we are interested.
Relevant answer
I see. 
An important question, then, is what unit of text should be used for this tasks ?
  • asked a question related to Web Mining
Question
3 answers
The content are from existing Technology Manuals, Blogs both written by paid technology writers and external content. We intend to build a FAQ corpus out of these. Thanks.
Relevant answer
Answer
Pretend that you lecture the corpus as a course, and that you have to test an employee on every sentence in the corpus. Manually all you do is convert each sentence into a question. The question can seek to establish the subject , verb or object of the sentence, as pleases you. The research problem is how to do this automatically with NLP. I attach the beginning of your reading list, and a link to ArikIturri: An Automatic Question Generator Based on Corpora and NLP Techniques. Have fun!
  • asked a question related to Web Mining
Question
3 answers
Hello,
I'm looking for a dataset with (obviously) features and genre tags for every song. I already have the subset retrievable from the official website (http://labrosa.ee.columbia.edu/millionsong/lastfm), but it seems the entire set went missing and it's hard to locate a source.
The Google group dedicated to the dataset is inactive and close.
I wonder if someone is working on the same data or can point me to a similar dataset.
Many thanks in advance.
Relevant answer
Answer
Thanks people. I actually needed a ready-made dataset features + musical genre labels. Actually I found an image of MSD here and the relative JSON where you can retrieve (tags/label) are downloadable from the website.
  • asked a question related to Web Mining
  • asked a question related to Web Mining
Question
7 answers
Hi 
I know the quiet (not-updated)  "A Comparison of Open Source Search Engines" by Christian Middleton, Ricardo Baeza-Yates. It does not contain all newer open source code libraries
Is there library faster than Lucene in Information Retrieval at the moment?
Also, what is the whole capability of Lucene package about term-weighting scheme?
Thanks
Osman
Relevant answer
Answer
Consider Solr (built on the top of Lucene) and Elasticsearch also (depending on your needs). Both Solr and Elasticsearch are solid, high-performance applications (with proper configurations, of course). Another issue to take into account is scalability. Many features of popular SE are depicted here: http://db-engines.com/en/system/Elasticsearch%3BSolr%3BSphinx
  • asked a question related to Web Mining
Question
4 answers
Hi all,
I am trying to find patterns within a web site visitors. Though I could extract all data I'd want, I see the convenience of working with sample data.
First step is setting the date range; having in consideration that a web site is a dynamical environment, it may be misleading to take a wide period. 
So, for this data type, what date range would be appropiate? (I normally take 1 up to 3 months). Once time frame is selected, what sampling methods should I use to ensure sample representativeness? 
Many thanks!
Relevant answer
Answer
Es una buena pregunta la que formulas, en este caso creo que lo primero que debes considerar es el objetivo de tu investigación, y dependiendo de ese objetivo determinar si tomaras en cuenta la estacionalidad, que en realidad dudo que exista, pero dependiendo de la naturaleza del sitio web podría ser un factor a tomar en cuenta.
That's a good question. In this case  i think that you must to considerate the final goal, the seasonality is a factor so strong if you will use a data range and this depend on the type of website.
(sorry i still learning english)
  • asked a question related to Web Mining
Question
8 answers
My thesis is about Analysis and Auto generation of FAQ lists in different domains. For conducting experiments, I need high volume of FAQs. That's the reason I am looking for a publicly available data-set containing FAQs in various domain (or even one specific domain).
Relevant answer
Answer
Hello Fatemeh Razzaghi,
I am about to start my thesis in which I need the same data set that you talk about. How did you managed to solve this requirement? Have you use an existing data set or have you used a crawler?
Thanks!
  • asked a question related to Web Mining
Question
9 answers
I'd like to mine web pages that'd result in a dataset of pages taken from a particular website (eg. news sites). It'd target articles not only from one section but also from the other sections on the site (for instance, politics, tech and etc. from CNN.com). All of these articles are combined and retrieved from the 3 years publication and that means I'd have all of the articles published in the 3 years time. What are the tools and techniques that I can opt to do? 
Relevant answer
Answer
I suggest https://scrapy.org/ - python based web crawler.
Also look at quora answer about web crawling services: https://www.quora.com/What-are-the-best-web-crawling-services
  • asked a question related to Web Mining
Question
2 answers
A tool which can accept tamil documents for classification and other processing steps for mining
Relevant answer
Answer
Honestly, I think that you'd have a lot more luck using [Google's] automated tools to translate the documents to English and then use the standard best-of-breed tools for the rest of the steps.  The magnitude of "market/mind-space" share of English simply overwhelms any other choice.
  • asked a question related to Web Mining
Question
3 answers
what is the different between the Rand index by this function
(Rand <- function(clust1, clust2) clv.Rand(std.ext(clust1, clust2)))
and Rand index by cluster.stats in the package of fpc because when I apply on my cluster I get different result which is right?
Relevant answer
There are  two expressions  available  (two formulas)  for  the definition of the Rand's Index  depending upon the values of the denominators you are dealing with .  Let's recall  that this index  measures  the association between two partitions (two nominal variables, two categorical variables or two "equivalence relations" in the "binary  relation" framework) . A first  expression  is taking into account the reflexivity of the equivalence relations    and consequently  the denominator of this version of the Rand Index  is equal to N2, in this configuration, the Rand Index is varying within the interval [1/N; 1]. The second  version of the Rand index does not take in charge the   reflexivity of an equivalence relation and so its denominator  is d of N2, it varies in the interval: [0-1] .  It appears   that this difference   can generate some  discrepencies between both the index values,  maybe it could be the problem you have been faced with ..   By the way ,it is interesting to quote that the Rand's Index (discovered in 1972)  is nothing but the Condorcet's Coefficient invented by this  "famous"  french scientist   about 200 years before  (in 1785) .  Here are three  references where this index is studied in depth .
Ah-Pine J. , Marcotorchino F.: «Overview of the Relational Analysis approach in Data-Mining and multi-criteria Decision Making», Web Intelligence and Intelligent Agents, book edited by: Zeeshan-ul-hassan Usmani , INTECH Publisher: (February 2010). (in English)
Marcotorchino F., El Ayoubi N. : «Paradigme logique des écritures relationnelles de quelques critères fondamentaux d'association », Revue de Statistique Appliquée, Vol 39, n°2, pp :25-46 (1991) (in French)
Ah-Pine J., Marcotorchino F.: «Unifying Some Association Criteria between Partitions by Using Relational Matrices », Communications in Statistics: Theory and Methods, pp: 531-542, Taylor &Francis Publisher, Philadelphia, (September 2009)
  • asked a question related to Web Mining
Question
5 answers
I'm looking for suggestions of researches that analyze big data volumes with the aim of discovery association rules or standards that enhance and further qualify the teaching and learning process to the student.
Relevant answer
Answer
Opinion mining in context of Teaching and Learning methodology/environment, Evaluation process including Sentiment analysis.
  • asked a question related to Web Mining
Question
3 answers
I am looking for a classification method to build a binary classifier for web documents: i.e. a classifier that predicts whether the document belong to domain of interest or not. A domain here is a broad category e.g. science. I am wondering if there is any work in neural network community to do this efficiently with training data ~10K labeled webpages with labels 0/1. 
Simple "language model" based approach hasn't been proved useful till now. Would a NN based model make more sense for this task?
Relevant answer
Answer
Hello,
I found this post and the accompanying code very useful to implement a Deep Learning based text classifier:
The code is available here:
It is in Python using the Tensorflow framework.
It is based on this article:
Convolutional Neural Networks for Sentence Classification
Yoon Kim, EMNLP 2014
There is a very good follow-up article providing practical recommendation as well:
A Sensitivity Analysis of (and Practitioners' Guide to) Convolutional Neural Networks for Sentence Classification
Ye Zhang, Byron Wallace
This approach works well for sentence classification.
I assume it will be more problematic for document classification.
A more recent paper proposes a very nice trick to design document classifiers:
Document Classification by Inversion of Distributed Language Representations
Matt Taddy, ACL 2015
Python code is available here:
This only uses word2vec -like dense distributed representations of words trained using a neural network and derive an effective document classifier from this lexical representation.
Michael
  • asked a question related to Web Mining
Question
4 answers
I am doing my research about ambiguity and desambiguation in science of information in representation system and sociocognitive and new infocomunication paradigms. do you have any advice, please?
Relevant answer
Answer
Mr. Oluwarotimi,
thanks for your suggestions, very useful for my research.
thanks
  • asked a question related to Web Mining
Question
2 answers
Meta objects: - Likes, dislikes, no.of  comments, etc
Visual objects: - person, places etc
Relevant answer
Answer
sir it is all the game of the metadata associated with some objects.
  • asked a question related to Web Mining
Question
6 answers
related to web mining
Relevant answer
Answer
Dear Hemamalini,
Dear researcher you didn’t mention the language you are going to consider. For generic language i.e English, Hindu, Tamil etc for relevant information mining: First it is important to have understanding on Natural Language Processing techniques such as text preprocessing which helps to remove words that have no value or repeated words such us “is”, “am” etc by using lexicon of keywords, Second the text should be stemmed to find the root word of each words, then part of speech tagging that means we need to determine part of the speech of each word i.e determine if the word is noun or adjective or adverb etc, then you can prepare lexicon of features/relevant words you want which will help you to find the existence of these words in your web text. Accordingly, you can categorize your web text or select relevant text you need from unstructured web text.
You can refer to my some paper which might help you:
Regards,
Tulu Tilahun
Lecturer at Arba Minch University
Ethiopia
  • asked a question related to Web Mining
Question
1 answer
I'm looking for any publicly available annotated data for microblogs or tweets in Arabic or English languages. For the topics of Persons, Organizations, and Places(locations)
i've been looking in the community papers and there appears to be no such data publicly available.
Relevant answer
Answer
  • asked a question related to Web Mining
Question
4 answers
I am using Tesla K20. I got an error that the shared memory is limited to 16K although K20 supports up to 48K. How to configure the GPU and NVCC compiler to use 48K shared memory instead of 16K?
Relevant answer
Answer
You will probably want to also take a look at cudaFuncSetCacheConfig, documented here:
For shared memory use on the K20 you may want to configure this to "cudaFuncCachePreferShared".
  • asked a question related to Web Mining
Question
8 answers
I am trying to reproduce experimental results from
[1] G. Guo, G. Mu, Y. Fu, and T. S. Huang, “Human age estimation using bio-inspired features,” 2009 IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. Work. CVPR Work. 2009, pp. 112–119, 2009.
They use the yamaha Gender and Age (YGA) database as well as the FG-NET aging database and it seems like all of the links to their supposed location are down.
Link for FG-net aging database should be the following
and for the YGA database I can not even find a reference
Do you know other database used for Age estimation ? which would include age and gender labeling.
Thank you very much for your time
Pascal
Relevant answer
Answer
Hi Muhammad and Volkmar, thank you for your help,
Concerning FG-net aging database, I am currently in contact with Mr. Andreas Lanitis, one of the researchers from which the database is originating. He pointed me to that link to which I have to apply to get access to the database.
I would still like to find the Yamaha Gender and Age database if anyone help in that.
Best regards,
Pascal
  • asked a question related to Web Mining
Question
7 answers
Students prediction,Educational data mining.
Relevant answer
Answer
Have a look at Cross Industry Standard Process for Data Mining (CRISP-DM):
It is more concerned the data mining process & project.
  • asked a question related to Web Mining
Question
5 answers
I am in search of a good standard dataset relevant for Music Recommendation systems which should consist of music with places and tags.
Relevant answer
Answer
Hello Sapan, if you're still looking for it, you may want to take a look at these resources:
Best,
Danilo
  • asked a question related to Web Mining
Question
3 answers
noisy links that Conduct user to false target.
Relevant answer
Answer
Hi, Taghandiky,
About your question,  you need to be more specific. I don't know if you can open the following link, but it's a research from 2010 with title "Combating Link Spam by Noisy Link Analysis".
There they define noisy link as non-voting link, a link that don't define or give support for the target page it points to. One example, links from the same site (in this papers, they call it "in-links pages"). Anyway, I also linking the reference from Research Gate for this paper.
Anyway, I'm assuming that you are talking about search engines ranking. Can you confirm if you are asking about spamdexing, or give more information, so we can have a better understand of your question?
  • asked a question related to Web Mining
Question
11 answers
I want to download random tweets from Twitter for specific time period (of two years 2011-2013). I have tried using statuses/sample API, but couldn't specify the time period.
Relevant answer
Answer
Hi Meenal Lanke, Twitter is designed in such a way which only aimed to post the tweets based on any event /user/keyword. I guess your research will be confined to any particular domain. Once you recieve the tweets using twimemachine.com then you can go for the regex to get the tweets of particular period of time. I wrote a parser to get particular tweets and clean them as per our requirement,  please share you email address I will email to you. 
  • asked a question related to Web Mining
Question
15 answers
I am doing Research in e Learning field based on Tweets. For this purpose, I need huge collection of tweets E.g. 20,000.
Is there any twitter API for searching domain specific results such as Computer Science etc. ?
Thanks in advance.
Relevant answer
Answer
You need to either:
1) Find a publicly available corpus (there are many pre-filtered and already collected tweets)
2) Use Twitter's API and select the hashtag you want to get data for. The hastag is what 'categorizes' the tweet. This tutorial will greatly help you (http://adilmoujahid.com/posts/2014/07/twitter-analytics/)
  • asked a question related to Web Mining
Question
1 answer
At the moment, I was able to find these papers:
1. Prototype a Knowledge Discovery Infrastructure by Implementing Relational Grid Monitoring Architecture (R-GMA) on European Data Grid (EDG) by Frank Wang, Na Helian, Yike Guo, Steve Thompson, John Gordon.
2. Knowledge grid-based problem-solving platform by Lu Zhen, Zuhua Jiang,Jun Liang.
Thank you in advance for any help.
Relevant answer
Answer
Hello Pawel,
Look into the attached paper. & this link for the application.
This paper gives you the well structured idea about application and extension of the Grid technology to knowledge discovery in Grid databases.
If you are working on larger Datasets, I'm certain OLAP could help you with it and provide better results than any other.
Furthermore, you also need to work on the performance results & usability of such applications.
Regards,
Manish
  • asked a question related to Web Mining
Question
3 answers
I am looking for old web news, blogs and forums between 1995 and 2007. Any suggestions please?
Relevant answer
Answer
Hi Manal
Look at article How 20 popular websites looked when they launched at link below.
Jan Grzegorek
  • asked a question related to Web Mining
Question
75 answers
Hello all,  I am working on project. I want to download twitter data.  By using twitter API, I am able to download only 3 tweets. Is there a way to download at least 1000 tweets? 
Relevant answer
Answer
You can use 
  1. Twitter APIs:
  1. R packages:
e.g. twitteR, RTwitterAPI
See also blog posts on Twitter data on R-bloggers: http://www.r-bloggers.com/search/twitter
  • asked a question related to Web Mining
Question
1 answer
I used GS with a function on image processing that calculates the symmetry of two images. The number of function execution increases dramatically when the size of the image is doubled. Does anyone have an explanation?
Relevant answer
Answer
When the size of your image doubles (say from 300×400 to 600×800), the number of pixels goes up by 4 times.  Now, if your algorithm compares every pixel of one image with every other pixel of the other, the number of comparisons (of function evaluations) should go up by 16 times.
This is for a straight comparison. If, suppose your algorithm also compares the neighboring 2% height and 2% width (say), then this factor should be even much more. ...If, on the other side, you take some short-cut, like use some average pixel values, the factor should be somewhat less.  This depends on the algorithm you adopt.
Hope this explains the complexity of your approach !
  • asked a question related to Web Mining
Question
11 answers
rank aggregation algorithm etc. for recommendation process.
Relevant answer
Answer
Hi, you can give weight distribution for each approaches. You can simply add all the scores of each approaches, then you can re-rank it again and get the best solution. So, first you need to specify the appropriate weight for each approach.
  • asked a question related to Web Mining
Question
3 answers
How can I say that a particular tweet is rumor. I don't want to use any supervised knowledge to identify rumors
Relevant answer
Answer
Dear Abhishek,
you can start by trying to understand the factors that yield to rumors.
Some works that can interest you are
Best Regards,
Andreas
  • asked a question related to Web Mining
Question
10 answers
In symbolic logic, you can translate proper sentences to logical constructs: propositional or first order. This is very difficult to do in tweets because it is not written in proper english, at least not for the most part. So, I am designing an algorithm to convert tweets to logical constructs. If you have some ideas or would like to collaborate, let me know. Thanks!
Relevant answer
Answer
That's very good project, Ahmed. First thing you need to make sure what the tweet really means. That's the most difficult task. Once you know exactly what it means then it would be rather straightforward to put it in symbolic form.
But finding out what some tweets mean is beyond the capability of any algorithm I think. Because to know the actual meanings in many cases you have to rely on contexts, but algorithm does not always have a means to know that. Maybe before you input the tweets to the machine you would nee to supply the context and so on. In short you have to interpret the tweet first and then let the machine do the translation.
  • asked a question related to Web Mining
  • asked a question related to Web Mining
Question
3 answers
Actually I have search so many paper on web mining with fuzzy logic but I am not able to find out most recent one. Can anyone please tell me how I can get most recent papers?
Relevant answer
Answer
Dear Ameet,
Below you can find references for some very recent related papers:
Lin, C. W., & Hong, T. P. (2013). A survey of fuzzy web mining. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 3(3), 190-199.‏
Romero, C., Espejo, P. G., Zafra, A., Romero, J. R., & Ventura, S. (2013). Web usage mining for predicting final marks of students that use Moodle courses. Computer Applications in Engineering Education, 21(1), 135-146.
Matthews, S. G., Gongora, M. A., Hopgood, A. A., & Ahmadi, S. (2013). Web usage mining with evolutionary extraction of temporal fuzzy association rules. Knowledge-Based Systems, 54, 66-72.‏
Al-Rawi, S. S., Farhan, R. N., & Hajim, W. I. (2013). Enhancing Semantic Search Engine by Using Fuzzy Logic in Web Mining. Advances in Computing, 3(1), 1-10.‏
Liu, Y. (2014). Fuzzy-Clustering Web based on Mining. Journal of Multimedia, 9(1), 123-129.‏
Best regards,
Yaakov‏
  • asked a question related to Web Mining
Question
4 answers
Many years ago I read a paper on a hardware implementation of an information retrieval system. It was implemented as a circuit board, where the query would be set by putting jumpers on one side of the board and the result would be indicated by LEDs or the equivalent on another side of the board. The math behind it was very insightful, and I'd love to find it again, but I've been unable to. The paper was written (probably well) before 1975, perhaps even in the 1950's. I vaguely remember that the primary author's name began with an S but that's as far as I've gotten. (I'm not thinking of Vannevar Bush's Memex.)
Can anyone help?
Relevant answer
Answer
Dear Sir,
Check the file attached. May be it can help you.
Thanks
  • asked a question related to Web Mining
Question
23 answers
I would like to know what free online text mining tools I can use for user profile?
Can mining of user profile be applied online?
Relevant answer
Answer
Dear friend
I got a few more to share with you.
here is the list.
Carrot2 – text and search results clustering framework.
GATE – General Architecture for Text Engineering, an open-source toolbox for natural language processing and language engineering
Gensim - large-scale topic modelling and extraction of semantic information from unstructured text (Python)
OpenNLP - natural language processing
Natural Language Toolkit (NLTK) – a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for the Python programming language.
RapidMiner with its Text Processing Extension – data and text mining software.
Unstructured Information Management Architecture (UIMA) – a component framework to analyze unstructured content such as text, audio and video, originally developed by IBM.
The programming language R provides a framework for text mining applications in the package tm.[4] The Natural Language Processing task view contains tm and other text mining library packages.[5]
The KNIME Text Processing extension.
KH Coder - For content analysis, text mining or corpus linguistics.
The PLOS Text Mining Collection[6]
  • asked a question related to Web Mining
Question
11 answers
Dear respected scientists and colleagues,
I am looking for a climate change corpus to do text analysis on it. If you have one or you know a journal that I can download their abstracts I would be very much obliged.
Thanks,
-Dr.Hamed
Relevant answer
Answer
Thank you so much!
Email:
-Ahmed
  • asked a question related to Web Mining
Question
3 answers
We are working on the incremental timetabling problem. We found a timetable that is satisfies hard constraints and optimizes soft ones. After accepting the timetable, new constraints appear.
Is there a solution that suggests minimum change to have the new table with the incremental constraints without destroying the old table in order to minimize the disturbance of the stakeholders?
Relevant answer
  • asked a question related to Web Mining
Question
5 answers
I am doing my research in web usage mining. I can't get the extract data sets. The World Cup '98 log files are present in Internet Traffic Archive, but I don't know the file format for this file to open it.
Relevant answer
Answer
What I did for my PhD/postdoc was to build a web app and then mine the data generated there. I suggest a similar approach, as it will bring you much more satisfaction with respect to using open (free, free of satisfaction) data
  • asked a question related to Web Mining
Question
5 answers
During any association mining process it is a big challenge to remove uninteresting rules. We are interested in effective formal and experimental method for finding interestingness of the multilevel rules.
Relevant answer
Answer
You could read this paper:
May be it helps you
  • asked a question related to Web Mining
Question
7 answers
I am interested in using machine learning to recognize social interaction patterns such as disagreements, and potentially use those patterns to generate new simulated interactions. I've been working with crowd sourced descriptions of social interactions, but these are more narrative and less action driven.
Are you aware of publicly available datasets of annotated social interactions?
Types of data that might be good candidates are annotated movie scripts or forum threads. Skeletal/gesture data could also be interesting.
Relevant answer
Answer
I suppose it is no longer relevant, but as you mentioned forum threads, I believe this work uses social networks of forum-like environments annotated for agreement/disagreement:
The data sets should be available at:
  • asked a question related to Web Mining
Question
1 answer
I want to know the proper use of sentiwordnet by using wordnet.
Relevant answer
Answer
You should read about it in "SENTIWORDNET: A Publicly Available Lexical Resource for Opinion Mining" and then "SENTIWORDNET 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining". The first one is how SWN was created. Is it, reasonable, well written and well explained. The later is a paper about the 3.0 version of SWN if you want to know the mais difference. 
  • asked a question related to Web Mining
Question
15 answers
Big Data and Data Science have continued to emerge among practitioners and researchers. But the foundation of these concepts involve large volumes and a variety of data created at high velocity. Hence, the focus have generally been on bigger organisations that generate such data. However, small and medium sized organisations are also active adopter of ICT. Can Big Data and Data Science benefit small and medium enterprises as well and how?
Relevant answer
Answer
Hi Kayode - i appreciate your question; it appears that there is much debate about the 'Big Data' these days, but perhaps comparatively little analysis of the actual uses for the Small to Medium Enterprises (SMEs).
I suppose the applicability of the 'Big Data' to SMEs might perhaps best relate to the business fundamentals around what sets of strategic actions might actually make a business more adaptive, resilient, sustainable, in better services of it's customer ecosystems, and ultimately, more profitable in the long-term.
What role might 'Big Data' play in this context?
Some obvious areas of compatibility might be semantic / sentiment analysis of the Social Cloud data, inferring 'what your customers are thinking about your SME brand' - as well as perhaps many other applications around hyper-localization.
Wouldn't it be nice if SMEs could very quickly and easily understand the emergent opportunities in their hyper-local contexts - what is it that their customers are actually looking / hoping / asking for, etc?
The analysis and utilization of Big Data is already starting to play a key role in this context; while extending these types of capabilities to the SMEs might correspondingly yield significant benefits.
  • asked a question related to Web Mining
Question
2 answers
I need to extract information from distinct template html web pages.
Relevant answer
Answer
It depends, whether you want hierarchy of clusters or simple partitioning. If you want hierarchical clustering then i will suggest to use Complete and Single linkage algorithms. These algorithms are easy to understand and implement. And have shown promising in many domains.
For partitioning, K-Means will be the better choice due to simplicity and ease of implementation and good accuracy rate.
You will also need a measure to calculate association between any two pages. You may use Jaccard or Euclidean.
Another technique, Latent semantic analysis may show better performance as web pages resemble to a documents.
  • asked a question related to Web Mining
Question
3 answers
Web mining research areas.
Relevant answer
Answer
Web Data Mining can be categorized into: Web Content Mining, Web Structure Mining, and Web Usage Mining.
Web semantics is the one of the recent area most of the people are working. Web analytic
  • asked a question related to Web Mining
Question
5 answers
I am working on web log mining processes. I need a tool which performs pre-processing (data cleaning, user identification, sesion identification) of server log file.
Relevant answer
Answer
Which variables do you have in your log file?
Please paste a sample in your post. I think I can help you.
I've got a few tools for such a task.
  • asked a question related to Web Mining
Question
1 answer
In pagerank algorithm, is it necessary that a page be connected directly with every other page? When using damping factor?
Relevant answer
Answer
The pages do not need to be all connected.
In the pagerank algorithm, the popularity score of a page is dependant on the popularity of the pages that point to it through hyperlinks (i.e. directed edges).
By extension, the popularity of a page A is dependant on the popularity of all pages B that can lead to it through a directed path, even if A and B are not directly connected.
  • asked a question related to Web Mining
Question
2 answers
Currently I am working on web log mining techniques. I want to choose clustering techniques but clustering has been used extensively in web log mining.
  • asked a question related to Web Mining
Question
2 answers
How to generate document term matrix from a different type of web page? Is there any source available that provides csv files for document term matrix generation?
Relevant answer
Answer
I'd recommend using the sklearn library for python the text feature extraction tools are useful for for transformations like this http://scikit-learn.org/stable/modules/feature_extraction.html Look under text-feature-extraction . I've used this library a lot and it has not let me down yet.
  • asked a question related to Web Mining
Question
6 answers
I want to implement pageranks and various improved pagerank algorithms on graph data but I am unable to find a simulator or real implementation of a pagerank algorithm.
Relevant answer
Answer
If you are interested in small to medium scale graphs that evolve in time (e.g. according to local characteristics of nodes and some rules) than a multiagent simulation toolkit may be usefull. I can recommend two of them for such purpose:
- Repast Simphony (or Repast for HPC) - http://repast.sourceforge.net/
None of these tools will work for really big datasets.
Regards,
Radek
  • asked a question related to Web Mining
Question
4 answers
Many web pages aim to publish latest news and using feeds or other related technologies spread summaries of news. So once you have the summary, title and the link to the main web, the next step is retrieval the associated text, knowing that the web page has non relevant information in it, such as banner, rating of news, advertising etc. What are best tools for achieve this goal of extracting the associated text of a news, having title, summary and web link?
Relevant answer
Answer
My suggestion would be to:
1) Segregate all links and images from the main text. The objective in this stpe is to isolate the main text. Depending on the source, the main body will be easy to identify (take for instance the difference between a CNN news and a Bloomberg news article. Bloomberg is more difficult to determine content since most articles have multiple sections and may differ in relevance).
2) Take the body of the news and determine word relevance
3) Match words on images and links against the body's relevance index. If a threshold is achieved then segregate as relevant
Ste 3 will vary from site to site and so will the tweaking of the algorithm to determine relevance. The same can be done with all other items you mentioned.
Hope this helps
  • asked a question related to Web Mining
Question
1 answer
I am doing a project involving Google search engine. However, I do not know how
to export the results from Google and then store them to a text file or a database?
Relevant answer
Answer
In your browser, Ctrl-S will let you save the results page to an HTML file.
Beforehand, you may wish to go to the advanced settings and customise the search settings to give you 100 results per page
  • asked a question related to Web Mining
Question
5 answers
I want to arrange documents for automatic e-learning which suggest new topics to students.
Relevant answer
Answer
Yes, this has been done already. We are finishing a paper on the subject.
1. To operationalise the idea of a educational topic, we honed in on a headword in the glossary.
2. We prepared an exemplary glossary. We checked that the headwords in the glossary were in the text. We ensured that the definitions employed headwords in preference to expanded definitions.
3. We formed an ontology from the exemplary glossary. We used this to further refine the glossary, by e.g., considering the deletion of islanded headwords that were not referenced by or pointing to other headwords.
4. We used the ontology to draw a concept map for the course, which we slowly refined. The map let us identify the foundation headwords and identify the capstone headwords. It also showed us where long and short chains of dependency (threads and bushes) existed. In particular, the map showed us the headword hierarchy: which headwords depended on other headwords being defined beforehand.
5. Our prototype (currently being tested) recommends a reading order for the student, based on presenting first the foundation headwords, then the superior headwords and finally the capstone concepts.
  • asked a question related to Web Mining
Question
1 answer
many authors use perplexity/entropy to validate their model but I'm not fully satisfied with this. Again some author use topic coherence (Pointwise Mutual information). Can anyone suggest most accurate method to test topic model?
Relevant answer
Answer
Here are some suggested ways to validate a topic model:
1. Use the Generalized Fowlkes-Mallows Index (requires computing the probability of intersecting events).
2. Partial class match precision (measure the probability of randomly selecting two documents from the same class taken from a randomly sampled cluster).
3. Clusterig recall (probability that a relevabt document is retrieved).
4. Single metric performance
For more about this, see the attached pdf file.
  • asked a question related to Web Mining
Question
3 answers
To analyze customer behaviour and customer segmentation in telecommunication
Relevant answer
Answer
Certainly I found it a fascinating and rewarding field in which to work.
You can try. But the Telco will have to spend time and resources on anonymising the data, so I really doubt it.