Questions related to Web Mining
I have a number of ongoing researches on adaptive web mining techniques and online social network analysis with applications.
Collaboration with funding support for presentation of research outputs in top conferences, workshops and international journals is highly solicited.
Please you can contact me via email@example.com
We have a dataset collected from multiple users and would like to measure levels of similarities and distance between users to build users profiles. Currently we are using some common approches for clustering such as k-means, Hierarchical clustering, GMM but would like to hear from other active researchers if there are other useful techniques that we haven't thought of.
I am interested in finding out the frequency of updates in several websites from a central point instead of scanning every page and link for dates or contacting the owner. I have tried Wayback Machine but I'm not sure crawling information is the same as updates.
We have witnessed the power of a regular search engine like Google. There is a semantic search engine like Swoogle as well. However, we are trying to build a semantic search engine with more user friendly display capability and relevant ranking algorithm. Can anybody suggest ideas?
1) I need to extract the movie genre from dbpedia with Sparql, can anyone provide me with links or materials on this code?
2) I want a user- feature matrix with genre as data points (18) and users as observations (6040). I need the procedure to get this done. Relevant links and documents will be appreciated.
I'm brand new to social network analysis. I'm trying to identify meme creators in twitter. Is there a way to do this using data downloaded from twitter?
I am looking for multiple documents that were drawn from the same domain. I would like to aggregate information from multple documents for summarization.
I have Engineering data where I need to classify Event vs NonEvent based on operational parameters. My Event class data size is about 1% and NonEvent class data size is 99%. I read an article about oversampling and undersampling. But in my case, these methods doesn't work. In my case Event is highly related to non-Event because it is sensor data which get captured very frequently.
How can I classify Event vs Non-Event in imbalanced class classification problem?
I want to use big data clustering algorithms in my PhD work but i don't know which topic is appropriate to apply big data clustering on it , I mean what is the good application of the big data clustering algorithms
if you can help me i will be grateful to you
I want to fetch online news from different news sources from today to one month back. How i can download those news? Is there any news API available in Python to download for Hindi News such as AajTak, Dainik Jagran, Dainik Bhaskar etc.
I am building an analysis tool and would like to see how it behaves with real world data. Also another traces like car GPS, or even checking data, like when you use a bus card to pay for a ride might help.
I am already looking forward to use twitter data, collected from the geolocated tweets.
At this point, our research does not wish to perform blind text mining; however, we may wish to provide some indication of the type of text content in which we are interested.
The content are from existing Technology Manuals, Blogs both written by paid technology writers and external content. We intend to build a FAQ corpus out of these. Thanks.
I'm looking for a dataset with (obviously) features and genre tags for every song. I already have the subset retrievable from the official website (http://labrosa.ee.columbia.edu/millionsong/lastfm), but it seems the entire set went missing and it's hard to locate a source.
The Google group dedicated to the dataset is inactive and close.
I wonder if someone is working on the same data or can point me to a similar dataset.
Many thanks in advance.
I know the quiet (not-updated) "A Comparison of Open Source Search Engines" by Christian Middleton, Ricardo Baeza-Yates. It does not contain all newer open source code libraries
Is there library faster than Lucene in Information Retrieval at the moment?
Also, what is the whole capability of Lucene package about term-weighting scheme?
I am trying to find patterns within a web site visitors. Though I could extract all data I'd want, I see the convenience of working with sample data.
First step is setting the date range; having in consideration that a web site is a dynamical environment, it may be misleading to take a wide period.
So, for this data type, what date range would be appropiate? (I normally take 1 up to 3 months). Once time frame is selected, what sampling methods should I use to ensure sample representativeness?
My thesis is about Analysis and Auto generation of FAQ lists in different domains. For conducting experiments, I need high volume of FAQs. That's the reason I am looking for a publicly available data-set containing FAQs in various domain (or even one specific domain).
I'd like to mine web pages that'd result in a dataset of pages taken from a particular website (eg. news sites). It'd target articles not only from one section but also from the other sections on the site (for instance, politics, tech and etc. from CNN.com). All of these articles are combined and retrieved from the 3 years publication and that means I'd have all of the articles published in the 3 years time. What are the tools and techniques that I can opt to do?
I'm looking for suggestions of researches that analyze big data volumes with the aim of discovery association rules or standards that enhance and further qualify the teaching and learning process to the student.
I am looking for a classification method to build a binary classifier for web documents: i.e. a classifier that predicts whether the document belong to domain of interest or not. A domain here is a broad category e.g. science. I am wondering if there is any work in neural network community to do this efficiently with training data ~10K labeled webpages with labels 0/1.
Simple "language model" based approach hasn't been proved useful till now. Would a NN based model make more sense for this task?
I am doing my research about ambiguity and desambiguation in science of information in representation system and sociocognitive and new infocomunication paradigms. do you have any advice, please?
Someone can help me to find some information or some papers about how Facebook uses big data, including Hadoop ?
I try to find, but can't find anything..Someone help me..
I'm looking for any publicly available annotated data for microblogs or tweets in Arabic or English languages. For the topics of Persons, Organizations, and Places(locations)
i've been looking in the community papers and there appears to be no such data publicly available.
I am using Tesla K20. I got an error that the shared memory is limited to 16K although K20 supports up to 48K. How to configure the GPU and NVCC compiler to use 48K shared memory instead of 16K?
I am trying to reproduce experimental results from
 G. Guo, G. Mu, Y. Fu, and T. S. Huang, “Human age estimation using bio-inspired features,” 2009 IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. Work. CVPR Work. 2009, pp. 112–119, 2009.
They use the yamaha Gender and Age (YGA) database as well as the FG-NET aging database and it seems like all of the links to their supposed location are down.
Link for FG-net aging database should be the following
and for the YGA database I can not even find a reference
Do you know other database used for Age estimation ? which would include age and gender labeling.
Thank you very much for your time
I am in search of a good standard dataset relevant for Music Recommendation systems which should consist of music with places and tags.
I want to download random tweets from Twitter for specific time period (of two years 2011-2013). I have tried using statuses/sample API, but couldn't specify the time period.
At the moment, I was able to find these papers:
1. Prototype a Knowledge Discovery Infrastructure by Implementing Relational Grid Monitoring Architecture (R-GMA) on European Data Grid (EDG) by Frank Wang, Na Helian, Yike Guo, Steve Thompson, John Gordon.
2. Knowledge grid-based problem-solving platform by Lu Zhen, Zuhua Jiang,Jun Liang.
Thank you in advance for any help.
I used GS with a function on image processing that calculates the symmetry of two images. The number of function execution increases dramatically when the size of the image is doubled. Does anyone have an explanation?
In symbolic logic, you can translate proper sentences to logical constructs: propositional or first order. This is very difficult to do in tweets because it is not written in proper english, at least not for the most part. So, I am designing an algorithm to convert tweets to logical constructs. If you have some ideas or would like to collaborate, let me know. Thanks!
I want to know about real case study of privacy threat cause of association rule mining (Distributed or centralized database).
Actually I have search so many paper on web mining with fuzzy logic but I am not able to find out most recent one. Can anyone please tell me how I can get most recent papers?
Many years ago I read a paper on a hardware implementation of an information retrieval system. It was implemented as a circuit board, where the query would be set by putting jumpers on one side of the board and the result would be indicated by LEDs or the equivalent on another side of the board. The math behind it was very insightful, and I'd love to find it again, but I've been unable to. The paper was written (probably well) before 1975, perhaps even in the 1950's. I vaguely remember that the primary author's name began with an S but that's as far as I've gotten. (I'm not thinking of Vannevar Bush's Memex.)
Can anyone help?
Dear respected scientists and colleagues,
I am looking for a climate change corpus to do text analysis on it. If you have one or you know a journal that I can download their abstracts I would be very much obliged.
We are working on the incremental timetabling problem. We found a timetable that is satisfies hard constraints and optimizes soft ones. After accepting the timetable, new constraints appear.
Is there a solution that suggests minimum change to have the new table with the incremental constraints without destroying the old table in order to minimize the disturbance of the stakeholders?
I am doing my research in web usage mining. I can't get the extract data sets. The World Cup '98 log files are present in Internet Traffic Archive, but I don't know the file format for this file to open it.
During any association mining process it is a big challenge to remove uninteresting rules. We are interested in effective formal and experimental method for finding interestingness of the multilevel rules.
I am interested in using machine learning to recognize social interaction patterns such as disagreements, and potentially use those patterns to generate new simulated interactions. I've been working with crowd sourced descriptions of social interactions, but these are more narrative and less action driven.
Are you aware of publicly available datasets of annotated social interactions?
Types of data that might be good candidates are annotated movie scripts or forum threads. Skeletal/gesture data could also be interesting.
Big Data and Data Science have continued to emerge among practitioners and researchers. But the foundation of these concepts involve large volumes and a variety of data created at high velocity. Hence, the focus have generally been on bigger organisations that generate such data. However, small and medium sized organisations are also active adopter of ICT. Can Big Data and Data Science benefit small and medium enterprises as well and how?
I am working on web log mining processes. I need a tool which performs pre-processing (data cleaning, user identification, sesion identification) of server log file.
Currently I am working on web log mining techniques. I want to choose clustering techniques but clustering has been used extensively in web log mining.
How to generate document term matrix from a different type of web page? Is there any source available that provides csv files for document term matrix generation?
I want to implement pageranks and various improved pagerank algorithms on graph data but I am unable to find a simulator or real implementation of a pagerank algorithm.
Many web pages aim to publish latest news and using feeds or other related technologies spread summaries of news. So once you have the summary, title and the link to the main web, the next step is retrieval the associated text, knowing that the web page has non relevant information in it, such as banner, rating of news, advertising etc. What are best tools for achieve this goal of extracting the associated text of a news, having title, summary and web link?
I am doing a project involving Google search engine. However, I do not know how
to export the results from Google and then store them to a text file or a database?