Science topic
Computational Data Mining - Science topic
Explore the latest questions and answers in Computational Data Mining, and find Computational Data Mining experts.
Questions related to Computational Data Mining
Hi There!
My data has a number of features (with contain continuous data) and a response feature (class label) of categorical data (binary). My intention is to study the variation of the response feature (Class ) due to all the other features using a variety of feature selection techniques. Kindly help in pointing out right techniques for the purpose. Data is like this:
------------------------------------------------------------------
f1 f2 f3 f4 ... fn class
------------------------------------------------------------------
0.2 0.3 0.87 0.6 ... 0.7 0
0.2 0.3 0.87 0.6 ... 0.7 1
0.2 0.3 0.87 0.6 ... 0.7 0
0.2 0.3 0.87 0.6 ... 0.7 1
-------------------------------------------------------------------
I think that Generative Adversarial Networks can be used as Data Farming Means. What do you know about such an approach? Can you give another example of means for Data Farming?
I'm quite new in GMDH and based on my first reading on this technique I feel like I want to know more. Here are some of the benefits of using GMDH approach:
1.The optimal complexity of model structure is found, adequate to level of noise in data sample. For real problems solution with noised or short data, simplified forecasting models are more accurate.
2.The number of layers and neurons in hidden layers, model structure and other optimal NN parameters are determined automatically.
3.It guarantees that the most accurate or unbiased models will be found - method doesn't miss the best solution during sorting of all variants (in given class of functions).
4.As input variables are used any non-linear functions or features, which can influence the output variable.
5.It automatically finds interpretable relationships in data and selects effective input variables.
6. GMDH sorting algorithms are rather simple for programming.
7. TMNN neural nets are used to increase the accuracy of another modelling algorithms.
8. Method uses information directly from data sample and minimizes influence of apriori author assumptions about results of modeling.
9. Approach gives possibility to find unbiased physical model of object (law or clusterization) - one and the same for future samples.
It seems that items 1,2,6 and 7 are really interesting and can be extend to ANN.
Any suggestion or experience from others?
I would like to dive into the research domain of explainable AI. What are some of the recent trending methodologies in this domain? What can be a good start to dive into this field?
Hi
I am trying to segment a sentinel2 image.
At this stage, I want to run a binary classifier that assigns each pixel to either farm or non-farm pixel. For this purpose, I have 4 10m bands including R/G/B/NIR. I also have generated an NDVI raster for each month (8 months in total) that has values ranging from -1 to 1 (it can be normalized to 0 to 255).
I am looking for a classifier that can accurately classify the pixels using NDVI and/or any combination of my 4 10m bands.
Thanks in advance.
In order to take the industrial revolution to new horizons (mining and mineral processing), Artificial intelligence and Machine learning can play a critical role. Big data is a fundamental requirement for these techniques, what challenges a researcher has to keep in mind while handling the data coming from industry?
Thaking you all research fellows in advance.
Hi,
I'm looking for a research project topic for my masters degree in data analytics. below are the couple of areas I'm interested in, Please suggest me some project idea related in to these subject areas:
- Big Data
- Database/No SQL Database/Data Warehouse
- Cloud Computing
- Data Mining
- Machine Learning/Deep Learning/NLP
Regards,
Richard
Does anyone know where can I find real life instances of Ontosensor ontology? I am looking for one that has both a sensor description and data measured with it.
Hi Folks,
I need your help regarding the Artificial Intelligence Context of Information Retrieval tools and Big Data & Data Mining in the libraries? Dissertation/Thesis, research paper, conference Paper, Book chapter, Research Project and Article can you share with me. I will also welcome you comments, thought and feed back in the context of University libraries support me to designed my PhD Questionnaire.
-Yousuf
i want to execute apriori algorithm of association rule in datamining through MATLAB
Data Mining: Sources of Data that can be mined
1.Files
- Flat files is defined as data files in text form or binary form with a structure that can be easily extracted by data mining algorithms.
- Data stored in flat files have no relationship or path among themselves, like if a relational database is stored on flat file, then there will be no relations between the tables.
- Flat files are represented by data dictionary. Eg: CSV file.
- Application: Used in DataWarehousing to store data, Used in carrying data to and from server, etc.
2.Relational Databases
- A Relational database is defined as the collection of data organized in tables with rows and columns.
- Physical schema in Relational databases is a schema which defines the structure of tables.
- Logical schema in Relational databases is a schema which defines the relationship among tables.
- Standard API of relational database is SQL.
- Application: Data Mining, ROLAP model, etc.
3.DataWarehouse
- A datawarehouse is defined as the collection of data integrated from multiple sources that will queries and decision making.
- There are three types of datawarehouse: Enterprise datawarehouse, Data Mart and Virtual Warehouse.
- Two approaches can be used to update data in DataWarehouse: Query-driven Approach and Update-driven Approach.
- Application: Business decision making, Data mining, etc.
4.Transactional Databases
- Transactional databases is a collection of data organized by time stamps, date, etc to represent transaction in databases.
- This type of database has the capability to roll back or undo its operation when a transaction is not completed or committed.
- Highly flexible system where users can modify information without changing any sensitive information.
- Follows ACID property of DBMS.
- Application: Banking, Distributed systems, Object databases, etc.
5.Multimedia Databases
- Multimedia databases consists audio, video, images and text media.
- They can be stored on Object-Oriented Databases.
- They are used to store complex information in a pre-specified formats.
- Application: Digital libraries, video-on demand, news-on demand, musical database, etc.
6.Spatial Database
- Store geographical information.
- Stores data in the form of coordinates, topology, lines, polygons, etc.
- Application: Maps, Global positioning, etc.
7.Time-series Databases
- Time series databases contains stock exchange data and user logged activities.
- Handles array of numbers indexed by time, date, etc.
- It requires real-time analysis.
- Application: eXtremeDB, Graphite, InfluxDB, etc.
8.WWW
- WWW refers to World wide web is a collection of documents and resources like audio, video, text, etc which are identified by Uniform Resource Locators (URLs) through web browsers, linked by HTML pages, and accessible via the Internet network.
- It is the most heterogeneous repository as it collects data from multiple resources.
- It is dynamic in nature as Volume of data is continuously increasing and changing.
- Application: Online shopping, Job search, Research, studying, etc.
I am having many singleton Species (species representing only one DNA sequence). These singletons lack resolution potential as they are single, when applied machine learning classifiers singletons are not considered as there is no reference for it... Even using other relevant methods, we cannot resolve singleton species confidently.
Please suggest any method/program that could simulate or generate reference sequences by taking those singletone species into consideration. Further this reference sequence could be used to resolve singleton species in the multiple species sequence dataset.
Price optimization methods and algorithms are used to determine the best price or set of prices for business offerings by companies. In our project https://www.researchgate.net/project/Dynamic-Pricing-Algorithms-and-Models-using-Artificial-Intelligence
We are working on Dynamic Pricing Algorithms and Models using Artificial Intelligence. However we would like to hear from researchers experts about dynamic pricing models and algorithms. What are the best of breed Dynamic Pricing Algorithms and Models using Artificial Intelligence?
I have a dataset having 56 variables, in which 4 dependent and 52 independent variables. In those independent variables 45 variables are categorical and 3 dependent variables are categorical remains are continuous. Each variables having 1500 observations. Independent variables are nominal, and dependent Categorical variables are ordinal. I want to check, there is any effect of independent variables on each dependent variables .
Greeting to every one
I have to select relevant feature from KDD99 data set. I am going to use bat algorithm. To use bat algorithm ,is it necessary to convert the dataset into binary or not? i don't know how to proceed further process. Can any one please tell me
I need to do some comparison with other methods for a new rule of combination under Dempster-Shafer theory. I would like to use the same data used in ‘Combining Multiple Hypotheses for Identifying Human Activities’ by Young-Woo Seo and Katia Sycara. Unfortunately, those data are no longer available at http://www.cs.utexas.edu/users/sherstov/pdmc/ . This data set was originally released to a Physiological Data Modeling Contest (PDMC) at the site cited above. Is there someone that can provide me the data or could reference a site where I can get it?
I have a sort of data in which the change in the weight of materials is recorded during the time. Unfortunately because of special condition I cannot record the weight in the first 75 seconds.
- Is there any way to predict the initial missed data (I mean the change in the weight in the first 75 seconds)?
- How can I find the equation of the curve that fit the data points?
Any solution with MATLAB, SPSS, and Excel softwares is appreciated.
How to retrieve location of tweets even though the location of users were turned off and what are features through which we find location?Any work on it till now?
Is there a way to automate such a process whilst having all the additional data of the sequence stored in a nice legible manner?
IS accuracy and correctly classified instances measures are same. if same then therir formulas will also be same using weka?
Hi. I have a query regarding Text Classification. I have a list of words with the following attributes. word, weight, class. The class can be positive or negative. Weight is between -1 to 1. How can I train a classifier like SVM using this word list to classify unseen documents? An example in any tool is welcome
Could you please suggest me some good articles (original research or review) on comparative (comparison of their performance w.r.t accuracy, ranking, etc.) studies between text similarity measures like cosine similarity, BM25, Language modeling?
Is there any API available for collecting the Facebook data-sets to implement Sentiment analysis.
I am having a finite set of subjects. Now I want to find which subject a tweet of a given twitter user belongs to show that I can learn the topic of interest for the twitter user. Which classifier would be most suitable for tweets that have small number of words.
i want to cluster about 3 millions of tweets on their contents.I mean i need to cluster similar tweets for event detection. how can i use weka for this purpose?
For incremental clusteing of tweets, how can get the cosine similarity between clusters and a tweet, so we can add the tweet to one of the clusters?as i know we can create a tf idf weight vector for a tweet and another same one for each cluster. Then get similarity between to same lenght vectors. I can creat the vector for the tweet but i do not know how to do it for each cluster.
I want to do some experiments with wireless sensor network. I am looking for datasets like UCL ML, or other web resources to download such datasets. Ideally, I would like to have a dataset that consists of large number of sensors, that are measuring different values on different times.
Now, we need three high dimensional datasets (500 to 1000 features) with small sample size (< 500 samples) for our project. Therefore, we will run our feature selection algorithm on these datasets. If you have any ideas and links that can download, It could be great for us...
Compare with other analytic/Recommender tools/library like WEKA, R-studio, GitHub, Surprise, MyMediaLite, LensKit, LibRec etc. How good is Azure machine learning studio?
I have the text file for which the clustering has to be made and finally after clustering cosine similarity for the vectors has to be determined
I want dataset for parent child hierarchy for event sub-event detection. Can anyone point to me to the link
if i have 1200 instances total and i want to classify in binary class. One class is "yes" having 900 and second class is "NO" having 300 instances. how i can know the dataset is unbalanced and what techniques could be apply using weka to resolve such issue?
I want to find similarity of data based on analysis of a set of features.
E.g. Lets say if am looking at a watch, then the make, brand, type, style can be set of influencers.
I can label encode these features because they are all discrete set of variables.
On top of this, i want to add continuous variables like discount given, price, stock availability to filter out alternative possibilities.
How do we achieve this ? Any thoughts and feedback would be appreciated.
At training phase, I applied k-means clustering algorithm on a dataset (k=10). I want to apply decision tree on each cluster and generate a separate performance model for each cluster
At testing phase, I want to compare each test instance with the centroid of each cluster and use appropriate model to classify the instance.
Is there a way I can achieve this in WEKA or WEKA API. Any link/resource will be appreciated
Hello All,
Cola benchmark function, how can I programming using matlab program?
Hi I am working on uncertain data , when I convert certain value such as in to uncertain interval value such [19,21) based on equation , the question is under which datatype I have to define this interval especially in arff file in weka ?
Thanks
I'm trying to use MLLib in java (svm) but I have different input files with string and nominal attributes. Solutions?
I understand that weka's J48 uses Gain Ratio as a method in determining how to split the tree, but when I compare the first attribute in the attribute ranking and the attribute by which my decision tree becomes split first, they are different (the first in rank in the attribute selector is the attribute "Domain", while the attribute by which the tree was split first was "MgO"). Why is it so? Aren't they supposed to be the same?
For decision trees, we can either compute the information gain and entropy or gini index in deciding the correct attribute which can be the splitting attribute. Can anyone send an worked out example of Gini index
I am new in the data mining research - actually I know theory and process - what and how to do that. My purpose is to-
1.calculate pearson's item based correlation coefficient to calculate relation among items.
2. look into the target users and use a technique to obtain predictions-for that I want to use Weighted Sum,this method computes the prediction on an item i for a user u by computing the sum of the preference given by the user on the items similar to i,for example S(i,j) between items i and j
but I quite don't know how to realize these technique using software like Weka or others. I found correlationAtrributeEval() in the weka feature selection but that calculate correlation among class attribute and other attributes. what is exactly I don't want - I want to calculate correlation items in the same attributes for example- ProductId is an attribute and I want to calculate correlation among different product identifying as ProductID not calculate and rank among ProductID and other attribute .
Please help me How can I do that,what the steps and which software is best suit (my preference is Weka if it is possible to do in weka). Thanks in advance
web log file from the webserver which is used to web usage mining
can i use any predictive or machine learning approach to improve quality of health care. or can i use it for disease prediction.
Can someone recommend dividing the data into training and testing?
For the user defined constraint for any sequential database, how can the association rules be found?
I am currently working on mining sequential classification rules with Hadoop. I am searching for big labeled sequential data having sequences as a sequence of items like <a,b,a,b,a> and for each sequence there is a class label, i.e., (<a,b,a,b,a> : class)
Does any one have access to such kind of big data?
I was trying to implement the USD algorithm (Paper Tile: Discretization oriented to Decision Rules). However, I have some doubts:
1. At line 20 of the algorithm it is written that Ii has the same majority class than Ii+1. What is the meaning of this?
2. At line 20 of the algorithm it is written that there is a tie in Ii or Ii+1. A tie of what?
3. What is the requirement of line number 14 & 15 when line number 11 & 12 is covering all consecutive intervals?
We need an example about a rea life data set, better if it is related with engineering, where we have binary attributes for each element (we are going to analyze data with formal concept analysis techniques).
This data set have to generate some knowledge that let us to allow or discard new elements that have different attributes that the previous elements.
Thanks for your support.
Dear ResearchGater,
I'm trying to identify association between keywords in PubMed. The prototypic search could be : listing kinases (classified in term of number of publication) that are associated to the keyword cancer or inflammation. Does anybody have an idea of an easily accessible tool that can perform such search. Thank you for your help.
I want to find the tightest lower and upper bounds for frequent subgraphs from
uncertain graph data and also densest frequent subgraph please suggest me
I compute the weights using dot product of the hog feature accroding to an article I found
I try to analyze the text corpus of tickets from a ticket-system in order to cluster the requests according to the words that are used in. I use the vector space model and have calculated the values for tf, idf, tfxidf and entropy.
Due to the amount of dimensions (about 8.500) I would like to extract the key words. Is there any statistical approach depending on a specific threshold for entropy/tfidf? Of course I could say that document frequency has to be at least x and entropy at least y, but how can I verify this point?
Thanks for your help!
I am currently working on a project that requires detailed knowledge on works that have been done on computer screen capturing, screen recording during usability testing sessions, etc. The focus is on works done not the available tools and their documentations.
We create a new set of labels for classifier B as follows:
If prediction by A is true positive or true negative, label for B is 0
If prediction by A are false pos. or false neg., label for B is 1
we train B accordingly (using same features as for A)
Then, we on new data, we modify outcome of A based on outcome of B
However, this expectedly worsens the performance if B itself gives too many false outcomes.
I am new to machine learning, so not aware of better ways to do this. Please help.
I am just starting to get into data mining and I wants to know that from where I should start my work?
Thank You.
I want to implement ECC algorithm in COOJA simulator and want to compare the performance of RSA and ECC in IoT(Internet of Things) nodes. I am using Contiki and Cooja.
In one of my recent studies (I will post it here after it gets published) I discovered the fact that there is a correlation (0.42, medium - moderate) between the number of inhabitants of a city and the level of development of the official city hall Web site. Explaining that to my students they came with a different approach – they never (or rarely) get onto the Web page, but they know almost everything that’s new in their city by the use of Facebook. If that’s true – the correlation should be way stronger – but, and those are my questions: How can I find how implicated is the new generation into cities life by the use of social media networks? How can I measure that?
I have to select relevant feature from KDD99 data set. I am going to use bat algorithm. To use bat algorithm ,is it necessary to convert the dataset into binary or not? i don't know how to proceed further process. Can any one please tell me
I am using R for creating DocumentTermMatrix and then converting it to normal matrix for further processing. There are 26,000 records and 4 columns, while converting DocumentTermMatrix to normal matrix it get hang. I am using syntax abc<- as.matrix(dtm), where dtm contains document term matrix. Ram size of my computer is sufficient. Can anybody give me solution.
i want know what is best tool for machine learning in big data, i used MLLIB API in spark framework and i want test another tool.
I need the ways to statistically measure the quality of clusters formed after performing clustering technique from data mining like kmeans, etc.
After forming the clusters of customers of Banks, using various algorithms like - kmeans and decision tree. What are the ways we can validate the clusters formed by our method.
for example : user 1 play items 1 to 6 with following number plays respectively: 5, 0 , 3, 3, 6, 1
i want to get range between 0 to 1 for interests of user based on number plays. i do not want to divide number of click on item to total number of clicks or divide to max
translation lexicon in different languages
Hi all,
I have been installed and setup the rhadoop in the VMware environment.
Recently, i tried to install the following packages:
1) rhdfs_1.0.8 package, and prompted with this error message:
(as ‘lib’ is unspecified)
Error in getOctD(x, offset, len) : invalid octal digit..
2)Rmpi_0.6-5
configure: error: "Cannot find mpi.h header file"
ERROR: configuration failed for package ‘Rmpi’
* removing ‘/home/aisyah/R/x86_64-redhat-linux-gnu-library/3.2/Rmpi’
Anyone had the same experience and know any suggestion?
Lets say you have few data visualization options: pie chart, bar chart, tabular, histogram.
Using features and data value, how can a program "calculate" the most suitable data visualization for that dataset?
We tried using PSO but this takes a long time especially with 6 variables and lag of 2. We also tried using linear regression to get the intercepts and coefficients but we believe better results can still be achieved by using other parameter search techniques.
I am working on Text Classification 'Prediction' using ML algorithms such as MultinomialNB and Ensemble. So, I already trained the classifiers with my training dataset using 70% of total. And I used test dataset %30 of total for testing the classifiers and had output for each. I used for testing the error percent. I'd like from you ladies and gentlemen to suggest me 2-3 more techniques for testing, and also for evaluation in advance"I am going to use (Cross Validation - Recall-Precision and Fmeasure)
I am trying to normalize a matrix that I have, which is a flow matrix. The goal is to make the matrix pre-magic such that the sum of rows are equal to sum of columns. I rounded it off to the nearest integer, but my issue is that the result is not pre-magic. What is a possible approach to solve rounding issues such as this so that I can normalize properly?
Students prediction,Educational data mining.
Lift is measure used in association rule mining.