Science topic

Computational Data Mining - Science topic

Explore the latest questions and answers in Computational Data Mining, and find Computational Data Mining experts.
Questions related to Computational Data Mining
  • asked a question related to Computational Data Mining
Question
11 answers
Hi There!
My data has a number of features (with contain continuous data) and a response feature (class label) of categorical data (binary). My intention is to study the variation of the response feature (Class ) due to all the other features using a variety of feature selection techniques. Kindly help in pointing out right techniques for the purpose. Data is like this:
------------------------------------------------------------------
f1 f2 f3 f4 ... fn class
------------------------------------------------------------------
0.2 0.3 0.87 0.6 ... 0.7 0
0.2 0.3 0.87 0.6 ... 0.7 1
0.2 0.3 0.87 0.6 ... 0.7 0
0.2 0.3 0.87 0.6 ... 0.7 1
-------------------------------------------------------------------
Relevant answer
Answer
You can select the best algorithm based on the measure of performance from a number of data mining algorithms. A exhaustive list may be found: https://www.kdnuggets.com/2015/05/top-10-data-mining-algorithms-explained.html
  • asked a question related to Computational Data Mining
Question
4 answers
I think that Generative Adversarial Networks can be used as Data Farming Means. What do you know about such an approach? Can you give another example of means for Data Farming?
Relevant answer
Answer
Other approaches exist. For instance (and mostly application based)
Extreme Data Mining.
A strategy to apply machine learning to small datasets in materials science.
Machine learning on small size samples: A synthetic knowledge synthesis.
  • asked a question related to Computational Data Mining
Question
2 answers
I'm quite new in GMDH and based on my first reading on this technique I feel like I want to know more. Here are some of the benefits of using GMDH approach:
1.The optimal complexity of model structure is found, adequate to level of noise in data sample. For real problems solution with noised or short data, simplified forecasting models are more accurate.
2.The number of layers and neurons in hidden layers, model structure and other optimal NN parameters are determined automatically.
3.It guarantees that the most accurate or unbiased models will be found - method doesn't miss the best solution during sorting of all variants (in given class of functions).
4.As input variables are used any non-linear functions or features, which can influence the output variable.
5.It automatically finds interpretable relationships in data and selects effective input variables.
6. GMDH sorting algorithms are rather simple for programming.
7. TMNN neural nets are used to increase the accuracy of another modelling algorithms.
8. Method uses information directly from data sample and minimizes influence of apriori author assumptions about results of modeling.
9. Approach gives possibility to find unbiased physical model of object (law or clusterization) - one and the same for future samples.
It seems that items 1,2,6 and 7 are really interesting and can be extend to ANN.
Any suggestion or experience from others?
Relevant answer
Answer
Not so much accurate
Instead use ANFIS-PSO
  • asked a question related to Computational Data Mining
Question
7 answers
I would like to dive into the research domain of explainable AI. What are some of the recent trending methodologies in this domain? What can be a good start to dive into this field?
Relevant answer
Answer
These Papers will help you:
1: Visual Analytics in Deep Learning: An Interrogative Survey for the Next Frontiers: https://arxiv.org/pdf/1801.06889.pdf
2: Visual Analytics for Explainable Deep Learning:
3: CNN EXPLAINER: Learning Convolutional Neural Networks with Interactive Visualization :
  • asked a question related to Computational Data Mining
Question
12 answers
Hi
I am trying to segment a sentinel2 image.
At this stage, I want to run a binary classifier that assigns each pixel to either farm or non-farm pixel. For this purpose, I have 4 10m bands including R/G/B/NIR. I also have generated an NDVI raster for each month (8 months in total) that has values ranging from -1 to 1 (it can be normalized to 0 to 255).
I am looking for a classifier that can accurately classify the pixels using NDVI and/or any combination of my 4 10m bands.
Thanks in advance.
Relevant answer
Answer
Convolutional Neural Networks (CNNs) is the most popular neural network model being used for image classification problem. The big idea behind CNNs is that a local understanding of an image is good enough
Top 5 Classification Algorithms in Machine Learning
  • Logistic Regression.
  • Naive Bayes Classifier.
  • K-Nearest Neighbors.
  • Decision Tree. Random Forest.
  • Support Vector Machines.
  • asked a question related to Computational Data Mining
Question
9 answers
In order to take the industrial revolution to new horizons (mining and mineral processing), Artificial intelligence and Machine learning can play a critical role. Big data is a fundamental requirement for these techniques, what challenges a researcher has to keep in mind while handling the data coming from industry?
Thaking you all research fellows in advance.
Relevant answer
Answer
There are multiple articles on that special topic. the main challenge lies in the fact that there are multiple perspectives from which you can look at the problem: algorithmic (data science) including preprocessing, database, learning algorithms etc.; acceptance within an organization of AI-based systems including linked topics such as explainability (explainable AI- XAI), security of such AI-based systems (e.g. adversarial attacks). Depending on the persepctive you are interested in you might want to look up those terms I mentioned above. Furthermore there are some articles that try to summarize some of the challenges (I linked them at the bottom of this answer). Further more you can have a look at our recent papesr that deal with decision criteria for AI-based system in an industrial context as well as a summary paper on AI security and Explainable AI.
xp
xEx
  • asked a question related to Computational Data Mining
Question
8 answers
Hi,
I'm looking for a research project topic for my masters degree in data analytics. below are the couple of areas I'm interested in, Please suggest me some project idea related in to these subject areas:
  • Big Data
  • Database/No SQL Database/Data Warehouse
  • Cloud Computing
  • Data Mining
  • Machine Learning/Deep Learning/NLP
Regards,
Richard
Relevant answer
Answer
As far as the current scenario is considered noting can be an apt topic than, SARS-COVID 19. Try using your computational skills and expertise in this field.
  • asked a question related to Computational Data Mining
Question
9 answers
Does anyone know where can I find real life instances of Ontosensor ontology? I am looking for one that has both a sensor description and data measured with it. 
Relevant answer
Answer
Have a look to this paper
C. ROUSSEY,S. BERNARD, G. ANDRÉ, D. BOFFETY. Weather Data Publication on the LOD using SOSA/SSN Ontology.Semantic Web Journal, 2019 http://www.semantic-web-journal.net/content/weather-data-publication-lod-using-sosassn-ontology-0
  • asked a question related to Computational Data Mining
Question
8 answers
Hi Folks,
I need your help regarding the Artificial Intelligence Context of Information Retrieval tools and Big Data & Data Mining in the libraries? Dissertation/Thesis, research paper, conference Paper, Book chapter, Research Project and Article can you share with me. I will also welcome you comments, thought and feed back in the context of University libraries support me to designed my PhD Questionnaire.
-Yousuf
Relevant answer
Answer
Dear Colleagues and Friends from RG,
In my opinion, in the coming years, one of the key applications of artificial intelligence integrated with other Industry 4.0 technologies, including Big Data Analytics, will be improvement of information search on the Internet.
Conducted scientific research confirms the strong correlation between the development of Big Data technology, Data Science analytics, Data Analytics and the effectiveness of the use of knowledge resources. I believe that the development of Big Data technology and Data Science analytics, Data Analytics and other ICT information technologies, multi-criteria technology, advanced processing of large information sets, and Industry 4.0 technology increases the efficiency of using knowledge resources, including in the field of economics, finance and organization management. In recent years, ICT information technologies, Industry 4.0 etc. have been developing dynamically and are used in knowledge-based economies in particular. These technologies are used in scientific research and business applications in commercial enterprises and in financial and public institutions. Due to the growing importance of this issue in knowledge-based economies, an important issue is the analysis of the correlation between the development of Big Data technology and Data Science analytics, Data Analytics, Business Intelligence and the effectiveness of using knowledge resources to solve key problems of civilization development. The use of Big Data, Data Science, Data Analytics, Business Intelligence and other ICT information technologies as well as advanced data processing Industry 4.0 in the processing of knowledge resources should contribute to increasing the efficiency of knowledge resource processing in knowledge-based economies, including in the field of economics and finance.
In recent years, the scope of applications of Big Data technology and Data Science analytics, Data Analytics in economics, finance and management of organizations, including enterprises, financial and public institutions, has been increasing. Therefore, the importance of implementing analytical instruments for advanced processing of large data sets in enterprises, financial and public institutions, i.e. the construction of Big Data Analytics platforms to support organization management processes in various aspects of operations, including the improvement of customer relations, is also growing. In my opinion, scientific research confirms the strong correlation between the development of Big Data technology, Data Science analytics, Data Analytics and the effectiveness of the use of knowledge resources. I believe that the development of Big Data technology and Data Science analytics, Data Analytics and other ICT information technologies, multi-criteria technology, advanced processing of large information sets, and Industry 4.0 technology increases the efficiency of using knowledge resources, including in the field of economics, finance and organization management. In recent years, ICT information technologies, Industry 4.0 etc. have been developing dynamically and are used in knowledge-based economies in particular. These technologies are used in scientific research and business applications in commercial enterprises and in financial and public institutions. Due to the growing importance of this issue in knowledge-based economies, an important issue is the analysis of the correlation between the development of Big Data technology and Data Science, Data Analytics, Business Intelligence and the effectiveness of using knowledge resources to solve key problems of development of business entities. In recent years, the use of 5G technology to collect data from the Internet can significantly contribute to improving the analysis of sentiment of Internet users' opinions and the possibility of extending the use of research techniques carried out on Business Intelligence, Big Data Analytics, Data Science and other research techniques using ICT information technologies , internet and advanced data processing typical of the current fourth technological revolution referred to as Industry 4.0.
In recent years, organization management processes have been improved through the implementation of information technology and advanced data processing technologies Industry 4.0 into the IT analytical platforms Business Intelligence, Big Data Analytics, etc. The technologies of advanced analysis of big data sets Big Data Analytics and research processes carried out on Business Intelligence platforms are used also to improve business management processes. Data collection processes on the Internet can be supported by the use of 5G technology. In recent years, information technology management models in organizations have been enriched with advanced 4.0 industry data processing technologies, including cloud computing, Internet of Things, artificial intelligence, machine learning and more. The use of information systems in built models of information technology management in organizations, etc. is currently taking place in many areas of functioning of various types of business entities. The use of ICT information technologies and advanced data processing technologies i.e. typical for the current technological revolution Industry 4.0 already covers almost the entire functioning of business entities, from computerized sales support systems, logistics, accounting, reporting, risk management to marketing activities on the Internet and designing new products and innovative solutions in information systems. Online banking is starting to dominate, whose development is determined by technological progress in the field of ICT and Industry 4.0 information technologies. Computerization is also increasingly affecting public sector institutions servicing tax systems and settlements of business entities. Business Intelligence analytical platforms have also been developed for several years in the SME sector. Business Intelligence systems supporting analytical processes and organization management are produced by IT companies not only for large corporations. The analyst of large information sets in Big Data databases is also developing. Big Data Analytics and Data Science analytical systems are used by more and more types of business entities to analyze both the markets in which they operate and complex processes that are conducted or diagnosed and researched in these enterprises. Computerization also covers financial and economic risk management processes, etc. In all these areas of ICT technology application, building and improving IT technology management models in organizations is also an important issue. Therefore, specific information technology management models should be tailored to the specifics of the operations of a particular business entity, enterprise, company, corporation, public institution or financial institution.
On the other hand, the collection of large data sets about users of specific websites and portals in Big Data database systems generates new categories of information security risk. The database of social media portal such as Facebbok is already a powerful collection of information. Some research centers specializing in the use of large data sets Big Data downloaded from social media portals through the analysis of sentiment I prepare reports that can be helpful in forecasting phenomena and processes in the future. Medicine is one of the areas where there are great opportunities in this matter. For example, insurance companies and commercial banks that grant loans may be interested in information posted by users on Facebook and possibly also on other social media sites. Apparently, some insurance companies and commercial banks during the analysis of an application for insurance or credit are looking at the information content of accounts, applicant profiles, potential client, contractor posted on social media portals.
Another area of ??application of analytics carried out on large data sets collected in Big Data database systems is sentiment analysis in the field of surveying the opinions of Internet users regarding specific products and / or services and companies producing them. Large amounts of information downloaded from comments, entries, posts from social media portals are processed in Big Data database systems to determine, e.g. consumer awareness regarding the offer of products and services of specific companies. This type of information is of great importance for the purpose of planning advertising campaigns informing about the mission, idea, product offer, usability features of a given company's offer. This type of data may be relevant to forecasting changing consumer preferences for specific companies' offers. Techniques for collecting analytical data on the Internet can be supported by the use of 5G technology.
I am also involved in research on knowledge management using the Big Data computerized database platforms. In my publications available on the Research Gate portal, I described the key determinants of the development of Big Data technology and the security of information obtained from the Internet, collected and processed in Big Data databases. I also described the development of analytics using Business Intelligence platforms that are used in enterprises. Business Intelligence based analytics, as well as Data Science and Big Data Analytics are increasingly being used to improve business management processes. The development of this analytics based on the implementation of ICT and Industry 4.0 information technologies into analytical processes has a great future ahead of it in the coming years. I invite you to cooperation.
One of the areas in which the possibilities of market analytical technology applications, including data downloaded from internet portals, are growing is the marketing of enterprises and institutions. In recent years, the development of marketing is determined by the development of Industry 4.0 technology and the development of open innovations on the Internet. Open innovations developed on the Internet concern, among others, free information and marketing services. The issue of the possibility of publishing specific content, texts, banners, comments etc. on the Internet and obtaining free information are key determinants of the development of information services on the Internet. On the other hand, the largest internet technology corporations earn income mainly from paid marketing services. Therefore, the Internet environment is a kind of mix of free and paid information and marketing services, which are simultaneously, simultaneously and simultaneously interrelatedly developed by various Internet companies. Currently, research is conducted into the analysis of the development of open innovations in the field of free information services, which are the main factor of business success of the largest online technology companies, which include such concerns as Google and social media portals such as Facebook, Instagram, YouTube, Tweeter, LinkedIn and others .
The development of internet information services will be determined by technological progress in the field of new ICT, communication technologies and advanced data processing techniques typical of the current technological revolution referred to as Industry 4.0. The development of information processing technology in the era of the current technological revolution called Industry 4.0 is determined by the use of new information techniques, for example in the field of e-commerce and e-marketing. These solutions are the basis for the business success of the largest online technology concerns that offer information search, data collection and processing services in the cloud (e.g. Google) and provide information services on platforms developed in social media portals (e.g. Facebook, Instagram, YouTube, Tweeter, LinkedIn, Pinterest, and more).
The current technological revolution referred to as Industry 4.0 is motivated by the development of the following factors: Big Data database technologies, cloud computing, machine learning, Internet of Things, artificial intelligence, Business Intelligence and other advanced technologies of Data Mining.
The information technologies mentioned above, combined with the improvement of ICT and communication technologies, along with the progressive process of increasing the computing power of computers will become an important determinant of technological progress in various branches of industry in the coming years. Based on the development of these new technological solutions, the processes of innovatively organized analyzes of large information collections gathered in Big Data database systems and computing cloud computing for the purposes of applications in such fields as machine learning, Internet of Things, artificial intelligence have been dynamically developing in recent years. , Business Intelligence. To this can be added other areas of advanced technologies for analyzing large data sets, such as Medical Intelligence, Life Science, Green Energy, etc. Processing and multi-criteria analysis of large data sets in Big Data database systems is performed according to the V4 concept, i.e. Volume (meaning large number of data), Value (large values ??of specific parameters of the information analyzed), Velocity (high speed of new information appearing) and Variety (high information diversity). The above-mentioned advanced technologies for processing and analyzing information are increasingly used for the needs of marketing activities of various business entities that advertise their offer on the Internet or analyze the needs in this regard reported by other entities, including companies, corporations, financial and public institutions. More and more commercially operating business entities and financial institutions conduct marketing activities on the Internet, including on social media portals. The possibilities of collecting market data on the Internet in subsequent years can be significantly expanded by using 5G technology.
The information and communication technologies listed above, combined with the improvement of ICT technologies and the implementation of Business Intelligence analytics into the processes of economic and financial, economic, macroeconomic and market analyzes may be instrumental instruments helpful in the efficient and effective management of economic, investment processes and enterprises, including analyzes carried out for the purposes of improving marketing activities in enterprises. More and more companies, banks and other entities need to carry out multi-criteria analyzes on large data sets downloaded from the Internet describing the markets in which they operate and contractors and clients with whom they cooperate. On the other hand, there are already specialized technology companies that offer this type of analytical services, prepare commissioned reports, which are the result of such multi-criteria analyzes of large data sets obtained from various websites and from entries and comments contained on social media portals. An important research technique that has been developing in recent years, the effects of which are used for the purposes of marketing activities of companies, is sentiment analysis carried out on large data sets collected from the Internet and stored in Big Data database systems.
In order to group the behavior of social media users into specific classes of behavior, these classes must first be defined. Sentiment analysis using large data sets collected from entries and comments from social media portals and transferred to Big Data database platforms can be helpful. Then, when observing the changes in certain types of behavior of users of social media portals, you can analyze the data collected in Big Data according to these observations. In addition, a useful tool can be an analysis of the behavior of users of social media portals based on current posts, entries and comments on specific social media pages, statistical analysis of comments on specific topics of posts. This type of research is carried out by online technology companies that run social media portals and use the results of these studies to develop their viral marketing services, because this field of marketing is a key determinant of revenue generated by these companies from advertising sales on social media portals. The basis of marketing activities conducted in this way are market research conducted by collecting market data from the Internet regarding the offer of individual companies, their competition, demand for specific products and services from Internet users as well as collecting, processing and analyzing this data in Big Data Analytics database and analytical systems. The process of collecting market data from specific websites can be improved by using 5G technology.
Industry 4.0 technologies are also used in the development of transaction systems and transaction security in the field of e-commerce and online banking. The key determinants of the globally developing e-commerce relate primarily to the implementation of ICT information technologies and advanced data processing technologies, i.e. industry 4.0 typical for the current technological revolution to computerized, automated transaction systems supporting online trading. In addition, the use of blockchain technology for transaction security systems and data transfer on the Internet. The use of ICT information technologies and advanced data processing technologies i.e. typical for the current technological revolution Industry 4.0 to online transaction systems supporting e-commerce already applies to almost all the functioning of online stores, from computerized sales support systems, logistics, accounting, reporting, risk management to Internet marketing activities and improving security systems for online transactions. Another important determinant of e-commerce development is the development of online mobile banking available on mobile devices and new solutions related to the Internet of Things technology. Online banking is starting to dominate, whose development is determined by technological progress in the field of ICT and Industry 4.0 information technologies. Computerization is also increasingly affecting public sector institutions servicing tax systems and settlements of business entities. In addition, Business Intelligence analytical platforms supporting the management processes of companies operating also in the e-commerce sector have been developed for several years. The analyst of large information sets in Big Data databases is also developing. Big Data Analytics and Data Science analytical systems are also used by businesses operating also in the field of e-commerce.
In recent years, new internet marketing instruments have also been developed, mainly used on social media portals, and are also used by companies operating in the e-commerce sector. Internet technology and fintech companies are also emerging that offer information services on the Internet to support marketing management, including the planning of advertising campaigns for products sold via the Internet. To this end, sentiment analyzes are used to survey Internet users' opinions regarding dominant awareness, recognition, brand image, mission and the offer of specific companies. Sentiment analysis is carried out on large data sets downloaded from various websites, including millions of social media sites collected in Big Data systems. The analytical data collected in this way are very helpful in the process of planning advertising campaigns carried out in new media, including social media portals. These campaigns advertise products and services sold via the Internet, available at online stores. In view of the above, the development of e-commerce is determined mainly by technological progress in the field of ICT information technologies and advanced data processing technologies Industry 4.0 and new technologies used in securing financial transactions carried out via the Internet, including e-commerce related transactions, e.g. technology blockchain. I have described the above issues of various aspects of the application of information systems and ICT, including Big Data, Business Intelligence in companies operating on the Internet in my scientific publications available on the Research Gate portal. I invite you to cooperation.
According to the above, in my opinion, the use of 5G technology to collect data from the Internet will significantly contribute to improving the analysis of sentiment of Internet users' opinions and the possibility of extending the use of research techniques carried out on Business Intelligence, Big Data Analytics, Data Science and other research techniques using information technologies ICT, internet and advanced data processing typical of the current fourth technological revolution referred to as Industry 4.0. At present, however, all the potential applications of 5G technology in economic and other applications are unknown. These applications will be wide in both business processes carried out by technological internet companies as well as by security institutions. Globally operating technology internet companies, thanks to the use of 5G technology in research processes, will improve their offer of information, internet and marketing services addressed to Internet users. On the other hand, national security institutions and IT systems risk management departments operating in companies can also obtain a tool enabling a significant improvement of instruments ensuring a high level of security of information transferred via the Internet and other cybersecurity issues. Therefore, research on cyber security and e-commerce will be expanded to include the impact of 5G technology on the development of many aspects of these areas of activity of business entities, institutions and citizens increasingly using the Internet in various areas of business.
In view of the above, in my opinion in the coming years one of the key applications of artificial intelligence integrated with other Industry 4.0 technologies, including Big Data Analytics, will be improvement of information search on the Internet.
Best wishes.
Dariusz Prokopowicz
  • asked a question related to Computational Data Mining
Question
4 answers
i want to execute apriori algorithm of association rule in datamining through MATLAB
Relevant answer
Answer
I work in the same topic and I use
it works with small dataset but with large database like mushroom or BMS1 or other datasets it takes time.
who has solution about that?@ Shafagat Mahmudova
  • asked a question related to Computational Data Mining
Question
2 answers
Data Mining: Sources of Data that can be mined
1.Files
  • Flat files is defined as data files in text form or binary form with a structure that can be easily extracted by data mining algorithms.
  • Data stored in flat files have no relationship or path among themselves, like if a relational database is stored on flat file, then there will be no relations between the tables.
  • Flat files are represented by data dictionary. Eg: CSV file.
  • Application: Used in DataWarehousing to store data, Used in carrying data to and from server, etc.
2.Relational Databases
  • A Relational database is defined as the collection of data organized in tables with rows and columns.
  • Physical schema in Relational databases is a schema which defines the structure of tables.
  • Logical schema in Relational databases is a schema which defines the relationship among tables.
  • Standard API of relational database is SQL.
  • Application: Data Mining, ROLAP model, etc.
3.DataWarehouse
  • A datawarehouse is defined as the collection of data integrated from multiple sources that will queries and decision making.
  • There are three types of datawarehouse: Enterprise datawarehouse, Data Mart and Virtual Warehouse.
  • Two approaches can be used to update data in DataWarehouse: Query-driven Approach and Update-driven Approach.
  • Application: Business decision making, Data mining, etc.
4.Transactional Databases
  • Transactional databases is a collection of data organized by time stamps, date, etc to represent transaction in databases.
  • This type of database has the capability to roll back or undo its operation when a transaction is not completed or committed.
  • Highly flexible system where users can modify information without changing any sensitive information.
  • Follows ACID property of DBMS.
  • Application: Banking, Distributed systems, Object databases, etc.
5.Multimedia Databases
  • Multimedia databases consists audio, video, images and text media.
  • They can be stored on Object-Oriented Databases.
  • They are used to store complex information in a pre-specified formats.
  • Application: Digital libraries, video-on demand, news-on demand, musical database, etc.
6.Spatial Database
  • Store geographical information.
  • Stores data in the form of coordinates, topology, lines, polygons, etc.
  • Application: Maps, Global positioning, etc.
7.Time-series Databases
  • Time series databases contains stock exchange data and user logged activities.
  • Handles array of numbers indexed by time, date, etc.
  • It requires real-time analysis.
  • Application: eXtremeDB, Graphite, InfluxDB, etc.
8.WWW
  • WWW refers to World wide web is a collection of documents and resources like audio, video, text, etc which are identified by Uniform Resource Locators (URLs) through web browsers, linked by HTML pages, and accessible via the Internet network.
  • It is the most heterogeneous repository as it collects data from multiple resources.
  • It is dynamic in nature as Volume of data is continuously increasing and changing.
  • Application: Online shopping, Job search, Research, studying, etc.
  • asked a question related to Computational Data Mining
Question
3 answers
I am having many singleton Species (species representing only one DNA sequence). These singletons lack resolution potential as they are single, when applied machine learning classifiers singletons are not considered as there is no reference for it... Even using other relevant methods, we cannot resolve singleton species confidently.
Please suggest any method/program that could simulate or generate reference sequences by taking those singletone species into consideration. Further this reference sequence could be used to resolve singleton species in the multiple species sequence dataset.
Relevant answer
Answer
I have read the question carefully and what you have wrote in brackets does not mean singleton species.
  • asked a question related to Computational Data Mining
Question
6 answers
Price optimization methods and algorithms are used to determine the best price or set of prices for business offerings by companies. In our project https://www.researchgate.net/project/Dynamic-Pricing-Algorithms-and-Models-using-Artificial-Intelligence
We are working on Dynamic Pricing Algorithms and Models using Artificial Intelligence. However we would like to hear from researchers experts about dynamic pricing models and algorithms. What are the best of breed Dynamic Pricing Algorithms and Models using Artificial Intelligence?
  • asked a question related to Computational Data Mining
Question
13 answers
I have a dataset having 56 variables, in which 4 dependent and 52 independent variables. In those independent variables 45 variables are categorical and 3 dependent variables are categorical remains are continuous. Each variables having 1500 observations. Independent variables are nominal, and dependent Categorical variables are ordinal. I want to check, there is any effect of independent variables on each dependent variables .
Relevant answer
Answer
If you are interested in the general properties of your data set, like any multivariate data set with non-trivial signals encoded in a matrix, its strength to infer groups, you should give network-based exploratory data analysis a try.
You'll find a bunch of related posts with ideas and links to literature on the Genealogical World of Phylogenetic Networks (GWoN)
Especially the neighbour-net is a most versatile tool for multivariate analysis, however, largely unknown and massively underused. Its applications in phylogenetics or linguistics are multi-fold but it can be very revealing for entirely different data (you didn't mention whether you are working with biological or other data), a few of the examples we explored on GWoN:
A network of gun legislation in the U.S. (to illustrate diversity) – https://phylonetworks.blogspot.com/2018/03/visualizing-us-gun-laws.html
A network of moons (for the purpose of classification) – https://phylonetworks.blogspot.com/2018/06/to-boldy-go-where-no-one-has-gone.html
A network of party programmes (to illustrate overall similarity and dissimilarity, also over time) – https://phylonetworks.blogspot.com/2018/10/jumping-political-parties-in-germanys.html
Visualisation of the data behind rankings, e.g.
  • asked a question related to Computational Data Mining
Question
2 answers
chenliang li
Relevant answer
Answer
Hello please check MATLAB and Python software's.
  • asked a question related to Computational Data Mining
Question
3 answers
Greeting to every one
              I have to select relevant feature from KDD99 data set. I am going to use bat algorithm. To use bat algorithm ,is it necessary to convert the dataset into binary or not? i don't know how to proceed further process. Can any one please tell me
Relevant answer
Answer
Yes , for sure, this application has already made
  • asked a question related to Computational Data Mining
Question
23 answers
I need to do some comparison with other methods for a new rule of combination under Dempster-Shafer theory. I would like to use the same data used in ‘Combining Multiple Hypotheses for Identifying Human Activities’ by Young-Woo Seo and Katia Sycara. Unfortunately, those data are no longer available at http://www.cs.utexas.edu/users/sherstov/pdmc/ .  This data set was originally released to a Physiological Data Modeling Contest (PDMC) at the site cited above.  Is there someone that can provide me the data or could reference a site where I can get it?
Relevant answer
Answer
Dear Ibrahim Musa I'm interested in using the PDMC dataset for my thesis. Could I get a copy too? Thank you!!
  • asked a question related to Computational Data Mining
Question
13 answers
I have a sort of data in which the change in the weight of materials is recorded during the time. Unfortunately because of special condition I cannot record the weight in the first 75 seconds.
- Is there any way to predict the initial missed data (I mean the change in the weight in the first 75 seconds)?
- How can I find the equation of the curve that fit the data points?
Any solution with MATLAB, SPSS, and Excel softwares is appreciated.
Relevant answer
Answer
hello,
please use any statistical forecasting techniques and check which one is suitable for your data.
good luck@
  • asked a question related to Computational Data Mining
Question
3 answers
How to retrieve location of tweets even though the location of users were turned off and what are features through which we find location?Any work on it till now?
Relevant answer
Answer
Hi Umair Arshad,
Please follow the papers below.
1. Srivastava, S. K., Gupta, R., & Singh, S. K. (2018). Simple Term Filtering for Location-Based Tweets Classification. In Speech and Language Processing for Human-Machine Communications (pp. 145-152). Springer, Singapore.
2. Liu, R., Cong, G., Zheng, B., Zheng, K., & Su, H. (2018, July). Location Prediction in Social Networks. In Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data (pp. 151-165). Springer, Cham.
3. Ozdikis, O., Ramampiaro, H., & Nørvåg, K. (2018, March). Spatial Statistics of Term Co-occurrences for Location Prediction of Tweets. In European Conference on Information Retrieval (pp. 494-506). Springer, Cham.
4. Stowe, K., Anderson, J., Palmer, M., Palen, L., & Anderson, K. (2018). Improving Classification of Twitter Behavior During Hurricane Events. In Proceedings of the Sixth International Workshop on Natural Language Processing for Social Media(pp. 67-75).
5. Li, P., Lu, H., Kanhabua, N., Zhao, S., & Pan, G. (2018). Location Inference for Non-geotagged Tweets in User Timelines. IEEE Transactions on Knowledge and Data Engineering.
Thanks,
Sobhan
  • asked a question related to Computational Data Mining
Question
5 answers
Is there a way to automate such a process whilst having all the additional data of the sequence stored in a nice legible manner?
Relevant answer
Answer
Dear Asif,
Please have a look at these python packages,
pydna
and
Biopython
Documentation of Biopython can be found at
Hope this will help you.
  • asked a question related to Computational Data Mining
Question
6 answers
IS accuracy and correctly classified instances measures are same. if same then therir formulas will also be same using weka?
Relevant answer
Answer
It is same because correctly classified instances means the sum of TP and TN. Similarly, incorrectly classified instances means the sum of FP and FN. The total number of correctly instances divided by total number of instances gives the accuracy. In weka, % of correctly classified instances give the accuracy of the model. Cheers!
  • asked a question related to Computational Data Mining
Question
9 answers
Hi. I have a query regarding Text Classification. I have a list of words with the following attributes. word, weight, class. The class can be positive or negative. Weight is between -1 to 1. How can I train a classifier like SVM using this word list to classify unseen documents? An example in any tool is welcome
Relevant answer
Answer
weka , rapid miner, sklearn python library are the easy to use for classification
  • asked a question related to Computational Data Mining
Question
5 answers
Could you please suggest me some good articles (original research or review) on  comparative (comparison of their performance w.r.t accuracy, ranking, etc.) studies between text similarity measures like cosine similarity, BM25, Language modeling?
Relevant answer
Answer
Perhaps, you should concentrate on filtering models. Also, consider a couple of parallel corpora of a different nature.
  • asked a question related to Computational Data Mining
Question
6 answers
Is there any API available for collecting the Facebook data-sets to implement Sentiment analysis.
Relevant answer
Answer
You can use aforementioned datasets or if you want to scrap the data yourself there is Facebook graph API.
PYLON provides access to previously unavailable Facebook topic data and has some price.
If you don't exclusively want Facebook datasets you can easily get data from other sources like (twitter, google, Wikipedia) using pattern.
  • asked a question related to Computational Data Mining
Question
16 answers
I am having a finite set of subjects. Now I want to find which subject a tweet of a given twitter user belongs to show that I can learn the topic of interest for the twitter user. Which classifier would be most suitable for tweets that have small number of words. 
Relevant answer
Answer
Basically it is a topic classification problem and it has many nuances.  Random Forest and SVM have been giving quite good results. However please note the pre-processing steps, feature selection technique and text representation scheme ( Bag of words, Topic Model, n-gram) will have a bearing apart from the model you are choosing.
  • asked a question related to Computational Data Mining
Question
4 answers
i want to cluster about 3 millions of tweets on their contents.I mean i need to cluster similar tweets for event detection. how can i use weka for this purpose?
Relevant answer
Answer
The sole purpose of clustering is to detect clustering or grouping structure. K-mean algorithm can be converted easily to incremental clustering more details can be found here https://www.cs.princeton.edu/courses/archive/fall08/cos436/Duda/C/sk_means.htm
I think it is a good idea to apply batch clustering first. Then compare it with incremental clustering.
  • asked a question related to Computational Data Mining
Question
4 answers
For incremental clusteing of tweets, how can get the cosine similarity between clusters and a tweet, so we can add the tweet to one of the clusters?as i know we can create a tf idf weight vector for a tweet and another same one for each cluster. Then get similarity between to same lenght vectors. I can creat the vector for the tweet but i do not know how to do it for each cluster.
Relevant answer
Answer
You may follow this papers for incremental clustering algorithms....
  • asked a question related to Computational Data Mining
Question
6 answers
I want to do some experiments with wireless sensor network. I am looking for datasets like UCL ML, or other web resources to download such datasets. Ideally, I would like to have a dataset that consists of large number of sensors, that are measuring different values on different times.
Relevant answer
Answer
You can also contact the TPC at www.tpc.org They are coming up with an IoT benchmark which includes a kit that generates artificial data. The kit is still under development but you can show your interest.
  • asked a question related to Computational Data Mining
Question
6 answers
Now, we need three high dimensional datasets (500 to 1000 features) with small sample size (< 500 samples) for our project. Therefore, we will run our feature selection algorithm on these datasets. If you have any ideas and links that can download, It could be great for us...
Relevant answer
Answer
Hello Salih,
I think Gene Expression microarray is a good choice for you, the features may range from 6000 to 60000 feature 
  • asked a question related to Computational Data Mining
Question
3 answers
Compare with other analytic/Recommender tools/library like WEKA, R-studio, GitHub, Surprise, MyMediaLite, LensKit, LibRec etc. How good is Azure machine learning studio?
Relevant answer
Answer
Disclaimer: I have worked for a company working on both Amazon Web Services and Azure machine learning toolboxes. At the time of writing: Azure machine learning studio is richer wrt. capabilities, a comparison which was found during comparing the tools in aforementioned company. Amazon might catch up, but currently I am not seeing that happen. Happy to hear if someone disagrees with me, I myself work on AWS all of my working days..
Ease of use: please go and test it and compare it to tools you mentioned. Frankly, I dont get it how GitHub is included in a list of machine learning tools...GitHub and BitBucket are both online code version control repository offering services..not ML tools...
  • asked a question related to Computational Data Mining
Question
3 answers
I have the text file for which the clustering has to be made and finally after clustering cosine similarity for the vectors has to be determined
Relevant answer
Answer
if you need Matlab please contract me
  • asked a question related to Computational Data Mining
Question
1 answer
I want dataset for parent child hierarchy for event sub-event detection. Can anyone point to me to the link
Relevant answer
Answer
You should try looking at the data sets in the Machine Learning Repository as there are many different ones there. Plus the bonus is that other people will be able to use the data set to repeat your experiments if you publish. Also you will be referred to previous users of the data set you choose. http://archive.ics.uci.edu/ml/ 
  • asked a question related to Computational Data Mining
Question
10 answers
if i have 1200 instances total and i want to classify in binary class. One class is "yes" having 900 and second class is "NO" having 300 instances. how i can know the dataset is unbalanced and what techniques could be apply using weka to resolve such issue?
Relevant answer
Answer
Your data set is imbalanced as your class is not a 50/50 or 60/40 distribution. If you use decision trees you might not need to balance your data set. Otherwise you can use under sampling (i.e. use all the smaller class and randomly select same number of majority class several times to make multiple data sets and then combine all classification results) to balance it this is best way for this without loosing information. If you use boosting you could alter the weights and balance data that way. As mentioned above the best way to decide is to explore the data fully. If any other attributes are also imbalanced over their values this will also affect classification results.
  • asked a question related to Computational Data Mining
Question
4 answers
I want to find similarity of data based on analysis of a set of features.
E.g. Lets say if am looking at a watch, then the make, brand, type, style can be set of influencers.
I can label encode these features because they are all discrete set of variables.
On top of this, i want to add continuous variables like discount given, price, stock availability to filter out alternative possibilities.
How do we achieve this ? Any thoughts and feedback would be appreciated.
Relevant answer
Answer
Ok, thanks everyone..I am gong through the 2 docs from @Dmitri and @Mateo. I will start with KNN model and evolve this further. Will keep you posted.
  • asked a question related to Computational Data Mining
Question
4 answers
At training phase, I applied k-means clustering algorithm on a dataset (k=10). I want to apply decision tree on each cluster and generate a separate performance model for each cluster
At testing phase, I want to compare each test instance with the centroid of each cluster and  use appropriate model to classify the instance.
Is there a way I can achieve this in WEKA or WEKA API. Any link/resource will be appreciated
Relevant answer
Answer
Hi Johnson,
Several ways to do that; for example, you can use WEKA's KnowledgeFlow as follows:
Training set will be passed into SimpleKMeans, after that prediction model gets built using J48 based on the generated clusters. All this process is evaluated using cross-validation.
NB: The attached image shows the flow process where FilteredClassifier is containing both cluster and classification algorithms.
HTH.
Samer
  • asked a question related to Computational Data Mining
  • asked a question related to Computational Data Mining
Question
4 answers
Hi I am working on uncertain data , when I convert certain value such as in to uncertain interval value such [19,21) based on equation , the question is under which datatype I have to define this interval especially in arff file in weka ? 
Thanks 
Relevant answer
Answer
Dear Hamdi I appreciate your help, this interval just simple example to show how to convert certain numerical attribute  value into uncertain numerical attribute as u know each attribute has many values and the rang between upper and lower bounds are not same for all values, so maybe the suggestion that split the intervals will work but the problem may will face within the classifer how they deal with two attributes as one.
  • asked a question related to Computational Data Mining
Question
2 answers
I'm trying to use MLLib in java (svm) but I have different input files with string and nominal attributes. Solutions?
Relevant answer
Answer
Hi
you can used LIBSVM which is developed in c program.
  • asked a question related to Computational Data Mining
Question
2 answers
I understand that weka's J48 uses Gain Ratio as a method in determining how to split the tree, but when I compare the first attribute in the attribute ranking and the attribute by which my decision tree becomes split first, they are different (the first in rank in the attribute selector is the attribute "Domain", while the attribute by which the tree was split first was "MgO"). Why is it so? Aren't they supposed to be the same?
Relevant answer
Answer
Oh. I get it now. Thank you very much! :D
  • asked a question related to Computational Data Mining
Question
6 answers
For decision trees, we can either compute the information gain and entropy or gini index in deciding the correct attribute which can be the splitting attribute. Can anyone send an worked out example of Gini index
Relevant answer
Answer
Lets assume we have 3 classes and 80 objects. 19 objects are in class 1, 21 objects in class 2, and 40 objects in class 3 (denoted as (19,21,40) ).
The Gini index would be: 1- [ (19/80)^2 + (21/80)^2 + (40/80)^2] = 0.6247 i.e. costbefore = Gini(19,21,40) = 0.6247
In order to decide where to split, we test all possible splits. For example splitting at 2.0623, which results in a split (16,9,0) and (3,12,40):
After testing x1 < 2.0623:
costL =Gini(16,9,0) = 0.4608
costR =Gini(3,12,40) = 0.4205
Then we weight branch impurity by empirical branch probabilities:
costx1<2.0623 = 25/80 costL + 55/80 costR = 0.4331
We do that for every possible split, for example x1 < 1:
costx1<1 = FractionL Gini(8,4,0) + FractionR Gini(11,17,40) = 12/80 * 0.4444 + 68/80 * 0.5653 = 0.5417
After that, we chose the split with the lowest cost. This is the split x1 < 2.0623 with a cost of 0.4331.
  • asked a question related to Computational Data Mining
Question
3 answers
I am new in the data mining research - actually I know theory and process - what and how to do that. My purpose is to-
1.calculate pearson's item based correlation coefficient to calculate relation among items.
2. look into the target users and use a technique to obtain predictions-for that I want to use Weighted Sum,this method computes the prediction on an item i for a user u by computing the sum of the preference given by the user on the items similar to i,for example S(i,j) between items i and j
but I quite don't know how to realize these technique using software like Weka or others. I found correlationAtrributeEval() in the weka feature selection but that calculate correlation among class attribute and other attributes. what is exactly I don't want - I want to calculate correlation items in the same attributes for example- ProductId is an attribute and I want to calculate correlation among different product identifying as ProductID not calculate and rank among ProductID and other attribute .
Please help me How can I do that,what the steps and which software is best suit (my preference is Weka if it is possible to do in weka). Thanks in advance 
Relevant answer
Answer
You may follow these following articles for item similarity in data mining.....
file:///C:/Users/sanjay%20chakraborty/Downloads/www10_sarwar.pdf
file:///C:/Users/sanjay%20chakraborty/Downloads/V2I6-IJERTV2IS60377.pdf
  • asked a question related to Computational Data Mining
  • asked a question related to Computational Data Mining
Question
5 answers
web log file from the webserver which is used to web usage mining
Relevant answer
Answer
thank u Mr.Alaa Khalaf
  • asked a question related to Computational Data Mining
Question
8 answers
can i use any predictive or machine learning approach to improve quality of health care. or can i use it for disease prediction. 
Relevant answer
Answer
there are several health related datasets available publicly, you can look at them and try applying machine learning/modeling. they will serve as an excellent starting point
  • asked a question related to Computational Data Mining
Question
5 answers
Can someone recommend dividing the data into training and testing?
Relevant answer
Answer
I know the most realistic division can be achieved by taking 10% of data for testing and 90% for learning to avoid such problems as overfitting and learning failure. You can consult my dissertation on opinion mining and lexical affect sensing for list of corpora I took for this purpose.
  • asked a question related to Computational Data Mining
Question
1 answer
any 
Relevant answer
Answer
Dear Girish Babu M CGirish Babu M C,
I had done some similar tasks with R programming. In that case we extracted from data base, the factors of the points where we knew that (for example) rice/sugarcane exist . Then by knowing those factors we made a linear regression model by "gml" function. At the end we implemented the model to the image and it illustrated the locations where the existence of rice/sugarcane where predicted
Be success
Ali Madad
  • asked a question related to Computational Data Mining
Question
9 answers
For the user defined constraint for any sequential database, how can the association rules be found?
Relevant answer
Answer
Thanks a lot Sir.
  • asked a question related to Computational Data Mining
Question
2 answers
I am currently working on mining sequential classification rules with Hadoop. I am searching for big labeled sequential data having sequences as a sequence of items like <a,b,a,b,a> and for each sequence there is a class label, i.e., (<a,b,a,b,a> : class)
Does any one have access to such kind of big data?
Relevant answer
Answer
1- The New York Times Annotated Corpus
2- Clueweb dataset
3- TREC Web track dataset
  • asked a question related to Computational Data Mining
Question
1 answer
I was trying to implement the USD algorithm (Paper Tile: Discretization oriented to Decision Rules). However, I have some doubts:
1. At line 20 of the algorithm it is written that Ii has the same majority class than Ii+1. What is the meaning of this?
2. At line 20 of the algorithm it is written that there is a tie in Ii or Ii+1. A tie of what?
3. What is the requirement of line number 14 & 15 when line number 11 & 12 is covering all consecutive intervals?
Relevant answer
loop testing algorithm are suitable
  • asked a question related to Computational Data Mining
Question
9 answers
We need an example about a rea life data set, better if it is related with engineering, where we have binary attributes for each element (we are going to analyze data with formal concept analysis techniques).
This data set have to generate some knowledge that let us to allow or discard new elements that have different attributes that the previous elements.
Thanks for your support.
Relevant answer
Answer
Hi Jose,
This UCI dataset may be useful for your experiements: https://archive.ics.uci.edu/ml/datasets/SPECT+Heart
Kind regards,
Olivier PARISOT
  • asked a question related to Computational Data Mining
Question
3 answers
Dear ResearchGater,
I'm trying to identify association between keywords in PubMed. The prototypic search could be : listing kinases (classified in term of number of publication) that are associated to the keyword cancer or inflammation. Does anybody have an idea of an easily accessible tool that can perform such search. Thank you for your help. 
Relevant answer
Answer
Thank you Laurent for the informations. I will give an eye on it. Hope it will be user friendly because my knowledge in programing is limited. Regards
  • asked a question related to Computational Data Mining
Question
2 answers
I want to find the tightest lower and upper bounds for frequent subgraphs from
uncertain graph data and also densest frequent subgraph please suggest me
Relevant answer
Answer
Dear
please i want dataset for large graph for graph mining in matlab 
Thanks a lot 
  • asked a question related to Computational Data Mining
Question
5 answers
I compute the weights using dot product of the hog feature accroding to an article I found
Relevant answer
Answer
thanks really helped me
  • asked a question related to Computational Data Mining
Question
4 answers
I try to analyze the text corpus of tickets from a ticket-system in order to cluster the requests according to the words that are used in. I use the vector space model and have calculated the values for tf, idf, tfxidf and entropy. 
Due to the amount of dimensions (about 8.500) I would like to extract the key words. Is there any statistical approach depending on a specific threshold for entropy/tfidf? Of course I could say that document frequency has to be at least x and entropy at least y, but how can I verify this point?
Thanks for your help!
Relevant answer
Answer
Dear Yaakov and Amina,
thanks for your answers. Your papers are very interesting and gave me a good overview concerning the several techniques.
First i calculated the tfidf for the whole collection in order to get the most specific words of the corpus.
Now i calculate the tfidfs for every document and choose the top n elements depending on the tfidf value. Now i have to evaluate a "good" value for n.
I hope this approach leads to a good result.
  • asked a question related to Computational Data Mining
Question
1 answer
I am currently working on a project that requires detailed knowledge on works that have been done on computer screen capturing, screen recording during usability testing sessions, etc. The focus is on works done not the available tools and their documentations. 
Relevant answer
Answer
  • asked a question related to Computational Data Mining
Question
2 answers
We create a new set of labels for classifier B as follows:
If prediction by A is true positive or true negative, label for B is 0
If prediction by A are false pos. or false neg., label for B is 1
we train B accordingly (using same features as for A)
Then, we on new data, we modify outcome of A based on outcome of B
However, this expectedly worsens the performance if B itself gives too many false outcomes.
I am new to machine learning, so not aware of better ways to do this. Please help.
Relevant answer
You have to look at stacked generalization: http://www.cs.utsa.edu/~bylander/cs6243/wolpert92stacked.pdf
  • asked a question related to Computational Data Mining
Question
15 answers
I am just starting to get into data mining and I wants to know that from where I should start my work?
Thank You.
Relevant answer
Answer
Hello,
Data-mining is part of the Knowledge Discovery Process presented by Fayyad. So I would advise to study first the whole framework in order to get a better understanding of what data-mining is. 
To my understanding, data-mining is just about using some kind of algorithm (closely associated with machine learning) in order to retrieve patterns from your data.
Data-mining is bound to your data. By knowing the type of data you want to mine, you may get insight of the kind of data-mining you need. For example, if you use textual data as input, you will have to focus on "text-mining". 
Hope it helps a bit.
  • asked a question related to Computational Data Mining
Question
1 answer
I want to implement ECC algorithm in COOJA simulator and want to compare the performance of RSA and ECC in IoT(Internet of Things) nodes. I am using Contiki and Cooja.
Relevant answer
Answer
  • asked a question related to Computational Data Mining
Question
4 answers
In one of my recent studies (I will post it here after it gets published) I discovered the fact that there is a correlation (0.42, medium - moderate) between the number of inhabitants of a city and the level of development of the official city hall Web site. Explaining that to my students they came with a different approach – they never (or rarely) get onto the Web page, but they know almost everything that’s new in their city by the use of Facebook. If that’s true – the correlation should be way stronger – but, and those are my questions: How can I find how implicated is the new generation into cities life by the use of social media networks? How can I measure that?
Relevant answer
Answer
Dear Catalin, 
One easy method would be to take a sample from the new generation in a befitting way for surveying the degree of implication into city life by individual's social media usage.
Hope this could be useful.
Regards- 
  • asked a question related to Computational Data Mining
Question
1 answer
  I have to select relevant feature from KDD99 data set. I am going to use bat algorithm. To use bat algorithm ,is it necessary to convert the dataset into binary or not? i don't know how to proceed further process. Can any one please tell me
Relevant answer
Answer
yes it is used in bat algorithm
  • asked a question related to Computational Data Mining
Question
6 answers
I am using R for creating DocumentTermMatrix and then converting it to normal matrix for further processing. There are 26,000 records and 4 columns, while converting DocumentTermMatrix to normal matrix it get hang. I am using syntax abc<- as.matrix(dtm), where dtm contains document term matrix. Ram size of my computer is sufficient. Can anybody give me solution.
Relevant answer
Answer
Probably you have to check the memory status during execution of command as.matrix(), given your operational system (for windows leave open the task manager). What is your RAM in GB? Are you using 32 or 64 bit computer?
  • asked a question related to Computational Data Mining
Question
9 answers
i want know what is best tool for machine learning in big data, i used MLLIB API in spark framework and i want test another tool.
Relevant answer
Answer
This depends on your capability in terms of how you are able to manipulate things. To make this clear to you, Mahout is a machine learning tool that allows you to execute a number of algorithms, however, mahout was developed to work on top of hadoop, which means as time goes on you may need to write your little codes or manipulate existing codes to address some problems. This means that you will face the challenge of writing mapreduce codes to solve your problem. However, spark has addressed this problem of having to face the headache of writing mapreduce codes always. Spark support lots of machine learning algorithms that you can use and they are easy to implement. Therefore, I will advise you go for it. You can check spark in action by visiting this link:
Like I said, it all depends on the problem you are trying to solve and your capability in handling the tools. You can also use ECL-ML (https://hpccsystems.com/download/free-modules/ecl-ml) to process large volume of data. In addition, do not also forget that WEKA, the popular machine learning tool support distributed data mining. This means that you can use a lot of algorithms on WEKA in distributed mode (http://weka.sourceforge.net/packageMetaData/distributedWekaHadoop/index.html).
All the best.
  • asked a question related to Computational Data Mining
Question
5 answers
any researchers
Relevant answer
Or alternatively, you can also use RapidMiner. WEKA is good, but RapidMiner offers you better GUI.
Cheers,
Diky
  • asked a question related to Computational Data Mining
Question
15 answers
I need the ways to statistically measure the quality of clusters formed after performing clustering technique from data mining like kmeans, etc.
Relevant answer
Answer
It really depends whether you know the ground truth clusters or not.
In the first case, you could proceed with "external clustering quality metrics". A widely used metric in this case is Purity. But you can proceed with others, like Homogeneity, Completeness, V-Measure, Precision, Recall, Normalized Mutual Information and CMM (for streaming environments).
In the second case, it is, no ground-truth is known, you could proceed with metrics that account for the distribution of data, and nothing besides it. These metrics are called "internal clustering quality metrics". Examples include: Silhouette coefficient and Sum of Squared Distances (SSQ).
My master's dissertation covers these subjects at least partially, but it is in portuguese. I will leave a link here in case someone else needs it too.
  • asked a question related to Computational Data Mining
Question
7 answers
After forming the clusters of customers of Banks, using various algorithms like - kmeans and decision tree. What are the ways we can validate the clusters formed by our method.
Relevant answer
Answer
Dear Mayur, here is the link to the clustering validation in general:
I have recently published a paper on clustering in banking, by using SOM methodology, and we used a predetermined methodology provided by the Viscovery (software that we have used). However, we have provided a logical validation of the clusters. It would be even better if you can apply a validation by the experts from the field, and also present this in the paper (we do not have this in our paper). 
Here are the links:
Bach, M. P., Juković, S., Dumičić, K., & Šarlija, N. (2014). Business Client Segmentation in Banking Using Self-Organizing Maps. South East European Journal of Economics and Business, 8(2), 32-41.
  • asked a question related to Computational Data Mining
Question
2 answers
 for example : user 1 play items 1 to 6 with following number plays respectively: 5, 0 , 3, 3, 6, 1
i want to get range between 0 to 1 for interests of user based on number plays. i do not want to divide number of click on item to total number of clicks  or divide to max 
Relevant answer
Answer
Do you want to derive a rating for items over all users' clicks or evaluate all items for each user separately?
if doing evaluation for each user separately, the best way is nothing other than dividing by max clicks in my opinion.
  • asked a question related to Computational Data Mining
Question
2 answers
translation lexicon in different languages 
Relevant answer
Answer
It is Arabic-English, English-Arabic and German-Arabic??.
  • asked a question related to Computational Data Mining
Question
1 answer
Hi all,
I have been installed and setup the rhadoop in the VMware environment.
Recently, i tried to install the following packages:
1) rhdfs_1.0.8 package, and prompted with this error message:
(as ‘lib’ is unspecified)
Error in getOctD(x, offset, len) : invalid octal digit..
2)Rmpi_0.6-5
configure: error: "Cannot find mpi.h header file"
ERROR: configuration failed for package ‘Rmpi’
* removing ‘/home/aisyah/R/x86_64-redhat-linux-gnu-library/3.2/Rmpi’
Anyone had the same experience and know any suggestion?
Relevant answer
Answer
Usually, in situations like this you need to install "special" package for your Linux distribution you are using that contains most of the "headers" required. One can also help is an installation of "dev" variant of solution you are trying to install.
  • asked a question related to Computational Data Mining
Question
1 answer
To start of research work
Relevant answer
Answer
It is possible to reformulate physics formulas such that the parameter spaces that are used by the functions, which represent fields are Euclidean spaces instead of spacetime geometric data spaces that have a Minkowski signature. When formulated with this much simpler parameter space most equations of GRT will become simpler and much more compact. For example with this choice the multi-dimensional differentiation operator can then be treated as a multiplying operator. 
  • asked a question related to Computational Data Mining
Question
9 answers
Lets say you have few data visualization options: pie chart, bar chart, tabular, histogram.
Using features and data value, how can a program "calculate" the most suitable data visualization for that dataset?
Relevant answer
Answer
I think the challenge is on getting to some reasonable metric. By way of example, take the interpretation objectives which Benno suggested (which I also  think is a good feature to include). Interpretation objectives has to be defined in a way in which it can be separable by the classifier given a training dataset for a supervised learning algorithm, a task which I think has nothing to do with ML but with human nature and the task at hand (not impossible but difficult).
Just a thought
  • asked a question related to Computational Data Mining
Question
3 answers
We tried using PSO but this takes a long time especially with 6 variables and lag of 2. We also tried using linear regression to get the intercepts and coefficients but we believe better results can still be achieved by using other parameter search techniques.
Relevant answer
Answer
to setup a var model , one need to make sure that:
1. the variables are stationary by differentiating. This is very important as any linear regression might result into spurious association due to non stationary.  
2. if there is seasonal effect  its possible to either add seasonality in your model or filter out seasonality using census x13. Differentiating to remove seasonal effect  or adding seasonal effect might influence your result in case if some one trying to infer causal relation.   
3. selecting best lag order. For me it is the most difficult part.  
 i) some one can either use information criterion like AIC, HQ to select lag order
ii) start with a lag order based on theory or knowledge of the field and keep increasing the lag order. 
4. After selecting the lag order , test for auto-correlation in the residuals using Portmanteau test or  Lagrange multiplier test. If there exist autocorrelation change the lag order by one until there is no autocorrelation. 
5. You can also check for heteroskeadsticity and normality test for the residuals. 
6. Prior to all of this transform data to log to remove non-linearity if exist like exponential?
there steps i would follow but im not sure .. so its your own risk if you take my advice:p .. you can test for cointegration, if exists its possible to setup vec model with out differentiating the data in level where its non stationary.  one of the good thing about vecm is it includes long term association besides the short term ones. :p but again i m just doing masters  :p I am still learning 
  • asked a question related to Computational Data Mining
Question
7 answers
I am working on Text Classification 'Prediction' using ML algorithms such as MultinomialNB and Ensemble. So, I already trained the classifiers with my training dataset using 70% of total. And I used test dataset %30 of total for testing the classifiers and had output for each. I used for testing the error percent. I'd like from you ladies and gentlemen to suggest me 2-3 more techniques for testing, and also for evaluation in advance"I am going to use (Cross Validation - Recall-Precision and Fmeasure)
Relevant answer
Answer
Why don't try purity, entropy, accuracy and mutual information?
For more information about these mesures, you can consult the following thesis.
Good luck.
  • asked a question related to Computational Data Mining
Question
5 answers
I am trying to normalize a matrix that I have, which is a flow matrix. The goal is to make the matrix pre-magic such that the sum of rows are equal to sum of columns. I rounded it off to the nearest integer, but my issue is that the result is not pre-magic. What is a possible approach to solve rounding issues such as this so that I can normalize properly?
Relevant answer
Answer
Linear min-max normalization: (max-X)/(max-min). Also check if (min==max)
  • asked a question related to Computational Data Mining
Question
7 answers
Students prediction,Educational data mining.
Relevant answer
Answer
Have a look at Cross Industry Standard Process for Data Mining (CRISP-DM):
It is more concerned the data mining process & project.
  • asked a question related to Computational Data Mining
Question
2 answers
Lift is measure used in association rule mining.
Relevant answer
Answer
thanks for answer
  • asked a question related to Computational Data Mining
Question
6 answers
** ** ** ** ** **?
Relevant answer
Answer
doesn't work in my case. See PM