Data Mining and Knowledge Discovery

Data Mining and Knowledge Discovery

  • Grzegorz Czapnik added an answer:
    Does anybody know about DM research based on data sets generated from Integrated Library Systems / library management system (LMS)?
    A few years ago S. Nicholson and J. Stanton defined "bibliomining" as "application of DM tools to large amounts of data associated with library systems". I'm looking for implementations of this concept (from the last 10 years): information about research, projects, tools, algorythms etc.
    Grzegorz Czapnik

    Thank You Haleema for your answer. I found some literature, more of them is listed in my other Q&A: Look for 1st link for details.

    A few days ago I get information about experiment provided by my colegues from PIONIER Network Digital Libraries Federation (Poznan, Poland). Look for 2nd link for details

    + 1 more attachment

  • Waldemar W Koczkodaj added an answer:
    What is the difference between probabilistic approach and knowledge-based approach?


    I found in several papers describe the definition on both concepts, but I haven't got a clear picture on how to differentiate both and apply in the data mining / machine learning techniques. Appreciate if someone could differentiate both approaches and associate to data mining and machine learning techniques.

    Thank you in advance.

  • Bin Jiang added an answer:
    Does anybody have experience with data mining over big data?
    I would like to work in optimization of algorithms to classification, clustering and logistic regression for data mining over big data.
    Bin Jiang

    I have some experience working with BIG data; see examples below:

    Jiang B. (2015, accepted), Wholeness as a hierarchical graph to capture the nature of space, International Journal of Geographical Information Science, xx(x), xx-xx, Preprint:

    Jiang B., Yin J. and Liu Q. (2014, accepted), Zipf’s Law for all the natural cities around the world, International Journal of Geographical Information Science, xx(x), xx-xx, Preprint:

    Jiang B. and Miao Y. (2014, accepted), The evolution of natural cities from the perspective of location-based social media, The Professional Geographer, xx(xx), xx-xx, DOI: 10.1080/00330124.2014.968886,  Preprint:

    Jiang B. and Ma D. (2015), Defining least community as a homogeneous group in complex networks, Physica A: Statistical Mechanics and its Applications, 428, 154-160.

    Jiang B. (2015b), Geospatial analysis requires a different way of thinking: The problem of spatial heterogeneity, GeoJournal, 80(1), 1-13.

    Jiang B. (2015a), Head/tail breaks for visualization of city structure and dynamics, Cities, 43, 69-77.

  • Manish Tembhurkar added an answer:
    Where can I find any useful materials about knowledge grid implementations?

    At the moment, I was able to find these papers:

    1. Prototype a Knowledge Discovery Infrastructure by Implementing Relational Grid Monitoring Architecture (R-GMA) on European Data Grid (EDG) by Frank Wang, Na Helian, Yike Guo, Steve Thompson, John Gordon.

    2. Knowledge grid-based problem-solving platform by Lu Zhen, Zuhua Jiang,Jun Liang.

    Thank you in advance for any help.

    Manish Tembhurkar

    Hello Pawel,

    Look into the attached paper. & this link for the application.

    This paper gives you the well structured idea about application and extension of the Grid technology to knowledge discovery in Grid databases.

    If you are working on larger Datasets, I'm certain OLAP could help you with it and provide better results than any other.

    Furthermore, you also need to work on the performance results & usability of such applications.



  • Bhuvaneswari Velumani added an answer:
    How can you make an application form (formulary) for Data Mining problem detection?
    I am having problems making an application form that can be used for non-technical areas of an financial/insurance company. The idea is to have a filter that could allow the Data Mining Area to know the possible applications.

    The thing is that users in non-technical areas don't know anything about Data Mining, and the Data Mining area doesn't know much about the business.

    Some of the questions so far are:
    1) Are there any business rules that let you know prior whether a client will have a problem?
    2) Is there any classification between risky and less risky based on any characteristic of the client?
    Bhuvaneswari Velumani

    The rules would be hidden in the databases of the applications deployed. You can have a filter added to your form for mining. But before be clear in what you are looking for and place the correct data mining task.

  • Raghunandan Palakodety added an answer:
    What is Minimal Transversal of a Hypergraph?

    Hi this is my question post at researchgate, so please go easy on me. I tried going through an algorithm Dualize and Advance for generating maximal frequent item sets. I considered an example as follows



    and minimum frequency threshold as 2.

    Now, I have a problem understanding how to generate 'minimal transversals' part of the algorithm.

    I understood that transversal is a subset of vertices of the hypergraph that intersects every hyper edge. So the initial set of minimal transversals should be {a,b,c,d,e} if I am not wrong.

    Can you please explain me this part of 'minimal transversal' w.r.t the transactions.

    Raghunandan Palakodety

    Hi Suvdev Naduvath. Thanks a lot sir for the reference paper. But I solved it as shown above.

  • Udai Pratap Rao added an answer:
    Can anyone tell me about real life live scenarios where Privacy Preserving Distributed Data Mining has been actually applied?

    There are numerous hypothetical examples of "Privacy Preservation in Distributed Data Mining" in literature. However, in practice can anyone give me scenarios where it has been actually applied?

    Udai Pratap Rao

    You may see below scenarios:

    Source: Internet

    1. Suppose we have a server and many clients in which each client has a set of sold items (e.g., books, movies, etc.). The clients want the server to gather statistical information about associations among items in order to provide recommendations to the clients. However, the clients do not want the server to know some strategic patterns (also called sensitive association rules). In this context, the clients represent companies and the server is a recommendation system for an e-commerce application, for example, fruit of the clients collaboration. In the absence of rating, which is used in collaborative filtering for automatic recommendation building, association rules can be effectively used to build models for on-line recommendation. When a client sends its frequent itemsets or association rules to the server, it must protect the sensitive itemsets according to some specific policies. The server then gathers statistical information from the non-sensitive itemsets and recovers from them the actual associations. These companies may benefit from such collaboration by sharing association rules while preserving some sensitive association rules.

    2. Two organizations, an Internet marketing company and an on-line retail company, have datasets with different attributes for a common set of individuals. These organizations decide to share their data for clustering to and the optimal customer targets so as to maximize return on investments. These organizations learn about their clusters using each other's data without learning anything about the attribute values of each other.

  • Ahmad T Siddiqui added an answer:
    Is there any article which discussed case study / application of privacy in distributed data mining?

    I want to know about real case study of privacy threat cause of association rule mining (Distributed or centralized database).

    Ahmad T Siddiqui


    try these links:

    Hope it helps...

  • Saeed Khazaee added an answer:
    How to train kddcup dataset to normalize data ?

    to identify intrusion in dataset normalizing data is neccessary. But i don't know how to normalize the 42 fields. which are the important fields i have to took to process data. plz help me

    Saeed Khazaee

    Certainly, data pre-processing should be performed. I strongly recommend you that employ feature selection, sampling, conversion, Normalization and so on. But to focus on your question, you can use min-Max normalization for all features after the necessary conversions.

  • Sebastián García added an answer:
    Can anyone provide suggestion on the collection of a dataset similar to KDDcup 99 dataset?

    How can we collect dataset similar to KDDcup 99 dataset in real environment? We want to check the performance of an unsupervised anomaly algorithm by collecting real dataset 

  • M. Ramakrishna Murty added an answer:
    From where I can get / download dataset for semantic trajectories mining?

    From where I can get / download dataset for semantic trajectories mining?

    M. Ramakrishna Murty

    Please try in the following links

  • Mohammad Ayaz Ahmad added an answer:
    How to predict an evolution of a network?
    I'm working on a data mining topic. I have an attributed graph of inflation network of developing countries. My vertex attributes are such as Population, Exchange rate to US $, Price level of consumption(PC), Ratio of GNP to GDP(%),.... The covariation among vertex descriptors can tell me whether the vertex attributes are structurally correlated or not? Is there any possibility to predict the future speed of my graph after a given period? If yes, what do my vertices stand for?
    Mohammad Ayaz Ahmad

    Dear Uwizeye

    Please read this article for your nice question:

    Predicting the evolution of spreading on complex networks

    and the web link is:

  • C.T.A. Schmidt added an answer:
    Can anyone recommend a course specifically designed for going into the big data business?
    Many researchers come to big data from other areas but are not specifically "tagged" for big data jobs or research. Any angle of treating big aata sciences in a higher educational institute would be appropriate.
    C.T.A. Schmidt

    Thank you for these answers, as I recently became interested in BG, I wrote a commentary about the social aspects of it, cf my publications within RG or discussion here and in :

  • Lukas Pfäffle added an answer:
    Calculating Bonacich´s power in Social Network Analysis. Should we look at the absolute value of the index scores or negative scores have relevance?

    Hi! I am perfoming a social network analysis using UCINET and trying to calculate Bonacich´s power (beta<0). However, it is not clear for me (according to the literature I have revised) how the outputs of UCINET should be interpretated. Just looking at the absolute values -as we do with Bonacich´s centrality- or taking negative scores into account. Thank you very much in advance for your responses.

    Lukas Pfäffle

    Dear Amanda Jiménez,

    a late answer, but maybe it can still help you or someone else. For the beta Score you have to take positive and negative values into account. Negative values mean less power for these actors, since they are connected to other well connected actors, high positive values are explained by connections to many other actors, which are not well connected overall.
    Here an explanation for UCINET:

    Best regards.

  • Andreas Briese added an answer:
    Is there an existing open source code or library for in hyperheuristics Data Mining?

    Hi ,

    Please, is there existing open source code or library for hyperheuristics Data Mining?

    Thank you

    Andreas Briese

    I know WEKA and ELKI aren't explicitly hyperheuristic but if you're interested in practical use rather than theoretical approach you might find a solution here.




    Hope that helps.


    EDIT: maybe eureqa would do the job.  I remember, there is an open source version free of charge, but can't find it now - maybe you need to go through the video introductions of Hod Lipson on eureka to find the link.  

  • John F. Wilhite added an answer:
    Should studies published in ResearchGate and refereed by ResearchGate peer reviewers be counted as scholarly publications?
    Yes, in my humble opinion. Firstly because scientists have an interest in disseminating their intellectual work through an interactive, secure and free medium. Secondly, traditional journals use the same pool of academics as referees to review submitted articles. And thirdly, It allows for post publication peer review.
    John F. Wilhite

    RG is not a publisher, it is a web site.  Uploading material to RG is not "publishing," it is "posting," like posting a notice on a bulletin board.  To be a legitimate publishing web site there must be in place a qualified editorial and review staff.  With some modification RG has the potential to become a "pre-publication" sounding board or a "publication preparation conduit."  This would be a useful service to both writers and publishers.  Allowing anyone, qualified or not, to write a review of RG "postings" is not professionally adequate; RG would have to limit reviews to those with the necessary expertise.

  • Afaq Ahmad added an answer:
    Are PhD/Masters Theses and Dissertations or Journals/Conference proceedings the best sources for a Literature Review? Why?
    When reviewing the literature for trends in technological advancement and future directions, we may use either PhD and Masters' theses and dissertations. But which will be useful, or give the most rigorous review?
    Afaq Ahmad

    Dear John K. Marco Pima

    PhD, Masters: Theses and Dissertations provide us a quick access to methodologies and tools along with scope of future works. Since these documents are quickly available for research groups at same institute. Hence, referring these are essential if research idea/methodology/tool/comparison benchmark is used.    

    Since Theses and Dissertations are not accessible or publicized immediately therefore, it is not commonly referred by other researchers of the same field. Because most of the the research papers are from the Theses' and Dissertations' work, so generally, referring of these are not prime issue. 

    Conference papers are commonly discusses new ideas, tools, methodologies, technologies and comments but not to the extents of completeness. Some good conferences adhere a standard like IEEExplore. Standard conference papers and /or conference papers designed from same research groups / fields are justified for referencing.

    Standard journal papers are the best source of breakthrough of your research and make a practice of it and therefore, this the best source of reference. 

  • Sanjay Garg added an answer:
    Is there any effective algorithm known for applying rule hiding techniques on temporal multilevel association rules?

    We are working on privacy preserving issues in temporal multilevel association mining and want to know which is most effective algorithm in practice/real deployment/research for the same purpose at present.

    Sanjay Garg

    Mr. Robert,

    Thanks for giving this answer, your suggestion is really useful to me.

  • Remy Fannader added an answer:
    Is it possible to explain the operational/tactical/strategic classification in terms of real-time KM ?

    The basic idea is to use events and the processing of associated data to bridge the gap between operational and decision-making systems.

    Remy Fannader

    I will assume that internal events are meant to be managed, which means that risks are by definition external. Then I will use the classic distinction:

    Operational: full information on external state of affairs allows for immediate appraisal of prospective states.

    Tactical: partially defined external state of affairs allow for periodic appraisal of prospective states in synch with production cycles.

    Strategic: undefined defined external state of affairs don’t allow for periodic appraisal of prospective states in synch with production cycles; their definition may also be affected through feedback.

  • Ibrahim Abubakar added an answer:
    How do you classify documents using WEKA?

    Please I need help on how to go about classifying documents using WEKA. I am doing classification of dissertations of my department. The assignment is to do an Ontology-based classification of 250 dissertations against Gartner's Hype cycle. The aim is to determine which field of IT each dissertation falls into.


    Ibrahim Abubakar

    thank you. It really helps. I will try using some of the suggested classifiers.

    Thank you 

  • Marc Scheibel added an answer:
    Can anyone help to determine the time lag between flood gauges from upstream to downstream?
    Can anyone suggest how we should calculate time lags of historical flow data between upstream and downstream gauges for urban flood prediction (considering that for various flood events the time lags between upstream and downstream flow stations can change depends on precipitation characteristics and and hydrologic condition of catchments)?
    Marc Scheibel

    There are flood routing methods like Muskingum and Kalinin-Milijukov, in which you can estimate the translation and the retention effects in the river section.

  • Ian Kennedy added an answer:
    Domain specific search?
    I have been guiding research scholars in the area of Domain Specific Ontology Searching Techniques. If any one would like to share your expertise with published papers, I will really appreciate that.
    Ian Kennedy

    We have a paper in press where we used an exemplary glossary as the basis for extracting the ontology of a domain. (We were then able to establish the hierarchy of dependencies.)

  • Mohamed Mohsen Gammoudi added an answer:
    How can one filter uninteresting rules in multilevel association mining processes?

    During any association mining process it is a big challenge to remove uninteresting rules. We are interested in effective formal and experimental method for finding interestingness of the multilevel rules.

    Mohamed Mohsen Gammoudi

    You could read this paper:

    May be it helps you

  • Dr. Indrajit Mandal added an answer:
    Is there any way to do manual prunning over the result (decision trees) of a trained Random Forest Model?

    The idea is to do an online pruning using a continuous timestamped dataset...and I wanted to train my model using some data then improve it during the day with some other information that I may receive (i.e. active learning, weather condition, etc.). It could be by prunning (i.e. removing some tree brunches) or by adding more branches to the current leaves. Are there any R packages that support such an implementation? Which would be the best way to do so? Thank you in advance.

    Dr. Indrajit Mandal

    hello friend

    Yes its possible .

    if the dataset contains less number of attributes.

    else it may become highly tedious job.

    You can see my7 publications for more details about application of trees:

    Mandal, I., Sairam, N. New machine-learning algorithms for prediction of Parkinson's disease (2014) International Journal of Systems Science, 45 (3), pp. 647-666. DOI: 10.1080/00207721.2012.724114

    Mandal, I., Sairam, N. Accurate telemonitoring of Parkinson's disease diagnosis using robust inference system (2013) International Journal of Medical Informatics, 82 (5), pp. 359-377. DOI: 10.1016/j.ijmedinf.2012.10.006

    Mandal, I., Sairam, N. Accurate prediction of coronary artery disease using reliable diagnosis system (2012) Journal of Medical Systems, 36 (5), pp. 3353-3373. DOI: 10.1007/s10916-012-9828-0

    Mandal, I., Sairam, N. Enhanced classification performance using computational intelligence (2011) Communications in Computer and Information Science, 204 CCIS, pp. 384-391. DOI: 10.1007/978-3-642-24043-0_39

    Mandal, I. Software reliability assessment using artificial neural network (2010) ICWET 2010 - International Conference and Workshop on Emerging Trends in Technology 2010, Conference Proceedings, pp. 698-699. DOI: 10.1145/1741906.1742067

    Mandal, I. A low-power content-addressable memory (CAM) using pipelined search scheme (2010) ICWET 2010 - International Conference and Workshop on Emerging Trends in Technology 2010, Conference Proceedings, pp. 853-858. DOI: 10.1145/1741906.1742103

    Indrajit Mandal, Developing New Machine Learning Ensembles for Quality Spine Diagnosis, Knowledge-Based Systems, Available online 19 October 2014, ISSN 0950-7051,

    Mandal, I. A novel approach for accurate identification of splice junctions based on hybrid algorithms (2014) Journal of Biomolecular Structure and Dynamics
    pp. 1-10 | DOI: 10.1080/07391102.2014.944218 PMID: 25203504

    Thank you.

    Best regards,

    Dr.Indrajit mandal

  • Surya Narayan Panda added an answer:
    What kind of "Research Tools" have you used during your PhD journey?
    I have collected over 700 tools that can help researchers do their work efficiently. It is assembled as an interactive Web-based mind map, titled "Research Tools", which is updated periodically. I would like to know your personal experience on any kind of "Research Tools" that you are using for facilitating your research.
    Surya Narayan Panda

    using google scholar, google patents,  IEEE, Springer etc

  • Peter Kokol added an answer:
    Do anyone know where to find links to public nursing data-sets?
    We are performing knowledge extraction from various health care data-sets found just one nursing data set, but we need more.
    Peter Kokol

    Thank you so much, the links are very helpfull

  • Brian Prasad added an answer:
    What techniques do you consider the most effective for capturing heuristics rules and best-practices during product design & development and why?
    Engineers often like to use "best practices" (data, information, knowledge, wisdom) during product development. Some of the data/information come from their experiences working on the job. Others (best practices) are derived from analytical, functional, logical or physical phenomena.
    Brian Prasad


    Is your article listed on ResearchGate or on any web site? Can you add a link to the article you have refereed in here?

  • Sanjay Garg added an answer:
    Is OOB error rate always random?

    Bootstrapped sample and a tree built using the same. Why is it that OOB error estimate  a random value?

    Sanjay Garg

      OOB predictions are based on a small subsample of the trees in the forest and are thus at a disadvantage relative to predictions that can be legitimately based on the entire forest. 

    OOB predictions are not expected to exhibit  better performance because their distribution is not the same as the distribution on which the individual trees are grown

    Advantage of randomness is obviously to get better ensemble.following article may help you in this regard

  • James F Peters added an answer:
    How to classify indexing techniques?
    The index is one of the main design concepts and they allow access to the stream data in a multitude of ways. When a stream is originally recorded, an index is is created . This is called the primary index and used for access to the stream of data. (

    The question is: Can we classify the indexing techniques in two categories (such as non-intelligent, and Intelligent method)?
    James F Peters

    One way to solve this problem is to start by selecting clue words representing subject categories and which can be used in pigeonholing each document in a subject category corresponding to a clue word.     For more about this, see page 408 in

    M.E. Maron, Automatic indexing: An experimental inquiry:

    Maron makes the following observation:

    The fundamental thesis says, in effect, that statistics on kind, frequency,
    location, order, etc., of selected words are adequate to make reasonably good
    predictions about the subject matter of documents containing those words (p. 405).

About Data Mining and Knowledge Discovery

It is the research project which is ongoing.

Topic followers (11,378) See all