# Data Mining and Knowledge Discovery

3
Calculating Bonacich´s power in Social Network Analysis. Should we look at the absolute value of the index scores or negative scores have relevance?

Hi! I am perfoming a social network analysis using UCINET and trying to calculate Bonacich´s power (beta<0). However, it is not clear for me (according to the literature I have revised) how the outputs of UCINET should be interpretated. Just looking at the absolute values -as we do with Bonacich´s centrality- or taking negative scores into account. Thank you very much in advance for your responses.

Dear Amanda Jiménez,

a late answer, but maybe it can still help you or someone else. For the beta Score you have to take positive and negative values into account. Negative values mean less power for these actors, since they are connected to other well connected actors, high positive values are explained by connections to many other actors, which are not well connected overall.
Here an explanation for UCINET:
http://faculty.ucr.edu/~hanneman/nettext/C10_Centrality.html#Bonacich

Best regards.

3
Is there an existing open source code or library for in hyperheuristics Data Mining?

Hi ,

Please, is there existing open source code or library for hyperheuristics Data Mining?

Thank you

I know WEKA and ELKI aren't explicitly hyperheuristic but if you're interested in practical use rather than theoretical approach you might find a solution here.

WEKA http://www.cs.waikato.ac.nz/~ml/weka/

ELKI http://elki.dbs.ifi.lmu.de/

YALE/WEKA http://rapidminer.com

Hope that helps.

Andreas

EDIT: maybe eureqa would do the job. http://www.nutonian.com  I remember, there is an open source version free of charge, but can't find it now - maybe you need to go through the video introductions of Hod Lipson on eureka to find the link.

9
How can I recognize a document is positive or negative based on polarity words?

I have:

- Polarity words.

Example:

- Good: Pol 5.

My assignment:

Determine a document is negative or positive. So how I have to do, please tell me about that, I'm a newbie in NLP (sentiment analysis).
I want to use polarity to do that, don't use Naive Bayes. So anyone tell me about algorithm based on polarity words.

Since you have the prior sentiment orientation of words (e.g., good, +5; bad, -5). You can simply follow the  assumption:

The sentiment orientation of a document/sentence is an average sum of the sentiment orientations of its words and phrases [1]

As such:

Sentiment (Document) = Average(Positive Sentiment Score of words) / Average(Negative Sentiment Score of words)

The above solution is rather simple and naive since it does not consider the context of words in the document. However, you can start experimenting with this solution since you are still new to this field.

Also, note that the above solution follows the lexicon-based approach to sentiment analysis. A recent and important work in this vein is by Thelwall et al. [2]

Hope this help ;)

[1] Bo Pang and Lillian Lee. Opinion mining and sentiment analysis. Foundations and trends in information retrieval, 2(1-2):1–135, 2008.

[2] Mike Thelwall, Kevan Buckley, and Georgios Paltoglou. Sentiment strength detec- tion for the social web. Journal of the American Society for Information Science and Technology, 63(1):163–173, 2012.

12
Should studies published in ResearchGate and refereed by ResearchGate peer reviewers be counted as scholarly publications?
Yes, in my humble opinion. Firstly because scientists have an interest in disseminating their intellectual work through an interactive, secure and free medium. Secondly, traditional journals use the same pool of academics as referees to review submitted articles. And thirdly, It allows for post publication peer review.

RG is not a publisher, it is a web site.  Uploading material to RG is not "publishing," it is "posting," like posting a notice on a bulletin board.  To be a legitimate publishing web site there must be in place a qualified editorial and review staff.  With some modification RG has the potential to become a "pre-publication" sounding board or a "publication preparation conduit."  This would be a useful service to both writers and publishers.  Allowing anyone, qualified or not, to write a review of RG "postings" is not professionally adequate; RG would have to limit reviews to those with the necessary expertise.

29
Are PhD/Masters Theses and Dissertations or Journals/Conference proceedings the best sources for a Literature Review? Why?
When reviewing the literature for trends in technological advancement and future directions, we may use either PhD and Masters' theses and dissertations. But which will be useful, or give the most rigorous review?

Dear John K. Marco Pima

PhD, Masters: Theses and Dissertations provide us a quick access to methodologies and tools along with scope of future works. Since these documents are quickly available for research groups at same institute. Hence, referring these are essential if research idea/methodology/tool/comparison benchmark is used.

Since Theses and Dissertations are not accessible or publicized immediately therefore, it is not commonly referred by other researchers of the same field. Because most of the the research papers are from the Theses' and Dissertations' work, so generally, referring of these are not prime issue.

Conference papers are commonly discusses new ideas, tools, methodologies, technologies and comments but not to the extents of completeness. Some good conferences adhere a standard like IEEExplore. Standard conference papers and /or conference papers designed from same research groups / fields are justified for referencing.

Standard journal papers are the best source of breakthrough of your research and make a practice of it and therefore, this the best source of reference.

2
Is there any effective algorithm known for applying rule hiding techniques on temporal multilevel association rules?

We are working on privacy preserving issues in temporal multilevel association mining and want to know which is most effective algorithm in practice/real deployment/research for the same purpose at present.

Mr. Robert,

Thanks for giving this answer, your suggestion is really useful to me.

6
Where can I find huge data sets for mining frequent item sets in data mining.
I am working on Distributed Association Rule Mining. I need data sets to simulate my program on it.

http://archive.ics.uci.edu/ml/ (UC Irvine Machine Learning Repository)
http://data.gov.in/
http://library.case.edu/databases/ (Case Western University)
http://www.indiastat.com/default.aspx
http://www.usgs.gov/
http://data.worldbank.org/

+ 4 more attachments

7
Is it possible to explain the operational/tactical/strategic classification in terms of real-time KM ?

The basic idea is to use events and the processing of associated data to bridge the gap between operational and decision-making systems.

I will assume that internal events are meant to be managed, which means that risks are by definition external. Then I will use the classic distinction:

Operational: full information on external state of affairs allows for immediate appraisal of prospective states.

Tactical: partially defined external state of affairs allow for periodic appraisal of prospective states in synch with production cycles.

Strategic: undefined defined external state of affairs don’t allow for periodic appraisal of prospective states in synch with production cycles; their definition may also be affected through feedback.

4
How do you classify documents using WEKA?

Please I need help on how to go about classifying documents using WEKA. I am doing classification of dissertations of my department. The assignment is to do an Ontology-based classification of 250 dissertations against Gartner's Hype cycle. The aim is to determine which field of IT each dissertation falls into.

Thanks

thank you. It really helps. I will try using some of the suggested classifiers.

Thank you

6
Can anyone help to determine the time lag between flood gauges from upstream to downstream?
Can anyone suggest how we should calculate time lags of historical flow data between upstream and downstream gauges for urban flood prediction (considering that for various flood events the time lags between upstream and downstream flow stations can change depends on precipitation characteristics and and hydrologic condition of catchments)?

There are flood routing methods like Muskingum and Kalinin-Milijukov, in which you can estimate the translation and the retention effects in the river section.

3
Domain specific search?
I have been guiding research scholars in the area of Domain Specific Ontology Searching Techniques. If any one would like to share your expertise with published papers, I will really appreciate that.

We have a paper in press where we used an exemplary glossary as the basis for extracting the ontology of a domain. (We were then able to establish the hierarchy of dependencies.)

4
How can one filter uninteresting rules in multilevel association mining processes?

During any association mining process it is a big challenge to remove uninteresting rules. We are interested in effective formal and experimental method for finding interestingness of the multilevel rules.

http://www.oalib.com/paper/2651823#.VEkLAiKsXKM

May be it helps you

1
Is there any way to do manual prunning over the result (decision trees) of a trained Random Forest Model?

The idea is to do an online pruning using a continuous timestamped dataset...and I wanted to train my model using some data then improve it during the day with some other information that I may receive (i.e. active learning, weather condition, etc.). It could be by prunning (i.e. removing some tree brunches) or by adding more branches to the current leaves. Are there any R packages that support such an implementation? Which would be the best way to do so? Thank you in advance.

hello friend

Yes its possible .

if the dataset contains less number of attributes.

else it may become highly tedious job.

You can see my7 publications for more details about application of trees:

Mandal, I., Sairam, N. New machine-learning algorithms for prediction of Parkinson's disease (2014) International Journal of Systems Science, 45 (3), pp. 647-666. DOI: 10.1080/00207721.2012.724114

Mandal, I., Sairam, N. Accurate telemonitoring of Parkinson's disease diagnosis using robust inference system (2013) International Journal of Medical Informatics, 82 (5), pp. 359-377. DOI: 10.1016/j.ijmedinf.2012.10.006

Mandal, I., Sairam, N. Accurate prediction of coronary artery disease using reliable diagnosis system (2012) Journal of Medical Systems, 36 (5), pp. 3353-3373. DOI: 10.1007/s10916-012-9828-0

Mandal, I., Sairam, N. Enhanced classification performance using computational intelligence (2011) Communications in Computer and Information Science, 204 CCIS, pp. 384-391. DOI: 10.1007/978-3-642-24043-0_39

Mandal, I. Software reliability assessment using artificial neural network (2010) ICWET 2010 - International Conference and Workshop on Emerging Trends in Technology 2010, Conference Proceedings, pp. 698-699. DOI: 10.1145/1741906.1742067

Mandal, I. A low-power content-addressable memory (CAM) using pipelined search scheme (2010) ICWET 2010 - International Conference and Workshop on Emerging Trends in Technology 2010, Conference Proceedings, pp. 853-858. DOI: 10.1145/1741906.1742103

Indrajit Mandal, Developing New Machine Learning Ensembles for Quality Spine Diagnosis, Knowledge-Based Systems, Available online 19 October 2014, ISSN 0950-7051, http://dx.doi.org/10.1016/j.knosys.2014.10.012.
http://www.sciencedirect.com/science/article/pii/S0950705114003797

Mandal, I. A novel approach for accurate identification of splice junctions based on hybrid algorithms (2014) Journal of Biomolecular Structure and Dynamics
pp. 1-10 | DOI: 10.1080/07391102.2014.944218 PMID: 25203504
http://bit.ly/1vQZsei

Thank you.

Best regards,

Dr.Indrajit mandal

89
What kind of "Research Tools" have you used during your PhD journey?
I have collected over 700 tools that can help researchers do their work efficiently. It is assembled as an interactive Web-based mind map, titled "Research Tools", which is updated periodically. I would like to know your personal experience on any kind of "Research Tools" that you are using for facilitating your research.

2
Do anyone know where to find links to public nursing data-sets?
We are performing knowledge extraction from various health care data-sets found just one nursing data set, but we need more.

6
What techniques do you consider the most effective for capturing heuristics rules and best-practices during product design & development and why?
Engineers often like to use "best practices" (data, information, knowledge, wisdom) during product development. Some of the data/information come from their experiences working on the job. Others (best practices) are derived from analytical, functional, logical or physical phenomena.

Thompson:

Is your article listed on ResearchGate or on any web site? Can you add a link to the article you have refereed in here?

2
Is OOB error rate always random?

Bootstrapped sample and a tree built using the same. Why is it that OOB error estimate  a random value?

OOB predictions are based on a small subsample of the trees in the forest and are thus at a disadvantage relative to predictions that can be legitimately based on the entire forest.

OOB predictions are not expected to exhibit  better performance because their distribution is not the same as the distribution on which the individual trees are grown

http://pages.bangor.ac.uk/~mas00a/papers/lkml.pdf

2
How to classify indexing techniques?
The index is one of the main design concepts and they allow access to the stream data in a multitude of ways. When a stream is originally recorded, an index is is created . This is called the primary index and used for access to the stream of data. (http://www.cl.cam.ac.uk/~jac22).

The question is: Can we classify the indexing techniques in two categories (such as non-intelligent, and Intelligent method)?

One way to solve this problem is to start by selecting clue words representing subject categories and which can be used in pigeonholing each document in a subject category corresponding to a clue word.     For more about this, see page 408 in

M.E. Maron, Automatic indexing: An experimental inquiry:

http://sci2s.ugr.es/keel/pdf/algorithm/classification-algorithm/Maron1961.pdf

Maron makes the following observation:

The fundamental thesis says, in effect, that statistics on kind, frequency,
location, order, etc., of selected words are adequate to make reasonably good
predictions about the subject matter of documents containing those words (p. 405).

9
How can I classify data with missing values using conventional classifiers?

The tolerance rough set model is a perfect model that deal with missing values in the data sets. But how can I use the tolerance rough set model to classify data using the conventional classifier as KNN?

Dear all,

In the context of support vector machines it is possible to also change the problem formulation in the case of missing values, see e.g.

Pelckmans K., De Brabanter J., Suykens J.A.K, De Moor B., ``Handling Missing Values in Support Vector Machine Classifiers'', Neural Networks, vol. 18, 2005, pp. 684-692.

Best regards,

Johan Suykens

----------------------

Prof. Dr.ir. Johan Suykens
Katholieke Universiteit Leuven
Kasteelpark Arenberg 10
B-3001 Leuven (Heverlee)
Belgium

8
Why knowledge is not highly valued as a decisive element to sustain competitive organizational position?
I drew your attention to 5 organizational elements as presented by D. McFarland from Stanford University in his Organizational Analysis MOOC Class. Technology, Participants, Goals, Social Structure, and Environment. Drucker cosiders knowledge as the primary resource, and land, labor, and capital as secondary resources. General categorization are physical, social and technical.

Our study has shown, that on the management level of organizations, knowledge (especially epistemologically progressed knowledge that is decision relevant) is valued very high. This applies not to the operational level of organizations.

Another phenomenon is, that the more knowledge is tacit and bound to persons, the more it has been highly valued.

6
May I know what is the difference of global and local search in term of machine learning and computer science?
Global search vs local search.

Ideally speaking, a global searching technique is promising to make sure to find the best global formation but this is achieved most ly at the cost of a long time searching. But then again in reality, they are run and stop when stopping criterion is come across. Examples of this search include, particle swarm optimization. simulated annealing, and genetic algorithm.  Whereas, local search algorithms do not totally focus on search and but it attempts to move from a current formation to a neighboring refining formation. This is much depending on the initial search space and initial formation. An example of local search is hill climbing, which is an iterative algorithm which can start with a random solution and then after the algorithm tries to find a better solution by incrementally altering a solution of single element. If this alteration harvests a better solution, an incremental alteration is made to a new solution, this process can be repeated in anticipation of no more enhancement is identified.

5
What is the significant and future scope of High-Dimensional Data?
High Dimensional Microarray data.

Eirini Ntoutsi........... thanks you mam

may you please tell me, how high-dimensional affect the data mining techniques, and what are the other methods which we can apply on it to mitigate high dimensionality and do efficient data analysis

7
How do you avoid the curse of dimensionality problems during the feature reduction step in data mining?
How do you know if the number of features reduced is sufficient? Is there any rule of thumb for it? Any good idea in this direction is highly appreciated.

Hi....

Book: An introduction to pattern recognition: A MATLAB approach

BY: Theodoridis, Sergios

Gud luck...!!!!

4
How does mathematics and statistics explain the logic behind association rules which is essential for the decision making process?

There are many different forms of mathematics. Different forms satisfies different understandings of association rules. Effectiveness and efficiency of same association rules are different in these mathematical explanations.  So, how is it possible that the association rules which has many different mathematical explanetions, may support a single decision? And how can I know which of these explanations are correct and which are wrong?

The key is to use interestingness measures that explain association between items or events beyond random chance, given that these items or events co-occur at an acceptable level of frequency(aka support ). Acceptable interestingness measures include confidence, lift, conviction, etc.

1
What is the role of spatial data in mobile governance?
I need good research articles on the role or use of spatial data in good/effective mobile governance?
I already worked on a mobile application project that solve some local problems using spatial data about the place of accidents, fire stations,...
You can access my LinkedIn profile and take a look on the project entitled "Problem Locator" http://www.linkedin.com/profile/view?id=55881961&trk=nav_responsive_tab_profile
3
Transformation technique to convert original data into perturbed data
I need to transform original data into perturbed data to protect individual privacy then perform data mining technique on perturbed data , the knowledge equivalent to the original data knowledge
Does Any body know how to run LDA(Latent Dirichlet Allocation ) Code for reueters21578 and news group20 datasets .