ArticlePDF Available

Abstract and Figures

Today, the use of social networks is growing ceaselessly and rapidly. More alarming is the fact that these networks have become a substantial pool for unstructured data that belong to a host of domains, including business, governments and health. The increasing reliance on social networks calls for data mining techniques that is likely to facilitate reforming the unstructured data and place them within a systematic pattern. The goal of the present survey is to analyze the data mining techniques that were utilized by social media networks between 2003 and 2015. Espousing criterion-based research strategies, 66 articles were identified to constitute the source of the present paper. After a careful review of these articles, we found that 19 data mining techniques have been used with social media data to address 9 different research objectives in 6 different industrial and services domains. However, the data mining applications in the social media are still raw and require more effort by academia and industry to adequately perform the job. We suggest that more research be conducted by both the academia and the industry since the studies done so far are not sufficiently exhaustive of data mining techniques.
Content may be subject to copyright.
MohammadNoor Injadat, Fadi Salo and Ali Bou Nassif, Data Mining Techniques in Social Media: A
Survey, Neurocomputing, http://dx.doi.org/10.1016/j.neucom.2016.06.045
Data Mining Techniques in Social Media: A Survey
MohammadNoor Injadat1, Fadi Salo1, Ali Bou Nassif2
1Department of Electrical and Computer Engineering, University of Western Ontario, 1151 Richmond St, London, Ontario N6A 3K7 Canada
2Department of Electrical and Computer Engineering, University of Sharjah, Sharjah, United Arab Emirates
ABSTRACT
Today, the use of social networks is growing ceaselessly and rapidly. More alarming is the fact that these networks have become
a substantial pool for unstructured data that belong to a host of domains, including business, governments and health. The
increasing reliance on social networks calls for data mining techniques that is likely to facilitate reforming the unstructured data
and place them within a systematic pattern. The goal of the present survey is to analyze the data mining techniques that were
utilized by social media networks between 2003 and 2015. Espousing criterion-based research strategies, 66 articles were
identified to constitute the source of the present paper. After a careful review of these articles, we found that 19 data mining
techniques have been used with social media data to address 9 different research objectives in 6 different industrial and services
domains. However, the data mining applications in the social media are still raw and require more effort by academia and
industry to adequately perform the job. We suggest that more research be conducted by both the academia and the industry since
the studies done so far are not sufficiently exhaustive of data mining techniques.
Keywords: Data Mining, Social Media, Social Media Networks Analysis, Survey
1. INTRODUCTION
Undoubtedly, the world is shrinking into a small village owing to the tangible influence of social media. It connects people
from different parts of the world, ages, and nationalities and allows them to share their opinions, experiences, feelings, hobbies,
pictures, and videos. This has opened the door for public and private organizations from all domains to promote, benefit, analyze,
learn, and improve their organizations based on the data provided in social media. Thus, the significance of social media for
academia and industry is quite conspicuous in the amount of research done by these two sectors, seeking answers to pivotal
questions.
The structure of the social media data is unorganized and is displayed in different forms such as: text, voice, images, and
videos [1]. Moreover, the social media provides an enormous amount of continuous real time data that makes traditional
statistical methods unsuitable to analyze this massive data [2]. Therefore, the data mining techniques can play an important role
in overcoming this problem.
In spite of the large number of empirical research about data mining techniques and social media, a scant number of studies
compare data mining techniques in terms of accuracy, performance, and suitability. For instance, it was observed that the
accuracy of certain machine learning techniques is calculated in various methods which makes it difficult to find answers to the
suitability of the data mining techniques.
Many researchers have selected their data mining techniques based solely on expert judgment (A31, A56). Few surveys have
been conducted in this area without giving full justification for using data mining techniques in social media [3,4]. However,
some studies discussed certain areas in the used data mining techniques in social media. In [5], Vilma Vuori, et al., discussed the
information gathering and knowledge and information sharing through social media for companies. In [6], Rafeeque P C, et al.,
the work and challenges related to short text analysis have been reviewed. Akin to this study, [7], Mikalai Tsytsarau, et al.,
reviewed the opinion mining and sentiment analysis development, providing a summary about the proposed methods of
contradiction analysis. In, [8], Sheela Gole, et al., discussed mining big data in social media and its challenges as a result of big
data features such as: Volume, Velocity, Variety, Veracity and Value.
To the best of our knowledge, there is no previous study that systematically concentrates on the implemented data mining
techniques in social media research, which has triggered the idea of the present survey. The review presented in this paper
discusses the published research in the period from January 1, 2003 to January 7, 2015. The goal of this study is to probe the
available articles with regards to: (I) the data mining techniques used to extract social media data, (II) the research area that
requires mining data from social media, (III) a comparison between machine learning and non-machine learning data mining
techniques, (IV) a comparison between different data mining techniques, and (V) the strength and weakness of the recommended
data mining techniques in social media.
This manuscript is divided into five sections. Section 2 explains the implemented methodology. Section 3 describes our
findings. Section 4 discusses the limitation of this review. Finally, Section 5 presents our findings, recommendations, and future
work.
2. METHODOLOGY
In this review, we conducted a survey based on the Systematic Literature Review (SLR) proposed by Kitchenham and
Charters [9] methodology which consists of: planning, conducting, and reporting phases where each phase consists of several
1. minjadat @uwo.ca (MohammadNoor Injadat), fsalo@uwo.ca (Fadi Salo)
2. anassif@sharjah.ac.ae (Ali Bou Nassif). Corresponding author
MohammadNoor Injadat, Fadi Salo and Ali Bou Nassif, Data Mining Techniques in Social Media: A
Survey, Neurocomputing, http://dx.doi.org/10.1016/j.neucom.2016.06.045
stages. At the planning phase we created a review protocol which consists of six stages: specifying research questions, designing
the search strategy, identifying the study selection procedures, specifying the quality assessment rules, detailing the data
extraction strategy, and synthesizing the extracted data. Fig. 1 shows the review protocol stages.
The research questions have been specified based on the objectives of this review. At the next stage, we designed the search
strategy referring to the first stage to retrieve the required and related articles. We also identified the search terms and article
selection process, which is required for an accurate search. Stage three covered the selection criteria which specify the inclusion
and exclusion rules; we also included more related articles from the references in the articles we used to enrich our literature
resources related to the research questions. Stage four included the quality questions to filter the related articles. In stage five, we
described the extraction strategy used to obtain the required data which could answer the research questions. Finally, in the last
stage, we identified the methodologies used to synthesize the extracted data.
As indicated by Kitchenham and Charters [9], the review protocol is considered to be a critical element of any SLR.
Therefore, to avoid researcher bias and to ensure the quality of the review protocol, regular meetings have continued between the
authors.
The following subsections: 2.1 – 2.6 will illustrate in detail the review protocol followed in this review.
2.1. RESEARCH QUESTIONS
Summarizing and providing evidence of implementing the data mining techniques in social media is our main goal in this
work. Thus, we identified the following five research questions (RQs):
1. RQ1: Which data mining techniques have been used in Social Media?
The role of this question is to specify the data mining techniques that were implemented in mining social network data.
2. RQ2: In which research areas have data mining techniques been applied?
The aim of this question is to identify the domains where the data mining techniques were applied and the research
objectives among these domains. The most frequent domain will be identified as well as any new domains suggested.
3. RQ3: Do machine learning perform better than non- machine learning in data mining techniques?
RQ3 compares machine learning and non-machine learning methods implemented in mining social media in term of
accuracy. Few articles made a comparison between machine learning and non-machine learning methods. As mentioned in
[10,11], only statistical techniques were considered as non-machine learning, whereas the other computational techniques
are considered as machine learning methods .
4. RQ4: Is there any comparison that has been performed among different data mining techniques?
The aim of RQ4 is to specify the data mining technique with high performance. The results produced by the answer of this
question will be considered as evidence of the recommended techniques.
5. RQ5: What are the strengths and weaknesses of the implemented data mining techniques in social media?
This question will prove the suitable practice of the selected data mining techniques in social media such as text mining,
media mining, content-based mining, context-aware mining, graph data mining, and multimedia mining.
2.2. SEARCH STRATEGY
The search strategy that we followed in this survey is explained in detail as follows:
2.2.1. SEARCH TERMS
To construct the search terms we followed the following procedure [9]:
1. The main terms have been concluded from the research questions.
2. We defined new terms which replace the main terms: such as jargon, alternative spellings, and synonyms.
3. The top ten data mining algorithms were selected from published papers and books [12,13].
4. We used Boolean search operators (ANDs and ORs) to limit the search results in addition to “ ” for specific phrases.
We included in our search terms the top ten data mining techniques identified by [12,13]. Fig. 1 shows the stages of the
review protocol.
The search terms used to retrieve the related publications are as follows. Note that different search terms have been used to
get more related publications. The last search date was conducted on January 9, 2015.
“data mining” AND “techniques” OR “technique” AND “social media”
“data mining” AND “machine learning” AND “social media”
“social media” AND “fuzzy” AND “data mining”
“social media” OR “social network” AND (“C4.5” OR “J48” OR “K-Means” OR “SVM” OR “support vector machines”
OR “Apriori” OR “EM” OR “expectation maximization” OR “PageRank” OR “AdaBoost” OR “KNN” OR “k-NN” OR “k-
nearest neighbors” OR “Naive Bayes” OR “CART”)
2.2.2. SURVEY RESOURCES
The following digital libraries were searched for the required articles:
IEEE Explorer
Google Scholar
MohammadNoor Injadat, Fadi Salo and Ali Bou Nassif, Data Mining Techniques in Social Media: A
Survey, Neurocomputing, http://dx.doi.org/10.1016/j.neucom.2016.06.045
Science Direct
ACM Digital Library
Computing Research Repository
Web of Science
Spie
The first search process included journals, and Tier I social network related conferences, such as International Conference on
Advances in Social Networks Analysis and Mining (ASONAM), ACM Conference on Online Social Networks (COSN),
International World Wide Web Conference (WWW), and International Conference on Data Engineering (ICDE), from the above
mentioned digital libraries. The search terms considered cover any part of the articles (metadata) and were restricted to articles
published between January, 2003 and the January, 2015, because the most popular social networks (Facebook, Twitter, LinkedIn,
and MySpace) began after 2002 [14].
2.2.3. SEARCH PHASES
We used the specified search terms to retrieve the primary related articles from these digital libraries. Moreover, a quick scan
of the reference from the paper we selected helped to enrich the resources to answer the research questions. The inclusion criteria
are explained in detail in Section 2.3.
The Google document platform was used to share and manage the search results and documents among authors. Based on the
inclusion criteria, 147 relevant publications were chosen as candidate publications: 83 journal papers, 64 conference papers. Fig.
2 illustrates the breakdown of the identified articles at each search and selection phase.
2.3. STUDY SELECTION
We obtained 1187 articles in the first search process. Because many articles did not provide sufficient information to answer
the research questions, we performed another filtration step (see Fig. 2).
The filtration process was conducted individually by the authors and the results were discussed in scheduled meetings to
ensure the accuracy and to resolve any differences. The selection and filtration steps are explained below:
1. Step 1: remove the duplicated articles obtained by authors and/or different libraries.
2. Step 2: apply inclusion and exclusion criteria to the candidate papers to avoid any irrelevant articles.
3. Step 3: apply the quality assessment rules to include the qualified articles that give the best answers to the research
questions.
4. Step 4: search for additional related articles from the article references obtained from step 3 and repeat step 3 on the extra
articles.
The inclusion and exclusion criteria applied in this survey are defined below:
Fig. 1 Review Protocol Stages
MohammadNoor Injadat, Fadi Salo and Ali Bou Nassif, Data Mining Techniques in Social Media: A
Survey, Neurocomputing, http://dx.doi.org/10.1016/j.neucom.2016.06.045
Inclusion criteria:
Use data mining techniques in social media.
Use machine learning and non-machine learning data mining techniques
in social media.
Comparative studies that compare among data mining techniques.
Comparative studies that compare between data mining and non-data
mining techniques.
Consider the latest edition of the article of the same research (if different
versions are available).
Consider only articles published between January 2003 and January 2015.
Exclusion criteria:
Exclude articles that include data mining that is not related to social
media.
Exclude articles that do not include data mining but are related to social
media.
Exclude non-journal and non-conferences articles.
Finally, after applying all filtration steps, 66 articles were considered as the
resources for this review. The selected articles are listed in Appendix (A), Table
A1.
2.4. QUALITY ASSESSMENT RULES (QARs)
The QARs were applied in the selected studies to evaluate article suitability
in accordance with the research questions. Ten QARs were identified, and each
one is worth 1 mark out of 10. Each QAR is scored as follows: “fully
answered” = 1, “above average” = 0.75, “average” = 0.5, “below average” =
0.25, “not answered” = 0. The overall score of the article will be the
summation of the marks obtained for the 10 QARs. If the result was 5 or
higher, the article was considered; otherwise it was excluded.
1. QAR 1: Are the research objectives clearly defined?
2. QAR 2: Is the data mining background clearly addressed?
3. QAR 3: Are the data mining techniques used clearly defined?
4. QAR 4: Is the design of the experiment suitable and acceptable?
5. QAR 5: Is the study performed on sufficient social media data?
6. QAR 6: Is the data mining technique measured and reported?
7. QAR 7: Is the proposed data mining technique compared with other techniques?
8. QAR 8: Are the conclusions of the experiment clearly identified and reported?
9. QAR 9: Are the methods used to analyze the results appropriate?
10. QAR 10: Does the experiment enrich academia or industry?
The scores that resulted from applying the QARs on the selected articles are shown in Appendix (A), Table A2.
2.5. DATA EXTRACTION STRATEGY
In this stage, we explored the articles selected to extract the information required to answer the research questions. Therefore,
we have designed an extraction form (see Table 1) to extract the needed data [9].
Based on the extraction form, two authors played the role of extraction and checking. In case of a disagreement between the
extractor and checker, group meetings were conducted between all authors to resolve any issue.
Some difficulties occurred during the extraction process. For instance,
different terminology was used for the same data mining technique such as
C4.5 algorithm is the new name of the J48 technique [15]; however, the
WEKA tool (which is commonly used by researchers) is still using the old
name J48 (A26). Moreover, some articles used different abbreviations of the
same technique such as: KNN, K-NN, Nearest Neighbor (A12, A34), Naïve
Bayes, Naive Bayes, NB (A2, A37). Furthermore, many researchers were
comparing between their techniques and other common techniques without
mentioning technique names or, if mentioned, the reason behind picking
certain technique (A31, A42, A53, A55).
Not all selected articles answered all the five RQs. Appendix (A), Table A3
illustrate the RQs that were answered by each selected study.
2.6. SYNTHESIS OF EXTRACTED DATA
TABLE 1
DATA EXTRACTION FORM
Article ID
Data Extractor
Data Checker
Publication Year
Authors
Article Source
Article Title
Article Type
Domain
RQ1
RQ2
RQ3
RQ4
RQ5
Fig. 2 Search and Selection Process
MohammadNoor Injadat, Fadi Salo and Ali Bou Nassif, Data Mining Techniques in Social Media: A
Survey, Neurocomputing, http://dx.doi.org/10.1016/j.neucom.2016.06.045
To synthesize the data extracted from the selected articles, we used
different procedures to aggregate evidence that will answer the RQs. The
following explains the synthesis procedure we followed in detail:
For RQ1 and RQ2, we used the narrative synthesis method [9] were the
extracted information was tabulated according to RQ1 and RQ2.
For the data extracted (quantitative) in RQ3 and RQ4, which came from
different articles that have various accuracy calculation techniques, we used
binary outcomes to measure the results, which are demonstrated in a
comparable way [9].
In RQ5, the strengths and weaknesses of the data mining techniques
have the same meaning but are written in different ways. Therefore, to unify
these points, we followed the reciprocal translation method [9] which is considered as one of the techniques that can be used for
synthesizing the qualitative data.
3. RESULTS AND DISCUSSION
In this section, we will discuss the results obtained from this review. The
first subsection gives an overview of the selected articles. The result of each
RQ will be discussed in detail in the next five subsections, 3.1-3.5.
The total number of the selected studies was 66 articles (see Appendix
(A), Table A4) that implemented data mining techniques used in social media.
The selected articles were retrieved only from journals published between
January 2003 and January 2015. Appendix (A), Table A4 shows the number
of articles and the percentage grouped by publisher name. The types of
articles considered in this survey are: experiment, case study, and survey.
Table 2 shows the distribution of the selected articles among the three types.
With regards to the quality of the selected articles, we applied a quality
assessment criterion to stream the articles based on the marks gained. The
articles with grade five or greater (out of ten) were taken into consideration (see Table 3).
3.1. TYPES OF DATA MINING TECHNIQUES (RQ1)
We identified 19 data mining techniques that had been applied by researchers in the area of social media. The list of these
techniques is below.
AdaBoost
Artificial Neural Network (ANN)
Apriori
Bayesian Networks (BN)
Decision Trees (DT)
Density Based Algorithm (DBA)
Fuzzy
Genetic Algorithm (GA)
Hierarchical Clustering (HC)
K-Means
k-nearest Neighbors (k-NN)
Linear Discriminant Analysis (LDA)
Linear-Regression (Lin-R)
Logistic Regression (LR)
Markov
Maximum Entropy (ME)
Novel
Support Vector Machine (SVM)
Wrapper
Fig. 3 shows that SVM, BN, and DT are the most applied techniques in the area of social media with a percentage of 51% of
the selected articles. Novel techniques with the percentage of 9% were not considered as the one of the highest; because each
article has its dedicated novel technique. Table 4, includes detailed information about the frequencies of data mining techniques
used by the selected articles in this review.
Appendix (A), Fig. A1 shows further demonstration about the findings, it illustrates the distribution of the data mining
techniques per year during the considered period. Based on the figure, it can be clearly seen that the number of data mining
techniques adopted by researchers in the social media area has increased dramatically in 2012 and 2014 with 39 and 35
techniques respectively. The number dropped slightly to 24 techniques in 2013. Moreover, it is worthwhile to mention that many
novel techniques have arisen between 2012 to early 2015 with a total number of 12 new techniques.
3.2. DATA MINING TECHNIQUES RESEARCH AREAS (RQ2)
TABLE 2
SELECTED ARTICLES’ TYPES DISTRIBUTION
Freq.
4
60
2
66
TABLE 3
CANDIDATE ARTICLES’ QUALITY
DISTRIBUTION
Calcification Criteria
Freq.
%.
Between 0 to 2.5
53
36%
Between 2.75 to 4.75
28
19%
Between 5 to 6.75
35
24%
Between 7 to 8.5
22
15%
Between 8.75 to 10
9
6%
Grand Total
147
100%
AdaBoost
2%
ANN
6%
Apriori
1%
BN
20%
DT
9% DBA
2%
Fuzzy
1%
GA
1%
HC
2%
K-Means
5%
k-NN
7%
LDA
7%
Lin-R
1%
LR
3%
Markov
1%
ME
2%
Novel
9%
SVM
22%
Wrapper
1%
Fig. 3 Data Mining Techniques among Selected Papers
MohammadNoor Injadat, Fadi Salo and Ali Bou Nassif, Data Mining Techniques in Social Media: A
Survey, Neurocomputing, http://dx.doi.org/10.1016/j.neucom.2016.06.045
From the selected articles, we identified six general domains which applied
various techniques in nine different research areas to mine the flow of big data
gathered from social media. The list of these domains follows:
Business and Management (BM)
Education (EDU)
Finance (FIN)
Government and Public (GP)
Medical and Health (MH)
Social Networks (SN)
Fig. 4 shows that social networks and business and management were the most
active domains used by data mining techniques, with a percentage of 79% among
all domains. Government and public with a percentage of 9% represents the third
active domain. Appendix (A), Table A5, includes detailed information about all
domains.
For further analysis of Table 2, we investigated the
experiments of the selected articles and plotted Fig. 5 which
demonstrates the popularity of various types in social media
application researches. Some experiments were conducted to
mine and analyze one or more social media applications’ data.
Microblogging applications such as Twitter was the most
popular application for researchers with 31 experiments
followed by social networks such as (Facebook) with 12
experiments. Appendix (A), Table A6, includes detailed
information about the frequencies of social media
applications used by the selected articles in this review.
Fig. 6 demonstrates further information about the findings
by illustrating the distribution of the domains applying data
mining techniques per year. Based on the figure, it can be
clearly seen that the number of publications has increased
dramatically in 2012 and 2014 with 19 articles in 5 domains
for both periods. In 2013, the number went down to 12
articles in 5 domains. The social network data analysis
remains the most active domain among the considered period.
Among the selected articles, we identified 9 active
research objectives adopted data mining techniques. The list
of these research objectives follows:
Biometric
Content Analysis
Cyber Crime
Disease Awareness
Geolocating
Quality Improvement
Risk Management
TABLE 4
DATA MINING TECHNIQUES FREQUENCIES AMONG ARTICLES
Technique
Frequencies
Technique
Frequencies
AdaBoost
2
k-NN
9
ANN
8
LDA
9
Apriori
1
Lin-R
1
BN
26
LR
4
DT
11
Markov
1
DBA
3
ME
2
Fuzzy
1
Novel
12
GA
1
SVM
29
HC
2
Wrapper
1
K-Means
6
8
9
31
1
12
11
010 20 30 40
Blogs
Forums and Discussion…
Microblogging
Product Reviews
Social Networks
Video and Photosharing
NO. OF RESEARCHES
SOCIAL MEDIA APPLICATION
Fig. 5 Popularity of various social media application in researches
Business
and
Manage
ment
17%
Educatio
nal
1%
Finance
3%
Govern
ment
and
Public
9%
Medical
and
health
8%
Social
Network
s
62%
Fig. 4 Domains among Articles
MohammadNoor Injadat, Fadi Salo and Ali Bou Nassif, Data Mining Techniques in Social Media: A
Survey, Neurocomputing, http://dx.doi.org/10.1016/j.neucom.2016.06.045
Semantic Analysis
Sentiment Analysis
Fig. 7 illustrates the distribution of these research areas. The sentiment analysis and quality improvement were the most
active areas among articles with a frequencies of 21 and 14 respectively.
3.3. MACHINE LEARNING VERSUS NON-MACHINE LEARNING METHODS IN MINING SOCIAL MEDIA
DATA (RQ3)
Data mining techniques are the process of extracting hidden knowledge from the data [16]. This can be done in many ways
such as KNN, K-means, and SVM as machine learning methods. Also the statistical methods in some cases are considered as
non-machine learning methods which used to discover patterns. As Berson, et al. mentioned [11], statistical techniques are
driven by the data and are used to discover patterns and build predictive models”.
Out of the 66 papers identified, only three papers contain either experimental or theoretical knowledge about non-machine
learning methods. Two of these paper (A11, A19) integrated non-machine learning methods with machine learning methods to
improve the result of their proposed solution. The third paper (A53) mentioned that text mining techniques that depend on
machine learning methods are different than non-machine learning methods because of: (i) in traditional quantitative analysis
methods, conclusions are derived from the population sample, whereas machine learning methods allow the researcher to derive
conclusions from the entire population, (ii) traditional quantitative methods require the researcher to analyze the data using a
theoretical platform, while machine learning methods give the researcher the ability to extract the actual meaning of the mined
data contained in natural language text. (iii) machine learning methods investigate the textual data without human interaction,
whereas traditional quantitative methods need the researcher to interpret the data before analyzing.
However, we disagree with the authors of paper A53 because the definition of data mining consists of three concepts [17]:
Statistics, Data (Big or Small), and Machine Learning and Lifting. Thus, data mining includes all statistics (Descriptive and non-
inferential parts of the classical statistics) and Exploratory Data Analysis (EDA) for the data using the power of computers for the
purpose of lifting and learning the patterns of the data [17].
Consequently, machine learning data mining techniques and non-machine learning data mining techniques such as traditional
quantitative methods in statistics are complementary to each other in data mining.
3.4. DATA MINING TECHNIQUES VERSUS OTHER DATA MINING TECHNIQUES (RQ4)
This RQ compares different data mining techniques that have been used in the selected articles. Since most of the articles
based their findings on either weak statistical analysis or without using any statistics, we built our comparison based on their
judgments, which relied on the experiment they made or by referring to their article references. For instance, papers (A31, A53)
indicate that the SVM technique is one of the best categorization and feature selection techniques available relying on references
published in 1998 and 2003; however, the paper was published in 2013. Further details are provided in Section 5.
After reviewing the papers selected, we found that many papers have common findings on the same data mining techniques.
For instance, papers (A31, A45, A53, A59) found that SVM outperforms other techniques such as Naïve Bayes. In contrast,
papers (A41, A51) claimed that Naïve Bayes and MLP are performed better than SVM. Some other papers (A3, A20, A35)
claimed that K-Means performed better than other techniques such as C4.5. Finally, (A42, A60) found that the DBA technique
outperforms other techniques in terms of working with noisy data.
3.5. STRENGTHS AND WEAKNESSES OF DATA MINING TECHNIQUES (RQ5)
1 3 2 1
4
1
2
4 1
1
1 1
1
2 2 1
3
2
13
7
12
1
0
2
4
6
8
10
12
14
16
18
20
2008 2009 2010 2011 2012 2013 2014 2015
NO. OF ARTICLES PER YEAR
Business and Management Educational Finance
Government and Public Medical and health Social Networks
Fig. 6 Domains Distribution per Year
MohammadNoor Injadat, Fadi Salo and Ali Bou Nassif, Data Mining Techniques in Social Media: A
Survey, Neurocomputing, http://dx.doi.org/10.1016/j.neucom.2016.06.045
This part of the review represents a good source of information where the best practices of the primary data mining
techniques could be implemented. Table 5 summarizes the data mining techniques that could be implemented in the social media
area. In addition to the traditional data mining techniques, Appendix (A), Table A7, summarizes the description and the main
features of the novel techniques proposed by the researchers.
4. LIMITATIONS OF THIS REVIEW
This study is restricted to journal and Tier 1 social network-related conferences papers in the field of data mining techniques
and social media. By applying our search filtration strategy, we obtained a large number of articles, the majority of which were
found to be irrelevant. The reason behind considering a small number of papers is to ensure that the papers selected fully match
our research objectives. Nevertheless, including more related papers would have enriched our conclusions.
We considered only the data mining techniques that were recommended by more than one paper, as mentioned in Section 3.6.
In addition, we applied rigorous quality assessment criteria to select the related articles that could provide synthesized results.
One more limitation is that having public social media datasets with clear description has a challenging task because the
nature of social media data is unstructured with different data types such as text, images, and videos [18]; this makes social media
datasets complex and in heterogeneous format [2].
5. CONCLUSIONS, RECOMMENDATIONS, AND FUTURE WORK
Our survey explored journal and Tier I conference papers that applied data mining techniques in social media between the
period 2003 and 2015; 66 articles were selected to answer the five RQs of this review. Our conclusions are summarized as
follows:
RQ1: the most frequent data mining techniques used in social media articles are SVM, BN, and DT.
RQ2: social network data analysis and business and management were the most active domains that requiring mining of
social media data. In contrast, sentiment analysis and quality improvement were the most active research objectives in these
domains.
RQ3: machine learning data mining techniques and non-machine learning data mining techniques are both required for data
mining purposes.
RQ4: SVM and BN are the most recommended techniques to mine social media data used by most of the papers.
RQ5: data mining techniques have various strengths and weaknesses which make the selection of certain techniques
dependent on the type of the informative data required.
An immediate recommendation is that the area of social media still calls for more profound research that takes into account
accurate implementation of data mining techniques in the academic and industrial sectors. A thorough investigation of the
literature written in this area reveals that a significant number of the studies have not applied any statistical tests.
Quite understandably, research in the social media domain should house a twin-focus method which incorporates accurate
result recording of experiments and appropriate statistical analysis.
The systematic literature review conducted in this study reveals that quite a few articles applied statistical tests, such as
ANOVA, MANOVA, and t-test; these parametric statistical tests require normally distributed data [11]. Apparently, the majority
of the studies reviewed failed to meet this condition and, therefore, the data provided can hardly be held reliable.
Our study also found that very few surveys and case studies have shed light on data mining techniques in social media from
the software engineering perspective. By way of illustration, most of the published papers in the health domain were conducted
by health researchers, who barely provide any information about the method utilized in their papers.
In addition to the method-related gap, another one still holds as far as other domains are concerned. The domains of
Education, Customer Relationship Management (CRM), and Human Resource Management (HRM), among others, have not yet
been explored by software engineers. This is a gap that we recommend future research could bridge by investigating CRM and
HRM using data mining techniques. Such studies are anticipated to yield a more generic view and understanding of data mining
techniques.
1
5 5
1 2 1 1
3 3
1 1 1 2 1
11
3 3
7 7
9
0
2
4
6
8
10
12
Cyber Crime
Quality
Improvement
Sentiment
Analysis
Sentiment
Analysis
Finance
Quality
Improvement
Sentiment
Analysis
Risk
Management
Sentiment
Analysis
Disease
Awareness
Quality
Improvement
Semantic
Analysis
Sentiment
Analysis
Biometric
Content
Analysis
Cyber Crime
Geolocating
Quality
Improvement
Semantic
Analysis
Sentiment
Analysis
BM EDU GP MH SN
NO. OF ARTICLES
Fig. 7 Research Objective among Domains
MohammadNoor Injadat, Fadi Salo and Ali Bou Nassif, Data Mining Techniques in Social Media: A
Survey, Neurocomputing, http://dx.doi.org/10.1016/j.neucom.2016.06.045
TABLE 5
STRENGTHS AND WEAKNESS
DM Tech.
Strength
Article
ID.
Weakness
Article ID.
SVM
One of the best techniques for solving classification
problems.
A31, A41,
A48, A49,
A53, A55,
A56, A66
Suffer from problem with sparse context links.
A34
Perform well with high dimensional feature space and
small training set size.
A66
Suitable for offline clustering
A60
ANN
Self-Organizing Map (SOM): High level capabilities
that greatly facilitated the high-dimensional data
analysis.
A4, A14,
A44
Median SOM: Induce maps of lesser quality than
maps obtained by the kernel version.
A14
SOM: Has visual benefits.
A4
DT
Random Forest (RF): Effective in giving estimates of
what variables are important in the classification.
A1
RF: Robust technique and perform well with variety of
learning tasks.
A33
BN
Very effective for text clustering.
A3, A15
Simple classification algorithm.
A3, A41
Very efficient in terms of computation time.
A41
k-NN
One of the simplest and most discriminative classifiers
in pattern recognition.
A29
Inferior performance on small datasets.
Performance will degrade for data with high
dimensions.
Dependent on the chosen feature and distance
measure.
A43
Fuzzy
Specialized in modeling with vague modes of social
reasoning and takes into account the stochastic
component of human reasoning.
A18
Requires expertise in semantic web and fuzzy
systems to manually handle the semantic fuzzy
rule through an offline process.
A18
K-Means
k-medoids: Less sensitive to outliers.
A16
Requires the number of clusters as an input.
A21, A64
Uses as few clusters as possible and captures
statistically and commercially important cluster
characteristics.
Suitable for fix number of groups with unknown
characteristics based on variables that one defines.
A20
When the number of clusters increases, the quality
of discovered clusters quickly deteriorates.
A21
Performs well at finding a very small number of
clusters.
A21
Often converge to a local minima.
A32
SK-Means:
Efficient in terms of speed. Works well with high-
dimensional datasets.
Can be efficiently parallelized and converges to
local maxima quickly.
Can be a model which allows it to be re-used in
future classifications.
A35
DBA
Density-Based Spatial Clustering of Application with
Noise (DBSCAN): Does not require pre-specified
number of clusters and noise filtering.
A10
DBSCAN: Includes all the density-reachable
points to a cluster.
A10
Groups data based on their density connectivity.
Treats noises as outliers which would not be
involved in any cluster.
Capable of detecting arbitrary-shaped clusters.
A42, A60
Unsuitable some real world applications, because
there is no assumption about the number of
clusters with fixed topics.
A42,A60
LDA
Characterizing documents in addition to data
clustering.
Useful to develop multimedia applications.
Designed to exploit term-frequency.
A40
Often converge to a local minima.
A32
Suffer from problem with sparse context links.
A34
Wrapper
Web Wrapper: Requires high level automation
strategies.
Wrapper maintenance becomes unsuitable if the
pool of Web pages largely increases.
A65
HC
Does not scale the growing of data size, because it
relies on a fully specified similarity matrix.
A64
MohammadNoor Injadat, Fadi Salo and Ali Bou Nassif, Data Mining Techniques in Social Media: A
Survey, Neurocomputing, http://dx.doi.org/10.1016/j.neucom.2016.06.045
APPENDIX (A)
1
1
1
1
2
2
1
1
1
1
1
1
1
3
2
1
1
1
2
1
7
2
2
1
1
2
5
1
1
1
6
9
1
8
3
1
1
1
1
1
7
1
1
5
3
2
2
6
1
2
4
7
1
2
1
1
1
1
0 1 2 3 4 5 6 7 8 9 10
ANN
SVM
SVM
AdaBoost
ANN
Bayesian Networks
Decision Trees
Fuzzy
k-NN
LDA
Logistic Regression
SVM
ANN
Bayesian Networks
Decision Trees
Density Based Algorithm
K-Means
Novel
SVM
Apriori
Bayesian Networks
Decision Trees
Density Based Algorithm
GA
Hierarchical Clustering
K-Means
k-NN
LDA
Markov
Maximum Entropy
Novel
SVM
ANN
Bayesian Networks
Decision Trees
Hierarchical Clustering
K-Means
Logistic Regression
Maximum Entropy
Novel
SVM
AdaBoost
ANN
Bayesian Networks
Decision Trees
K-Means
k-NN
LDA
Linear-Regression
Logistic Regression
Novel
SVM
Wrapper
ANN
Bayesian Networks
k-NN
LDA
SVM
2008 20
09 2010 2011 2012 2013 2014 2015
Iteration of Data Mining Techniques
FIG. A1 DATA MINING TECHNIQUES ITERATION PER YEAR
MohammadNoor Injadat, Fadi Salo and Ali Bou Nassif, Data Mining Techniques in Social Media: A
Survey, Neurocomputing, http://dx.doi.org/10.1016/j.neucom.2016.06.045
TABLE A1
SELECTED ARTICLES
ID
Title
Year
Ref
A1
#tag: Meme or Event?
2014
[19]
A2
@Phillies tweeting from philly? Predicting twitter user locations with spatial word usage
2012
[20]
A3
A framework for building web mining applications in the world of blogs: A case study in product sentiment
analysis
2012
[21]
A4
A Novel Data-Mining Approach Leveraging Social Media to Monitor Consumer Opinion of Sitagliptin
2015
[22]
A5
A probabilistic generative model for mining cybercriminal networks from online social media
2014
[23]
A6
A semantic triplet based story classifier
2012
[24]
A7
An algorithm for local geoparsing of microtext
2013
[25]
A8
An interests discovery approach in social networks based on semantically enriched graphs
2012
[26]
A9
An Unsupervised Feature Selection Framework for Social Media Data
2014
[27]
A10
Analyzing and visualizing web opinion development and social interactions with density-based clustering
2011
[28]
A11
Analyzing the political landscape of 2012 Korean presidential election in twitter
2014
[29]
A12
Annimos: An LP-Based Approach for Anonymizing Weighted Social Network Graphs
2012
[30]
A13
Ant colony based approach to predict stock market movement from mood collected on Twitter
2013
[31]
A14
Batch kernel SOM and related Laplacian methods for social network analysis
2008
[32]
A15
Bayesian filters for mobile recommender systems
2011
[33]
A16
Big Data for Big Business? A Taxonomy of Data-driven Business Models used by Start-up Firms
2014
[34]
A17
BTM: Topic Modeling over Short Texts
2014
[35]
A18
Building dynamic social network from sensory data feed
2010
[36]
A19
Business Intelligence from Social Media A Study from the VAST Box Office Challenge
2014
[37]
A20
Classifying ecommerce information sharing behaviour by youths on social networking sites
2011
[38]
A21
Clustering memes in social media
2013
[39]
A22
Collaborative filtering based on collaborative tagging for enhancing the quality of recommendation
2010
[40]
A23
Collaborative visual modeling for automatic image annotation via sparse model coding
2012
[41]
A24
Confucius and its intelligent disciples: integrating social with search
2010
[42]
A25
Content Feature Enrichment for Analyzing Trust Relationships in Web Forums
2013
[43]
A26
Content Matters : A study of hate groups detection based on social networks analysis and web mining
2013
[44]
A27
Co-training over Domain-independent and Domain-dependent features for sentiment analysis of an online
cancer support community
2013
[45]
A28
Data-Mining Twitter and the Autism Spectrum Disorder : A Pilot Study
2014
[46]
A29
Decision Fusion for Multimodal Biometrics Using Social Network Analysis
2014
[47]
A30
Detecting Deception in Online Social Networks
2014
[48]
A31
Enhancing financial performance with social media: An impression management perspective
2013
[49]
A32
Enriching short text representation in microblog for clustering
2012
[50]
A33
Estimating the helpfulness and economic impact of product reviews: Mining text and reviewer characteristics
2011
[51]
A34
Exploring Context and Content Links in Social Media: A Latent Space Method
2012
[52]
A35
Gaining customer knowledge in low cost airlines through text mining
2014
[53]
A36
Intelligent Social Media Indexing and Sharing Using an Adaptive Indexing Search Engine
2012
[54]
A37
Latent Co-interests ’ Relationship Prediction
2013
[55]
A38
Learning by expansion: Exploiting social media for image classification with few training examples
2012
[56]
A39
Learning Stochastic Models of Information Flow
2012
[57]
A40
Mining Crowdsourced First Impressions in Online Social Video
2014
[58]
A41
Mining Social Media Data for Understanding Students' Learning Experiences
2014
[59]
A42
Mining spatio-temporal information on microblogging streams using a density-based online clustering
method
2012
[60]
MohammadNoor Injadat, Fadi Salo and Ali Bou Nassif, Data Mining Techniques in Social Media: A
Survey, Neurocomputing, http://dx.doi.org/10.1016/j.neucom.2016.06.045
ID
Title
Year
Ref
A43
Nearest-neighbor method using multiple neighborhood similarities for social media data mining
2012
[61]
A44
Network-Based Modeling and Intelligent Data Mining of Social Media for Improving Care
2015
[62]
A45
OMG U got flu? Analysis of shared health messages for bio-surveillance
2011
[63]
A46
Optimizing an organized modularity measure for topographic graph clustering: A deterministic annealing
approach
2010
[64]
A47
Predicting Time-sensitive User Locations from Social Media
2013
[65]
A48
Resource discovery through social tagging: a classification and content analytic approach
2009
[66]
A49
Rumors Detection in Chinese via Crowd Responses
2014
[67]
A50
Search engine reinforced semi-supervised classification and graph-based summarization of microblogs
2015
[68]
A51
Sentimental causal rule discovery from Twitter
2014
[69]
A52
Social Network Analysis in Enterprise
2012
[70]
A53
Spreading Social Media Messages on Facebook: An Analysis of Restaurant Business-to-Consumer
Communications
2013
[71]
A54
Studying user footprints in different online social networks
2012
[72]
A55
The Information Ecology of Social Media and Online Communities
2008
[73]
A56
The potential of social media in delivering transport policy goals
2014
[74]
A57
The social media genome: modeling individual topic-specific behavior in social media
2013
[75]
A58
Topic-sensitive influencer mining in interest-based social media networks via hypergraph learning
2014
[76]
A59
Twitter, MySpace, Digg: Unsupervised Sentiment Analysis in Social Media
2012
[77]
A60
Unsupervised and supervised learning to evaluate event relatedness based on content mining from social-
media streams
2012
[78]
A61
Using explicit linguistic expressions of preference in social media to predict voting behavior
2013
[79]
A62
Using inter-comment similarity for comment spam detection in Chinese blogs
2011
[80]
A63
Using Sentiment to Detect Bots on Twitter: Are Humans more Opinionated than Bots?
2014
[81]
A64
Using social media to enhance emergency situation awareness
2012
[82]
A65
Web data extraction, applications and techniques: A survey
2014
[83]
A66
What's in twitter: I know what parties are popular and who you are supporting now!
2012
[84]
TABLE A2
QARS MARKS FOR THE SELECTED ARTICLES
ID
QAR1
QAR2
QAR3
QAR4
QAR5
QAR6
QAR7
QAR8
QAR9
QAR10
Total
A1
0.75
0.25
0.25
0.75
0.75
1
1
0.75
0.75
0.75
7
A2
0.75
0
0.25
0.5
0.75
0.75
0
0.75
0.5
0.75
5
A3
1
0.75
0.75
0.75
0.75
0.5
0.25
0.75
0.25
0.5
6.25
A4
1
0.75
1
0.75
1
0.75
1
0.5
0.25
0.75
7.75
A5
1
1
0.5
1
1
1
1
1
0.75
0.75
9
A6
0.75
0.25
1
0.5
0.75
0.25
0
0.5
0.25
0.75
5
A7
1
0.5
0.75
0.75
0.5
0.75
0.5
0.75
0.25
0.5
6.25
A8
0.75
0
0.25
0.5
0.75
0.5
0.5
0.75
0.25
0.75
5
A9
1
0.75
1
0.75
1
1
1
1
0.5
0.75
8.75
A10
1
1
1
0.75
0.75
1
1
0.75
0.25
0.5
8
A11
1
1
1
0.75
0.75
0.75
0.75
0.75
0.25
0.5
7.5
A12
1
0.75
1
0.75
0.75
0.5
1
0.75
0.25
0.5
7.25
A13
0.75
0.5
0.5
0.5
0.5
0.25
0.25
0.75
0.25
0.75
5
A14
0.75
0.25
1
0.75
0.5
0.75
0.25
0.75
0.75
0.75
6.5
A15
0.75
0.5
0.75
0.5
0.75
0.25
0.25
0.5
0.25
0.75
5.25
MohammadNoor Injadat, Fadi Salo and Ali Bou Nassif, Data Mining Techniques in Social Media: A
Survey, Neurocomputing, http://dx.doi.org/10.1016/j.neucom.2016.06.045
ID
QAR1
QAR2
QAR3
QAR4
QAR5
QAR6
QAR7
QAR8
QAR9
QAR10
Total
A16
0.75
0.75
0.5
0.75
0.5
0.75
0
0.75
0.5
0.5
5.75
A17
1
0.75
0.75
0.75
1
1
0.5
0.5
0.25
0.75
7.25
A18
1
0.75
1
0.75
1
0.5
0.5
0.5
0.25
0.5
6.75
A19
1
0.75
0.75
1
0.75
1
1
1
0.75
0.75
8.75
A20
0.75
0.75
0.5
1
1
1
0
1
1
1
8
A21
0.5
0.5
0.5
0.75
0.75
1
1
0.75
0.5
0.75
7
A22
1
1
0.75
0.75
0.5
1
0
0.75
0.5
0.75
7
A23
0.75
0
0.25
0.5
0.75
0.75
0.75
0.75
0.5
0.75
5.75
A24
1
0.25
0.25
1
0.25
0.75
1
0.75
0.75
1
7
A25
0.75
0.25
0.25
1
0.25
0.75
1
0.75
1
0.75
6.75
A26
0.75
1
0.25
0.75
0.75
1
1
0.75
0.75
0.75
7.75
A27
0.75
0.25
0.25
0.5
0.75
0.75
0.5
0.5
0.5
0.75
5.5
A28
0.75
0
0.25
0.5
0.75
0.75
1
0.75
0.5
0.75
6
A29
1
0.75
0.75
1
1
0.75
0.75
0.75
0.25
0.75
7.75
A30
0.75
0.5
0.5
0.75
0.75
0
0
0.75
0.25
0.75
5
A31
1
1
1
0.75
1
1
1
1
1
1
9.75
A32
0.75
0.75
0.75
1
1
1
0.5
0.75
0.5
0.5
7.5
A33
1
0.5
0.5
1
0.75
0.75
0.75
1
0.5
1
7.75
A34
1
0.75
0.75
1
0.75
1
1
1
0.75
0.75
8.75
A35
1
1
1
1
0.75
1
0.75
0.75
0.25
1
8.5
A36
1
0.25
0.25
0.75
0.75
0.75
0
1
0.5
0.5
5.75
A37
1
1
0.75
0.75
0.75
0.75
1
0.75
0.5
0.5
7.75
A38
0.75
0.75
0.25
0.75
0.5
0.75
0.5
0.5
0.5
0.5
5.75
A39
0.75
0.75
0.75
0.75
0.5
0.25
0.25
0.75
0.25
0.75
5.75
A40
1
1
1
1
0.75
1
0
0.75
1
0.5
8
A41
1
1
1
1
0.75
1
0.75
1
0.5
0.75
8.75
A42
1
1
1
1
0.75
0.75
0
0.75
0.5
0.75
7.5
A43
0.75
0.75
0.75
0.5
0.75
0.75
0.5
0.5
0.5
0.5
6.25
A44
1
0.75
0.75
0.75
0.5
0.75
0.75
0.75
0.5
0.75
7.25
A45
1
1
1
1
0.75
1
1
1
0.75
0.75
9.25
A46
0.75
0
0.75
0.75
0.75
0.75
0.5
0.75
0.5
0.75
6.25
A47
0.75
0.75
0.5
0.5
0.75
1
0.5
0.75
0.5
0.75
6.75
A48
1
0.75
0.75
0.5
0.75
0.75
0
0.75
0.75
0.5
6.5
A49
0.75
0.5
0.75
0.75
0.5
0.5
0
0.5
0.25
0.75
5.25
A50
0.75
0
0.25
0.75
0.5
1
0.75
0.75
0.75
0.75
6.25
A51
1
0.5
0.5
0.5
0.75
0.5
0.75
0.5
0.25
0.5
5.75
A52
1
0.5
0.75
0.5
0.75
0.75
0
0.5
0.25
0.5
5.5
A53
1
1
0.75
0.75
1
1
1
1
1
0.75
9.25
A54
0.5
0.75
0.25
0.5
0.5
0.75
0
0.75
0.5
0.75
5.25
A55
1
1
0.75
1
0.75
0.75
0
0.75
0.5
0.75
7.25
A56
1
1
0.75
0.75
0.75
0.75
0
0.75
0.25
0.5
6.5
A57
0.5
0.5
0.25
0.75
0.75
0.5
0.25
0.5
0.5
0.5
5
A58
1
0.75
0.75
0.75
0.75
1
0.75
1
0.75
0.5
8
A59
1
1
1
1
1
1
1
0.75
0.75
0.5
9
MohammadNoor Injadat, Fadi Salo and Ali Bou Nassif, Data Mining Techniques in Social Media: A
Survey, Neurocomputing, http://dx.doi.org/10.1016/j.neucom.2016.06.045
ID
QAR1
QAR2
QAR3
QAR4
QAR5
QAR6
QAR7
QAR8
QAR9
QAR10
Total
A60
1
1
0.75
1
0.75
1
1
0.5
0.25
0.75
8
A61
0.75
0.5
0.25
0.5
0.75
0.75
0
0.75
0.75
0.5
5.5
A62
0.75
0.75
0.5
0.75
0.5
0.75
0.75
0.75
0.75
0.5
6.75
A63
0.75
0.5
0.25
0.75
0.75
0.25
0.5
0.25
0.25
0.75
5
A64
1
1
0.75
0.5
0.75
0.5
0.5
0.5
0.25
0.5
6.25
A65
1
1
1
0.75
1
0.75
0
0.5
0.25
0.5
6.75
A66
0.75
0.75
0.5
0.75
0.5
0.5
0.75
0.5
0.5
0.75
6.25
TABLE A3
RQS ANSWERED BY ARTICLES
ID
RQ1
RQ2
RQ3
RQ4
RQ5
A1
1
1
0
1
0
A2
1
1
0
0
0
A3
1
1
0
0
1
A4
1
1
0
0
1
A5
1
1
1
1
0
A6
1
1
0
0
0
A7
1
1
0
0
0
A8
1
1
0
0
0
A9
1
1
0
0
0
A10
1
1
0
1
1
A11
1
1
1
0
0
A12
1
1
0
1
1
A13
1
1
0
0
0
A14
1
1
0
0
1
A15
1
1
0
0
0
A16
1
1
0
1
1
A17
1
1
0
1
1
A18
1
1
0
0
1
A19
1
1
1
0
0
A20
1
1
0
0
1
A21
1
1
0
1
1
A22
1
1
0
0
0
A23
1
1
0
1
0
A24
1
1
0
1
0
A25
1
1
0
1
0
A26
1
1
0
1
0
A27
1
1
0
0
0
A28
1
1
0
1
0
A29
1
1
0
1
1
A30
1
1
0
0
0
A31
1
1
0
1
0
A32
1
1
0
1
1
A33
1
1
0
1
0
A34
1
1
0
1
1
MohammadNoor Injadat, Fadi Salo and Ali Bou Nassif, Data Mining Techniques in Social Media: A
Survey, Neurocomputing, http://dx.doi.org/10.1016/j.neucom.2016.06.045
ID
RQ1
RQ2
RQ3
RQ4
RQ5
A35
1
1
0
1
1
A36
1
1
0
0
1
A37
1
1
0
1
1
A38
1
1
0
0
0
A39
1
1
0
0
0
A40
1
1
0
0
1
A41
1
1
0
1
0
A42
1
1
0
1
1
A43
1
1
0
1
1
A44
1
1
0
1
1
A45
1
1
0
1
0
A46
1
1
0
1
0
A47
1
1
0
1
0
A48
1
1
0
0
1
A49
1
1
0
0
0
A50
1
1
0
1
0
A51
1
1
0
1
0
A52
1
1
0
0
1
A53
1
1
1
0
1
A54
1
1
0
1
0
A55
1
1
0
0
1
A56
1
1
0
0
1
A57
1
1
0
0
0
A58
1
1
0
1
0
A59
1
1
0
1
0
A60
1
1
0
0
1
A61
1
1
0
0
0
A62
1
1
0
1
0
A63
1
1
0
1
0
A64
1
1
0
1
1
A65
1
1
0
0
1
A66
1
1
0
1
1
MohammadNoor Injadat, Fadi Salo and Ali Bou Nassif, Data Mining Techniques in Social Media: A
Survey, Neurocomputing, http://dx.doi.org/10.1016/j.neucom.2016.06.045
TABLE A4
ARTICLES PERCENTAGE PER JOURNAL
Publication Venue
Type
Freq.
%
Publication Venue
Type
Freq.
%
ACM TRANSACTIONS ON INTELLIGENT
SYSTEMS AND TECHNOLOGY
Jour.
2
3
IEEE TRANSACTIONS ON LEARNING
TECHNOLOGIES
Jour.
1
2
AI MAGAZINE
Jour.
1
2
IEEE TRANSACTIONS ON MULTIMEDIA
Jour.
2
3
CAMBRIDGE SERVICE ALLIANCE BLOG
Jour.
1
2
IEEE TRANSACTIONS ON PATTERN
ANALYSIS AND MACHINE
INTELLIGENCE
Jour.
1
2
CORNELL HOSPITALITY QUARTERLY
Jour.
1
2
IEEE TRANSACTIONS ON SYSTEMS,
MAN, AND CYBERNETICS
Jour.
2
3
DECISION SUPPORT SYSTEMS
Jour.
1
2
IEEE/ACM INTERNATIONAL
CONFERENCE ON ADVANCES IN
SOCIAL NETWORKS ANALYSIS AND
MINING
Conf.
20
27
ELECTRONIC COMMERCE RESEARCH
AND APPLICATIONS
Jour.
1
2
INDUSTRIAL MANAGEMENT & DATA
SYSTEMS
Jour.
1
2
EXPERT SYSTEMS WITH APPLICATIONS
Jour.
4
6
FRONTIERS OF COMPUTER SCIENCE IN
CHINA
Jour.
1
2
JOURNAL OF BIOMEDICAL SEMANTICS
Jour.
1
2
GEOINFORMATICA
Jour.
1
2
JOURNAL OF INFORMATION SCIENCE
Jour.
1
2
IEEE COMPUTATIONAL INTELLIGENCE
MAGAZINE
Jour.
1
2
KNOWLEDGE-BASED SYSTEMS
Jour.
1
2
IEEE COMPUTER GRAPHICS AND
APPLICATIONS
Jour.
1
2
NEUROCOMPUTING
Jour.
6
9
IEEE INTELLIGENT SYSTEMS
Jour.
2
3
ONLINE INFORMATION REVIEW
Jour.
1
2
IEEE INTERNATIONAL CONFERENCE
ON DATA ENGINEERING (ICDE)
Conf.
1
2
PROCEEDINGS OF THE IEEE
Jour.
1
2
IEEE JOURNAL OF BIOMEDICAL AND
HEALTH INFORMATICS
Jour.
2
3
PROCEEDINGS OF THE VLDB
ENDOWMENT
Jour.
1
2
IEEE TRANSACTIONS ON
INSTRUMENTATION AND
MEASUREMENT
Jour.
1
2
TRANSPORT POLICY
Jour.
1
2
IEEE TRANSACTIONS ON KNOWLEDGE
AND DATA ENGINEERING
Jour.
4
6
TSINGHUA SCIENCE AND
TECHNOLOGY
Jour.
1
2
Grand Total
66
100
TABLE A5
DOMAINS FREQUENCIES AMONG ARTICLES
Domain
Frequency
%
Business and Management
11
17%
Education
1
2%
Finance
2
3%
Government and Public
6
9%
Medical and health
5
8%
Social Networks
41
62%
Grand Total
66
100%
TABLE A6
POPULARITY OF VARIOUS SOCIAL MEDIA APPLICATION IN RESEARCHES
MohammadNoor Injadat, Fadi Salo and Ali Bou Nassif, Data Mining Techniques in Social Media: A
Survey, Neurocomputing, http://dx.doi.org/10.1016/j.neucom.2016.06.045
SOCIAL MEDIA APPLICATION
Frequency
Ref
Blogs
8
A6, A12, A18, A22, A31, A37, A48, A55
Forums and Discussion Boards
9
A4, A5, A6, A24, A25, A27, A31, A44, A52
Microblogging
31
A1, A2, A5, A7, A9, A11, A13, A15, A17, A21, A28, A30,
A32, A35, A39, A41, A42, A45, A47, A49, A50, A51, A54,
A57, A59, A60, A61, A62, A63, A64, A66
Product Reviews
1
A33
Social Networks
12
A8, A10, A12, A14, A15, A18, A26, A32, A46, A53, A54,
A59
Video and Photo sharing
11
A9, A12, A15, A23, A29, A34, A36, A38, A40, A43, A58
TABLE A7
NOVEL TECHNIQUES FEATURES
ID
Novel Tech.
Features
Compared with
A12
Ano´nimos
Applied to preserve linear properties by generation of inequalities
corresponding to decisions made by the algorithm during its execution.
Preserve multiple linear properties in a single anonymized graph
A17
Biterm Topic Model (BTM)
Capture the topics within short texts by explicitly modeling word co-
occurrence patterns in the whole corpus.
Discover more prominent and coherent topics than the state-of-the art
competitors.
Outperforms the online LDA
in terms of effectiveness
A37
Interest-based Factor Graph
Model (I-FGM)
Proposed to take both network topology and node features into
consideration.
Makes the most of the strong inference abilities of the probability model
and the graph model.
A58
Topic-Sensitive
Influencer Mining (TSIM)
Aims to find the influential nodes in the networks.
Improves the performance significantly in the applications of friends’
suggestion and photo recommendation.
Outperforms LDA in terms of
friends’ suggestion and photo
recommendation.
A34
Latent Space Method
Discovers the latent semantic space from both context and content links
in multimedia information networks.
Solve the problem with sparse context.
The learned latent semantic space can be applied for many applications,
such as multimedia annotation and retrieval.
Extends the traditional LSI
algorithm by low-rank
approximation.
A32
Novel
The proposed framework performs language knowledge integration and
feature reduction simultaneously.
Improves the short texts clustering performance.
Scales linearly with the number of short texts and the number of
integrated languages.
A64
Online Incremental
Clustering Algorithm
Provide useful situation awareness information through a set of tightly
integrated components.
Enhance timely situation awareness across a range of crisis types.
Resolves the weaknesses in
K-Means and EM.
A10
Scalable Distance-Based
Clustering (SDC)
Proposed SDC technique for Web opinion clustering.
Ensures that a required density must be reached in the initial clusters and
uses scalable distances to expand the initial clusters.
Does not require a predefined number of clusters.
Able to filter noise.
A29
Decision Fusion for Multimodal
Biometrics
Reduces the false acceptance rate for both single biometric traits and
multimodal biometrics when the social network analysis is employed.
Independently classify an actor from the relationship among actors.
MohammadNoor Injadat, Fadi Salo and Ali Bou Nassif, Data Mining Techniques in Social Media: A
Survey, Neurocomputing, http://dx.doi.org/10.1016/j.neucom.2016.06.045
ID
Novel Tech.
Features
Compared with
A9
Unsupervised Feature Selection
Framework (LUFS)
Exploit link information effectively in comparison with the state-of-the-
art unsupervised feature selection methods.
A43
Neighborhood Similarity
Measure
Encodes both the local density information and semantic information.
Enhances the scalability to conduct approximated nearest neighbor
search.
Enhance the robustness on diversified genres of images.
Outperforms the k-NN
methods using the labeled
data only.
A8
Semantic Social Graph (SSG)
Discovers the implicit semantic relations between entities in text
messages.
Enriches graph representation of entities contained in text messages
generated by a user.
Significantly outperforms
Naive Bayes classifier in
accuracy and reliability
ACKNOWLEDGEMENT:
MohammadNoor Injadat and Fadi Salo would like to thank the University of Western Ontario for
supporting this research.
Dr. Ali Bou Nassif would like to thank the University of Sharjah for supporting this research.
MohammadNoor Injadat, Fadi Salo and Ali Bou Nassif, Data Mining Techniques in Social Media: A
Survey, Neurocomputing, http://dx.doi.org/10.1016/j.neucom.2016.06.045
REFERENCES
[1] A.L. Kavanaugh, E. a. Fox, S.D. Sheetz, S. Yang, L.T. Li, D.J. Shoemaker, et al., Social media use by government: From the routine to
the critical, Gov. Inf. Q. 29 (2012) 480491. doi:10.1016/j.giq.2012.06.002.
[2] H. Chen, R.H.L. Chiang, V.C. Storey, Business Intelligence and Analytics: From Big Data To Big Impact, Mis Q. 36 (2012) 1165
1188.
[3] M. Zuber, A Survey of Data Mining Techniques for Social Network Analysis, Int. J. Res. Comput. Eng. Electron. 3 (2014) 18.
[4] S. Yu, S. Kak, A survey of prediction using social media, arXiv Prepr. arXiv1203.1647. (2012) 120. http://arxiv.org/abs/1203.1647.
[5] V. Vuori, J. Väisänen, The use of social media in gathering and sharing competitive intelligence, ICEB 2009 Proc. (2009) 18.
[6] P.C. Rafeeque, S. Sendhilkumar, A survey on Short text analysis in Web, 2011 Third Int. Conf. Adv. Comput. (2011) 365371.
doi:10.1109/ICoAC.2011.6165203.
[7] M. Tsytsarau, T. Palpanas, Survey on mining subjective data on the web, Data Min. Knowl. Discov. 24 (2012) 478514.
doi:10.1007/s10618-011-0238-6.
[8] S. Gole, B. Tidke, A survey of Big Data in social media using data mining techniques, in: 2015 Int. Conf. Adv. Comput. Commun.
Syst. (ICACCS -2015), 2015: pp. 15. doi:10.1109/ICACCS.2015.7324059.
[9] B. Kitchenham, S. Charters, Guidelines for performing Systematic Literature Reviews in Software Engineering, Tech. Rep. EBSE-
2007-01, Keele Univ. Univ. Durham. (2007). doi:10.1145/1134285.1134500.
[10] D. Hand, Statistics and data mining: intersecting disciplines, ACM SIGKDD Explor. Newsl. 1 (1999) 1619.
doi:10.1145/846170.846171.
[11] A. Berson, S.J. Smith, Building Data Mining Applications for CRM, McGraw-Hill, Inc., New York, NY, USA, 2002.
[12] X. Wu, V. Kumar, The top ten algorithms in data mining, CRC Press, 2009.
[13] X. Wu, V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang, H. Motoda, et al., Top 10 algorithms in data mining, Knowl. Inf. Syst. 14
(2008) 137. doi:10.1007/s10115-007-0114-2.
[14] D.M. Boyd, N.B. Ellison, Social network sites: Definition, history, and scholarship, J. Comput. Commun. 13 (2007) 210230.
doi:10.1111/j.1083-6101.2007.00393.x.
[15] M.G. Smith, L. Bull, Feature construction and selection using genetic programming and a genetic algorithm, in: Genet. Program.,
Springer, 2003: pp. 229237.
[16] U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, others, Knowledge Discovery and Data Mining: Towards a Unifying Framework., in:
KDD, 1996: pp. 8288.
[17] B. Ratner, Statistical and machine-learning data mining: Techniques for better predictive modeling and analysis of big data, CRC Press,
2011.
[18] D. Pohl, A. Bouchachia, H. Hellwagner, Social media for crisis management: clustering approaches for sub-event detection, Multimed.
Tools Appl. (2013) 132. doi:10.1007/s11042-013-1804-2.
[19] D. Kotsakos, P. Sakkos, I. Katakis, D. Gunopulos, #tag: Meme or Event?, in: 2014 IEEE/ACM Int. Conf. Adv. Soc. Networks Anal.
Min., 2014: pp. 391394. doi:10.1109/ASONAM.2014.6921615.
[20] H.W. Chang, D. Lee, M. Eltaher, J. Lee, Phillies tweeting from philly? Predicting twitter user locations with spatial word usage, in:
Proc. 2012 IEEE/ACM Int. Conf. Adv. Soc. Networks Anal. Mining, ASONAM 2012, 2012: pp. 111118.
doi:10.1109/ASONAM.2012.29.
[21] E. Costa, R. Ferreira, P. Brito, I.I. Bittencourt, O. Holanda, A. MacHado, et al., A framework for building web mining applications in
the world of blogs: A case study in product sentiment analysis, Expert Syst. Appl. 39 (2012) 48134834.
doi:10.1016/j.eswa.2011.09.135.
[22] A. Akay, A. Dragomir, A Novel Data-Mining Approach Leveraging Social Media to Monitor Consumer Opinion of Sitagliptin, IEEE J.
Biomed. Heal. INFORMATICSournal Biomed. Heal. Informatics. 19 (2015) 389396. doi:10.1109/JBHI.2013.2295834.
[23] R.Y.K. Lau, Y. Xia, Y. Ye, A probabilistic generative model for mining cybercriminal networks from online social media, IEEE
Comput. Intell. Mag. 9 (2014) 3143. doi:10.1109/MCI.2013.2291689.
[24] B. Ceran, R. Karad, A. Mandvekar, S.R. Corman, H. Davulcu, A semantic triplet based story classifier, in: Proc. 2012 IEEE/ACM Int.
Conf. Adv. Soc. Networks Anal. Mining, ASONAM 2012, 2012: pp. 573580. doi:10.1109/ASONAM.2012.97.
[25] J. Gelernter, S. Balaji, An algorithm for local geoparsing of microtext, Geoinformatica. 17 (2013) 635667. doi:10.1007/s10707-012-
0173-8.
[26] A. Al-Kouz, S. Albayrak, An interests discovery approach in social networks based on semantically enriched graphs, in: Proc. 2012
IEEE/ACM Int. Conf. Adv. Soc. Networks Anal. Mining, ASONAM 2012, 2012: pp. 12721277. doi:10.1109/ASONAM.2012.219.
[27] J. Tang, H. Liu, An Unsupervised Feature Selection Framework for Social Media Data, IEEE Trans. Knowl. Data Eng. 4347 (2014)
29142927. doi:10.1109/TKDE.2014.2320728.
[28] C.C. Yang, T.D. Ng, Analyzing and visualizing web opinion development and social interactions with density-based clustering, IEEE
Trans. Syst. Man, Cybern. Part ASystems Humans. 41 (2011) 11441155. doi:10.1109/TSMCA.2011.2113334.
[29] M. Song, M.C. Kim, Y.K. Jeong, Analyzing the political landscape of 2012 korean presidential election in twitter, IEEE Intell. Syst. 29
(2014) 1826. doi:10.1109/MIS.2014.20.
[30] S. Das, Ö. Eǧecioǧlu, A. El Abbadi, Anónimos: An LP-based approach for anonymizing weighted social network graphs, IEEE Trans.
Knowl. Data Eng. 24 (2012) 590604. doi:10.1109/TKDE.2010.267.
[31] S. Bouktif, M.A. Awad, Ant colony based approach to predict stock market movement from mood collected on Twitter, in: 2013
IEEE/ACM Int. Conf. Adv. Soc. Networks Anal. Min. Ant, 2013: pp. 837845. doi:10.1145/2492517.2500282.
[32] R. Boulet, B. Jouve, F. Rossi, N. Villa, Batch kernel SOM and related Laplacian methods for social network analysis, Neurocomputing.
71 (2008) 12571273. doi:10.1016/j.neucom.2007.12.026.
[33] M. Saravanan, S. Buveneswari, S. Divya, V. Ramya, Bayesian filters for mobile recommender systems, in: Proc. - 2011 Int. Conf. Adv.
Soc. Networks Anal. Mining, ASONAM 2011, 2011: pp. 715721. doi:10.1109/ASONAM.2011.51.
[34] P.M. Hartmann, M. Zaki, N. Feldmann, A. Neely, Big Data for Big Business? A Taxonomy of Data-driven Business Models used by
Start-up Firms, Cambridge Serv. Alliance Blog. (2014) 129. http://cambridgeservicealliance.blogspot.co.uk/2014/04/big-data-for-big-
business_3.html.
[35] X. Cheng, X. Yan, Y. Lan, J. Guo, BTM: Topic Modeling over Short Texts, IEEE Trans. Knowl. Data Eng. 26 (2014) 29282941.
doi:10.1109/TKDE.2014.2313872.
[36] M.A. Rahman, A. El Saddik, W. Gueaieb, Building dynamic social network from sensory data feed, IEEE Trans. Instrum. Meas. 59
(2010) 13271341. doi:10.1109/TIM.2009.2038307.
MohammadNoor Injadat, Fadi Salo and Ali Bou Nassif, Data Mining Techniques in Social Media: A
Survey, Neurocomputing, http://dx.doi.org/10.1016/j.neucom.2016.06.045
[37] B.I. Analytics, Business Intelligence from Social Media A Study from the VAST Box Office Challenge, IEEE Comput. Graph. Appl.
34 (2014) 5869. doi:10.1109/MCG.2014.61.
[38] B.J. Jansen, K. Sobel, G. Cook, Classifying ecommerce information sharing behaviour by youths on social networking sites, J. Inf. Sci.
37 (2011) 120136. doi:10.1177/0165551510396975.
[39] E. Ferrara, M. JafariAsbagh, O. Varol, V. Qazvinian, F. Menczer, A. Flammini, Clustering memes in social media, in: Proc. 2013
IEEE/ACM Int. Conf. Adv. Soc. Networks Anal. Min. - ASONAM ’13, 2013: pp. 548–555. doi:10.1145/2492517.2492530.
[40] H.-N. Kim, A.-T. Ji, I. Ha, G.-S. Jo, Collaborative filtering based on collaborative tagging for enhancing the quality of
recommendation, Electron. Commer. Res. Appl. 9 (2010) 7383. doi:10.1016/j.elerap.2009.08.004.
[41] M. Wang, F. Li, M. Wang, Collaborative visual modeling for automatic image annotation via sparse model coding, Neurocomputing.
95 (2012) 2228. doi:10.1016/j.neucom.2011.04.049.
[42] X. Si, E.Y. Chang, Z. Gyöngyi, M. Sun, Confucius and its intelligent disciples: integrating social with search, Proc. VLDB Endow. 3
(2010) 15051516. doi:10.1145/1645953.1645955.
[43] J. Piorkowski, L. Zhou, Content Feature Enrichment for Analyzing Trust Relationships in Web Forums, in: 2013 IEEE/ACM Int. Conf.
Adv. Soc. Networks Anal. Min. Content, 2013: pp. 14861487.
[44] I. Ting, S. Wang, Content Matters : A study of hate groups detection based on social networks analysis and web mining, in: 2013
IEEE/ACM Int. Conf. Adv. Soc. Networks Anal. Min., 2013: pp. 11961201. doi:10.1145/2492517.2500254.
[45] P. Biyani, C. Caragea, P. Mitra, C. Zhou, J. Yen, G.E. Greer, et al., Co-training over Domain-independent and Domain-dependent
features for sentiment analysis of an online cancer support community, in: 2013 IEEE/ACM Int. Conf. Adv. Soc. Networks Anal.
Mining, ASONAM 2013, August 25, 2013 - August 28, 2013, 2013: pp. 413417. doi:10.1145/2492517.2492606.
[46] A. Beykikhoshk, T. Caelli, Data-Mining Twitter and the Autism Spectrum Disorder : A Pilot Study, in: 2014 IEEE/ACM Int. Conf.
Adv. Soc. Networks Anal. Min., 2014: pp. 349356.
[47] P.P. Paul, M.L. Gavrilova, R. Alhajj, Decision Fusion for Multimodal Biometrics Using Social Network Analysis, Ieee Trans. Syst.
Man, Cybern. Syst. 44 (2014) 15221533.
[48] J.S. Alowibdi, U.A. Buy, P.S. Yu, L. Stenneth, Detecting Deception in Online Social Networks, in: 2014 IEEE/ACM Int. Conf. Adv.
Soc. Networks Anal. Min., 2014: pp. 383390.
[49] D. Schniederjans, E.S. Cao, M. Schniederjans, Enhancing financial performance with social media: An impression management
perspective, Decis. Support Syst. 55 (2013) 911918. doi:10.1016/j.dss.2012.12.027.
[50] J. Tang, X. Wang, H. Gao, X. Hu, H. Liu, Enriching short text representation in microblog for clustering, Front. Comput. Sci. China. 6
(2012) 88101. doi:10.1007/s11704-011-1167-7.
[51] A. Ghose, P.G. Ipeirotis, Estimating the helpfulness and economic impact of product reviews: Mining text and reviewer characteristics,
IEEE Trans. Knowl. Data Eng. 23 (2011) 14981512. doi:10.1109/TKDE.2010.188.
[52] G. Qi, C. Aggarwal, Q. Tian, S. Member, Exploring Context and Content Links in Social Media: A Latent Space Method, IEEE Trans.
Pattern Anal. Mach. Intell. 34 (2012) 850862.
[53] B. Yee Liau, P. Pei Tan, Gaining customer knowledge in low cost airlines through text mining, Ind. Manag. Data Syst. 114 (2014)
13441359. doi:10.1108/IMDS-07-2014-0225.
[54] C.H.C. Leung, A.W.S. Chan, A. Milani, J. Liu, Y. Li, Intelligent Social Media Indexing and Sharing Using an Adaptive Indexing
Search Engine, ACM Trans. Intell. Syst. Technol. 3 (2012) 127. doi:10.1145/2168752.2168761.
[55] F. Tan, L. Li, Z. Zhang, Y. Guo, Latent Co-interests ’ Relationship Prediction, Tsinghua Sci. Technol. 18 (2013) 379–386.
[56] S.Y. Wang, W.S. Liao, L.C. Hsieh, Y.Y. Chen, W.H. Hsu, Learning by expansion: Exploiting social media for image classification
with few training examples, Neurocomputing. 95 (2012) 117125. doi:10.1016/j.neucom.2011.05.043.
[57] L. Dickens, I. Molloy, J. Lobo, Learning Stochastic Models of Information Flow, in: 2012 IEEE 28th Int. Conf. Data Eng., 2012: pp.
570581.
[58] J. Biel, D. Gatica-perez, Mining Crowdsourced First Impressions in Online Social Video, IEEE Trans. ONMULTIMEDIA. 16 (2014)
20622074.
[59] X. Chen, M. Vorvoreanu, K. Madhavan, Mining Social Media Data for Understanding Students’ Learning Experiences, IEEE Trans.
Learn. Technol. 7 (2014) 246259. http://web.ics.purdue.edu/~chen654/pub/XinChen_etal_IEEETrans_tlt-
cs_Mining_Twitter.pdf\nhttp://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6697807.
[60] C.H. Lee, Mining spatio-temporal information on microblogging streams using a density-based online clustering method, Expert Syst.
Appl. 39 (2012) 96239641. doi:10.1016/j.eswa.2012.02.136.
[61] S. Wang, Q. Huang, S. Jiang, Q. Tian, L. Qin, Nearest-neighbor method using multiple neighborhood similarities for social media data
mining, Neurocomputing. 95 (2012) 105116. doi:10.1016/j.neucom.2011.06.039.
[62] A. Akay, A. Dragomir, B.-E. Erlandsson, Network-Based Modeling and Intelligent Data Mining of Social Media for Improving Care,
IEEE J. Biomed. Heal. INFORMATICS. 19 (2015) 210218.
[63] N. Collier, N.T. Son, N.M. Nguyen, OMG U got flu? Analysis of shared health messages for bio-surveillance, J. Biomed. Semantics. 2
(2011) 110. doi:10.1186/2041-1480-2-S5-S9.
[64] F. Rossi, N. Villa-Vialaneix, Optimizing an organized modularity measure for topographic graph clustering: A deterministic annealing
approach, Neurocomputing. 73 (2010) 11421163. doi:10.1016/j.neucom.2009.11.023.
[65] A. Jaiswal, W. Peng, T. Sun, Predicting Time-sensitive User Locations from Social Media, in: 2013 IEEE/ ACM Int. Conf. Adv. Soc.
Networks Anal. Min., 2013: pp. 870877. doi:10.1145/2492517.2500229.
[66] D.H.-L. Goh, A. Chua, C.S. Lee, K. Razikin, Resource discovery through social tagging: a classification and content analytic approach,
Online Inf. Rev. 33 (2009) 568583. doi:10.1108/14684520910969961.
[67] G. Cai, H. Wu, R. Lv, Rumors Detection in Chinese via Crowd Responses, in: 2014 IEEE/ACM Int. Conf. Adv. Soc. Networks Anal.
Min., 2014: pp. 912917.
[68] Y. Chen, X. Zhang, Z. Li, J.-P. Ng, Search engine reinforced semi-supervised classification and graph-based summarization of
microblogs, Neurocomputing. 152 (2015) 274286. doi:10.1016/j.neucom.2014.10.068.
[69] R. Dehkharghani, H. Mercan, A. Javeed, Y. Saygin, Sentimental causal rule discovery from Twitter, Expert Syst. Appl. 41 (2014)
49505958. doi:10.1016/j.eswa.2014.02.024.
[70] C.-Y. Lin, L. Wu, Z. Wen, H. Tong, V. Griffiths-Fisher, L. Shi, et al., Social Network Analysis in Enterprise, Proc. IEEE. 100 (2012)
27592776. doi:10.1109/JPROC.2012.2203090.
[71] L. Kwok, B. Yu, Spreading Social Media Messages on Facebook: An Analysis of Restaurant Business-to-Consumer Communications,
Cornell Hosp. Q. 54 (2013) 8494. doi:10.1177/1938965512458360.
[72] A. Malhotra, L. Totti, W. Meira, P. Kumaraguru, V. Almeida, Studying user footprints in different online social networks, Proc. 2012
MohammadNoor Injadat, Fadi Salo and Ali Bou Nassif, Data Mining Techniques in Social Media: A
Survey, Neurocomputing, http://dx.doi.org/10.1016/j.neucom.2016.06.045
IEEE/ACM Int. Conf. Adv. Soc. Networks Anal. Mining, ASONAM 2012. (2012) 10651070. doi:10.1109/ASONAM.2012.184.
[73] T. Finin, A. Joshi, P. Kolari, A. Java, A. Kale, A. Karandikar, The Information Ecology of Social Media and Online Communities, AI
Mag. 29 (2008) 7792. doi:10.1609/aimag.v29i3.2158.
[74] A. Gal-Tzur, S.M. Grant-Muller, T. Kuflik, E. Minkov, S. Nocera, I. Shoor, The potential of social media in delivering transport policy
goals, Transp. Policy. 32 (2014) 115123. doi:10.1016/j.tranpol.2014.01.007.
[75] P. Bogdanov, M. Busch, J. Moehlis, A.K. Singh, B.K. Szymanski, The social media genome: modeling individual topic-specific
behavior in social media, in: Proc. 2013 IEEE/ACM Int. Conf. Adv. Soc. Networks Anal. Min., 2013: pp. 236242.
doi:10.1145/2492517.2492621.
[76] Q. Fang, J. Sang, C. Xu, Y. Rui, Topic-sensitive influencer mining in interest-based social media networks via hypergraph learning,
IEEE Trans. Multimed. 16 (2014) 796812. doi:10.1109/TMM.2014.2298216.
[77] G. Paltoglou, M. Thelwall, Twitter, MySpace, Digg: Unsupervised Sentiment Analysis in Social Media, ACM Trans. Intell. Syst.
Technol. 3 (2012) 119. doi:10.1145/2337542.2337551.
[78] C.H. Lee, Unsupervised and supervised learning to evaluate event relatedness based on content mining from social-media streams,
Expert Syst. Appl. 39 (2012) 1333813356. doi:10.1016/j.eswa.2012.05.068.
[79] S. O’Banion, L. Birnbaum, Using explicit linguistic expressions of preference in social media to predict voting behavior, in: 2013
IEEE/ACM Int. Conf. Adv. Soc. Networks Anal. Min., 2013: pp. 207214. doi:10.1145/2492517.2492538.
[80] J.H. Wang, M.S. Lin, Using inter-comment similarity for comment spam detection in Chinese blogs, in: Proc. - 2011 Int. Conf. Adv.
Soc. Networks Anal. Mining, ASONAM 2011, 2011: pp. 189194. doi:10.1109/ASONAM.2011.49.
[81] J. Dickerson, V. Kagan, V. Subrahmanian, Using Sentiment to Detect Bots on Twitter: Are Humans more Opinionated than Bots?, in:
2014 IEEE/ACM Int. Conf. Adv. Soc. Networks Anal. Min., 2014: pp. 620627. http://jpdickerson.com/pubs/dickerson14using.pdf.
[82] J. Yin, A. Lampert, M. Cameron, B. Robinson, R. Power, Using social media to enhance emergency situation awareness, IEEE Intell.
Syst. 27 (2012) 5259. doi:10.1109/MIS.2012.6.
[83] E. Ferrara, P. De Meo, G. Fiumara, R. Baumgartner, Web data extraction, applications and techniques: A survey, Knowledge-Based
Syst. 70 (2014) 301323. doi:10.1016/j.knosys.2014.07.007.
[84] A. Boutet, H. Kim, E. Yoneki, What’s in twitter: I know what parties are popular and who you are supporting now!, in: Proc. 2012
IEEE/ACM Int. Conf. Adv. Soc. Networks Anal. Mining, ASONAM 2012, 2012: pp. 132139. doi:10.1109/ASONAM.2012.32.
MohammadNoor Injadat received the BSc and MSc degrees in computer science from Al al-Bayt University and
University Putra Malaysia in Jordan and Malaysia in 2000 and 2002, respectively. He obtained a Master of Engineering
in Electrical and Computer Engineering from University of Western Ontario in 2015. He is currently working toward
his PhD degree in Software Engineering at the Department of Electrical and Computer Engineering, University of
Western Ontario in Canada. His research interests include data mining, machine learning, social network analysis, data
analytics, and cloud computing. MohhammadNoor is a member of IEEE Computer Society.
Fadi Salo received the BSc and MSc degrees in computer science from Al-Ahliyya Amman University and University
Putra Malaysia in Jordan and Malaysia in 1999 and 2005, respectively. He obtained a Master of Engineering in
Electrical and Computer Engineering from University of Western Ontario in 2015. He is currently working toward his
PhD degree in Software Engineering at the Department of Electrical and Computer Engineering, University of Western
Ontario in Canada. His research interests include data mining, text mining, machine learning, social network analysis,
data analytics, and intrusion detection systems. Fadi is a member of IEEE Computer Society.
Ali Bou Nassif is currently an Assistant Professor at University of Sharjah, UAE. He obtained a Master’s degree in
Computer Science and a Ph.D. degree in Electrical and Computer Engineering from Western University in 2009 and
2012, respectively. Ali’s research interests include the applications of statistical and artificial intelligence models in
different areas such as software engineering, electrical engineering, e-learning and social media, as well as cloud
computing and mobile computing. Ali is a registered professional engineer in Ontario, as well as a member of IEEE
Computer Society and ACM Association for Computing Machinery.
... In terms of the specific health topic of interest, 22/58 papers included any health condition [7,15,16,21,30,32,34,36,37,39,44,45,47,52,54,55,62,65,69,[72][73][74]. Twelve focused on mental health conditions [20,23,24,28,42,48,50,53,58,64,66,71], 9 on adverse drug reactions (ADRs) [31,43,46,51,57,67,68,70,75], 4 on infectious diseases [25,29,40,41], two each on chronic disease [26,56], substance misuse [49,60], public health [27,59], breast cancer [33,38] and with one each for symptom identification [35], use of complementary and alternative medicine (CAM) therapies [61] and the reasons for existing use by health researchers [63]. ...
... Common classification algorithms include Support Vector Machines (SVM), Naive Bayes (NB), Decision Trees (DT) and Random Forest (RF). All these and others are frequently mentioned in the method discussions although SVM is the most popular [25,47,59,71,74]. Gupta [45] noted that SVM was the most promising method for binary classification tasks. ...
... Unsupervised techniques using topic modelling which do not require large amounts of labelled data are beginning to become more prevalent, especially for identifying themes and topics within large quantities of text [29,38] but were less frequently utilised [37]. A comparison of all datamining techniques found that they all had various strengths and weaknesses and that research objective and data should guide the choice of method [47]. The methods of both SGOPE and clinical NLP (looking at the unstructured text in EHRs etc.) have similar issues and purposes, but although automatic methods of processing are developing, the unstructured nature, noise, domain specific content, problems with language usage, understanding semantics and the complexities of informal speech mean that there is still a lot of work to be done in developing methods to maximise its usage [44,45,66]. ...
Article
Full-text available
Purpose Social media has led to fundamental changes in the way that people look for and share health related information. There is increasing interest in using this spontaneously generated patient experience data as a data source for health research. The aim was to summarise the state of the art regarding how and why SGOPE data has been used in health research. We determined the sites and platforms used as data sources, the purposes of the studies, the tools and methods being used, and any identified research gaps. Methods A scoping umbrella review was conducted looking at review papers from 2015 to Jan 2021 that studied the use of SGOPE data for health research. Using keyword searches we identified 1759 papers from which we included 58 relevant studies in our review. Results Data was used from many individual general or health specific platforms, although Twitter was the most widely used data source. The most frequent purposes were surveillance based, tracking infectious disease, adverse event identification and mental health triaging. Despite the developments in machine learning the reviews included lots of small qualitative studies. Most NLP used supervised methods for sentiment analysis and classification. Very early days, methods need development. Methods not being explained. Disciplinary differences - accuracy tweaks vs application. There is little evidence of any work that either compares the results in both methods on the same data set or brings the ideas together. Conclusion Tools, methods, and techniques are still at an early stage of development, but strong consensus exists that this data source will become very important to patient centred health research.
... On Facebook, more than 4000 new posts appear each second globally. 2 This makes it so easy for anyone to write, share and publish news about any field. Unfortunately, accessing social media easily and freely has disadvantages [2]. ...
... 2 This makes it so easy for anyone to write, share and publish news about any field. Unfortunately, accessing social media easily and freely has disadvantages [2]. Many users share fake and misleading news items, which has a negative impact on society [3,4]. ...
Preprint
Full-text available
Social media is becoming a source of news for many people due to its ease and freedom of use. As a result, fake news has been spreading quickly and easily regardless of its credibility, especially in the last decade. Fake news publishers take advantage of critical situations such as the Covid-19 pandemic and the American presidential elections to affect societies negatively. Fake news can seriously impact society in many fields including politics, finance, sports, etc. Many studies have been conducted to help detect fake news in English, but research conducted on fake news detection in the Arabic language is scarce. Our contribution is twofold: first, we have constructed a large and diverse Arabic fake news dataset. Second, we have developed and evaluated transformer-based classifiers to identify fake news while utilizing eight state-of-the-art Arabic contextualized embedding models. The majority of these models had not been previously used for Arabic fake news detection. We conduct a thorough analysis of the state-of-the-art Arabic contextualized embedding models as well as comparison with similar fake news detection systems. Experimental results confirm that these state-of-the-art models are robust, with accuracy exceeding 98%.
... On Facebook, more than 4000 new posts appear each second globally. 2 This makes it so easy for anyone to write, share and publish news about any field. Unfortunately, accessing social media easily and freely has disadvantages [2]. ...
... 2 This makes it so easy for anyone to write, share and publish news about any field. Unfortunately, accessing social media easily and freely has disadvantages [2]. Many users share fake and misleading news items, which has a negative impact on society [3,4]. ...
Article
Full-text available
Social media is becoming a source of news for many people due to its ease and freedom of use. As a result, fake news has been spreading quickly and easily regardless of its credibility, especially in the last decade. Fake news publishers take advantage of critical situations such as the Covid-19 pandemic and the American presidential elections to affect societies negatively. Fake news can seriously impact society in many fields including politics, finance, sports, etc. Many studies have been conducted to help detect fake news in English, but research conducted on fake news detection in the Arabic language is scarce. Our contribution is twofold: first, we have constructed a large and diverse Arabic fake news dataset. Second, we have developed and evaluated transformer-based classifiers to identify fake news while utilizing eight state-of-the-art Arabic contextualized embedding models. The majority of these models had not been previously used for Arabic fake news detection. We conduct a thorough analysis of the state-of-the-art Arabic contextualized embedding models as well as comparison with similar fake news detection systems. Experimental results confirm that these state-of-the-art models are robust, with accuracy exceeding 98%.
... e Elman neural network is a classical recurrent neural network. Elman neural network not only has hidden layers but also adds time-lapse operators Mobile Information Systems through feedback connection; historical data can be recorded to make it better in real time and have better stability, and later RNN also used this structure [18]. Nowadays, recurrent neural networks have been successful in many fields with the following formula: ...
Article
Full-text available
Based on the global big data environment, people have more and more requirements for the interior design and spatial structure of buildings, and the traditional design has been unable to meet people’s needs, and the importance of artificial intelligence decision-making is reasonably reflected in the process of building interior design and space structure optimization. There are a variety of algorithms for artificial intelligence decision-making, artificial neural networks, and correlation coefficient analysis methods, and expandable interior design mining methods are currently being continuously improved and evolved, and these algorithms are used to analyze each case and then screen and finally obtain the optimal results, and the proposed multiobjective optimization and constraint optimization make the research work provide a new strategy for the design development of the data age. In the case of the Library of Extremely Cold Lands, the solution set quality of the nonadaptive solution is verified, the convergence, uniformity, and extensiveness are optimized, and then the experimental process is analyzed, and finally the multiobjective conclusion that building interior design and spatial structure still needs to be further optimized for artificial intelligence decision-making is obtained.
... Researchers are of the view the data search will also have a revolution in the next ten years. This technology is that of the superior technologies developed by humans [6]. ...
... The survey is based on journals and conferences in various spheres, providing a globalised vision of what the academic, business and social world is doing with the help of association rules. The methodology followed for the creation of the survey is very similar to the one proposed in the paper (Injadat et al. 2016), which is based on the methodology Systematic Literature Review (Budgen and Brereton 2006). ...
Article
Full-text available
The incursion of social media in our lives has been much accentuated in the last decade. This has led to a multiplication of data mining tools aimed at obtaining knowledge from these data sources. One of the greatest challenges in this area is to be able to obtain this knowledge without the need for training processes, which requires structured information and pre-labelled datasets. This is where unsupervised data mining techniques come in. These techniques can obtain value from these unstructured and unlabelled data, providing very interesting solutions to enhance the decision-making process. In this paper, we first address the problem of social media mining, as well as the need for unsupervised techniques, in particular association rules, for its treatment. We follow with a broad overview of the applications of association rules in the domain of social media mining, specifically, their application to the problems of mining textual entities, such as tweets. We also focus on the strengths and weaknesses of using association rules for solving different tasks in textual social media. Finally, the paper provides a perspective overview of the challenges that association rules must face in the next decade within the field of social media mining.
Chapter
The chapter will instruct readers into the fundamentals of online social research, filling a vital, instructional gap in the literature. After a summary of the theoretical path towards digital ethnography and a description of the discipline's overall strengths, the chapter will begin a practical breakdown of the research planning and execution processes. This will follow the following steps: preparation and entrée, sample design, quantitative and qualitative methods of data collection, and ethical considerations. All the above will be informed by the researcher's own experience and will be geared towards under- or post-grad anthropology and ethnography students with some experience with online spaces.
Article
The active social media users across the globe have passed the 3.8 billion mark. Platforms like Facebook, Reddit, Instagram, Twitter, and more are an ocean full of opinions and views. With over 500 million tweets being generated daily, this enormous volume of data can offer very prominent insights and allow organizations and businesses to make strategic decisions. The COVID-19 pandemic has changed the landscape of learning and education dramatically. With a sudden deviation from the classroom in many parts of the world, some wonder if the adoption of online learning will continue with the outbreak of the post-epidemic epidemic and how such a change could affect the global education market. The crux of the problem is how we need to analyze vast amounts of data efficiently. We chose to employ advanced ML and NLP techniques to analyze the sentiment of the masses on digital learning and important extracting demographic information from them. In this paper, we will make an effort to understand the orientation of the academicians towards the recent online education adoption. We will collect the data from the tweets using the trending tags of COVID-19 and Online classes.
Article
Full-text available
The objective of this report is to propose comprehensive guidelines for systematic literature reviews appropriate for software engineering researchers, including PhD students. A systematic literature review is a means of evaluating and interpreting all available research relevant to a particular research question, topic area, or phenomenon of interest. Systematic reviews aim to present a fair evaluation of a research topic by using a trustworthy, rigorous, and auditable methodology. The guidelines presented in this report were derived from three existing guidelines used by medical researchers, two books produced by researchers with social science backgrounds and discussions with researchers from other disciplines who are involved in evidence-based practice. The guidelines have been adapted to reflect the specific problems of software engineering research. The guidelines cover three phases of a systematic literature review: planning the review, conducting the review and reporting the review. They provide a relatively high level description. They do not consider the impact of the research questions on the review procedures, nor do they specify in detail the mechanisms needed to perform meta-analysis.
Article
2016, © Emerald Group Publishing Limited. $\textbf{Purpose:}$ The purpose of this paper is to derive a taxonomy of business models used by start-up firms that rely on data as a key resource for business, namely data-driven business models (DDBMs). By providing a framework to systematically analyse DDBMs, the study provides an introduction to DDBM as a field of study. $\textbf{Design/methodology/approach:}$ To develop the taxonomy of DDBMs, business model descriptions of 100 randomly chosen start-up firms were coded using a DDBM framework derived from literature, comprising six dimensions with 35 features. Subsequent application of clustering algorithms produced six different types of DDBM, validated by case studies from the study’s sample. $\textbf{Findings:}$ The taxonomy derived from the research consists of six different types of DDBM among start-ups. These types are characterised by a subset of six of nine clustering variables from the DDBM framework. $\textbf{Practical implications:}$ A major contribution of the paper is the designed framework, which stimulates thinking about the nature and future of DDBMs. The proposed taxonomy will help organisations to position their activities in the current DDBM landscape. Moreover, framework and taxonomy may lead to a DDBM design toolbox. $\textbf{Originality/value:}$ This paper develops a basis for understanding how start-ups build business models capture value from data as a key resource, adding a business perspective to the discussion of big data. By offering the scientific community a specific framework of business model features and a subsequent taxonomy, the paper provides reference points and serves as a foundation for future studies of DDBMs.
Conference Paper
World's largest community Facebook's ‘Like’ button pressed 2.7 billion times every day across the web revealing what people care about, such an impact of social media that internet user average almost spends 2.5 hours daily on liking, chatting, poking, tweeting on social media, which has become vast source of unstructured data. While dealing with big data it's difficult for traditional databases and architecture to modify, grill and then structure this data, it can lead to many consumer insights which can help to create win-win situations. It has become necessary to find out value from large data sets to show relationships, dependencies as well as to perform predictions of outcomes and behaviors. Big Data has been characterized by 5 Vs — Volume, Velocity, Variety, Veracity and Value. This paper deals with all these 5Vs, features, challenges, future of Big Data in social media arena using data mining algorithms, tools and Hadoop framework for overcoming challenges of Big Data.
Chapter
Sentiment Analysis deals with the detection and analysis of affective content in written text. It utilizes methodologies, theories, and techniques from a diverse set of scientific domains, ranging from psychology and sociology to natural language processing and machine learning. In this chapter, we discuss the contributions of the field in social media analysis with a particular focus in online collective actions; as these actions are typically motivated and driven by intense emotional states (e.g., anger), sentiment analysis can provide unique insights into the inner workings of such phenomena throughout their life cycle. We also present the state of the art in the field and describe some of its contributions into understanding online collective behavior. Lastly, we discuss significant real-world datasets that have been successfully utilized in research and are available for scientific purposes and also present a diverse set of available tools for conducting sentiment analysis.