Conference PaperPDF Available

A Cloud Based Framework for Identification of Influential Health Experts from Twitter

Authors:
  • COMSATS Institue of Information Technology, Abbottabad, Pakistan
1
A Cloud Based Framework for
Identification of Influential Health Experts from Twitter
Assad Abbas1, Muhammad U. S. Khan1, Mazhar Ali2, Samee U. Khan1, and Laurence T. Yang3
1North Dakota State University, Fargo, ND, USA
e-mail: {assad.abbas, ushahid.khan, samee.khan}@ndsu.edu
2COMSATS Institute of Information Technology, Abbottabad, Pakistan
e-mail: mazhar@ciit.net.pk
3St. Francis Xavier University Antigonish, NS, Canada
e-mail: ltyang@stfx.ca
Abstract The ever increasing growth in health related data
has necessitated the development of pervasive tools and
technologies to manage the huge data volumes. Likewise, the
conventional healthcare services are transforming into patient-
centric services to offer ubiquitous access to the health related
information. However, there is a need to extend the capabilities
of the existing health services and tools so that users could
become aware about their health, devise wellness plans, and
seek experts’ advice at no or low cost using the social media. In
this paper, we propose a cloud based framework that uses
Twitter data to offer recommendations about the most
influential health experts. We employ a variant of the
Hyperlink-Induced Topic Search (HITS) approach to identify
the candidate health experts based on health related keywords
used in the tweets. Subsequently, we propose an influence
metric that calculates the influence of the candidate experts
based on various parameters. The proposed approach attained
high accuracy when compared to other approaches for expert
user identification. Moreover, experimental results exhibit that
the approach is highly scalable for workloads of varying sizes.
Keywords- Expert users, influence, cloud computing,tweets
I. INTRODUCTION
The increased demand for utilizing the electronic
healthcare services has resulted in enormous growth of
health data on the Internet [1]. The current volumes of health
related data include the diagnosis and prescriptions data,
laboratory data, pharmacy records, health insurance claims
data, gene extraction and sequencing data [2], [3].
Consequently, such high volumes of diverse data give rise to
the term health Big-data [2]. Besides the aforementioned
sources of health related data, online health communities and
social media platforms, such as Twitter and Facebook have
also appeared as the rich sources of health data. Users of
social media websites discuss various topics of common
interests including the health issues. Twitter, for example is
being used by both the doctors and the patients to exchange
their experiences and feelings with others. Moreover, Twitter
contains various health communities that are meant to
answer and resolve concerns of the users or patients of
different diseases. There are also other online health
communities, for example PatientsLikeMe [4] that people
use to exchange their experiences against different diseases
and to seek support from the other patients. According to the
Pew Internet survey of 2013, a key benefit of such Internet
based health communities is to help people seek advice from
the current or past patients and health experts at no cost [5].
The increased trends of finding online health information are
due to the growth in the numbers of smartphone users.
Moreover, the Pew Internet survey of 2014 reveals that
around 90% of the U.S. adults own mobile phones and
approximately 72% of the users have searched online for the
information pertaining to health related issues [6].
Considering the widespread use of computing and mobile
devices for searching the health related information from the
online health communities and social media networks, it is
the appropriate time to enhance the potential of the online
health communities. Therefore, developing methodologies
that enable the interaction between patients and health
experts through social media will help users seek advice at
low or no cost.
The enormous amount of health data requires scalable
solutions to efficiently process and store large amounts of
data. The scalability issues of the traditional Web based
systems not only result in inefficient processing but also
affect the accuracy [7]. Therefore, using the cloud computing
based scalable and elastic services to manage large amounts
of health data is important to offer efficient and scalable
services [8]. Besides the performance benefits of the cloud
services in healthcare, financial advantages are also of
paramount significance that can help in reduction of
healthcare costs [9]. A 2013 survey conducted by McKinsey
shows that the healthcare expenses of the U.S are
approximately 17.6% of the total GDP [10]. Therefore, the
cloud computing services are an inexpensive alternative to
deliver the quality healthcare services. Moreover, it is
expected that in near future more and more computing and
mobile devices will generate gigantic volumes of health data
that calls for the use of Big-data analytics tools in the
healthcare domain.
In this paper, we propose a cloud based scalable
framework that supports both the desktop and mobile users
to seek advice related to health affairs from the health
experts who frequently use Twitter. The framework analyzes
the tweets related to different diseases by various doctors and
determines the most suitable health experts for a particular
2
disease in that geographical area. Twitter has emerged as
vibrant health information source containing more than
784,893,181 health related tweets, around 10,000 doctors and
over 6,200 healthcare communities [11]. The aforementioned
figures are evidence of the increased use of Twitter for health
related issues that enables the quick information exchange
without cost. The framework mainly comprises of two
modules: (a) candidate experts identification module and (b)
influential user identification module. The candidate experts
are identified by using a variant of Hyperlink-Induced Topic
Search (HITS) [12] approach. Subsequently, the candidate
experts are further analyzed to determine the influential
experts for a disease. The influential users are identified
according to the prioritized criteria indicated in the query of
the querying user. The users can find the influential health
experts based on multiple criteria, such as: (a) number of
followers of the expert, (b) health related tweets by the
expert, (c) analyzing the followers’ sentiments in replies to
the tweets by expert, and (d) the retweets of the experts’
tweets. The rationale for offering multiple selection criteria is
that only one criterion cannot be a true characterization of the
expertise of an individual. For example, the following
relationship on Twitter is slight casual where some
individuals might just randomly follow others who in
courtesy can follow them back. Therefore, the reciprocity of
the following relationship is not a strong indicator of an
individual’s expertise [13]. Our framework exhibits great
potential to turn the Twitter into a collaborative online health
community where people can discuss their health matters
with the experts without any cost.
The framework performs the identification of multiple
influential users simultaneously across different geographical
locations. Maintaining large tweet repositories requires
scalable infrastructure with massive storage and efficient
processing. Therefore, cloud computing services are utilized
because of their ability to dynamically scale up and scale
down according to the workload characteristics. The
framework executes the periodic jobs to update and maintain
tweet repositories and to subsequently identify the health
experts. The reason to perform the offline processing for
identification of candidate experts and the influential users is
that it may incur high time overheads if the processing is
performed online. Therefore, offline processing avoids the
limitations of online processing and minimizes the query
response time. The key contributions of the paper are as
follows:
We present a scalable framework that utilizes the cloud
computing services to identify the influential health
experts from Twitter.
A variant of HITS approach is employed to identify the
candidate health experts based on the health related
keywords in their tweets.
We also propose an influence metric that calculates the
influence of the experts in terms of the number of
followers, sentiment analysis of the replies to the tweets
by followers, health related tweets, and the retweets to
the experts’ tweets.
The framework is capable of managing multiple queries
simultaneously by executing parallel jobs to identify the
experts from different geographical areas.
We also demonstrate the scalability of the framework
for workloads of different sizes.
The paper is organized as follows. Section II discusses
the related work. The architecture of the proposed system is
presented in Section III. Section IV presents results and
discussion whereas Section V concludes the paper.
II. RELATED WORK
There has been plentiful research conducted on expert
identification from various online communities and
microblog systems. However, identification of the experts
from online health communities has not been very
significant. An approach to find influential users in online
health communities is proposed by Zhao et al. [14]. The
approach determines the influence of a user through
sentiment dynamics in the threaded community discussions
and introduces a metric called Influential Responding
Replies (IRR) to determine the influence of others in the
community. Weng et al. [13] proposed an extension of the
PageRank algorithm called the TwitterRank that finds the
influential users on Twitter. TwitterRank uses link structures
and topical similarities to compute ranking for the
influential users on a particular topic. The aforementioned
approaches come across the scalability issues whereas our
approach is capable of finding the influential users by
executing parallel jobs from huge tweets corpus.
Another approach that utilizes probabilistic clustering
to identify the topical authorities from the microblogs is
presented in [15]. Ghosh et al. [16] proposed a
crowdsourcing based approach called Cognos to identify the
influence of the users by using the Twitter lists. However,
the approach in [16] is restricted by hourly restrictions for
tweets extraction. A link analysis based approach to identify
the influential users from an online healthcare social
network is proposed in [17]. The approach quantifies the
influence of the users in a small social network. On the
contrary, our proposed approach is two-fold that first
identifies the candidate experts through a variant of HITS
methodology and subsequently determines the influence of
the experts based on multiple criteria, such as the follower,
number of tweets, sentiments, and retweets.
III. PROPOSED SYSTEM ARCHITECTURE
The proposed framework utilizes the cloud computing
services to identify health experts from Twitter that best
match with users’ queries. The Software as a Service (SaaS)
implementation of the framework allows the availability of
the health expert recommendation service by means of
Internet. The tweets repositories are maintained by
periodically executing the jobs to retrieve the tweets from
Twitter. To identify the expert users, the following tasks are
performed: (a) identification of candidate experts and (b)
calculation of influential users. The architecture of the
3
proposed framework is presented in Fig. 1. The steps to
identify the experts are presented in Algorithm 1.
A. Identification of Candidate Experts
Based on a user query, the tweets from the health experts
are analyzed and parsed to extract disease specific keywords.
For the disease specific terminologies to analyze the tweets,
we used the WordNet database [18]. The benefit of using
WordNet is that it is capable of identifying the relationships
between different keywords by using the hypernym,
hyponym, meronym, holonym, and derivationally related
terms [19]. Interested readers are encouraged to consult [18]
and [19] for more details on the hypernym, hyponym,
meronym, holonym, and derivationally related terms. Based
on the frequency of health related keywords by the health
experts in their tweets, a keyword popularity matrix is
generated. The set of users for a particular disease is
represented as below:

 
where  is the j-th keyword used by the user for a
particular disease . However, the popularity of the health
experts based on the keywords count is not an exact
depiction of the real health experts because it only considers
the total number of keywords in the tweets used by a
particular user. Consequently, the users who frequently
repeat a few keywords in tweets may emerge as the top
experts. Therefore, to accurately identify the health experts,
it is essential to consider the frequency of keywords,
importance of keywords, and the importance of the
particular experts who use the keywords. To this end, we
use a variant of the hubs and authorities based approach to
identify the candidate expert users. The concept of hubs and
authorities is based on a Hyperlink-Induced Topic Search
(HITS) approach that has been used in Web search such that
the page that points to several other pages is called hub
whereas the pages that are pointed to by several other pages
are called authorities [12]. The proposed framework
considers the health experts as the hubs and the keywords as
the authorities. An issue with the HITS approach is that the
good hubs point mostly to the good authorities. Therefore,
the ranking decisions using the HITS for experts are mostly
based on the frequency of keywords used by important
experts. However, there are multiple parameters that
contribute for identification of good hubs. The parameters
include the usage of multiple different keywords by an
expert, importance (frequency) of the particular keywords,
and the importance of the hubs using those keywords.
Therefore, we modify the HITS approach by multiplying the
hub scores with the number of distinctive authorities pointed
by the hubs. Consequently, the final ranking score for the
hubs is more balanced and is not dependent merely on the
frequency of keywords. To identify the candidate experts for
a particular disease , we construct a matrix with rows
and columns. We calculate the authority and hub scores
using Eq. 2 and Eq. 3, respectively.
  
 
where and represent the hubs and authorities,
respectively and is the number of distinct authorities
pointed by each of the hubs. The approach recursively
works by assigning the hubs and authorities scores initially
equal to 1. In each of the iterations, the hub and authority
scores are updated and the scores at the converging iteration
are considered as the final hub and authority scores.
Fig. 1. Architecture of proposed cloud based framework
4
An Illustrative Example of Candidate Experts Identification
Suppose and be the two sets such that  
 and    represent Twitter
based expert users and the keywords used by each expert,
respectively. The initial scores for hubs and authority are
assumed as   and  ,
respectively. Table I shows a matrix comprising of four
users and six different keywords. Apparently it seems that
user is the most popular user among all of the four users
because it uses 18 keywords in total. Likewise, is at 2-nd
position with 16 keywords, and is the lowest in terms of
keywords count. The algorithm for finding the candidate
experts is recursively applied such that in each iteration, the
hub and authority scores are updated. Table II and Table III
respectively show the hub and authority score at the first and
the convergence iteration. Table II shows that the hub score
for after first iteration is highest whereas is at 2-nd
position. However, after 41-st iteration, emerges as the
hub with the highest score whereas is at 2-nd position.
Table III presents the authority scores at the first and
converging (36-th) iteration. It can be observed from Table
III that the keyword count for each of , and is
equal to 15 whereas has 2-nd highest keyword count.
Table I: User-keyword matrix
Table II: Hub score
Table III: Authority score
Iteration
No.
K1
K2
K3
K4
K5
K6
1
0.689
0.389
0.160
1
0.964
0.621
36
0.578
0.342
0.127
0.991
1
0.202
After the 1-st iteration the keyword emerges as the
keyword with the highest authority score whereas the
authority score for becomes even lower than that of
that had fairly less count as compared to . More obvious
differences in authority scores can be observed at the
converging (36-th) iteration where and evolve as
the authorities with the highest scores. Interestingly, that
has the highest keyword count turned extremely low in
terms of authority score. Likewise, the hub score for that
actually used the highest keywords turned lowest at
converging iteration. The reason is that only used two
keywords and one of them was repeatedly used. On the
other hand, that fairly used distinctive keywords with
different frequencies evolved as the hub with the highest
score despite of low disease specific keywords as compared
to . Another interesting observation is about  that used
even more distinct keywords with varying frequencies
attained the 2-nd highest hub score. The reason that also
used and that were not as important as used and
. Consequently, it can be concluded that there is no single
factor that individually helps in accurate identification of the
candidate experts. Instead each of the keyword frequency,
use of distinctive keywords, and the importance of the hubs
pointing to the authorities contribute in accurately
identifying the candidate health experts.
B. Influential User Identification
After the candidate experts have been identified through the
hubs and authorities based approach, we further refine the
process of expert user identification to ensure that the
querying users are recommended the most relevant experts.
Therefore, we introduce a metric that computes the
influence of each of the candidate experts. The influence of
a user is calculated based on: (a) the number of followers of
the expert on Twitter, (b) total health related tweets, (c)
sentiments of the followers in replies to the tweets by
experts, and (d) retweets. The intuition behind using the
aforementioned multiple criteria is that only single criteria,
for example the number of followers is not sufficient to
determine the influence or popularity of an expert on
Twitter. Therefore, it is important to evaluate the influence
of an expert based on several different criteria. This will
also enable the querying users to evaluate the influence of
an expert based on multiple prioritized criteria. The replies
of the followers of a health expert are important in
determining the influence and reputation of a health expert.
The users in their replies to the tweets by the health expert
express their sentiments. The sentiments expressed in the
tweets may be positive, negative, or neutral. To classify the
sentiments from the replies to the tweets as positive,
negative, or neutral, we used Stanford CoreNLP library
[20]. However, we only used positive sentiments scores for
the replies against all of the tweets of a particular health
expert. The reason for considering the health related tweets
as one of the influence criteria is that a health expert may
also tweet about some matters different from the health.
Therefore, considering the total number of tweets on all
topics by the health experts may significantly affect the total
influence calculated for that expert. Likewise, the numbers
of retweets by the followers of an expert are also an
important factor that can portray the influence and
popularity of an expert.
The users that are interested in finding the health
experts based on the number of followers assign high
importance to that criteria in their queries. The users are
returned a ranked list of the health experts that best match
with their query. The criterion with the high importance or
priority indicated by the user is assigned higher weights
K1
K2
K3
K4
K5
K6
Ktotal
3
2
-
5
6
-
16
2
-
-
6
7
-
15
3
3
2
4
2
-
14
3
-
-
-
-
15
18
Iteration
No.
U1
U2
U3
U4
1
0.914
0.643
1
0.514
41
1
0.790
0.839
0.178
5
whereas those with the low importance are assigned lower
weights while ranking the experts. Weight assignment is an
important task to rank the experts based on a certain criteria.
We used Rank Order Centroid (ROC) method [21] to assign
weights to different criteria. In the ROC method, the
weights to different attributes or decision criteria are
assigned according to their relative importance. The weight
assignment using the ROC is performed as follows:
 


where represents the number of different decision criteria.
The final influence is calculated as follows:
  
 
where  refers to the particular criteria and is the
weight assigned to that criteria.
Algorithm 1 presents the steps to identify and rank the
influential health expert users from the Twitter using the
variant of HITS approach and the proposed influence
metric. Line 2 of Algorithm 1 executes in , where is
the number of keywords. Line 3Line 6 search the
repositories and have complexity   , where
represents the tweets. The operations in Line 7Line 10
extract the users and have complexity   . Line 11
Line 16 execute in    , where be the number of
tokens. Line 17 and Line 18 execute in   and
  , where represents the number of
iterations required by the variant of HITS to converge . Line
20 executes in and each of Line 21Line 25 take
to execute. Therefore, the total complexity from Line
19Line 26 becomes   , where being the
number of candidate experts. Line 27Line 29 execute in
 , where  is the time complexity
to sort the list of top ranked experts. The total complexity of
the algorithm to find the experts for a disease becomes
         
    .
IV. RESULTS AND DISCUSSION
We evaluated the effectiveness of our approach in terms
of recommendation accuracy and scalability against varying
workloads. Evaluation results for the expert user
recommendation and scalability are presented in subsequent
subsections.
A. Evaluation of Expert User Recommendation Module
The performance of the expert user recommendation
module in terms of accuracy was evaluated and precision,
recall, and F-measure [22] were used as the evaluation
metrics.
Precision: The ratio of the accurately identified health
experts (True Positives) to the total occurrences (True
Positive (TP) + False Positive (FP)) is termed as precision
and is given as:
  

Recall: The identification probability of the randomly
selected health expert from the total training set (True
Positive (TP) + False Negative (FN)) is called recall and is
given as:
  

F-measure: F-measure is the harmonic mean of both the
precision and the recall values and is represented as:
    

___________________________________________________________
Algorithm1: Expert User Identification
____________________________________________
Output: List of health experts
Definitions: = set of diseases, = set of Keywords against
disease d,  tweets for disease d,  tweets collection for a
keyword k,  set of users who tweet about a particular disease
d,  frequency of a keyword k in the tweets of a user u for
disease d in his/her tweets, = user to keyword popularity matrix
for disease d, N= number of required expert users, , ҥt=ratio of
health related tweets to the total tweets, =retweets, =weight
assigned to each decision criteria, = Influence Matrix, and =
weighted influence matrix for all possible combinations of weights.
1: PARFOR each    do
2: keyWordsSearch(
3: PARFOR each    do
4:  searchTweetRepository(
5:   
6: end PARFOR
7: PARFOR tweet    do
8:   extractUser(
9:  
10: end PARFOR
11: PARFOR user    do
12:  tokenize(
13: PARFOR keyword    do
14:   
15: end PARFOR
16: end PARFOR
17:  
18:  
19: PARFOR each   do
20:   getFollowers(Ud)
21: ŞgetSentimentsScore()
22: ҥt getHealthTweets()
23: getRetweets()
24:  calculateInfluence(f, Ş, ҥt , )
25: calculateWeightedMatrix()
26: end PARFOR
27: PARFOR each 
 do
28:   
29: end PARFOR
30: Update
31: end PARFOR
__________________________________________
6
The tweets were collected by using the twitteR package
of R [23]. We evaluated the performance for correctly
identifying the health experts by collecting over 20,000
profiles of Twitter users who used the health related
terminologies in their tweets. Around 400,000 tweets related
to the diabetes were collected from the Twitter by using the
hypernyms, hyponyms, meronyms, holonym, and
derivationally related terms through the WordNet. The
aforementioned numbers also contain the tweets that were
provided by the Symplur on request. The tweets repositories
are maintained and updated by periodically executing the
jobs in offline mode. The framework also performs the
computations of the hub and authority scores to identify the
candidate experts and the influential users in offline mode.
The reason to perform the aforementioned tasks offline is
that it requires huge amount of storage and processing that
eventually results in high query response time. Therefore,
our cloud based framework effectively stores the large
amounts of Twitter data and performs intensive computation
operations in offline mode for the identification of health
experts. Moreover, to minimize the query response time, the
tweet repositories are preprocessed based on the
geographical locations.
The performance of our approach was evaluated in terms
of accuracy by comparing with the approaches presented in
[15] and [24]. In addition, we also compared our approach
with the popularity based ranking approach called as the
RowSum method that only considers the frequency of
keywords used by the health experts. The precision, recall,
and F-measure for each of the approaches are presented in
Fig. 2, Fig. 3, and Fig. 4, respectively where our proposed
approach is termed as Influential User Recommendation
(IUR). The performance of the IUR approach was observed
to be sufficiently better as compared to the other approaches
in terms of precision, recall, and F-measure for Top-k experts
with k=(5, 10, 15, 20). However, the approach by Cheng et
al. [23] also turned with high accuracy as compared to the
approach presented in [15] and the popularity based
approach. The popularity based approach attained low
accuracy particularly for Top-k experts with k= (10, 15, 20).
Interestingly the proposed IUR approach exhibited relatively
high accuracy even at large k, such as k= (15, 20). The
comparison of results shows that our proposed approach that
first identifies the candidate experts and then calculates the
influence of the candidates offers more accurate
recommendations. In addition, offering users the facility to
search and evaluate the experts by specifying four different
criteria helps to obtain personalized recommendation about
help experts. Moreover, we also compared the complexities
of the proposed IUR approach with the three approaches
used for comparison. The approach presented in [15] takes
     to execute whereas the approach by
Cheng et al. [24] executes in 
   
, where is the distance between the users.
Similarly, the complexity of RowSum   
  +   . Apparently it seems that
the proposed IUR methodology has more complexity as
compared to the three approaches. However, this includes the
complexity for tweets parsing, candidate expert
identification, influential user identification, and weight
assignment. On the other hand, the compared approaches
only consider only single task of experts’ identification.
Therefore, considering that most of the time consuming tasks
are performed offline, the complexity of responding real-
time queries for the IUR approach is reasonably acceptable.
B. Scalability Analysis
The systems based on the centralized computing models
come across the issues of scalability because of their inability
to cope with the ever changing processing requirements.
Consequently, the deployment of decentralized cloud based
methodologies that enable the concurrent processing of large
data volumes is becoming inevitable. For a parallel algorithm
to be scalable, with the increase in number of resources, for
Fig. 2: Precision comparison of IUR with other
approaches
Fig. 3: Recall comparison of IUR with other approaches
Fig. 4: F-measure comparison of IUR with other
approaches
510 15 20
0
0.2
0.4
0.6
0.8
1
No. of Recommendations
Precision
IUR
Cheng et al.
Pal et al.
RowSum
510 15 20
0
0.1
0.2
0.3
0.4
0.5
No. of Recommendations
Recall
IUR
Cheng et al.
Pal et al.
RowSum
510 15 20
0
0.1
0.2
0.3
0.4
0.5
No. of Recommendations
F-measure Score
IUR
Cheng et al.
Pal et al.
RowSum
7
example the processors and the workload, the performance in
terms of time efficiency and resources’ utilization must be
consistent or should not degrade substantially [25].
Therefore, we utilized the cloud services because they can be
procured on-demand and according to requirements. Amazon
Elastic Compute Cloud (EC2) [26] is an example of
commercial cloud service provider that provides the
processors, storage, and memory to host applications based
on different pricing models. We evaluated the scalability of
our approach by analyzing the effects of increasing the
workload and processors on the time consumption for: (a)
the candidate expert identification module, (b) calculation of
the influential users by considering all the possible
permutations for a single query, and (c) weight assignment to
four prioritized criteria.
Each of the aforementioned tasks is performed offline
and the repositories are updated periodically to avoid the
overheads arising due to online processing. The influence is
calculated based on the importance of the criteria indicated
by the users. Because there are four criteria over which users
can view the ranking decisions, it makes a total of 24
different possible combinations to evaluate the final ranking
or influence of an expert for a single query. Obviously, it is
impractical to calculate ranking score for each of the
combinations at run-time to manage the queries of users from
different geographical regions. Therefore, executing the
parallel and periodic jobs not only avoids high processing
delays but also ensures the availability of updated
information at all the times. We evaluated the performance
of all of the modules in terms of time consumption by
increasing the number of users and the number of processor.
The time consumption to identify the expert users from
20,542, 41,084, and 82,168 user profiles by varying the
number of processors was observed.
Fig. 5 shows the scalability results with different
workloads and number of processors to identify the
candidate health experts using the variant of HITS approach.
The results show that increasing number of users two times
resulted in sudden increase in the processing time. However,
substantial decreases in time consumption were observed by
increasing the number of processor. On average, by
increasing the number of user profiles twice increases the
time consumption by approximately 38.72% whereas
increasing one processor resulted in an average decrease of
16.27% for the candidate experts identification task. It is also
important to note that by increasing the number of processors
more than a certain limit, relatively small decreases in
processing time were observed. The reason is that this time
also includes the overheads, such as the processor start up
time and the communication time between the two
processors. For large number of processors, the
aforementioned overheads also increase and consequently
affect the total execution time [25]. Fig. 6 shows the
execution time corresponding to the three workloads for
influential user identification module. The influential users’
identification module calculates the number of followers of
each of the experts, performs sentiment analysis, and
calculates the health related tweets and the retweets.
Consequently, for each candidate expert, four different tasks
are to be performed, which requires parallel task processing
to speed up the query response time. By increasing the
number of profiles twice, the average combined increase in
time consumption is 72.03% whereas an average decrease of
approximately 66.37% is observed by increasing one
processor at a time. Fig. 7 shows the processing time for
weight assignment to various decision criteria. For each user
query, the framework performs weight assignment according
to 24 different combinations. This requires sufficient
computations that result in increased processing time, if
performed online. It also appears from Fig. 7 that the time
Fig. 5: Execution time analysis for different no. of users
and processors to identify candidate experts
Fig. 6: Execution time analysis for different no. of
users and processors to identify influential users
Fig. 7: Execution time analysis for different no. of
users and processors for weight assignment
8
consumption for weight assignment task is sufficiently less
than the two other modules. The reason is that weight
assignment is only subtask of the process of influential user
identification that has to be performed repeatedly.
It is evident from the above discussion and results that
all the tasks starting from the tweets extraction to the
influential user identification require enormous processing
time and resources. Therefore, query response time can only
be reduced if all the tasks demanding heaving computations
are preprocessed and periodically updated to ensure the
provision of the most recent information about health
experts. The experimental results also reveal that with the
increase in workload and processors, our algorithm
substantially maintains the efficiency in terms of time
consumption. Therefore, the proposed cloud based approach
is highly effective and can scale up and scale down
depending upon the workloads.
V. CONCLUSIONS
In this paper, we proposed a framework that enables the
users to interact with the health experts from Twitter to seek
advice at no cost. The framework utilizes the cloud
infrastructure to manage huge tweet repositories. The
variant of the HITS algorithm is employed to identify the
candidate experts. The approach effectively identified the
candidate experts by considering the use of distinctive
keywords, importance of the keywords, and the importance
of the experts using the keywords. To make the ranking
process more effective, we further introduced an influence
metric that identifies the influential users from the list of
candidate experts. Experimental results demonstrate that the
proposed framework is highly effective in terms of accuracy
as compared to other approaches. Moreover, the
performance of the system in terms of execution time is
preserved at high workload which indicates the scalability of
the system. We are optimistic that the research will be
helpful to fully utilize the potential of the online health
communities by offering free of cost interaction services
among the patients and doctors.
REFERENCES
[1] A. Abbas, K. Bilal, L. Zhang, and S. U. Khan, “A cloud based health insurance
plan recommendation system: A user centered approach,Future Generation
Computer Systems, vol. 43, 2015, pp. 99-109.
[2] K. Mille, “Big Data Analytics in Biomedical Research,” Biomedical
Computation Review, pp. 14-21, 2012.
[3] H. Chen, R.H.L. Chiang, and V. C. Storey, “Business Intelligence and
Analytics: From Big Data to Big Impact,” MIS quarterly 36, no. 4,
2012, pp. 1165-1188.
[4] Patientslikeme”, http://www.patientslikeme.com/, accessed on March 7, 2015.
[5] S. Fox, M. Duggan, “Health online 2013,” http://www.pewinternet.org/files/old-
media/Files/Reports/PIP_HealthOnline.pdf, accessed on March 25, 2015.
[6] Health Fact Sheet”, http://www.pewinternet.org/fact-sheets/health-fact-sheet/,
accessed on March 7, 2015.
[7] A. Abbas, L. Zhang, and S. U. Khan, “A Survey on Context-aware
Recommender Systems Based on Computational Intelligence
Techniques,” Computing, 2015, DOI 10.1007/s00607-015-0448-7.
[8] A. Abbas and S. U. Khan, “A Review on the State-of-the-Art Privacy Preserving
Approaches in E-Health Clouds,IEEE Journal of Biomedical and Health
Informatics, vol. 18, no. 4, 2014, pp. 1431-1441.
[9] M. Ali, S. U. Khan, and A. V. Vasilakos, “Security in Cloud Computing:
Opportunities and Challenges,” Information Sciences, vol. 305, 2015, pp. 357-
388.
[10] B. Kayyali, D. Knott, and S. V. Kuiken, “The big-data revolution in US health
care: Accelerating value and innovation,” Mc Kinsey & Company, 2013, pp. 1-
13.
[11] Healthcare Social Media Analytics, http://www.symplur.com/healthcare-
social-media-analytics/, accessed on March 10, 2015.
[12] D. Easley and J. Kleinberg, Networks, Crowds, and Markets: Reasoning About a
Highly Connected World, Cambridge University, Press, 2010.
[13] J. Weng, E. P. Lim, J. Jiang, and Q. He, “Twitterrank: finding topic-sensitive
influential twitterers,In Proceedings of the third ACM international conference
on Web search and data mining, 2010, pp. 261-270.
[14] K. Zhao, J. Yen, G. Greer, B. Qiu, P. Mitra, and K. Portier, “Finding influential
users of online health communities: a new metric based on sentiment
influence,” Journal of the American Medical Informatics Association, 2014, pp.
1-7.
[15] A. Pal, and S. Counts, “Identifying topical authorities in microblogs,In
Proceedings of the fourth ACM international conference on Web search and data
mining, 2011, pp. 45-54.
[16] S. Ghosh, N. Sharma, F. Benevenuto, N. Ganguly, and K. Gummadi, “Cognos:
crowdsourcing search for topic experts in microblogs,In Proceedings of the
35th international ACM SIGIR conference on Research and development in
information retrieval, 2012, pp. 575-590.
[17] X. Tang, and C. C. Yang, “Identifing influential users in an online healthcare
social network,In IEEE International Conference on Intelligence and Security
Informatics (ISI), 2010, pp. 43-48.
[18] G. A. Miller, “WordNet: a lexical database for English,” Communications of the
ACM 38, no. 11, 1995, pp. 39-41.
[19] H. Park, J. Yoon, and K. Kim, “Identifying patent infringement using SAO based
semantic technological similarities,Scientometrics 90, no. 2, pp. 515-529, 2012.
[20] C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D.
McClosky, “The Stanford CoreNLP natural language processing toolkit,
In Proceedings of 52nd Annual Meeting of the Association for Computational
Linguistics: System Demonstrations, 2014, pp. 55-60.
[21] T. Solymosi, and J. Dombi, “A method for determining the weights of criteria:
the centralized weights, European Journal of Operational Research 26, no. 1,
1986, pp. 35-41.
[22] P. Bedi and R. Sharma, “Trust based recommender system using ant colony for
trust computation,Expert Systems with Applications, vol. 39, no. 1, 2012, pp.
1183-1190.
[23] “twitteR: R based Twitter client, http://cran.r-
project.org/web/packages/twitteR/index.html, accessed on March 21, 2015.
[24] Z. Cheng, J. Caverlee, H. Barthwal, and V. Bachani, Who is the Barbecue King
of Texas? A Geo-Spatial Approach to Finding Local Experts on Twitter,
http://faculty.cse.tamu.edu/caverlee/pubs/cheng_sigir14.pdf, accessed on March
21, 2015.
[25] M. Ahmed, I. Ahmad, and S. U. Khan, “A Theoretical Analysis of Scalability of
the Parallel Genome Assembly Algorithms, in IEEE/EMB/ESEM/BMES
International Conference on Bioinformatics Models, Methods and Algorithms
(BIOINFORMATICS), Rome, Italy, January 2011, pp. 234-237.
[26] “Amazon EC2 Pricing, http://aws.amazon.com/ec2/pricing/, accessed on
March 19, 2015.
... There are two types of bots on the internet, legitimate and malicious bots. Although the bots are created for legitimate tasks [1] such as automatic execution of a script, web crawling, chatbots, etc., other bots' categories are malicious. [2] that is mainly intended for credential theft, dissemination of information, execution of network attacks, fake social network accounts, false allegations, counterfeit advertisements, malicious URLs, etc. Detecting such malicious bots is very crucial as it possesses serious security threats to online businesses. ...
Article
Full-text available
Deep learning, an evolution of the machine learning methods, is nowadays successfully used in various applications such as object detection, image recognition, language processing, bot detections, fraud detections, etc. Although its successful exploration in many areas, researchers find difficulties in choosing the network architectures suitable for the problems. It takes a substantial amount of time to create a babysitting model by manually tuning the hyperparameters for any deep learning architecture. Currently, research is going on in optimizing the hyperparameters of deep learning models using various search techniques. In our paper, we propose a Custom Genetic algorithm (CGA), an enhanced optimization technique to tune the hyperparameters of a deep learning model for Twitter bot detection. The proposed algorithm overcomes the limitations of the native genetic algorithm like early convergence and local optima traps. We compared our experimental results against default hyperparameter values and existing techniques. The results were promising and, our proposed CGA technique outperformed the current research techniques.
... Cloud computing has such a large number of advantages among which is constant exchange and sharing of medicinal information in an suitable way. It has additionally eased healthcare providers the strictness required to oversee foundation and furthermore provide them plentiful chance to familiarize with IT administration providers [5].It has been built up in various scholastic papers that cloud computing offers various advantages going from adaptability, cost effectiveness, agility improvement of community oriented sharing of resources [6] ...
Article
Full-text available
New technologies emerge almost every day and they change how organizations and even governments operate. One of the technologies which had a great impact on business models is the cloud computing technology. Cloud computing impacts how resources and capabilities can be used to function effectively and efficiently. It allows organizations to operate in a competitive environment without focusing on scaling, maintenance, and failure. It became a key factor in IT vendors’ strategies and governments like US Apps, and The Kasumigaseki Cloud in Japan. This study will investigate the challenges for deploying cloud computing in the eHealth Context within developing countries. It will examine the doctors’ and managers’ evaluation of cost, security, infrastructure, management support and the importance of other challenges in deploying cloud computing in eHealth.
... A bot [1] is a computer program used to create the content, and connect with users on social media, Fig. 1: Bot Activities creating an effort to copy and change its behavior. There are two categories of bots, one that is helpful therefore, they are used to produce useful services [2]. The other type is harmful, that is used to spread fake news, change the content into rumors and mislead users [3]. ...
... It has also relieved healthcare providers the rigour involved in managing infrastructure and also provide them ample opportunity to familiarize with IT service providers [2]. It has been established in different academic papers that cloud computing offers numerous benefits ranging from scalability, cost effectiveness, agility enhancement of collaborative sharing of resources [3]. Despite its various advantages, there are security and privacy challenges that urgently deserve utmost attention for realization of its efficient and full scale utilization [4]. ...
Chapter
The recent advancement in Information and Communication Technology (ICT) has undoubtedly improved services in all sectors in the world. Specifically, Information Technology (IT) has led to a very vital innovation in the health sector called electronic health (e-Health). In order to optimize full and excellent benefits of this innovation, its implementation in a cloud-based environment is important. However, with noticeable and numerous benefits inherent from e-Health in a cloud computing, its full utilization is still hampered by challenges of security and privacy. The Internet of Things (IoT) which is considered as a connection of various smart objects through network has unfolded many opportunities in many areas particularly in the healthcare sector. However, the introduction of IoT services in electronic health applications has resulted to increase fear and concerns of security and privacy. In this paper, we focused on extensive review of current and existing literatures of various approaches and mechanisms being used to handle security and privacy related matters in e-Health. Strengths and weaknesses of some of these approaches were enunciated. The literature review was carried out after selecting over 110 original articles and figured out several models adopted in their solutions. After comparing models used, we arrived at the reviewed articles. Reviewed articles were narrowed down to the current number because of similarity observed in the models adopted by some researchers. Also, we give an acceptable and standard definition of e-Health. An effort was made to classify cloud-based models. Security and privacy requirements as recommended by Health Insurance Portability and Accountability Act (HIPAA) were also provided. Remarks and recommendations were made regarding the review process and future directions on security and privacy of e-Health in cloud computing. Finally, we proposed a secured architecture for electronic health that could guarantee efficiency, reliability, and regulated access framework to health information. The architecture, though is currently under implementation, will yield the objective for which it is designed for. Its full-scale deployment will undoubtedly guarantee security of classified and confidential information.
... Text mining methods have been extensively used in literature to extract information [7] [8] [9] [10]. In this paper, we propose a novel method to detect bot accounts using the text content of their social media rather than identifying them from their profile features. ...
... The records are to be accessed globally to support patient needs, improve the quality of service, accelerate biomedical discoveries, reduce medical costs, and well-timed decision making. The health care data hoarded in the cloud is used to augment collaboration among innumerable partaking entities to the healthcare domain (Ahuja et al. 2012), and to offer the facilities like scalability, agility, cost effectiveness, and round the clock availability of health-related information (Abbas et al. 2015;Wu et al. 2012). The cloud stored patient information is exploited for the prosperousness of the community, medical diagnosis, and other health-related discoveries. ...
Article
Full-text available
The assiduous parade of the state-of-the-art sprouting digital technologies, is disrupting the smooth, easy-going health care digital ecosystem and forewarns us to manage it preemptively; since adaptation and survival of the fittest is a proven fact and we need to acclimatize to the mutated health care digital landscape. In this paper, the heightened consternations in the cloud are discoursed, with prime focus on integrity and privacy solutions, useful to hook the doles of cloud computing technologies for the health care world. An all-embracing appraisal of the correlated up-to-date research work on Provable Data Possession (PDP), tosses light on the erstwhile current status, research challenges, and future directions of PDP based health care data integrity. The need of the hour is a system, which, aids as an external auditor to audit the user’s outsourced health care data in the cloud, deprived of the wisdom of the health care data content. The contributions in this paper are (1) A comprehensive analysis of the contemporary Privacy Conserving PDP data integrity schemes, (2) a proposed novel generic support framework, which is useful to shield stored health care data, provide authentication in the cloud environment, which, is scalable and efficient, (3) deployment of the Secure Privacy Conserving Provable Data Possession (SPC-PDP) framework. The results validate that the proposed SPC-PDP framework can competently accomplish secure auditing and outclass the erstwhile ones. The SPC-PDP framework is no doubt, a promising solution to the challenges soaring due to the state-of-the-art improvements in health care digital technology. Last but not the least, this paper also gives a bird’s eye view on the future directions of secure and privacy preserving data integrity.
... It has also relieved healthcare providers the rigour involved to manage infrastructure and also provide them ample opportunity to familiarize with IT service providers [2]. It has been established in different academic papers that cloud computing offers numerous benefits ranging from scalability, cost effectiveness, agility enhancement of collaborative sharing of resources [3]. ...
Article
The recent advancement in Information and Communication Technology (ICT) has undoubtedly improved services in all sectors in the world. Specifically, Information Technology (IT) has led to a very vital innovation in health sector called electronic health (e-Health). In order to optimize full and excellent benefits of this innovation, its implementation in a cloud-based environment is important. However, with noticeable and numerous benefits inherent from e-Health in a cloud computing, its full utilization is still being hampered by challenges of security and privacy. In this paper, we focused on extensive review of current and existing literatures of various approaches and mechanisms being used to handle security and privacy related matters in e-Health. Strengths and weaknesses of some of these approaches were enunciated. The literature review was carried out after selecting over One Hundred and Ten (1 1 0) original articles and figured out several models adopted in their solutions. After comparing models used, we arrived at the reviewed articles. Reviewed articles were narrowed down to the current number because of similarity observed in the models adopted by some researchers. Also, we give an acceptable and standard definition of e-Health. Effort was made to classify cloud-based models. Security and privacy requirements as recommended by Health Insurance Portability and Accountability Act (HIPAA) were also discussed and provided. Remarks and recommendations were made regarding the review process and future directions on security and privacy of e-Health in cloud computing was also provided. Finally, authors propose a secured and dependable architecture for electronic health that could guarantee efficiency, reliability and regulated access framework to health information. The architecture, though is currently under implementation, will guarantee absolute security and privacy between healthcare providers and the patients.
Chapter
Technology is already affecting every aspect of life, and our health is no exception. Artificial intelligence (AI) has become one of the most emerging technologies over the last few years in almost every environment. New technological advances such as cloud computing provide benefits and have changed the way we store, access and exchange information. Especially, in the Healthcare IT sector, cloud-based systems offer great potential, from many perspectives, including improved medical diagnosis, accurate and faster prediction and cost-effective management treatment. In an attempt to assist cloud providers and healthcare organizations to secure their cloud-based environment and to adopt the appropriate measures for data protection, we present an overview of the security and privacy requirements of cloud-based healthcare systems. Specifically, this chapter starts with the presentation of the reported threats in cloud-based health systems, continues with the identified objectives and assets and concludes with measures for the mitigation of the identified threats. Due to the fact, migration into cloud-based healthcare systems, in most cases, implies that data subjects lose control of their data, many scientists have raised their worries about this. It is therefore needed to re-consider security, privacy and trust requirements, in the context of cloud computing. This chapter makes concrete recommendations for improving the protection level of cloud-based health organizations, cloud providers, hospitals and patients.
Article
Online social networks (OSNs) are structures that help users to interact, exchange, and propagate new ideas. The identification of the influential users in OSNs is a significant process for accelerating the propagation of information that includes marketing applications or hindering the dissemination of unwanted contents, such as viruses, negative online behaviors, and rumors. This article presents a detailed survey of influential users’ identification algorithms and their performance evaluation approaches in OSNs. The survey covers recent techniques, applications, and open research issues on analysis of OSN connections for identification of influential users.
Article
Full-text available
Business intelligence and analytics (BI&A) has emerged as an important area of study for both practitioners and researchers, reflecting the magnitude and impact of data-related problems to be solved in contemporary business organizations. This introduction to the MIS Quarterly Special Issue on Business Intelligence Research first provides a framework that identifies the evolution, applications, and emerging research areas of BI&A. BI&A 1.0, BI&A 2.0, and BI&A 3.0 are defined and described in terms of their key characteristics and capabilities. Current research in BI&A is analyzed and challenges and opportunities associated with BI&A research and education are identified. We also report a bibliometric study of critical BI&A publications, researchers, and research topics based on more than a decade of related academic and industry publications. Finally, the six articles that comprise this special issue are introduced and characterized in terms of the proposed BI&A research framework.
Article
Full-text available
The cloud computing exhibits, remarkable potential to provide cost effective, easy to manage, elastic, and powerful resources on the fly, over the Internet. The cloud computing, upsurges the capabilities of the hardware resources by optimal and shared utilization. The above mentioned features encourage the organizations and individual users to shift their applications and services to the cloud. Even the critical infrastructure, for example, power generation and distribution plants are being migrated to the cloud computing paradigm. However, the services provided by third-party cloud service providers entail additional security threats. The migration of user’s assets (data, applications etc.) outside the administrative control in a shared environment where numerous users are collocated escalates the security concerns. This survey details the security issues that arise due to the very nature of cloud computing. Moreover, the survey presents the recent solutions presented in the literature to counter the security issues. Furthermore, a brief view of security vulnerabilities in the mobile cloud computing are also highlighted. In the end, the discussion on the open issues and future research directions is also presented.
Conference Paper
Full-text available
We describe the design and use of the Stanford CoreNLP toolkit, an extensible pipeline that provides core natural language analysis. This toolkit is quite widely used, both in the research NLP community and also among commercial and government users of open source NLP technology. We suggest that this follows from a simple, approachable design, straight-forward interfaces, the inclusion of robust and good quality analysis components, and not requiring use of a large amount of associated baggage.
Article
Full-text available
Cloud computing is emerging as a new computing paradigm in the healthcare sector besides other business domains. Large numbers of health organizations have started shifting the electronic health information to the cloud environment. Introducing the cloud services in the health sector not only facilitates the exchange of electronic medical records among the hospitals and clinics, but also enables the cloud to act as a medical record storage center. Moreover, shifting to the cloud environment relieves the healthcare organizations of the tedious tasks of infrastructure management and also minimizes development and maintenance costs. Nonetheless, storing the patient health data in the third-party servers also entails serious threats to data privacy. Because of probable disclosure of medical records stored and exchanged in the cloud, the patients’ privacy concerns should essentially be considered when designing the security and privacy mechanisms. Various approaches have been used to preserve the privacy of the health information in the cloud environment. This survey aims to encompass the state-of-the-art privacy-preserving approaches employed in the e-Health clouds. Moreover, the privacy-preserving approaches are classified into cryptographic and noncryptographic approaches and taxonomy of the approaches is also presented. Furthermore, the strengths and weaknesses of the presented approaches are reported and some open issues are highlighted.
Article
The demand for ubiquitous information processing over the Web has called for the development of context-aware recommender systems capable of dealing with the problems of information overload and information filtering. Contemporary recommender systems harness context-awareness with the personalization to offer the most accurate recommendations about different products, services, and resources. However, such systems come across the issues, such as sparsity, cold start, and scalability that lead to imprecise recommendations. Computational Intelligence (CI) techniques not only improve recommendation accuracy but also substantially mitigate the aforementioned issues. Large numbers of context-aware recommender systems are based on the CI techniques, such as: (a) fuzzy sets, (b) artificial neural networks, (c) evolutionary computing, (d) swarm intelligence, and (e) artificial immune systems. This survey aims to encompass the state-of-the-art context-aware recommender systems based on the CI techniques. Taxonomy of the CI techniques is presented and challenges particular to the context-aware recommender systems are also discussed. Moreover, the ability of each of the CI techniques to deal with the aforesaid challenges is also highlighted. Furthermore, the strengths and weaknesses of each of the CI techniques used in context-aware recommender systems are discussed and a comparison of the techniques is also presented.
Article
Thirty-five percent of U.S. adults say that at one time or another they have gone online specifically to try to figure out what medical condition they or someone else might have. These findings come from a national survey by the Pew Research Center’s Internet & American Life Project. Throughout this report, we call those who searched for answers on the internet “online diagnosers.” When asked if the information found online led them to think they needed the attention of a medical professional, 46% of online diagnosers say that was the case. Thirty-eight percent of online diagnosers say it was something they could take care of at home and 11% say it was both or in-between.
Article
This paper addresses the problem of identifying local experts in social media systems like Twitter. Local experts -- in contrast to general topic experts -- have specialized knowledge focused around a particular location, and are important for many applications including answering local information needs and interacting with community experts. And yet identifying these experts is difficult. Hence in this paper, we propose a geo-spatial-driven approach for identifying local experts that leverages the fine-grained GPS coordinates of millions of Twitter users. We propose a local expertise framework that integrates both users' topical expertise and their local authority. Concretely, we estimate a user's local authority via a novel spatial proximity expertise approach that leverages over 15 million geo-tagged Twitter lists. We estimate a user's topical expertise based on expertise propagation over 600 million geo-tagged social connections on Twitter. We evaluate the proposed approach across 56 queries coupled with over 11,000 individual judgments from Amazon Mechanical Turk. We find significant improvement over both general (non-local) expert approaches and comparable local expert finding approaches.
Article
Collaborative Filtering (CF) technique has proven to be promising for implementing large scale recommender systems but its success depends mainly on locating similar neighbors. Due to data sparsity of the user-item rating matrix, the process of finding similar neighbors does not often succeed. In addition to this, it also suffers from the new user (cold start) problem as finding possible neighborhood and giving recommendations to user who has not rated any item or rated very few items is difficult. In this paper, our proposed Trust based Ant Recommender System (TARS) produces valuable recommendations by incorporating a notion of dynamic trust between users and selecting a small and best neighborhood based on biological metaphor of ant colonies. Along with the predicted ratings, displaying additional information for explanation of recommendations regarding the strength and level of connectedness in trust graph from where recommendations are generated, items and number of neighbors involved in predicting ratings can help active user make better decisions. Also, new users can highly benefit from pheromone updating strategy known from ant algorithms as positive feedback in the form of aggregated dynamic trust pheromone defines ''popularity'' of a user as recommender over a period of time. The performance of TARS is evaluated using two datasets of different sparsity levels viz. Jester dataset and MovieLens dataset (available online) and compared with traditional Collaborative Filtering based approach for generating recommendations.