Content uploaded by Khalifeh Aljadda
Author content
All content in this area was uploaded by Khalifeh Aljadda on Dec 29, 2015
Content may be subject to copyright.
AlJadda et al.
RESEARCH
Mining Massive Hierarchical Data Using a Scalable
Probabilistic Graphical Model
Khalifeh AlJadda1* , Mohammed Korayem2, Camilo Ortiz3, Trey Grainger3, John A. Miller1, Khaled
Rasheed1, Krys J. Kochut1, William S. York4, Rene Ranzinger4and Melody Porterfield4
*Correspondence:
aljadda@uga.edu
1Department of Computer
Science, University of Georgia,
Athens,GA, USA
Full list of author information is
available at the end of the article
†Equal contributor
Abstract
Probabilistic Graphical Models (PGM) are very useful in the fields of machine
learning and data mining. The crucial limitation of those models,however, is the
scalability. The Bayesian Network, which is one of the most common PGMs used
in machine learning and data mining, demonstrates this limitation when the
training data consists of random variables, each of them has a large set of
possible values. In the big data era, one would expect new extensions to the
existing PGMs to handle the massive amount of data produced these days by
computers, sensors and other electronic devices. With hierarchical data - data
that is arranged in a treelike structure with several levels - one would expect to
see hundreds of thousands or millions of values distributed over even just a small
number of levels. When modeling this kind of hierarchical data across large data
sets, Bayesian Networks become infeasible for representing the probability
distributions. In this paper we introduce an extension to Bayesian Networks to
handle massive sets of hierarchical data in a reasonable amount of time and
space. The proposed model achieves perfect precision of 1.0and high recall of
0.93 when it is used as multi-label classifier for the annotation of mass
spectrometry data. On another data set of 1.5 billion search logs provided by
CareerBuilder.com the model was able to predict latent semantic relationships
between search keywords with accuracy up to 0.80.
Introduction
Probabilistic graphical models (PGM) consist of a structural model and a set of
conditional probabilities [1, 2]. They are widely used in machine learning and data
mining techniques, like classification, speech recognition [3], bioinformatics [4, 5],
Natural Language Processing (NLP) [6, 7], etc. Scalability and restricted domain
size (e.g., propositional domain) are the major challenges for PGMs. To overcome
these challenges one would expect extensions to the existing PGMs. One extension
is offered by the hierarchical probabilistic graphical models (HPGM) which aim to
extend the PGM to work with more structured domains [8, 9]. However, this exten-
sion tackles the restricted domain size problem, but not scalability. For hierarchical
data, where data can be divided into several levels arranged in tree-like structures,
data items in each level depend on or are influenced only by the data items in
the upper levels while a Bayesian Network (BN) is the most appropriate PGM to
represent a probability distribution since the dependencies in this kind of data are
not bidirectional, a BN is often infeasible, as it may not provide a concise enough
representation of a large probability distribution.When dealing with the kind of
massive hierarchical data that is becoming increasingly common in the big data
AlJadda et al. Page 2 of 20
era, this is true because each level represents a random variable, while each node
in that level represents an outcome (possible value) of that random variable, so the
data can grow horizontally (number of values) faster than vertically (number of
random variables). Moreover, since the dependency between the random variables
is pre-defined in the hierarchical data, the structure of the network is predefined.
Hence, the first phase of building Bayesian Network to find the optimal structure
is not applicable
For example, consider the glycan ontology "GlycO" [10] which describes 1300
glycan structures (see section ) whose theoretical tandem mass spectra (MS) can
be predicted by GlycoWorkbench [11]. If the maximum of cleavages is set to two
and the number of cross-ring cleavages is set to one, then the theoretical MS2
spectrum contains 2,979,334 ions, which themselves can be fragmented to form tens
of millions of ions in MS3. To represent this data set of only two levels of the MS
data using a Bayesian Network (BN) the network will be composed of two nodes,
MS1and MS2with a single path MS1→M S2while the conditional probability
table (CPT) for the MS2will contain 3,873,134,200 (2,979,334 ×1300) entries. For
this kind of data, we propose a simple probabilistic graphical model for massive
hierarchical data (PGMHD) which we consider as an extension to the Bayesian
Network (BN,) that can represent massive hierarchical data in a more efficient way.
We successfully apply the PGMHD in two different domains: bioinformatics (for
multi-label classification) and search log analytics (for latent semantic discovery of
related terms, as well as, semantically ambiguous terms).
The main contributions of this paper are as follows: We propose a simple, effi-
cient and scalable probabilistic model that extends Bayesian Network for massive
hierarchical data. We successfully apply this model to the bioinformatics domain in
which we automatically classify and annotate high-throughput mass spectrometry
data. We also apply this model for large-scale latent semantic discovery and seman-
tically ambiguous terms discovery using 1.6 billion search log entries provided by
CareerBuilder.com, using the Hadoop Map/Reduce framework.
Background
Graphical models can be classified into two major categories: (1) directed graphical
models (the focus of this paper), which are often referred to as Bayesian Networks,
or belief networks, and (2) undirected graphical models which are often referred
to as Markov Random Fields, Markov networks, Boltzmann machines, or log-linear
models [12]. Probabilistic graphical models (PGMs) consist of both graph structure
and parameters. The graph structure represents a set of conditionally independent
relations for the probability model, while the parameters consist of the joint prob-
ability distributions [1]. Probabilistic graphical models are often considered to be
more convenient than numerical representations for two main reasons [13]:
1 To encode a joint probability distribution for P(X1,...,Xn) for npropositional
random variables with a numerical representation, we need a table with 2n
entries.
2 Inadequacy in addressing the notion of independence: to test independence
between Xand Y, one needs to test whether the joint distribution of xand
yis equal to the product of their marginal probability.
AlJadda et al. Page 3 of 20
Figure 1 Bayesian Network
PGMs are used in many domains. For example, Hidden Markov Models (HMM)
are considered a crucial component for most of the speech recognition systems [3]. In
bioinformatics, probabilistic graphical models are used in RNA sequence analysis [4].
In natural language processing (NLP), HMM and Bayesian models are used for part
of speech (POS) tagging [6]. The problem with PGMs in general, and Bayesian
Networks in particular, is that they are not suitable for representing massive data
due to the time complexity of learning the structure of the network and the space
complexity of storing a network with thousands of random variables or random
variables taking in many values. In general, finding a network that maximizes the
Bayesian score which maximizes the posterior probability and Minimum Description
Length (MDL) score which gives preference to a simple BN over a complex one, is
an NP-hard problem [14].
Bayesian Network
A Bayesian Network is a concise representation of a large probability distribution
to be handled using traditional techniques such as tables and equations [15]. The
graph of a Bayesian Network is a directed acyclic graph (DAG) [2]. A Bayesian
Network consists of two components: a DAG representing the structure (as shown
in Figure 1), and a set of conditional probability tables (CPTs). Each node in a
Bayesian Network must have a CPT which quantifies the relationship between the
variable represented by that node and its parents in the network. Completeness and
consistency are guaranteed in a Bayesian Network since there is only one probabil-
ity distribution that satisfies the Bayesian Network constraints [15]. The constraints
that guarantee a unique probability distribution are the numerical constraints repre-
sented by CPT and the independence constraints represented by the structure itself.
The independence constraints is shown in Figure 1. Each variable in the structure
is independent of any other variables other than its parents, once its parents are
known. For example, once the information about A is known, the probability of L
will not be affected by any new information about F or T, so we call L independent
of F and T once A is known.
Bayesian Networks are widely used for modeling causality in a formal way, for
decision-making under uncertainty, and for many other applications [15].
AlJadda et al. Page 4 of 20
C"
A1"
A2"
A3"
Figure 2 Naive Bayes
Related Work
Our research is related closely to Bayesian Network classifiers. In this section we re-
view different forms of Bayesian Network classifiers to understand how the PGMHD
extends BN in a way different than the existing models. We will cover the following
BN classifiers:
1 Naive Bayes Classifier (NB).
2 Selective Naive Bayes (SNB)
3 Tree Augmented Naive Bayes (TAN).
4 Hidden Naive Bayes (HNB).
we will also cover how we applied PGMHD to other data mining problems, such
as, latent semantic discovery of related search terms in users search logs, and of
semantically ambiguous keywords by analyzing users’ search logs.
Naive Bayes (NB)
Naive Bayes is the simplist form of the BN classifiers and the most common one.
This classifier is based on an assumption that all the features are independent given
the class. Figure 2 shows an example of NB. A NB classifier is defined as follows:
P(c|x)∝P(c)
n
Y
j=1
P(xj|c)
Where x= (x1, .., xn).P(c)is the prior probability of class cand P(xj|c)is the
conditional probability of feature/variable xj. The value of cthat maximizes the
right hand side is chosen. A Naive Bayes classifier’s performance depends upon the
quality of the predictor features, such that the performance is improved when the
predictor features are relevant and non redundant.
Selective Naive Bayes (SNB)
In order to improve the performance of BN classifier by selecting the predictive
features that are relevant and not redundant, Selective Naive Bayes (SNB) [16] is
proposed as a feature subset selection problem. Let us define xFas the projection
of xonto a selected feature subset F⊂ {1,2, . . . , n}. The classification equation
becomes
AlJadda et al. Page 5 of 20
C"
A1"
A2"
A3"
Figure 3 Tree Augmented Naive Bayes
P(c|x)∝P(c|xF) = P(c)Y
j∈F
P(xj|c)
Tree Augmented Naive Bayes (TAN)
This form of Bayesian Network classifier extends the NB by allowing each attribute
to have at most one attribute parent in addition to its class parent. This extension
tends to represent the fact that in some cases there is dependency or influence
between features in a way that a value of a feature xjdepends on a value of feature
y. TAN classifier is defined as follows:
P(c|x) = P(c)
n
Y
j=1
P(xj|pxj , c)
Where pxj is the attribute parent of xj. Figure 3 shows an example of TAN.
Hidden Naive Bayes (HNB)
HNB (Figure 4) is another extension of NB. In this extension each attribute Ai
gets a hidden parent Ahpi to integrate the influences from all other attributes. The
definition of the HNB classifier is as follows:
P(c|x) = P(c)
n
Y
j=1
P(xj|hpj, c)
where,
P(xj|xhpj , c) =
n
X
i=1,i6=j
wji ∗P(xj|xi, c)
Probabilistic Graphical model for Massive Hierarchical Data
Problems (PGMHD)
In this section we describe PGMHD. We discuss the structure of the model, its
learning algorithm, and how it extends BN.
AlJadda et al. Page 6 of 20
C"
A1"
A2"
A3"
Ahp1"
Ahp2"
Ahp3"
Figure 4 Hidden Naive Bayes
Model Structure
Consider a multi-level directed graph G= (V, A)where Vand A⊂V×Vdenote
the sets of nodes and arcs, respectively, such that:
1Vis partitioned into mlevels L0, . . . , Lm−1such that V=∪m−1
i=0 Li, and
Li∪Lj=∅for i6=j.
2 The arcs in Aonly connect one level to the next, i.e., if a∈Athen a∈
Li−1×Lifor some i= 1, . . . , m −1.
3 An arc a= (vi−1, vi)∈Li−1×Lirepresents the dependency of viwith its
parent vi−1,i= 1, . . . , m −1. Moreover, let pa :V→ P(V)be the function
that maps every node to its parents, i.e.,
pa(v) = {w: (w, v)∈A} ∀v∈V .
4 The nodes in each level Lirepresent all the possible outcomes of a finite
discrete random variable, namely Xi,i= 1, . . . , m −1.
Note that the nodes in the first level L0can be seen as root nodes and the ones in
Lm−1as leaves. Also, an observation xin our probabilistic model is an outcome of
a random variable, namely X∈L0× · · · × Lm−1, defined as
X= (X0, X1, . . . , Xm−1),(1)
which represents a path from L0to Lm−1such that (Xi−1, Xi)∈A.
In addition, we assume that there are tobservations of X, namely x1, . . . , xt, and
let f:V×V→Nbe a frequency function defined as f(w, v ) = Frequency of
Co-Occurrence wand v. Moreover, these latter tobservations are the ones used to
train our model, so that f(w, v)>0for every (w, v)∈A.
It should be observed that the proposed model can be seen as a special case of a
Bayesian Network by considering a network consisting of directed predefined paths.
However, we believe that a leveled directed graph that explicitly defines one node per
outcome of the random variables (as described above): i) leads to an easily scalable
(and distributable) implementation of the problems we consider; ii) improves the
readability and expressiveness of the implemented network; and iii) simplifies and
facilitates the training of the model.
AlJadda et al. Page 7 of 20
Probabilistic-based Classification
Given an outcome at level i∈ {1, . . . , m −1}, namely v∈Li, we calculate the
classification score Cli(w|v)of vto the parent outcome w∈Li−1by estimating the
conditional probability P(Xi−1=w|Xi=v)as follows
Cli(w|v) := f(w, v)
In(v)=f(w,v)
Out(w)·Out(w)
t
In(v)
t
=for P(Xi=v|Xi−1=w)·P(Xi−1=w)
P(Xi=v)
=P(Xi−1=w|Xi=v),
where
In(v) := X
u∈pa(v)
f(u, v),∀v∈V,
and
Out(w) := X
u:(w,u)∈A
f(w, u),∀w∈V.
Probabilistic-based Similarity scoring
Fix a level i∈ {1, . . . , m −1}, and let X, Y ∈L0× · · · × Lm−1be identically
distributed random variables as in (1). We define the probabilistic-based similarity
score CO (Co-Occurrence) between two outcomes xi, yi∈Liby computing the
conditional joint probability
CO(xi, yi) := P(Xi=xi, Yi=yi|Xi−1∈pa(xi)∩pa(yi), Yi−1∈pa(xi)∩pa(yi))
as
CO(xi, yi) = Y
w∈pa(xi)∩pa(yi)
pi(w, xi)·Y
w∈pa(xi)∩pa(yi)
pi(w, yi),
where pi(w, v) = P(Xi−1=w, Xi=v)for every (w, v)∈Li−1×Li. Recalling that
tis the number of observations, we can naturally estimate the probabilities pi(v, w)
with ˆp(v, w)defined as
ˆp(w, v) := f(w, v )
t.
Hence, we can obtain the related outcomes of xi∈Li(at level i) by finding all the
y∈Liwith a large estimated probabilistic similarity score CO(xi, yi).
Progressive Learning
PGMHD is designed to allow progressive learning which is shown in Algorithm 1.
Progressive learning is a learning technique that allows a model to learn gradually
AlJadda et al. Page 8 of 20
Ain$
C1$ C2$
Apg$
C3$ C4$
Bpg$
C5$
F1 F2 F3 F4
Input Hierarchical Data PGMHD
Figure 5 Input hierarchical data and PGMHD
over time. Training data does not need to be given at one time to the model. In-
stead, the model can learn from any available data and integrate the new knowledge
incrementally. This learning technique is very attractive in the big data age for the
following reasons:
1 Training the model does not require processing all data upfront
2 It can easily learn from new data without the need to re-include the previous
training data in the learning.
3 The training session can be distributed instead of doing it in one long-running
session.
4 It supports recursive learning which allows the results of the model to be used
as new training data, provided they are judged to be accurate by the user.
PGMHD an extension to NB
PGMHD extends NB in different directions to improve its scalability and ability to
handle massive hierarchical data as follows:
1 It enables multi-label classification.
2 It enables multi-level representation of the predictive features.
3 It enables lazy classification.
The first dimension PGMHD extends is the multi-label classification. Our model
allows more than one class to be in the root level of the classifier where any instance
can be classified to more than one class. The second dimension of this extension is
the multi-level classification which allows the classifier to represent the predictive
features in mlevels instead of only 2 levels as in the regular NB. This extension
allows the hierarchical modeling to preserve the structure of the data, which our
experiments show is important for improving the quality of the classification. The
last dimension of this extension is lazy classification against the eager NB. PGMHD
is considered a lazy classifier since the calculation of the classification score of a
new instance is all done during the classification process, unlike NB where all the
CPT are pre-calculated and stored. This extension makes the PGMHD suitable for
progressive learning, which can be very important for scalability.
Experiments and Results
Glycans (Figure 6) are the third major class of biological macro-molecules besides
nucleic acids and proteins [17]. Glycomics refers to the scientific attempts to char-
AlJadda et al. Page 9 of 20
Data: Input Hierarchical Data
Result: PGMHD Instance
begin
currentLevel = 0
while currentLevel < maxInputLevel do
foreach inputNode ∈pgmhd(currentLevel)do
if inputNode exists in P GMH D then
get pgmhdN ode where pgmhdN ode.data =inputN ode.data
pgmhdN ode.f requency+=1
else
pgmhdN ode =newnode
pgmhdN ode.f requency = 1
end
childrenLevel =currentLevel + 1
foreach inputChildN ode ∈inputN ode.children do
foreach pgmhdChildN ode ∈pgmhdNode.children do
if inputChildN ode.data =pgmhdChildN ode.data then
edge =edge(pg mhdNode, pgmhdC hildNode)edg e.frequency+=1
else
if childN ode ∈pgmhd(childrenLevel)then
pgmhdChildNode =node where node.data =childNode.data
edge =createN ewE dge(pgmhdN ode, pgmhdC hildNode)
edge.f requency = 1
else
pgmhdChildNode =newNode pgmhdC hildN ode.data =child
pgmhdChildNode.fr equency = 1
edge =createN ewE dge(pgmhdN ode, pgmhdC hildNode)
edge.f requency = 1
end
end
end
end
end
currentLevel =currentLevel + 1
end
end
Algorithm 1: Learning Algorithm for PGMHD. currentLevel represents the cur-
rent level in the input hierarchical data, we start with level0.maxInputLevel is
the highest level in the input hierarchical data. In Figure 5 Ain is an inputNode,
while Apg is the pgmhdN ode.C1is inputC hildN ode.C3is P g mhdChildN ode.
F1is edge.F req uency.C1, C2∈inputP arentN ode.children.C3, C4, C5∈
P gmhdP arentN ode.childr en
AlJadda et al. Page 10 of 20
Figure 6 Glycan structure in CFG format. The circles and squares represent the monosaccharides
which are the building blocks of a glycan while the lines are the linkages between them
acterize and study glycans, as defined in [17] or an integrated systems approach to
study structure-function relationships of glycans as defined in [18].
Mass spectrometry (MS) is an analytical technique used to identify the compo-
sition of a sample [19]. Although (MS) has become the major analytical technique
for glycans, no general method has been developed for the automated identification
of glycan structures using MS and tandem MS data. MSnrefers to the sequence of
MS selection with some form of fragmentation, it is also called tandem MS. The
relative ease of peptide identification using tandem MS is mainly due to the linear
structure of peptides and the availability of reliable peptide sequence databases. In
proteomic analyses, a mostly complete series of high abundance fragment ions is
often observed. In such tandem mass spectra, the mass of each amino acid in the
sequence corresponds to the mass difference between two high-abundance peaks,
allowing the amino acid sequence to be deduced. In glycomics MS data, ion series
are disrupted by the branched nature of the molecule, significantly complicating the
extraction of sequence information. In addition, groups of isomeric monosaccharides
commonly share the same mass, making it impossible to distinguish them by MS
alone. Databases for glycans exist but are limited, minimally curated, and suffer
badly from pollution from glycan structures that are not produced in nature or are
irrelevant to the organism of study. PGMHD attempts to employ machine learning
techniques (mainly probabilistic-based multi-label classification) to find a solution
for the automated identification of glycans using MS data.
We recently implemented the Glycan Elucidation and Annotation Tool (GELATO),
which is a semi-automated MS annotation tool for glycomics integrated within our
MS data processing framework called GRITS (http://www.grits-toolbox.org/). Fig-
ures 7, and 8 show screen shots from GELATO for annotated spectra. Figure 7 shows
the MS profile level and Figure 8 shows the annotation of MS2peaks using frag-
ments of a selected candidate glycan for annotation of the MS1data. The output
GELATO represents all the possible annotations to the given spectra. The user may
select a subset of those possible annotations as the correct ones, but then he/she
needs a smarter tool that can learn the correct selection and eliminate the incorrect
ones in the future. PGMHD is successfully applied for that purpose as we show in
this section.
To represent the MS data annotation using PGMHD, each annotation of MS 1
data (which is a glycan) is represented as a node in the top-layer of PGMHD. All
AlJadda et al. Page 11 of 20
Figure 7 MS1annotation using GELATO. Scan is the ID number of the scan in the MS file, peak
charge is the charge state of that peak in the MS file, peak intensity represents the abundance of
an ion at that peak, peak m/z is the mass over charge of the given peak, cartoon is the
annotation of that peak (glycan) in CFG format, feature m/z is the mass over charge for the
glycan, and glycanID is the ID of the glycan in the Glycan Ontology (GlycO).
the fragments generated by that glycan and used to annotate peaks in MS2are
represented by nodes in the lower layer and connected by edges with the parent node
in the upper layer, and this pattern can be extended until MSn. Each fragment at
level MSiis represented by a node in layer Li−1and connected by an edge with its
parent node at layer Li−2. The edge’s weight represents the co-occurrence frequency
between a child and a parent, and storing frequencies rather than probabilities
facilitates progressive learning. Figure 9 shows the PGMHD for MS data with three
levels (MS1,M S2, and MS3) in these figures. As shown in the model, three layers
are created: one for the MS1level, a second one for the MS2level, and a third
for MS3. Several different nodes at the MS1level can be annotated with the same
fragment ion at the MS2level, so MS2nodes can have several parents. The frequency
values are shown on the edges.
We annotated 3314 MS scans of banceriatic cancer samples using GELATO. Then
an expert manually approved 1990 scan annotations which we used to train and test
PGMHD. We split this data set into training data and test data sets. The size of the
training data is 1779 scans and 121 scans for testing. We trained PGMHD and com-
pared it against leading classifiers including Naive Bayes [20], SVM [21], Decision
Tree [22], K-NN [23], Neural Network [24], Radial Basis Function network (RBF
Network) [25] and Bayesian Network [26] from Weka [27]. Then we provide the test
list to each classifier to predict the best glycan that annotates the scans in the test
set. We also used Mulan [28], which is a Java library that extends Weka classifiers
to handle multi-label classification problems. Also, we applied the m-estimate as a
probability estimation technique To help PGMHD overcome the common problem
for any Bayesian model which is the zero-frequency problem [29]. Figure 10 shows
the precision and recall for the different classifiers compared to PGMHD after we
used Mulan for multi-label classification and m-estimate for PGMHD. Another im-
portant aspect in our experiment besides accuracy is the scalability. In order to
measure the scalability of PGMHD compared to the other classifiers we measure
the space and time complexity. Figure 11 shows how PGMHD was the fastest model
in the training phase, however it was not the best in the classification time as shown
AlJadda et al. Page 12 of 20
Figure 8 Fragments of a selected glycan at the MS2level. Each ion observed in MS1is selected
and fragmented in MS2to generate smaller ions, which can be used to identify the glycan
structure that most appropriately annotates the MS1ion. Theoretical fragments of the glycan
structure that had been used to annotate the MS1spectrum are used to annotate the
corresponding MS2spectrum.
in Figure 12, though it did get the third best time. Most important is the space
complexity used by each model which is shown in Figure 13. The memory usage
which is the most important aspect in the scalability shows that PGMHD is much
better than all the other classifiers especially the bayesian ones. Due to the difficulty
of getting more manually curated MS annotations dataset for testing the scalabil-
ity of our model in comparison to other machine learning models, we synthesized
a dataset using GELATO. To synthesize a dataset with a massive number of MS
annotations we used GELATO to generate all the possible annotations for the MS
experiments which were manually curated before. We assume that all the generated
annotations by GELATO are valid and correct annotations. Our focus in this part
of the experiment is the scalability not the accuracy since the accuracy was already
tested using the manually curated dataset. The new dataset includes 6776 instances
for training and 392 instances for testing. The number of features is 2952 while the
number of classes is 1340. As a result of this extension in the training data, the
Baysian Network classifer, K-NN, and RBF ran out of memory which means they
can not handle this dataset in 4 GB of main memory. On the other hand PGMHD
used only 160 MB to represent this dataset in memory. Figure 14 shows the memory
usage of the models which scale successfully to handle the new dataset.
Semantically Related Keywords in Search Logs
Semantic similarity is a metric that is defined over documents or terms in which
the distance between them reflects the likeness of their meaning [30], and it is
widely used in Natural Language Processing (NLP) and Information Retrieval (IR)
[31]. Generally, there are two major techniques used to compute semantic similarity:
one is computed using a semantic network (Knowledge-based approach) [32], and
AlJadda et al. Page 13 of 20
50 20
20 40 50 30 50
10 5
20 15
F1 F2
F3 F4 F5
F6
F7 F8 F9 F10
15
Figure 9 PGMHD representing MS annotations. The root nodes are the glycans that annotate
the peaks at MS1level, while the level 2 and 3 nodes are the glycan fragments that annotate the
peaks at MS2and MS3level respectively and the edges represent dependency associating the
glycans with their MS2fragments and MS2with their MS3fragments.
the other is based on computing the relatedness of terms within a large corpus
of text (corpus-based approach) [31]. The major techniques classified as corpus-
based approaches are Pointwise Mutual Information (PMI) [33] and Latent Semantic
Analysis (LSA) [34], though PMI outperforms LSA on mining the web for synonyms
[35]. A group of Google researchers proposed two efficient models which can discover
semantic word similarities [36]. The two novel models are the following:
1 Continuous Bag-of-Words model
2 Continuous Skip-gram model
These models aim to use large scale Neural Network to learn word vectors. The
two models have restrictions that make them not suitable in our usecase. The first
restriction is that both models require words and context in which those words are
used. In their experiments, the authors built vectors of at least 50 words around
the given word (words before and after the given word from the text in which
that word was used). One more restriction is that they allow only single token
words to be processed (no phrases). In our case, the two models are not applicable
since the searches conducted by the users usually contains a single phrase with no
context or other words surrounding it. Also, we care about phrases as opposed to
single words, since small phrases are most commonly used in our search engine.
For example "Java Developer" should be considered as a single phrase when we
discover the semantically related phrases. In our experiment, we discovered high
quality semantic relationships using a data set of 1.6 billion search logs (search
keywords used to search for jobs on CareerBuilder.com). PGMHD completed this
task in 45 minutes.
Motivation
We would like to create a language-independent algorithm for modeling semantic re-
lationships between search phrases that provides output in a human-understandable
format. It is important that the person searching can be assisted by an augmented
query without us creating a black-box system in which that person is unable to un-
derstand and adjust the query augmentation. CareerBuilder[1] operates job boards
[1]http://www.careerbuilder.com/
AlJadda et al. Page 14 of 20
!"#$%& '()*+&,(-+.& ,(-+.&'+/& 01''20345& 01''20365& 01''20375& 89#& %+:;.;<=&
>?++& @,A&
!?+:;.;<=&
7&
BCDE&
7&
BC47&
BC4D&
BCFG&
BCG6&
BCDH&
BCFH&
@+:(II&
BCH4&
BCD6&
BC7H&
BC7F&
BC6E&
BCFG&
BCFH&
BC4F&
BC4F&
A1#+(.J?+&
BCHK&
BCDF&
BC46&
BC7H&
BC4B&
BCFG&
BCK7&
BCF4&
BCF&
B&
BC6&
BCF&
BCK&
BCG&
7&
7C6&
Precision/Recall-
Figure 10 Precision and Recall after the Multi-label classification. PGMHD was applied with the
m-estimate where m= 1 and p= 0.1
in many countries and receives tens of millions of search queries every day. Given the
tremendous volume of search data in our logs, we would like to discover the latent
semantic relationships between search terms and phrases for different region-specific
websites using a novel technique that avoids the need to use natural language pro-
cessing (NLP). We wish to avoid NLP in order to make it possible to apply the
same technique to different websites supporting many languages without having to
change the algorithms or the libraries per-language.
It is tempting to suggest using a synonym dictionary since the problem sounds like
finding synonyms, but the problem here is more complicated than finding synonyms
since the search terms or phrases on our site are often job titles, skills, and company
names which are not, in most cases, regular words from any dictionary. For example
if a user searches for ”java developer”, we would not find any synonyms for this
phrase in a dictionary. Another user may search for ”hadoop”which is also not a
word that would be found in a typical English dictionary.
Probabilistic Semantic Similarity Scoring using PGMHD
We applied the proposed PGMHD model to discover the semantically related search
terms by measuring probabilistic-based semantic similarity between those search
terms. Given the search logs for all the users and the users’ classifications as shown
in Table 1, PGMHD can represent this kind of data by placing the classes of the
users as root nodes and placing the search terms for all the users in the second
level as children nodes. Then, an edge will be formed linking each search term back
to the class of the user who searched for it. The frequency of each search term
(how many users search for it) will be stored in the node of that term, while the
frequency of a specific search term searched for by users of a specific class (how
many users belonging to that class searched for the given term) will be stored in
the edge between the class and the term. The frequency of the root node is the
AlJadda et al. Page 15 of 20
PGMHD& Naïve&Bayes& Bayes&Net& Decision&
Tree(J48)& K<NN&(K=1)& SVM& RBF&Network&
Time&MS&
1679&
6681&
13769&
15542&
3690&
18987&
56475&
0&
10000&
20000&
30000&
40000&
50000&
60000&
Time%MS%
Training%Time%(MS)%
Figure 11 Training time for different classifiers. Lazy classifiers (PGMHD, and K-NN) are much
faster in the training phase due to the fact that no complicated calculation is required.
Table 1 Input data to PGMHD over hadoop
UserID Classification Search Terms
user1 Java Developer Java, Java Developer, C#, Software Engineer
user2 Nurse RN, Rigistered Nurse, Health Care
user3 .NET Developer C#, ASP, VB, Software Engineer, SE
user4 Java Developer Java, JEE, Struts, Software Engineer, SE
user5 Health Care Health Care Rep, HealthCare
summation of the frequencies on the edges that connect that root node with its
children (Figure 15).
Distributed PGMHD
In order to process 1.6 billion search logs (each search log contains one or more key-
words used by a user to search for jobs on careerbuilder.com) provided by Career-
builder in reasonable time, we designed a distributed PGMHD using Hadoop HDFS
[37], Hadoop Map/Reduce [38] and Hive [39]. The design of distributed PGMHD is
shown in figure 16. Basically we use Hive to store the intermediate data while we
are buidling and training PGMHD. Once it is trained we can then run our inquires
to get an ordered list of the semantically related keywords for a specific term(s).
0.0.1 Experiment Setup and Results
The experiment performing latent semantic discovery among search terms using
PGMHD was run on a Hadoop cluster with 69 data nodes, each having a 2.6 GHz
AMD Opteron Processor with 12 to 32 cores and 32 to 128 GB RAM. Table 2 shows
sample results of 10 terms with their top 5 related terms discovered by PGMHD. To
evaluate the model’s accuracy, we sent the results to data analysts at CareerBuilder
who reviewed 1000 random pairs of discovered related search terms and returned
the list with their feedback about whether each pair of discovered related terms
was "related" or "unrelated". We then calculated the accuracy (precision) of the
model based upon the ratio of the number of related results to the total number
AlJadda et al. Page 16 of 20
PGMHD& Naïve&Bayes& Bayes&Net& Decision&
Tree(J48)& K<NN&(K=1)& SVM& RBF&Network&
Time&MS&
5218&
13923&
6020&
168&
5535&
1128&
4663&
0&
2000&
4000&
6000&
8000&
10000&
12000&
14000&
16000&
Time%MS%
Classifica.on%Time%(MS)%
Figure 12 Classification time for different classifiers. The eager classifiers (Decision Tree, SVM,
and RBF) are faster than the lazy ones due to the fact that the complicated computations are
done during the training phase, which causes the classification time to be faster. Naive Bayes and
Bayesian Networks did not do well due to the multi-label classification for which they are not
suitable.
of results. The results show the accuracy of the discovered semantic relationships
among search terms using the PGMHD model to be 0.80.
Table 2 PGMHD results for latent semantic discovery
Term Related Terms
hadoop big data, hadoop developer, obiee, java, python
registered nurse rn registered nurse, rn, registered nurse manager, nurse, nursing, director of
nursing
data mining machine learning, data scientist, analytics, business intellegence, statistical an-
alyst
solr lucene, hadoop, java
software engineer software developer, programmer, .net developer, web developer, software
big data nosql, data science, machine learning, hadoop, teradata
realtor realtor assistant, real estate, real estate sales, sales, real estate agent
data scientist machine learning, data analyst, data mining, analytics, big data
plumbing plumber, plumbing apprentice, plumbing maintenance, plumbing sales, main-
tenance
agile scrum, project manager, agile coach, pmiacp, scrum master
Discovering Semantic Ambiguity of a Keyword
The semantic ambiguity of a keyword can be defined as the likelihood of seeing
different meanings of the same keyword in different contexts [40, 41]. The techniques
mentioned in the literature focuses on utilization of ontologies and dictionaries
like Wordnet as described in [40, 41]. Those solutions are not applicable when the
keywords are from a domain like job search. In the job search domain the used
keywords are typically job titles, skills, company names, etc. which are not regular
English keywords. For example, java can mean a programming language, as well as,
coffee but an English dictionary would not provide both of those meanings.
PGMHD is applied successfully to discover the semantic ambiguity of a keyword.
About 1.6 billion search logs used in this experiment. The search keywords extracted
from those 1.6 billion logs were used to train PGMHD, which was then used to
calculate the normalized PMI score for each term with all of its parents. The initial
AlJadda et al. Page 17 of 20
!"#$%& '()*+&,(-+.& ,(-+.&'+/& 01#& %+23.345&67++& 89''& :,;&'+/<47=&
#+>47-&#,&
?@&
@@A&
B?CB&
@@C&
CDA&
E@E&
CFGE&
G&
EGG&
CGGG&
CEGG&
BGGG&
BEGG&
@GGG&
Mega%Bytes%
Memory%MB%
Figure 13 Memory usage by each model in MB for a dataset of 1779 instances annotated by 468
glycans (classes).
PGMHD& Naïve&Bayes& SVM& Decision&Tree&
Memory&(MB)&
160&
1408&
1403&
702&
0&
200&
400&
600&
800&
1000&
1200&
1400&
1600&
Memory'(MB)'
Figure 14 Memory usage by each model in MB for a training dataset of 6776 instances, 1640
features, and annotated by 1340 glycans (classes).
Java
Developer
.NET
Developer Nurse Health Care
Java J2EE C# Care
giver RN Senior
Home
5
10 3
50 50 100 10
15
1
Figure 15 PGMHD Representing Job Search Keywords
results of this use case are promising, though work to improve the implementation
AlJadda et al. Page 18 of 20
Read%Input%Row%
UserId,Classifica6
on,SearchTerms%
Map
Count%Term%
Frequency%and%
(Term,Class)%
Frequency%
Key: Term
Value: Classification
Reduce
Read%Input%Row%
Term,Term%Freq,%
Class,%
(Term,Class)%Freq%
Map
Count%
Classifica6on%
Frequency%
Key: Classification
Val ue: UserID
Reduce
Term% Term%Freq% Class% (Term,Class)%
Freq%
Class% Class%Freq%
Join
PGMHD
Hive Table 1
Hive Table 2
Figure 16 PGMHD Over Hadoop
are is still ongoing. We plan to publish a separate paper about this use case of
PGMHD soon.
Table 3 shows sample results of the discovered semantically ambiguous terms
using PGMHD.
Table 3 PGMHD results for semantic ambiguity discovery. The first column shows the keyword, while
the second column shows the related keywords of each possible meaning separated by horizontal line
term related terms of meaning
architect enterprise architect, java architect, data architect, telecommute, oracle, java, .net
architectural designer, architectural drafter, autocad, autocad drafter, designer, drafter, cad,
engineer
account bookkeeper, accountant, analyst, finance
sales executive, account executive, insurance, account manager, outside sales, medical sales,
manager, sales
designer
design, print, animation, artist, illustrator, creative, graphic artist, graphic, photoshop, video
graphic, web designer, design, web design, graphic design, graphic designer
design, drafter, cad designer, draftsman, autocad, mechanical designer, proe, drafter drafting
designer autocad, structural designer, revit
Conclusions and Future Work
Probabilistic graphical models are very important in many modern applications
such as data mining and data analytics. The major issue with existing probabilistic
graphical models is their scalability to handle large data sets, making this a very
important area for research, especially given the tremendous modern focus on big
data due to the number of data points produced by modern computer systems and
sensors. PGMHD is a probabilistic graphical model that attempts to solve the scal-
ability problems with existing models in scenarios where massive hierarchical data
is present. PGMHD is designed to fit hierarchical data sets of any size, regardless of
the domain in which the data belongs. PGMHD can represent the hierarchical data
with any number of levels, it can handle multi-label classification, and it is suitable
for progressive learning since it is considered to be a lazy classifier. In this paper we
present three experiments from different domains: one being the automated tagging
of high-throughput mass spectrometry data in bioinformatics, the other being la-
tent semantic discovery using search logs from the largest job board in the U.S, and
the last one being identification of semantically ambiguous keywords. The three use
AlJadda et al. Page 19 of 20
cases in which we tested PGMHD show that this model is robust and can scale from
a few thousand entries to billions of entries, and that it can also run on a single
computer (for smaller data sets), as well as in a parallelized fashion on a large clus-
ter of servers (69 were used in our experiment). PGMHD is used in production at
CareerBuilder.com for discovery of semantically related keywords and semantically
ambiguous keywords. The work on discovering semantically ambiguous keywords is
ongoing, and we plan to publish a separate paper about it. We plan to compare
machine learning algorithms implemented in Apache Spark with PGMHD.
Competing interests
The authors declare that they have no competing interests.
Author’s contributions
KA carried out the design, implementation, and experiments related to MS annotation, Discovering
semantically related keywords, and discovering semantically ambiguous keywords. MK participated in the
design, implementation, and experiments related to discovering semantically related keywords, and discovering
semantically ambiguous keywords. CO participated in the design, implementation, and experiments related to
discovering semantically related keywords, and discovering semantically ambiguous keywords. TG participated
in the design, and experiments related to discovering semantically related keywords, and discovering
semantically ambiguous keywords. RR, WY, and MP participated in the design and validation of the MS
annotation experiments. JM, KR, KK, and WY they all contributed to writing this manuscript and validating
the model as well as the results. All authors read and approved the final manuscript.
Acknowledgements
The authors would like to thank David Crandall from Indiana University for providing very helpful comments
and suggestions to improve this paper. We also would like to thank Kiyoko Aoki Kinoshita from Soka University
for the valuable discussions and suggestions to improve this model. Deep thanks to the search team, the big
data team, and the data science team at CareerBuilder.com for their support while implementing and test this
model over their Hadoop cluster.
Author details
1Department of Computer Science, University of Georgia, Athens,GA, USA. 2School of Informatics and
Computing, Indiana University, Bloomington, IN, USA. 3CareerBuilder.com, Norcross, GA, USA. 4Complex
Carbohydrate Research Center, University of Georgia, Athens,GA, USA.
References
1. Smyth, P.: Belief networks, hidden markov models, and markov random fields: A unifying view. Pattern
recognition letters 18(11), 1261–1268 (1997)
2. Hamelryck, T.: An overview of bayesian inference and graphical models. In: Bayesian Methods in
Structural Bioinformatics, pp. 3–48. Springer, ??? (2012)
3. Korayem, M., Badr, A., Farag, I.: Optimizing hidden markov models using genetic algorithms and artificial
immune systems. Computing and Information Systems 11(2), 44 (2007)
4. Eddy, S.R., Durbin, R.: Rna sequence analysis using covariance models. Nucleic acids research 22(11),
2079–2088 (1994)
5. Söding, J.: Protein homology detection by hmm–hmm comparison. Bioinformatics 21(7), 951–960 (2005)
6. Christodoulopoulos, C., Goldwater, S., Steedman, M.: A bayesian mixture model for part-of-speech
induction using multiple features. In: Proceedings of the Conference on Empirical Methods in Natural
Language Processing, pp. 638–647 (2011). Association for Computational Linguistics
7. Kupiec, J.: Robust part-of-speech tagging using a hidden markov model. Computer Speech & Language
6(3), 225–242 (1992)
8. Gyftodimos, E., Flach, P.A.: Hierarchical bayesian networks: an approach to classification and learning for
structured data. In: Methods and Applications of Artificial Intelligence, pp. 291–300. Springer, ??? (2004)
9. Fine, S., Singer, Y., Tishby, N.: The hierarchical hidden markov model: Analysis and applications. Machine
learning 32(1), 41–62 (1998)
10. Thomas, C.J., Sheth, A.P., York, W.S.: Modular ontology design using canonical building blocks in the
biochemistry domain. Frontiers in Artificial Intelligence and Applications 150, 115 (2006)
11. Ceroni, A., Maass, K., Geyer, H., Geyer, R., Dell, A., Haslam, S.M.: Glycoworkbench: a tool for the
computer-assisted annotation of mass spectra of glycansâĂă. Journal of proteome research 7(4),
1650–1659 (2008)
12. Jordan, M.I., et al.: Graphical models. Statistical Science 19(1), 140–155 (2004)
13. Pearl, J.: Markov and Bayes Networks: a Comparison of Two Graphical Representations of Probabilistic
Knowledge. Computer Science Department, University of California, ??? (1986)
14. Friedman, N., Nachman, I., Peér, D.: Learning bayesian network structure from massive datasets: the
«sparse candidate «algorithm. In: Proceedings of the Fifteenth Conference on Uncertainty in Artificial
Intelligence, pp. 206–215 (1999). Morgan Kaufmann Publishers Inc.
15. Darwiche, A.: Bayesian networks. Communications of the ACM 53(12), 80–90 (2010)
16. Bielza, C., Larrañaga, P.: Discrete bayesian network classifiers: A survey. ACM Computing Surveys (CSUR)
47(1), 5 (2014)
AlJadda et al. Page 20 of 20
17. Aoki-Kinoshita, K.F.: An introduction to bioinformatics for glycomics research. PLoS computational
biology 4(5), 1000075 (2008)
18. Raman, R., Raguram, S., Venkataraman, G., Paulson, J.C., Sasisekharan, R.: Glycomics: an integrated
systems approach to structure-function relationships of glycans. Nature Methods 2(11), 817–824 (2005)
19. Ma, B.: Challenges in computational analysis of mass spectrometry data for proteomics. Journal of
Computer Science and Technology 25(1), 107–123 (2010)
20. Rish, I.: An empirical study of the naive bayes classifier. In: IJCAI 2001 Workshop on Empirical Methods in
Artificial Intelligence, vol. 3, pp. 41–46 (2001)
21. Cortes, C., Vapnik, V.: Support-vector networks. Machine learning 20(3), 273–297 (1995)
22. Quinlan, J.R.: Simplifying decision trees. International journal of man-machine studies 27(3), 221–234
(1987)
23. Altman, N.S.: An introduction to kernel and nearest-neighbor nonparametric regression. The American
Statistician 46(3), 175–185 (1992)
24. Werbos, P.J.: Generalization of backpropagation with application to a recurrent gas market model. Neural
Networks 1(4), 339–356 (1988)
25. Poggio, T., Girosi, F.: A theory of networks for approximation and learning. Technical report, DTIC
Document (1989)
26. Pearl, J.: Bayesian netwcrks: A model cf âĂŸself-activated memory for evidential reasoning (1985)
27. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining
software: an update. ACM SIGKDD explorations newsletter 11(1), 10–18 (2009)
28. Tsoumakas, G., Spyromitros-Xioufis, E., Vilcek, J., Vlahavas, I.: Mulan: A java library for multi-label
learning. The Journal of Machine Learning Research 12, 2411–2414 (2011)
29. Jiang, L., Wang, D., Cai, Z.: Scaling up the accuracy of bayesian network classifiers by m-estimate. In:
Advanced Intelligent Computing Theories and Applications. With Aspects of Artificial Intelligence, pp.
475–484. Springer, ??? (2007)
30. Harispe, S., Ranwez, S., Janaqi, S., Montmain, J.: Semantic measures for the comparison of units of
language, concepts or entities from text and knowledge base analysis. arXiv preprint arXiv:1310.1285
(2013)
31. Mihalcea, R., Corley, C., Strapparava, C.: Corpus-based and knowledge-based measures of text semantic
similarity. In: AAAI, vol. 6, pp. 775–780 (2006)
32. Budanitsky, A., Hirst, G.: Semantic distance in wordnet: An experimental, application-oriented evaluation
of five measures. In: Workshop on WordNet and Other Lexical Resources, vol. 2 (2001)
33. Bouma, G.: Normalized (pointwise) mutual information in collocation extraction. In: Proceedings of the
Biennial GSCL Conference, pp. 31–40 (2009)
34. Dumais, S.T.: Latent semantic analysis. Annual review of information science and technology 38(1),
188–230 (2004)
35. Turney, P.D.: Mining the web for synonyms: PMI-IR versus lsa on toefl. In: Proceedings of the 12th
European Conference on Machine Learning. EMCL ’01, pp. 491–502. Springer, London, UK, UK (2001).
http://dl.acm.org/citation.cfm?id=645328.650004
36. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space.
arXiv preprint arXiv:1301.3781 (2013)
37. Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: Mass Storage
Systems and Technologies (MSST), 2010 IEEE 26th Symposium On, pp. 1–10 (2010). IEEE
38. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Communications of the
ACM 51(1), 107–113 (2008)
39. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.:
Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment 2(2),
1626–1629 (2009)
40. Jayadianti, H., Nugroho, L.E., Pinto, C.S., Santosa, P.I., Widayat, W.: Solving problem of ambiguity terms
using ontology (2013)
41. Gracia, J., Lopez, V., d’Aquin, M., Sabou, M., Motta, E., Mena, E.: Solving semantic ambiguity to
improve semantic web based ontology matching (2007)