Content uploaded by Mohammed Korayem

Author content

All content in this area was uploaded by Mohammed Korayem on Oct 03, 2014

Content may be subject to copyright.

PGMHD: A Scalable Probabilistic Graphical Model for Massive

Hierarchical Data Problems

Khalifeh AlJadda∗Mohammed Korayem†Camilo Ortiz‡Trey Grainger§

John A. Miller¶William S. Yorkk

Abstract

In the big data era, scalability has become a crucial requirement for any useful computational

model. Probabilistic graphical models are very useful for mining and discovering data insights, but

they are not scalable enough to be suitable for big data problems. Bayesian Networks particularly

demonstrate this limitation when their data is represented using few random variables while each

random variable has a massive set of values. With hierarchical data - data that is arranged in a

treelike structure with several levels - one would expect to see hundreds of thousands or millions of

values distributed over even just a small number of levels. When modeling this kind of hierarchical

data across large data sets, Bayesian networks become infeasible for representing the probability

distributions for the following reasons: i) Each level represents a single random variable with hundreds

of thousands of values, ii) The number of levels is usually small, so there are also few random variables,

and iii) The structure of the network is predeﬁned since the dependency is modeled top-down from

each parent to each of its child nodes, so the network would contain a single linear path for the

random variables from each parent to each child node. In this paper we present a scalable probabilistic

graphical model to overcome these limitations for massive hierarchical data. We believe the proposed

model will lead to an easily-scalable, more readable, and expressive implementation for problems that

require probabilistic-based solutions for massive amounts of hierarchical data. We successfully applied

this model to solve two diﬀerent challenging probabilistic-based problems on massive hierarchical data

sets for diﬀerent domains, namely, bioinformatics and latent semantic discovery over search logs.

1 Introduction

Probabilistic graphical models (PGM) refer to a family of techniques that merge concepts from graph

structures and probability models [35]. They represent the conditional dependencies among sets of random

variables [19]. In the age of big data, PGMâĂŹs can be very useful for mining and extracting insights from

large-scale and noisy data. The major challenges that PGMs face in this emerging ﬁeld are the scalability

and the restriction that they can only be applied on a propositional domain [18, 7]. Some extensions have

already been proposed to address these challenges, such as hierarchical probabilistic graphical models

(HPGM) which aim to extend the PGM to work with non-propositional domains [18, 14]. The focus

of these models is to make Bayesian networks applicable to non-propositional domains, but they do not

solve the scalability issues that arise when they are applied to massive data sets.

Massive data sets often exhibit hierarchical properties, where data can be divided into several levels

arranged in tree-like structures. Data items in each level depend on or inﬂuenced by only the data

items in the immediate upper level. For this kind of data the most appropriate PGM to represent the

probability distribution would be a Bayesian network (BN), since the dependencies in this kind of data

are not bidirectional. A Bayesian network is considered to be feasible when it can provide a concise

representation of a large probability distribution where the need cannot be eﬃciently handled using

traditional techniques such as tables and equations [10]. Such a scenario is not the case with massive

hierarchical data, however, since each level only represents one random variable, while the data items in

that level are outcomes of that random variable. For example, consider that the hierarchical data are

∗Department of Computer Science, University of Georgia, Athens, Georgia. Email: aljadda@uga.edu,jam@cs.uga.edu

†School of Informatics and Computing, Indiana Univeristy, Bloomington, IN. Email: mkorayem@cs.indiana.edu

‡CareerBuilder, Norcross, GA. Email: Camilo.Ortiz@careerbuilder.com

§CareerBuilder, Norcross, GA. Email: trey.grainger@careerbuilder.com

¶Department of Computer Science, University of Georgia, Athens, Georgia. Email: jam@cs.uga.edu

kComplex Carbohydrate Research Center, University of Georgia, Athens, Georgia. Email: will@ccrc.uga.edu

1

arXiv:1407.5656v1 [cs.AI] 21 Jul 2014

organized as follows: The data items in the top level (root level) represent US cities, while the data items

in the second level represent diseases, where each city is connected with the set of diseases that appears in

that city. In this case assume we have 19000 cities and 50000 diseases. If we would like to represent this

data in a BN, we will consider all of the cities in the root level to be outcomes of one random variable City

and all the data items in the second level to be outcomes of another random variable Disease. Thus, the

BN for this data will be composed of two nodes with single path City →Disease while the conditional

probability table (CPT) for the Disease will contain 950,000,000 (50000 ×19000) entries. For this

kind of data, we propose a simple probabilistic graphical model (PGMHD) that can represent massive

hierarchical data in more eﬃcient way. We successfully apply the PGMHD in two diﬀerent domains:

bioinformatics (for multi-class classiﬁcation) and search log analytics (for latent semantic discovery).

The main contributions of this paper are as follows:

•We propose a simple, eﬃcient and scalable probabilistic-based model for massive hierarchical data.

•We successfully apply this model to the Bioinformatics domain in which we automatically classify

and annotate high-throughput mass spectrometry data.

•We also apply this model for large-scale latent semantic discovery using 1.6 billion search log entries

provided by CareerBuilder.com, using the Hadoop Map/Reduce framework.

2 Background

Graphical models can be classiﬁed into two major categories: (1) directed graphical models, which are

often referred to as Bayesian networks, or belief networks, and (2) undirected graphical models which

are often referred to as Markov Random Fields, Markov networks, Boltzmann machines, or log-linear

models [23]. Probabilistic graphical models (PGMs) consist of both graph structure and parameters.

The graph structure represents a set of conditionally independent relations for the probability model,

while the parameters consist of the joint probability distributions [35]. Probabilistic graphical models are

often considered to be more convenient than numerical representations for two main reasons [31]:

1. To encode a joint probability distribution for P(x1,...,xn) for npropositional variables with a nu-

merical representation, we need a table with 2nentries.

2. Inadequacy in addressing the notion of independence: to test independence between xand y, one

needs to test whether the joint distribution of xand yis equal to the product of their marginal

probability.

PGMs are used in many domains. For example, Hidden Markov Models (HMM) are considered

a crucial component for most of the speech recognition systems [24]. In bioinformatics, probabilistic

graphical models are used in RNA sequence analysis [12], protein homology detection and sequence

alignment [36], and for genome-wide identiﬁcation [39]. In natural language processing (NLP), HMM

and Bayesian models are used for part of speech (POS) tagging [8, 25]. The problem with PGMs in

general, and Bayesian networks in particular, is that they are not suitable for representing massive data

due to the time complexity of learning the structure of the network and the space complexity of storing a

network with thousands of random variables. In general, ﬁnding a network that maximizes the Bayesian

and Minimum Description Length (MDL) scores is an NP-hard problem [15].

2.1 Bayesian Networks

A Bayesian network is a concise representation of a large probability distribution to be handled using

traditional techniques such as tables and equations [10]. The graph of a Bayesian network is a directed

acyclic graph (DAG) [19]. A Bayesian network consists of two components: a DAG representing the

structure, and a set of conditional probability tables (CPTs) as shown in Figure 1. Each node in a Bayesian

network must have a CPT which quantiﬁes the relationship between the variable represented by that node

and its parents in the network. Completeness and consistency are guaranteed in a Bayesian network

since there is only one probability distribution that satisﬁes the Bayesian network constraints [10]. The

constraints that guarantee a unique probability distribution are the numerical constraints represented by

CPT and the independence constraints represented by the structure itself. The independence constraint

is shown in Figure 1. Each variable in the structure is independent of any other variables other than its

parents, once its parents are known. For example, once the information about A is known, the probability

2

Figure 1: Bayesian Network [10]

Figure 2: Markov Network

of L will not be aﬀected by any new information about F or T, so we call L independent of F and T once

A is known. These independence constraints are known as the Markovian assumptions.

Bayesian networks are widely used for modeling causality in a formal way, for decision-making under

uncertainty, and for many other applications [10].

2.2 Markov Random Fields (MRFs)

MRFs, which are known also as Markov networks, are the most well-known graphical models in which

the graph is undirected. In this graphical model, the random variables are represented as vertices while

the edges represent dependency. However, because there is no clear causal inﬂuence from one node to

the other (i.e. the link represents a direct dependency between two variables, but neither one of them

is a cause for the other) the edges are undirected. In an undirected graph any two nodes without a

direct link are always conditionally independent variables, whereas any two nodes with a direct link are

always dependent [19, 31]. In MRFs the joint probability distribution can be calculated by multiplying a

normalization factor by potential functions which assign positive value to a set of fully connected nodes

called a clique. A clique is a fully connected subset of nodes that is associated with a non-negative

potential function φ. Potential functions are derived from the notion of conditional independence, so any

potential function must refer only to the nodes that are directly connected (i.e. form a maximal clique).

According to cliques and potential functions, the joint probability in an undirected graph shown in Figure

2 is calculated using the following equation:

p(a, b, c, d) = 1

Zφacd(a, c, d)φa,b (a, b)

Where Zis a normalization factor that is calculated by summing or integrating the product of the

potential functions:

Z=X

a

X

b

X

c

X

d

φa,c,d(a, c, d)φa,b (a, b)

MRFs are common in many ﬁelds like spatial statistics, natural language processing, and communication

networks that have little causal structure to guide the construction of a directed graph.

2.3 Hidden Markov Models

A Hidden Markov Model (HMM) is a statistical time series model which is used to model dynamic systems

whose states are not observable, but whose outputs are. HMMs are widely used in speech recognition,

3

handwriting recognition and text-to-speech synthesis [40]. HMMs rely on three main assumptions. First,

the observation at time tis generated by a process whose state Stis hidden from observation.

Second, the state of that hidden process satisﬁes the Markov assumption that once the state of the

system at tis known, its states and outputs at times after tare no longer dependent on states before

t. In other words, the state at a speciﬁc time contains all needed information about the history of the

process to predict the future of the process. Upon those assumptions, the joint probability distribution

of a sequence of states and observations can be factored as follows [10, 40]:

P(S1:T, Y1:T) = P(S1)P(Y1|S1)

T

Y

t=2

P(St|St−1)P(Yt|St)

where Strefers to the hidden state, Ytrefers to the observation at time t, and the notation 1 : Tmeans

(1,2,..,T).

The third assumption is that the hidden state variables are discrete (i.e. Stcan take on Kvalues).

So, to deﬁne the probability distribution over observation sequences, we need to specify a probability

distribution over the initial state P(S1), the K*Kstate transition matrix deﬁning P(St|St−1) and the

output model deﬁning P(Yt|St). HMMs are considered a subclass of Bayesian networks known as

dynamic Bayesian networks (DBN), which are Bayesian networks that model systems that evolve over

time [16].

3 Related Work

This section describes the most related work to the proposed model from diﬀerent perspectives. First,

we describe the related hierarchical probabilistic models, then we describe the current techniques used to

automate the annotation of Mass Spectrometry (MS) data for glycomics, which is one of the scenarios

that we use to test the proposed model. We close this section by describing how we applied the proposed

model to discover the latent semantic similarity between keywords extracted from search logs for the

purposes of building a semantic search system.

3.1 Probabilistic Graphical Models for Hierarchical Data

Probabilistic graphical models require propositional domains [18]. To overcome this limitation some

extensions were proposed to extend those models to non-propositional domains. A Bayesian hierarchical

model has been used for natural scene categorization where it performs well on large sets of complex

scenes [13]. This model has also been applied for event recognition of human actions and interactions

[30]. Another application of the hierarchical Bayesian network is for identifying changes in gene expression

from microarray experiments [4]

In [18] the authors introduced a hierarchical Bayesian network which extends the expressiveness of a

regular Bayesian network by allowing a node to represent an aggregation of simpler types which enables

the modeling of complex hierarchical domains. The main idea is to use a small number of hidden variables

as a compressed representation for a set of observed variables with the following restrictions:

1. Any parent of a variable should be in the same or immediate upper layer.

2. At most one parent from the immediate upper layer is allowed for each variable.

So, the idea is mainly to compress the observed data. Although hierarchical Bayesian network models

extended the regular Bayesian network to represent non-propositional domains, they have not been able

to solve the issue of the scalability of Bayesian networks for massive amounts of hierarchical data.

3.2 Automated Annotation of Mass Spectrometry Data for Glycomics

One use case of the proposed model is the automated annotation of Mass Spectrometry (MS) data for

glycomics. Glycans (Figure 3) are the third major class of biological macro-molecules besides nucleic

acids and proteins [1]. Glycomics refers to the scientiﬁc attempts to characterize and study glycans, as

deﬁned in [1] or an integrated systems approach to study structure-function relationships of glycans as

deﬁned in [32]. The importance of this emerging ﬁeld of study is clear from the accumulated evidence for

the roles of glycans in cell growth and metastasis, cell-cell communication, and microbial pathogenesis.

Glycans are more diverse in terms of chemical structure and information density than nucleic acids

4

Figure 3: Glycan structure in CFG format. The circles and squares represent the monosaccharides which

are the building blocks of a glycan while the lines are the linkages between them

and proteins [32]. Glycan identiﬁcation is much more diﬃcult than protein identiﬁcation, and it is a

proven NP-hard problem [34] since, unlike protein structures, glycan structures are trees rather than

linear sequences. This leads to a large diversity of glycan structures, which, along with the absence of

a standard representation of glycans, has resulted in many incomplete databases, each of which stores

glycan structures and glycan-related data in a diﬀerent format. For example KEGG [21] uses the KCF

format, Glycosciences.de [27] uses the LINUCS format, and CFG [33] uses the IUPAC format.

Although MS has become the major analytical technique for glycans, no general method has been

developed for the automated identiﬁcation of glycan structures using MS and tandem MS data. The

relative ease of peptide identiﬁcation using tandem MS is mainly due to the linear structure of peptides

and the availability of reliable peptide sequence databases. In proteomic analyses, a mostly complete

series of fragment ions with high abundance is often observed. In such tandem mass spectra, the mass of

each amino acid in the sequence corresponds to the mass diﬀerence between two high-abundance peaks,

allowing the amino acid sequence to be deduced. In glycomics MS data, ion series are disrupted by

the branched nature of the molecule, signiﬁcantly complicating the extraction of sequence information.

In addition, groups of isomeric monosaccharides commonly share the same mass, making it impossible

to distinguish them by MS alone. Databases for glycans exist but are limited, minimally curated, and

suﬀer badly from pollution from glycan structures that are not produced in nature or are irrelevant

to the organism of study. Several algorithms have been developed in attempts to semi-automate the

process of glycan identiﬁcation by interpreting tandem MS spectra, including CartoonistTwo [17], GLYCH

[37], GlycoPep ID [22], GlycoMod [9], GlycoPeakFinder [28], GlycoWork-bench [6], and SimGlycan [2]

(commercially available from Premier Biosoft). However, each of these programs produces incorrect

results when using polluted databases to annotate large MSndatasets containing hundreds or thousands

of spectra. Inspection of the current literature indicates that machine learning and data mining techniques

have not been used to resolve this issue, although they have a great potential to be successful in doing

so. PGMHD attempts to employ machine learning techniques (mainly probabilistic-based classiﬁcation)

to ﬁnd a solution for the automated identiﬁcation of glycans using MS data.

3.3 Semantic Similarity

Semantic similarity, which is a metric that is deﬁned over documents or terms in which the distance

between them reﬂects the likeness of their meaning [20], is well deﬁned in Natural Language Processing

(NLP) and Information Retrieval (IR) [29]. Generally there are two major techniques used to compute

the semantic similarity: one is computed using a semantic network (Knowledge-based approach) [5], and

the other is based on computing the relatedness of terms within a large corpus of text (corpus-based

approach) [29]. The major techniques classiﬁed under corpus-based approach are Pointwise Mutual

Information (PMI) [3] and Latent Semantic Analysis (LSA) [11], though PMI outperform LSA on mining

the web for synonyms [38]. We applied the proposed PGMHD model to discover related search terms by

measuring probabilistic-based semantic similarity between those search terms.

4 Model Structure

Consider a (leveled) directed graph G= (V, A)where Vand A⊂V×Vdenote the sets of nodes and

arcs, respectively, such that:

5

1. The nodes Vare partitioned into mlevels L1, . . . , Lmand a root node v0such that V=∪m

i=0L1,

Li∩Lj=∅for i6=j, and L0={v0}.

2. The arcs in Aonly connect one level to the next, i.e., if a∈Athen a∈Li−1×Lifor some

i= 1, . . . , m.

3. An arc a= (vi−1, vi)∈Li−1×Lirepresents the dependency of viwith its parent vi−1,i= 1, . . . , m.

Moreover, let the function pa :V→ P(V)be deﬁned such that pa(v)is the set of all the parents of

v, i.e.,

pa(v) = {w: (w, v)∈A} ∀v∈V.

4. The nodes in each level Lirepresent all the possible outcomes of a ﬁnite discrete random variable,

namely Vi,i= 1, . . . , m.

We now make some remarks about the above assumptions. First, the node v0in the ﬁrst level L0can be

seen as the root node and the ones in Lmas leaves. Second, an observation xin our probabilistic model

is an outcome of a random variable, namely X∈L0× ·· · × Lm, deﬁned as

X= (V0:= v0, V1, . . . , Vm),

which represents a path from v0to the last level Lmsuch that (Vi−1, Vi)∈Aa.s. Hence, P(X=x) = 0

and P(Vi−1=vi−1, Vi=vi) = 0 whenever xi−1=vi−1,xi=viand (vi−1, vi)6∈ A.

In addition, we assume that there are nobservations of X, namely x1, . . . , xn, and we let f:V×V→N

be a frequency function deﬁned as

f(a) =

{xj: (xj

i−1, xj

i) = a, i = 1, . . . , m, j = 1, . . . , n}

.

Clearly, f(a)=0if a6∈ A. These latter observations are the ones used to train our model.

It should be observed that the proposed model can be seen as a special case of a Bayesian network

by considering a network consisting of a single directed path with mnodes. However, we believe that a

leveled directed graph that explicitly deﬁnes one node per outcome of the random variables (as described

above): i) leads to an easily scalable (and distributable) implementation of the problems we consider; ii)

improves the readability and expressiveness of the implemented network; and iii) more easily facilitates

the training of the model.

4.1 Probabilistic-based Classiﬁcation

Let X∈L0×· ··×Lmbe deﬁned as earlier in Section 4. Our model can predict the outcome at a parent

level i−1given an observation1at level iwith a classiﬁcation score. Given an outcome at level i−1,

namely li−1∈Li−1, we deﬁne the classiﬁcation score between li−1and an observation w∈Liat level i

by estimating the conditional probability Cli(li−1|w) := P(Xi−1=v|Xi=w)as

Cli(li−1|w)≈f(li−1, w)

T(w),

where

T(w) = X

v∈pa(w)

f(v, w).

4.2 Probabilistic-based Semantic Similarity scoring

Fix a level i∈ {1, . . . .m}, and let Xand Ybe identically distributed random variables such that

X∈L0× ·· · × Lmis deﬁned earlier in Section 4. We deﬁne the probabilistic-based semantic similarity

score between two outcomes xi, yi∈Liby approximating the conditional joint probability COi(xi, yi) :=

P(Xi=xi, Yi=yi|Xi−1∈pa(xi), Yi−1∈pa(yi)) as

COi(xi, yi)≈Y

v∈pa(xi)

pi(v, xi)·Y

v∈pa(yi)

pi(v, yi),(1)

1Diﬀerent from the observations used to train our model.

6

Figure 4: PGMHD for tandem MS data. The root nodes are the glycans that annotate the peaks at

MS1level, while the level 2 nodes are the glycan fragments that annotate the peaks at MS2level and the

edges represent dependency between the glycans that generates the fragments.

where pi(v, w) = P(Xi−1=v , Xi=w)for every (v, w)∈Li−1×Li. We can naturally estimate the

probabilities pi(v, w)with ˆp(v, w)deﬁned as

ˆp(v, w) := f(v, w)

n.

Hence, we can obtain the related outcomes of xi∈Li(at level i) by ﬁnding all the w∈Liwith a large

estimated probabilistic-based semantic similarity score COi(xi, w).

4.3 Progressive Learning

PGMHD is designed to allow progressive learning. Progressive learning is a learning technique that

allows a model to learn gradually over time. Training data does not need to be given at one time to the

model, instead the model can learn from any available data and integrate the new knowledge with the

represented one. This learning technique is very attractive in the big data age for the following reasons:

1. Any size of the training data can ﬁt.

2. It can easily learn from new data without the need to re-include the previous training data in the

learning.

3. The training session can be distributed instead of doing it in one long-running session.

4. Recursive learning allows the results of the model to be used as new training data, provided they

are judged to be accurate by the user. The progressive learning approach for PGMHD is shown in

Algorithm 1.

5 Experimental Results

PGMHD can be used for diﬀerent purposes once it is built and trained. PGMHD can be used to predict

the class from level lfor the observations of random variables at level l+1. For example, in the annotation

of the MS data, PGMHD is used to predict the best Glycan at level MS1to annotate a spectrum by

evaluating the annotated peaks at level MS2with probability scores that represent how well the selected

glycan correlates to the manually curated annotations that were used to train the model.

5.1 PGMHD to automate the MS annotation

This model is well suited for representing MS data. We recently implemented the Glycan Elucidation and

Annotation Tool (GELATO), which is a semi-automated MS annotation tool for glycomics integrated

within our MS data processing framework called GRITS. Figures 4, 5, 6 and 7 show screen shots from

GELATO for annotated spectra. Figure 5 shows the MS proﬁle level and Figures 6, 7, and 8 show the

annotated MS2peaks using fragments of the glycans that were chosen as candidate annotations to the

MS proﬁle data (i.e. level 1).

7

Algorithm 1 Progressive Learning for PGMHD

Figure 4: PGMHD for tandem MS data. The root nodes are the glycans that annotate the peaks at MS1

level, while the level 2 nodes are the glycan fragments that annotate the peaks at MS2level and the edges

represent dependency between the glycans that generates the fragments.

Data: Hierarichal Data

Result: Probabilistic Graphical Model

currentI nputLevel 1

currentGraphLayer 1

while currentI nputLevel < maxInputLevel do foreach dataItem in currentInputLevel do

read dataItem

if dataItem exists in currentGraphLayer then

retreive the node where node.data =dataItem parentN ode node

node.frequency node.frequency +1

else

parentN ode newnode parentNode.frequency 1

end

childrenLevel currentLevel +1

foreach child in dataItem.children do

if child exists in parentN ode.children then

set childN ode node where node.data =child.data edge edge(parentN ode, childN ode)

edge.f requency frequency +1

else

if child exists in childrenLevel then

childN ode node where node.data =child.data

edge createN ewEdge(parentN oden, childN ode)edge.frequency edge.frequecy +1

else

childN ode newN ode childNode.data child childN ode.f requency 0

edge createN ewEdge(parentN ode, childN ode)edge.frequency 1

end

end

end

end

currentI nputLevel currentLevel +1 currentGraphLayer currentGraphLayer +1

end while

Algorithm 1: Progressive Learning for PGMHD

8

Figure 5: MS1 annotation using GELATO. Scan# is the ID number of the scan in the MS ﬁle, peak

charge is the charge state of that peak in the MS ﬁle, peak intensity represents the abundance of an ion

at that peak, peak m/z is the mass over charge of the given peak, cartoon is the annotation of that peak

(glycan) in CFG format, feature m/z is the mass over charge for the glycan, and glycanID is the ID of

the glycan in the Glycan Ontology(GlycO).

8

Figure 6: Fragments of Glycan GOG166 at the MS2level. Each ion observed in MS1is selected and

fragmented in MS2to generate smaller ions, which can be used to identify the glycan structure that most

appropriately annotates the MS1ion. Theoretical fragments of the glycan structure that had been used

to annotate the MS1spectrum are used to annotate the corresponding MS2spectrum.

Figure 7: Fragments of Glycan GOG120 whose peaks were annotated at the MS2level. See Figure 5 for

annotation scheme.

Figure 8: Fragments of Glycan GOG516 whose peaks were annotated at the MS2level. See Figure 5 for

annotation scheme.

9

Table 1: Precision and Recall for PGMHD in the MS annotation experiment

Size of training set Precision Recall

5 0.891 0.621

6 0.870 0.609

7 0.865 0.619

8 0.868 0.632

9 0.867 0.618

0"

0.1"

0.2"

0.3"

0.4"

0.5"

0.6"

0.7"

0.8"

0.9"

1"

5" 6" 7" 8" 9"

Size%of%Training%Data%(number%of%experiments)%

Precision"

Recall"

Figure 9: Precision and Recall of PGMHD

To represent the data shown in these ﬁgures using the proposed model, a top-layer node is assigned

to each row in the MS proﬁle table, which corresponds to the MS1data. Then, for each row in the MS2

tables, a unique node is created and connected with its parent node using a directed edge from the parent

node (at the MS proﬁle layer) to the child node (at the MS2layer). Each top-layer node stores a value

representing how frequently that parent has been seen in the training data. However, each child node

in the MS2layer has more than one parent. The edge’s weight represents the co-occurrence frequency

between a child and a parent. The child node stores the total frequency of observing that child regardless

of the identity of its parents. The combined frequency data makes it possible to design a progressive

learning algorithm that can extract information from massive data sets. Figure 4 shows the PGMHD for

the given MS data in these ﬁgures. As shown in the model, two layers are created: one for the MS1level

and a second one for the MS2level. The nodes at the MS2level may have many parents as long as they

have the same annotation. The frequency values are not shown because of space constraints.

We ran our experiments using MS data which is collected from stem cell samples. The size of this

data set is 1,746,278 peaks distributed over 1713 MS scans from 10 MS experiments. Figure 11 shows the

learning time using the progressive learning technique. In this test we introduced one new experiment

at a time to the model for training, and we recorded the total time required to train the model. These

performance results demonstrate how eﬃciently the progressive learning works with PGMHD.

To test the accuracy of PGMHD, we trained the model by randomly selecting one of 10 available

experiments, while the other 9 experiments were used to test the trained model by annotating the ex-

periments’ peaks using PGMHD. The baseline in our evaluation was the annotations generated by the

commercial tool SimGlycan. The results of the accuracy test are shown in Table 1. Figure 10 shows the

average precision and recall for PGMHD compared to the average precision and recall of GELATO using

the same dataset of 1,746,278 peaks distributed over 10 MS experiments.

5.2 PGMHD for latent semantic discovery over Hadoop

We also implemented a version of PGMHD over Hadoop [26] to be used for latent semantic discovery

between users’ search terms extracted from search logs provided by CareerBuilder.com.

10

0"

0.1"

0.2"

0.3"

0.4"

0.5"

0.6"

0.7"

0.8"

0.9"

1"

Precision" Recall"

PGMHD"

GELATO"

Figure 10: Average precision and recall of PGMHD and GELATO

Figure 11: Progressive Learning Time Over Diﬀerent Experiments

11

Read%Input%Row%

UserId,Classiﬁca6

on,SearchTerms%

Map

Count%Term%

Frequency%and%

(Term,Class)%

Frequency%

Key: Term

Value: Classification

Reduce

Read%Input%Row%

Term,Term%Freq,%

Class,%

(Term,Class)%Freq%

Map

Count%

Classiﬁca6on%

Frequency%

Key: Classification

Valu e: UserID

Reduce

Term% Term%Freq% Class% (Term,Class)%

Freq%

Class% Class%Freq%

Join

PGMHD

Hive Table 1

Hive Table 2

Figure 12: PGMHD Over Hadoop

Table 2: Input data to PGMHD over hadoop

UserID Classiﬁcation Search Terms

user1 Java Developer Java, Java Developer, C, Software Engineer

user2 Nurse RN, Rigistered Nurse, Health Care

user3 .NET Developer C#, ASP, VB, Software Engineer, SE

user4 Java Developer Java, JEE, Struts, Software Engineer, SE

user5 Health Care Health Care Rep, HealthCare

5.2.1 Problem Description

CareerBuilder operates the largest job board in the U.S. and has an extensive and growing global presence,

with millions of job postings, more than 60 million actively-searchable resumes, over one billion searchable

documents, and more than a million searches per hour. The search relevancy and recommendations team

wants to discover latent semantic relationships among the search terms entered by their users in order to

build a semantic search engine that understands a user’s query intent in order to provide more relevant

results than a traditional keyword search engine. To tackle this problem, CareerBuilder cannot use a

typical synonyms dictionary since most of the keywords used in the employment search domain represent

job titles, skills, and companies that would not be found in a traditional English dictionary. Additionally,

CareerBuilder’s search engine supports over a dozen languages, so they were in search of a model that is

language-independent.

5.2.2 PGMHD over Hadoop

Given the search logs for all the users and the users’ classiﬁcations as shown in Table 2, PGMHD can

represent this kind of data by placing the classes of the users as root nodes and placing the search terms

for all the users in the second level as children nodes. Then, an edge will be formed linking each search

term back to the class of the user who searched for it. The frequency of each search term (how many

users search for it) will be stored in the node of that term, while the frequency of a speciﬁc search term

searched for by users of a speciﬁc class (how many users belonging to that class searched for the given

term) will be stored in the edge between the class and the term. The frequency of the root node is the

summation of the frequencies on the edges that connect that root node with its children (Figure 13).

Figure 12 shows how PGMHD was implemented over Hadoop using Map/Reduce jobs and Hive

tables. After we created PGMHD on Hadoop we calculated the probabilistic-based semantic similarity

score between each pair of two terms with shared parents. The size of the data set we analyzed in

this experiment is 1.6 billion search records. To decrease the noise in the given data set we applied a

pre-ﬁltering technique by removing any search term used by less than 10 distinct users. The ﬁnal graph

representing this data contains 1931 root nodes, 16414 child nodes, and 439435 edges.

5.2.3 Results of latent semantic discovery using PGMHD

The experiment performing latent semantic discovery among search terms using PGMHD was run on a

Hadoop cluster with 63 data nodes, each having a 2.6 GHZ AMD Opteron Processor with 12 to 32 cores

and 32 to 128 GB RAM. Table 3 shows sample results of 10 terms with their top 5 related terms discovered

12

Health'

Care'Rep'

Java'

Developer'

.NET'

Developer' Nurse' Health'Care'

Java'

Developer'

Java' C'

So7ware'

Engineer' ASP'

RN' Registered'

Nurse'

Health'

Care' VB'

SE' JEE' Struts'

Figure 13: PGMHD representing the search log data

Table 3: PGMHD results for latent semantic discovery

Term Related Terms

hadoop big data, hadoop developer, OBIEE,

Java, Python

registered nurse

rn registered nurse, rn, registered

nurse manager, nurse, nursing, director

of nursing

data mining

machine learning, data scientist,

analytics, business intellegence,

statistical analyst

Solr lucene, hadoop, java

Software Engineer software developer, programmer, .net

developer, web developer, software

big data nosql, data science, machine learning,

hadoop, teradata

Realtor realtor assistant, real estate, real

estate sales, sales, real estate agent

Data Scientist machine learning, data analyst, data

mining, analytics, big data

Plumbing

plumber, plumbing apprentice,

plumbing maintenance, plumbing

sales, maintenance

Agile scrum, project manager, agile coach,

pmiacp, scrum master

by PGMHD. To evaluate the model’s accuracy, we sent the results to data analysts at CareerBuilder who

reviewed 1000 random pairs of discovered related search terms and returned the list with their feedback

about whether each pair of discovered related terms was “related" or “unrelated". We then calculated

the accuracy (precision) of the model based upon the ratio of number of related results to total number

of results. The results show the accuracy of the discovered semantic relationships among search terms

using the PGMHD model to be 0.80.

6 Conclusion

Probabilistic graphical models are very important in many modern applications such as data mining and

data analytics. The major issue with existing probabilistic graphical models is their scalability to handle

large data sets, making this a very important area for research given the tremendous modern focus on big

data due to the number of data points produced by modern computers systems and sensors. PGMHD

is a probabilistic graphical model that attempts to solve the scalability problems in existing models in

scenarios where massive hierarchical data is present. PGMHD is designed to ﬁt hierarchical data sets of

any size, regardless of the domain to which the data belongs. In this paper we present two experiments

from diﬀerent domains: one being the automated tagging of high-throughput mass spectrometry data

in bioinformatics, and the other being latent semantic discovery using search logs from the largest job

13

board in the U.S. The two use cases in which we tested PGMHD show that this model is robust and can

scale from a few thousand entries to at least billions of entries, and can also run on a single computer

(for smaller data sets), as well as in a parallelized fashion on a large cluster of servers (63 were used in

our experiment).

Acknowledgment

The authors would like to deeply thank David Crandall from Indiana University for providing very helpful

comments and suggestions to improve this paper. We also would like to thank Kiyoko Aoki Kinoshita

from Soka University and Khaled Rasheed from University of Georgia for the valuable discussions and

suggestions to improve this model. Deep thanks to Melody Porterﬁeld and Rene Ranzinger from Complex

Carbohydrate Research Center (CCRC) at the University of Georgia for providing the MS data and the

valuable time and information they shared with us to understand the annotation process of the MS data.

References

[1] K. F. Aoki-Kinoshita. An introduction to bioinformatics for glycomics research. PLoS computational

biology, 4(5):e1000075, 2008.

[2] A. Apte and N. S. Meitei. Bioinformatics in glycomics: Glycan characterization with mass spectro-

metric data using simglycanâĎć. In Functional Glycomics, pages 269–281. Springer, 2010.

[3] G. Bouma. Normalized (pointwise) mutual information in collocation extraction. In Proceedings of

the Biennial GSCL Conference, pages 31–40, 2009.

[4] P. Broët, S. Richardson, and F. Radvanyi. Bayesian hierarchical model for identifying changes in

gene expression from microarray experiments. Journal of Computational Biology, 9(4):671–683, 2002.

[5] A. Budanitsky and G. Hirst. Semantic distance in wordnet: An experimental, application-oriented

evaluation of ﬁve measures. In Workshop on WordNet and Other Lexical Resources, volume 2, 2001.

[6] A. Ceroni, K. Maass, H. Geyer, R. Geyer, A. Dell, and S. M. Haslam. Glycoworkbench: a tool

for the computer-assisted annotation of mass spectra of glycansâĂă. Journal of proteome research,

7(4):1650–1659, 2008.

[7] D. M. Chickering, D. Heckerman, and C. Meek. Large-sample learning of bayesian networks is

np-hard. The Journal of Machine Learning Research, 5:1287–1330, 2004.

[8] C. Christodoulopoulos, S. Goldwater, and M. Steedman. A bayesian mixture model for part-of-

speech induction using multiple features. In Proceedings of the conference on empirical methods in

natural language processing, pages 638–647. Association for Computational Linguistics, 2011.

[9] C. A. Cooper, E. Gasteiger, and N. H. Packer. Glycomod–a software tool for determining glycosy-

lation compositions from mass spectrometric data. Proteomics, 1(2):340–349, 2001.

[10] A. Darwiche. Bayesian networks. Communications of the ACM, 53(12):80–90, 2010.

[11] S. T. Dumais. Latent semantic analysis. Annual review of information science and technology,

38(1):188–230, 2004.

[12] S. R. Eddy and R. Durbin. Rna sequence analysis using covariance models. Nucleic acids research,

22(11):2079–2088, 1994.

[13] L. Fei-Fei and P. Perona. A bayesian hierarchical model for learning natural scene categories. In

Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference

on, volume 2, pages 524–531. IEEE, 2005.

[14] S. Fine, Y. Singer, and N. Tishby. The hierarchical hidden markov model: Analysis and applications.

Machine learning, 32(1):41–62, 1998.

[15] N. Friedman, I. Nachman, and D. Peér. Learning bayesian network structure from massive datasets:

the «sparse candidate «algorithm. In Proceedings of the Fifteenth conference on Uncertainty in

artiﬁcial intelligence, pages 206–215. Morgan Kaufmann Publishers Inc., 1999.

14

[16] Z. Ghahramani. An introduction to hidden markov models and bayesian networks. International

Journal of Pattern Recognition and Artiﬁcial Intelligence, 15(01):9–42, 2001.

[17] D. Goldberg, M. Bern, B. Li, and C. B. Lebrilla. Automatic determination of o-glycan structure

from fragmentation spectra. Journal of proteome research, 5(6):1429–1434, 2006.

[18] E. Gyftodimos and P. A. Flach. Hierarchical bayesian networks: an approach to classiﬁcation and

learning for structured data. In Methods and Applications of Artiﬁcial Intelligence, pages 291–300.

Springer, 2004.

[19] T. Hamelryck. An overview of bayesian inference and graphical models. In Bayesian Methods in

Structural Bioinformatics, pages 3–48. Springer, 2012.

[20] S. Harispe, S. Ranwez, S. Janaqi, and J. Montmain. Semantic measures for the comparison of

units of language, concepts or entities from text and knowledge base analysis. arXiv preprint

arXiv:1310.1285, 2013.

[21] K. Hashimoto, S. Goto, S. Kawano, K. F. Aoki-Kinoshita, N. Ueda, M. Hamajima, T. Kawasaki,

and M. Kanehisa. Kegg as a glycome informatics resource. Glycobiology, 16(5):63R–70R, 2006.

[22] J. Irungu, E. P. Go, D. S. Dalpathado, and H. Desaire. Simpliﬁcation of mass spectral analysis of

acidic glycopeptides using glycopep id. Analytical chemistry, 79(8):3065–3074, 2007.

[23] M. I. Jordan et al. Graphical models. Statistical Science, 19(1):140–155, 2004.

[24] M. Korayem, A. Badr, and I. Farag. Optimizing hidden markov models using genetic algorithms

and artiﬁcial immune systems. Computing and Information Systems, 11(2):44, 2007.

[25] J. Kupiec. Robust part-of-speech tagging using a hidden markov model. Computer Speech & Lan-

guage, 6(3):225–242, 1992.

[26] C. Lam. Hadoop in action. Manning Publications Co., 2010.

[27] T. Lütteke, A. Bohne-Lang, A. Loss, T. Goetz, M. Frank, and C.-W. von der Lieth. Glycosciences.

de: an internet portal to support glycomics and glycobiology research. Glycobiology, 16(5):71R–81R,

2006.

[28] K. Maass, R. Ranzinger, H. Geyer, C.-W. von der Lieth, and R. Geyer. âĂĲglyco-peakﬁnderâĂİ–de

novo composition analysis of glycoconjugates. Proteomics, 7(24):4435–4444, 2007.

[29] R. Mihalcea, C. Corley, and C. Strapparava. Corpus-based and knowledge-based measures of text

semantic similarity. In AAAI, volume 6, pages 775–780, 2006.

[30] S. Park and J. K. Aggarwal. A hierarchical bayesian network for event recognition of human actions

and interactions. Multimedia systems, 10(2):164–179, 2004.

[31] J. Pearl. Markov and Bayes networks: a comparison of two graphical representations of probabilistic

knowledge. Computer Science Department, University of California, 1986.

[32] R. Raman, S. Raguram, G. Venkataraman, J. C. Paulson, and R. Sasisekharan. Glycomics: an inte-

grated systems approach to structure-function relationships of glycans. Nature Methods, 2(11):817–

824, 2005.

[33] R. Raman, M. Venkataraman, S. Ramakrishnan, W. Lang, S. Raguram, and R. Sasisekharan. Ad-

vancing glycomics: implementation strategies at the consortium for functional glycomics. Glycobiol-

ogy, 16(5):82R–90R, 2006.

[34] B. Shan, B. Ma, K. Zhang, and G. Lajoie. Complexities and algorithms for glycan sequencing using

tandem mass spectrometry. Journal of bioinformatics and computational biology, 6(01):77–91, 2008.

[35] P. Smyth. Belief networks, hidden markov models, and markov random ﬁelds: A unifying view.

Pattern recognition letters, 18(11):1261–1268, 1997.

[36] J. Söding. Protein homology detection by hmm–hmm comparison. Bioinformatics, 21(7):951–960,

2005.

15

[37] H. Tang, Y. Mechref, and M. V. Novotny. Automated interpretation of ms/ms spectra of oligosac-

charides. Bioinformatics, 21(suppl 1):i431–i439, 2005.

[38] P. D. Turney. Mining the web for synonyms: PMI-IR versus lsa on toeﬂ. In Proceedings of the 12th

European Conference on Machine Learning, EMCL ’01, pages 491–502, London, UK, UK, 2001.

Springer-Verlag.

[39] H. Xu, C.-L. Wei, F. Lin, and W.-K. Sung. An hmm approach to genome-wide identiﬁcation of

diﬀerential histone modiﬁcation sites from chip-seq data. Bioinformatics, 24(20):2344–2349, 2008.

[40] J. Yamagishi. An introduction to hmm-based speech synthesis. Technical report, Technical report,

Tokyo Institute of Technology, 2006.

16