ArticlePDF Available

Abstract and Figures

In the big data era, scalability has become a crucial requirement for any useful computational model. Probabilistic graphical models are very useful for mining and discovering data insights, but they are not scalable enough to be suitable for big data problems. Bayesian Networks particularly demonstrate this limitation when their data is represented using few random variables while each random variable has a massive set of values. With hierarchical data - data that is arranged in a treelike structure with several levels - one would expect to see hundreds of thousands or millions of values distributed over even just a small number of levels. When modeling this kind of hierarchical data across large data sets, Bayesian networks become infeasible for representing the probability distributions for the following reasons: i) Each level represents a single random variable with hundreds of thousands of values, ii) The number of levels is usually small, so there are also few random variables, and iii) The structure of the network is predefined since the dependency is modeled top-down from each parent to each of its child nodes, so the network would contain a single linear path for the random variables from each parent to each child node. In this paper we present a scalable probabilistic graphical model to overcome these limitations for massive hierarchical data. We believe the proposed model will lead to an easily-scalable, more readable, and expressive implementation for problems that require probabilistic-based solutions for massive amounts of hierarchical data. We successfully applied this model to solve two different challenging probabilistic-based problems on massive hierarchical data sets for different domains, namely, bioinformatics and latent semantic discovery over search logs.
Content may be subject to copyright.
PGMHD: A Scalable Probabilistic Graphical Model for Massive
Hierarchical Data Problems
Khalifeh AlJaddaMohammed KorayemCamilo OrtizTrey Grainger§
John A. MillerWilliam S. Yorkk
Abstract
In the big data era, scalability has become a crucial requirement for any useful computational
model. Probabilistic graphical models are very useful for mining and discovering data insights, but
they are not scalable enough to be suitable for big data problems. Bayesian Networks particularly
demonstrate this limitation when their data is represented using few random variables while each
random variable has a massive set of values. With hierarchical data - data that is arranged in a
treelike structure with several levels - one would expect to see hundreds of thousands or millions of
values distributed over even just a small number of levels. When modeling this kind of hierarchical
data across large data sets, Bayesian networks become infeasible for representing the probability
distributions for the following reasons: i) Each level represents a single random variable with hundreds
of thousands of values, ii) The number of levels is usually small, so there are also few random variables,
and iii) The structure of the network is predefined since the dependency is modeled top-down from
each parent to each of its child nodes, so the network would contain a single linear path for the
random variables from each parent to each child node. In this paper we present a scalable probabilistic
graphical model to overcome these limitations for massive hierarchical data. We believe the proposed
model will lead to an easily-scalable, more readable, and expressive implementation for problems that
require probabilistic-based solutions for massive amounts of hierarchical data. We successfully applied
this model to solve two different challenging probabilistic-based problems on massive hierarchical data
sets for different domains, namely, bioinformatics and latent semantic discovery over search logs.
1 Introduction
Probabilistic graphical models (PGM) refer to a family of techniques that merge concepts from graph
structures and probability models [35]. They represent the conditional dependencies among sets of random
variables [19]. In the age of big data, PGMâĂŹs can be very useful for mining and extracting insights from
large-scale and noisy data. The major challenges that PGMs face in this emerging field are the scalability
and the restriction that they can only be applied on a propositional domain [18, 7]. Some extensions have
already been proposed to address these challenges, such as hierarchical probabilistic graphical models
(HPGM) which aim to extend the PGM to work with non-propositional domains [18, 14]. The focus
of these models is to make Bayesian networks applicable to non-propositional domains, but they do not
solve the scalability issues that arise when they are applied to massive data sets.
Massive data sets often exhibit hierarchical properties, where data can be divided into several levels
arranged in tree-like structures. Data items in each level depend on or influenced by only the data
items in the immediate upper level. For this kind of data the most appropriate PGM to represent the
probability distribution would be a Bayesian network (BN), since the dependencies in this kind of data
are not bidirectional. A Bayesian network is considered to be feasible when it can provide a concise
representation of a large probability distribution where the need cannot be efficiently handled using
traditional techniques such as tables and equations [10]. Such a scenario is not the case with massive
hierarchical data, however, since each level only represents one random variable, while the data items in
that level are outcomes of that random variable. For example, consider that the hierarchical data are
Department of Computer Science, University of Georgia, Athens, Georgia. Email: aljadda@uga.edu,jam@cs.uga.edu
School of Informatics and Computing, Indiana Univeristy, Bloomington, IN. Email: mkorayem@cs.indiana.edu
CareerBuilder, Norcross, GA. Email: Camilo.Ortiz@careerbuilder.com
§CareerBuilder, Norcross, GA. Email: trey.grainger@careerbuilder.com
Department of Computer Science, University of Georgia, Athens, Georgia. Email: jam@cs.uga.edu
kComplex Carbohydrate Research Center, University of Georgia, Athens, Georgia. Email: will@ccrc.uga.edu
1
arXiv:1407.5656v1 [cs.AI] 21 Jul 2014
organized as follows: The data items in the top level (root level) represent US cities, while the data items
in the second level represent diseases, where each city is connected with the set of diseases that appears in
that city. In this case assume we have 19000 cities and 50000 diseases. If we would like to represent this
data in a BN, we will consider all of the cities in the root level to be outcomes of one random variable City
and all the data items in the second level to be outcomes of another random variable Disease. Thus, the
BN for this data will be composed of two nodes with single path City Disease while the conditional
probability table (CPT) for the Disease will contain 950,000,000 (50000 ×19000) entries. For this
kind of data, we propose a simple probabilistic graphical model (PGMHD) that can represent massive
hierarchical data in more efficient way. We successfully apply the PGMHD in two different domains:
bioinformatics (for multi-class classification) and search log analytics (for latent semantic discovery).
The main contributions of this paper are as follows:
We propose a simple, efficient and scalable probabilistic-based model for massive hierarchical data.
We successfully apply this model to the Bioinformatics domain in which we automatically classify
and annotate high-throughput mass spectrometry data.
We also apply this model for large-scale latent semantic discovery using 1.6 billion search log entries
provided by CareerBuilder.com, using the Hadoop Map/Reduce framework.
2 Background
Graphical models can be classified into two major categories: (1) directed graphical models, which are
often referred to as Bayesian networks, or belief networks, and (2) undirected graphical models which
are often referred to as Markov Random Fields, Markov networks, Boltzmann machines, or log-linear
models [23]. Probabilistic graphical models (PGMs) consist of both graph structure and parameters.
The graph structure represents a set of conditionally independent relations for the probability model,
while the parameters consist of the joint probability distributions [35]. Probabilistic graphical models are
often considered to be more convenient than numerical representations for two main reasons [31]:
1. To encode a joint probability distribution for P(x1,...,xn) for npropositional variables with a nu-
merical representation, we need a table with 2nentries.
2. Inadequacy in addressing the notion of independence: to test independence between xand y, one
needs to test whether the joint distribution of xand yis equal to the product of their marginal
probability.
PGMs are used in many domains. For example, Hidden Markov Models (HMM) are considered
a crucial component for most of the speech recognition systems [24]. In bioinformatics, probabilistic
graphical models are used in RNA sequence analysis [12], protein homology detection and sequence
alignment [36], and for genome-wide identification [39]. In natural language processing (NLP), HMM
and Bayesian models are used for part of speech (POS) tagging [8, 25]. The problem with PGMs in
general, and Bayesian networks in particular, is that they are not suitable for representing massive data
due to the time complexity of learning the structure of the network and the space complexity of storing a
network with thousands of random variables. In general, finding a network that maximizes the Bayesian
and Minimum Description Length (MDL) scores is an NP-hard problem [15].
2.1 Bayesian Networks
A Bayesian network is a concise representation of a large probability distribution to be handled using
traditional techniques such as tables and equations [10]. The graph of a Bayesian network is a directed
acyclic graph (DAG) [19]. A Bayesian network consists of two components: a DAG representing the
structure, and a set of conditional probability tables (CPTs) as shown in Figure 1. Each node in a Bayesian
network must have a CPT which quantifies the relationship between the variable represented by that node
and its parents in the network. Completeness and consistency are guaranteed in a Bayesian network
since there is only one probability distribution that satisfies the Bayesian network constraints [10]. The
constraints that guarantee a unique probability distribution are the numerical constraints represented by
CPT and the independence constraints represented by the structure itself. The independence constraint
is shown in Figure 1. Each variable in the structure is independent of any other variables other than its
parents, once its parents are known. For example, once the information about A is known, the probability
2
Figure 1: Bayesian Network [10]
Figure 2: Markov Network
of L will not be affected by any new information about F or T, so we call L independent of F and T once
A is known. These independence constraints are known as the Markovian assumptions.
Bayesian networks are widely used for modeling causality in a formal way, for decision-making under
uncertainty, and for many other applications [10].
2.2 Markov Random Fields (MRFs)
MRFs, which are known also as Markov networks, are the most well-known graphical models in which
the graph is undirected. In this graphical model, the random variables are represented as vertices while
the edges represent dependency. However, because there is no clear causal influence from one node to
the other (i.e. the link represents a direct dependency between two variables, but neither one of them
is a cause for the other) the edges are undirected. In an undirected graph any two nodes without a
direct link are always conditionally independent variables, whereas any two nodes with a direct link are
always dependent [19, 31]. In MRFs the joint probability distribution can be calculated by multiplying a
normalization factor by potential functions which assign positive value to a set of fully connected nodes
called a clique. A clique is a fully connected subset of nodes that is associated with a non-negative
potential function φ. Potential functions are derived from the notion of conditional independence, so any
potential function must refer only to the nodes that are directly connected (i.e. form a maximal clique).
According to cliques and potential functions, the joint probability in an undirected graph shown in Figure
2 is calculated using the following equation:
p(a, b, c, d) = 1
Zφacd(a, c, d)φa,b (a, b)
Where Zis a normalization factor that is calculated by summing or integrating the product of the
potential functions:
Z=X
a
X
b
X
c
X
d
φa,c,d(a, c, d)φa,b (a, b)
MRFs are common in many fields like spatial statistics, natural language processing, and communication
networks that have little causal structure to guide the construction of a directed graph.
2.3 Hidden Markov Models
A Hidden Markov Model (HMM) is a statistical time series model which is used to model dynamic systems
whose states are not observable, but whose outputs are. HMMs are widely used in speech recognition,
3
handwriting recognition and text-to-speech synthesis [40]. HMMs rely on three main assumptions. First,
the observation at time tis generated by a process whose state Stis hidden from observation.
Second, the state of that hidden process satisfies the Markov assumption that once the state of the
system at tis known, its states and outputs at times after tare no longer dependent on states before
t. In other words, the state at a specific time contains all needed information about the history of the
process to predict the future of the process. Upon those assumptions, the joint probability distribution
of a sequence of states and observations can be factored as follows [10, 40]:
P(S1:T, Y1:T) = P(S1)P(Y1|S1)
T
Y
t=2
P(St|St1)P(Yt|St)
where Strefers to the hidden state, Ytrefers to the observation at time t, and the notation 1 : Tmeans
(1,2,..,T).
The third assumption is that the hidden state variables are discrete (i.e. Stcan take on Kvalues).
So, to define the probability distribution over observation sequences, we need to specify a probability
distribution over the initial state P(S1), the K*Kstate transition matrix defining P(St|St1) and the
output model defining P(Yt|St). HMMs are considered a subclass of Bayesian networks known as
dynamic Bayesian networks (DBN), which are Bayesian networks that model systems that evolve over
time [16].
3 Related Work
This section describes the most related work to the proposed model from different perspectives. First,
we describe the related hierarchical probabilistic models, then we describe the current techniques used to
automate the annotation of Mass Spectrometry (MS) data for glycomics, which is one of the scenarios
that we use to test the proposed model. We close this section by describing how we applied the proposed
model to discover the latent semantic similarity between keywords extracted from search logs for the
purposes of building a semantic search system.
3.1 Probabilistic Graphical Models for Hierarchical Data
Probabilistic graphical models require propositional domains [18]. To overcome this limitation some
extensions were proposed to extend those models to non-propositional domains. A Bayesian hierarchical
model has been used for natural scene categorization where it performs well on large sets of complex
scenes [13]. This model has also been applied for event recognition of human actions and interactions
[30]. Another application of the hierarchical Bayesian network is for identifying changes in gene expression
from microarray experiments [4]
In [18] the authors introduced a hierarchical Bayesian network which extends the expressiveness of a
regular Bayesian network by allowing a node to represent an aggregation of simpler types which enables
the modeling of complex hierarchical domains. The main idea is to use a small number of hidden variables
as a compressed representation for a set of observed variables with the following restrictions:
1. Any parent of a variable should be in the same or immediate upper layer.
2. At most one parent from the immediate upper layer is allowed for each variable.
So, the idea is mainly to compress the observed data. Although hierarchical Bayesian network models
extended the regular Bayesian network to represent non-propositional domains, they have not been able
to solve the issue of the scalability of Bayesian networks for massive amounts of hierarchical data.
3.2 Automated Annotation of Mass Spectrometry Data for Glycomics
One use case of the proposed model is the automated annotation of Mass Spectrometry (MS) data for
glycomics. Glycans (Figure 3) are the third major class of biological macro-molecules besides nucleic
acids and proteins [1]. Glycomics refers to the scientific attempts to characterize and study glycans, as
defined in [1] or an integrated systems approach to study structure-function relationships of glycans as
defined in [32]. The importance of this emerging field of study is clear from the accumulated evidence for
the roles of glycans in cell growth and metastasis, cell-cell communication, and microbial pathogenesis.
Glycans are more diverse in terms of chemical structure and information density than nucleic acids
4
Figure 3: Glycan structure in CFG format. The circles and squares represent the monosaccharides which
are the building blocks of a glycan while the lines are the linkages between them
and proteins [32]. Glycan identification is much more difficult than protein identification, and it is a
proven NP-hard problem [34] since, unlike protein structures, glycan structures are trees rather than
linear sequences. This leads to a large diversity of glycan structures, which, along with the absence of
a standard representation of glycans, has resulted in many incomplete databases, each of which stores
glycan structures and glycan-related data in a different format. For example KEGG [21] uses the KCF
format, Glycosciences.de [27] uses the LINUCS format, and CFG [33] uses the IUPAC format.
Although MS has become the major analytical technique for glycans, no general method has been
developed for the automated identification of glycan structures using MS and tandem MS data. The
relative ease of peptide identification using tandem MS is mainly due to the linear structure of peptides
and the availability of reliable peptide sequence databases. In proteomic analyses, a mostly complete
series of fragment ions with high abundance is often observed. In such tandem mass spectra, the mass of
each amino acid in the sequence corresponds to the mass difference between two high-abundance peaks,
allowing the amino acid sequence to be deduced. In glycomics MS data, ion series are disrupted by
the branched nature of the molecule, significantly complicating the extraction of sequence information.
In addition, groups of isomeric monosaccharides commonly share the same mass, making it impossible
to distinguish them by MS alone. Databases for glycans exist but are limited, minimally curated, and
suffer badly from pollution from glycan structures that are not produced in nature or are irrelevant
to the organism of study. Several algorithms have been developed in attempts to semi-automate the
process of glycan identification by interpreting tandem MS spectra, including CartoonistTwo [17], GLYCH
[37], GlycoPep ID [22], GlycoMod [9], GlycoPeakFinder [28], GlycoWork-bench [6], and SimGlycan [2]
(commercially available from Premier Biosoft). However, each of these programs produces incorrect
results when using polluted databases to annotate large MSndatasets containing hundreds or thousands
of spectra. Inspection of the current literature indicates that machine learning and data mining techniques
have not been used to resolve this issue, although they have a great potential to be successful in doing
so. PGMHD attempts to employ machine learning techniques (mainly probabilistic-based classification)
to find a solution for the automated identification of glycans using MS data.
3.3 Semantic Similarity
Semantic similarity, which is a metric that is defined over documents or terms in which the distance
between them reflects the likeness of their meaning [20], is well defined in Natural Language Processing
(NLP) and Information Retrieval (IR) [29]. Generally there are two major techniques used to compute
the semantic similarity: one is computed using a semantic network (Knowledge-based approach) [5], and
the other is based on computing the relatedness of terms within a large corpus of text (corpus-based
approach) [29]. The major techniques classified under corpus-based approach are Pointwise Mutual
Information (PMI) [3] and Latent Semantic Analysis (LSA) [11], though PMI outperform LSA on mining
the web for synonyms [38]. We applied the proposed PGMHD model to discover related search terms by
measuring probabilistic-based semantic similarity between those search terms.
4 Model Structure
Consider a (leveled) directed graph G= (V, A)where Vand AV×Vdenote the sets of nodes and
arcs, respectively, such that:
5
1. The nodes Vare partitioned into mlevels L1, . . . , Lmand a root node v0such that V=m
i=0L1,
LiLj=for i6=j, and L0={v0}.
2. The arcs in Aonly connect one level to the next, i.e., if aAthen aLi1×Lifor some
i= 1, . . . , m.
3. An arc a= (vi1, vi)Li1×Lirepresents the dependency of viwith its parent vi1,i= 1, . . . , m.
Moreover, let the function pa :V→ P(V)be defined such that pa(v)is the set of all the parents of
v, i.e.,
pa(v) = {w: (w, v)A} ∀vV.
4. The nodes in each level Lirepresent all the possible outcomes of a finite discrete random variable,
namely Vi,i= 1, . . . , m.
We now make some remarks about the above assumptions. First, the node v0in the first level L0can be
seen as the root node and the ones in Lmas leaves. Second, an observation xin our probabilistic model
is an outcome of a random variable, namely XL0× ·· · × Lm, defined as
X= (V0:= v0, V1, . . . , Vm),
which represents a path from v0to the last level Lmsuch that (Vi1, Vi)Aa.s. Hence, P(X=x) = 0
and P(Vi1=vi1, Vi=vi) = 0 whenever xi1=vi1,xi=viand (vi1, vi)6∈ A.
In addition, we assume that there are nobservations of X, namely x1, . . . , xn, and we let f:V×VN
be a frequency function defined as
f(a) =
{xj: (xj
i1, xj
i) = a, i = 1, . . . , m, j = 1, . . . , n}
.
Clearly, f(a)=0if a6∈ A. These latter observations are the ones used to train our model.
It should be observed that the proposed model can be seen as a special case of a Bayesian network
by considering a network consisting of a single directed path with mnodes. However, we believe that a
leveled directed graph that explicitly defines one node per outcome of the random variables (as described
above): i) leads to an easily scalable (and distributable) implementation of the problems we consider; ii)
improves the readability and expressiveness of the implemented network; and iii) more easily facilitates
the training of the model.
4.1 Probabilistic-based Classification
Let XL0×· ··×Lmbe defined as earlier in Section 4. Our model can predict the outcome at a parent
level i1given an observation1at level iwith a classification score. Given an outcome at level i1,
namely li1Li1, we define the classification score between li1and an observation wLiat level i
by estimating the conditional probability Cli(li1|w) := P(Xi1=v|Xi=w)as
Cli(li1|w)f(li1, w)
T(w),
where
T(w) = X
vpa(w)
f(v, w).
4.2 Probabilistic-based Semantic Similarity scoring
Fix a level i∈ {1, . . . .m}, and let Xand Ybe identically distributed random variables such that
XL0× ·· · × Lmis defined earlier in Section 4. We define the probabilistic-based semantic similarity
score between two outcomes xi, yiLiby approximating the conditional joint probability COi(xi, yi) :=
P(Xi=xi, Yi=yi|Xi1pa(xi), Yi1pa(yi)) as
COi(xi, yi)Y
vpa(xi)
pi(v, xi)·Y
vpa(yi)
pi(v, yi),(1)
1Different from the observations used to train our model.
6
Figure 4: PGMHD for tandem MS data. The root nodes are the glycans that annotate the peaks at
MS1level, while the level 2 nodes are the glycan fragments that annotate the peaks at MS2level and the
edges represent dependency between the glycans that generates the fragments.
where pi(v, w) = P(Xi1=v , Xi=w)for every (v, w)Li1×Li. We can naturally estimate the
probabilities pi(v, w)with ˆp(v, w)defined as
ˆp(v, w) := f(v, w)
n.
Hence, we can obtain the related outcomes of xiLi(at level i) by finding all the wLiwith a large
estimated probabilistic-based semantic similarity score COi(xi, w).
4.3 Progressive Learning
PGMHD is designed to allow progressive learning. Progressive learning is a learning technique that
allows a model to learn gradually over time. Training data does not need to be given at one time to the
model, instead the model can learn from any available data and integrate the new knowledge with the
represented one. This learning technique is very attractive in the big data age for the following reasons:
1. Any size of the training data can fit.
2. It can easily learn from new data without the need to re-include the previous training data in the
learning.
3. The training session can be distributed instead of doing it in one long-running session.
4. Recursive learning allows the results of the model to be used as new training data, provided they
are judged to be accurate by the user. The progressive learning approach for PGMHD is shown in
Algorithm 1.
5 Experimental Results
PGMHD can be used for different purposes once it is built and trained. PGMHD can be used to predict
the class from level lfor the observations of random variables at level l+1. For example, in the annotation
of the MS data, PGMHD is used to predict the best Glycan at level MS1to annotate a spectrum by
evaluating the annotated peaks at level MS2with probability scores that represent how well the selected
glycan correlates to the manually curated annotations that were used to train the model.
5.1 PGMHD to automate the MS annotation
This model is well suited for representing MS data. We recently implemented the Glycan Elucidation and
Annotation Tool (GELATO), which is a semi-automated MS annotation tool for glycomics integrated
within our MS data processing framework called GRITS. Figures 4, 5, 6 and 7 show screen shots from
GELATO for annotated spectra. Figure 5 shows the MS profile level and Figures 6, 7, and 8 show the
annotated MS2peaks using fragments of the glycans that were chosen as candidate annotations to the
MS profile data (i.e. level 1).
7
Algorithm 1 Progressive Learning for PGMHD
Figure 4: PGMHD for tandem MS data. The root nodes are the glycans that annotate the peaks at MS1
level, while the level 2 nodes are the glycan fragments that annotate the peaks at MS2level and the edges
represent dependency between the glycans that generates the fragments.
Data: Hierarichal Data
Result: Probabilistic Graphical Model
currentI nputLevel 1
currentGraphLayer 1
while currentI nputLevel < maxInputLevel do foreach dataItem in currentInputLevel do
read dataItem
if dataItem exists in currentGraphLayer then
retreive the node where node.data =dataItem parentN ode node
node.frequency node.frequency +1
else
parentN ode newnode parentNode.frequency 1
end
childrenLevel currentLevel +1
foreach child in dataItem.children do
if child exists in parentN ode.children then
set childN ode node where node.data =child.data edge edge(parentN ode, childN ode)
edge.f requency frequency +1
else
if child exists in childrenLevel then
childN ode node where node.data =child.data
edge createN ewEdge(parentN oden, childN ode)edge.frequency edge.frequecy +1
else
childN ode newN ode childNode.data child childN ode.f requency 0
edge createN ewEdge(parentN ode, childN ode)edge.frequency 1
end
end
end
end
currentI nputLevel currentLevel +1 currentGraphLayer currentGraphLayer +1
end while
Algorithm 1: Progressive Learning for PGMHD
8
Figure 5: MS1 annotation using GELATO. Scan# is the ID number of the scan in the MS file, peak
charge is the charge state of that peak in the MS file, peak intensity represents the abundance of an ion
at that peak, peak m/z is the mass over charge of the given peak, cartoon is the annotation of that peak
(glycan) in CFG format, feature m/z is the mass over charge for the glycan, and glycanID is the ID of
the glycan in the Glycan Ontology(GlycO).
8
Figure 6: Fragments of Glycan GOG166 at the MS2level. Each ion observed in MS1is selected and
fragmented in MS2to generate smaller ions, which can be used to identify the glycan structure that most
appropriately annotates the MS1ion. Theoretical fragments of the glycan structure that had been used
to annotate the MS1spectrum are used to annotate the corresponding MS2spectrum.
Figure 7: Fragments of Glycan GOG120 whose peaks were annotated at the MS2level. See Figure 5 for
annotation scheme.
Figure 8: Fragments of Glycan GOG516 whose peaks were annotated at the MS2level. See Figure 5 for
annotation scheme.
9
Table 1: Precision and Recall for PGMHD in the MS annotation experiment
Size of training set Precision Recall
5 0.891 0.621
6 0.870 0.609
7 0.865 0.619
8 0.868 0.632
9 0.867 0.618
Figure 9: Precision and Recall of PGMHD
To represent the data shown in these figures using the proposed model, a top-layer node is assigned
to each row in the MS profile table, which corresponds to the MS1data. Then, for each row in the MS2
tables, a unique node is created and connected with its parent node using a directed edge from the parent
node (at the MS profile layer) to the child node (at the MS2layer). Each top-layer node stores a value
representing how frequently that parent has been seen in the training data. However, each child node
in the MS2layer has more than one parent. The edge’s weight represents the co-occurrence frequency
between a child and a parent. The child node stores the total frequency of observing that child regardless
of the identity of its parents. The combined frequency data makes it possible to design a progressive
learning algorithm that can extract information from massive data sets. Figure 4 shows the PGMHD for
the given MS data in these figures. As shown in the model, two layers are created: one for the MS1level
and a second one for the MS2level. The nodes at the MS2level may have many parents as long as they
have the same annotation. The frequency values are not shown because of space constraints.
We ran our experiments using MS data which is collected from stem cell samples. The size of this
data set is 1,746,278 peaks distributed over 1713 MS scans from 10 MS experiments. Figure 11 shows the
learning time using the progressive learning technique. In this test we introduced one new experiment
at a time to the model for training, and we recorded the total time required to train the model. These
performance results demonstrate how efficiently the progressive learning works with PGMHD.
To test the accuracy of PGMHD, we trained the model by randomly selecting one of 10 available
experiments, while the other 9 experiments were used to test the trained model by annotating the ex-
periments’ peaks using PGMHD. The baseline in our evaluation was the annotations generated by the
commercial tool SimGlycan. The results of the accuracy test are shown in Table 1. Figure 10 shows the
average precision and recall for PGMHD compared to the average precision and recall of GELATO using
the same dataset of 1,746,278 peaks distributed over 10 MS experiments.
5.2 PGMHD for latent semantic discovery over Hadoop
We also implemented a version of PGMHD over Hadoop [26] to be used for latent semantic discovery
between users’ search terms extracted from search logs provided by CareerBuilder.com.
10
0"
0.1"
0.2"
0.3"
0.4"
0.5"
0.6"
0.7"
0.8"
0.9"
1"
Precision" Recall"
PGMHD"
GELATO"
Figure 10: Average precision and recall of PGMHD and GELATO
Figure 11: Progressive Learning Time Over Different Experiments
11
Read%Input%Row%
UserId,Classifica6
on,SearchTerms%
Map
Count%Term%
Frequency%and%
(Term,Class)%
Frequency%
Key: Term
Value: Classification
Reduce
Read%Input%Row%
Term,Term%Freq,%
Class,%
(Term,Class)%Freq%
Map
Count%
Classifica6on%
Frequency%
Key: Classification
Valu e: UserID
Reduce
Term% Term%Freq% Class% (Term,Class)%
Freq%
Class% Class%Freq%
Join
PGMHD
Hive Table 1
Hive Table 2
Figure 12: PGMHD Over Hadoop
Table 2: Input data to PGMHD over hadoop
UserID Classification Search Terms
user1 Java Developer Java, Java Developer, C, Software Engineer
user2 Nurse RN, Rigistered Nurse, Health Care
user3 .NET Developer C#, ASP, VB, Software Engineer, SE
user4 Java Developer Java, JEE, Struts, Software Engineer, SE
user5 Health Care Health Care Rep, HealthCare
5.2.1 Problem Description
CareerBuilder operates the largest job board in the U.S. and has an extensive and growing global presence,
with millions of job postings, more than 60 million actively-searchable resumes, over one billion searchable
documents, and more than a million searches per hour. The search relevancy and recommendations team
wants to discover latent semantic relationships among the search terms entered by their users in order to
build a semantic search engine that understands a user’s query intent in order to provide more relevant
results than a traditional keyword search engine. To tackle this problem, CareerBuilder cannot use a
typical synonyms dictionary since most of the keywords used in the employment search domain represent
job titles, skills, and companies that would not be found in a traditional English dictionary. Additionally,
CareerBuilder’s search engine supports over a dozen languages, so they were in search of a model that is
language-independent.
5.2.2 PGMHD over Hadoop
Given the search logs for all the users and the users’ classifications as shown in Table 2, PGMHD can
represent this kind of data by placing the classes of the users as root nodes and placing the search terms
for all the users in the second level as children nodes. Then, an edge will be formed linking each search
term back to the class of the user who searched for it. The frequency of each search term (how many
users search for it) will be stored in the node of that term, while the frequency of a specific search term
searched for by users of a specific class (how many users belonging to that class searched for the given
term) will be stored in the edge between the class and the term. The frequency of the root node is the
summation of the frequencies on the edges that connect that root node with its children (Figure 13).
Figure 12 shows how PGMHD was implemented over Hadoop using Map/Reduce jobs and Hive
tables. After we created PGMHD on Hadoop we calculated the probabilistic-based semantic similarity
score between each pair of two terms with shared parents. The size of the data set we analyzed in
this experiment is 1.6 billion search records. To decrease the noise in the given data set we applied a
pre-filtering technique by removing any search term used by less than 10 distinct users. The final graph
representing this data contains 1931 root nodes, 16414 child nodes, and 439435 edges.
5.2.3 Results of latent semantic discovery using PGMHD
The experiment performing latent semantic discovery among search terms using PGMHD was run on a
Hadoop cluster with 63 data nodes, each having a 2.6 GHZ AMD Opteron Processor with 12 to 32 cores
and 32 to 128 GB RAM. Table 3 shows sample results of 10 terms with their top 5 related terms discovered
12
Health'
Care'Rep'
Java'
Developer'
.NET'
Developer' Nurse' Health'Care'
Java'
Developer'
Java' C'
So7ware'
Engineer' ASP'
RN' Registered'
Nurse'
Health'
Care' VB'
SE' JEE' Struts'
Figure 13: PGMHD representing the search log data
Table 3: PGMHD results for latent semantic discovery
Term Related Terms
hadoop big data, hadoop developer, OBIEE,
Java, Python
registered nurse
rn registered nurse, rn, registered
nurse manager, nurse, nursing, director
of nursing
data mining
machine learning, data scientist,
analytics, business intellegence,
statistical analyst
Solr lucene, hadoop, java
Software Engineer software developer, programmer, .net
developer, web developer, software
big data nosql, data science, machine learning,
hadoop, teradata
Realtor realtor assistant, real estate, real
estate sales, sales, real estate agent
Data Scientist machine learning, data analyst, data
mining, analytics, big data
Plumbing
plumber, plumbing apprentice,
plumbing maintenance, plumbing
sales, maintenance
Agile scrum, project manager, agile coach,
pmiacp, scrum master
by PGMHD. To evaluate the model’s accuracy, we sent the results to data analysts at CareerBuilder who
reviewed 1000 random pairs of discovered related search terms and returned the list with their feedback
about whether each pair of discovered related terms was “related" or “unrelated". We then calculated
the accuracy (precision) of the model based upon the ratio of number of related results to total number
of results. The results show the accuracy of the discovered semantic relationships among search terms
using the PGMHD model to be 0.80.
6 Conclusion
Probabilistic graphical models are very important in many modern applications such as data mining and
data analytics. The major issue with existing probabilistic graphical models is their scalability to handle
large data sets, making this a very important area for research given the tremendous modern focus on big
data due to the number of data points produced by modern computers systems and sensors. PGMHD
is a probabilistic graphical model that attempts to solve the scalability problems in existing models in
scenarios where massive hierarchical data is present. PGMHD is designed to fit hierarchical data sets of
any size, regardless of the domain to which the data belongs. In this paper we present two experiments
from different domains: one being the automated tagging of high-throughput mass spectrometry data
in bioinformatics, and the other being latent semantic discovery using search logs from the largest job
13
board in the U.S. The two use cases in which we tested PGMHD show that this model is robust and can
scale from a few thousand entries to at least billions of entries, and can also run on a single computer
(for smaller data sets), as well as in a parallelized fashion on a large cluster of servers (63 were used in
our experiment).
Acknowledgment
The authors would like to deeply thank David Crandall from Indiana University for providing very helpful
comments and suggestions to improve this paper. We also would like to thank Kiyoko Aoki Kinoshita
from Soka University and Khaled Rasheed from University of Georgia for the valuable discussions and
suggestions to improve this model. Deep thanks to Melody Porterfield and Rene Ranzinger from Complex
Carbohydrate Research Center (CCRC) at the University of Georgia for providing the MS data and the
valuable time and information they shared with us to understand the annotation process of the MS data.
References
[1] K. F. Aoki-Kinoshita. An introduction to bioinformatics for glycomics research. PLoS computational
biology, 4(5):e1000075, 2008.
[2] A. Apte and N. S. Meitei. Bioinformatics in glycomics: Glycan characterization with mass spectro-
metric data using simglycanâĎć. In Functional Glycomics, pages 269–281. Springer, 2010.
[3] G. Bouma. Normalized (pointwise) mutual information in collocation extraction. In Proceedings of
the Biennial GSCL Conference, pages 31–40, 2009.
[4] P. Broët, S. Richardson, and F. Radvanyi. Bayesian hierarchical model for identifying changes in
gene expression from microarray experiments. Journal of Computational Biology, 9(4):671–683, 2002.
[5] A. Budanitsky and G. Hirst. Semantic distance in wordnet: An experimental, application-oriented
evaluation of five measures. In Workshop on WordNet and Other Lexical Resources, volume 2, 2001.
[6] A. Ceroni, K. Maass, H. Geyer, R. Geyer, A. Dell, and S. M. Haslam. Glycoworkbench: a tool
for the computer-assisted annotation of mass spectra of glycansâĂă. Journal of proteome research,
7(4):1650–1659, 2008.
[7] D. M. Chickering, D. Heckerman, and C. Meek. Large-sample learning of bayesian networks is
np-hard. The Journal of Machine Learning Research, 5:1287–1330, 2004.
[8] C. Christodoulopoulos, S. Goldwater, and M. Steedman. A bayesian mixture model for part-of-
speech induction using multiple features. In Proceedings of the conference on empirical methods in
natural language processing, pages 638–647. Association for Computational Linguistics, 2011.
[9] C. A. Cooper, E. Gasteiger, and N. H. Packer. Glycomod–a software tool for determining glycosy-
lation compositions from mass spectrometric data. Proteomics, 1(2):340–349, 2001.
[10] A. Darwiche. Bayesian networks. Communications of the ACM, 53(12):80–90, 2010.
[11] S. T. Dumais. Latent semantic analysis. Annual review of information science and technology,
38(1):188–230, 2004.
[12] S. R. Eddy and R. Durbin. Rna sequence analysis using covariance models. Nucleic acids research,
22(11):2079–2088, 1994.
[13] L. Fei-Fei and P. Perona. A bayesian hierarchical model for learning natural scene categories. In
Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference
on, volume 2, pages 524–531. IEEE, 2005.
[14] S. Fine, Y. Singer, and N. Tishby. The hierarchical hidden markov model: Analysis and applications.
Machine learning, 32(1):41–62, 1998.
[15] N. Friedman, I. Nachman, and D. Peér. Learning bayesian network structure from massive datasets:
the «sparse candidate «algorithm. In Proceedings of the Fifteenth conference on Uncertainty in
artificial intelligence, pages 206–215. Morgan Kaufmann Publishers Inc., 1999.
14
[16] Z. Ghahramani. An introduction to hidden markov models and bayesian networks. International
Journal of Pattern Recognition and Artificial Intelligence, 15(01):9–42, 2001.
[17] D. Goldberg, M. Bern, B. Li, and C. B. Lebrilla. Automatic determination of o-glycan structure
from fragmentation spectra. Journal of proteome research, 5(6):1429–1434, 2006.
[18] E. Gyftodimos and P. A. Flach. Hierarchical bayesian networks: an approach to classification and
learning for structured data. In Methods and Applications of Artificial Intelligence, pages 291–300.
Springer, 2004.
[19] T. Hamelryck. An overview of bayesian inference and graphical models. In Bayesian Methods in
Structural Bioinformatics, pages 3–48. Springer, 2012.
[20] S. Harispe, S. Ranwez, S. Janaqi, and J. Montmain. Semantic measures for the comparison of
units of language, concepts or entities from text and knowledge base analysis. arXiv preprint
arXiv:1310.1285, 2013.
[21] K. Hashimoto, S. Goto, S. Kawano, K. F. Aoki-Kinoshita, N. Ueda, M. Hamajima, T. Kawasaki,
and M. Kanehisa. Kegg as a glycome informatics resource. Glycobiology, 16(5):63R–70R, 2006.
[22] J. Irungu, E. P. Go, D. S. Dalpathado, and H. Desaire. Simplification of mass spectral analysis of
acidic glycopeptides using glycopep id. Analytical chemistry, 79(8):3065–3074, 2007.
[23] M. I. Jordan et al. Graphical models. Statistical Science, 19(1):140–155, 2004.
[24] M. Korayem, A. Badr, and I. Farag. Optimizing hidden markov models using genetic algorithms
and artificial immune systems. Computing and Information Systems, 11(2):44, 2007.
[25] J. Kupiec. Robust part-of-speech tagging using a hidden markov model. Computer Speech & Lan-
guage, 6(3):225–242, 1992.
[26] C. Lam. Hadoop in action. Manning Publications Co., 2010.
[27] T. Lütteke, A. Bohne-Lang, A. Loss, T. Goetz, M. Frank, and C.-W. von der Lieth. Glycosciences.
de: an internet portal to support glycomics and glycobiology research. Glycobiology, 16(5):71R–81R,
2006.
[28] K. Maass, R. Ranzinger, H. Geyer, C.-W. von der Lieth, and R. Geyer. âĂIJglyco-peakfinderâĂİ–de
novo composition analysis of glycoconjugates. Proteomics, 7(24):4435–4444, 2007.
[29] R. Mihalcea, C. Corley, and C. Strapparava. Corpus-based and knowledge-based measures of text
semantic similarity. In AAAI, volume 6, pages 775–780, 2006.
[30] S. Park and J. K. Aggarwal. A hierarchical bayesian network for event recognition of human actions
and interactions. Multimedia systems, 10(2):164–179, 2004.
[31] J. Pearl. Markov and Bayes networks: a comparison of two graphical representations of probabilistic
knowledge. Computer Science Department, University of California, 1986.
[32] R. Raman, S. Raguram, G. Venkataraman, J. C. Paulson, and R. Sasisekharan. Glycomics: an inte-
grated systems approach to structure-function relationships of glycans. Nature Methods, 2(11):817–
824, 2005.
[33] R. Raman, M. Venkataraman, S. Ramakrishnan, W. Lang, S. Raguram, and R. Sasisekharan. Ad-
vancing glycomics: implementation strategies at the consortium for functional glycomics. Glycobiol-
ogy, 16(5):82R–90R, 2006.
[34] B. Shan, B. Ma, K. Zhang, and G. Lajoie. Complexities and algorithms for glycan sequencing using
tandem mass spectrometry. Journal of bioinformatics and computational biology, 6(01):77–91, 2008.
[35] P. Smyth. Belief networks, hidden markov models, and markov random fields: A unifying view.
Pattern recognition letters, 18(11):1261–1268, 1997.
[36] J. Söding. Protein homology detection by hmm–hmm comparison. Bioinformatics, 21(7):951–960,
2005.
15
[37] H. Tang, Y. Mechref, and M. V. Novotny. Automated interpretation of ms/ms spectra of oligosac-
charides. Bioinformatics, 21(suppl 1):i431–i439, 2005.
[38] P. D. Turney. Mining the web for synonyms: PMI-IR versus lsa on toefl. In Proceedings of the 12th
European Conference on Machine Learning, EMCL ’01, pages 491–502, London, UK, UK, 2001.
Springer-Verlag.
[39] H. Xu, C.-L. Wei, F. Lin, and W.-K. Sung. An hmm approach to genome-wide identification of
differential histone modification sites from chip-seq data. Bioinformatics, 24(20):2344–2349, 2008.
[40] J. Yamagishi. An introduction to hmm-based speech synthesis. Technical report, Technical report,
Tokyo Institute of Technology, 2006.
16
... The flexibility of hierarchical models in capturing the complex interactions in the data comes with a high computational expense since all the model parameters need to be estimated jointly [11]. Furthermore, large-scale data may be structured in many levels or groups [12], resulting in a large number of parameters to learn for a hierarchical model, further increasing the computational load. Given that many of the applications involving count data have recently benefited from technological advances of data collection and storage, there is a critical need to ensure the applicability of HBPRMs. ...
Preprint
Full-text available
Hierarchical Bayesian Poisson regression models (HBPRMs) provide a flexible modeling approach of the relationship between predictors and count response variables. The applications of HBPRMs to large-scale datasets require efficient inference algorithms due to the high computational cost of inferring many model parameters based on random sampling. Although Markov Chain Monte Carlo (MCMC) algorithms have been widely used for Bayesian inference, sampling using this class of algorithms is time-consuming for applications with large-scale data and time-sensitive decision-making, partially due to the non-conjugacy of many models. To overcome this limitation, this research develops an approximate Gibbs sampler (AGS) to efficiently learn the HBPRMs while maintaining the inference accuracy. In the proposed sampler, the data likelihood is approximated with Gaussian distribution such that the conditional posterior of the coefficients has a closed-form solution. Numerical experiments on real and synthetic datasets demonstrate the superior performance of AGS in comparison to the state-of-the-art sampling algorithm, especially for large datasets.
... In the case of t-SNE the default result is an increased count of distinct clusters, potentially influencing the interpretation by the scientist performing the analysis. The same applies to other problems such as network synthesis [19,52]. It will therefore be of increasing importance to put appropriate model selection procedures into place to guard against overfitting. ...
Article
Full-text available
Recent technological advances have enabled unprecedented insight into transcriptomics at the level of single cells. Single cell transcriptomics enables the measurement of transcriptomic information of thousands of single cells in a single experiment. The volume and complexity of resulting data make it a paradigm of big data. Consequently, the field is presented with new scientific and, in particular, analytical challenges where currently no scalable solutions exist. At the same time, exciting opportunities arise from increased resolution of single-cell RNA sequencing data and improved statistical power of ever growing datasets. Big single cell RNA sequencing data promises valuable insights into cellular heterogeneity which may significantly improve our understanding of biology and human disease. This review focuses on single cell transcriptomics and highlights the inherent opportunities and challenges in the context of big data analytics.
... Use Case: As an example use case, we leveraged the SKG to clean a list of relationships mined from search engine query logs using a similar methodology to that described in [19,20]. The idea here is that users who conduct similar searches often search for related terms and phrases. ...
... IV. EXPERIMENT AND RESULTS To test the proposed system, we applied it within the recommendation email system at CareerBuilder, which is one of the largest job boards in the world. This system has millions of job postings, more than 60 million actively-searchable resumes, over one billion searchable documents, and more than a million searches per hour [21]- [23]. The recommendation Fig. 3. System Architecture. ...
Article
Full-text available
Recommendation emails are among the best ways to re-engage with customers after they have left a website. While on-site recommendation systems focus on finding the most relevant items for a user at the moment (right item), email recommendations add two critical additional dimensions: who to send recommendations to (right person) and when to send them (right time). It is critical that a recommendation email system not send too many emails to too many users in too short of a time-window, as users may unsubscribe from future emails or become desensitized and ignore future emails if they receive too many. Also, email service providers may mark such emails as spam if too many of their users are contacted in a short time-window. Optimizing email recommendation systems such that they can yield a maximum response rate for a minimum number of email sends is thus critical for the long-term performance of such a system. In this paper, we present a novel recommendation email system that not only generates recommendations, but which also leverages a combination of individual user activity data, as well as the behavior of the group to which they belong, in order to determine each user's likelihood to respond to any given set of recommendations within a given time period. In doing this, we have effectively created a meta-recommendation system which recommends sets of recommendations in order to optimize the aggregate response rate of the entire system. The proposed technique has been applied successfully within CareerBuilder's job recommendation email system to generate a 50\% increase in total conversions while also decreasing sent emails by 72%
Article
Graphics, uncertainty, and semantics are three approaches to building models. The combination of the three approaches is a way to develop a stronger modeling method. This article surveys the research efforts toward combining these aspects, which can be divided into two routes: One is to combine graphics and uncertainty as probabilistic graphical models and then incorporate semantics, and the other is to combine graphics and semantics and then incorporate probability to handle uncertainty. The models and methods involved in these efforts are introduced and their expressiveness, pros, and cons are discussed.
Chapter
Most work in building semantic knowledge bases has thus far focused upon either manually building language-specific taxonomies/ontologies or upon automatic techniques such as clustering or dimensionality reduction to discover latent semantic links within the content of a given corpus. The former is very labor intensive and is hard to maintain, while the latter is prone to noise and may be hard for a human to understand or to interact with directly. More recently, ontology learning systems have arisen with the hope of automatically extracting relationships from free-text content. Unfortunately, this is also problematic in that it is only able to utilize relationships found within documents, and it also loses substantial meaning which is encoded in the underlying documents when generating the ontology. We will describe a combination of techniques which, taken together, can overcome these problems. First, we’ll show how search logs represent a largely untapped source for discovering latent semantic relationships between phrases, which can be used to build a semantic knowledge base. Second, we’ll show how a semantic knowledge graph of relationships between all terms and concepts can be automatically built and compactly represented to enable traversal and scoring of relationships without losing any of the nuanced meaning embedded in the underlying corpus of documents. We will discuss how to use key big data analytics technologies and techniques for mining search logs, as well as textual content, to discover semantic relationships between key phrases in a manner that is language-agnostic, human-understandable, highly scalable, and mostly noise-free.
Article
Network intelligence is a discipline that builds on the ca pabilities of network systems to act intelligently by the usage of networ resources for delivering high-quality services in a changing environment Wide area network intelligence is a class of network intelligence in wid area network which covers the core and the edge of Internet. In this paper we propose a system based on machine learning for wide area network in telligence. The whole system consists of a core machine for pre-Trainin and many terminal machines to accomplish faster responses. Each ma chine is one of dual-hemisphere models which are made of left and righ hemispheres. The left hemisphere is used to improve latency by termina response and the right hemisphere is used to improve communication b data generation. In an application on multimedia service, the propose model is superior to the latest deep feed forward neural network in the dat center with respect to the accuracy, latency and communication. Evalu ation shows scalable improvement with regard to the number of termina machines. Evaluation also shows the cost of improvement is longer learnin time. © 2016 The Institute of Electronics, Information and Communication Engineers.
Article
Full-text available
Semantic measures are today widely used to estimate the strength of the semantic relationship between elements of various types: units of language (e.g., words, sentences), concepts or even entities (e.g., documents, genes, geographical locations). They play an important role for the comparison these elements according to semantic proxies, texts and knowledge models (e.g., ontologies), implicitly or formally supporting their meaning or describing their nature. Semantic measures are therefore essential for designing intelligent agents which will use semantic analysis to mimic human ability to compare things. This paper proposes a survey of the broad notion of semantic measure. This notion generalizes the well-known notions of semantic similarity, semantic relatedness and semantic distance, which have been extensively studied by various communities over the last decades (e.g., Cognitive Science, Linguistics, and Artificial Intelli-gence to mention a few). Definitions, practical applications, and the various approaches used for their definitions are presented. In addition, the evaluations of semantic measures, as well as, software solutions dedicated to their computation and analysis are introduced. The general presentation of the large diversity of existing semantic measures is further completed by a detailed survey of measures based on knowledge bases analysis. In this study, we mainly focus on measures which rely on graph analyses, i.e. framed in the relational setting. They are of particular interest for numerous communities and have recently gained a lot of attention in research and application by taking advantage of several types of knowledge bases (e.g., ontologies, semantic graphs) to compare words, concepts, or entities.
Article
Full-text available
Hidden Markov Models are widely used in speech recognition and bioinformatics systems. Conventional methods are usually used in the parameter estimation process of Hidden Markov Models (HMM). These methods are based on iterative procedure, like Baum-Welch method, or gradient based methods. However, these methods can yield to local optimum parameter values. In this work, we use artificial techniques such as Artificial Immune Systems (AIS) and Genetic Algorithms (GA) to estimate HMM parameters. These techniques are global search optimization techniques inspired from biological systems. Also, the hybrid between genetic algorithms and artificial immune system was used to optimize HMM parameters.
Chapter
The Bayesian view of statistics interprets probability as a measure of a state of knowledge or a degree of belief, and can be seen as an extension of the rules of logic to reasoning in the face of uncertainty [342]. The Bayesian view has many advantages [48, 342, 428, 606]: it has a firm axiomatic basis, coincides with the intuitive idea of probability, has a wide scope of applications and leads to efficient and tractable computational methods. The main aim of this book is to show that a Bayesian, probabilistic view on the problems that arise in the simulation, design and prediction of biomolecular structure and dynamics is extremely fruitful. This book is written for a mixed audience of computer scientists, bioinformaticians, and physicists with some background knowledge of protein structure. Throughout the book, the different authors will use a Bayesian viewpoint to address various questions related to biomolecular structure. Unfortunately, Bayesian statistics is still not a standard part of the university curriculum; most scientists are more familiar with the frequentist view on probability. Therefore, this chapter provides a quick, high level introduction to the subject, with an emphasis on introducing ideas rather than mathematical rigor. In order to explain the rather strange situation of two mainstream paradigms of statistics and two interpretations of the concept of probability existing next to each other, we start with explaining the historical background behind this schism, before sketching the main aspects of the Bayesian methodology. In the second part of this chapter, we will give an introduction to graphical models, which play a central role in many of the topics that are discussed in this book. We also discuss some useful concepts from information theory and statistical mechanics, because of their close ties to Bayesian statistics.
Article
L'Analyse Semantique Latente est une approche statistique introduite pour ameliorer la recherche d'information. Elle consiste a reduire la dimensionalite du probleme de la recherche d'information pour surmonter les difficultes liees a la synonymie et la polysemie. Au-dela de ses applications en recherche d'information (filtrage, multilinguisme), elle a egalement ete utilisee en sciences cognitives pour la modelisation de la memoire humaine. Un certain nombre de developpements ont concerne les modeles probabilistes et les aspects computationnels.
Article
In this paper, we discuss the related information theoreti-cal association measures of mutual information and pointwise mutual information, in the context of collocation extraction. We introduce nor-malized variants of these measures in order to make them more easily interpretable and at the same time less sensitive to occurrence frequency. We also provide a small empirical study to give more insight into the be-haviour of these new measures in a collocation extraction setup.
Article
What are Bayesian networks and why are their applications growing across all fields?
Book
Big data can be difficult to handle using traditional databases. Apache Hadoop is a NoSQL applications framework that runs on distributed clusters. This lets it scale to huge datasets. If you need analytic information from your data, Hadoop's the way to go. Hadoop in Action introduces the subject and teaches you how to write programs in the MapReduce style. It starts with a few easy examples and then moves quickly to show Hadoop use in more complex data analysis tasks. Included are best practices and design patterns of MapReduce programming. This book requires basic Java skills. Knowing basic statistical concepts can help with the more advanced examples.