Improving the Quality of Semantic Relationships Extracted from Massive User Behavioral Data


As the ability to store and process massive amounts of user behavioral data increases, new approaches continue to arise for leveraging the wisdom of the crowds to gain insights that were previously very challenging to discover by text mining alone. For example, through collaborative filtering, we can learn previously hidden relationships between items based upon users' interactions with them, and we can also perform ontology mining to learn which keywords are semantically-related to other keywords based upon how they are used together by similar users as recorded in search engine query logs. The biggest challenge to this collaborative filtering approach is the variety of noise and outliers present in the underlying user behavioral data. In this paper we propose a novel approach to improve the quality of semantic relationships extracted from user behavioral data. Our approach utilizes millions of documents indexed into an inverted index in order to detect and remove noise and outliers.
Improving the Quality of Semantic Relationships
Extracted from Massive User Behavioral Data
Khalifeh AlJadda, Mohammed Korayem, and Trey Grainger
CareerBuilder, Norcross, GA, USA
khalifeh.aljadda, mohammed.korayem,
Abstract—As the ability to store and process massive amounts
of user behavioral data increases, new approaches continue to
arise for leveraging the wisdom of the crowds to gain insights
that were previously very challenging to discover by text mining
alone. For example, through collaborative filtering, we can learn
previously hidden relationships between items based upon users’
interactions with them, and we can also perform ontology
mining to learn which keywords are semantically-related to other
keywords based upon how they are used together by similar users
as recorded in search engine query logs. The biggest challenge
to this collaborative filtering approach is the variety of noise and
outliers present in the underlying user behavioral data. In this
paper we propose a novel approach to improve the quality of
semantic relationships extracted from user behavioral data. Our
approach utilizes millions of documents indexed into an inverted
index in order to detect and remove noise and outliers.
Data cleaning is one of the most important steps needed
when working with large, real-world data sets [4], [5]. Based
on [6], cleaning the data often consumes between 40% and
80% of the time involved in data mining efforts. Data cleaning,
sometimes called data cleansing or scrubbing, is the process
of removing incomplete, noisy and inaccurate data points in
order to improve the quality of the data, and it is usually treated
as a preprocessing step [4], [7]. Data cleaning problems often
appear when data from heterogeneous sources is combined into
composite sources of information [6].
Data cleaning is a process ususally consisting of dif-
ferent phases including determining errors types, identifying
instances of data errors, and finally correcting these errors [8].
The data cleaning approaches chosen mainly depend on the
type of data under consideration. For instance, cleaning sensor
data is much different than cleaning textual data due to the
nature of data collection by sensor devices [9], [10]. Detecting
and removing outliers is another form of data cleaning [11],
[12]. There are specialized cleaning tools available to deal
with specific domains (e.g, name and address data) [7], [13],
but these tools do not fit our use case well, as we dealing
specifically with user behavioral data that is by nature context-
specific. We refer the reader to [7], [14] for more details about
common data cleaning approaches.
Khalifeh et al. in [1], [2] presented a semantic search
system based on mining user behavior from search logs to
extract related keywords. The extracted semantically-related
keywords are considered related due to an overlapping usage
of keywords across searches conducted by similar users. The
system utilizes a probabilistic graphical model [3] to discover
such relationships, but that model struggles to detect and
remove certain categories of noise and outliers, which tend to
show up due to some common searches like microsoft office, or
customer service. To overcome the noise and outliers present
in the semantic search system, we proposed a post-filtering
technique which we discuss in this paper. In our previous work,
we briefly mentioned the content-based filter as a post cleaning
phase. Here, we will focus on the details of that post-filtering
technique used to remove the noise and outliers extracted form
the user behavioral data.
Our methodology requires a set of documents related to
the domain of the extracted semantic relationships. That set
of documents collectively serves as a litmus test to accept or
reject a relationship extracted from user behavioral data. A
representative set of documents within this collection is critical
since the the main objective of that set of documents is to
be rich and diverse enough to include most of the semantic
relationships within the given domain.
Indexing is also very important, as this enables us to
quickly conduct searches over a massive set of documents.
To do so, we utilized Apache Solr [15] to create an inverted
index containing millions of documents relevant to the domain
covered within our mined query logs. Utilizing Apache Solr
brings many benefits:
1) It is free and open source
2) It is both vertically and horizontally scalable
3) It is user friendly, and thus easy to setup and maintain
4) It is incredibly fast (milliseconds) at both indexing
documents and searching keyword phrases
Figure 1 shows the proposed algorithm to score the relevancy
of a given term (T1) and a semantically related term (T2). As
shown in the figure, the input is a pair of terms. The first step
is to conduct a search within the document set for the given
term. The next step is to conduct a search for the semantically-
related term using the same set of documents. The third step is
to conduct a search for the intersection of both terms and count
the number of documents that fall within that intersection.
Once we have these three numbers (the number of documents
that have the given term, the number of documents that have
the related term, and the number of documents that have both
terms), we can then calculate the relevancy score as follows:
rel(T1,T 2) =num(T1, T 2)
arg min(num(T1), num(T2)).
In this function, num(T1, T 2) is the number of occurrence of
term T1with term T2in the corpus, num(T1) is the number
of occurrences of T1in the corpus, and num(T2) is the
number of occurrences of T2in the corpus. We use the mini-
mum between num(T1) and num(T2) in the denominator to
make sure that when one term is less popular than the other
one, the term with the greater popularity over the other won’t
affect the score negatively. The calculated relevancy score is
then compared with a threshold, so if the score exceeds that
threshold then the semantic relationship between both terms
is considered valid, otherwise the relationship is considered
invalid and can be discarded as noise.
IV. EXPERIMENT AND RES ULT S is the largest job board in the US with
millions of active jobs, over 60 million publicly searchable
resumes, over 1.5 billion actively searchable documents, and
millions of searches an hour. We conducted our experiment us-
ing the semantic relationships extracted using CareerBuilder’s
semantic search platform. CareerBuilder has built a semantic
Fig. 1. The cleaning system starts by leveraging Apache Solr to index a set of
documents related to the domain of the extracted set of semantic relationships.
Each pair of keywords previously extracted to be semantically-related based
upon collaborative filtering of user behavioral data is then sent as three search
requests to the Solr index. The number of documents returned by each search
is used to calculate the relevancy score between the two keywords T1and
search engine which utilizes user behavioral data collected
through search logs to discover semantic relationships between
search terms. To clean up the noise and outliers present in the
extracted relationships, we indexed 100 million job postings
in a Solr index. We then used that index to calculate the rel-
evancy score for 840,000 semantically-related pairs of search
terms. Our optimal threshold was determined experimentally
to perform best when set to 0.2. If any relevancy score falls
below 0.2, we thus drop that relationship; otherwise we keep
it. Table I shows examples of terms that were filtered kept
after the proposed technique, while Table II shows some of
the noise that was filtered out. A data analyst analyzed 1000
pairs both before we applied this cleanup system and after.
The results demonstrated that the proposed technique removes
60% of the remaining noise and outliers in the given data set.
With the increasing availability of large datasets of user
behavioral data, algorithms like collaborative filtering make
it possible to derive interesting relationships between items.
While these relationships are of reasonably high quality and
provide tremendous insights that often could not be easily
discovered from textual data alone, these algorithms suffer
from both noise (i.e. the same user demonstrating unrelated
activities) and outliers (i.e. automated processes or users who
have significant deviations from common user behavior). We
have demonstrated a data cleaning approach which leverages
an underlying textual corpus related to the domain to post-
process relationships learned through collaborative filtering of
user search logs. This post processing stage was demonstrated
to eliminate a further 60% of all remaining noise from an
already mostly clean system. In our specific example, we
cleaned pairs of keyword phrases derived from user search
log mining, and we searched against an inverted index for
each of the keyword phrases (as well as for the overlap) to
determine a content-based filter as a post-processing filtering
step. The proposed technique is being successfully used in pro-
duction within CareerBuilder’s semantic search engine in order
enhance the quality of machine-learned semantic relationship
j2ee java, jsp, struts, hibernate, sql, jboss, python, .net, javascript, unix, java ee
lpn lpn nurse, nursing, lvn, licensed practical nurse, practical nurse, vocational nurse, lpn case manager, rn
hadoop hadoop developer, map/reduce, hive, hbase, pig, big data, obiee, sqoop, hdfs, oozie
project manager project management, senior project manager, program manager, project coordinator, business analyst, pmp
dentist general dentist, dental, associate dentist, orthodontist, dmd, oral surgeon, periodont, endodon, doctor of medical dentistry, malpractice
j2ee microsoft, mobile web, online, paypal, quality center, scale, source control, web based
lpn admin, admission, advertising, agriculture, analyst, attorney, autocad, banking, bartender, blood, call center
hadoop .net, html, informatica, machine to machine, microstrategy, multi-thread, network engineer, qa, oracle, rest, semantic
project manager account, admin assitant, adminstration assistant, advertisting sale, application analyst, associate, anthem, at&t, attorney, auto cad
dentist chemist, child care, dental receptionist, nurse, lab technician, part time, front office
mappings between keyword phrases within the human capital
The authors would like to the Search Development group
at CareerBuilder for their assistance setting up and running
the Apache Solr cluster used for this system, as well as David
Lin for his superb data analysis of countless results before and
after this cleaning technique was applied.
