Content uploaded by Mohammed Korayem
Author content
All content in this area was uploaded by Mohammed Korayem on Nov 17, 2015
Content may be subject to copyright.
Improving the Quality of Semantic Relationships
Extracted from Massive User Behavioral Data
Khalifeh AlJadda∗, Mohammed Korayem∗, and Trey Grainger∗
∗CareerBuilder, Norcross, GA, USA
khalifeh.aljadda, mohammed.korayem, trey.grainger@careerbuilder.com
Abstract—As the ability to store and process massive amounts
of user behavioral data increases, new approaches continue to
arise for leveraging the wisdom of the crowds to gain insights
that were previously very challenging to discover by text mining
alone. For example, through collaborative filtering, we can learn
previously hidden relationships between items based upon users’
interactions with them, and we can also perform ontology
mining to learn which keywords are semantically-related to other
keywords based upon how they are used together by similar users
as recorded in search engine query logs. The biggest challenge
to this collaborative filtering approach is the variety of noise and
outliers present in the underlying user behavioral data. In this
paper we propose a novel approach to improve the quality of
semantic relationships extracted from user behavioral data. Our
approach utilizes millions of documents indexed into an inverted
index in order to detect and remove noise and outliers.
I. INTRODUCTION
As the ability to store and process massive amounts of
user behavioral data increases, new approaches continue to
arise for leveraging the wisdom of the crowds to gain insights
that were previously very challenging to discover by text
mining alone. For example, through collaborative filtering, we
can learn previously hidden relationships between items based
upon users’ interactions with them, and we can also perform
ontology mining to learn which keywords are semantically-
related to other keywords based upon how they are used
together by similar users as recorded in search engine query
logs. Many organizations are now seeking to capture and retain
as much user log data as possible to extract and take advantage
of these valuable insights. Some of the biggest challenges
to the successful utilization of these data stores, however,
are the quality and generalizability of the data, since data
collected from log files, sensors, cameras, etc. usually contains
significant noise and outliers. For example, when mining user
search logs to discover similar keywords commonly entered by
the same users, the data can be misleading if some of the users
are spamming the search engine with keywords or crawling
the search page in a seemingly random order that results
in search sessions containing many unrelated queries. There
may be thousands of unique variations of anomalous behavior
like this that would have to be independently identified and
handled to truly eliminate the noise from the system, making
it quite challenging to avoid learning false connections between
items when mining user-behavior alone. In this paper we
discuss a novel technique that was used to systematically detect
and remove outliers in lists of semantically-related keywords
discovered through query log mining. This technique was
applied at CareerBuilder, the largest job board in the US, and
is integrated as a key part of their semantic search engine
platform [1]–[3]. Our approach utilizes Apache Solr to index
millions of documents into an inverted index that can be used
to find intersections between interesting concepts. These inter-
sections (or lack thereof) are then used to systematically filter
out anomalous relationships derived from users whose search
patterns do not represent a meaningful, overlapping search
intent across their search sessions. This system successfully
cleans up many of the rough edges of collaborative filtering
II. RE LATE D WOR K
Data cleaning is one of the most important steps needed
when working with large, real-world data sets [4], [5]. Based
on [6], cleaning the data often consumes between 40% and
80% of the time involved in data mining efforts. Data cleaning,
sometimes called data cleansing or scrubbing, is the process
of removing incomplete, noisy and inaccurate data points in
order to improve the quality of the data, and it is usually treated
as a preprocessing step [4], [7]. Data cleaning problems often
appear when data from heterogeneous sources is combined into
composite sources of information [6].
Data cleaning is a process ususally consisting of dif-
ferent phases including determining errors types, identifying
instances of data errors, and finally correcting these errors [8].
The data cleaning approaches chosen mainly depend on the
type of data under consideration. For instance, cleaning sensor
data is much different than cleaning textual data due to the
nature of data collection by sensor devices [9], [10]. Detecting
and removing outliers is another form of data cleaning [11],
[12]. There are specialized cleaning tools available to deal
with specific domains (e.g, name and address data) [7], [13],
but these tools do not fit our use case well, as we dealing
specifically with user behavioral data that is by nature context-
specific. We refer the reader to [7], [14] for more details about
common data cleaning approaches.
Khalifeh et al. in [1], [2] presented a semantic search
system based on mining user behavior from search logs to
extract related keywords. The extracted semantically-related
keywords are considered related due to an overlapping usage
of keywords across searches conducted by similar users. The
system utilizes a probabilistic graphical model [3] to discover
such relationships, but that model struggles to detect and
remove certain categories of noise and outliers, which tend to
show up due to some common searches like microsoft office, or
customer service. To overcome the noise and outliers present
in the semantic search system, we proposed a post-filtering
technique which we discuss in this paper. In our previous work,
we briefly mentioned the content-based filter as a post cleaning
phase. Here, we will focus on the details of that post-filtering
technique used to remove the noise and outliers extracted form
the user behavioral data.
III. MET HO DS
Our methodology requires a set of documents related to
the domain of the extracted semantic relationships. That set
of documents collectively serves as a litmus test to accept or
reject a relationship extracted from user behavioral data. A
representative set of documents within this collection is critical
since the the main objective of that set of documents is to
be rich and diverse enough to include most of the semantic
relationships within the given domain.
Indexing is also very important, as this enables us to
quickly conduct searches over a massive set of documents.
To do so, we utilized Apache Solr [15] to create an inverted
index containing millions of documents relevant to the domain
covered within our mined query logs. Utilizing Apache Solr
brings many benefits:
1) It is free and open source
2) It is both vertically and horizontally scalable
3) It is user friendly, and thus easy to setup and maintain
4) It is incredibly fast (milliseconds) at both indexing
documents and searching keyword phrases
Figure 1 shows the proposed algorithm to score the relevancy
of a given term (T1) and a semantically related term (T2). As
shown in the figure, the input is a pair of terms. The first step
is to conduct a search within the document set for the given
term. The next step is to conduct a search for the semantically-
related term using the same set of documents. The third step is
to conduct a search for the intersection of both terms and count
the number of documents that fall within that intersection.
Once we have these three numbers (the number of documents
that have the given term, the number of documents that have
the related term, and the number of documents that have both
terms), we can then calculate the relevancy score as follows:
rel(T1,T 2) =num(T1, T 2)
arg min(num(T1), num(T2)).
In this function, num(T1, T 2) is the number of occurrence of
term T1with term T2in the corpus, num(T1) is the number
of occurrences of T1in the corpus, and num(T2) is the
number of occurrences of T2in the corpus. We use the mini-
mum between num(T1) and num(T2) in the denominator to
make sure that when one term is less popular than the other
one, the term with the greater popularity over the other won’t
affect the score negatively. The calculated relevancy score is
then compared with a threshold, so if the score exceeds that
threshold then the semantic relationship between both terms
is considered valid, otherwise the relationship is considered
invalid and can be discarded as noise.
IV. EXPERIMENT AND RES ULT S
CareerBuilder.com is the largest job board in the US with
millions of active jobs, over 60 million publicly searchable
resumes, over 1.5 billion actively searchable documents, and
millions of searches an hour. We conducted our experiment us-
ing the semantic relationships extracted using CareerBuilder’s
semantic search platform. CareerBuilder has built a semantic
Fig. 1. The cleaning system starts by leveraging Apache Solr to index a set of
documents related to the domain of the extracted set of semantic relationships.
Each pair of keywords previously extracted to be semantically-related based
upon collaborative filtering of user behavioral data is then sent as three search
requests to the Solr index. The number of documents returned by each search
is used to calculate the relevancy score between the two keywords T1and
T2.
search engine which utilizes user behavioral data collected
through search logs to discover semantic relationships between
search terms. To clean up the noise and outliers present in the
extracted relationships, we indexed 100 million job postings
in a Solr index. We then used that index to calculate the rel-
evancy score for 840,000 semantically-related pairs of search
terms. Our optimal threshold was determined experimentally
to perform best when set to 0.2. If any relevancy score falls
below 0.2, we thus drop that relationship; otherwise we keep
it. Table I shows examples of terms that were filtered kept
after the proposed technique, while Table II shows some of
the noise that was filtered out. A data analyst analyzed 1000
pairs both before we applied this cleanup system and after.
The results demonstrated that the proposed technique removes
60% of the remaining noise and outliers in the given data set.
V. CONCLUSION
With the increasing availability of large datasets of user
behavioral data, algorithms like collaborative filtering make
it possible to derive interesting relationships between items.
While these relationships are of reasonably high quality and
provide tremendous insights that often could not be easily
discovered from textual data alone, these algorithms suffer
from both noise (i.e. the same user demonstrating unrelated
activities) and outliers (i.e. automated processes or users who
have significant deviations from common user behavior). We
have demonstrated a data cleaning approach which leverages
an underlying textual corpus related to the domain to post-
process relationships learned through collaborative filtering of
user search logs. This post processing stage was demonstrated
to eliminate a further 60% of all remaining noise from an
already mostly clean system. In our specific example, we
cleaned pairs of keyword phrases derived from user search
log mining, and we searched against an inverted index for
each of the keyword phrases (as well as for the overlap) to
determine a content-based filter as a post-processing filtering
step. The proposed technique is being successfully used in pro-
duction within CareerBuilder’s semantic search engine in order
enhance the quality of machine-learned semantic relationship
TABLE I. (SAMPLE) GOO D TE RMS R EMA IN ING A FT ER PR OPO SE D FILTE RIN G TE CHN IQU E (> 0.2)
TERM RELATED TERMS
j2ee java, jsp, struts, hibernate, sql, jboss, python, .net, javascript, unix, java ee
lpn lpn nurse, nursing, lvn, licensed practical nurse, practical nurse, vocational nurse, lpn case manager, rn
hadoop hadoop developer, map/reduce, hive, hbase, pig, big data, obiee, sqoop, hdfs, oozie
project manager project management, senior project manager, program manager, project coordinator, business analyst, pmp
dentist general dentist, dental, associate dentist, orthodontist, dmd, oral surgeon, periodont, endodon, doctor of medical dentistry, malpractice
TABLE II. (SAMPLE) BAD TE RM S REM OVE D BY T HE PRO PO SED FI LTER IN G TEC HN IQU E (< 0.2)
TERM REMOVED NOISY/OUTLIER TERMS
j2ee microsoft, mobile web, online, paypal, quality center, scale, source control, web based
lpn admin, admission, advertising, agriculture, analyst, attorney, autocad, banking, bartender, blood, call center
hadoop .net, html, informatica, machine to machine, microstrategy, multi-thread, network engineer, qa, oracle, rest, semantic
project manager account, admin assitant, adminstration assistant, advertisting sale, application analyst, associate, anthem, at&t, attorney, auto cad
dentist chemist, child care, dental receptionist, nurse, lab technician, part time, front office
mappings between keyword phrases within the human capital
domain.
ACKNOWLEDGMENT
The authors would like to the Search Development group
at CareerBuilder for their assistance setting up and running
the Apache Solr cluster used for this system, as well as David
Lin for his superb data analysis of countless results before and
after this cleaning technique was applied.
REFERENCES
[1] K. AlJadda, M. Korayem, C. Ortiz, T. Grainger, J. Miller, W. S.
York, et al., “Pgmhd: A scalable probabilistic graphical model for
massive hierarchical data problems,” in Big Data (Big Data), 2014 IEEE
International Conference on, pp. 55–60, IEEE, 2014.
[2] K. AlJadda, M. Korayem, T. Grainger, and C. Russell, “Crowdsourced
query augmentation through semantic discovery of domain-specific
jargon,” in Big Data (Big Data), 2014 IEEE International Conference
on, pp. 808–815, IEEE, 2014.
[3] K. AlJadda, M. Korayem, C. Ortiz, C. Russell, D. Bernal, L. Payson,
S. Brown, and T. Grainger, “Augmenting recommendation systems us-
ing a model of semantically-related terms extracted from user behavior,”
arXiv preprint arXiv:1409.2530, 2014.
[4] M. Chen, S. Mao, and Y. Liu, “Big data: A survey,” Mobile Networks
and Applications, vol. 19, no. 2, pp. 171–209, 2014.
[5] M. A. Hernández and S. J. Stolfo, “Real-world data is dirty: Data
cleansing and the merge/purge problem,” Data mining and knowledge
discovery, vol. 2, no. 1, pp. 9–37, 1998.
[6] D. V. Kalashnikov, S. Mehrotra, and Z. Chen, “Exploiting relationships
for domain-independent data cleaning.,” in SDM, pp. 262–273, SIAM,
2005.
[7] E. Rahm and H. H. Do, “Data cleaning: Problems and current ap-
proaches,” IEEE Data Eng. Bull., vol. 23, no. 4, pp. 3–13, 2000.
[8] J. I. Maletic and A. Marcus, “Data cleansing: Beyond integrity analy-
sis.,” in IQ, pp. 200–209, Citeseer, 2000.
[9] S. R. Jeffery, G. Alonso, M. J. Franklin, W. Hong, and J. Widom, A
pipelined framework for online cleaning of sensor data streams. IEEE,
2006.
[10] E. Elnahrawy and B. Nath, “Online data cleaning in wireless sensor
networks,” in Proceedings of the 1st International conference on Em-
bedded networked sensor systems, pp. 294–295, ACM, 2003.
[11] A. Loureiro, L. Torgo, and C. Soares, “Outlier detection using clus-
tering methods: a data cleaning application,” in Proceedings of KDNet
Symposium on Knowledge-based systems for the Public Sector, 2004.
[12] H. Liu, S. Shah, and W. Jiang, “On-line outlier detection and data
cleaning,” Computers & chemical engineering, vol. 28, no. 9, pp. 1635–
1647, 2004.
[13] T. Milo and S. Zohar, “Using schema matching to simplify heteroge-
neous data translation,” in VLDB, vol. 98, pp. 24–27, Citeseer, 1998.
[14] T. Dasu and T. Johnson, Exploratory data mining and data cleaning,
vol. 479. John Wiley & Sons, 2003.
[15] T. Grainger and T. Potter, Solr in Action. Manning Publications Co.,
2014.