Conference PaperPDF Available

Improving the Quality of Semantic Relationships Extracted from Massive User Behavioral Data

Authors:

Abstract and Figures

As the ability to store and process massive amounts of user behavioral data increases, new approaches continue to arise for leveraging the wisdom of the crowds to gain insights that were previously very challenging to discover by text mining alone. For example, through collaborative filtering, we can learn previously hidden relationships between items based upon users' interactions with them, and we can also perform ontology mining to learn which keywords are semantically-related to other keywords based upon how they are used together by similar users as recorded in search engine query logs. The biggest challenge to this collaborative filtering approach is the variety of noise and outliers present in the underlying user behavioral data. In this paper we propose a novel approach to improve the quality of semantic relationships extracted from user behavioral data. Our approach utilizes millions of documents indexed into an inverted index in order to detect and remove noise and outliers.
Content may be subject to copyright.
Improving the Quality of Semantic Relationships
Extracted from Massive User Behavioral Data
Khalifeh AlJadda, Mohammed Korayem, and Trey Grainger
CareerBuilder, Norcross, GA, USA
khalifeh.aljadda, mohammed.korayem, trey.grainger@careerbuilder.com
Abstract—As the ability to store and process massive amounts
of user behavioral data increases, new approaches continue to
arise for leveraging the wisdom of the crowds to gain insights
that were previously very challenging to discover by text mining
alone. For example, through collaborative filtering, we can learn
previously hidden relationships between items based upon users’
interactions with them, and we can also perform ontology
mining to learn which keywords are semantically-related to other
keywords based upon how they are used together by similar users
as recorded in search engine query logs. The biggest challenge
to this collaborative filtering approach is the variety of noise and
outliers present in the underlying user behavioral data. In this
paper we propose a novel approach to improve the quality of
semantic relationships extracted from user behavioral data. Our
approach utilizes millions of documents indexed into an inverted
index in order to detect and remove noise and outliers.
I. INTRODUCTION
As the ability to store and process massive amounts of
user behavioral data increases, new approaches continue to
arise for leveraging the wisdom of the crowds to gain insights
that were previously very challenging to discover by text
mining alone. For example, through collaborative filtering, we
can learn previously hidden relationships between items based
upon users’ interactions with them, and we can also perform
ontology mining to learn which keywords are semantically-
related to other keywords based upon how they are used
together by similar users as recorded in search engine query
logs. Many organizations are now seeking to capture and retain
as much user log data as possible to extract and take advantage
of these valuable insights. Some of the biggest challenges
to the successful utilization of these data stores, however,
are the quality and generalizability of the data, since data
collected from log files, sensors, cameras, etc. usually contains
significant noise and outliers. For example, when mining user
search logs to discover similar keywords commonly entered by
the same users, the data can be misleading if some of the users
are spamming the search engine with keywords or crawling
the search page in a seemingly random order that results
in search sessions containing many unrelated queries. There
may be thousands of unique variations of anomalous behavior
like this that would have to be independently identified and
handled to truly eliminate the noise from the system, making
it quite challenging to avoid learning false connections between
items when mining user-behavior alone. In this paper we
discuss a novel technique that was used to systematically detect
and remove outliers in lists of semantically-related keywords
discovered through query log mining. This technique was
applied at CareerBuilder, the largest job board in the US, and
is integrated as a key part of their semantic search engine
platform [1]–[3]. Our approach utilizes Apache Solr to index
millions of documents into an inverted index that can be used
to find intersections between interesting concepts. These inter-
sections (or lack thereof) are then used to systematically filter
out anomalous relationships derived from users whose search
patterns do not represent a meaningful, overlapping search
intent across their search sessions. This system successfully
cleans up many of the rough edges of collaborative filtering
II. RE LATE D WOR K
Data cleaning is one of the most important steps needed
when working with large, real-world data sets [4], [5]. Based
on [6], cleaning the data often consumes between 40% and
80% of the time involved in data mining efforts. Data cleaning,
sometimes called data cleansing or scrubbing, is the process
of removing incomplete, noisy and inaccurate data points in
order to improve the quality of the data, and it is usually treated
as a preprocessing step [4], [7]. Data cleaning problems often
appear when data from heterogeneous sources is combined into
composite sources of information [6].
Data cleaning is a process ususally consisting of dif-
ferent phases including determining errors types, identifying
instances of data errors, and finally correcting these errors [8].
The data cleaning approaches chosen mainly depend on the
type of data under consideration. For instance, cleaning sensor
data is much different than cleaning textual data due to the
nature of data collection by sensor devices [9], [10]. Detecting
and removing outliers is another form of data cleaning [11],
[12]. There are specialized cleaning tools available to deal
with specific domains (e.g, name and address data) [7], [13],
but these tools do not fit our use case well, as we dealing
specifically with user behavioral data that is by nature context-
specific. We refer the reader to [7], [14] for more details about
common data cleaning approaches.
Khalifeh et al. in [1], [2] presented a semantic search
system based on mining user behavior from search logs to
extract related keywords. The extracted semantically-related
keywords are considered related due to an overlapping usage
of keywords across searches conducted by similar users. The
system utilizes a probabilistic graphical model [3] to discover
such relationships, but that model struggles to detect and
remove certain categories of noise and outliers, which tend to
show up due to some common searches like microsoft office, or
customer service. To overcome the noise and outliers present
in the semantic search system, we proposed a post-filtering
technique which we discuss in this paper. In our previous work,
we briefly mentioned the content-based filter as a post cleaning
phase. Here, we will focus on the details of that post-filtering
technique used to remove the noise and outliers extracted form
the user behavioral data.
III. MET HO DS
Our methodology requires a set of documents related to
the domain of the extracted semantic relationships. That set
of documents collectively serves as a litmus test to accept or
reject a relationship extracted from user behavioral data. A
representative set of documents within this collection is critical
since the the main objective of that set of documents is to
be rich and diverse enough to include most of the semantic
relationships within the given domain.
Indexing is also very important, as this enables us to
quickly conduct searches over a massive set of documents.
To do so, we utilized Apache Solr [15] to create an inverted
index containing millions of documents relevant to the domain
covered within our mined query logs. Utilizing Apache Solr
brings many benefits:
1) It is free and open source
2) It is both vertically and horizontally scalable
3) It is user friendly, and thus easy to setup and maintain
4) It is incredibly fast (milliseconds) at both indexing
documents and searching keyword phrases
Figure 1 shows the proposed algorithm to score the relevancy
of a given term (T1) and a semantically related term (T2). As
shown in the figure, the input is a pair of terms. The first step
is to conduct a search within the document set for the given
term. The next step is to conduct a search for the semantically-
related term using the same set of documents. The third step is
to conduct a search for the intersection of both terms and count
the number of documents that fall within that intersection.
Once we have these three numbers (the number of documents
that have the given term, the number of documents that have
the related term, and the number of documents that have both
terms), we can then calculate the relevancy score as follows:
rel(T1,T 2) =num(T1, T 2)
arg min(num(T1), num(T2)).
In this function, num(T1, T 2) is the number of occurrence of
term T1with term T2in the corpus, num(T1) is the number
of occurrences of T1in the corpus, and num(T2) is the
number of occurrences of T2in the corpus. We use the mini-
mum between num(T1) and num(T2) in the denominator to
make sure that when one term is less popular than the other
one, the term with the greater popularity over the other won’t
affect the score negatively. The calculated relevancy score is
then compared with a threshold, so if the score exceeds that
threshold then the semantic relationship between both terms
is considered valid, otherwise the relationship is considered
invalid and can be discarded as noise.
IV. EXPERIMENT AND RES ULT S
CareerBuilder.com is the largest job board in the US with
millions of active jobs, over 60 million publicly searchable
resumes, over 1.5 billion actively searchable documents, and
millions of searches an hour. We conducted our experiment us-
ing the semantic relationships extracted using CareerBuilder’s
semantic search platform. CareerBuilder has built a semantic
Fig. 1. The cleaning system starts by leveraging Apache Solr to index a set of
documents related to the domain of the extracted set of semantic relationships.
Each pair of keywords previously extracted to be semantically-related based
upon collaborative filtering of user behavioral data is then sent as three search
requests to the Solr index. The number of documents returned by each search
is used to calculate the relevancy score between the two keywords T1and
T2.
search engine which utilizes user behavioral data collected
through search logs to discover semantic relationships between
search terms. To clean up the noise and outliers present in the
extracted relationships, we indexed 100 million job postings
in a Solr index. We then used that index to calculate the rel-
evancy score for 840,000 semantically-related pairs of search
terms. Our optimal threshold was determined experimentally
to perform best when set to 0.2. If any relevancy score falls
below 0.2, we thus drop that relationship; otherwise we keep
it. Table I shows examples of terms that were filtered kept
after the proposed technique, while Table II shows some of
the noise that was filtered out. A data analyst analyzed 1000
pairs both before we applied this cleanup system and after.
The results demonstrated that the proposed technique removes
60% of the remaining noise and outliers in the given data set.
V. CONCLUSION
With the increasing availability of large datasets of user
behavioral data, algorithms like collaborative filtering make
it possible to derive interesting relationships between items.
While these relationships are of reasonably high quality and
provide tremendous insights that often could not be easily
discovered from textual data alone, these algorithms suffer
from both noise (i.e. the same user demonstrating unrelated
activities) and outliers (i.e. automated processes or users who
have significant deviations from common user behavior). We
have demonstrated a data cleaning approach which leverages
an underlying textual corpus related to the domain to post-
process relationships learned through collaborative filtering of
user search logs. This post processing stage was demonstrated
to eliminate a further 60% of all remaining noise from an
already mostly clean system. In our specific example, we
cleaned pairs of keyword phrases derived from user search
log mining, and we searched against an inverted index for
each of the keyword phrases (as well as for the overlap) to
determine a content-based filter as a post-processing filtering
step. The proposed technique is being successfully used in pro-
duction within CareerBuilder’s semantic search engine in order
enhance the quality of machine-learned semantic relationship
TABLE I. (SAMPLE) GOO D TE RMS R EMA IN ING A FT ER PR OPO SE D FILTE RIN G TE CHN IQU E (> 0.2)
TERM RELATED TERMS
j2ee java, jsp, struts, hibernate, sql, jboss, python, .net, javascript, unix, java ee
lpn lpn nurse, nursing, lvn, licensed practical nurse, practical nurse, vocational nurse, lpn case manager, rn
hadoop hadoop developer, map/reduce, hive, hbase, pig, big data, obiee, sqoop, hdfs, oozie
project manager project management, senior project manager, program manager, project coordinator, business analyst, pmp
dentist general dentist, dental, associate dentist, orthodontist, dmd, oral surgeon, periodont, endodon, doctor of medical dentistry, malpractice
TABLE II. (SAMPLE) BAD TE RM S REM OVE D BY T HE PRO PO SED FI LTER IN G TEC HN IQU E (< 0.2)
TERM REMOVED NOISY/OUTLIER TERMS
j2ee microsoft, mobile web, online, paypal, quality center, scale, source control, web based
lpn admin, admission, advertising, agriculture, analyst, attorney, autocad, banking, bartender, blood, call center
hadoop .net, html, informatica, machine to machine, microstrategy, multi-thread, network engineer, qa, oracle, rest, semantic
project manager account, admin assitant, adminstration assistant, advertisting sale, application analyst, associate, anthem, at&t, attorney, auto cad
dentist chemist, child care, dental receptionist, nurse, lab technician, part time, front office
mappings between keyword phrases within the human capital
domain.
ACKNOWLEDGMENT
The authors would like to the Search Development group
at CareerBuilder for their assistance setting up and running
the Apache Solr cluster used for this system, as well as David
Lin for his superb data analysis of countless results before and
after this cleaning technique was applied.
REFERENCES
[1] K. AlJadda, M. Korayem, C. Ortiz, T. Grainger, J. Miller, W. S.
York, et al., “Pgmhd: A scalable probabilistic graphical model for
massive hierarchical data problems,” in Big Data (Big Data), 2014 IEEE
International Conference on, pp. 55–60, IEEE, 2014.
[2] K. AlJadda, M. Korayem, T. Grainger, and C. Russell, “Crowdsourced
query augmentation through semantic discovery of domain-specific
jargon,” in Big Data (Big Data), 2014 IEEE International Conference
on, pp. 808–815, IEEE, 2014.
[3] K. AlJadda, M. Korayem, C. Ortiz, C. Russell, D. Bernal, L. Payson,
S. Brown, and T. Grainger, “Augmenting recommendation systems us-
ing a model of semantically-related terms extracted from user behavior,
arXiv preprint arXiv:1409.2530, 2014.
[4] M. Chen, S. Mao, and Y. Liu, “Big data: A survey,” Mobile Networks
and Applications, vol. 19, no. 2, pp. 171–209, 2014.
[5] M. A. Hernández and S. J. Stolfo, “Real-world data is dirty: Data
cleansing and the merge/purge problem,Data mining and knowledge
discovery, vol. 2, no. 1, pp. 9–37, 1998.
[6] D. V. Kalashnikov, S. Mehrotra, and Z. Chen, “Exploiting relationships
for domain-independent data cleaning.,” in SDM, pp. 262–273, SIAM,
2005.
[7] E. Rahm and H. H. Do, “Data cleaning: Problems and current ap-
proaches,” IEEE Data Eng. Bull., vol. 23, no. 4, pp. 3–13, 2000.
[8] J. I. Maletic and A. Marcus, “Data cleansing: Beyond integrity analy-
sis.,” in IQ, pp. 200–209, Citeseer, 2000.
[9] S. R. Jeffery, G. Alonso, M. J. Franklin, W. Hong, and J. Widom, A
pipelined framework for online cleaning of sensor data streams. IEEE,
2006.
[10] E. Elnahrawy and B. Nath, “Online data cleaning in wireless sensor
networks,” in Proceedings of the 1st International conference on Em-
bedded networked sensor systems, pp. 294–295, ACM, 2003.
[11] A. Loureiro, L. Torgo, and C. Soares, “Outlier detection using clus-
tering methods: a data cleaning application,” in Proceedings of KDNet
Symposium on Knowledge-based systems for the Public Sector, 2004.
[12] H. Liu, S. Shah, and W. Jiang, “On-line outlier detection and data
cleaning,” Computers & chemical engineering, vol. 28, no. 9, pp. 1635–
1647, 2004.
[13] T. Milo and S. Zohar, “Using schema matching to simplify heteroge-
neous data translation,” in VLDB, vol. 98, pp. 24–27, Citeseer, 1998.
[14] T. Dasu and T. Johnson, Exploratory data mining and data cleaning,
vol. 479. John Wiley & Sons, 2003.
[15] T. Grainger and T. Potter, Solr in Action. Manning Publications Co.,
2014.
Article
Full-text available
We present an ensemble approach for categorizing search query entities in the recruitment domain. Understanding the types of entities expressed in a search query (Company, Skill, Job Title, etc.) enables more intelligent information retrieval based upon those entities compared to a traditional keyword-based search. Because search queries are typically very short, leveraging a traditional bag-of-words model to identify entity types would be inappropriate due to the lack of contextual information. Our approach instead combines clues from different sources of varying complexity in order to collect real-world knowledge about query entities. We employ distributional semantic representations of query entities through two models: 1) contextual vectors generated from encyclopedic corpora like Wikipedia, and 2) high dimensional word embedding vectors generated from millions of job postings using word2vec. Additionally, our approach utilizes both entity linguistic properties obtained from WordNet and ontological properties extracted from DBpedia. We evaluate our approach on a data set created at CareerBuilder; the largest job board in the US. The data set contains entities extracted from millions of job seekers/recruiters search queries, job postings, and resume documents. After constructing the distributional vectors of search entities, we use supervised machine learning to infer search entity types. Empirical results show that our approach outperforms the state-of-the-art word2vec distributional semantics model trained on Wikipedia. Moreover, we achieve micro-averaged F 1 score of 97% using the proposed distributional representations ensemble.
Conference Paper
Full-text available
We present an ensemble approach for categorizing search query entities in the recruitment domain. Understanding the types of entities expressed in a search query (Company, Skill, Job Title, etc.) enables more intelligent information retrieval based upon those entities compared to a traditional keyword-based search. Because search queries are typically very short, leveraging a traditional bag-of-words model to identify entity types would be inappropriate due to the lack of contextual information. Our approach instead combines clues from different sources of varying complexity in order to collect real-world knowledge about query entities. We employ distributional semantic representations of query entities through two models: 1) contextual vectors generated from encyclopedic corpora like Wikipedia, and 2) high dimensional word embedding vectors generated from millions of job postings using word2vec. Additionally, our approach utilizes both entity linguistic properties obtained from WordNet and ontological properties extracted from DBpedia. We evaluate our approach on a data set created at CareerBuilder; the largest job board in the US. The data set contains entities extracted from millions of job seekers/recruiters search queries, job postings, and resume documents. After constructing the distributional vectors of search entities, we use supervised machine learning to infer search entity types. Empirical results show that our approach outperforms the state-of-the-art word2vec distributional semantics model trained on Wikipedia. Moreover, we achieve micro-averaged F 1 score of 97% using the proposed distributional representations ensemble.
Conference Paper
Full-text available
Most work in semantic search has thus far fo-cused upon either manually building language-specific tax-onomies/ontologies or upon automatic techniques such as clustering or dimensionality reduction to discover latent semantic links within the content that is being searched. The former is very labor intensive and is hard to maintain, while the latter is prone to noise and may be hard for a human to understand or to interact with directly. We believe that the links between similar user's queries represent a largely untapped source for discovering latent semantic relationships between search terms. The proposed system is capable of mining user search logs to discover semantic relationships between key phrases in a manner that is language agnostic, human understandable, and virtually noise-free.
Article
Full-text available
Common difficulties like the cold-start problem and a lack of sufficient information about users due to their limited interactions have been major challenges for most recommender systems (RS). To overcome these challenges and many similar ones that result in low accuracy (precision and recall) recommendations, we propose a novel system that extracts semantically-related search keywords based on the aggregate behavioral data of many users. These semantically-related search keywords can be used to substantially increase the amount of knowledge about a specific user's interests based upon even a few searches and thus improve the accuracy of the RS. The proposed system is capable of mining aggregate user search logs to discover semantic relationships between key phrases in a manner that is language agnostic, human understandable, and virtually noise-free. These semantically related keywords are obtained by looking at the links between queries of similar users which, we believe, represent a largely untapped source for discovering latent semantic relationships between search terms.
Article
Full-text available
In the big data era, scalability has become a crucial requirement for any useful computational model. Probabilistic graphical models are very useful for mining and discovering data insights, but they are not scalable enough to be suitable for big data problems. Bayesian Networks particularly demonstrate this limitation when their data is represented using few random variables while each random variable has a massive set of values. With hierarchical data - data that is arranged in a treelike structure with several levels - one would expect to see hundreds of thousands or millions of values distributed over even just a small number of levels. When modeling this kind of hierarchical data across large data sets, Bayesian networks become infeasible for representing the probability distributions for the following reasons: i) Each level represents a single random variable with hundreds of thousands of values, ii) The number of levels is usually small, so there are also few random variables, and iii) The structure of the network is predefined since the dependency is modeled top-down from each parent to each of its child nodes, so the network would contain a single linear path for the random variables from each parent to each child node. In this paper we present a scalable probabilistic graphical model to overcome these limitations for massive hierarchical data. We believe the proposed model will lead to an easily-scalable, more readable, and expressive implementation for problems that require probabilistic-based solutions for massive amounts of hierarchical data. We successfully applied this model to solve two different challenging probabilistic-based problems on massive hierarchical data sets for different domains, namely, bioinformatics and latent semantic discovery over search logs.
Article
Full-text available
This paper describes a methodology for the application of hierarchical clustering methods to the task of outlier detection. The methodology is tested on the problem of cleaning Official Statistics data. The goal is to detect erroneous foreign trade transactions in data col-lected by the Portuguese Institute of Statistics (INE). These transactions are a minority, but still they have an important impact on the statistics produced by the institute. The task of detecting these rare errors is a manual, time-consuming task. Our methodology is able to save a large amount of time by selecting a small subset of suspicious transactions for manual inspection, which, nevertheless, includes most of the erroneous transactions. In this study we compare several alternative hierarchical clustering methodologies for this task. The results we have obtained con-firm the validity of the use of hierarchical clustering techniques for this task. Moreover, our results when compared to previous approaches to the same data, clearly outperform them, identifying the same level of erroneous transactions with significantly less manual inspection.
Book
Summary Solr in Action is a comprehensive guide to implementing scalable search using Apache Solr. This clearly written book walks you through well-documented examples ranging from basic keyword searching to scaling a system for billions of documents and queries. It will give you a deep understanding of how to implement core Solr capabilities. About the Book Whether you're handling big (or small) data, managing documents, or building a website, it is important to be able to quickly search through your content and discover meaning in it. Apache Solr is your tool: a ready-to-deploy, Lucene-based, open source, full-text search engine. Solr can scale across many servers to enable real-time queries and data analytics across billions of documents. Solr in Action teaches you to implement scalable search using Apache Solr. This easy-to-read guide balances conceptual discussions with practical examples to show you how to implement all of Solr's core capabilities. You'll master topics like text analysis, faceted search, hit highlighting, result grouping, query suggestions, multilingual search, advanced geospatial and data operations, and relevancy tuning. This book assumes basic knowledge of Java and standard database technology. No prior knowledge of Solr or Lucene is required. Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications. What's Inside How to scale Solr for big data Rich real-world examples Solr as a NoSQL data store Advanced multilingual, data, and relevancy tricks Coverage of versions through Solr 4.7 About the Authors Trey Grainger is a director of engineering at CareerBuilder. Timothy Potter is a senior member of the engineering team at LucidWorks. The authors work on the scalability and reliability of Solr, as well as on recommendation engine and big data analytics technologies.
Article
In this paper, we review the background and state-of-the-art of big data. We first introduce the general background of big data and review related technologies, such as could computing, Internet of Things, data centers, and Hadoop. We then focus on the four phases of the value chain of big data, i.e., data generation, data acquisition, data storage, and data analysis. For each phase, we introduce the general background, discuss the technical challenges, and review the latest advances. We finally examine the several representative applications of big data, including enterprise management, Internet of Things, online social networks, medial applications, collective intelligence, and smart grid. These discussions aim to provide a comprehensive overview and big-picture to readers of this exciting area. This survey is concluded with a discussion of open problems and future directions.
Article
Outliers are observations that do not follow the statistical distribution of the bulk of the data, and consequently may lead to erroneous results with respect to statistical analysis. Many conventional outlier detection tools are based on the assumption that the data is identically and independently distributed. In this paper, an outlier-resistant data filter-cleaner is proposed. The proposed data filter-cleaner includes an on-line outlier-resistant estimate of the process model and combines it with a modified Kalman filter to detect and “clean” outliers. The advantage over existing methods is that the proposed method has the following features: (a) a priori knowledge of the process model is not required; (b) it is applicable to autocorrelated data; (c) it can be implemented on-line; and (d) it tries to only clean (i.e., detects and replaces) outliers and preserves all other information in the data.
Conference Paper
We present our ongoing work on data quality problems in sensor networks. Specifically, we deal with the problems of outliers, missing information, and noise. We propose an approach for modeling and online learning of spatio-temporal correlations in sensor net-works. We utilize the learned correlations to discover outliers and recover missing information. We also propose a Bayesian approach for reducing the effect of noise on sensor data online.