ArticlePDF Available

Augmenting recommendation systems using a model of semantically-related terms extracted from user behavior

Authors:

Abstract and Figures

Common difficulties like the cold-start problem and a lack of sufficient information about users due to their limited interactions have been major challenges for most recommender systems (RS). To overcome these challenges and many similar ones that result in low accuracy (precision and recall) recommendations, we propose a novel system that extracts semantically-related search keywords based on the aggregate behavioral data of many users. These semantically-related search keywords can be used to substantially increase the amount of knowledge about a specific user's interests based upon even a few searches and thus improve the accuracy of the RS. The proposed system is capable of mining aggregate user search logs to discover semantic relationships between key phrases in a manner that is language agnostic, human understandable, and virtually noise-free. These semantically related keywords are obtained by looking at the links between queries of similar users which, we believe, represent a largely untapped source for discovering latent semantic relationships between search terms.
Content may be subject to copyright.
Augmenting Recommendation Systems Using a Model of
Semantically-related Terms Extracted from User Behavior
Khalifeh AlJadda*, Mohammed Korayem**, Camilo Ortiz,
Chris Russell, David Bernal, Lamar Payson,
Scott Brown, and Trey Grainger
Search Relevancy and Recommendations, CareerBuilder
*Department of Computer Science, University of Georgia, GA
**School of Informatics & Computing, Indiana University, IN
khalifeh.aljadda, mohammed.korayem, camilo.ortiz@careerbuilder.com
chris.russell, david.bernal, lamar.payson@careerbuilder.com
scott.brown, trey.grainger@careerbuilder.com
Abstract
Common difficulties like the cold-start problem and a lack of sufficient information about users due to
their limited interactions have been major challenges for most recommender systems (RS). To overcome
these challenges and many similar ones that result in low accuracy (precision and recall) recommen-
dations, we propose a novel system that extracts semantically-related search keywords based on the
aggregate behavioral data of many users. These semantically-related search keywords can be used to
substantially increase the amount of knowledge about a specific user’s interests based upon even a few
searches and thus improve the accuracy of the RS. The proposed system is capable of mining aggregate
user search logs to discover semantic relationships between key phrases in a manner that is language
agnostic, human understandable, and virtually noise-free. These semantically related keywords are ob-
tained by looking at the links between queries of similar users which, we believe, represent a largely
untapped source for discovering latent semantic relationships between search terms.
1 Introduction
Recommender systems are widely used these days in many industries like e-commerce, video streaming,
and job portals. Recommender systems automate the process of discovering the interests of a user and
subsequently what is relevant to his/her needs [5, 6, 7].
Many companies like Netflix1, Amazon2, CareerBuilder3, etc. depend on recommender systems (RS) to
help drive their revenue. For example, Netflix, a movie rental and video streaming web site, offered a prize
(known as the Netflix prize) of 1 million dollars in 2006 for any recommendation algorithm that could beat
their RS, named Cinematch[2]. Netflix, like many other websites, depends heavily on recommendations in
order to keep their customers interested in their service. One of the major challenges for any RS is the
cold-start problem [6], which occurs when there is a lack of information linking new users or items such that
the RS is unable to determine how those users or items are related and is therefore unable to provide useful
recommendations. One possible way to solve this is to perform classification or matching based upon the
1http://www.netflix.com
2http://www.amazon.com
3http://www.careerbuilder.com
1
arXiv:1409.2530v1 [cs.IR] 8 Sep 2014
Job Seeker Search Terms
Extractor
Combine a user’s search terms
Crowdsourcing Latent Semantic
Discovery Engine
Content-Based Filtering
Search Log Analyzer
Recommendation
System
Augmentation Engine
Figure 1: System Architecture
description (i.e. keywords in the text) of the items which could be recommended, but this approach tends
to not perform as well as a collaborative filtering approach which links related items together based upon
the collective intelligence of many users.
Instead of directly linking users and items, we propose a system that utilizes the wisdom of the crowd. This
system incorporates a language-agnostic technique for discovering the intent of user searches by revealing the
latent semantic relationships between the terms and phrases within keyword queries. Our hope is to express
these relationships in common human language so that we can automatically augment recommendations when
the only data available from a user is few search keywords extracted from the searches that he conducted.
Mining the search history of millions of users allows us to discover the relationship between search terms and
the most common meaning of each term. Once we know the semantic relationships between terms according
to these users, we can then use those relationships to enhance the recommendation features in order to more
accurately express the interest of each user.
Our use case for the proposed technique is for the recommendation system CareerBuilder, which operates
the largest job board in the U.S. and has an extensive and growing global presence, with millions of job
postings, more than 60 million actively-searchable resumes, over one billion searchable documents, and more
than a million searches per hour. Our solution to this challenge makes use of the wisdom of the crowd to
discover domain-specific relationships. Using the query logs of more than a billion user searches, we can
discover the family of related keyword phrases for any particular term or phrase for which a search has been
conducted. We are not attempting to fit these terms into an artificial taxonomy, instead we are discovering
the existing relationships between terms according to our users.
2 Search Log Analyzer
Our proposed system is applicable for websites and services that receive searches from massive numbers
of users (i.e. millions), such as is the case with CareerBuilder.com. We propose a search log analyzer
(SLA) that aims to discover latent semantic relationships among the users’ search terms in order to build a
semantic dictionary that can more expressively interpret a user’s query intent which in turn provide more
relevant results in both the search engine and RS. One important feature of the proposed SLA is that it is
language agnostic, as there is no dependency on any language-specific natural language processing (NLP)
techniques. This makes the system applicable on any website or system that receives a large number of
searches, regardless of the list of languages the system supports.
2
Table 1: Input data to PGMHD
UserID Classification Search Terms
user1 Java Developer Java, Java Developer,
C, Software Engineer
user2 Nurse RN, Rigistered Nurse,
Health Care
user3 .NET Developer C#, ASP, VB, Soft-
ware Engineer, SE
user4 Java Developer Java, JEE, Struts,
Software Engineer, SE
user5 Health Care Health Care Rep,
HealthCare
Health'
Care'Rep'
Java'
Developer'
.NET'
Developer' Nurse' Health'Care'
Java'
Developer'
Java' C'
So7ware'
Engineer' ASP'
RN' Registered'
Nurse'
Health'
Care' VB'
SE' JEE' Struts'
Figure 2: PGMHD representing the search log data
3
Search'keyword(s)'
from'search'engine'
Augmenta8on'Engine'(AE)'
sends'a'request'to''Search'Log'
Analyzer'(SLA)':'
'
1)Grab'the'seman8cally'
related'keywords'
(SRKs).'
2)Classify'this'set'of'
keywords'using'PGMHD'
3)Return'class'(C)'with'
highest'probability'
score'(p).'
4)Return'(if'there'exist)'
the'nearest'neighbor'
users'(NNUs)'within'a'
threshold'distance'
based'on'the'vector'of'
SRKs.'
If''
p#>'threshold''
Grab'users'with'same'
classifica8on'C.'
Yes
Send'the'SRKs'to'the'
search'engine'and'run'a'
search.'
Return'the'result'to'the'user.'
If'NNUs'is'
empty'
Grab'users'in'NNUs'
No
Yes No
Find'search'vectors'for'the'
selected'users'
Enrich'the'SRKs'with'the'
keywords'in'those'vectors'
Send'this'set'of'users'to'the'
recommender'system'to'run'
collabora8ve'filtering.'
Figure 3: Augmentation Engine Architecture
2.1 Crowdsourced Latent Semantic Discovery Engine
The most significant component in the proposed SLA is the crowdsourced latent semantic discovery engine.
This engine is designed in a way that enables the utilization of crowdsourced wisdom. The model we used to
represent the data is a probabilistic graphical model for massive hierarchical data (PGMHD) [1], which we
designed and implemented to obtain a scalable variant of a Bayesian Network that is suitable for this kind
of massive data. In order to represent search log data in this model, pre-processing phases are required as
shown in Figure 1. The pre-processing phases include:
1. classify the users.
2. grab the query strings of their searches.
3. extract the search terms of those query strings.
4. aggregate the search terms of each user.
5. combine the user’s classification with his search terms in one table to be used as input table to build
the PGMHD.
Table 1 shows the processed data to be represented using PGMHD and Figure 2 shows an example of
PGMHD representing search log data. The root nodes in this model represent the classification of the users
who conducted searches. Additionally, the second level nodes represent the keywords used in the searches
conducted by the users. An edge between a root node (user’s classification) and a child node (search keyword)
represents the usage of a search keyword by the users from a classification.
To obtain the estimates in our probabilistic model, we store the number of searches fck for a keyword
kby users of class cat every edge that connects ck. This way, we can naturally estimate the joint
probabilities
P(k, c) = fck
P
ij
fij
,
and similarly, the conditional probabilities
P(k|c) = fck
P
j
fcj
, P (c|k) = fck
P
i
fik
,
4
required by the PGMHD. The SLA is implemented using the following technologies: HDFS [8], Map/Reduce
jobs [3], Hive [9], and Solr4.
Table 2: Sample Results using Solr for content-based filtering (CBF)
Term Before Solr Removed Terms (Out-
lier)
Final List
data scientist machine learning, data ana-
lyst, data mining, analytics,
big data, statistics, ...
data analyst, data mining,
analytics, statistics
machine learning, big data
cashier retail, retail cashier, cus-
tomer service, cashiers , re-
ceptionist , cashier jobs,
teller, ...
receptionist, teller retail, retail cashier, cus-
tomer service, cashiers,
cashier jobs
collections Rep collector, collections spe-
cialist, call center rep,
credit and collections, ...
call center rep collector, collections spe-
cialist, credit and collec-
tions
repair tech repair technician, mainte-
nance technician, call cen-
ter
maintenance technician,
call center
repair technician
front end development web developer, productiv-
ity, Monster
productivity, monster web developer
2.2 Content-based Filtering
In our content-based post-filtering phase, we apply the search engine [4] itself as part of the SLA for post-
filtering. Since our ultimate goal is to find the most related search terms that have actual semantic similarity,
we examine all of the relationships for each term. For each relationship we run a query using the original
term and another query using the related term. We compare the two resulting document sets and look
for intersection. If there are too few intersections we consider the relationship to be invalid and remove it.
Table 2 shows how the content-based filtering improves the accuracy of the discovered semantic relationships
between search terms.
3 Augmentation Engine
There are several approaches to follow in order to improve the accuracy of the RS using the proposed SLA.
In our design, the augmentation engine (AE) decides which combination of approaches to follow in order
to improve the recommendations generated by the RS. Our implementation of the AE focuses on deciding
which combination of the following three augmentations (to the RS) should be used:
1. a search query augmentation,
2. a user classification augmentation, or
3. a nearest neighbor augmentation.
These three augmentation approaches try to tackle the special “cold-start” cases when the only information
you have about a user are the search queries that he/she performs through the search engine. The next
three subsections describe in detail each of the augmentation approaches mentioned above.
4http://lucene.apache.org/solr
5
3.1 Search Query Augmentation
In the case when the RS is not able to recommend items to a user (possibly due to the “cold-start” problem),
it is possible to use the search engine to assist the RS based on the user’s search keywords. However, it
might be the case that the search keywords from the user are not enough for the search engine to retrieve a
significant amount of items to feed to the RS. In such a case, the semantically-related keywords represent a
crucial source of information to augment the search query. The search query augmentation that we propose
uses the semantically-related keywords obtained from the SLA to enhance the search queries that will be
used on the search engine that feeds the RS.
3.2 User Classification Augmentation
In the previous subsection, we presented a simple augmentation approach that only enriches the search
query for the search engine with the semantically related search keywords, ignoring any information that
can be obtained about the users who share similar searching behavior. To go beyond that basic search query
augmentation, we propose to utilize the search keywords to classify the users with the PGMHD model [1]
used within the SLA. Given a set of search keywords Kand a set Cof pre-defined classifications of the
users, PGMHD calculates the probability scores pcfor every class of user cC. The user classification
augmentation that we propose can be combined with the search query augmentation described in the last
subsection to refine the decision space in order to return more precise results which fall within the same
classification as the user.
3.3 Nearest Neighbor Augmentation
We can further improve the precision of the recommendations obtained from the user classification augmen-
tation presented in the previous subsection by considering a specific subset of the users from the class cmax
based on a k–nearest neighbor approach. More specifically, by representing each user as vector of its search
keywords, we can define a distance (e.g., Hamming, Euclidean, etc.) between a given user and other users,
so as to discover the most similar/nearest ones. The nearest neighbor augmentation that we propose enables
the RS to use a collaborative filtering (CF) approach in a more precise way by using the most similar users
from the class cmax.
3.4 Augmentation Engine Architecture
Let Kbe the set of most relevant search keywords from the user for which the RS will provide recommen-
dations, and Cbe the set of pre-defined classifications of the users. Then, the proposed AE operates as
follows (see Figure 3). First, based on the set of search keyword(s) K, the AE sends a request to the SLA
to obtain: i) a set Rof semantically related keywords; ii) the probability scores pcobtained from a PGMHD
classification of the keywords, for every class of user cC; and iii) the set of nearest neighbor users (NNUs)
from the class cmax within a threshold distance. If the maximum probability score is larger than a given
threshold α > 0, i.e.
max
cCpc=pcmax > α, (1)
and the set of NNUs is not empty, a nearest neighbor augmentation is chosen. If (1) is satisfied but the
set of NNUs is empty, a user classification augmentation is applied (to refine the decision space), and a
search query augmentation is applied within the decision space corresponding with the user’s classification.
Otherwise, if (1) is not satisfied, a search query augmentation is chosen without restricting the decision space
to a specific classification corresponding to the user.
6
4 Experiment and Results
We have implemented the Search Log Analyzer (SLA) using Hadoop Map/Reduce framework. The SLA was
applied to analyze 1.6 billion search logs obtained from CareerBuilder.com. After applying the content-based
filtering, the discovered related search terms had an error rate of at most 2%. Table 3 shows sample results
of the SLA. Currently, we are implementing tests on the RS that incorporate the proposed AE to evaluate
the impact of the proposed system at CareerBuilder.com.
Table 3: Results of SLA
Term Related Terms
hadoop big data, hadoop developer,
OBIEE, Java, Python
registered nurse rn registered nurse, rn, registered
nurse manager, nurse, nursing,
director of nursing
data mining machine learning, data scientist,
analytics, business intellegence,
statistical analyst
Solr lucene, hadoop, java
Software Engineer software developer, programmer,
.net developer, web developer,
software
big data nosql, data science, machine
learning, hadoop, teradata
Realtor realtor assistant, real estate, real
estate sales, sales, real estate
agent
Data Scientist machine learning, data analyst,
data mining, analytics, big data
Plumbing plumber, plumbing apprentice,
plumbing maintenance, plumb-
ing sales, maintenance
Agile scrum, project manager, agile
coach, pmiacp, scrum master
5 Conclusions
In this paper we propose a novel system to augment recommendations generated by a RS when the only
available information about a user for recommendations is a small set of searched keywords. The proposed
system relies on a model that discovers the semantic relationships between search keywords using aggregate
behavioral data from millions of users. Moreover, the proposed system is language-agnostic and could be
integrated with most modern recommendation engines.
7
6 Acknowledgments
We would like to greatly thank the Big Data and Data Science teams at CareerBuilder for their support and
feedback during implementation of the Search Log Analyzer. Also, many thanks to the Search Development
Group at CareerBuilder for their help with integrating the proposed system within the recommendation
engine.
References
[1] K. AlJadda, M. Korayem, C. Ortiz, T. Grainger, J. A. Miller, and W. S. York. PGMHD: A scalable
probabilistic graphical model for massive hierarchical data problems. CoRR, abs/1407.5656, 2014.
[2] J. Bennett and S. Lanning. The netflix prize. In Proceedings of KDD cup and workshop, volume 2007,
page 35, 2007.
[3] J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of
the ACM, 51(1):107–113, 2008.
[4] T. Grainger and T. Potter. Solr in Action. Manning Publications Co., 2014.
[5] J. A. Konstan. Introduction to recommender systems: Algorithms and evaluation. ACM Transactions
on Information Systems (TOIS), 22(1):1–4, 2004.
[6] S.-T. Park and W. Chu. Pairwise preference regression for cold-start recommendation. In Proceedings of
the third ACM conference on Recommender systems, pages 21–28. ACM, 2009.
[7] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl. Item-based collaborative filtering recommendation
algorithms. In Proceedings of the 10th international conference on World Wide Web, pages 285–295.
ACM, 2001.
[8] K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The hadoop distributed file system. In Mass Storage
Systems and Technologies (MSST), 2010 IEEE 26th Symposium on, pages 1–10. IEEE, 2010.
[9] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy.
Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment,
2(2):1626–1629, 2009.
8
... On the job-seeker side, the prefilled content of job-seeker"s profile is not enough but also considering the behavioral-based information will lead to higher matching and recommendation quality. There is a lot of worthy information can be collected from the frequent interactions of users [ 2,3,4]. For example, the tracking of the jobs which the user had applied to, then extract only the most frequent job-titles and keywords from these jobs to finally construct the dynamic job-seeker profile and use it for matching with the jobs" profiles. ...
... Table 3. Various misspelled variations of Arabic job title Synonyms. Using synonyms is the approach of query augmentation by including the similar words or synonyms [ 4,8]. It is helpful to append the similar words to achieve wide range of accurate results. ...
Chapter
Finding the most relevant jobs for job-seeker is one of the ongoing challenges in the area of e-recruitment. Text mining, information retrieval and natural language processing are some of the key concepts for matching a number of profiles and find the most textually similar job profiles given the job-seeker profile. The mission of finding the relevant jobs that are textually similar to the job-seeker profile depends mainly on the quality of three mains phases. The first early phase is basically constructing the profiles that should be involved in the matching phase through gathering data from different sources and representing this data into a form of important keywords. The second phase is the matching phase which works on mining and analyzing the text of these constructed profiles to find the similar profiles. The final phase is ranking these produced similar profiles by relevance. We introduce our intelligent similarity computing engine which works on predicting the personalized jobs for each job-seeker. It aims to empower the recommendation effectiveness in the area of e-recruitment by applying the three aforementioned phases on the problem of job recommendation. Its purpose is to recommend the textually relevant jobs based on computing the textual similarity distance between jobs and job-seekers’ profiles.
... On the job-seeker side, not only the job-seeker"s profile content but also considering the behavioral-based information will lead to higher matching and recommendation quality. There is valuable information can be collected from the frequent behavior of users [ 2,3,4]. For example, tracking the jobs which the user had applied to, include only the most frequent job-titles and keywords from these jobs and use them in the upcoming recommendation queries. ...
... Synonyms. Using synonyms is the approach of query augmentation by including the similar words or synonyms [ 4,8]. It is helpful to append the similar words to achieve wide range of accurate results. ...
Chapter
Full-text available
Understanding both of job-seekers and employers behavior in addition to analyzing the text of job-seekers and job profiles are two important missions for the e-recruitment industry. They are important tasks for matching job-seekers with jobs to find the top relevant suggestions for each job-seeker. Recommender systems, information retrieval and text mining are originally targeted to assist users and provide them with useful information, which makes human-computer interaction plays a fundamental role in the users’ acceptance of the produced suggestions. We introduce our intelligent framework to help build the knowledge required to produce the most relevant jobs based on processing each job-seeker profile’s text, the behaviorally collected text and the jobs’ profile content. We analyzed the available textual similarity scoring algorithms to find the best suitable relevancy ranking model which is plugged into our developed textual similarity engine. The main purpose is enhancing the recommendation quality in the challenging domain of e-recruitment by finding the textually similar jobs for each job-seeker profile.
... In this paper we discuss a novel technique that was used to systematically detect and remove outliers in lists of semantically-related keywords discovered through query log mining. This technique was applied at CareerBuilder, the largest job board in the US, and is integrated as a key part of their semantic search engine platform [1]–[3]. Our approach utilizes Apache Solr to index millions of documents into an inverted index that can be used to find intersections between interesting concepts. ...
... The extracted semantically-related keywords are considered related due to an overlapping usage of keywords across searches conducted by similar users. The system utilizes a probabilistic graphical model [3] to discover such relationships, but that model struggles to detect and remove certain categories of noise and outliers, which tend to show up due to some common searches like microsoft office, or customer service. To overcome the noise and outliers present in the semantic search system, we proposed a post-filtering technique which we discuss in this paper. ...
Conference Paper
Full-text available
As the ability to store and process massive amounts of user behavioral data increases, new approaches continue to arise for leveraging the wisdom of the crowds to gain insights that were previously very challenging to discover by text mining alone. For example, through collaborative filtering, we can learn previously hidden relationships between items based upon users' interactions with them, and we can also perform ontology mining to learn which keywords are semantically-related to other keywords based upon how they are used together by similar users as recorded in search engine query logs. The biggest challenge to this collaborative filtering approach is the variety of noise and outliers present in the underlying user behavioral data. In this paper we propose a novel approach to improve the quality of semantic relationships extracted from user behavioral data. Our approach utilizes millions of documents indexed into an inverted index in order to detect and remove noise and outliers.
... As example, most of the recommended jobs for javaScript developer are titled frontend Engineer or contain the keyword jQuery frequently. It means these keywords will be helpful for further recommendation [ 20]. ...
Chapter
Full-text available
Recommender system is a mature concept that periodically evolves. It is relating to more than one branch of knowledge such as data mining, machine learning, information retrieval, natural language processing and human-computer interaction. Nowadays recommender systems impose themselves to be a core model in all the present modern web applications. They are obviously found in numerous industries such as videos recommendation by Youtube and products recommendation by Amazon. Similarly, recommendation systems are incredibly required in the challenging e-recruitment domain. In e-recruitment domain, it is mandatory to build an intelligent system that digs deeper into thousands of jobs to eventually filter out the relevant jobs for each job-seeker. We explain what it needs to build such sophisticated and practical hybrid engine. The engine is practically employed in a dynamic e-recruitment portal and it has already proven its efficiency in the real-world market. We utilized many applied recommendation approaches in other several industries to empower the field of e-recruitment. Our project aims to recommend relevant jobs for job-seekers to solve the problem of disrupting job-seekers with irrelevant jobs to improve users’ satisfaction and loyalty.
... The other class of keyword extraction/ranking technique is based on using context information to improve keyword extraction. Recently, there has been lot of work on developing different machine learning methods to make use of the context in the document [3], [28]. Zhang et al. [4] discusses the use of support vector machines for keyword extraction from documents using both the local and global context. ...
Article
Full-text available
According to a report online, more than 200 million unique users search for jobs online every month. This incredibly large and fast growing demand has enticed software giants such as Google and Facebook to enter this space, which was previously dominated by companies such as LinkedIn, Indeed and CareerBuilder. Recently, Google released their "AI-powered Jobs Search Engine", "Google For Jobs" while Facebook released "Facebook Jobs" within their platform. These current job search engines and platforms allow users to search for jobs based on general narrow filters such as job title, date posted, experience level, company and salary. However, they have severely limited filters relating to skill sets such as C++, Python, and Java and company related attributes such as employee size, revenue, technographics and micro-industries. These specialized filters can help applicants and companies connect at a very personalized, relevant and deeper level. In this paper we present a framework that provides an end-to-end "Data-driven Jobs Search Engine". In addition, users can also receive potential contacts of recruiters and senior positions for connection and networking opportunities. The high level implementation of the framework is described as follows: 1) Collect job postings data in the United States, 2) Extract meaningful tokens from the postings data using ETL pipelines, 3) Normalize the data set to link company names to their specific company websites, 4) Extract and ranking the skill sets, 5) Link the company names and websites to their respective company level attributes with the EVERSTRING Company API, 6) Run user-specific search queries on the database to identify relevant job postings and 7) Rank the job search results. This framework offers a highly customizable and highly targeted search experience for end users.
Chapter
Full-text available
The recommendation system, also known as recommender system or recommendation engine/platform, is considered as an interdisciplinary field. It uses the techniques of more than one field. Recommender system inherits approaches from all of machine learning, data mining, information retrieval, information filtering and human-computer interaction. In this paper, we propose our value-added architecture of the hybrid information filtering engine for job recommender system (HIFE-JRS). We discuss our developed system’s components to filter the most relevant information and produce the most personalized content to each user. The basic idea of recommender systems is to recommend items for users to suit their interests. Similarly the project tends to recommend relevant jobs for job-seekers by utilizing the concepts of recommender systems, information retrieval and data mining. The project solves the problem of flooding job-seekers with thousands of irrelevant jobs which is a frustrating and time-wasting process to let job-seekers rely on their limited searching abilities to dig into tons of jobs for finding the right job.
Article
Full-text available
Probabilistic Graphical Models (PGM) are very useful in the fields of machine learning and data mining. The crucial limitation of those models,however, is the scalability. The Bayesian Network, which is one of the most common PGMs used in machine learning and data mining, demonstrates this limitation when the training data consists of random variables, each of them has a large set of possible values. In the big data era, one would expect new extensions to the existing PGMs to handle the massive amount of data produced these days by computers, sensors and other electronic devices. With hierarchical data - data that is arranged in a treelike structure with several levels - one would expect to see hundreds of thousands or millions of values distributed over even just a small number of levels. When modeling this kind of hierarchical data across large data sets, Bayesian Networks become infeasible for representing the probability distributions. In this paper we introduce an extension to Bayesian Networks to handle massive sets of hierarchical data in a reasonable amount of time and space. The proposed model achieves perfect precision of 1.0 and high recall of 0.93 when it is used as multi-label classifier for the annotation of mass spectrometry data. On another data set of 1.5 billion search logs provided by CareerBuilder.com the model was able to predict latent semantic relationships between search keywords with accuracy up to 0.80.
Article
Full-text available
In the big data era, scalability has become a crucial requirement for any useful computational model. Probabilistic graphical models are very useful for mining and discovering data insights, but they are not scalable enough to be suitable for big data problems. Bayesian Networks particularly demonstrate this limitation when their data is represented using few random variables while each random variable has a massive set of values. With hierarchical data - data that is arranged in a treelike structure with several levels - one would expect to see hundreds of thousands or millions of values distributed over even just a small number of levels. When modeling this kind of hierarchical data across large data sets, Bayesian networks become infeasible for representing the probability distributions for the following reasons: i) Each level represents a single random variable with hundreds of thousands of values, ii) The number of levels is usually small, so there are also few random variables, and iii) The structure of the network is predefined since the dependency is modeled top-down from each parent to each of its child nodes, so the network would contain a single linear path for the random variables from each parent to each child node. In this paper we present a scalable probabilistic graphical model to overcome these limitations for massive hierarchical data. We believe the proposed model will lead to an easily-scalable, more readable, and expressive implementation for problems that require probabilistic-based solutions for massive amounts of hierarchical data. We successfully applied this model to solve two different challenging probabilistic-based problems on massive hierarchical data sets for different domains, namely, bioinformatics and latent semantic discovery over search logs.
Article
Full-text available
The size of data sets being collected and analyzed in the industry for business intelligence is growing rapidly, making traditional warehousing solutions prohibitively expensive. Hadoop [3] is a popular open-source map-reduce implementation which is being used as an alternative to store and process extremely large data sets on commodity hardware. However, the map-reduce programming model is very low level and requires developers to write custom programs which are hard to maintain and reuse.
Conference Paper
Full-text available
Recommender systems apply knowledge discovery techniques to the problem of making personalized recom- mendations for information, products or services during a live interaction. These systems, especially the k-nearest neighbor collaborative filtering based ones, are achieving widespread success on the Web. The tremendous growth in the amount of available information and the number of visitors to Web sites in recent years poses some key challenges for recommender systems. These are: producing high quality recommendations, performing many recommendations per second for millions of users and items and achieving high coverage in the face of data sparsity. In traditional collaborative filtering systems the amount of work increases with the number of participants in the system. New recommender system technologies are needed that can quickly produce high quality recommendations, even for very large-scale problems. To address these issues we have explored item-based collaborative filtering techniques. Item- based techniques first analyze the user-item matrix to identify relationships between different items, and then use these relationships to indirectly compute recommendations for users. In this paper we analyze different item-based recommendation generation algorithms. We look into different techniques for computing item-item similarities (e.g., item-item correlation vs. cosine similarities between item vec- tors) and different techniques for obtaining recommendations from them (e.g., weighted sum vs. regression model). Finally, we experimentally evaluate our results and compare them to the basic k-nearest neighbor approach. Our experiments suggest that item-based algorithms provide dramatically better performance than user-based algorithms, while at the same time providing better quality than the best available user-based algorithms.
Book
Summary Solr in Action is a comprehensive guide to implementing scalable search using Apache Solr. This clearly written book walks you through well-documented examples ranging from basic keyword searching to scaling a system for billions of documents and queries. It will give you a deep understanding of how to implement core Solr capabilities. About the Book Whether you're handling big (or small) data, managing documents, or building a website, it is important to be able to quickly search through your content and discover meaning in it. Apache Solr is your tool: a ready-to-deploy, Lucene-based, open source, full-text search engine. Solr can scale across many servers to enable real-time queries and data analytics across billions of documents. Solr in Action teaches you to implement scalable search using Apache Solr. This easy-to-read guide balances conceptual discussions with practical examples to show you how to implement all of Solr's core capabilities. You'll master topics like text analysis, faceted search, hit highlighting, result grouping, query suggestions, multilingual search, advanced geospatial and data operations, and relevancy tuning. This book assumes basic knowledge of Java and standard database technology. No prior knowledge of Solr or Lucene is required. Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications. What's Inside How to scale Solr for big data Rich real-world examples Solr as a NoSQL data store Advanced multilingual, data, and relevancy tricks Coverage of versions through Solr 4.7 About the Authors Trey Grainger is a director of engineering at CareerBuilder. Timothy Potter is a senior member of the engineering team at LucidWorks. The authors work on the scalability and reliability of Solr, as well as on recommendation engine and big data analytics technologies.
Article
The Hadoop Distributed File System (HDFS) is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications. In a large cluster, thousands of servers both host directly attached storage and execute user application tasks. By distributing storage and computation across many servers, the resource can grow with demand while remaining economical at every size. We describe the architecture of HDFS and report on experience using HDFS to manage 25 petabytes of enterprise data at Yahoo!.
Conference Paper
Recommender systems are widely used in online e-commerce applications to improve user engagement and then to in- crease revenue. A key challenge for recommender systems is providing high quality recommendation to users in \cold- start" situations. We consider three types of cold-start prob- lems: 1) recommendation on existing items for new users; 2) recommendation on new items for existing users; 3) rec- ommendation on new items for new users. We propose predictive feature-based regression models that leverage all available information of users and items, such as user de- mographic information and item content features, to tackle cold-start problems. The resulting algorithms scale e- ciently as a linear function of the number of observations. We verify the usefulness of our approach in three cold-start settings on the MovieLens and EachMovie datasets, by com- paring with ve alternatives including random, most popu- lar, segmented most popular, and two variations of Vibes anity algorithm widely used at Yahoo! for recommenda- tion.
Conference Paper
MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google's clusters every day, processing a total of more than twenty petabytes of data per day.
Article
Recommender systems have been evaluated in many, often incomparable, ways. In this article, we review the key decisions in evaluating collaborative filtering recommender systems: the user tasks being evaluated, the types of analysis and datasets being ...