Conference Paper

Video suggestion and discovery for youtube: Taking random walks through the view graph

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The rapid growth of the number of videos in YouTube pro-vides enormous potential for users to find content of inter-est to them. Unfortunately, given the di culty of searching videos, the size of the video repository also makes the dis-covery of new content a daunting task. In this paper, we present a novel method based upon the analysis of the en-tire user-video graph to provide personalized video sugges-tions for users. The resulting algorithm, termed Adsorption, provides a simple method to e ciently propagate preference information through a variety of graphs. We extensively test the results of the recommendations on a three month snapshot of live data from YouTube.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In the past few decades, many methods have been proposed for the tasks defined above. For node classification, there are broadly two categories of approaches -methods which use random walks to propagate the labels [9,10], and methods which extract features from nodes and apply classifiers on them [11,12]. Approaches for link prediction include similarity based methods [13,14], maximum likelihood models [15,16], and prob-abilistic models [17,18]. ...
... Feature-based models [11,12,87] generate features for nodes based on their neighborhood and local network statistics and then apply a classifier like Logistic Regression [88] and Naive Bayes [89] to predict the labels. Random walk based models [9,10] propagate the labels with random walks. ...
... The labels represent blogger interests inferred through the metadata provided by the bloggers. The network has 10 : This is a network of biological interactions between proteins in humans. This network has 3,890 nodes and 38,739 edges. ...
Preprint
Graphs, such as social networks, word co-occurrence networks, and communication networks, occur naturally in various real-world applications. Analyzing them yields insight into the structure of society, language, and different patterns of communication. Many approaches have been proposed to perform the analysis. Recently, methods which use the representation of graph nodes in vector space have gained traction from the research community. In this survey, we provide a comprehensive and structured analysis of various graph embedding techniques proposed in the literature. We first introduce the embedding task and its challenges such as scalability, choice of dimensionality, and features to be preserved, and their possible solutions. We then present three categories of approaches based on factorization methods, random walks, and deep learning, with examples of representative algorithms in each category and analysis of their performance on various tasks. We evaluate these state-of-the-art methods on a few common datasets and compare their performance against one another. Our analysis concludes by suggesting some potential applications and future directions. We finally present the open-source Python library we developed, named GEM (Graph Embedding Methods, available at https://github.com/palash1992/GEM), which provides all presented algorithms within a unified interface to foster and facilitate research on the topic.
... When it comes to human online activities many theoretical studies curiously assume uncorrelated random events on the part of the users [15,[24][25][26] which makes their behavior rather unpredictable. Moreover, that literature assumes that a user's future partners in comments and reviews, or how web pages are visited are independent of the history of the process or at best on the previous time step. ...
... Moreover, that literature assumes that a user's future partners in comments and reviews, or how web pages are visited are independent of the history of the process or at best on the previous time step. While these assumptions work well for page ranking in web searching [24], online recommendation systems [25], link prediction [27], and advertising [26], it is not clear that they apply to more interactive processes such as contacting friends within online social networks, participating in online discourse and exchanges of email and text messages. Even in cases where a Markovian assumption seems to yield good results, the discovery of deterministic components to online browsing and searching can improve existing algorithms [28]. ...
... Using ideas first articulated in studies of gene expressions [29], predictability is here defined as the degree to which one can forecast a user's interacions based on observations of his previous activity. The main focus of this study is to be contrasted to existing studies of online social behavior, such as recommender systems [25] and link prediction [27], which use statistical learning models to improve the prediction accuracy of novel links and recommendations. By examining datasets from user commenting activities and place visiting logs, we found that the observed activity sequences deviate from a random walk model with deterministic components. ...
Preprint
The massive amounts of data that social media generates has facilitated the study of online human behavior on a scale unimaginable a few years ago. At the same time, the much discussed apparent randomness with which people interact online makes it appear as if these studies cannot reveal predictive social behaviors that could be used for developing better platforms and services. We use two large social databases to measure the mutual information entropy that both individual and group actions generate as they evolve over time. We show that user's interaction sequences have strong deterministic components, in contrast with existing assumptions and models. In addition, we show that individual interactions are more predictable when users act on their own rather than when attending group activities.
... For example, in Web search, Deng et al. [8] modeled queries and URLs for query suggestion, Cao et al. [20] considered the cooccurrence between entities and queries for entity ranking, Li et al. [10] modeled users and their search sessions for detecting click spam, and Rui et al. [21] mined visual features and the surrounding texts for Web image annotation. In practical recommender systems, bipartite graphs methods have been used for Twitter user recommendation [22] and YouTube video recommendation [23]. In the domain of natural language processing, Parveen et al. [24] generated multi-document summarization based on the relationship of sentences and lexical entities. ...
... The use of symmetric normalization is a key characteristic of BiRank, allowing edges connected to a high-degree vertex to be suppressed through normalization, lessening the contribution of high-degree vertices. This has the beneficial effect of toning down the dependence of top rankings on high-degree vertices, a known defect of the random walk-based diffusion methods [23]. This gives rise to better quality results. ...
... Co-HITS [8] normalizes each column of W (and W T ) stochastically, having an explanation of simulating random walks on the graph. However, random walk methods can be biased towards the highdegree vertices [23]. While BGER [20] avoids this defect by normalizing each row of W (and W T ) stochastically, yielding an effect of suppressing the scores of high-degree vertices. ...
Preprint
The bipartite graph is a ubiquitous data structure that can model the relationship between two entity types: for instance, users and items, queries and webpages. In this paper, we study the problem of ranking vertices of a bipartite graph, based on the graph's link structure as well as prior information about vertices (which we term a query vector). We present a new solution, BiRank, which iteratively assigns scores to vertices and finally converges to a unique stationary ranking. In contrast to the traditional random walk-based methods, BiRank iterates towards optimizing a regularization function, which smooths the graph under the guidance of the query vector. Importantly, we establish how BiRank relates to the Bayesian methodology, enabling the future extension in a probabilistic way. To show the rationale and extendability of the ranking methodology, we further extend it to rank for the more generic n-partite graphs. BiRank's generic modeling of both the graph structure and vertex features enables it to model various ranking hypotheses flexibly. To illustrate its functionality, we apply the BiRank and TriRank (ranking for tripartite graphs) algorithms to two real-world applications: a general ranking scenario that predicts the future popularity of items, and a personalized ranking scenario that recommends items of interest to users. Extensive experiments on both synthetic and real-world datasets demonstrate BiRank's soundness (fast convergence), efficiency (linear in the number of graph edges) and effectiveness (achieving state-of-the-art in the two real-world tasks).
... Making sense of these relational data is critical for companies and organizations to make better business decisions and even bring convenience to our daily life. Recent advances in data mining, machine learning, and data analytics have led to a flurry of graph analytic techniques that typically require an iterative refinement process [6,34,25,9]. However, the massive amount of data involved and potentially numerous iterations required make performing data analytics in a timely manner challenging. ...
... Adsorption [6] is a graph-based label propagation algorithm that provides personalized recommendation for contents (e.g., video, music, document, product). The concept of label indicates a certain common feature of the entities. ...
Preprint
Myriad of graph-based algorithms in machine learning and data mining require parsing relational data iteratively. These algorithms are implemented in a large-scale distributed environment in order to scale to massive data sets. To accelerate these large-scale graph-based iterative computations, we propose delta-based accumulative iterative computation (DAIC). Different from traditional iterative computations, which iteratively update the result based on the result from the previous iteration, DAIC updates the result by accumulating the "changes" between iterations. By DAIC, we can process only the "changes" to avoid the negligible updates. Furthermore, we can perform DAIC asynchronously to bypass the high-cost synchronous barriers in heterogeneous distributed environments. Based on the DAIC model, we design and implement an asynchronous graph processing framework, Maiter. We evaluate Maiter on local cluster as well as on Amazon EC2 Cloud. The results show that Maiter achieves as much as 60x speedup over Hadoop and outperforms other state-of-the-art frameworks.
... In practice, the vertex update method in Eq. 2 has been applied in many fields, such as Gauss-Seidel iteration [17] in linear algebra, Adsorption [18], Katz metric [19], SimRank [20], Belief propagation [21] and so on [22], [23]. ...
... edges between super-vertices have weights that are equal to the number of edges between the subgraphs (line 14). Then we compute the val of each super vertex (line [17][18][19]. After that, we sort the subgraphs ascending with val of the super vertex (line 21). ...
Preprint
Enhancing the efficiency of iterative computation on graphs has garnered considerable attention in both industry and academia. Nonetheless, the majority of efforts focus on expediting iterative computation by minimizing the running time per iteration step, ignoring the optimization of the number of iteration rounds, which is a crucial aspect of iterative computation. We experimentally verified the correlation between the vertex processing order and the number of iterative rounds, thus making it possible to reduce the number of execution rounds for iterative computation. In this paper, we propose a graph reordering method, GoGraph, which can construct a well-formed vertex processing order effectively reducing the number of iteration rounds and, consequently, accelerating iterative computation. Before delving into GoGraph, a metric function is introduced to quantify the efficiency of vertex processing order in accelerating iterative computation. This metric reflects the quality of the processing order by counting the number of edges whose source precedes the destination. GoGraph employs a divide-and-conquer mindset to establish the vertex processing order by maximizing the value of the metric function. Our experimental results show that GoGraph outperforms current state-of-the-art reordering algorithms by 1.83x on average (up to 3.34x) in runtime.
... (2) Intuitively, MF-applicable algorithms like PageRank can derive cancelation and compensation messages without recording any intermediate vertex states due to two key factors (a) these algorithms employ accumulative aggregation algorithms, and (b) there exists an inverse function capable of eliminating the effects of previously accumulated messages from the vertex states. Apart from PageRank, many other algorithms are also MF-applicable, such as SimRank [20], Penalized Hitting Probability (PHP) [16], Katz Metric [22], Believe Propagation [39] and Adsorption [2], i.e., they can be incrementalized with Algorithm 1. ...
... (lines[2][3][4][5]. Starting with the transmission of these messages to designated neighbors, it restores the iterative computation of A over G ⊕ ΔG to get the updated results, i.e., applying the same functions H, U and G as batch counterpart A (line 6).Example 7Continuing with Example 3, we use Algorithm 1 to generate the cancelation and compensation messages as shown inFig. ...
Article
Full-text available
The graph data keep growing over time in real life. The ever-growing amount of dynamic graph data demands efficient techniques of incremental graph computation. However, incremental graph algorithms are challenging to develop. Existing approaches usually require users to manually design nontrivial incremental operators, or choose different memoization strategies for certain specific types of computation, limiting the usability and generality. In light of these challenges, we propose IngressIngress\textsf{Ingress}, an automated system for incrementalgraph processing. IngressIngress\textsf{Ingress} is able to deduce the incremental counterpart of a batch vertex-centric algorithm, without the need of redesigned logic or data structures from users. Underlying IngressIngress\textsf{Ingress} is an automated incrementalization framework equipped with four different memoization policies, to support all kinds of vertex-centric computations with optimized memory utilization. We identify sufficient conditions for the applicability of these policies. IngressIngress\textsf{Ingress} chooses the best-fit policy for a given algorithm automatically by verifying these conditions. In addition to the ease-of-use and generalization, IngressIngress\textsf{Ingress} outperforms state-of-the-art incremental graph systems by 12.14×12.14×12.14\times on average (up to 49.23×49.23×49.23\times ) in efficiency.
... Actual classification of nodes is performed through an iterative algorithm (Algorithm 1) (Baluja et al. 2008; Bha-gat, Cormode, and Muthukrishnan 2011) equivalent to performing random walks over an absorbing Markov chain. 1 Their formulation is similar to the iterative formulation of the PageRank algorithm (Page et al. 1999), which is preferred over the closed formula solution for scalability reasons. We will now elaborate on the Markov chain formulation of the problem at hand as it gives insight in the inner workings of the method. ...
... Their labeled nodes represent absorbing states in the Markov chain; they did not exploit the inherent connections in their data but relied on a metric to provide edge weights. Baluja et al. (2008) propose an algorithm to recommend YouTube videos based on randomly walking over the co-view graph. Bhagat, Cormode, and Muthukrishnan (2011) provide a survey on node classification methods in social networks. ...
Article
We derive the political climate of the social circles of Twitter users using a weakly-supervised approach. By applying random walks over a sub-sample of Twitter's social graph we infer a distribution indicating the presence of eight Flemish political parties in users' social circles in the months before the 2014 elections. The graph structure is induced through a combination of connection and retweet features and combines information of over a million tweets and 14 million follower connections. We solely exploit the social graph structure and do not rely on tweet content. For validation we compare the affiliation of politically active Twitter users with the most-influential party in their network. On a validation set of around 700 politically active individuals we achieve F_1 scores of 0.85 and greater. We asked the Twitter community to evaluate our classification performance. More than half of the 2258 users who responded reported a score higher than 60 out of 100.
... The core approach involves organizing user-item interaction data into a bipartite graph and learning node representations from the graph structure. Earlier efforts [2,13] extract the graph information using random walk strategies. With the development of GNNs, the common studies has shifted towards designing effective messagepassing mechanisms to propagate user/item embeddings over the graph [40,47,56]. ...
Preprint
Graph neural network (GNN) has been a powerful approach in collaborative filtering (CF) due to its ability to model high-order user-item relationships. Recently, to alleviate the data sparsity and enhance representation learning, many efforts have been conducted to integrate contrastive learning (CL) with GNNs. Despite the promising improvements, the contrastive view generation based on structure and representation perturbations in existing methods potentially disrupts the collaborative information in contrastive views, resulting in limited effectiveness of positive alignment. To overcome this issue, we propose CoGCL, a novel framework that aims to enhance graph contrastive learning by constructing contrastive views with stronger collaborative information via discrete codes. The core idea is to map users and items into discrete codes rich in collaborative information for reliable and informative contrastive view generation. To this end, we initially introduce a multi-level vector quantizer in an end-to-end manner to quantize user and item representations into discrete codes. Based on these discrete codes, we enhance the collaborative information of contrastive views by considering neighborhood structure and semantic relevance respectively. For neighborhood structure, we propose virtual neighbor augmentation by treating discrete codes as virtual neighbors, which expands an observed user-item interaction into multiple edges involving discrete codes. Regarding semantic relevance, we identify similar users/items based on shared discrete codes and interaction targets to generate the semantically relevant view. Through these strategies, we construct contrastive views with stronger collaborative information and develop a triple-view graph contrastive learning approach. Extensive experiments on four public datasets demonstrate the effectiveness of our proposed approach.
... The similarity measure between users and videos, established through specific feature representation methods, serves as the foundation for recommendation systems [11]. Miller et al. [12] leveraged the content and hyperlink structure of Wikipedia to establish the similarity between two movies, employing the k-nearest neighbor algorithm for movie recommendation [13,14]. However, experimental findings indicated that incorporating Wikipedia knowledge did not notably enhance the performance of the recommender system. ...
Preprint
Full-text available
The rise of the Internet has revolutionized the way information is accessed and consumed, leading to an era characterized by data abundance and complexity. In this landscape, users are often inundated with numerous choices, creating a challenge known as information overload. To address this challenge, recommendation systems have emerged as indispensable tools across various online platforms. These systems, by analyzing user attributes and behaviors, aim to provide personalized recommendations, thus assisting users in navigating the vast array of available information. However, conventional recommendation algorithms often overlook valuable user-item features embedded in additional data sources such as review texts. Moreover, integrating auxiliary information to enhance recommendation accuracy adds complexity to the process. In light of these considerations, this paper introduces an interest-aware message passing recommendation model. By partitioning the graph into subgraphs based on user and item interactions, the model leverages graph convolution operations to learn node representations, thereby enhancing the expression of users' interests and preferences. Through empirical evaluation on benchmark datasets, our model outperforms existing baselines, demonstrating its efficacy in addressing the challenges of recommendation in the age of data abundance.
... In the realm of multimodal recommendation, prevailing models can be categorized into collaborative filteringbased video recommendation [11], content-based video recommendation [12], and hybrid video recommendation [13]. An early example is YouTube's User-Video-based graph tour algorithm in 2008, which utilized collaborative filtering for propagating video labels on a graph [14]. However, this approach only considered video label information and faced the cold start problem [15]. ...
Preprint
Full-text available
In the dynamic landscape of contemporary social media, short videos have emerged as a dominant form of content consumption, prompting intensified research focus on short video recommendation scenarios. Users engaged with short videos exhibit unique characteristics, marked by diverse and multi-level dynamic interests. Addressing the challenges inherent in short video recommendation systems, this paper introduces a hybrid recommendation algorithm model that capitalizes on multimodal information. The model incorporates user-side auxiliary information into its network structure, delving into the profound interests of users. It assesses the significance of each dimension within user and item feature representations during the scoring prediction task. Furthermore, the application of graph neural networks in the recommendation system is enhanced through the integration of an attention mechanism. This mechanism facilitates the fusion of multi-layer state output information, enabling more effective participation of shallow structural features provided by the intermediate layer in the prediction task. Through extensive experimentation on different datasets, the proposed model demonstrates improved recommendation accuracy compared to traditional recommendation algorithms, affirming the feasibility and effectiveness of the approach.
... However, asking users to rate or rank jobs based on their significance is not accurate or applicable in a real world recruiting systems. Online recommendation systems use variety of explicit and implicit information sources such as purchasing histories in ecommerce systems [5], [6], browsing and clicking actions in news recommendations [7], or views in online video recommendation [8]. One major drawback of relying on such data sources alone is the high level of data sparsity in which typical item-item similarity measures may fail [9]. ...
... Hence, Jing et al. [11] studied the VisualRank reranking algorithm, which uses visual features of videos to construct a weighted graph structure on a large-scale video set and applies random walk model in the PageRank algorithm, the experimental results show that the user satisfactions and retrieval precision have been significantly improved. Baluja et al. [12] applied a random walk model on a user-video graph to provide personalized video suggestions for user, and then generated a list of recommended videos according to the user's preferences. In order to avoid the one-sidedness of using single-mode to extract image features, the algorithms in Ref [13,14]. ...
Article
Full-text available
In order to further improve the performance of image retrieval, a novel image reranking algorithm based on discrete-time quantum walk is proposed. In this algorithm, a discrete-time quantum walk model based on a weighted undirected complete graph is first constructed, in which the nodes of the graph represent the images and the weighted values of these edges are the similarity value between the images. Then the average probability values of the walker reaching the node of the graph is used as the relevance scores of the image and the images are reranked according to the relevance scores. Finally, our experimental results show that our proposed reranking algorithm has a significant improvement compared with the initial ranking algorithm from the comparison of visual and relevance scores. Furthermore, the effectiveness of our algorithm is evaluated by the average precision (AP) and the mean average precision (MAP), where the AP of our algorithm is increased by 18.23% and 37.61% for two types of the query image in randomly selected image group respectively, and the MAP of our algorithm is increased by 22.24% for all image groups compared with the initial ranking algorithm.
... In the age of information explosion, recommender systems plays an important role in discovering user preferences and providing online services efficiently Baluja et al. (2008). And B Zhao Pengpeng ppzhao@suda.edu.cn ...
Article
Full-text available
The multimodal side information such as images and text have been commonly used as supplements to improve graph collaborative filtering recommendations. However, there is often a semantic gap between multimodal information and collaborative filtering information. Previous works often directly fuse or align these information, which results in semantic distortion or degradation. Additionally, multimodal information also introduces additional noises, and previous methods lack explicit supervision to identify these noises. To tackle the issues, we propose a novel contrastive learning approach to improve graph collaborative filtering, named Multimodal-Side-Information-enriched Contrastive Learning (MSICL), which does not fuse multimodal information directly, but still explicitly captures users’ potential preferences for similar images or text by contrasting ID embeddings, and filters noises in multimodal side information. Specifically, we first search for samples with similar images or text as positive contrastive pairs. Secondly, some searched sample pairs may be irrelevant, so we distinguish the noise by filtering out sample pairs that have no interaction relationship. Thirdly, we contrast the ID embeddings of the true positive sample pairs to excavate the potential similarity relationship in multimodal side information. Extensive experiments on three datasets demonstrate the superiority of our method in multimodal recommendation. Moreover, our approach significantly reduces computation and memory cost compared to previous work.
... The traditional methods [11,17] of initializing user and item embeddings in recommendation systems often rely on random or heuristic techniques. These approaches, however, may not adequately capture the intrinsic structure of user-item interaction data, which can lead to less than optimal performance in the early stages of training and slower convergence. ...
Article
Full-text available
Graph Collaborative Filtering (GCF) methods have emerged as an effective recommendation approach, capturing users’ preferences over items by modeling user–item interaction graphs. However, these methods suffer from data sparsity in real scenarios, and their performance can be improved using contrastive learning. In this paper, we propose an optimized method, named LoRA-NCL, for GCF based on Neighborhood-enriched Contrastive Learning (NCL) and low-rank dimensionality reduction. We incorporate low-rank features obtained through matrix factorization into the NCL framework and employ LightGCN to extract high-dimensional representations. Extensive experiments on five public datasets demonstrate that the proposed method outperforms a competitive graph collaborative filtering base model, achieving 4.6% performance gains on the MovieLens dataset, respectively.
... Within everyday living, we very much depend on recommendations from other people, either it is by hearsay or responses to general questionnaires. People regularly use browser recommendation systems to make buying choices for items that are relevant to their tastes [3]. Suggestion expert systems are computer tools and procedures with the goal of making relevant and sensible suggestions for things or commodities which might be of importance to a bunch of people [4]. ...
Article
Full-text available
People are puzzled about which movie to watch these days because there are so many movies available on various OTT platforms. A recommender system would solve this problem by recommending the best movie to the user based on his genre, actor, director, and rating preferences. The cosine similarity principle would be used to guide the recommendation system. Apart from that, we will use the Tfidftransformer and count vectorizer from the sci-kit-learn library in Python in this work. In this study work, all of the approaches' constraints have been described. All of this work was done using datasets from several OTT platforms that were available on Kaggle.
... Many real-world systems, including those controlling large-scale applications such as video recommendation [134] and traffic prediction [135], use ad hoc versions of iterated retraining, relying on online and offline collection processes that are largely agnostic to learning potential. As such, their online collection can suffer from the bootstrap problem, falling into premature equilibria. ...
Article
Full-text available
We are at the cusp of a transition from ‘learning from data’ to ‘learning what data to learn from’ as a central focus of artificial intelligence (AI) research. While the first-order learning problem is not completely solved, large models under unified architectures, such as transformers, have shifted the learning bottleneck from how to effectively train models to how to effectively acquire and use task-relevant data. This problem, which we frame as exploration, is a universal aspect of learning in open-ended domains like the real world. Although the study of exploration in AI is largely limited to the field of reinforcement learning, we argue that exploration is essential to all learning systems, including supervised learning. We propose the problem of generalized exploration to conceptually unify exploration-driven learning between supervised learning and reinforcement learning, allowing us to highlight key similarities across learning settings and open research challenges. Importantly, generalized exploration is a necessary objective for maintaining open-ended learning processes, which in continually learning to discover and solve new problems, provides a promising path to more general intelligence.
... Figs. 10 & 12 show how a significant proportion of views are generated from the recommender system rather than from direct search. Secondly, the recommender system accounts for approximately 60% of all video clicks from YouTube's homepage [9] [21]; thus, many viewers watch YouTube videos without necessarily performing a search query, thus watching videos based solely on YouTube's suggestions-"unarticulated want" [9, p. 293]. Thirdly, the recommender system may promote an educator's video for only a short time (as shown in Fig. 10) which may give the educator a false sense of growth. ...
Conference Paper
Full-text available
Abstract— Videos are an effective means of knowledge delivery and educators are increasingly using videos as part of their pedagogy. While many universities have digital teaching and learning platforms, most of these platforms are not specifically designed for video-based tuition. Social media platforms, however, have become a popular choice for the uploading of educational content. YouTube is the largest media sharing platform worldwide and provides educators the capacity to share their knowledge with a global audience. In so doing, YouTube provides extensive metrics as part of its social media analytics which are powerful tools that educators can use to improve their educational content and increase their impact on the platform. While there has been some scholarly activity on social media analytics, few, if any, publications are aimed at explaining what YouTube’s social media analytics are and how educators can use them to improve their content. This article aims to critically explain and analyse 14 of YouTube’s social media analytics to assist educators in increasing their impact. The article reports on the success of numerous engineering tutorial videos published in 2020 and 2021 that have accrued over 1 million views on YouTube. The aim of this article is to provide the reader with practical tools to improve their own offerings on public networks such as YouTube. Since as much as 60% of all YouTube views originate from YouTube’s recommendations rather than from direct search queries, YouTube’s recommender system is also presented in this article.
... Since the data in most recommendation systems is essentially a graph structure, more and more graph learning methods have been applied to learn the inter-object relations in recommendation systems [40]. Random walk based recommendation system [41] has been widely adopted to capture complex, higher-order and indirect relations among a variety of nodes on the graph [42]. Graph representation learningbased recommendation system [43] encodes each node into a latent representation and then analyzes the complex relations between them. ...
Preprint
Full-text available
Many patients with chronic diseases resort to multiple medications to relieve various symptoms, which raises concerns about the safety of multiple medication use, as severe drug-drug antagonism can lead to serious adverse effects or even death. This paper presents a Decision Support System, called DSSDDI, based on drug-drug interactions to support doctors prescribing decisions. DSSDDI contains three modules, Drug-Drug Interaction (DDI) module, Medical Decision (MD) module and Medical Support (MS) module. The DDI module learns safer and more effective drug representations from the drug-drug interactions. To capture the potential causal relationship between DDI and medication use, the MD module considers the representations of patients and drugs as context, DDI and patients' similarity as treatment, and medication use as outcome to construct counterfactual links for the representation learning. Furthermore, the MS module provides drug candidates to doctors with explanations. Experiments on the chronic data collected from the Hong Kong Chronic Disease Study Project and a public diagnostic data MIMIC-III demonstrate that DSSDDI can be a reliable reference for doctors in terms of safety and efficiency of clinical diagnosis, with significant improvements compared to baseline methods.
... Since the data in most recommendation systems is essentially a graph structure, more and more graph learning methods have been applied to learn the inter-object relations in recommendation systems [40]. Random walk based recommendation system [41] has been widely adopted to capture complex, higher-order and indirect relations among a variety of nodes on the graph [42]. Graph representation learningbased recommendation system [43] encodes each node into a latent representation and then analyzes the complex relations between them. ...
Conference Paper
Many patients with chronic diseases resort to multiple medications to relieve various symptoms, which raises concerns about the safety of multiple medication use, as severe drug-drug antagonism can lead to serious adverse effects or even death. This paper presents a Decision Support System, called DSSDDI, based on drug-drug interactions to support doctors prescribing decisions. DSSDDI contains three modules, Drug-Drug Interaction (DDI) module, Medical Decision (MD) module and Medical Support (MS) module. The DDI module learns safer and more effective drug representations from the drug-drug interactions. To capture the potential causal relationship between DDI and medication use, the MD module considers the representations of patients and drugs as context, DDI and patients' similarity as treatment, and medication use as outcome to construct counterfactual links for the representation learning. Furthermore, the MS module provides drug candidates to doctors with explanations. Experiments on the chronic data collected from the Hong Kong Chronic Disease Study Project and a public diagnostic data MIMIC-III demonstrate that DSSDDI can be a reliable reference for doctors in terms of safety and efficiency of clinical diagnosis, with significant improvements compared to baseline methods. Source code of the proposed DSSDDI is publicly available at https://github.com/TianBian95/DSSDDI.
... As a kind of unstructured data, the rich content and various forms of expression of video bring certain challenges to the recommendation system. Currently, video recommendation techniques can be broadly classified into: collaborative filtering-based video recommendation [6,7], content-based video recommendation [8,9,10,11,12], and hybrid video recommendation [13]. Existing video recommendation research focuses on the following issues: (1) data representation: video feature representation and user model representation. ...
Preprint
Full-text available
The exponential growth of the internet has led to the creation of a large volume of information, making it challenging for users to navigate through. As a solution to this challenge, recommendation systems have emerged. In video recommendation, besides applying some basic interaction data (including image data, behavior data, context data, etc.) to the recommendation model, many studies also try to apply the video content data to the model for video recommendation. In this paper, we propose different types of features for different modal data, and select the most suitable feature types according to different tasks. The internal representation of features is learned by a multi-headed self-attentive mechanism and the cross-representation of features is learned by attention. On the basis of this, the multimodal features of video data are represented and modeled in a unified way, which is the basis for the implementation of multi-view video recommendation. and based on this, we add multimodal video content to mine richer and more comprehensive descriptions, so as to provide more accurate and personalized recommendations. The results of experiments conducted on real data sets demonstrate the effectiveness of the model proposed in this paper.
... YouTube analysis has been used to apply sentiment analysis in the recent US Elections 2020, on limited dataset of approximately 200 comments from YouTube [98], to discover irrelevant and misleading metadata [99], to identify spam campaigns [100], to discover extremists videos and hidden communities [101], to propagate preference information of personalised video [102], to estimate causality between user profiles [103], to spread political advertisement [39, 104,105], and to apply opinion mining [106]. For example in [104], they explore the Senate Campaign 2008. ...
Article
Full-text available
Most studies analyzing political traffic on Social Networks focus on a single platform, while campaigns and reactions to political events produce interactions across different social media. Ignoring such cross-platform traffic may lead to analytical errors, missing important interactions across social media that e.g. explain the cause of trending or viral discussions. This work links Twitter and YouTube social networks using cross-postings of video URLs on Twitter to discover the main tendencies and preferences of the electorate, distinguish users and communities’ favouritism towards an ideology or candidate, study the sentiment towards candidates and political events, and measure political homophily. This study shows that Twitter communities correlate with YouTube comment communities: that is, Twitter users belonging to the same community in the Retweet graph tend to post YouTube video links with comments from YouTube users belonging to the same community in the YouTube Comment graph. Specifically, we identify Twitter and YouTube communities, we measure their similarity and differences and show the interactions and the correlation between the largest communities on YouTube and Twitter. To achieve that, we have gather a dataset of approximately 20M tweets and the comments of 29K YouTube videos; we present the volume, the sentiment, and the communities formed in YouTube and Twitter graphs, and publish a representative sample of the dataset, as allowed by the corresponding Twitter policy restrictions.
... Based on the curve of the ROC the largest AUC is at k =10. Another good choice of k is 5, but it can never have a high TPR [1]. This means that even if the k value is set high, the algorithm will not be able to recommend a large percentage of items that the user liked. ...
Article
In recent years there is a drastic increase in information over the internet. Users get confused to find out best product on the internet of one’s interest. Here the recommender system helps to filter the information and gives relevant recommendations to users so that the user community can find the item(s) of their interest from huge collection of available data. But filtering information from the users reviews given for various items seems to be a challenging task for recommending the user interested things. In general similarities between the users are considered for recommendations in collaborative filtering techniques. This paper describes a new collaborative filtering technique called Adaptive Similarity Measure Model [ASMM] to identify similarity between users for the selection of unseen items. Out of all the available items most similarities would be sorted out by ASMM for recommendation which varies from user to user
... Many real-world systems, including those controlling large-scale applications such as video recommendation [7] and traffic prediction [60], use ad hoc versions of iterated retraining, relying on online and offline collection processes that are largely agnostic to learning potential. As such, their online collection can suffer from the bootstrap problem, falling into premature equilibria. ...
Preprint
We are at the cusp of a transition from "learning from data" to "learning what data to learn from" as a central focus of artificial intelligence (AI) research. While the first-order learning problem is not completely solved, large models under unified architectures, such as transformers, have shifted the learning bottleneck from how to effectively train our models to how to effectively acquire and use task-relevant data. This problem, which we frame as exploration, is a universal aspect of learning in open-ended domains, such as the real world. Although the study of exploration in AI is largely limited to the field of reinforcement learning, we argue that exploration is essential to all learning systems, including supervised learning. We propose the problem of generalized exploration to conceptually unify exploration-driven learning between supervised learning and reinforcement learning, allowing us to highlight key similarities across learning settings and open research challenges. Importantly, generalized exploration serves as a necessary objective for maintaining open-ended learning processes, which in continually learning to discover and solve new problems, provides a promising path to more general intelligence.
Article
Bug triaging is a vital process in software maintenance, involving assigning bug reports to developers in the issue tracking system. Current studies predominantly treat automatic bug triaging as a classification task, categorizing bug reports using developers as labels. However, this approach deviates from the essence of triaging, which is establishing bug–developer correlations. These correlations should be explicitly leveraged, offering a more comprehensive and promising paradigm. Our bug triaging model utilizes graph collaborative filtering (GCF), a method known for handling correlations. However, GCF encounters two challenges in bug triaging: data sparsity in bug fixing records and semantic deficiency in exploiting input data. To address them, we propose PCG, an innovative framework that integrates prototype augmentation and contrastive learning with GCF. With bug triaging modeled as predicting links on the bipartite graph of bug–developer correlations, we introduce prototype clustering‐based augmentation to mitigate data sparsity and devise a semantic contrastive learning task to overcome semantic deficiency. Extensive experiments against competitive baselines validate the superiority of PCG. This work may open new avenues for investigating correlations in bug triaging and related scenarios.
Conference Paper
The sheer volume of data makes it challenging to locate pertinent information in today's age of abundant YouTube videos. A system of recommendations was developed to help users find information that would be most interesting to them to improve this process. Algorithms, computations, and implicit user input are the backbone of most recommendation systems. These techniques work well, except the video lacks implicit input, so the algorithms will likely fail to identify significant material. Cold-start occurs when a video is first posted and has no history or viewer comments attached to it. Every day, users also need help with the issue of discovery, which is hampered by the fact that films need labels or a large number of views to be easily discovered. Because the search engine’s method relies on the description tags and keywords you’ve assigned to the video footage rather than the video itself, in this research, we offer a content-based recommendation system that is capable of identifying in-video objects and noises and providing users with the option to search for relevant material using submitted scenes or filter results using keywords. Further experiments have been conducted using a wide range of situations to prove further the efficacy of the suggested system for video content suggestion.
Article
Graph Collaborative Filtering is a widely adopted approach for recommendation, which captures similar behavior features through graph neural network. Recently, Contrastive Learning (CL) has been demonstrated as an effective method to enhance the performance of graph collaborative filtering. Typically, CL-based methods first perturb users’ history behavior data (e.g., drop clicked items), then construct a self-discriminating task for behavior representations under different random perturbations. However, for widely existing inactive users, random perturbation makes their sparse behavior information more incomplete, thereby harming the behavior feature extraction. To tackle the above issue, we design a novel directional perturbation-based CL method to improve the graph collaborative filtering performance. The idea is to perturb node representations through directionally enhancing behavior features. To do so, we propose a simple yet effective feedback mechanism, which fuses the representations of nodes based on behavior similarity. Then, to avoid irrelevant behavior preferences introduced by the feedback mechanism, we construct a behavior self-contrast task before and after feedback, to align the node representations between the final output and the first layer of GNN. Different from the widely-adopted self-discriminating task, the behavior self-contrast task avoids complex message propagation on different perturbed graphs, which is more efficient than previous methods. Extensive experiments on three public datasets demonstrate that the proposed method has distinct advantages over other contrastive learning methods on recommendation accuracy.
Chapter
This chapter introduces four types of classic recommendation algorithms, including content-based recommendation algorithms, classic collaborative filtering algorithms, matrix factorization methods, and factorization machines. Before the emergence of deep learning, these methods were the most mainstream techniques for recommender systems, widely recognized by both academia and industry. Although after the emergence of deep learning, these technologies are no longer the first choice of the industry, but the basic ideas and practical experience extracted from these technologies still affect the follow-up research. Therefore, in many deep learning-based recommendation algorithms, we can often see the reflections of the above approaches.
Chapter
The explosively generated micro-videos on content sharing platforms call for recommender systems to permit personalized micro-video discovery with ease. Recent advances in micro-video recommendation have achieved remarkable performance in mining users’ current preference based on historical behaviors. However, most of them neglect the dynamic and time-evolving nature of users’ preference, and the prediction on future micro-videos with historically mined preference may deteriorate the effectiveness of recommender systems. In this paper, we devise the DMR framework, which comprises: 1) the implicit user network module which identifies sequence fragments from other users with similar interests and extracts the sequence fragments that are chronologically behind the identified fragments; 2) the multi-trend routing module which assigns each extracted sequence fragment into a trend group and update the corresponding trend vector; 3) the history-future trend prediction module jointly uses the history preference vectors and future trend vectors to yield the final click-through-rate. We validate the effectiveness of DMR over multiple state-of-the-art micro-video recommenders on two publicly available real-world datasets. Relatively extensive analysis further demonstrate the superiority of modeling dynamic multi-trend for micro-video recommendation.
Article
The music domain is among the most important ones for adopting recommender systems technology. In contrast to most other recommendation domains, which predominantly rely on collaborative filtering (CF) techniques, music recommenders have traditionally embraced content-based (CB) approaches. In the past years, music recommendation models that leverage collaborative and content data – which we refer to as content-driven models – have been replacing pure CF or CB models. In this survey, we review 55 articles on content-driven music recommendation. Based on a thorough literature analysis, we first propose an onion model comprising five layers, each of which corresponds to a category of music content we identified: signal, embedded metadata, expert-generated content, user-generated content, and derivative content. We provide a detailed characterization of each category along several dimensions. Second, we identify six overarching challenges, according to which we organize our main discussion: increasing recommendation diversity and novelty, providing transparency and explanations, accomplishing context-awareness, recommending sequences of music, improving scalability and efficiency, and alleviating cold start. Each article addresses one or more of these challenges and is categorized according to the content layers of our onion model, the article’s goal(s), and main methodological choices. Furthermore, articles are discussed in temporal order to shed light on the evolution of content-driven music recommendation strategies. Finally, we provide our personal selection of the persisting grand challenges which are still waiting to be solved in future research endeavors.
Chapter
Recently, Contrastive Learning (CL) is becoming a mainstream approach to reduce the influence of data sparsity in recommendation system. However, existing methods do not fully explore the relationship between the outputs of different Graph Neural Network (GNN) layers and fail to fully utilize the capacity of combining GNN and CL for better recommendation. Within this paper, we introduce a novel approach based on CL, called efficient Graph collaborative filtering with multi-layer output-enhanced Contrastive Learning (GmoCL). It maximizes the benefits derived from the information propagation property of GNN with multi-layer aggregation to obtain better node representations. Specifically, the construction of CL tasks involves considerations from both intra-layer and inter-layer perspectives. The goal of intra-layer CL task is to exploit the semantic similarities of different users (or items) on a certain GNN layer. The inter-layer CL task aims to make the outputs of different GNN layers of the same user (or item) more similar. Additionally, we propose the strategy of negative sampling in the inter-layer CL task to learn the better node representations. The efficacy of the suggested approach is validated through comprehensive experiments conducted on five publicly available datasets.
Article
Full-text available
Video platforms have become indispensable components within a diverse range of applications, serving various purposes in entertainment, e-learning, corporate training, online documentation, and news provision. As the volume and complexity of video content continue to grow, the need for personalized access features becomes an inevitable requirement to ensure efficient content consumption. To address this need, recommender systems have emerged as helpful tools providing personalized video access. By leveraging past user-specific video consumption data and the preferences of similar users, these systems excel in recommending videos that are highly relevant to individual users. This article presents a comprehensive overview of the current state of video recommender systems (VRS) , exploring the algorithms used, their applications, and related aspects. In addition to an in-depth analysis of existing approaches, this review also addresses unresolved research challenges within this domain. These unexplored areas offer exciting opportunities for advancements and innovations, aiming to enhance the accuracy and effectiveness of personalized video recommendations. Overall, this article serves as a valuable resource for researchers, practitioners, and stakeholders in the video domain. It offers insights into cutting-edge algorithms, successful applications, and areas that merit further exploration to advance the field of video recommendation.
Article
Full-text available
In this era of big data explosion, humans widely use the movie recommendation system as an information tool. There are two common issues found in the machine learning movie recommendation system that is still undeniable: first, cold start, and second, data sparsity. To minimize the problems, a research study is conducted to find a decision-making algorithm to solve the complex start problem in a movie recommendation system with precise parameters. It involves the implementation of the proposed demographics filtering technique with the k-means clustering method. The research findings present the effects of demographic filtering for movie recommendations. Demographic filtering can group users into clusters based on gender, age group, and occupation. The clusters distribution representative group based on the top 100 results of the experiment. The user with the least distance to the cluster center is chosen as the usual group in that cluster. Three clusters were experimented: Cluster 0, Cluster 1, and Cluster 2. Cluster 0 has a representative group of male, college, or graduate students aged 25 to 34. Cluster 1 has a representative group of females, executive or managerial, aged 25 to 34. Cluster 2 has a representative group of males, sales or marketing aged 35 to 44. It is shown that user from different collection has various preferred movie genre. The preferred movie genre in Cluster 0 is action, adventure, comedy, drama, and war. Cluster 1 has preferred comedy, crime, drama, horror, romance, and sci-fi movie genres. Cluster 2 has chosen action, comedy, drama, film-noir, mystery, and thriller movie genres. This research has contributed to the demographic filtering studies as an alternative solution for future technical development work.
Conference Paper
Full-text available
During the COVID-19 lockdowns in South Africa undergraduate laboratory sessions were forbidden, in turn, video-based tutorials were proposed as a tentative solution to address the lack of in-person practical demonstration sessions. Five videos were filmed on electrical engineering topics, uploaded, and then publicly shared on YouTube. An investigation was then conducted as to whether videos may be useful for the teaching of practical engineering content in the university context. This article is a report back on the findings of using YouTube as a platform for sharing and evaluating engineering educational practical tutorial videos. The gaol of this article is to introduce YouTube’s social media analytics as a tool for educators to evaluate their educational videos. The findings suggest that educators may consider evaluating their videos using social media analytics, but these analytics should be reviewed critically and should comprise of several metrics measured temporally. Understanding YouTube’s recommender system and its influence on the platform is also an important factor in evaluating one’s video content.
Article
Programming online judges (POJs) are widely used to train programming skills, and exercise recommendation algorithms in POJs have attracted wide attention. The current programming recommendation algorithms cannot make full use of the feedback of user-item pairs and cannot effectively express students’ mastery of exercises. Therefore, we propose a dual-track feedback aggregation recommendation model for programming training (DTFARec). In this model, multiple types of feedback fusion mechanism (MTFFM) and dual-track method (DTM) are proposed to solve this problem and can better express students’ mastery of exercises. The MTFFM uses an attention mechanism to learn different feedback information, and the DTM is able to fuse information from both feedback and interactive aspects. The experimental results on a real-world dataset show that the model has better recommendation performance than the best performing benchmark and that our method can effectively model students’ mastery of exercises.
Chapter
This book presents an integrated collection of representative approaches for scaling up machine learning and data mining methods on parallel and distributed computing platforms. Demand for parallelizing learning algorithms is highly task-specific: in some settings it is driven by the enormous dataset sizes, in others by model complexity or by real-time performance requirements. Making task-appropriate algorithm and platform choices for large-scale machine learning requires understanding the benefits, trade-offs and constraints of the available options. Solutions presented in the book cover a range of parallelization platforms from FPGAs and GPUs to multi-core systems and commodity clusters, concurrent programming frameworks including CUDA, MPI, MapReduce and DryadLINQ, and learning settings (supervised, unsupervised, semi-supervised and online learning). Extensive coverage of parallelization of boosted trees, SVMs, spectral clustering, belief propagation and other popular learning algorithms and deep dives into several applications make the book equally useful for researchers, students and practitioners.
Article
Neural graph collaborative filtering has received great recent attention due to its power of encoding the high-order neighborhood via the backbone graph neural networks. However, their robustness against noisy user-item interactions remains largely unexplored. Existing work on robust collaborative filtering mainly improves the robustness by denoising the graph structure, while recent progress in other fields has shown that directly adding adversarial perturbations in the embedding space can significantly improve the model robustness. In this work, we propose to improve the robustness of neural graph collaborative filtering via both denoising in the structure space and perturbing in the embedding space. Specifically, in the structure space, we measure the reliability of interactions and further use it to affect the message propagation process of the backbone graph neural networks; in the embedding space, we add in-distribution perturbations by mimicking the behavior of adversarial attacks and further combine it with contrastive learning to improve the performance. Extensive experiments have been conducted on four benchmark datasets to evaluate the effectiveness and efficiency of the proposed approach. The results demonstrate that the proposed approach outperforms the recent neural graph collaborative filtering methods especially when there are injected noisy interactions in the training data.
Article
Full-text available
Efficient and effective handling of video documents depends on the availability of indexes. Manual indexing is unfeasible for large video collections. In this paper we survey several methods aiming at automating this time and resource consuming process. Good reviews on single modality based video indexing have appeared in literature. Effective indexing, however, requires a multimodal approach in which either the most appropriate modality is selected or the different modalities are used in collaborative fashion. Therefore, instead of separately treating the different information sources involved, and their specific algorithms, we focus on the similarities and differences between the modalities. To that end we put forward a unifying and multimodal framework, which views a video document from the perspective of its author. This framework forms the guiding principle for identifying index types, for which automatic methods are found in literature. It furthermore forms the basis for categorizing these different methods.
Conference Paper
Full-text available
Collaborative filtering aims at helping users find items they should appreciate from huge catalogues. In that field, we can distinguish user-based, item-based and model-based approaches. For each of them, many options play a crucial role for their performances, and in particular the similarity function defined between users or items, the number of neighbors considered for user- or item-based approaches, the number of clusters for model-based approaches using clustering, and the prediction function used. In this paper, we review the main collaborative filtering methods proposed in the litterature and compare them on the same widely used real dataset called MovieLens, and using the same widely used performance measure called Mean Absolute Error (MAE). This study thus allows us to highlight the advantages and drawbacks of each approach, and to propose some default options that we think should be used when using a given approach or designing a new one.
Article
This paper presents an overview of the field of recommender systems and describes the current generation of recommendation methods that are usually classified into the following three main categories: content-based, collaborative, and hybrid recommendation approaches. This paper also describes various limitations of current recommendation methods and discusses possible extensions that can improve recommendation capabilities and make recommender systems applicable to an even broader range of applications. These extensions include, among others, an improvement of understanding of users and items, incorporation of the contextual information into the recommendation process, support for multicriteria ratings, and a provision of more flexible and less intrusive types of recommendations.
Conference Paper
An approach to semi-supervised learning is pro- posed that is based on a Gaussian random field model. Labeled and unlabeled data are rep- resented as vertices in a weighted graph, with edge weights encoding the similarity between in- stances. The learning problem is then formulated in terms of a Gaussian random field on this graph, where the mean of the field is characterized in terms of harmonic functions, and is efficiently obtained using matrix methods or belief propa- gation. The resulting learning algorithms have intimate connections with random walks, elec- tric networks, and spectral graph theory. We dis- cuss methods to incorporate class priors and the predictions of classifiers obtained by supervised learning. We also propose a method of parameter learning by entropy minimization, and show the algorithm's ability to perform feature selection. Promising experimental results are presented for synthetic data, digit classification, and text clas- sification tasks.
Conference Paper
We consider the problem of multiclass clas- sification where both labeled and unlabeled data points are given. We introduce and demonstrate a new approach for estimating a distribution over the missing labels where data points are viewed as nodes of a graph, and pairwise similarities are used to derive a transition probability matrix P for a Markov random walk between them. The algorithm associates each point with a particle which moves between points according to P. La- beled points are set to be absorbing states of the Markov random walk, and the probability of each particle to be absorbed by the dier- ent labeled points, as the number of steps in- creases, is then used to derive a distribution over the associated missing label. A com- putationally ecient algorithm to implement this is derived and demonstrated on both real and artificial data sets, including a numerical comparison with other methods.
Conference Paper
MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google's clusters every day, processing a total of more than twenty petabytes of data per day.
Article
In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/ To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. This paper provides an in-depth description of our large-scale web search engine -- the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical largescale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want. Keywords World Wide Web, Search Engines, Information Retrieval, PageRank, Google 1.
Conference Paper
In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/ To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from 3 years ago. This paper provides an in-depth description of our large-scale web search engine - the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections, where anyone can publish anything they want.
Article
Abstract: "In traditional machine learning approaches to classification, one uses only a labeled set to train the classifier. Labeled instances however are often difficult, expensive, or time consuming to obtain, as they require the efforts of experienced human annotators. Meanwhile unlabeled data may be relatively easy to collect, but there has been [sic] few ways to use them. Semi-supervised learning addresses this problem by using large amount [sic] of unlabeled data, together with the labeled data, to build better classifiers. Because semi-supervised learning requires less human effort and gives higher accuracy, it is of great interest both in theory and in practice. We present a series of novel semi-supervised learning approaches arising from a graph representation, where labeled and unlabeled instances are represented as vertices, and edges encode the similarity between instances. They address the following questions: How to use unlabeled data? (label propagation); What is the probabilistic interpretation? (Gaussian fields and harmonic functions); What if we can choose labeled data? (active learning); How to construct good graphs? (hyperparameter learning); How to work with kernel machines like SVM? (graph kernels); How to handle complex data like sequences? (kernel conditional random fields); How to handle scalability and induction? (harmonic mixtures). An extensive literature review is included at the end." "May 2005." Thesis (Ph. D.)--Carnegie Mellon University, 2005. Includes bibliographical references and index.
Conference Paper
Communications between individuals can be represented by (weighted, multi-) graphs. Many applications operate on communication graphs associated with telephone calls, emails, instant messages (IM), blogs, web forums, e-business relationships and so on. These applications include identifying repetitive fraudsters, message board aliases, multiusage of IP addresses, etc. Tracking electronic identities in communication networks can be achieved if we have a reliable "signature" for nodes and activities. While many examples of ad hoc signatures can be proposed for particular tasks, what is needed is a systematic study of the principles behind the usage of signatures for any task. We develop a formal framework for the use of signatures in communication graphs and identify three fundamental properties that are natural to signature schemes: persistence, uniqueness and robustness. We argue for the importance of these properties by showing how they impact a set of applications. We then explore several signature schemes - previously defined and new - in our framework and evaluate them on real data in terms of these properties. This provides insights into suitable signature schemes for desired applications. Finally, as case studies, we focus on two concrete applications in enterprise network traffic. We apply signature schemes to these problems and demonstrate their effectiveness.
Article
An n-point metric space (X, D) can be represented by an n × n table specifying the distances. Such tables arise in many diverse areas. For example, consider the following scenario in microbiology: X is a collection of bacterial strains, and for every two strains, one is given their dissimilarity (computed, say, by comparing their DNA). It is difficult to see any structure in a large table of numbers, and so we would like to represent a given metric space in a more comprehensible way. For example, it would be very nice if we could assign to each x 2 X a point f(x) in the plane in such a way that D(x; y) equals the Euclidean distance of f(x) and f(y). Such a representation would allow us to see the structure of the metric space: tight clusters, isolated points, and so on. Another advantage would be that the metric would now be represented by only 2n real numbers, the coordinates of the n points in the plane, instead of numbers as before. Moreover, many quantities concern...
Article
To classify a large number of unlabeled examples we combine a limited number of labeled examples with a Markov random walk representation over the unlabeled examples. The random walk representation exploits any low dimensional structure in the data in a robust, probabilistic manner. We develop and compare several estimation criteria/algorithms suited to this representation. This includes in particular multi-way classification with an average margin criterion which permits a closed form solution. The time scale of the random walk regularizes the representation and can be set through a margin-based criterion favoring unambiguous classification. We also extend this basic regularization by adapting time scales for individual examples. We demonstrate the approach on synthetic examples and on text classification problems.
Youtube launches site in traditional Chinese
  • China Post
China Post. Youtube launches site in traditional Chinese http://www.chinapost.com.tw/.
Who rated what in the netflix challenge
KDD Cup 2007. Who rated what in the netflix challenge. http://www.cs.uic.edu/ liub/netflix- kdd-cup-2007.html.