Sergey Brin's research while affiliated with Stanford University and other places

Publications (25)

Article
Association rules are useful for determining correlations between attributes of a relation and have applications in the marketing, financial, and retail sectors. Furthermore, optimized association rules are an effective way to focus on the most interesting characteristics involving certain attributes. Optimized association rules are permitted to co...
Article
We consider the problem of analyzing market-basket data and present several important contributions. First, we present a new algorithm for finding large itemsets which uses fewer passes over the data than classic algorithms, and yet uses fewer candidate itemsets than methods based on sampling. We investigate the idea of item reordering, which can i...
Article
A great challenge for data mining techniques is the huge space of potential rules which can be generated. If there are tens of thousands of items, then potential rules involving three items number in the trillions. Traditional data mining techniques rely on downward-closed measures such as support to prune the space of rules. However, in many appli...
Article
Mining for association rules in market basket data has proved a fruitful area of research. Measures such as conditional probability (confidence) and correlation have been used to infer rules of the form “the existence of item A implies the existence of item B.” However, such rules indicate only a statistical relationship between A and B. They do no...
Article
eRank citation ranking: Bringing order to the Web. Work in progress. URL: http://google.stanford.edu/#backrub/pageranksub.ps. [PPR96] P. Pirolli, J. Pitkow, and R. Rao. Silk from a sow's ear: Extracting usable structures from the web. In Proceedings of the Conference on Human Factors in Computing Systems (CHI 96), pages 118--125, April 1996. [P97]...
Article
The importance of a Web page is an inherently subjective matter, which depends on the readers interests, knowledge and attitudes. But there is still much that can be said objectively about the relative importance of Web pages. This paper describes PageRank, a mathod for rating Web pages objectively and mechanically, effectively measuring the human...
Conference Paper
In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages...
Article
The amount of information available online has grown enormously over the past decade. Fortunately, computing power, disk capacity, and network bandwidth have also increased dramatically. It is currently possible for a university research project to store and process the entire World Wide Web. Since there is a limit on how much text humans can gener...
Conference Paper
. The World Wide Web is a vast resource for information. At the same time it is extremely distributed. A particular type of data such as restaurant lists may be scattered across thousands of independent information sources in many different formats. In this paper, we consider the problem of extracting a relation for such a data type from all of the...
Article
In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages...
Article
One of the more well-studied problems in data mining is the search for association rules in market basket data. Association rules are intended to identify patterns of the type: “A customer purchasing item A often also purchases item B.” Motivated partly by the goal of generalizing beyond market basket data and partly by the goal of ironing out some...
Conference Paper
We consider the problem of analyzing market-basket data and present several important contributions. First, we present a new algorithm for finding large itemsets which uses fewer passes over the data than classic algorithms, and yet uses fewer candidate itemsets than methods based on sampling. We investigate the idea of item reordering, which can i...
Article
One of the most well-studied problems in data mining is mining for association rules in market basket data. Association rules, whose significance is measured via support and confidence, are intended to identify rules of the type, "A customer purchasing item A often also purchases item B." Motivated by the goal of generalizing beyond market baskets...
Article
One of the most well-studied problems in data mining is mining for association rules in market basket data. Association rules, whose significance is measured via support and confidence, are intended to identify rules of the type, "A customer purchasing item A often also purchases item B." Motivated by the goal of generalizing beyond market baskets...
Article
One of the most well-studied problems in data mining is mining for association rules in market basket data. Association rules, whose significance is measured via support and confidence, are intended to identify rules of the type, "A customer purchasing item A often also purchases item B." Motivated by the goal of generalizing beyond market baskets...
Article
Given user data, one often wants to find approximate matches in a large database. A good example of such a task is finding images similar to a given image in a large collection of images. We focus on the important and technically difficult case where each data element is high dimensional, or more generally, is represented by a point in a large metr...
Article
In a digital library system, documents are available in digital form and therefore are more easily copied and their copyrights are more easily violated. This is a very serious problem, as it discourages owners of valuable information from sharing it with authorized users. There are two main philosophies for addressing this problem: prevention and d...
Article
In a digital library system, documents are available in digital form and therefore are more easily copied and their copyrights are more easily violated. This is a very serious problem, as it discourages owners of valuable information from sharing it with authorized users. There are two main philosophies for addressing this problem: prevention and d...

Citations

... Machine learning tasks on graphs are usually solved in three different ways. Traditionally, tasks on graphs are solved using label propagation [31], PageRank [32], and proximity-based measures such as Adamic/Adar [33] or the Jaccard coefficient [34]. Another group of approaches embed graphs into a vector space, used together with traditional machine learning methods such as logistic regression to generate predictions. ...
... The problem of plagiarism has recently increased because of the digital era of resources available on the World Wide Web. Plagiarism detection in natural languages by statistical or computerized methods has started since the 90s, which is pioneered by the studies of copy detection mechanisms in digital documents [1] [2]. Earlier than plagiarism detection in natural languages, code clones and software misuse detection has started since the 1970 by the studies to detect programming code plagiarism in Pascal and C [3], [4]- [7]. ...
... Influence maximization (Kempe & Kleinberg, 2003) covers various optimization methods for selecting a subset of nodes in a network, commonly referred to as seed nodes to maximize the total exposure. Common heuristics to identify influential nodes use graph-based centrality measures, such as Betweenness Centrality (Brandes, 2001) or PageRank (Page et al., 1998). However, these algorithms merely seek to maximize connectivity and neither consider differences among users nor the existence of rival forces. ...
... It addresses the problems of multiple dataset scanning and huge candidate generations. Other major extensions over Apriori are: [20] has devised correlation-based mining, [21] proposes SPADE-a fast and efficient sequence pattern discovery algorithm, etc. All of the above-mentioned approaches are generating a large set of candidate itemsets, and, in most cases, multiple scans to the dataset are required. ...
... [19] Dynamic Itemset Counting is a variation of the Apriori algorithm that scans the database multiple times requiring more I/O operations and hence is not cost-effective. [20] Partition algorithm, a two-scan algorithm generates all potentially large itemsets by logically dividing the database into mutually exclusive partitions. [21] Frequent pattern tree, a non-candidate generation prefix tree structure algorithm requires two scans of the database to mine frequent itemsets. ...
... The equation for the confidence is given by Refs. (Han et al. 2011;Agrawal et al. 1993) Here, support (X ∪ Y) is the number of transactions containing both the itemsets X and Y, Support (X) is the number of transactions containing the itemsetX (Prajapati et al. 2017) Lift Lift/interest is used to measure frequency X and Y together if both are statistically independent of each other (Brijs et al. 2003;Brin et al. 1997) The lift of rule X → Y is defined as: ...
... An importance measure could be based on the number of downloads or an explicit rating system by users. Our system allows using such " base " measures but also computes importance in the style of PageRank [7]. A mashlet acquires importance from its use in important GPs; GPs similarly acquire importance from using important mashlets. ...
... common technical method of IR is to map textual language into symbol vectors which can be easily manipulated mathematically. The result set generated by IR is a rank ordered list of documents which likely contain information that the user has specified. Examples of common IR systems include Web search engines such as Google (http://www.google.com) [57] and Altavista (http://www.altavista.com), both of which use schemes that include tf-idf. The IR systems provide fast, but not always accurate, answers to the questions posed by Web users. Search engines are designed to handle generic collections of text, based on word frequency , regardless of the content of the collections. Articles ab ...
... The number of references (citations) to a thing is evidence of its importance; many Nobel Prizes are assigned according to this fact. Considering this, we can say that highly linked pages are more " important " than pages with few in-links [36]. L. Page and S. Brin proposed the Page Rank algorithm in [8, 36, 37] that calculates the importance of web pages using the link structure of the web. ...
... This phenomenon is also encountered in non-traditional education; it is estimated that 10% of certificates earned in massive open online courses (MOOCs) were earned at least partially by creating a separate account to harvest answers (Alexandron et al., 2017;Northcutt et al., 2016). However, new technologies simultaneously provide better opportunities for staff to monitor potential cases of plagiarism (Brin et al., 1995;Ercegovac & Richardson, 2004;Shivakumar & Garcia-molina, 1995). In particular, where original resources are available in digital form and students submit digital copies of their work, advanced software tools can assist in the automation of plagiarism detection for large classes in higher education (Batane, 2010). ...