# ArnetMiner: extraction and mining of academic social networks

**2**Bookmarks

**·**

**71**Views

- [Show abstract] [Hide abstract]

**ABSTRACT:**Network structure analysis plays an important role in characterizing complex systems. Different from previous network centrality measures, this article proposes the topological centrality measure reflecting the topological positions of nodes and edges as well as influence between nodes and edges in general network. Experiments on different networks show distinguished features of the topological centrality by comparing with the degree centrality, closeness centrality, betweenness centrality, information centrality, and PageRank. The topological centrality measure is then applied to discover communities and to construct the backbone network. Its characteristics and significance is further shown in e-Science applications.Journal of the American Society for Information Science and Technology 01/2010; 61:1824-1841. · 2.01 Impact Factor - SourceAvailable from: uiuc.edu
##### Conference Paper: Probabilistic topic models with biased propagation on heterogeneous information networks.

[Show abstract] [Hide abstract]

**ABSTRACT:**With the development of Web applications, textual documents are not only getting richer, but also ubiquitously interconnected with users and other objects in various ways, which brings about text-rich heterogeneous information networks. Topic models have been proposed and shown to be useful for document analysis, and the interactions among multi-typed objects play a key role at disclosing the rich semantics of the network. However, most of topic models only consider the textual information while ignore the network structures or can merely integrate with homogeneous networks. None of them can handle heterogeneous information network well. In this paper, we propose a novel topic model with biased propagation (TMBP) algorithm to directly incorporate heterogeneous information network with topic modeling in a unified way. The underlying intuition is that multi-typed objects should be treated differently along with their inherent textual information and the rich semantics of the heterogeneous information network. A simple and unbiased topic propagation across such a heterogeneous network does not make much sense. Consequently, we investigate and develop two biased propagation frameworks, the biased random walk framework and the biased regularization framework, for the TMBP algorithm from different perspectives, which can discover latent topics and identify clusters of multi-typed objects simultaneously. We extensively evaluate the proposed approach and compare to the state-of-the-art techniques on several datasets. Experimental results demonstrate that the improvement in our proposed approach is consistent and promising.Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, August 21-24, 2011; 01/2011 - SourceAvailable from: export.arxiv.org[Show abstract] [Hide abstract]

**ABSTRACT:**Over the years, frequent subgraphs have been an important sort of targeted patterns in the pattern mining literatures, where most works deal with databases holding a number of graph transactions, e.g., chemical structures of compounds. These methods rely heavily on the downward-closure property (DCP) of the support measure to ensure an efficient pruning of the candidate patterns. When switching to the emerging scenario of single-graph databases such as Google Knowledge Graph and Facebook social graph, the traditional support measure turns out to be trivial (either 0 or 1). However, to the best of our knowledge, all attempts to redefine a single-graph support resulted in measures that either lose DCP, or are no longer semantically intuitive. This paper targets mining patterns in the single-graph setting. We resolve the "DCP-intuitiveness" dilemma by shifting the mining target from frequent subgraphs to frequent neighborhoods. A neighborhood is a specific topological pattern where a vertex is embedded, and the pattern is frequent if it is shared by a large portion (above a given threshold) of vertices. We show that the new patterns not only maintain DCP, but also have equally significant semantics as subgraph patterns. Experiments on real-life datasets display the feasibility of our algorithms on relatively large graphs, as well as the capability of mining interesting knowledge that is not discovered in prior works.05/2013;

Page 1

ArnetMiner: Extraction and Mining

of Academic Social Networks

Jie Tang, Jing Zhang

Computer Science Dept.

Tsinghua University, China

jietang@tsinghua.edu.cn

zhangjing0544@gmail.com

Limin Yao, Juanzi Li

Computer Science Dept.

Tsinghua University, China

ylm@keg.cs.tsinghua.edu.cn

ljz@keg.cs.tsinghua.edu.cn

Li Zhang, Zhong Su

IBM, China Research Lab

Beijing, China

lizhang@cn.ibm.com

suzhong@cn.ibm.com

ABSTRACT

This paper addresses several key issues in the ArnetMiner system,

which aims at extracting and mining academic social networks.

Specifically, the system focuses on: 1) Extracting researcher pro-

files automatically from the Web; 2) Integrating the publication

data into the network from existing digital libraries; 3) Modeling

the entire academic network; and 4) Providing search services for

the academic network. So far, 448,470 researcher profiles have

been extracted using a unified tagging approach. We integrate pub-

lications from online Web databases and propose a probabilistic

framework to deal with the name ambiguity problem. Further-

more, we propose a unified modeling approach to simultaneously

model topical aspects of papers, authors, and publication venues.

Search services such as expertise search and people association

search have been provided based on the modeling results. In this

paper, we describe the architecture and main features of the system.

We also present the empirical evaluation of the proposed methods.

Categories and Subject Descriptors

H.3.3 [Information Search and Retrieval]: Text Mining, Digital

Libraries; H.2.8 [Database Management]: Database Applications

General Terms

Algorithms, Experimentation

Keywords

Social Network, Information Extraction, Name Disambiguation,

Topic Modeling, Expertise Search, Association Search

1.INTRODUCTION

Extraction and mining of academic social networks aims at pro-

viding comprehensive services in the scientific research field. In an

academic social network, people are not only interested in search-

ing for different types of information (such as authors, conferences,

and papers), but are also interested in finding semantics-based in-

formation (such as structured researcher profiles).

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for profit or commercial advantage and that copies

bear this notice and the full citation on the first page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior specific

permission and/or a fee.

KDD’08, August 24–27, 2008, Las Vegas, Nevada, USA.

Copyright 2008 ACM 978-1-60558-193-4/08/08 ...$5.00.

Many issues in academic social networks have been investigated

and several systems have been developed (e.g., DBLP, CiteSeer,

and Google Scholar). However, the issues were usually studied

separately and the methods proposed are not sufficient for mining

the entire academic network. Two reasons are as follows: 1) Lack

of semantics-based information. The social information obtained

from user-entered profiles or by extraction using heuristics is some-

times incomplete or inconsistent; 2) Lack of a unified approach to

efficiently model the academic network. Previously, different types

of information in the academic network were modeled individually,

thus dependencies between them cannot be captured accurately.

In this paper, we try to address the two challenges in novel ap-

proaches. We have developed an academic search system, called

ArnetMiner (http://www.arnetminer.org). Our objective

in this system is to answer the following questions: 1) how to au-

tomatically extract researcher profiles from the Web? 2) how to

integrate the extracted information (e.g., researchers’ profiles and

publications) from different sources? 3) how to model different

types of information in a unified approach? and 4) how to provide

powerful search services based on the constructed network?

(1) We extend the Friend-Of-A-Friend (FOAF) ontology [9] as

the profile schema and propose a unified approach based on Condi-

tional Random Fields to extract researcher profiles from the Web.

(2) We integrate the extracted researcher profiles and the crawled

publication data from the online digital libraries. We propose a uni-

fied probabilistic framework for dealing with the name ambiguity

problem in the integration.

(3) We propose three generative probabilistic models for simul-

taneously modeling topical aspects of papers, authors, and publica-

tion venues.

(4) Based on the modeling results, we implement several search

services such as expertise search and association search.

We conducted empirical evaluations of the proposed methods.

Experimental results show that our proposed methods significantly

outperform the baseline methods for dealing with the above issues.

Our contributions in this paper include: (1) a proposal of a uni-

fied tagging approach to researcher profile extraction, (2) a pro-

posal of a unified probabilistic framework to name disambiguation,

and (3) a proposal of three probabilistic topic models to simultane-

ously model the different types of information.

The paper is organized as follows. In Section 2, we review the

related work. In Section 3, we give an overview of the system.

In Section 4, we present our approach to researcher profiling. In

Section 5, we describe the probabilistic framework to name disam-

biguation. In Section 6, we propose three generative probabilistic

models to model the academic network. Section 7 illustrates sev-

eral search services provided in ArnetMiner based on the modeling

results. We conclude the paper in Section 8.

Page 2

2. RELATED WORK

2.1Person Profile Extraction

Several research efforts have been made for extracting person

profiles. For example, Yu et al. [32] propose a two-stage extraction

method for identifying personal information from resumes. The

first stage segments a resume into different types of blocks and the

second stage extracts the detailed information such as Address and

Email from the identified blocks. However, the method formalizes

the profile extraction as several separate steps and conducts extrac-

tion in a more or less ad-hoc manner.

A few efforts also have been placed on the extraction of contact

information from emails or from the Web. For example, Kristjans-

son et al. [19] have developed an interactive information extraction

system to assist the user to populate a contact database from emails.

In comparison, profile extraction consists of contact information

extraction as well as other different subtasks.

2.2Name Disambiguation

A number of approaches have been proposed to name disam-

biguation. For example, Bekkerman and McCallum [6] present two

unsupervised methods to distinguish Web pages to different per-

sons with the same name: one is based on the link structure of the

Web pages and the other is based on the textural content. However,

the methods cannot incorporate the relationships between data.

Han et al. [15] propose an unsupervised learning approach using

K-way spectral clustering. Tan et al. [27] propose a method for

name disambiguation based on hierarchical clustering. However,

this kind of methods cannot capture the relationships either.

Two supervised methods are proposed by Han et al. [14]. For

each given name, the methods learn a specific classification model

from the training data and use the model to predict whether a new

paper is authored by a specific author with the name. However, the

methods are user-dependent. It is impractical to train thousands of

models for all individuals in a large digital library.

2.3 Topic Modeling

Considerable work has been conducted for investigating topic

models or latent semantic structures for text mining. For example,

Hofmann [17] proposes the probabilistic latent semantic indexing

(pLSI) and applies it to information retrieval (IR).

Blei et al. [8] introduce a three-level Bayesian network, called

Latent Dirichlet Allocation (LDA). The basic generative process of

LDA closely resembles pLSI except that in pLSI, the topic mixture

is conditioned on each document while in LDA, the topic mixture

is drawn from a conjugate Dirichlet prior that remains the same for

all documents.

Some other work has been conducted for modeling both author

interests and document contents together. For example, the Author

model [21] is aimed at modeling the author interests with a one-to-

one correspondence between topics and authors. The Author-Topic

model [25] [26] integrates the authorship into the topic model and

can find a topic mixture over documents and authors.

Compared with the previous topic modeling work, in this pa-

per, we propose a unified topic model to simultaneously model the

topical aspects of different types of information in the academic

network.

2.4Academic Search

For academic search, several research issues have been inten-

sivelyinvestigated, forexampleexpertfindingandassociationsearch.

Expert finding is one of the most important issues for mining so-

cial networks. For example, both Nie et al. [24] and Balog et al.

[4] propose extended language models to address the expert finding

problem. From 2005, Text REtrieval Conference (TREC) has pro-

vided a platform with the Enterprise Search Track for researchers

to empirically assess their methods for expert finding [13].

Association search aims at finding connections between people.

For example, the ReferralWeb [18] system helps people search and

explore social networks on the Web. Adamic and Adar [1] have

investigated the problem of association search in email networks.

However, existing work mainly focuses on how to find connections

between people and ignores how to rank the found associations.

In addition, a few systems have been developed for academic

search such as, scholar.google.com, libra.msra.cn, citeseer.ist.psu,

and Rexa.info. Though much work has been performed, to the best

of our knowledge, the issues we focus on in this work (i.e., profile

extraction, name disambiguation, and academic network modeling)

have not been sufficiently investigated. Our system addresses all

these problems holistically.

3. OVERVIEW OF ARNETMINER

Figure 1 shows the architecture of our ArnetMiner system. The

system mainly consists of five main components:

1. Extraction: it focuses on extracting researcher profiles from

the Web automatically. It first collects and identifies one’s

homepage from the Web, then uses a unified approach to ex-

tract the profile properties from the identified document. It

extracts publications from online digital libraries using rules.

2. Integration: it integrates the extracted researchers’ profiles

and the extracted publications by using the researcher name

as the identifier. A probabilistic framework has been pro-

posed to deal with the name ambiguity problem in the inte-

gration. The integrated data is stored into a researcher net-

work knowledge base (RNKB).

3. Storage and Access: it provides storage and index for the ex-

tracted/integrated data in the RNKB. Specifically, for storage

it employs MySQL and for index, it employs the inverted file

indexing method [3].

4. Modeling: it utilizes a generative probabilistic model to si-

multaneously model different types of information. It esti-

mates a topic distribution for each type of information.

5. Search Services: based on the modeling results, it provides

severalsearchservices: expertisesearchandassociationsearch.

It also provides other services, e.g., author interest finding

and academic suggestion (such as paper suggestion and cita-

tion suggestion).

It is challenging in many ways to implement these components.

First, the previous extraction work has been usually conducted on a

specific data set. It is not immediately clear whether such methods

can be directly adapted to the global Web. Secondly, it is unclear

how to deal with the disambiguation problem by making full use

of the extracted information. For example, how to use the rela-

tionships between publications. Thirdly, there is no existing model

that can simultaneously model the different types of information

in the academic network. Finally, different strategies for modeling

the academic network have different behaviors. It is necessary to

study how different they are and which one would be the best for

academic search.

Based on these considerations, for profile extraction, name dis-

ambiguation, and modeling, we propose new approaches to over-

come the drawbacks that exist in the traditional methods. For stor-

age and access, we utilize the classical methods, because these is-

sues have been intensively investigated and the existing methods

can result in good performance in our system.

Page 3

3. Storage and Access

Web

RNKB

Metadata

Access interface

Indexing

Storage

5. Search Services

Expert search

Hot-topic finding

1. Extraction

Document collection

Profile extraction

Publication collection

Papers

ACM

DBLP

Citeseer

2. Integration

Name disambiguation

4. Modeling

Modeling academic social networks

Paper search

Conference search

Author interest finding

Survey paper finding

Academic suggestion

Association search

Figure 1: Architecture of ArnetMiner.

RESEARCHERPROFILEEXTRACTION

Problem Definition

Profile extraction is the process of extacting the value of each

property in a person profile. We define the schema of the researcher

profile (as shown in Figure 2) by extending the FOAF ontology [9].

We perform a statistical study on randomly selected 1,000 re-

searchers from ArnetMiner and find that it is non-trivial to perform

profile extraction from the Web. We observed that 85.62% of the

researchers are faculty members from universities and 14.38% are

from company research centers. For researchers from the same

company, they may share a template-based homepage. However,

different companies have different templates. For researchers from

universities, the layout and the content of their homepages vary

largely. We have also found that 71.88% of the 1,000 Web pages

are researchers’ homepages and the rest are pages introducing the

researchers. Characteristics of the two types of pages significantly

differ from each other.

We also analyze the content of the Web pages and find that about

40% of the profile properties are presented in tables/lists and the

othersarepresentedinnaturallanguagetext. Thissuggestsamethod

without using global context information in the page would be in-

effective. Statistical study also unveils that (strong) dependencies

exist between different profile properties. For example, there are

1,325 cases (14.54%) in our data of which the extraction needs

to use the extraction results of other properties. An ideal method

should consider processing all the subtasks holistically.

4.2A Unified Approach to Profiling

4.

4.1

4.2.1

The proposed approach consists of three steps: relevant page

identification, preprocessing, and extraction. In relevant page iden-

tification, given a researcher name, we first get a list of web pages

by a search engine (we use the Google API) and then identify the

homepage/introducing page using a binary classifier. We use Sup-

Process

Researcher

Homepage

Phone

Address

Email

Phduniv

Phddate

Phdmajor

Msuniv

Bsmajor

Bsdate

Bsuniv

Affiliation

Postion

Msmajor

Msdate

Fax

Person Photo

Publication

Research_Interest

Name

author

authored_by

Title

Publication_venue

Start_page

End_page

Date

Publisher

Download_URL

Figure 2: The schema of researcher profile.

port Vector Machines (SVM) [12] as the classification model and

define features such as whether the title of the page contains the

person name and whether the URL address (partly) contains the

person name. The performance of the classifier is 92.39% by F1-

measure. In preprocessing, (a) we separate the text into tokens and

(b) we assign possible tags to each token. The tokens form the ba-

sic units and the pages form the sequences of units in the tagging

problem. In tagging, given a sequence of units, we determine the

most likely corresponding sequence of tags by using a trained tag-

ging model. Each tag corresponds to a property defined in Figure

2, e.g., ‘Position’. In this paper, we make use of Conditional Ran-

dom Fields (CRFs) [20] as the tagging model. Next we describe

the steps (a) and (b) in detail.

(a) We identify tokens in the Web page using heuristics. We

define five types of tokens: ‘standard word’, ‘special word’, ‘<im-

age>’ token, term, and punctuation mark. Standard words are un-

igram words in natural language. Special words include email,

URL, date, number, percentage, words containing special terms

(e.g. ‘Ph.D.’ and ‘.NET’), special symbols (e.g. ‘===’ and ‘###’).

We identify special words by using regular expressions. ‘<image>’

tokens (used for identifying person photos and email addresses) are

‘<image>’ tags in the HTML file. Terms are base noun phrases ex-

tracted from the Web page by using a tool based on technologies

proposed in [30].

(b) We assign tags to each token based on the token type. For ex-

ample, for a standard word, we assign all possible tags correspond-

ing to all properties. For a special word, we assign tags indicating

Position, Affiliation, Email, Address, Phone, Fax, Bsdate, Msdate,

and Phddate. For a ‘<image>’ token, we assign two tags: Photo and

Email, because an email address is sometimes shown as an image).

After each token is assigned with several possible tags, we can

perform most of the profiling tasks using the tags (extracting 19

properties defined in Figure 2).

4.2.2

We employ Conditional Random Fields (CRF) as the tagging

model. CRF is a conditional probability of a sequence of tags given

a sequence of observations [20]. For tagging, a trained CRF model

is used to find the sequence of tags Y∗having the highest like-

lihood Y∗= maxYP(Y |X). The CRF model is built with the

labeled data by means of an iterative algorithm based on Maximum

Likelihood Estimation.

Three types of features were defined in the CRF model: con-

tent features, pattern features, and term features. The features were

defined for different kinds of tokens. Table 1 shows the defined fea-

tures. We incorporate the defined features into the CRF model by

defining Boolean-valued feature functions. Finally, 108,409 fea-

tures were used in our experiments.

CRF model and Features

4.3Profile Extraction Performance

For evaluating our profiling method, we randomly chose 1,000

researcher names in total from our researcher network. We used the

Page 4

Table 1: Content features, Pattern features, and term features.

Content Feature

Word in the token

Morphology of the word

The size of the image

the ratio of height/width of the image

The format of the image (e.g., ‘JPG’)

The number of unique colors

The number of bits per pixel

Words in the filename

If the image contains a person face recog-

nized by (opencvlibrary.sf.net)

Words in ‘alt’ attribute of the image

Pattern Feature

All Token

If the token contains a pre-defined positive word

If the token contains a pre-defined negative word

If the token contains a special pattern

If the token contains the researcher name

How many line breaks before the current line

Term Feature

Term Token

If the token contains a base noun phrase

Standard

Token

Word

Morphology

Size

Height/width ratio

Image format

Image color

Filename

Face detection

Positive word

Negative word

Special token

Name

#Line break

Image

Token

Term

ALTDictionaryIf the token contains a word in a dictionary

6870727476 78 808284

content

content + term

content + pattern

All

Figure 3: Contribution of features (%).

method described in Section 4.2.1 to find the researchers’ home-

pages or introducing pages. If the method cannot find a Web page

for a researcher, we removed the researcher name from the data

set. We finally obtained 898 Web pages (one for each researcher).

Seven human annotators conducted annotation on the Web pages.

A spec was created to guide the annotation process. On disagree-

ments in the annotation, we conducted ‘majority voting’. In the

experiments, we conducted evaluations in terms of precision, re-

call, and F1-measure for each profile property.

We defined baselines for profile extraction. We used the rule

learning and the classification based approaches as baselines. For

the former, we employed the Amilcare tool, which is based on a

rule induction algorithm: LP2[11]. For the latter, we trained a

classifier to identify the value of each property. We employed Sup-

port Vector Machines (SVM) [12] as the classification model.

Experimental results show that our method results in a perfor-

mance of 83.37% in terms of average F1-measure; while Amilcare

and SVM result in 53.44% and 73.57%, respectively. Our method

clearly outperforms the two baseline methods. We have also found

that the performance of the unified method decreases (−11.28% by

F1) when removing the transition features, which indicates that a

unified approach is necessary for researcher profiling.

We investigated the contribution of each feature type in profile

extraction. We employed only content features, content+term fea-

tures, content+pattern features, and all features to train the models

and conducted the profile extraction. Figure 3 shows the average

F1-scores of profile extraction with different feature types. The

results unveil contributions of individual features in the extraction.

We see that solely using one type of features cannot obtain accurate

profiling results. Detailed evaluations can be found in [28].

5.

5.1

NAME DISAMBIGUATION

Problem Definition

We integrate the publication data from the online database in-

cluding DBLP bibliography, ACM Digital library, CiteSeer, and

others. For integrating the researcher profiles and the publications,

we use the researcher name and the publication author name as the

identifier. The method inevitably has the ambiguity problem.

We give a formal definition of the name disambiguation task in

our context. Given a person name a, we denote all publications

having the author name a as P = {p1,p2,··· ,pn}. Each publi-

cation pihas six attributes: paper title (pi.title), publication venue

Table 2: Relationships between papers.

Relation Name

CoPubvenue

pi.pubvenue = pj.pubvenue

CoAuthor

Citation

picites pjor pjcites pi

ConstraintsFeedbacks supplied by users

τ-CoAuthor

τ-extension co-authorship (τ > 1)

R

r1

r2

r3

r4

r5

W

w1

w2

w3

w4

w5

Description

∃r,s > 0,a(r)

i

= a(s)

j

(pi.pubvenue), published year (pi.year), abstract (pi.abstract),

authors ({a(0)

For the authors of a paper {a(0)

author name we are going to disambiguate as the principal author

(denoted as a(0)

arek actualresearchershavingthenamea, ourtaskisthentoassign

papers with the author a to their actual researcher yh, h ∈ [1,k].

We define five types of relationships between papers (Table 2).

Relationship r1 represents two papers are published at the same

venue. Relationship r2means two papers have a secondary author

with the same name, and relationship r3means one paper cites the

other paper. Relationship r4 indicates a constraint-based relation-

ship supplied via user feedback. For instance, the user can specify

that two specific papers should be assigned to a same person. We

use an example to explain relationship r5. Suppose pihas authors

‘David Mitchell’ and ‘Andrew Mark’, and pj has authors ‘David

Mitchell’ and ‘Fernando Mulford’. We are to disambiguate ‘David

Mitchell’. If ‘Andrew Mark’ and ‘Fernando Mulford’ also coauthor

a paper, then we say piand pj have a 2-CoAuthor relationship. In

our currently experiments, we empirically set the weights of rela-

tionships w1 ∼ w5as 0.2, 0.7, 0.3, 1.0, 0.7τ.

Thepublicationdatawithrelationshipscanbemodeledasagraph

comprisingofnodesandedges. Eachattributeofapaperisattached

to the corresponding node as a feature vector. In the vector, we use

words (after stop words filtering and stemming) in the attributes as

features and use their numbers of occurrences as the values.

5.2A Unified Probabilistic Framework

i,a(1)

i,··· ,a(u)

i

}), and references (pi.references).

i,a(1)

i,··· ,a(u)

i

}, we call the

i) and the others secondary authors. Suppose there

5.2.1

We propose a probabilistic framework based on Hidden Markov

Random Fields (HMRF) [5], which can capture dependencies be-

tween observations (with each paper being viewed as an observa-

tion). The disambiguation problem is cast as assigning a tag to each

paper with each tag representing an actual researcher.

Specifically, we define a-posteriori probability as the objective

function. We aims at maximizing the objective function. The five

types of relationships are incorporated into the objective function.

According to HMRF, the conditional distribution of the researcher

labels y given the observations x (papers) is

1

Zexp(−

i,h

Formalization using HMRF

P(y|x) =

?

D(xi,yh)−

?

i,j?=i

(D(xi,xj)

?

rk

wkrk(xi,xj)))

Page 5

Table 3: Data set for name disambiguation.

Person Name # Publi-

cationsPersons

Cheng Chang 123

Wen Gao2864

Yi Li4221

Jie Tang212

Bin Yu6612

Rakesh Kumar 615

Bing Liu130 11

#ActualPerson Name# Publi-

cations

40

54

6

15

109

#Actual

Persons

16

25

2

3

40

12

5

Gang Wu

Jing Zhang

Kuo Zhang

Hui Fang

Lei Wang

Michael Wagner 44

Jim Smith33

where D(xi,yh) is the distance between paper xi and researcher

yhandD(xi,xj)isthedistancebetweenpapersxiandxj; rk(xi,xj)

denotes a relationship between xi and xj; wk is the weight of the

relationship; and Z is a normalization factor.

5.2.2

ThreetasksareexecutedbytheExpectationMaximizationmethod:

estimation of parameters in the distance measure, re-assignment of

papers to researchers, and update of researcher representatives yh.

We define the distance function D(xi,xj) as follows:

EM framework

D(xi,xj) = 1 −

here A is defined as a diagonal matrix, for simplicity. Each element

in A denotes the weight of the corresponding feature in x.

The EM process can be summarized as follows: in the E-step,

given the researcher representatives, each paper is assigned to a

researcher by maximizing P(y|x). In the M-step, the researcher

representative yhis re-estimated from the assignments, and the dis-

tance measure is updated to maximize the objective function again.

In the E-step, assignments of data points to researchers are up-

dated to maximize P(y|x). A greedy algorithm is used to sequen-

tially assign each paper xi to its new assignment yh (h ∈ [1,k])

that minimizes the function (equivalently to maximize P(yh|xi)):

f(yh,xi) = D(xi,yh) +

j?=i

The assignment of a paper is performed while keeping assign-

ments of the other papers fixed. The assignment process is re-

peated after all papers are assigned. This process runs until no

paper changes its assignment between two successive iterations.

In the M-step, each researcher representative is updated by the

arithmetic mean of its points: yh=

Then, each parameter ammin A is updated by (only parameters

on the diagonal): amm = amm+ η?

∂f(yh,xi)

∂amm

∂amm

j?=i

xT

iAxj

?xi?A?xj?A

, where ?xi?A=

?

xT

iAxj

(1)

?

(D(xi,xj)

?

rk

wkrk(xi,xj))

(2)

?

i:yi=hxi

i:yi=hxi?A.

??

i

∂f(yh,xi)

∂amm

, where:

=∂D(xi,yh)

+

?

(∂D(xi,xj)

∂amm

?

rk

wkrk(xi,xj))

(3)

∂D(xi,xj)

∂amm

=

ximxjm?xi?A?xj?A− xT

iAxj

x2

im?xi?2

2?xi?A?xj?A

A+x2

jm?xj?2

A

?xi?2

A?xj?2

A

(4)

5.3Name Disambiguation Performance

To evaluate our method, we created a data set that consists of

14 real person names (six are from the author’s lab and the others

are from [31]). Statistics of this data set are shown in Table 3. Five

human annotators conducted disambiguation for the names. A spec

was created to guide the annotation process. The labeling work was

carried out based on authors’ affiliations, emails, and publications

on their homepages.

We defined a baseline based on the method from [27] except that

[27] also utilizes a search engine to help the disambiguation. The

method is based on hierarchical clustering. We also compared our

Table 4: Results on name disambiguation (%).

Baseline

Prec.Rec.

Cheng Chang100.0 100.0

Wen Gao96.6062.64

Yi Li86.64 95.12

Jie Tang100.0100.0

Gang Wu97.54 97.54

Jing Zhang85.00 69.86

Kuo Zhang 100.0100.0

Hui Fang100.0 100.0

Bin Yu67.22 50.25

Lei Wang 68.45 41.12

Rakesh Kumar63.3692.41

Michael Wagner18.35 60.26

Bing Liu 84.88 43.16

Jim Smith 92.4386.80

Average82.89 78.51

Person Name

Our Approach

Prec.Rec.

100.0100.0

99.2998.59

70.9197.50

100.0100.0

71.8698.36

83.91100.0

100.0 100.0

100.0 100.0

86.5353.00

88.6489.06

99.1496.91

85.1976.16

88.25 86.49

95.8193.56

90.6892.12

F1 F1

100.0

76.00

90.68

100.0

97.54

76.69

100.0

100.0

57.51

51.38

75.18

28.13

57.22

89.53

80.64

100.0

98.94

82.11

100.0

83.05

91.25

100.0

100.0

65.74

88.85

98.01

80.42

87.36

94.67

91.39

0.00

20.00

40.00

60.00

80.00

100.00

Hui Fang Rakesh

Kumar

Person name

Michael

Wagner

Bing LiuJim Smith Lei WangBin Yu

DISTINCT Our Approach

F1-score

Figure 4: Comparison with an existing method.

approach with the DISTINCT method [31]. In all experiments, we

suppose that the number of persons k is provided empirically.

Table 4 shows the results. We see that our method significantly

outperformsthebaselinemethodfornamedisambiguation(+10.75%

in terms of the average F1-score). The baseline method suffers

from two disadvantages: 1) it cannot take advantage of relation-

ships between papers and 2) it relies on a fixed distance measure.

Figure 4 shows the comparison results of our method and DIS-

TINCT [31]. We used the person names evaluated in both [31]

and our experiments for comparison. We see that for some names,

our approach significantly outperforms DISTINCT (e.g., ‘Michael

Wagner’); while for other names our approach underperforms DIS-

TINCT (e.g. ‘Bin Yu’).

Wefurtherinvestigatedthecontributionofeachrelationshiptype.

We first removed all relationships and then added them to our ap-

proachonebyone: CoPubvenue, Citation, CoAuthor, and τ-CoAuthor.

At each step, we evaluated the performance of our approach (cf.

Figure 5). We see that without using the relationships the dis-

ambiguation performance drops sharply (−44.72% by F1) and by

adding the relationships, improvements can be obtained at each

step. This confirms us that a framework by integrating relationships

for name disambiguation is worthwhile and each defined relation-

ship in our method is helpful. We can also see that the CoAuthor

relationship is the major contributor (+24.38% by F1).

6.MODELING ACADEMIC NETWORK

Modeling the academic network is critical to any searching or

suggesting tasks. Traditionally, information is usually represented

based on the ‘bag of words’ (BOW) assumption. The method tends

to be overly specific in terms of matching words.

0.00

20.00

40.00

60.00

80.00

100.00

Pre.Rec.F1

w/o Relationship

+CoPubVenue

+Citation

+CoAuthor

All

Figure 5: Contribution of relationships.

Page 6

Recently, probabilistic topic models such as probabilistic La-

tent Semantic Indexing (pLSI) [17], Latent Dirichlet Allocation

(LDA) [8], and Author-Topic model [25] [26] have been proposed

as well as successfully applied to multiple text mining tasks such as

information retrieval [29], collaborative filtering [8] [16], and paper

reviewer finding [22]. However, these models are not sufficient to

model the whole academic network, as they cannot model topical

aspects of all types of information in the academic network.

We propose a unified topic model for simultaneously modeling

the topical distribution of papers, authors, and conferences. For

simplicity, we use conference to denote conference, journal, and

book hereafter. The learned topic distribution can be used to fur-

ther estimate the inter-dependencies between different types of in-

formation, e.g., the closeness between a conference and an author.

The notations used are summarized as follows. A paper d is a

vector wd of Nd words, in which each wdi is chosen from a vo-

cabulary of size V ; a vector adof Adauthors, chosen from a set of

authors of size A; and a published conference cd. A collection of D

papers is defined by D = {(w1,a1,c1),··· ,(wD,aD,cD)}. xdi

denotes an author, chosen from ad, responsible for the i-th word

wdiin paper d. The number of topics is denoted as T.

6.1Our Proposed Topic Models

The proposed model is called Author-Conference-Topic (ACT)

model. Three different strategies are employed to implement the

topic model (as shown in Figure 6).

In the first model (ACT1, Figure 6 (a)), each author is associ-

ated with a multinomial distribution over topics and each word in a

paper and the conference stamp is generated from a sampled topic.

Inthesecondmodel(ACT2, Figure6(b)), eachauthor-conference

pair is associated with a multinomial distribution over topics and

each word is then generated from a sampled topic.

Inthe thirdmodel(ACT3, Figure6(c)), each authorisassociated

with a topic distribution and the conference stamp is generated after

topics have been sampled for all word tokens in a paper.

The different implementations reduces the process of writing a

scientific paper to different series of probabilistic steps. They have

different behaviors in the academic applications. In the remainder

of this section, we will describe the three models in more detail.

6.2ACT Model 1

In the first model (Figure 6(a)), the conference information is

viewed as a stamp associated with each word in a paper. Intuition

behind the first model is: coauthors of a paper determine topics

written in this paper and each topic then generates the words and

determines a proportion of the publication venue. The generative

process can be summarized as follows:

1. For each topic z, draw φzand ψzrespectively from Dirichlet

priors βzand µz;

2. For each word wdiin paper d:

• draw an author xdifrom aduniformly;

• draw a topic zdi from a multinomial distribution θxdi

specifictoauthorxdi, whereθ isgeneratedfromaDirich-

let prior α;

• draw a word wdifrom multinomial φzdi;

• draw a conference stamp cdifrom multinomial ψzdi.

Following [26], we choose Gibbs sampling for inference. As for

the hyperparameters α, β, and µ, for simplicity, we take a fixed

value (i.e., α = 50/T, β = 0.01, and µ = 0.1). In the Gibbs

sampling procedure, we first estimate the posterior distribution on

just x and z and then use the results to infer θ, φ, and ψ. The

posterior probability is calculated by the following:

P(zdi,xdi|z−di,x−di,w,c,α,β,µ) ∝

n−di

zdiwdi+ βwdi

?

m−di

xdizdi+ αzdi

?

where the superscript −di denotes a quantity, excluding the current

instance (e.g., the di-th word token in the d-th paper).

After Gibbs sampling, the probability of a word given a topic φ,

the probability of a conference given a topic ψ, and the probability

of a topic given an author θ can be estimated as follows:

z(m−di

xdiz+ αz)

wv(n−di

zdiwv+ βwv)

n−d

zdicd+ µcd

?

c(n−d

zdic+ µc)

(5)

φzwdi

=

nzwdi+ βwdi

wv(nzwv+ βwv)

nzcd+ µcd

?

?

?

(6)

ψzcd

=

c(nzc+ µc)

mxz+ αz

(7)

θxz

=

z?(mxz? + αz?)

(8)

6.3ACT Model 2

Inthesecondmodel(cf. Figure6(b)), eachtopicischosenfroma

multinomialtopicdistributionspecifictoanauthor-conferencepair,

instead of an author as that in ACT1. The model is derived from the

observation: when writing a paper, coauthors usually first choose a

publication venue and then write the paper based on themes of the

publication venue and interests of the authors. The corresponding

generative process is:

1. For each topic z, draw φzfrom Dirichlet priors βz;

2. For each word wdiin ppaer d:

• draw an author-conference pair (xdi,cd) from {ad,cd}

uniformly;

• drawatopiczdifromamultinomialdistributionθ(xdicd)

specific to author-conference pair (xdi,cd), where θ is

generated from a Dirichlet prior α;

• draw a word wdifrom multinomial φzdi.

Similarly, we can calculate the posterior conditional probability

using a Gibbs sampling procedure analogous to that in ACT1.

P(zdi,(xc)di|z−di,x−di,c−d,w,α,β) ∝

m−di

(xc)dizdi+ αzdi

?

ACT Model 3

In the third model (cf. Figure 6(c)), the conference stamp is

taken as a numerical value. Each conference stamp of a paper is

chosen after topics have been sampled for all word tokens in the

paper. Intuitively, this corresponds to a natural way of publishing

the scientific paper: authors first write a paper and then determine

where to publish the paper based on the topics discussed in the

paper. The corresponding generative process is:

z(m−di

(xc)diz+ αz)

n−di

zdiwdi+ βwdi

wv(n−di

?

zdiwv+ βwv)

(9)

6.4

1. For each topic z, draw φzfrom Dirichlet priors βz;

2. For each word wdiin paper d:

• draw an author xdifrom aduniformly;

• draw a topic zdi from a multinomial distribution θxdi

specific to author xdi;

• draw a word wdifrom multinomial φzdi.

3. Drawaconferencecdfromz1:Ndusinganormallinearmodel

N(η?τ,σ2), where τ is a vector recording the normalized

number of times of each topic sampled from paper d. We

define it as τ[k] = (1/Nd)?Nd

i=1I[zdi = k], where I is an

indicator function with I[true] = 1 and I[false] = 0.

Page 7

T

D

Nd

w

zx

d

?

?

?

A

?

c

T

??

T

D

Nd

wzx

d

?

?

?

AC

?

c

T

D

Nd

wz

?

?

c

?,?2

d

x

?

A

?

(a) ACT1

Figure 6: Graphical representation of the three Author-Conference-Topic (ACT) models.

(b) ACT2 (c) ACT3

In this model, the conference comes from a normal linear model.

The covariates τ in this model are the frequencies of the topics

in the document. The regression coefficients on these frequencies

constitute η. The difference of parameterization from ACT1 is that

the conference stamp is sampled from a normal linear distribution

after topics were sampled for all word tokens in a paper.

For inference in ACT3, there is a slight difference from that in

ACT1 and ACT2, as we also need to estimate the parameters η and

σ2. We use a Gibbs EM algorithm [2] for inference of this model.

In the E-step, for sampling the topic, the posterior probability is

calculated by

P(zdi,xdi|z−di,x−di,cd,w,α,β,η,σ2) ∝

m−di

xdizdi+ αzdi

?

where

P(cd|z1:Nd,η,σ2) =

In the M-step, given the sampled topics z, the optimal η and σ2

canbeestimatedbymaximizing argmxη,σlogP(x,z,w,c|α,β,η,σ2).

Specifically, η is updated by

ηnew← (E[A?A])−1E[A]?c

and σ2is updated by

P(cd|z1:Nd,η,σ2)

z(m−di

xdiz+ αz)

n−di

zdiwdi+ βwdi

wv(n−di

?

zdiwv+ βwv)

(10)

1

√2πσ2e(−(cd−η?τ)2

2σ2

)

(11)

(12)

σ2

new← (1/D){c?c − c?E[A](E[A?A])−1E[A]?c}

where E[.] is the expectation of the variables; A is a D×T matrix.

The d-th row of the matrix is E[τ] =¯φ := (1/Nd)?Nd

defined as:

(13)

i=1φdiand

E[A?A] =?D

E[τdτ?

d=1E[τdτ?

d] is a T × T matrix, where E[τdτ?

d] is

d] = (1/N2

d)(

Nd

?

i=1

?

j?=i

φdiφ?

dj+

Nd

?

i=1

diag{φdi})

(14)

with diag{φdi} denoting a matrix with diagonal as the vector of

φdi. Note that φdidenotes a vector of probabilities of topics gen-

erating word wdi. We omit details of derivation of Equations (12)

and (13). Interested reader is referred to [7] and [23].

7.

7.1

ACADEMIC SEARCH SERVICES

ApplyingACTModelstoExpertiseSearch

In expertise search, the objective is to find the expertise authors,

expertise papers, and expertise conferences for a given query.

7.1.1Process

Basedontheproposedmodels, wecancalculatethelikelihoodof

a paper generating a word using ACT1 as the example as following:

PACT1(w|d,θ,φ) =

T

?

z=1

Ad

?

x=1

P(w|z,φz)P(z|x,θx)P(x|d)

(15)

The likelihood of an author model and a conference model gen-

erating a word can be similarly defined. However, the learned top-

ics by the LDA-style model is usually general and not specific to a

given query. Therefore, only using ACT itself is too coarse for aca-

demic search [29]. Our preliminary experiments also show that em-

ploying only ACT or LDA models to information retrieval hurts the

retrieval performance. In general, we would like to have a balance

between generality and specificity. Therefore, we derive a combi-

nation of the ACT model and the word-based language model:

P(w|d) = PLM(w|d) × PACT(w|d)

where PLM(w|d) is the generating probability of word w from pa-

per d by the language model. It is defined as:

(16)

P(w|d) =

where tf(w,d) is the frequency of word w in d, tf(w,D) is the

frequency of word w in the collection D, and ND is the number

of word tokens in the collection D. λ is the Dirichlet prior and is

commonly set based on the average paper length in the collection.

Finally, given a query q, P(q|d) can be computed by P(q|d) =

Πw∈qP(w|d). Similarly, we can define P(q|a) for authors and

P(q|c) for conferences.

7.1.2Expertise Search Performance

We collected a list of the most frequent queries from the log

of ArnetMiner for evaluation. We conducted experiments on a

subset of the data (including 14,134 persons, 10,716 papers, and

1,434 conferences) from ArnetMiner. For evaluation, we used the

method of pooled relevance judgments [10] together with human

judgments. Specifically, for each query, we first pooled the top 30

results from three similar systems (Libra, Rexa, and ArnetMiner).

Then, two faculty members and five graduate students from CS

provided human judgments. Four-grade scores (3, 2, 1, and 0)

were assigned respectively representing definite expertise, exper-

tise, marginal expertise, and no expertise. Finally, the judgment

scores were averaged to obtain the final score.

In all experiments, we conducted evaluation in terms of P@5,

P@10, P@20, R-pre, and mean average precision (MAP) [10] [13].

We used language model (LM), LDA [8], and the Author-Topic

(AT) model [25] [26] as the baseline methods. For language model,

we used Equation (17) to calculate the relevance between a query

termandadocumentandsimilarequationsforanauthor/conference

(an author is represented by his/her published papers and a confer-

ence is represented by papers published on it). For LDA, we used

a similar equation to Equation (16) to calculate the relevance of a

term and a document. For the AT model, we used similar equations

to Equation (16) to calculate the relevance of a query term with a

paper or an author. For the LDA and AT models, we performed

model estimation with the same setting as that for the ACT models.

We empirically set the number of topics as T = 80 for all models.

Nd

Nd+ λ·tf(w,d)

Nd

+ (1 −

Nd

Nd+ λ) ·tf(w,D)

ND

(17)

Page 8

Topic #5 (Model 1)

“Natural language processing”

language

parsing

natural

learning

approach

grammars 0.012712

processing 0.011923

text

Yuji Matsumoto

Eugene Charniak

Rens Bod

Brian Roark

Suzanne Stevenson 0.001124

Anoop Sarkar

ACL

COLING

CL

ANLP

CoRR

COLING-ACL 0.036814

Table 5: Five topics discovered by ACT1 on the Arnetminer data. Each topic is shown with the top 8 words and their corresponding

probabilities. Top 6 authors and top 6 conferences are shown with each topic. The titles are our interpretation of the topics.

Topic #10 (Model 1)

“Semantic web”

semantic

web

ontology

knowledge 0.041497

learning

framework 0.012095

approach

based

Steffen Staab

Enrico Motta

York Sure

Nenad Stojanovic

Alexander Maedche

Asuncion Gomez-Perez 0.001694

ISWC

EKAW

IEEE Intelligent Systems 0.071418

CoopIS/DOA/ODBASE

K-CAP

ESWS

Topic #16 (Model 1)

“Machine learning”

learning

classification 0.018517

boosting

machine

feature

classifiers

margin

selection

Robert E. Schapire

Yoram Singer

Thomas G. Dietterich 0.002472

Bernhard Scholkopf

Alexander J. Smola

Ralf Schoknecht

NIPS

JMLR

ICML

COLT

Neural Computation 0.023017

MLSS

Topic #19 (Model 1)

“Support vector machines”

support

vector

machine

kernel

regression

neural

classification 0.012072

networks

Bernhard Scholkopf 0.003929

Johan A. K. Suykens 0.003536

Vladimir Vapnik

Olvi L. Mangasarian 0.002947

Joos Vandewalle

Nicola L. C. Talbot

Neural Computation 0.096707

NIPS

ICANN

JMLR

Neurocomputing

Machine Learning

Topic #24 (Model 1)

“Information extraction”

learning

information 0.043527

extraction

web

semantic

text

rules

relational

Raymond J. Mooney 0.010346

Andrew McCallum

Craig A. Knoblock

Nicholas Kushmerick 0.002457

Ellen Riloff

William W. Cohen

AAAI 0.295846

IJCAI 0.192995

ICML 0.060567

KDD 0.058551

JAIR0.046451

ECML 0.033006

0.034820

0.023766

0.019029

0.015871

0.012712

0.011923

0.001389

0.001323

0.001323

0.001190

0.068226

0.048847

0.043160

0.013431

0.011427

0.010758

0.058056

0.015881

0.017797

0.013904

0.013904

0.013245

0.012586

0.004033

0.003318

0.082669

0.071373

0.064076

0.026897

0.020544

0.016308

0.011366

0.065259

0.033592

0.019311

0.011860

0.010618

0.010618

0.009376

0.001058

0.253487

0.234435

0.118136

0.060423

0.058674

0.005863

0.004365

0.003713

0.001824

0.001824

0.001496

0.001301

0.001236

0.289761

0.206583

0.156389

0.096157

0.002947

0.002030

0.001768

0.004074

0.003492

0.002199

0.002134

0.125291

0.122379

0.065594

0.054674

0.023369 0.011545

0.094388

0.084338

0.083565

0.071197

0.067331

Table 6: Performance of six expertise search approaches (%).

Method Object P@5 P@10

Paper 40.0

Author 65.7

Conference51.4

Average 52.4

LDAPaper 31.4

Paper 42.9

Author82.9

Average62.9

Paper 42.9

Author91.4

Conference 62.9

Average

65.7

Paper42.9

Author74.3

Conference 54.3

Average 57.1

Paper 42.9

Author71.4

Conference 57.1

Average57.1

P@20

37.1

25.0

21.4

27.9

42.9

42.9

25.7

34.3

43.6

26.4

23.6

31.2

39.3

25.7

22.1

29.1

41.4

25.7

23.6

30.2

R-pre

10.0

58.8

47.6

38.8

13.5

13.1

73.5

43.3

16.6

80.0

60.7

52.4

15.0

69.4

54.2

46.2

17.1

70.0

58.3

48.5

MAP

46.4

73.4

63.1

61.0

45.8

49.3

78.1

63.7

51.0

89.6

72.3

71.0

47.7

80.1

63.9

63.9

47.0

78.7

65.7

63.8

LM

38.6

44.3

32.9

38.6

48.6

48.6

45.7

47.1

45.7

50.0

41.4

45.7

47.1

50.0

41.4

46.2

38.6

47.1

38.6

41.4

AT

ACT1

ACT2

ACT3

Table 5 shows five topics discovered by ACT1.

Table 6 shows the experimental results of retrieving papers, au-

thors, and conferences using our proposed methods and the base-

line methods. We see that our proposed three methods outperform

the baseline methods. LDA only models documents and thus can

support only paper search; while AT supports paper search and au-

thor search. Both models underperform our proposed unified mod-

els. Our models benefit from the ability of modeling all kinds of in-

formation holistically, thus can capture the dependencies between

the different types of information. We can also see that ACT1

achieves the best performance in all evaluation measures.

For comparison purposes, we also evaluate the results of two

similar systems: Libra.msra.cn and Rexa.info. The average MAP

obtained by Libra and Rexa on our data set are 48.3% and 45.0%.

We see that our methods clearly outperform the two systems.

7.2 ApplyingACTModelstoAssociationSearch

Association Search: Given a social network G = (V,E) and an as-

sociation query (ai,aj) (source person, target person), association

search is to find and rank possible associations {αk(ai,aj)} from

aito aj. Each association is denoted as a referral chain of persons.

There are two subtasks in association search: finding possible as-

sociationsbetweentwopersonsandrankingtheassociations. Given

a large social network, to find all associations is an NP-hard prob-

lem. Weinsteadfocusonfindingthe‘shortest’associations. Hence,

the problem becomes how to estimate the score of an association

and one key issue is how to calculate the distance between persons.

We use KL divergence to define the distance as:

KL(ai,aj) =

T

?

z=1

θaizlogθaiz

θajz

(18)

We use the accumulated distance between persons on an associ-

ation path as the score of the association. We call the association

with the smallest score as the shortest association and our problem

can be formalized as that of finding the near-shortest associations.

Our approach consists of two stages:

1) Shortest association finding. It aims at finding shortest asso-

ciations from all persons a ∈ V \aj in the network to the target

person aj (the score of the shortest association from aito aj is de-

noted as Lmin > 0). We use a heap based Dijkstra algorithm to

find the shortest associations.

2) Near-shortest associations finding. Based on the shortest asso-

ciation score Lminand a parameter γ, the algorithm uses a depth-

first search to find associations whose scores are less than (1 +

γ)Lmin. We constrain the length of an association to be less than a

pre-defined threshold. Finally, the obtained associations are ranked

according to the scores.

Our approach can find the near-shortest associations for a query

in less than 3 seconds on a commodity machine with a network of

researchers. In the following, we list two associations ranked by

our approach for the query (‘Hang Li’, ‘Qiang Yang’).

1. Hang Li -> Yong Yu -> Qiang Yang (score: 0.127)

2. Hang Li -> Bin Gao -> Wei-Ying Ma -> Qiang Yang (score: 0.274)

7.3 Other Applications

Our model can support many other applications, e.g., author in-

terest finding and academic suggestion.

For example, Table 7 shows top 5 words and top 5 authors as-

sociated to two conferences found by ACT1. Table 8 shows top 5

words and top 5 conferences associated to two researchers found

by ACT1. The results can be directly used to characterize the con-

ference themes and researcher interests. They can be also used for

prediction/suggestion tasks. For example, one can use the model

to find the best matching reviewer for a paper submitted to a spe-

cific conference. Previously, such work is fulfilled by only keyword

matching or topic-based retrieval such as [22], but not considering

Page 9

Table 7: Top 5 representative words and top 5 authors associ-

ated to two conferences found by ACT1.

ACL

parsing0.030523

semantic 0.018398

learning0.016851

statistical 0.014143

information 0.013620

Christopher D. Manning 0.003984

Dan I. Moldovan0.003358

Mark Johnson0.002837

Robert C. Moore 0.002055

Jason Eisner0.001933

SIGIR

information

text

classification 0.027953

retrieval

web

Susan T. Dumais

W. Bruce Croft

Norbert Fuhr

Fabrizio Sebastiani 0.001279

Laura A. Granka

0.036946

0.030265

0.025588

0.021703

0.002432

0.002190

0.001643

0.001279

Table 8:

associated to two researchers found by ACT1.

Raymond Mooney

learning0.053442

information 0.029767

extraction 0.022361

web0.014841

semantic 0.009696

AAAI0.190748

IJCAI0.126281

Machine Learning 0.053669

ICML0.049556

KDD 0.038491

Top 5 representative words and top 5 conferences

Bruce Croft

information

web

learning

text

classification 0.014315

SIGIR

CIKM

Inf. Process. Manage. 0.024329

AAAI

ECIR

0.020554

0.017087

0.016322

0.014615

0.104724

0.099845

0.023232

0.022895

the conference. One can also use the model to suggest a venue to

submit a paper based on its content and authors’ interests. Or one

can use it to suggest popular topics when authors prepare a paper

for a conference.

8. CONCLUSION

In this paper, we describe the architecture and the main fea-

tures of the ArnetMiner system. Specifically, we propose a uni-

fied tagging approach to researcher profiling. About a half million

researcher profiles have been extracted into the system. The sys-

tem has also integrated more than one million papers. We propose

a probabilistic framework to deal with the name ambiguity prob-

lem in the integration. We further propose a unified topic model

to simultaneously model the different types of information in the

academic network. The modeling results have been applied to ex-

pertise search and association search. We conduct experiments for

evaluating each of the proposed approaches. Experimental results

indicate thatthe proposed methodscan achievea high performance.

There are many potential future directions of this work. It would

be interesting to further investigate new extraction models for im-

proving the accuracy of profile extraction. It would be also interest-

ing to investigate how to determine the actual person number k for

name disambiguation. Currently, the number is supplied manually,

which is not practical for all author names. In addition, extending

the topic model with link information (e.g., citation information) or

time information is a promising direction.

9.ACKNOWLEDGMENTS

The work is supported by the National Natural Science Founda-

tion of China (90604025, 60703059), Chinese National Key Foun-

dation Research and Development Plan (2007CB310803), and Chi-

nese Young Faculty Research Funding (20070003093). It is also

supported by IBM Innovation funding.

10.

[1] L. A. Adamic and E. Adar. How to search a social network. Social

Networks, 27:187–203, 2005.

[2] C. Andrieu, N. de Freitas, A. Doucet, and M. I. Jordan. An

introduction to mcmc for machine learning. Machine Learning,

50:5–43, 2003.

REFERENCES

[3] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval.

ACM Press, 1999.

[4] K. Balog, L. Azzopardi, and M. de Rijke. Formal models for expert

finding in enterprise corpora. In Proc. of SIGIR’06, pages 43–55,

2006.

[5] S. Basu, M. Bilenko, and R. J. Mooney. A probabilistic framework

for semi-supervised clustering. In Proc. of KDD’04, pages 59–68,

2004.

[6] R. Bekkerman and A. McCallum. Disambiguating web appearances

of people in a social network. In Proc. of WWW’05, pages 463–470,

2005.

[7] D. M. Blei and J. D. McAuliffe. Supervised topic models. In Proc. of

NIPS’07, 2007.

[8] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation.

Journal of Machine Learning Research, 3:993–1022, 2003.

[9] D. Brickley and L. Miller. Foaf vocabulary specification. In

Namespace Document, http://xmlns.com/foaf/0.1/, September 2004.

[10] C. Buckley and E. M. Voorhees. Retrieval evaluation with incomplete

information. In Proc. of SIGIR’04, pages 25–32, 2004.

[11] F. Ciravegna. An adaptive algorithm for information extraction from

web-related texts. In Proc. of IJCAI’01 Workshop, August 2001.

[12] C. Cortes and V. Vapnikn. Support-vector networks. Machine

Learning, 20:273–297, 1995.

[13] N. Craswell, A. P. de Vries, and I. Soboroff. Overview of the

trec-2005 enterprise track. In TREC’05, pages 199–205, 2005.

[14] H. Han, L. Giles, H. Zha, C. Li, and K. Tsioutsiouliklis. Two

supervised learning approaches for name disambiguation in author

citations. In Proc. of JCDL’04, pages 296–305, 2004.

[15] H. Han, H. Zha, and C. L. Giles. Name disambiguation in author

citations using a k-way spectral clustering method. In Proc. of

JCDL’05, pages 334–343, 2005.

[16] T. Hofmann. Collaborative filerting via gaussian probabilistic latent

semantic analysis. In Proc.of SIGIR’03, pages 259–266, 1999.

[17] T. Hofmann. Probabilistic latent semantic indexing. In Proc.of

SIGIR’99, pages 50–57, 1999.

[18] H. Kautz, B. Selman, and M. Shah. Referral web: Combining social

networks and collaborative filtering. Communications of the ACM,

40(3):63–65, 1997.

[19] T. Kristjansson, A. Culotta, P. Viola, and A. McCallum. Interactive

information extraction with constrained conditional random fields. In

Proc. of AAAI’04, 2004.

[20] J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields:

Probabilistic models for segmenting and labeling sequence data. In

Proc. of ICML’01, 2001.

[21] A. McCallum. Multi-label text classification with a mixture model

trained by em. In Proc. of AAAI’99 Workshop, 1999.

[22] D. Mimno and A. McCallum. Expertise modeling for matching

papers with reviewers. In Proc. of KDD’07, pages 500–509, 2007.

[23] T. Minka. Estimating a dirichlet distribution. In Technique Report,

http://research.microsoft.com/ minka/papers/dirichlet/, 2003.

[24] Z. Nie, Y. Ma, S. Shi, J.-R. Wen, and W.-Y. Ma. Web object retrieval.

In Proc. of WWW’07, pages 81–90, 2007.

[25] M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth. The

author-topic model for authors and documents. In Proc. of UAI’04,

2004.

[26] M. Steyvers, P. Smyth, and T. Griffiths. Probabilistic author-topic

models for information discovery. In Proc. of SIGKDD’04, 2004.

[27] Y. F. Tan, M.-Y. Kan, and D. Lee. Search engine driven author

disambiguation. In Proc. of JCDL’06, pages 314–315, 2006.

[28] J. Tang, D. Zhang, and L. Yao. Social network extraction of academic

researchers. In Proc. of ICDM’07, pages 292–301, 2007.

[29] X. Wei and W. B. Croft. Lda-based document models for ad-hoc

retrieval. In Proc. of SIGIR’06, pages 178–185, 2006.

[30] E. Xun, C. Huang, and M. Zhou. A unified statistical model for the

identification of english basenp. In Proc. of ACL’00, 2000.

[31] X. Yin, J. Han, and P. Yu. Object distinction: Distinguishing objects

with identical names. In Proc. of ICDE’2007, pages 1242–1246,

2007.

[32] K. Yu, G. Guan, and M. Zhou. Resume information extraction with

cascaded hybrid model. In Proc. of ACL’05, pages 499–506, 2005.