CoMiner: An Effective Algorithm for Mining Competitors from the Web.
-
Citations (0)
-
Cited In (0)
Page 1
CoMiner: An Effective Algorithm for Mining Competitors from the Web
Rui Li, Shenghua Bao, Jin Wang, Yong Yu
Department of Computer Science
Shanghai Jiao Tong University
Shanghai 200240, P.R.China
{rli,shhbao,jwang,yyu}@apex.sjtu.edu.cn
Yunbo Cao
Microsoft Research Asia
5F Sigma Center, No. 49 Zhichun Road,
Haidian Beijing, China, 100080
yucao@microsoft.com
Abstract
This paperattempts to accomplisha novel task of mining
competitive information with respect to an entity (such as
a company, product, person) from the web. An algorithm
called “CoMiner” is proposed, which first extracts a set of
comparative candidates of the input entity and then ranks
themaccordingtothecomparability,andfinallyextracts the
competitive fields. The experimental results show that the
proposedalgorithmdrafts a completepictureof competitive
relation of a given entity effectively.
1. Introduction
Dueto therapiddevelopmentandrelativematurityofthe
web,the dataandinformationbecomeextraordinarilyabun-
dant nowadays. In the paper, we define mining competitive
information such as finding similar products of a given en-
tity which have the same features as “competitor mining”
problems. Traditional search engines (e.g., Google1, Ya-
hoo!2)can solve this problem partially , but they need users
to browse thousandsof related pages to find useful informa-
tion manually. Some services like Froogle3are also avail-
able to help people get these competitive information of a
given product. However, they are designed to serve for a
limited domain and furthermore their services are based on
a manually-built database.
We proposed an algorithm, called CoMiner, which con-
sists of two parts, namely Competitor Discovery and Com-
petitive Domain Mining. In the Competitor Discovery part
(Section 3), the CoMiner extracts a set of comparative can-
didates of the input entity by predefined linguistic patterns,
and then ranks them according to the comparability. The
1http://www.google.com/
2http://www.yahoo.com/
3http://froogle.google.com
Figure 1. An output of the prototype
comparability here is defined as the probability in which
the two entities are discussed together by web users. In the
competitive Domain Mining part(Section 4), the CoMiner
extracts the competitive domain in which the given entity
and its competitors play against each other by mining the
salient phrase from a set of web phrase.
A prototype system implementing the CoMiner algo-
rithm is depicted in Figure 1. The main features of the pro-
totype system include:
1. A list of the competitors. The set of competitors are
ranked according to comparability. The page describ-
ing each competitor is supplied too.
2. Asetofcomparablefeaturesorcompetitive(fields)do-
mains aboutthe pair of the givenentity (e.g. Sony)and
its competitor (e.g. Microsoft). The features/domains
are represented by a set of terms.
As the evaluation,70 entities are fed into this prototypesys-
tem, 728 competitors and 3640 competitive fields are dis-
covered. A detailed analysis is in Section 5.
Proceedings of the Sixth International Conference on Data Mining (ICDM'06)
0-7695-2701-9/06 $20.00 © 2006
Authorized licensed use limited to: Shanghai Jiao Tong University. Downloaded on March 31, 2009 at 05:08 from IEEE Xplore. Restrictions apply.
Page 2
2. Related work
To the best of our knowledge, CoMiner is the first al-
gorithm for discovering competitors and their competitive
domains by mining web resource. Competitor discovery is
related to technics used for the entity extraction and recog-
nition. The pattern-based approach in this area, first intro-
duced by Hearst [4] to discover part-of relation from text,
has achieved a lots of successful application [3, 2] recently.
Extractingcompetitive domain from web pages is related to
identifying salient phrases in text ming [5, 6]. However we
use short snippets,linguistic characters and other different
features to identify the competitive domains.
There are also some work about the comparative search
and mining. Liu’s work is [7] mining opinions and extract-
ing sentiment from some online discussion forms. Zhai [9]
defines a comparative text mining problem (CTM) which
means discovering common themes and specific theme for
anexistingsetofcomparativetextcollections. Butourwork
uses a web search engine to get a set of comparative data
automatically from the Web. Sun’s [8] comparative search
engine, collects comparative information for the given two
entities. Our work,however, automatically discover highly
competitive entities instead.
3. Competitor discovery
The objective of this step is to extract and rank the com-
petitors of the given entity from a set of pages. Our com-
petitor discoveryalgorithm is based on the following obser-
vation.
Observation: The expressions which indicate the com-
parative relationship are diversified, but we need only few
common patterns to extract candidates due to the web-
redundancy. UnevenCo-occurrencealso givesthe measure-
ment of the degree of the relationship closeness. Compara-
tive entities often occur together more frequently.
3.1. Candidate extraction
We defineaset oflinguisticpatternsforgettingthepages
which may contain information of competitors and for ex-
tracting candidates competitor. The first 3 kinds of patterns
have been used by Hearst to identify is-a relationship be-
tween the concepts referred by two terms. However, the
two terms are usually the competitors. The last 2 patterns
are often used to compare two entities. En refers to Entity
Name and CN refers to Competitor Name.
H1: such as EN (,CN)* or ? and CN
e.g.,“brands of tape such as Sony, Phillips, BASF or
TDK should be used. ”
H2: especially EN (,CN)* and (CN)
e.g.,“ especially Sony and Panasonic projectors ”
H3: including EN (,CN)* and (CN)
e.g.,“DigitalCamerasReviewsOnLeadingBrandsIn-
cluding Sony and Canon”
C1: CN vs EN — EN vs CN
e.g. “ Nintendo vs. Sony ’
e.g. “E3 showdown: Sony vs. Microsoft”
C2: EN or CN — CN or EN
e.g. “Sony or JVC camcorder dilemma”
e.g. “work for Samsung or Sony on 43 Things”’
3.2. Competitor ranking
To evaluate the comparability between each competitor
and the given entity, we first define three ranking features.
In the following, competitor and the given entity are de-
noted as C and E respectively.
1. Match count
MC(c,e) :=
?
p∈P
count(c,e,p)
(1)
Counts(c,e) means the hits of all predefined patterns.
2. Mutual information
PMI(ci|e) = hits(e,ci)/hits(e)
The point-wise mutual information(PMI)is often used
to measure the co-occurrencebetween two terms. Hits
represents the result returned by the search engine.
(2)
3. Candidates confidence
CC(ci) = MC(ci)/f(ci)
(3)
We use candidates confidence(CC) to avoid the com-
mon word like ‘company” and “web” to appear in our
final result.These words occur with low matched
counts and high appearance counts. f(ci) represents
frequency of ciin our data set.
Given features above, we use a single formula to calculate
a single confidence score (CS) for each competitor.
CS(ci) = w1MC(ci,e) + w2PMI(ci,e) + w3CC(ci) (4)
The weight is tuned with an revised Hill Climbing algo-
rithm. In our experiment,the parameters is weighted as 0.2,
0.6, and 0.2 respectively.
Proceedings of the Sixth International Conference on Data Mining (ICDM'06)
0-7695-2701-9/06 $20.00 © 2006
Authorized licensed use limited to: Shanghai Jiao Tong University. Downloaded on March 31, 2009 at 05:08 from IEEE Xplore. Restrictions apply.
Page 3
4. Competitive domain mining
Theobjectiveof this step is to extractthe competitivedo-
mains for each pair of the given entity and its competitors.
4.1. Generating candidate phrases
By our observation, the pages which contain both names
of competitive pair often talk about the competitive do-
mains. The meaningful domain names are more likely to
be noun phrase or adjective phrase combined with noun
phrase. So we collect top 100 returned pages by querying
given entity name and competitor name. Then we use an
NLP tool to parse the data and get the list of phrases as the
candidates of competitive domains.
4.2. Salient phrase ranking
Since the competitive domain are more likely to be
salient phrase in data set, we improved the existing salient
phrase raking method by adding new features for extract-
ing domains. We denote the current phrase as p, and the
collection of returned results for the pair of the given entity
andoneofits competitorsas C(e,ci) wheree representsthe
given entity and ciis denoted as one of its competitor.
1. PF: phrase frequency. In general the more frequent
phrase are more likely to be better candidates of a
salient phrase.
PF = Freq(p)
(5)
2. DF: the document frequency, where D represents the
frequency calculation.
DF = D(p)
(6)
3. PL: Phrase Length. Intuitively, a longer name is more
meaningful for users’ browsing.
PL = length(p)
(7)
4. Average Distance: It is calculated as the distance be-
tween the phrase and the given entity or its competi-
tors. L() is calculation for minimal distance between
two phrase. djis the jth document. C(p,e,ci) repre-
sents the documents where p occurs.
AD =
?
dj∈C(p,e,ci)
L(p,e) + L(p,ci)
2 × D(p)
(8)
5. Intra-Cluster Similarity: this is used in [5] for clus-
tering web results. First we calculate the centroid for
for each phrase, and then calculate the average cosine
similarity between the documents and centroid.
6. Cluster Entropy [5], cluster entropy represents the dis-
tinctness of phrases. P(t|p,e,ci) is the probability of
term t occurring in the the documents. DL is the do-
main candidates list.
P(t|p,e,ci) =|C(p,e,ci) ∩ C(t,e,ci)|
CE = −
t∈DL
7. Phrase Independence[1], a phrase is independentwhen
the left and right context are random enough. INDl|r
is independence value for left or right context. IND is
average score of IND(l) and IND(r)
?
4.3. Linear regression
C(p,e,ci)
(9)
?
P(t|p,e,ci)logP(t|p,e,ci)
(10)
INDl|r= −
t=l(p)
F(t)
PF
logF(t)
PF
(11)
For our domain extraction, we use the linear regression
as our method in experiments give a score for each candi-
date domain phrase.
y = b0+
n
?
j=1
bjxj+ e
(12)
where x=(TF,DF,PL,AD,ICS,CE,IND),and the ”residual” e
is a random variable with mean zero. In our experiment ,
b0to b7is set to -0.350, 0.138, 0.06, 0.229, -0.073, -0.126,
0.103, 0.187 for each.
5. Experiment
In this section, we will evaluate the effectiveness of the
proposed algorithm. We intentionally selected 70 entities
distributed in different fields including organization (com-
panies(10), universities(10),and football clubs(10)),brands
of different products(30), products(5) and persons (football
stars(5)). as our test data. Meanwhile, it is much easier for
other readers to evaluateour experiment. We use the google
API to collect the informative pages. Top 100 snippet re-
sults are crawled and tagged with POS Tagger4.
5.1. Results of CoMiner
Table1demonstratestheoutputofcompetitorextraction,
18 sets of the competitors(due to the space limitation we
only present 28 entity results from our experiment results.)
For each entity, we listed only top 5 ranking competitors.
Table 2 shows competitive domain extraction for compet-
itive pair. We selected top 3 pairs of competitor and the
given entity as results in our experiment.
4http://www.cs.jhu.edu/ brill/
Proceedings of the Sixth International Conference on Data Mining (ICDM'06)
0-7695-2701-9/06 $20.00 © 2006
Authorized licensed use limited to: Shanghai Jiao Tong University. Downloaded on March 31, 2009 at 05:08 from IEEE Xplore. Restrictions apply.
Page 4
Cornell
Stanford
Harvard
Yale
Georgia
Illinois
Zidane
Ronaldinho
Beckham
Ronaldo
Henry
Figo
BMW
Mercedes
Audi
Honda
Car
Mini
Berkeley
San Francisco
America
House
Stanford
Ucla
Lampard
Gerrard
Vieira
Beckham
Ronaldinho
Deco
Benz
Bmw
Audi
Toyota
Honda
Porsche
Rice
Texas
Bush
Michigan
Fish
Illinois
AC Milan
Barcelona
Juventus
Chelsea
Inter Milan
Real Madrid
Nissan
Toyota
Ford
Infiniti
Honda
Mitsubishi
Olympus
Canon
Sony
Nikon
Kodak
Minolta
Liverpool
London
Birmingham
West Ham
Leeds
Newcastle
Audi
BMW
Mercedes
Mitsubishi
Volkswagen
VW
Kodak
Canon
Fuji
Olympus
HP
Nikon
Juventus
AC Milan
Arsenal
Barcelona
Real Madrid
Chelsea
Canon A70
Olympus
Canon A75
Flash
A60
Nikon 3100
Nikon
Canon
Sony
Minolta
Pentax
Epson
Real Madrid
Barcelona
Arsenal
Chelsea
Bayern Munich
Ac Milan
Xbox
PS2
PC
Play Station
360
Gamecube
Table 1. Competitor Discovery
Panasonic
camera,camcorder,digital video
plasma tv, phone, dvd
DVD Player, Battery,Plasmas TV
Toshiba
cells, battery, dvd
laptop,toner, battery
tv, dvd, dlp
Motorola
ringtone, phones, accessory
phones, headset, bluetooth
wimax,wireless, chip
Converse
shoes, mens, basketball shoes
shoes, airforce, sneekers
prices, sports, duffel
Gerrard
England, world cup,midfield
world cup, England, Germany
Liverpool,fa cup, goals,
IBM T40
laptop, battery, LCD
notebook, battery, widescreen
notebook, processor, weight
Samsung
mobile phone, ring tong,accessory
plasma tv, lcd, hdtv
Ringtong, Mobile Phone,Accesory
DELL
laptop, computer, servers
computer, laptop,storage
laptop, computer, parts
Nokia
phones, ringtone accessory
phones, ringtone, games
game, phones,ringtone
Addidas
shoes, sports, clothes
shoes, bape,wholesale
shoes, sports, footwear
Ronaldo
Brazil, world cup, soccer
skills,soccer, videos
world cup,soccer, football
Nokia 7270
phone, accessory, games
phone, ringtong, prices
mobile, audio, handset
Sony
Samsung
Toshiba
LG
Philips
Nokia
Sony
HP
Samsung
HP
Apple
IBM
Nokia
Sony
Intel
Motorola
Sony
Siements
Adidas
Nike
Jansport
Nike
puma
Reebok
Lampard
Ballack
West Ham
Ronaldinho
Zlatan
Henry
Dell i600
HP l2000
Sony vaio
6260
Sonyericcson s700
Samsung D500
Table 2. Results for Competitive Domain Extraction
Proceedings of the Sixth International Conference on Data Mining (ICDM'06)
0-7695-2701-9/06 $20.00 © 2006
Authorized licensed use limited to: Shanghai Jiao Tong University. Downloaded on March 31, 2009 at 05:08 from IEEE Xplore. Restrictions apply.
Page 5
Rice
Texas
Bush
Michigan
Fish
Illinois
Rice University
University Houston
Stanford University
Purdue
University Texas
Nevada
Table 3. Results with domain information
IBM
Microsoft
Sun
SCO
Intel
Apple
IBM + Notebook
Toshiba
HP
Dell
Apple
Lenovo
Table 4. Results with Domain Limitation
5.2. Case studies
5.2.1Disambiguation in competitor mining
From the first column of Table 3, we can find that, how-
ever, “Rice” is an ambiguous entity name. The ambiguity
here results from the fact of that term “fish” and “beans”
maybecompetitorsof“Rice ”, whentheterm“Rice”means
the food, and “Texas”, “Michigan” may be competitors of
“Rice” when “Rice” means a university. In order to disam-
biguate the “’Rice’, we provide “university” additionally,
and use the model in section ??,finally we get result in the
second column of Table 3.
The second column of Table 4 shows the distinguishing
results generatedby“notebook”as the domainlimitation for
searching “IBM”’s competitors in notebook area compared
to previous search result without domain limitation.
5.2.2 Time evaluation in competitor mining
Our system can discover and reflect the changes of entity’
competitor by mining top 100 returned pages at differen
periods, since the content of web is growing dynamically.
The changes of ranking list is be called Competitor Evo-
lution. Table 5 shows the result of our system by mining
the entity “Arsenal” competitor in March 15 and in May 28.
“Barcelona” has burst in the second list owing to that the
European Champion League Final Match played between
the two teams in May 18, and that nearly 10 billion people
focused on this match, and so does the news on the web.
Arsenal @ 15/3,2006
Chelsea
Liverpool
Manchester United
Aston Villa
Real Madrid
Arsenal @ 28/5,2006
Barcelona
Chelsea
liverpoll
France
Manchester United
Table 5. Competitor Evolution
6. Conclusion
In this paper, we studied the problem of competitor min-
ing from the web. we give the observation of competitor
and domain distribution in the unrestricted web. The exper-
imental results show that the proposed algorithm is highly
effective. In our future work, we plan to evaluate CoMiner
evaluate in more domains and then improvethe CoMiner to
ming more competitive information from the web.
References
[1] L. F. Chien. Pat-tree based adapative keyphrase extraction for
intelligent chinese information retrieval. In SIGIR’97, 1997.
[2] P. Cimino, S. Handschuh, and S. Staab.
annotating web. In Processdings of WWW-04, 2004.
[3] O. Etzioni, M. Cafarella, D. Downey, S. Kok, A. Popcscu,
T. Shaked, S. Soderland, and S. Weld. Web-scale information
extraction in knowitall. In Processdings of WWW-04, 2004.
[4] M. Hearst. Automatic acquisition of hyponyms from large
text corpora. In Proceedings of the 14th International Con-
ference on Computational Linguistics, 1992.
[5] Z. Hua-Jun, Q.-C. He, Z. Chen, W.-Y. Ma, and J. Ma. Learn-
ing to cluster web search results. In SIGIR’04, 2004.
[6] B. Liu and C. Chin. Mining topic-specific concepts and defi-
nitions on the web. In Processdings WWW-03, 2003.
[7] S. Morinaga, K. Yamanishi, K. Tateishi, and T. Fukushinna.
Mining product reputations on the web. In KDD-02, 2002.
[8] J.-T. Sun, X. Wang, D. Shen, H.-J. Zeng, and Z. Chen. Cws:
A comparative web search system. In Processdings of WWW-
06, 2006.
[9] C. Zhai, A.Velivelli, and B. Yu. A cross-collection mixture
model for comparative text mining. In Processdings of KDD’
04, 2004.
Towrds the self-
Proceedings of the Sixth International Conference on Data Mining (ICDM'06)
0-7695-2701-9/06 $20.00 © 2006
Authorized licensed use limited to: Shanghai Jiao Tong University. Downloaded on March 31, 2009 at 05:08 from IEEE Xplore. Restrictions apply.