Link detection results and analysis
ABSTRACT This paper describes the two Story Link Detection systems Carnegie Mellon University (CMU) developed, and examines why their per-formance on the evaluation data was considerably worse than ex-pected while performance on an alternate evaluation set matched the performance on the training data.
-
Citations (0)
-
Cited In (0)
Page 1
Carnegie Mellon University
Research Showcase
Institute for Software ResearchSchool of Computer Science
1-1-1999
Link Detection – Results and Analysis
Ralf D. Brown
Carnegie Mellon University
Thomas Pierce
Carnegie Mellon University
Yiming Yang
Carnegie Mellon University
Jaime G. Carbonell
Carnegie Mellon University, jgc@cs.cmu.edu
This Working Paper is brought to you for free and open access by the School of Computer Science at Research Showcase. It has been accepted for
inclusion in Institute for Software Research by an authorized administrator of Research Showcase. For more information, please contactresearch-
showcase@andrew.cmu.edu.
Recommended Citation
Brown, Ralf D.; Pierce, Thomas; Yang, Yiming; and Carbonell, Jaime G., "Link Detection – Results and Analysis" (1999).Institute for
Software Research.Paper 464.
http://repository.cmu.edu/isr/464
Page 2
Link Detection – Results and Analysis
Ralf D. Brown, Thomas Pierce, Yiming Yang, Jaime G. Carbonell
Language Technologies Institute
Carnegie Mellon University (CMU)
Pittsburgh, PA 15213, USA
fralf,tomp,yiming,jgc g@cs.cmu.edu
ABSTRACT
This paperdescribesthe twoStory LinkDetection systemsCarnegie
Mellon University (CMU) developed, and examines why their per-
formance on the evaluation data was considerably worse than ex-
pectedwhile performanceon analternate evaluationset matchedthe
performance on the training data.
1. Introduction
Carnegie Mellon University submitted two systems to the official
TDT-3 evaluation for the Story Link Detection (SLD) task. The
two systems were independent implementations of essentially the
same method with differing attempted enhancements. Development
of multiple systemswasenabledbya commoncodelibrary for load-
ing TDT story collections and processingthe test data file; the only
code which must be implemented for each new SLD system is the
actual similarity/confidence computation. This common library is
an outgrowth of the DTREE topic tracker from the TDT-2 project
[1, 4].
These systems were run on three distinct data sets. The first (“dry
run”) consisted of story pairs selected from the six months of news
storiesoriginally collectedforthe TDT-2projectin 1998; the second
(“evaluation”) consisted of previously-unseen pairs selected from
an additional three months of previously unseen news stories; and
the third (“alternate”) consistedof additional pairs selectedfrom the
same three months of new data, which was provided in response
to the dismal performance on the evaluation set of all submitted
systems. Both the dry run and alternate sets selected their story
pairs from among those stories which had received event labels for
the tracking task, while the evaluation set contained 120 candidate
matches for eachof 180 seed stories selected at random.
2. System Descriptions
Both of the systemssubmittedfor the TDT evaluationusedthe com-
mon library to load the story collections and additional common
code to provide those portions of the test and control mechanism
for the story link detection task which are independentof the actual
similarity determination. Becauseof thiscommoncode,the twosys-
tems also have many of the same capabilities, such as stop-wording
and table-based stemming of the stories as they are loaded, and the
ability to automatically select the optimum thresholds for declaring
a pair of stories to be linked.
Note the use of the word “thresholds”in the previous sentence. The
decision threshold is actually a split threshold, with different values
depending on whether or not the two stories come from the same
source; this permits a laxer threshold when the two stories are from
sourceswhich may have different styles. For the purposesof select-
ing which threshold to use, multiple TDT sourcesmay be treated as
a single source; training on the dry run data, it was determined that
performance of our first system was best (for the default deferral
of ten source files) if the New York Times and AP newswire were
treated as one source and all other TDT sources combined into a
secondsource.
The first of our systems, identified as CMU-1 in the evaluation
and using the system identifier COSINE, uses incremental TF*IDF-
weightedcosinesimilarity measurestodeterminewhetherornottwo
documentsdiscussthesametopic. Thestop-wordednewsstoriesare
converted into binary term vectors (any nonzero number of occur-
rences reduced to 1) which are then weighted by the TF*IDF value
of each term. To decide whether two stories are linked, the cosine
similarity measure– the normalized inner (dot) product – of the two
correspondingterm vectors is computed,and a YES decision output
if the similarity is above a predetermined threshold.
For the evaluation, the TF*IDF values for COSINE were initialized
from the complete collection of English documents in the dry-run
dataset which wasavailable for training. As the evaluation datawas
processed, the TF*IDF values were incrementally updated to adapt
to changing patterns of use over time. The COSINE system can
additionallyapplyatime-baseddecaytothesimilarity score,making
temporally distant story pairs less likely to be declared linked, but
this feature was not used as it was determined to be detrimental to
the cost measure during early testing.
The second system, identified as CMU-2 in the evaluation, is also
basedon weightedcosinesimilarity measures,thoughwith different
weighting and thresholds. Unlike the COSINE system, the loga-
rithm of the term frequency was used (’ltc’ rather than ’ntc’ in the
terminology of the SMART document-retrieval system[3]), and the
TF*IDF statistics were derived solely from the test stories as they
were processed, rather than having been initialized from the six-
month training corpus. The CMU-2 system additionally contains
a probabilistic modeler, which was disabled for the evaluation.
3. Performance
Table 1 lists the results of the seven runs CMU submitted to the
December 1999 evaluation, showing the normalized cost measure
C
it had on the training data, it still did much better than CMU-1,
which had been designated CMU’s official system for the evalua-
tion. In fact, CMU-1 had a cost measure worse than the strawman
strategy of never indicating that stories are linked (“Just Say No”).
The majority of this difference in performance is due to the differ-
ent decision thresholds selected for the two systems – CMU-2 used
much higher thresholds, which proved to be quite close to the op-
linkfor eachrun. While the CMU-2 systemperformed worse than
Page 3
System
CMU-1
CMU-1
CMU-1
CMU-1
CMU-1
CMU-1
CMU-2
Transcription
ASR
ASR
ASR
manual
manual
manual
ASR
Deferral
1
10
100
1
10
100
10
Norm()
1.1260
1.0943
1.0921
1.1477
1.1657
1.0974
0.4667
Table 1: Official Evaluation Results
timum for the evaluation data. Tuning the CMU-1 system on the
evaluation data yields a normalized
a factor of two improvement.
C
link near 0.58, almost exactly
The current (post-evaluation) best performance of the CMU-1 sys-
tem with a ten-file deferral period produces a normalized
0.1399, 1.1320,and0.1392for the dry-run, evaluation, and alternate
data sets when tuned on the dry-run data set. Even though the per-
formance on the training data is now slightly better than it had been
at the time of the official evaluation, performance on the evaluation
data is worse because the selected thresholds have shifted even fur-
ther away from the optimum for the evaluation data. The CMU-2
system currently achievescost values of 0.1267, 1.2867, and 0.1269
for the dry-run, evaluation, and alternate data sets.
C
link of
Figure 1 shows how errors and false alarms made by the CMU-1
system may be traded against one another by varying the threshold
on the similarity measure for each of the three test sets. Figure 2
plots the equivalent Detection-Error Tradeoff (DET) curves for the
CMU-2 system. It is obvious that the December 1999 evaluation
data yields a DET curve which differs markedly from those for the
other two data sets on both SLD systems; this will be examined
further in the next section.
Another way to present the performance is with the
which is commonly used in the broader information-retrieval com-
munity.
tion of retrieved documentswhichshouldhave beenretrieved) and
is recall (proportion of documentswhich should have beenretrieved
that actually were retrieved). When tuned for
set, CMU-1 currently achieves micro-averaged
0.56, and 0.93 on the dry-run, evaluation, and alternate data sets –
very good except on the evaluation set. Even when tuned on the
evaluation data,
F
? measure
F
?is definedas
?pr??p?r?, where
p is precision (propor-
r
F
?on the dry-run data
F
? values of 0.92,
F
?for the evaluation set is only 0.70.
4. What Went Wrong?
Cross-validation on the training data using the COSINE system led
us to expect that the cost measure would be 20 to 30% higher on
the evaluation data than on the training data, yet the CMU-1 system
had a cost some six times as high and the CMU-2 system had a cost
measure more than twice as high. Why did the two systems fare
so much more poorly on the evaluation data than on the training
data? The simple answer seems to be that the evaluation data is
dramatically different from (and “harder” than) the training data.
The dry-run training data (as well as the alternate test set created
after the December 1999 evaluation) was generated by using the
event-labeled stories in the English portion of the collection and as-
sociating 120 random otherlabeled stories with one story from each
event. Of the randomly-selected stories, those which had the same
label as the initial story were considered linked, while those which
did not were considerednot linked.
In contrast, the evaluation data was not limited to the subset of the
collection which had been labeled. However, it is not clear a priori
whether restricting the test to labeled documents helps or hurts per-
formance. Using only labeled stories may make the decision easier,
since many confounding stories would not be included in the data
set, i.e. bombings other than those for which labels have been as-
signed. On the other hand, similar labeled events which should not
beconsideredlinkedwill makeup a larger proportion of the reduced
data set, increasing their impact.
Figure 3 compares the distribution of similarity scores computed by
the CMU-1 system oneachof the three data sets. Thescore for each
storypairwasplacedinto oneof 1000binsbytruncating thescoreto
3 decimal places, and the number of elements in each bin was then
plotted. The Y axis of each plot has been truncated somewhat to
betterillustrate the behaviorin the region 0.1 to 0.2. As is clearfrom
the figure, the distribution of scores are very similar between the
dry run (top) and alternate (bottom) data, and quite different for the
evaluationdata(middle). Notethelocalminimum near0.1, followed
bya local maximumat 0.2 in thetop andbottom graphs;this may be
anindicationof two well-separatedGaussiandistributions for linked
and non-linked story pairs.
Further indications that the evaluation data is qualitatively different
from the training data are given by the dramatically higher optimal
thresholds (0.18-0.22 versus 0.065-0.075 for CMU-1) and the dif-
fering effects on
sets. Thus, disabling TF*IDF weighting in the CMU-1 system sub-
stantially(40%ormore) increases
sets, yet moderately decreasesthe optimum
set, from 0.58 to about 0.50. Similarly, applyinga time-based decay
to the similarity score worsens
proves performance on the evaluation data – even when the system
is first tuned on the very data on which it is to be tested. The latter
effect indicates that the events in the evaluation data are much more
temporally focusedthan the events in either of the other sets.
C
link of various parameter settings for those data
C
linkforthedryrunandalternate
C
linkfor the evaluation
C
link on the dry run data, but im-
5. Conclusions and Future Work
Although both Carnegie Mellon SLD systems performed very well
onthetraining andalternateevaluationdatasets, performanceonthe
official December 1999 evaluation data leaves much to be desired.
In light of these results, it is clearly imperative to find a similarity
measurewhich is less affected by differences in the data sets.
The common code for processing the test file already supports the
useof multiple decisionstrategies anda variety of methodsfor com-
bining their outputs into a single decision (majority vote, weighted
votes, all-but-one, etc.). By coding additional similarity measures
which make independent errors, such a multi-strategy SLD system
promises better overall performance than any one of its component
strategies. Combining independent decisions has proven to be ben-
eficial in speechrecognition (the ROVER system[2]) and in CMU’s
own investigations on improving performance on the tracking task
(describedelsewhere in this volume).
An obvious extension to the existing split threshold would be to use
a different threshold for each possible combination of news sources
(those“sources”possiblyencompassingmultiple TDTsources). It is
Page 4
1
2
5
10
20
40
60
80
90
.01.02 .05 .1 .2.512510 20 40608090
Miss probability (in %)
False Alarms probability (in %)
CMU-1 Topic-Weighted Performance (tuned on Dry Run)
Random Performance
Dry Run Data
Evaluation Data
Alternate Data
Figure 1: Performance Variation by Data Set: CMU-1 System
1
2
5
10
20
40
60
80
90
.01.02 .05 .1 .2.5125 10 2040 6080 90
Miss probability (in %)
False Alarms probability (in %)
CMU-2 Topic-Weighted Performance
Random Performance
Dry Run Data
Evaluation Data
Alternate Data
Figure 2: Performance Variation by Data Set: CMU-2 System
Page 5
0
20
40
60
80
100
0 0.2 0.40.6 0.81
Frequency
Similarity Score
Distribution of Similarity Scores (Dry Run)
"test-dryrun.freq"
0
50
100
150
200
00.2 0.40.6 0.81
Frequency
Similarity Score
Distribution of Similarity Scores (Evaluation Data)
"test-eval.freq"
0
20
40
60
80
100
0 0.20.4 0.60.81
Frequency
Similarity Score
Distribution of Similarity Scores (Alternate Data)
"test-alt.freq"
Figure 3: Comparing the Distributions of Similarity Scores
notclearwhetherthere wouldbesufficient training material to accu-
ratelyset thepairwise thresholdsformore thanthree orfoursources,
even if the problem of dramatically different optimal thresholds be-
tween training and test sets were not an issue.
6. Acknowledgements
The authors would like to thank Tom Ault for his contributions to
our Story Link Detection efforts.
References
1. Jaime Carbonell, Yiming Yang, John Lafferty, Ralf D.Brown,
Tom Pierce, and Xin Liu. CMU report on TDT-2: Segmen-
tation, Detection and Tracking. In Proceedings of the DARPA
BroadcastNews Workshop,pages117–120,SanFrancisco, CA,
1999. Morgan Kaufmann Publishers, Inc.
2. Jonathan G. Fiscus. A Post-Processing System to Yield Re-
duced Word Error Rates: Recognizer Output Voting Error Re-
duction (ROVER). In Proceedings of the 1997 IEEE ASRU
Workshop,pages347–354,December1997.
3. G. Salton. Automatic Text Processing: The Transformation,
Analysis, and Retrieval of Information by Computer. Addison-
Wesley, Reading, Pennsylvania,1989.
4. Yiming Yang, Jaime Carbonell, Ralf D. Brown, Tom Pierce,
Brian T. Archibald, and Xin Liu. Learning Approaches for De-
tecting and Tracking News Events. IEEE Intelligent Systems,
14(4):32–43, July/August 1999. Special Issue on Applications
of Intelligent Information Retrieval.