Conference PaperPDF Available

WORDij: A word-pair approach to information retrieval

Authors:

Figures

Content may be subject to copyright.
WORDU: AWORD-PAIR APPROACH TO INFORMATION RETRIEVAL
James A. Danowski
University of Illinois at Chicago
CONCEPTUAL MODEL
WORDij is asystem based on alinkage or network
model for representing textual information. The
fundamental unit of analysis is the word pair, or bi-gram
phrase, rather than the individual term. WORDij also takes a
local approach to term cooccurrence. Systems such as
SMART historically used the entire document as the field
within which to define term cooccurrence. More recent
research has suggested that defining cooccurrence within
smaller text units such as paragraphs may be better [Salton &
Buckley 91]. WORDij is even more local in focus. It
defines cooccurrence of terms within three word positions
(after dropping stop words). In addition, WORDij uses
direct and indirect pair information to compute shortest paths
among words in retrieved documents. This counts both
direct and indirect matches between queries and documents.
Consider aquery Qcontaining the phrase {tl, 6} and a
document Dcontaining the phrases {tl, t2}, and {t2, t3) but
not the phrase {tl, t3). Existing algorithms [Salton &
Buckley 91, Croft, Turtle &Lewis 91, Fagan 89] would not
consider the dependency between tl and t3 as there is no
match for the phrase. However, tree-dependency models
[van Rijsbergen 77; Yu, Buckley, Lam and Salton 83]
recognize such indirect dependencies and produce aformula
to compute the degree of dependency between tl and t3.
The WORDij approach considers not only the direct phrases
but also indirect phrases.
METHODS
TREC work was begun using anetwork of Sun
workstations in the Database and Information Systems
Laboratory in the Electrical Engineering and Computer
Science Department at the University of Illinois at Chicago.
Because the lead Research Assistant, Nainesh Khimasia,
died during the project, software development using Cand
Unix tools was impeded. Earlier generations of tools had
been optimized for an IBM mainframe computer, so work
was switched to that platform. The machine used was an
IBM 3090/3001 platform running VMXA, CMS. Avirtual
machine CPU size of 16meg was used along with three
gigabytes of disk space. The CPU clock speed is rated at
14.5 nanoseconds, or 69 MHz.
We modified earlier generations of WORDij software
written in SPITBOL [Danowski 82, Danowski &Andrews
85]. These modifications consisted mainly of replacing
some SPITBOL code where possible with CMS PIPELINE
code, because it runs approximately one thousand times
faster. The *JZ text files were uncompressed using a
compress utility on CMS that works with Unix based
compressed files. WORDij code was run on each
uncompressed text file, generating an inverted file of word
pairs by document identification numbers. All word pairs
occurring only once in each document were dropped to save
disk space.
No spell checking, stemming, morphological analysis,
parsing, or tokenizing was done. Astop list of 63 1words
was used, comprised of the 570 stop words in SMART v. 10
and some additional stop words forming the markup format
of the raw text. Processing time to create the word pair
index averaged three minutes per file.
Ad hoc queries were automatically processed in the
same way as raw documents, except that no single pairs were
dropped. Query text used to generate word pairs for
matching included all text provided, except the factors and
definitions, and concepts numbered higher than two. Total
CPU seconds to build aquery averaged .26 seconds. For the
ad hoc queries, nothing further was done to them, either
automatically or manually.
For the routing topics, queries were also constructed
automatically, but in adifferent way. The training sets of
relevant and irrelevant documents were separately analyzed
to identify all word pairs that occurred in the relevant set but
not in the irrelevant set These unique relevant word pairs
were used as routing queries.
131
PIPELINE matching of the query pairs against the pair
files for each text file executed in approximately 16
milliseconds per file per 100 sets of query pairs. This meant
that to run all 100 queries against the entire collection took
approximately five hours of PIPELINE processing on the
word pair index files, or three minutes per query.
Time constraints precluded completing aword and
word-pair by document count on the entire collection for
inverse document frequency or entropy word and word pair
weighting. Retrieved documents were ranked from 1to 200
by counting the number of matching pairs each document
had to the query. Frequency of pair occurrence in documents
was not used to weight except in breaking ties at the 200
document-rank threshold.
Time limitations also prevented full implementation of
the indirect matching process. Only directly matching pairs
were used for the main analysis to produce the results.
Indirect matching was, however, later tested. This will be
described after presentation of the basic results.
RESULTS
WORDij results were greater than or equal to the
median levels of performance for seven topics. Our results
were within one standard deviation on 55 topics, and within
two standard deviations on 82 topics. Performance was
significantly lower than the median for 14 topics, as judged
by counting topics whose results were greater than two
standard deviations below the median. Table 1lists the
topics in two categories, those that were better than or equal
to the median, and those that were significantly below the
median. '
Failure Analysis
Query Style.
Several kinds of failure analysis were performed. To
investigate whether stylistic features of queries were
associated with performance, we computed the following
variables for each query using the shareware program, PC-
STYLE:
Number of Sentences
Number of Words
Words per sentence
Percentage of long words
Percentage of personal words
Percentage of action verbs
Average number of syllables per word
Table T. Topic Results Ordered by Performance
TOPIC Difference (median -result)
Better than or Equal to Median
66 -.08980 Natural Language Processing
29 -.04540 OS/2 problems
94 -.03180 Computer-aided Crime
95 -.00800 Computer-aided Crime Detection
18 .00000 Global Stock Market Trends
44 .00000 What Makes CASE succeed or fai
88 .00000 Crude Oil Price Trends
100 .00000 Controlling High Tech Transfer
50 .00250 Virtual Reality Military Apps.
Significantly Below Median (Failures)
22 .19590 Legal Repercus .-Agrochemicals
58 .20740 Rail Strikes
37 .21290 Role of Minis and Mainframes
20 .21770 Superconductors
77 .23290 Poaching
17 .24350 Japanese Stock Market Trends
93 .24560 What Backing Does the NRA Have
13 .24780 Drug Approval
54 .26840 Satellite Launch Contracts
51 .29490 Airbus Subsidies
10 .33340 Space Program
70 .35440 Surrogate Motherhood
78 .38240 Greenpeace
21 .48710 Counternarcot ics
Reading grade level
These variables were correlated with acriterion variable,
which was the difference between the median and our result.
We subtracted for each query our obtained result from the
median result on the 11 -point averages of recall-precision
contained in the official results across systems for the test
queries 51-100. Table 2displays these correlations. None of
them are statistically significant at the .01 level. Asecond
criterion variable was created to represent whether the query
was in the "failed" category, greater than two standard
deviations below the median. Adummy variable was
created for each query using zero to represent success and
one to represent failure. Correlations of the style variables
were also computed with the failure criterion. No
correlations were significant at the .01 level. This suggests
that query length, complexity, and other stylistic variables
are unrelated to retrieval performance.
Query Words.
132
Table 2: Query style &performance correlations
Diff
.
Failure
Sentences -.1053 -.1102
Words -.1139 -.1616
Words/sent
.
.0736 -.0004
Long words -.1574 -.1086
Personal words .0846 -.0046
Action words .1192 .2690
Syllables /word -.1055 -.0763
Reading grade level -.0256 -.0629
Additional failure analysis was conducted to explore
whether there were particular words associated with
performance. The frequencies of all words (no stop words)
for each query were correlated with both types of
performance criteria: 1) continuous difference from the
median and 2) failure, indicated by results significantly
below median performance. Table 3presents the
correlations that were significant at the .01 alpha level or
better across the 98 topics, and which occurred in at least
five different topics.
The words 'to' and 'some' increased in frequency as
performance increased, while frequency of the following
words was associated with lower performance: 'who, more,
type, following, been, two.' For the failure criterion, 'who,
more, been, two' were also significantly associated with
lower performance. In addition, 'national, system, support'
were also negatively associated with it. This analysis of
words from queries associated with performance suggests
that the pair matching approach worked best when the
documents used adomain-specific vocabulary.
Proper Name Identification.
At the other extreme, topics that used more domain-
general words had lower performance. In particular, queries
that asked for acategory of documents, such as indicated by
words such as 'who' and 'type' were more likely in the failure
category. Words including: 'system, national, following,
been, and two' were also associated with higher failure rates.
This suggests that proper noun compounds may require
special treaunent. The names of organizations, products,
locations, etc. cannot apparently be easily identified through
direct pair matching when these specific proper nouns are
not contained in the query. When such specific results are
called for by aquery, special procedures are probably
desirable for identification of proper nouns in documents that
match on other query pairs.
Domain Specificity of Words.
Table 3: Query words &performance correlations
WORD rNo. of Topics
Difference
to -.2743* 15
some -.2480* 10
who .3570** 8
more .2509* 8
type .3740** 6
following .3069* 6
been .2580* 6
two .3750** 5
Failure
An additional implication is that query expansion may
be fruitful when dealing with domain-transcendent words.
Through use of thesauri or databases such as WordNet,
alternative word meaning senses may be disambiguated.
Then synonyms specific to the proper domain could be
added to the actual query pairs contained in the original raw
query text.
Interestingly, queries that contained the words 'some'
resulted in higher performance. This may suggest that the
criteria for relevance were less stringent for such queries, in
that they asked not for an exhaustive and complete fit of
query to documents, but amore partial overlap. The word
'to' in queries was also associated with higher performance.
This may be associated with the specificity of this word in
discourse, indicating relationships of direction, degree, state,
contact, possession, etc.
national .2479* 11
system .2479* 9
who .3828** 8
more .2426* 8
been .2545* 6
two .4100** 5
support .2479* 5
p<.01, ** p<.001
Natural Language Processing on Queries.
Together, such query-focused results suggest that future
work may benefit from performing complex natural language
processing such as parsing, sense disambiguation, etc. on the
queries themselves to tune them before matching.
Sophisticated treatment of queries may improve performance
to the point that such treatment of the raw texts themselves,
which is expensive, may not add much marginal
performance improvement
133
Stemming.
Tests were run with the training sets for three queries
selected at random: 2, 26, and 49. For query 2 the
difference was zero. For query 26, the relevant documents
retrieved increased by 43%, while for query 49 there was a
73% improvement. Average improvement for the three
queries was 37% using stemming.
All Pairs.
Tests were run for three different queries to examine
effects of dropping single occurring word pairs from
documents. Queries 51, 71, and 78 were chosen at random.
Retrieval of relevant documents increased on average by
75%, with varied results across queries. Query 51 saw
relevant documents retrieved increase by 2.25 times, query
71 decreased performance by .93, and query 78 increased by
11times, for an average of 1.75 times increase in
performance.
Indirect Match Tests.
The training set of documents for query 51,about
Airbus subsidies, was used to test indirectness effects. One-
step indirectness was assessed, meaning that two query pair
words were not directly in the document, but were indirectly
connected through an intermediary word.
To illustrate, here are the query pairs including the
word, "aid," none of which have any direct matches in the
documents:
AID LOAN
AID TRADE
AID FINANCING
AID SUBSIDIES
AID ASSISTANCE
AID GOVERNMENT
Table 4contains the direct (one-step) and indirect (two-
step) links that "aid" had in the documents. The lefunost
pairs are direct links, while the rightmost words were
directly linked only to the second word of the direct pairs,
thus forming atwo-step indirect link to the first word in the
pair. For example, "aid" is linked to "government" only
indirectly through "Airbus." Also, "aid" is linked to
"subsidies" only indirectly through "Airbus." These two sets
of indirect hnks, aid-(Airbus)-subsidies and
aid-(Airbus)-govemment are meaningful in terms of the
content of the query, which generally concerns government
aid and subsidies to Airbus. If we had used only directly
matching pairs, we would have missed these two
conceptually meaningful sets of links. After identifying all
indirect pairs in documents matching query pairs in this way,
retrieval of relevant documents was 12% higher.
Shortest Paths.
WORDij does not restrict detection of indirect phrases
to these dual bi-gram cases. Rather, indirectness can be of n-
step lengths [Danowski and Martin 79, Van Rijsbergen 77].
For example, if there is an intermediate term between two
other terms not otherwise linked, then these two other terms
have an indirect step linkage of two. If the connection is only
through two intermediaries, then the indirect linkage is at
step three, and so on. Shortest path algorithms [Gabow &
Tarjan 89] find the best set of all direct and indirect links
connecting all nodes in anetworic. Here, this is all words in
the query.
We expect that indirectness at the two-step level may
contribute most to recall-precision effectiveness. At larger
numbers of steps the value of indirect information
diminishes. This is because at the extreme lengths, every
word is indirectly connected to every other word. This is
equivalent to asimple within-document cooccurrence of
words, such as in traditional approaches. It renders useless
the local cooccurrence constraints. Note also that stop word
removal from texts is necessary to represent higher degrees
of indirecdiess. When stop words are present, they increase
the connectivity of the word network.
Structural Equivalence and Meaning.
In network analysis, attention to the direct links in a
network is called a"cohesion" approach while examining the
degree of similarity in two-step links is called "structural
equivalence" [Burt 90]. Two nodes are structurally
equivalent to the extent that they share the same indirect
links, though they may not be directly linked themselves.
For example, if word Ais linked to words C,D,E and word B
is linked to words C,D,E, then although Aand Bare not
directly linked (i.e. show no cohesion), they are structurally
equivalent and maximally similar because they share the
same links.
Research in mathematical sociology and network
analysis has found that structural equivalence is usually
equal to or better than cohesion in accounting for system
behavior. In text analysis using words as nodes, two words
can be considered to share more meaning to the extent they
have overlapping two-step links. Therefore, structural
equivalence of words is meaning equivalence.
Latent Semantic Indexing and Indirect Pairs.
It is interesting that another approach to indirectness is
Latent Semantic Indexing (LSI) [Deerwester et al. 90;
Dumais 92]. Instead of than using anetwork approach,
however, it uses an eigenvector model. Eigenvectors
represent the combined effects of direct and indirect
associations among elements in the matrix. "Latent" refers
134
Table 4: Direct and indirect links to the word "aid"
FREQUENCY
aid airbus 1
fdp 1
back. 1
jets 2
adams 1
board 1
crash 1
group 1
plans 1
airbus 1
boeing 2
family 1
german 1
member 1
planes 2
dispute 1
mandate 1
nations 2
partner 2
percent 1
program 2
provide 1
aircraft 1
european 1
products 1
projects 2
amendment 1
executive 1
industrie 9
initially 1
ministers 1
spokesman 2
structure 1
*subsidies 2
violating 1
consortium 5
*government 1
management 1
aid package 1
aid guarantee 1
aid consortium 1
airbus 1
*These are indirect links that
create pairs contained in the
query pairs: aid- (Airbus) -subsidies
and aid- (Airbus) -government .The
other indirect links are not
meaningful because they do not
relate to the query at the two-step
level. Nevertheless, they are listed
to show the larger context of
identifying meaningful indirectness.
to indirect association patterns below the manifest or direct
level. Currently, eigenvector solutions to large matrices are
more computationally limited than shortest path network
solutions. There has been more development of large scale,
parallel algorithms for shortest paths, due to the {M^ctical
needs to aid routing of information in telecommunications
networks. Some work, however, suggests that there is a
mathematical equivalence between eigenvector and network
approaches to reducing matrices of associations to asimpler
underlying structure [Bamett &Richards 91].
Shortest Path Weighting.
Given aset of query word pairs and alist of all
documents that contain each word pair-both directly and
indirectly-- we can take all pairs of nodes and identify the
shortest path linking them in the network. These paths are
measured for length according to Euclidean distance in grz^h
terms. Such distance is adirect function of the minimum
number of link steps it requires to connect two nodes on their
geodesic. Directly linked nodes have adistance of one,
nodes linked through one common intermediary node have a
distance of two, etc. Documents are counted that were
"passed through" or "activated" as each step in the shortest
path is traversed. Shortest path algorithms can find these
indirect paths with large data sets provided parallel
algorithms and hardware are used. We are further
developing such experiments.
After IDF weighting, ranking, and selection of the best
words, networic analysis is conducted on the word pairs they
form. The shortest paths linking every word in the set are
found, and the word centrality in the network is indexed via
the average of the minimum number of steps between that
word and all other words in the set
Then, for each document, it is given aweight that is
based on the centrality of the words from the query it
contains. The retrieved documents found along the shortest
paths between all query pairs are counted and weighted by
their constituent word centrality. ank ordered
for each query. Documents are then rank-ordered for each
query.
CONCLUSION
Results showed that even with unexpected limitations
due to the mid-project death of the lead research assistant,
Nainesh Khimasia, we succeeded in processing the entire
TREC collection and doing direct matching of query word
pairs to document word pairs. For 15% of the topics, our
results can be considered failures. Failure analysis suggests
that improvements in future research may result from:
135
query tuning based on natural language processing
using special procedures for treating proper noun
names for organizations, products, locations, etc.
retaining and using word pairs occurring only once in
documents
stemming the documents and queries
doing indirect document frequency (IDF) or entropy
weighting on words and using these to weight query
pairs
computing additional weights based on shortest paths.
ACKNOWLEDGEMENTS
The author is grateful for the contributions of the
following University of Illinois at Chicago faculty, students,
and staff to this project: John Andrews, Robert Goldstein,
Alan Hinds, Nainesh Khimasia, Jin Hong Meng, Stephen
Roy, Gary Singer, Anand Sundaram, George Yanos, and
Clement Yu.
REFERENCES
Bamett, G.A. &Richards, W.D. (1991, February). A
comparison of NEGOPY'S clique detection algorithm with
correspondence analysis. Paper presented to the
International Social Networks Conference, Tampa, Florida.
Burt, R.S. (1990). Structure. New York: Center for Social
Sciences, Columbia University.
Croft, B., Turtle, H. &Lewis, D. (1991). Proceedings of the
SIGIR '91, 32-45.
Danowski, J. (1982). Anetwork-based content analysis
methodology for computer-mediated communication: An
illustration with acomputer bulletin board," Communication
Yearbook, 6, 904-925.
Danowski, J. (1988). Organizational infographics and
automated auditing: Using computers to unobtrusively
gather and analyze communication. In G. Goldhaber and G.
Bamett (eds.) Handbook of organizational communication
(pp. 335-384). Norwood, NJ: Ablex.
Danowski, J. &Andrews, J. (1985, February). Amethod for
automated network analysis of word cooccurrences. Paper
presented to the International Social Networks Conference,
San Diego.
Danowski, J. &Martin, T.H. (1979). Evaluating the health
of information science: Research community and user
contexts. Final report to the Division of Information Science
of the National Science Foundation, no. IST78-21130.
Deerwesier, S., Dumais, S.T., Landauer, T.K., Furnas, G.W,
&Harshman, R.A. (1990). Indexing by latent semantic
analysis. Journal of the Society for Information Science,
41:6, 391-407.
Dumais, S.T. (1992). LSI meets TREC: Astatus report
Paper presented to TREC.
Fagan, J. (1989). The effectiveness of anonsyntactic
approach to automatic phrase indexing for document
retrieval. Journal of the American Society for Information
Science, 40:2,1 15-132.
Gabow, H.N. &Tarjan, R.E. (1989). Faster scaling
algorithms for network problems. SIAM Journal on
Computing. 18(Oct),1013-36.
Salton, G. &McGill, M. (1983). Introduction to modem
information retrieval. New York: McGraw-Hill.
Salton, G. &Buckley, C. (1991). Automatic text structuring
and retrieval: Experiments in automatic encyclopedia
searching. Proceedings of the SIG-IR '91, 21-30.
van Rijsbergen, C. (1977). Atheoretical basis for the use of
cooccurrence data in information retrieval. Joumal of
Documentation, 33,106-119.
Yu, C, Buckley, C, Um, H. &Salton, G. (1983). A
generalized term dependence model in information retrieval.
Information technology: Research and development,
2,129-154.
136
... A key point I saved until now is that proximity co-occurrence indexing (Danowski, 1982, 1993a, 1993b, 2009 avoids the problems of the simplistic 'bag of words' approaches common from Information Science and Information Retrieval. While word bags are useful for document retrieval they blur social meaning by ignoring the relationships of social units within the texts, whether these units are words, people, or other entities. ...
... It is more analytically precise, however, to use a proximity criterion in defining relationships among entities network analyzed. I used a three-word window, based on empirical testing of window size and network strucrture validity (Danowski, 1993b) operating on the text file after all words except the names of departments were automatically removed by the use of the WordLink 'include list' of department names. Table 2 shows an example of a portion of a larger include file. ...
Article
Full-text available
This chapter presents six examples of organization-related social network mining: 1) interorganizational and sentiment networks in the Deepwater BP Oil Spill events, 2) intraorganizational interdepartmental networks in the Savannah College of Art and Design (SCAD), 3) who-to-whom email networks across the organizational hierarchy the Ford Motor Company's automotive engineering innovation: "Sync® w/MyFord Touch", 4) networks of selected individuals who left that organization, 5) semantic associations across email for a corporate innovation in that organization, and 6) assessment of sentiment across its email for innovations over time. These examples are discussed in terms of motivations, methods, implications, and applications.
... Once the data has been split into windows, text within each is transformed into a network using a technique called Wordij [8] (see Figure 3 for an example). In Wordij, strong ties connect adjacent words, weaker ties link words that are adjacent to a common word but are not themselves adjacent, and so forth. ...
... We applied the TEvA and ICI analyses to the collected dataset. Settings for the CPM algorithm and word cooccurrence network creation were chosen following guidance in [24] and [8], and parameters for the windowing strategy were chosen as described above. In all cases, a time windows were ten minutes long and were slid across the data in one-minute increments. ...
Conference Paper
Full-text available
In this article, we present an analysis of communication transcripts from computer-mediated teams that illustrates how different kinds of decision support impact collaborative knowledge construction. Our analysis introduces an algorithmic technique called Topic Evolution Analysis (TEvA), which tracks clusters of words in conversation, and illustrates how these clusters change and merge over time. This analysis is combined with measurements of group dynamics to distinguish between teams using different kinds of decision support. Our analysis offers evidence that some kinds of decision support improve the apparent rationality of a team, but at the cost of collaborative knowledge construction. This result is not apparent when simply measuring team decision performance. We use this finding to motivate the utility and importance of the approach when assessing the impact of technology on collaborative knowledge processing.
... The author followed the convention of using the sliding window threewords-wide on both sides of the word. The corpora were analyzed by countries using WORDij [55] to generate words and word-pairs frequency. Gephi (0.9.2) generated the visualisation and the relevant network measures (Tables 2 and 3, and Figures 1-4). ...
Article
Full-text available
Digital nomadism is emerging as a growing segment of the labor force. It is an insightful framework for understanding work during the pandemic and perhaps into the post-pandemic era because it construes work to be related to the notion of space, time and the instrumentality of work. The present study is about how people understand, relate, and make sense of their work during the early phase of the pandemic lockdown in 2020. The study will report difficulties that arise from work digitalization during the lockdown, and the study conceived the various dimensionality of work to cope with work challenges. Semantic network analysis (SNA) was used to aid the analysis of the contents from four European countries. One hundred and sixty respondents are interviewed using a semi-structured questionnaire. The words and word pairs from the SNA resulted in keywords identified for the four countries. There are common word hubs between the countries, such as hubs revolving around the meaning of ‘time’ and ‘meeting’. However, there are also unique hubs such as ‘task’, ‘office’ and “colleagues”. The results provide a cross-cultural comparison of how people adopted to work change. The organization of the word pairs in the network provided the narratives.
... Before using the start words, we preprocessed the corpus, removing duplicate documents, stop words, and words and word pairs occurring less than three times, selecting these options in the semantic network package WORDij (Danowski 1993b(Danowski , 2013. We did no stemming or lemmatization, to increase linguistic validity. ...
Article
Full-text available
For reason beyond the control of the authors or the editors, the article titled “Scaling constructs with semantic networks” by James A. Danowski and Kenneth Riopelle (https://doi.org/10.1007/s11135-019-00879-5) was published.
... Before using the start words, we preprocessed the corpus, removing duplicate documents, stop words, and words and word pairs occurring less than three times, selecting these options in the semantic network package WORDij (Danowski 1993b(Danowski , 2013. We did no stemming or lemmatization, to increase linguistic validity. ...
... Before using the start words, we preprocessed the corpus, removing duplicate documents, stop words, and words and word pairs occurring less than three times, selecting these options in the semantic network package WORDij (Danowski 1993b(Danowski , 2013. We did no stemming or lemmatization, to increase linguistic validity. ...
Article
Full-text available
This paper introduces a method for creating scales of constructs based on word bigram cooccurrences in natural language text. Instead of using a stop-word list to drop less useful words, we use a start-word list that enables computing the cooccurrences of only these “smart words.” In this way, we can create scales to measure communication constructs by first listing the key terms in the conceptual definition, then expanding the terms by looking up synonyms in dictionaries such as WordNet. Following this, we compute the cooccurrence network among these words with a sliding window. Next, we extract the first dimension through principal components analysis and identify the words that load most highly on the first factor. For these words, we sum the frequencies, which produces the final index for the construct. This operationalization yields index scales that have high construct validity, which contributes to external validity. Extending the procedures of classic psychometric index construction into the natural language domain avoids the biases of data based on fixed-choice questionnaires. To demonstrate the procedures for construct operationalization, we use a dataset of news stories about the BP Deepwater Horizon Gulf Oil Spill, scaling environmental uncertainty and organizational responses to it including innovation, strategic planning, and changes in organizational structure.
... The narrative format of the abstract content was suited to a proximity-based word-pair approach. Each file was analyzed with WordLink in WORDij 3.0 [15] [16][17] [18]. AutoMap [19] and Catpac [20] have adopted this proximity approach and could be used in this kind of analysis. ...
Article
Full-text available
We investigated possible causal relationships between a professional association's division network structure based on co-memberships, and the division network structure based on the semantic similarity of papers presented at annual meetings. Data from the International Communication Association (ICA), a basic-research focused organization of academic social scientists with 21 divisions, provide for an analysis at two points in time, 2007 and 2011. QAP correlations among the four networks entered into a quasi-experimental cross-lagged correlation design suggested evidence for possible causality. Compared to the no cause baseline, the time 1 co-membership network structure was a significant predictor of the time 2 semantic division network structure. The reverse relationship was not significant. As well, there is considerable reduction of the size of the synchronous correlation of the semantic and co-membership division networks from time 1 to time 2. Noteworthy was also the pattern of diachronic association of the same kind of network. The semantic division network at time 1 explained only 31% of the same network at time 2. Likewise, the co-membership network at time 1 explained only 25% of that network at time 2. This would be consistent with the basic research focus of the association. Such a focus privileges novelty. The paper uses theory to form the research questions and interpretations. Because of the limitations of the statistical model, and the case study design, these results should be taken as exploratory and suggestive. Future research may reduce these limitations.
Chapter
This work proposes a novel approach to visually interact with semantic networks constructed via natural language processing techniques. The proposed web interface, WINS, allows the user to select a textual document to be analyzed, choose the algorithm to construct the semantic network, and visualize the network with its metrics. Unlike previous works, which are typically based on co-occurrence matrix for constructing the text network, the proposed interface embeds an additional approach based on the combination of network science with distributed representations of words and phrases.
Conference Paper
Full-text available
This research analyzes Muslim nation (MN) networks associated with Jihad for the previous two years. We captured all documents from Lexis-Nexis Academic's BBC International Monitoring - which contains translated transcriptions of web pages, broadcasts, newspapers, and other content - for each of 47 Muslim nations (MNs) using the search term: jihad and MN name. We presented a new kind of semantic network time series analysis of this text. Unlike most semantic network analysis, our nodes were time segments, not words. The link strengths were similarity scores of time nodes across 779,192 word pairs. The time nodes were 105 weekly intervals. We created a two-mode matrix. Columns were the frequencies of time slices' word pairs, appearing in a three-word window. Matrix rows were three-quarters of a million word pairs extracted from the aggregate two-year text file. We converted this two-mode matrix to a one-mode matrix by computing the similarity of each pair of time slices across the rows of word pairs. This resulted in a one-mode network of 105 by 105 time units. Pearson correlations were the similarity coefficients. We conducted social network analysis of the time nodes to find the most central ones. Highly central nodes lie more often on the shortest paths between all pairs of time nodes. They therefore contain in their internal lists of highest frequency word pairs the main themes across the two-years of text. The method is highly automated and efficient. In this case only three central nodes provided the basis for an analyst's interpretations of main propaganda themes.
Conference Paper
Full-text available
Analysis of sentiment expressed in political systems' communication over long periods of time has been difficult. This research illustrates a method based on network analysis, the Sentiment Network Analyzer (SNAZ). It identifies weighted shortest paths between seed words and 3,500 target sentiment words as these occur in semantic networks extracted from open-source documents sliced into time intervals. Computing the normalized intensity ratios of positive and negative sentiment for each time slice enables application of the "Losada Line." For a system to be flourishing there must be at least 2.9 times more positive than negative communication. Below that ratio the system is languishing. Excessive positivity above a ratio of 11.6 marks the disintegration of a system into chaotic oscillation. We collected and analyzed five years of documents propaganda mentioning the Tale ban from Afghani and Pakistani sources transcribed by BBC International Monitoring. Likewise, we extract and analyze stories communicated by Radio Free Europe/Radio Liberty (RFE/RL) connected with Afghanistan over the same five-year period. Semantic network and sentiment network analysis is coupled with the computation of positivity ratios in each time slice during this period. Tale ban content is generally evident of flourishing, except for a period of oscillation between flourishing and excessive positivity beginning in the third quarter of 2010. RFE/RL is consistently languishing, reaching the 2.9 flourishing level in only one period. We discuss possible reasons. We also consider some implications for perception management and counter terrorism strategy.
Article
This paper provides a foundation for a practical way of improving the effectiveness of an automatic retrieval system. Its main concern is with the weighting of index terms as a device for increasing retrieval effectiveness. Previously index terms have been assumed to be independent for the good reason that then a very simple weighting scheme can be used. In reality index terms are most unlikely to be independent. This paper explores one way of removing the independence assumption. Instead the extent of the dependence between index terms is measured and used to construct a non-linear weighting function. In a practical situation the values of some of the parameters of such a function must be estimated from small samples of documents. So a number of estimation rules are discussed and one in particular is recommended. Finally the feasibility of the computations required for a non-linear weighting scheme is examined.
Article
It may be possible to improve the quality of automatic indexing systems by using complex descriptors, for example, phrases, in addition to the simple descriptors (words or word stems) that are normally used in automatically constructed representations of document content. This study is directed toward the goal of developing effective methods of identifying phrases in natural language text from which good quality phrase descriptors can be constructed. The effectiveness of one method, a simple nonsyntactic phrase indexing procedure, has been tested on five experimental document collections. The results have been analyzed in order to identify the inadequacies of the procedure, and to determine what kinds of information about text structure are needed in order to construct phrase descriptors that are good indicators of document content. Two primary conclusions have been reached: (1) In the retrieval experiments, the nonsyntactic phrase construction procedure did not consistently yield substantial improvements in effectiveness. It is therefore not likely that phrase indexing of this kind will prove to be an important method of enhancing the performance of automatic document indexing and retrieval systems in operational environments. (2) Many of the shortcomings of the nonsyntactic approach can be overcome by incorporating syntactic information into the phrase construction process. However, a general syntactic analysis facility may be required, since many useful sources of phrases cannot be exploited if only a limited inventory of syntactic patterns can be recognized. Further research should be conducted into methods of incorporating automatic syntactic analysis into content analysis for document retrieval. © 1989 John Wiley & Sons, Inc.
Conference Paper
Many conventional approaches to text analysis and informationretrieval prove ineffective when large textcollections must be processed in heterogeneous subjectareas. An alternative text manipulation systemis outlined useful for the retrieval of large heterogeneoustexts, and for the recognition of content similaritiesbetween text excerpts, based on flexible textmatching procedures carried out in several contexts ofdifferent scope. The methods are illustrated by searchexperiments...
Article
This paper presents algorithms for the assignment problem, the transportation problem, and the minimum-cost flow problem of operations research. The algorithms find a minimumcost solution, yet run in time close to the best-known bounds for the corresponding problems without costs. For example, the assignment problem (equivalently, minimum-cost matching in a bipartite graph) can be solved in O(v/’rn log(nN)) time, where n, m, and N denote the number of vertices, number of edges, and largest magnitude of a cost; costs are assumed to be integral. The algorithms work by scaling. As in the work of Goldberg and Tarjan, in each scaled problem an approximate optimum solution is found, rather than an exact optimum.
A comparison of NEGOPY'S clique detection algorithm with correspondence analysis. Paper presented to the International Social Networks Conference
  • G A Barnett
  • W D Richards
Barnett, G.A. & Richards, W.D. (1991, February). A comparison of NEGOPY'S clique detection algorithm with correspondence analysis. Paper presented to the International Social Networks Conference, Tampa, Florida.