About
28
Publications
8,955
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
92
Citations
Introduction
I'm interested in applying both machine learning and natural language processing to analyzing one or more texts, sometimes called text analytics. I prefer programming in Python and R.
Skills and Expertise
Current institution
Additional affiliations
August 2001 - May 2004
Education
September 1990 - June 1998
September 1986 - June 1988
August 1984 - May 1986
Publications
Publications (28)
This contains slides that introduce how to use Python to analyze texts, which is called 'text analytics' or 'text mining.'
This contains slides that introduce how to use Python to analyze texts, which is called 'text analytics' or 'text mining.'
This contains slides that introduce how to use Python to analyze texts, which is called 'text analytics' or 'text mining.' Regular expressions are emphasized here.
This contains slides that introduce how to use Python to analyze texts, which is called 'text analytics' or 'text mining.' Regular expressions are emphasized here.
This contains slides that introduce how to use Python to analyze texts, which is called 'text analytics' or 'text mining.'
Although squaring integers is deterministic, squares modulo a prime, $p$, appear to be random. First, because they are all generated by the multiplicative linear congruential equation, $x_{i+1} = g^2 x_i \mod p$, where $x_0 = 1$ and $g$ is any primitive root of $p$, a pseudorandom number heuristic suggests that they are, in fact, unpredictable. Mor...
Walter Skeat published his critical edition of William Langland's 14th
century alliterative poem, Piers Plowman, in 1886. In preparation for this he
located forty-five manuscripts, and to compare dialects, he published excerpts
from each of these. This paper does three statistical analyses using these
excerpts, each of which mimics a task he did in...
In preparation of his edition of the 14th century alliterative poem Piers Plowman, the 19th century philologist, Walter Skeat, was able to find forty-five manuscripts. These were used in two different ways. First, he studied these with respect to their dialects, which led to his identification of three versions of the poem, denoted as texts A, B, a...
Interest in the mathematical structure of poetry dates back to at least the
19th century: after retiring from his mathematics position, J. J. Sylvester
wrote a book on prosody called $\textit{The Laws of Verse}$. Today there is
interest in the computer analysis of poems, and this paper discusses how a
statistical approach can be applied to this tas...
Although textbook publishers offer course management systems, they do so to
promote brand loyalty, and while an open source tool such as WeBWorK is
promising, it requires administrative and IT buy-in. So supported in part by a
College Access Challenge Grant from the Department of Education, we
collaborated with other instructors to create online ho...
Researchers have developed ways to generalize the mean and variance to
situations in which a data metric is available. We apply the tools developed in
Pennec (2006) to categorical data, and show the generality of this approach by
considering two quite different applications. First, spelling variability in
Middle English is quantified. Second, varia...
Markov chains are an important example for a course on stochastic processes
because simple board games can be used to illustrate the fundamental concepts.
For example, a looping board game (like Monopoly) consists of all recurrent
states, and a game where players win by reaching a final square (like Chutes
and Ladders) consists of all transient sta...
Statistics pedagogy values using a variety of examples. Thanks to text
resources on the Web, and since statistical packages have the ability to
analyze string data, it is now easy to use language-based examples in a
statistics class. Three such examples are discussed here. First, many types of
wordplay (e.g., crosswords and hangman) involve finding...
In many clinical trials and epidemiological studies, comparing the mean count response of an exposed group to a control group is often of interest. This type of data is often over-dispersed with respect to Poisson variation, and previous studies usually compared groups using confidence intervals (CIs) of the difference between the two means. Howeve...
Extra-dispersion (overdispersion or underdispersion) is a common phenomenon in practice when the variance of count data differs from that of a Poisson model. This can arise when the data come from different subpopulations or when the assumption of independence is violated. This paper develops a procedure for testing the equality of the means of sev...
In this article, we discuss the modeling of count data occurring in biological applications.
We then derive asymptotic procedures for the construction of con¯dence limits for the
over-dispersion parameter of count data when there is no likelihood available. We also
obtain closed-form asymptotic variance formulae for the estimator of the over-disper...
IntroductionScalars, Interpolation, and Context in PerlArrays and Context in PerlWord Lengths in Poe's “The Tell-Tale Heart”Arrays and FunctionsHashesTwo Text ApplicationsComplex Data StructuresReferencesFirst TransitionProblems
92.45 Anasquares: Square anagrams of squares - Volume 92 Issue 524 - Roger Bilisoly
Provides readers with the methods, algorithms, and means to perform text mining tasks. This book is devoted to the fundamentals of text mining using Perl, an open-source programming tool that is freely available via the Internet (www.perl.org). It covers mining ideas from several perspectives--statistics, data mining, linguistics, and information r...
Edgar Allan Poe wrote seventy short stories in his lifetime, and literary critics have categorized these stories in many ways, e.g., by genres such as horror, detective or proto-science fiction. This paper discusses how a computer can group stories by using families of words related by a theme, e.g., words denoting colors. This approach combines tw...
The effect of variable demands at short time scales on the transport of a solute through a water distribution network has not previously been studied. We simulate flow and transport in a small water distribution network using EPANET to explore the effect of variable demand on solute transport across a range of hydraulic time step scales from 1 minu...
Previous work on sample design has been focused on constructing designs for samples taken at
point locations. Significantly less work has been done on sample design for data collected
along transects. A review of approaches to point and transect sampling design shows that
transects can be considered as a sequential set of point samples. Any two sam...
I will reconstruct the ocean currents for a region in the northeast Pacific based a combination of (i) pre-existing knowledge of the average properties of the currents in this region; (ii) information obtained from floating instrument platforms that freely move with the currents; and (iii) the equations of fluid motion. The reconstruction will be t...
Soil chemical field data typically do not satisfy the required statistical assumptions, and this renders statistical tests based on normal theory either invalid or not particularly powerful. The objective of this study was to compare the t-test and two nonparametric tests (Wilcoxon signed rank and the Sign test) for a theoretical data set and 3 yr...
We address the issue of how to make decisions about the degree of smoothness demanded of a flexible contour used to model the boundary of a 2D object. We demonstrate the use of a Bayesian approach to set the strength of the smoothness prior for a tomographic reconstruction problem. The Akaike Information Criterion is used to determine whether to al...
As demonstrated by the anthrax attack through the United States mail, people infected by the biological agent itself will give the first indication of a bioterror attack. Thus, a distributed information system that can rapidly and efficiently gather and analyze public health data would aid epidemiologists in detecting and characterizing emerging di...
This note is inspired by Numbo-Carrean, which was introduced in Ross Eckler's Word Recreations [1] in the chapter called "Ten Logotopian Lingos." This lingo uses words with the following property: when each letter is replaced by its letter rank (or alphabetic position number), the resulting number is a perfect square. That is, a is replaced by 1, b...