
Thomas G Dietterich- Oregon State University
Thomas G Dietterich
- Oregon State University
About
214
Publications
41,783
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
31,874
Citations
Current institution
Publications
Publications (214)
Unsupervised anomaly detection algorithms search for outliers and then predict that these outliers are the anomalies. When deployed, however, these algorithms are often criticized for high false-positive and high false-negative rates. One main cause of poor performance is that not all outliers are anomalies and not all anomalies are outliers. In th...
These are exciting times for computational sciences with the digital revolution permeating a variety of areas and radically transforming business, science, and our daily lives. The Internet and the World Wide Web, GPS, satellite communications, remote sensing, and smartphones are dramatically accelerating the pace of discovery, engendering globally...
Accurate weather data is important for improving agricultural productivity in developing countries. Unfortunately, weather sensors can fail for a wide variety of reasons. One approach to detecting failed sensors is to identify statistical anomalies in the joint distribution of sensor readings. This powerful method can break down if some of the sens...
Scripts have been proposed to model the stereotypical event sequences found in narratives. They can be applied to make a variety of inferences including filling gaps in the narratives and resolving ambiguous references. This paper proposes the first formal framework for scripts based on Hidden Markov Models (HMMs). Our framework supports robust inf...
Anomaly detectors are often used to produce a ranked list of statistical anomalies, which are examined by human analysts in order to extract the actual anomalies of interest. This can be exceedingly difficult and time consuming when most high-ranking anomalies are false positives and not interesting from an application perspective. In this paper, w...
Anomaly detectors are often used to produce a ranked list of statistical anomalies, which are examined by human analysts in order to extract the actual anomalies of interest. Unfortunately, in realworld applications, this process can be exceedingly difficult for the analyst since a large fraction of high-ranking anomalies are false positives and no...
Anomaly detectors are often used to produce a ranked list of statistical anomalies, which are examined by human analysts in order to extract the actual anomalies of interest. Unfortunately, in real-world applications, this process can be exceedingly difficult for the analyst since a large fraction of high-ranking anomalies are false positives and n...
1.Occupancy models are employed in species distribution modeling to account for imperfect detection during field surveys. While this approach is popular in the literature, problems can occur when estimating the model parameters. In particular, the maximum likelihood estimates can exhibit bias and large variance for datasets with small sample sizes,...
In many applications, an anomaly detection system presents the most anomalous
data instance to a human analyst, who then must determine whether the instance
is truly of interest (e.g. a threat in a security setting). Unfortunately, most
anomaly detectors provide no explanation about why an instance was considered
anomalous, leaving the analyst with...
In standard passive imitation learning, the goal is to learn a policy that performs as well as a target policy by passively observing full execution trajectories of it. Unfortunately, generating such trajectories can require substantial expert effort and be impractical in some cases. In this paper, we consider active imitation learning with the goa...
Scripts have been proposed to model the stereotypical event sequences found in narratives. They can be applied to make a variety of inferences including fillinggaps in the narratives and resolving ambiguous references. This paper proposes the first formal frameworkfor scripts based on Hidden Markov Models (HMMs). Our framework supports robust infer...
Bird migration occurs at the largest of global scales, but monitoring such movements can be challenging. In the United States there is an operational network of weather radars providing freely accessible data for monitoring
meteorological phenomena in the atmosphere. Individual radars are sensitive enough to detect birds, and can provide insight in...
The Collective Graphical Model (CGM) models a population of independent and
identically distributed individuals when only collective statistics (i.e.,
counts of individuals) are observed. Exact inference in CGMs is intractable,
and previous work has explored Markov Chain Monte Carlo (MCMC) and MAP
approximations for learning and inference. This pap...
Methods for using task-related information to enhance digital searching are provided. A task-oriented user activity system maintains task-related information about resources accessed by a user and current user task. This task-related information is used to enhance search queries to include task-related search criteria that improve relevance of sear...
This paper presents a learning approach for detecting nematocysts in Scanning Electron Microscope (SEM) images. The image dataset was collected and made available to us by biologists for the purposes of morphological studies of corals, jellyfish, and other species in the phylum Cnidaria. Challenges for computer vision presented by this biological d...
Biologists collect and analyze phenomic (e.g., anatomical or non-genomic) data to discover relationships among species in the Tree of Life. The domain is seeking to modernize this very time-consuming and largely manual process. We have developed an approach to detect and localize object parts in standardized images of bat skulls. This approach has...
Research in anomaly detection suffers from a lack of realistic and publicly-available problem sets. This paper discusses what properties such problem sets should possess. It then introduces a methodology for transforming existing classification data sets into ground-truthed benchmark data sets for anomaly detection. The methodology produces data se...
This paper reports on methods and results of an applied research project by a team consisting of SAIC and four universities to develop, integrate, and evaluate new approaches to detect the weak signals characteristic of insider threats on organizations' information systems. Our system combines structural and semantic information from a real corpora...
The phenotype represents a critical interface between the genome and the environment in which organisms live and evolve. Phenotypic characters also are a rich source of biodiversity data for tree building, and they enable scientists to reconstruct the evolutionary history of organisms, including most fossil taxa, for which genetic data are unavaila...
Archived data from the WSR-88D network of weather radars in the US hold detailed information about the continent-scale migratory movements of birds over the last 20 years. However, significant technical challenges must be overcome to understand this information and harness its potential for science and conservation. We present an approximate Bayesi...
Counting the number of rice pests captured via light traps each day is very important for monitoring the population dynamics of rice pests in paddy fields. This paper focuses on developing a segmentation method for separating the touching insects in the rice light-trap insect image from our imaging system to automatically identify and count rice pe...
In standard passive imitation learning, the goal is to learn a target policy
by passively observing full execution trajectories of it. Unfortunately,
generating such trajectories can require substantial expert effort and be
impractical in some cases. In this paper, we consider active imitation learning
with the goal of reducing this effort by query...
In typical real-time strategy (RTS) games, enemy units are visible only when
they are within sight range of a friendly unit. Knowledge of an opponent's
disposition is limited to what can be observed through scouting. Information is
costly, since units dedicated to scouting are unavailable for other purposes,
and the enemy will resist scouting attem...
We present a novel ensemble architecture for learning problem-solving techniques from a very small number of expert solutions and demonstrate its effectiveness in a complex real-world domain. The key feature of our “Generalized Integrated Learning Architecture” (GILA) is a set of heterogeneous independent learning and reasoning (ILR) components, co...
Background/Question/Methods
Species distribution models are an important tool for guiding our understanding of ecological systems and how to manage them. A recently introduced method merges two popular but previously distinct classes of species distribution models: site occupancy models (OD) and boosted regression trees (BRT). The new method (OD-...
Background/Question/Methods
In Breiman's "Two Cultures" paper, he contrasted statistical modeling (such as logistic regression) with prediction algorithms (such as random forests) and argued that inferences about traditional statistical models are risky unless the models also demonstrate high predictive accuracy on independent data. He discussed...
When training data is sparse, more domain knowledge must be incorporated into
the learning algorithm in order to reduce the effective size of the hypothesis
space. This paper builds on previous work in which knowledge about qualitative
monotonicities was formally represented and incorporated into learning
algorithms (e.g., Clark & Matwin's work wit...
Remote sensors are becoming the standard for observing and recording
ecological data in the field. Such sensors can record data at fine temporal
resolutions, and they can operate under extreme conditions prohibitive to human
access. Unfortunately, sensor data streams exhibit many kinds of errors ranging
from corrupt communications to partial or tot...
To avoid ecological collapse, we must manage Earth's ecosystems sustainably. Viewed as a control problem, the two central challenges of ecosystem management are to acquire a model of the system that is sufficient to guide good decision making and then optimize the control policy against that model. This paper describes three efforts aimed at addres...
Previous research has shown that a technique called error-correcting output coding (ECOC) can dramatically improve the classiication accuracy of supervised learning algorithms that learn to classify data points into one of k 2 classes. This paper presents an empirical investigation of why the ECOC technique works, particularly when employed with de...
We consider the problem of learning rules from natural language text sources. These sources, such as news articles and web texts, are created by a writer to communicate information to a reader, where the writer and reader share substantial domain knowledge. Consequently, the texts tend to be concise and mention the minimum information necessary for...
Background/Question/Methods
An important question in contemporary ecology is how the timing of key life history stages, such as insect emergence, may be responding to climate change. This question is difficult to answer because many existing ecological datasets only contain observations of insect abundance over time, but the timing of emergence o...
The ecological sciences have benefited greatly from recent advances in wireless sensor technologies. These technologies allow researchers to deploy networks of automated sensors, which can monitor a landscape at very fine temporal and spatial scales. However, these networks are subject to harsh conditions, which lead to malfunctions in individual s...
Helping computer users rapidly locate files in their folder hierarchies is a practical research problem involving both intelligent systems and user interface design. This article reports on FolderPredictor, a software system that can reduce the cost of locating files in hierarchical folders. FolderPredictor applies a cost-sensitive prediction algor...
Sequential decision tasks present many opportunities for the study of transfer learning. A principal one among them is the existence of multiple domains that share the same underlying causal structure for actions. We describe an approach that exploits this shared causal structure to discover a hierarchical task structure in a source domain, which i...
We consider the problem of learning rules from natural language text sources. These sources, such as news articles, journal articles, and web texts, are created by a writer to communicate information to a reader, where the writer and reader share substantial domain knowledge. Consequently, the texts tend to be concise and mention the minimum inform...
We study the problem of learning probabilistic models of high-level strategic behavior in the real-time strategy (RTS) game StarCraft. The models are automatically learned from sets of game logs and aim to capture the common strategic states and decision points that arise in those games. Unlike most work on behavior/strategy learning and prediction...
Important ecological phenomena are often observed indirectly. Consequently, probabilistic latent variable models provide an important tool, because they can include explicit models of the ecological phenomenon of interest and the process by which it is observed. However, existing latent variable methods rely on handformulated parametric models, whi...
Identification and population counts of soil mesofauna can be an important tool for soil ecologists to determine soil biodiversity. The process of performing population counts, which includes classifying and sorting specimens, is very time consuming because of the large diversity and quantities of specimens in soil samples. A mechanical system was...
We present a visually based method for the taxonomic identification of benthic invertebrates that automates image capture, image processing, and specimen classification. The BugID system automatically positions and images specimens with minimal user input. Images are then processed with interest operators (machine-learning algorithms for locating i...
This paper proposes an image classification method based on extracting image features using Haar random forests and combining them with a spatial matching kernel SVM. The method works by combining multiple efficient, yet powerful, learning algorithms at every stage of the recognition process. On the task of identifying aquatic stonefly larvae, the...
We consider the problem of incorporating end-user advice into reinforcement learning (RL). In our setting, the learner alternates between practicing, where learning is based on actual world experience, and end-user critique sessions where advice is gathered. During each critique session the end-user is allowed to analyze a trajectory of the current...
In this paper, we consider the problem of inductively learning rules from specific facts extracted from texts. This problem is challenging due to two reasons. First, natural texts are radically incomplete since there are always too many facts to mention. Second, natural texts are systematically biased towards novelty and surprise, which presents an...
In the field of Human-Computer Interaction, provenance refers to the history and genealogy of a document or file. Provenance helps us to understand the evolution and relationships of files; how and when different versions of a document were created, or how different documents in a collection build on each other through copy-paste events. Though met...
We consider the problem of incorporating end-user advice into reinforcement learning (RL). In our setting, the learner alternates between practicing, where learning is based on actual world experience, and end-user critique sessions where advice is gathered. During each critique session the end-user is allowed to analyze a trajectory of the current...
Situation Awareness (SA) for cyber defense consists of at least seven aspects: 1. Be aware of the current situation. This aspect can also be called situation perception. Situation perception includes both situation recognition and identification. Situation identification can include identifying the type of attack (recognition is only recognizing th...
Cyber situation awareness needs to operate at many levels of abstraction. In this chapter, we discuss situation awareness
at a very high level—the behavior of desktop computer users. Our goal is to develop an awareness of what desktop users are
doing as they work. Such awareness has many potential applications including
Real-time prediction problems pose a challenge to machine learning algorithms because learning must be fast, the set of classes may be changing, and the relevance of some features to each class may be changing. To learn robust classifiers in such nonstationary environments, it is essential not to assign too much weight to any single feature. We add...
Ecosystem Informatics is the study of computational methods for advancing the ecosystem sciences and environmental policy. This talk will discuss the ways in which machine learning---in combination with novel sensors---can help transform the ecosystem sciences from small-scale hypothesis-driven science to global-scale data-driven science. Example c...
Although machine learning is becoming commonly used in today's software, there has been little research into how end users might interact with machine learning systems, beyond communicating simple “right/wrong” judgments. If the users themselves could work hand-in-hand with machine learning systems, the users’ understanding and trust of the system...
Codebook-based representations are widely em- ployed in the classification of complex objects such as images and documents. Most previous codebook-based methods construct a single co- debook via clustering that maps a bag of low- level features into a fixed-length histogram that describes the distribution of these features. This paper describes a s...
Current work in object categorization discriminates among objects that typically possess gross differences which are readily apparent. However, many applications require making much finer distinctions. We address an insect categorization problem that is so challenging that even trained human experts cannot readily categorize images of insects consi...
Intelligent desktop assistants could provide more help for users if they could learn models of the users' workflows. However, discovering desktop workflows is difficult because they unfold over extended periods of time (days or weeks) and they are interleaved with many other workflows because of user multi-tasking. This paper describes an approach...
The TaskTracer system allows knowledge workers to define a set of activities that characterize their desktop work. It then associates with each user-defined activity the set of resources that the user accesses when performing that activity. In order to correctly associate resources with activities and provide useful activity-related services to the...
Visual dictionaries are widely employed in object recognition to map unordered bags of local region descriptors into feature vectors for image classification. Most visual dictionaries have been constructed by unsupervised clustering. This paper presents an efficient discriminative approach, called iterative discriminative clustering (IDC), for dict...
Ecosystem Informatics brings together math- ematical and computational tools to address scientific and policy challenges in the ecosys- tem sciences. These challenges include novel sensors for collecting data, algorithms for au- tomated data cleaning, learning methods for building statistical models from data and for fitting mechanistic models to d...
This paper examines how six online multiclass text classification algorithms perform in the domain of email tagging within the TaskTracer system. TaskTracer is a project-oriented user interface for the desktop knowledge worker. TaskTracer attempts to tag all documents, web pages, and email messages with the projects to which they are relevant. In p...
Intelligent user interfaces employ machine- learning to learn and adapt according to user peculiarities. In all these cases, the learning tasks are predefined and a machine-learning expert is involved in the development process. This significantly limits the potential utility of machine-learning since there is no way for a user to create new learni...
The modern study of approximate dynamic programming (DP) combines ideas from several research traditions. Among these is the field of Artificial Intelligence, whose earliest period focussed on creating artificial learning systems. Today, Machine Learning is an active branch of Artificial Intelligence (although it includes researchers from many othe...
The field of inductive logic programming (ILP) has made steady progress, since the first ILP workshop in 1991, based on a balance of developments in theory, implementa- tions and applications. More recently there has been an increased emphasis on Probabilistic ILP and the related fields of Statistical Relational Learning (SRL) and Structured Predic...
Conditional random fields (CRFs) provide a flexible and powerful model for sequence labeling problems. However, existing learning algorithms are slow, particularly in problems with large numbers of potential input features and feature combinations. This paper describes a new algo-rithm for training CRFs via gradient tree boosting. In tree boosting,...
This paper addresses the problem of learning dynamic Bayesian network (DBN) models to support reinforcement learning. It focuses on learning regression tree (context-specific dependence) models of the conditional probability distributions of the DBNs. Existing algorithms rely on standard regression tree learning methods (both propositional and rela...
This paper describes a computer vision ap- proach to automated rapid-throughput taxonomic iden- tification of stonefly larvae. The long-term goal of this research is to develop a cost-effective method for environ- mental monitoring based on automated identification of indicator species. Recognition of stonefly larvae is chal- lenging because they a...
We present an algorithm, HI-MAT (Hierar- chy Induction via Models And Trajectories), that discovers MAXQ task hierarchies by ap- plying dynamic Bayesian network models to a successful trajectory from a source rein- forcement learning task. HI-MAT discovers subtasks by analyzing the causal and tem- poral relationships among the actions in the trajec...
This paper addresses the question of how statistical learning algorithms can be integrated into a larger AI system both from a practical engineering perspec- tive and from the perspective of correct representation, learning, and reasoning. Our goal is to create an in- tegrated intelligent system that can combine observed facts, hand-written rules,...
Population counts of aquatic insects are a valuable tool for monitoring the water quality of rivers and streams. However, the handling of samples in the lab for species identification is time consuming and requires specially trained experts. An aquatic insect imaging system was designed as part of a system to automate aquatic insect classification...
The emerging eld of Ecosystem Informatics applies meth- ods from computer science and mathematics to address fundamental and applied problems in the ecosystem sciences. The ecosystem sciences are in the midst of a revolution driven by a combination of emerging tech- nologies for improved sensing and the critical need for better science to help mana...
The emerging field of Ecosystem Informatics applies methods from computer science and mathematics to address fundamental and applied problems in the ecosystem sciences. The ecosystem sciences are in the midst of a revolution driven by a combination of emerging technologies for improved sensing and the critical need for better science to help manage...
Cognitive networks pose enormous challenges for machine learning research. In this chapter, Thomas G. Dietterich and Pat Langley examine various aspects of machine learning that touch on cognitive approaches to networking. They present the state of the art in machine learning with emphasis on those aspects that are relevant to the emerging vision o...
Many ecological science and environmental monitoring problems can benefit from inexpensive, automated methods of counting insect and mesofaunal populations. Existing methods for obtaining population counts require expensive and tedious manual identification by human experts. This chapter describes the development of general-purpose pattern-recognit...
This paper presents a new structure-based interest region detector called principal curvature-based regions (PCBR) which we use for object class recognition. The PCBR interest operator detects stable watershed regions within the multi-scale principal curvature image. To detect robust watershed regions, we "clean" a principal curvature image by comb...
This paper describes a fully automated stonefly-larvae classification system using a local features approach. It compares the three region detectors employed by the sys- tem: the Hessian-affine detector, the Kadir entropy detector and a new detector we have developed called the princi- pal curvature based region detector (PCBR). It introduces a con...
There has been little research into how end users might be able to communicate advice to machine learning systems. If this resource—the users themselves—could somehow work hand-in-hand with machine learning systems, the accuracy of learning systems could be improved and the users' under- standing and trust of the system could improve as well. We co...
Intelligent desktop environments allow the desktop user to define a setofprojects oractivities that characterize the user's desktop work. These environments then attempt to identify the current activity of the user in order to provide various kinds of assistance. These systems take a hybrid approach in which they allow the user to declare their cur...
Desktop users commonly work on multiple tasks. The TaskTracer system provides a convenient, low- cost way for such users to define a hierarchy of tasks and to associate resources with those tasks. With this information, TaskTracer then supports the multi-tasking user by configuring the computer for the current task. To do this, it must detect when...
The volume of electronic information that users accumulate is steadily rising. A recent study [2] found that there were on average 32,000 pieces of information (e-mails, web pages, documents, etc.) for each user. The problem of organizing
It is crucial for the development of high quality products that design requirements are identified and clarified as early as possible in the design process. In many projects the design requirements and design specifications evolve during the project cycle. Shifting needs of the customer, advancing technology, market considerations and even addition...
Knowledge Representation and Reasoning (KRR) has developed a wide range of methods for representing knowledge and reasoning from it to produce expert-level performance. Despite these accomplishments, there is one major problem preventing the wide-spread application of KRR technology: the inability to support learning. This makes KRR systems brittle...
Local feature-based matching is robust to both clutter and occlusion. However, a primary shortcoming of local features is a deficiency of global information that can cause ambiguities in matching. Local features combined with global relationships convey much more information, but global spatial information is often not robust to occlusion and/or no...
Although habitat fragmentation is one of the greatest threats to biodiversity worldwide, virtually no attention has been paid
to the quantification of error in fragmentation statistics. Landscape pattern indices (LPIs), such as mean patch size and
number of patches, are routinely used to quantify fragmentation and are often calculated using remote-...
The TaskTracer system seeks to help multi-tasking users manage the resources that they create and access while car- rying out their work activities. It does this by associating with each user-defined activity the set of files, folders, email messages, contacts, and web pages that the user accesses when performing that activity. The initial TaskTrac...
Helping computer users rapidly locate files in their folder hierarchies has become an important research topic in today's intelligent user interface design. This paper reports on FolderPredictor, a software system that can reduce the cost of locating files in hierarchical folders. FolderPredictor applies a cost-sensitive prediction algorithm to the...
This paper proposes a new generic object recognition system based on multi-scale affineinvariant image regions. Image segments are obtained by a watershed transform of the principal curvature of a contrast enhanced image. Each region is described by an intensity-based statistical descriptor and a PCASIFT descriptor. The spatial relations between re...
Knowledge workers spend the majority of their working hours processing and manipulating information. These users face continual costs as they switch between tasks to retrieve and create information. The TaskTracer project at Oregon State University investigates the possibilities of a desktop software system that will record in detail how knowledge...
This paper studies the problem of learning diagnostic policies from training examples. A diagnostic policy is a complete description of the decision-making actions of a diagnostician (i.e., tests followed by a diagnostic decision) for all possible combinations of test results. An optimal diagnostic policy is one that minimizes the expected total co...
This paper reports on TaskTracer --- a software system being designed to help highly multitasking knowledge workers rapidly locate, discover, and reuse past processes they used to successfully complete tasks. The system monitors users' interaction with a computer, collects detailed records of users' activities and resources accessed, associates (au...
A short report on the Dagstuhl seminar on Probabilistic, Logical and Relational Learning -- Towards a Synthesis is given. @InProceedings{deraedt_et_al:DSP:2006:412, author = {Luc De Raedt and Tom Dietterich and Lise Getoor and Stephen H. Muggleton}, title = {05051 Executive Summary -- Probabilistic, Logical and Relational Learning - Towards a Synth...
From 30.01.05 to 04.02.05, the Dagstuhl Seminar 05051 ``Probabilistic, Logical and Relational Learning - Towards a Synthesis'' was held in the International Conference and Research Center (IBFI), Schloss Dagstuhl. During the seminar, several participants presented their current research, and ongoing work and open problems were discussed. Abstracts...
Many real-world domains exhibit rich relational structure and stochasticity and motivate the development of models that combine predicate logic with probabilities. These models describe probabilistic influences between attributes of objects that are related to each other through known domain relationships. To keep these models succinct, each such i...
With transfer learning, one set of tasks is used to bias learning and im- prove performance on another task. However, transfer learning may ac- tually hinder performance if the tasks are too dissimilar. As described in this paper, one challenge for transfer learning research is to develop approaches that detect and avoid negative transfer using ver...
The standard model of supervised learning assumes that training and test data are drawn from the same underlying distribution. This paper explores an application in which a second, auxiliary, source of data is available drawn from a different distribution. This auxiliary data is more plentiful, but of significantly lower quality, than the training...
Bias-variance analysis provides a tool to study learning algorithms and can be used to properly design ensemble methods well tuned to the properties of a specific base learner. Indeed the effectiveness of ensemble methods critically depends on accuracy, diversity and learning characteristics of base learners. We present an extended experimental ana...