• Home
  • IBM
  • IBM Research Almaden
  • Daniel Gruhl
Daniel Gruhl

Daniel Gruhl
IBM · IBM Research Almaden

About

77
Publications
29,723
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
7,733
Citations

Publications

Publications (77)
Preprint
Full-text available
Recent advances in text representation have shown that training on large amounts of text is crucial for natural language understanding. However, models trained without predefined notions of topical interest typically require careful fine-tuning when transferred to specialized domains. When a sufficient amount of within-domain text may not be availa...
Article
Rule-based Natural Language Processing (NLP) pipelines depend on robust domain knowledge. Given the long tail of important terminology in radiology reports, it is not uncommon for standard approaches to miss items critical for understanding the image. AI techniques can accelerate the concept expansion and phrasal grouping tasks to efficiently creat...
Article
The Semantic Web movement has produced a wealth of curated collections of entities and facts, often referred as Knowledge Graphs. Creating and maintaining such Knowledge Graphs is far from being a solved problem: it is crucial to constantly extract new information from the vast amount of heterogeneous sources of data on the Web. In this work we add...
Chapter
A considerable amount of scientific and technical content is still locked behind data formats which are not machine readable, especially PDF files - and this is particularly true in the healthcare domain. While the Semantic Web has nourished the shift to more accessible formats, in business scenarios it is critical to be able to tap into this type...
Chapter
In the last decade the Semantic Web initiative has promoted the construction of knowledge resources that are understandable by both humans and machines. Nonetheless considerable scientific and technical content is still locked behind proprietary formats, especially PDF files. While many solutions have been proposed to shift the publishing mechanism...
Conference Paper
In many Information Extraction tasks, dictionaries and lexica are powerful building blocks for sophisticated extractions. The success of the Semantic Web in the last 10 years has produced an unprecedented quantity of available structured data that can be leveraged to produce dictionaries on countless concepts in many domains. While being an invalua...
Chapter
Many Knowledge Extraction systems rely on semantic resources - dictionaries, ontologies, lexical resources - to extract information from unstructured text. A key for successful information extraction is to consider such resources as evolving artifacts and keep them up-to-date. In this paper, we tackle the problem of dictionary expansion and we prop...
Conference Paper
Data exploration is a task that inherently requires high human interaction. The subject matter expert looks at the data to identify a hypothesis, potential questions, and where to look for answers in the data. Virtually all data exploration scenarios can benefit from a tight human-in-the-loop paradigm, where data can be visualized and reshaped, but...
Conference Paper
Many real world analytics problems examine multiple entities or classes that may appear in a corpus. For example, in a customer satisfaction survey analysis there are over 60 categories of (somewhat overlapping) concerns. Each of these is backed by a lexicon of terminology associated with the concern (e.g., “Easy, user friendly process” or ”Process...
Chapter
Extracting relations from unstructured Web content is a challenging task and for any new relation a significant effort is required to design, train and tune the extraction models. In this work, we investigate how to obtain suitable results for relation extraction with modest human efforts, relying on a dynamic active learning approach. We propose a...
Chapter
Ontologies are a basic tool to formalize and share knowledge. However, very often the conceptualization of a specific domain depends on the particular user’s needs. We propose a methodology to perform user-centric ontology population that efficiently includes human-in-the-loop at each step. Given the existence of suitable target ontologies, our met...
Conference Paper
Domain-specific relation extraction requires training data for supervised learning models, and thus, significant labeling effort. Distant supervision is often leveraged for creating large annotated corpora however these methods require handling the inherent noise. On the other hand, active learning approaches can reduce the annotation cost by selec...
Conference Paper
Ontologies are dynamic artifacts that evolve both in structure and content. Keeping them up-to-date is a very expensive and critical operation for any application relying on semantic Web technologies. In this paper we focus on evolving the content of an ontology by extracting relevant instances of ontological concepts from text. We propose a novel...
Article
With the rise of social media, learning from informal text has become increasingly important. We present a novel semantic lexicon induction approach that is able to learn new vocabulary from social media. Our method is robust to the idiosyncrasies of informal and open-domain text corpora. Unlike previous work, it does not impose restrictions on the...
Conference Paper
Within social networks, certain messages propagate with more ease or attract more attention than others. This effect can be a consequence of several factors, such as topic of the message, number of followers, real-time relevance, person who is sending the message etc. Only one of these factors is within a user’s reach at authoring time: how to phra...
Patent
Full-text available
A method resolves ambiguous spotted entity names in a data corpus by determining an activation level value for each of a plurality of nodes corresponding to a single ambiguous entity name. The activation levels for each of the nodes may be modified by inputting outside domain knowledge corresponding to the nodes to increase the activation value of...
Patent
Described herein are methods, systems, apparatuses and products for automated information discovery and traceability for evidence generation. An aspect provides for accessing a mapping of a plurality of connected nodes stored in a memory device, said mapping being discovered via a network scan based on a seed set, said plurality of connected nodes...
Patent
Full-text available
Embodiments of the invention relate to creating an operating system and file system independent incremental data backup. A first data backup of a source system and second version of the data on the source system is received. A second data backup of the second version of the data is created by determining differences between the first data backup an...
Patent
Full-text available
Embodiments of the invention provide a system and method for determining preferences from information mashups and, in particular, a system and method for constructing a ranked list from multiple sources. In an exemplary embodiment, the system and method tunably combines multiple ranked lists by computing a score for each item within the list, where...
Patent
Full-text available
Data deduplication compression in a streaming storage application, is provided. The disclosed deduplication process provides a deduplication archive that enables storage of the archive to, and extraction from, a streaming storage medium. One implementation involves compressing fully sequential data stored in a data repository to a sequential stream...
Patent
Full-text available
A method, system, and article for compressing an input stream of uncompressed data. The input stream is divided into one or more data segments. A hash is applied to a first data segment, and an offset and length are associated with this first segment. This hash, together with the offset and length data for the first segment, is stored in a hash tab...
Patent
Full-text available
A system for transforming domain specific unstructured data into structured data including an intake platform controlled by feed back from a control platform. The intake platform includes an intake acquisition module for acquiring data building baseline data related to a domain and problem of interest, an intake pre-processing module, an intake lan...
Conference Paper
Although structured electronic health records are becoming more prevalent, much information about patient health is still recorded only in unstructured text. “Understanding” these texts has been a focus of natural language processing (NLP) research for many years, with some remarkable successes, yet there is more work to be done. Knowing the drugs...
Article
Full-text available
Rising costs, decreasing quality of care, diminishing productivity, and increasing complexity have all contributed to the present state of the healthcare industry. The interactions between payers (e.g., insurance companies and health plans) and providers (e.g., hospitals and laboratories) are growing and are becoming more complicated. The constant...
Conference Paper
Family history is an important part of the clinical record. Automatically extracting family history from clinical reports requires an understanding of how clinicians describe family history as well as an understanding of how they discuss diseases. We perform a characteristic analysis of family history sentences and compare them with sentences which...
Conference Paper
Family history information is an important part of understanding a patient's total health, and it is spread in text throughout the clinical record; mostly untagged and often very unstructured. We have performed a syntactic analysis of family history in 1274 sample clinical texts and created an algorithm to extract specific family history informatio...
Article
Full-text available
Social Networks provide one of the most rapidly evolving data sets in existence today. Traditional Business Intelligence applications struggle to take advantage of such data sets in a timely manner. The BBC SoundIndex, developed by the authors and others, enabled real-time analytics of music popularity using data from a variety of Social Networks....
Conference Paper
There is still significant investment in legacy healthcare information technology (HIT). Current systems are a diverse mix of technologies, standards, platforms and versions. Many of which were never intended to be used together to achieve a common goal. In order to deliver care effectively and efficiently, point-of-care software must navigate this...
Article
Modern Electronic Medical Record (EMR) systems often integrate large amounts of data from multiple disparate sources. To do so, EMR systems must align the data to create consistency between these sources. The data should also be presented in a manner that allows a clinician to quickly understand the complete condition and history of a patient's hea...
Article
Full-text available
The Sound Index system demonstrates a new way to measure popularity in the world of music by incorporating the web, online communities and social networks. The Sound Index system catalogs the hottest artists and tracks who are most popular on the web. It provides a current view of popular music content online, by incorporating listens, plays, downl...
Conference Paper
Full-text available
There are a large number of websites serving valuable content that can be used by higher-level applications, Web Services, Mashups etc. Yet, due to various reasons (lack of computing resources, financial constraints etc.) they are unable to provide Web Service APIs to access their data. In their desire to incorporate the latest and greatest technol...
Conference Paper
Full-text available
This paper explores the application of restricted relationship graphs (RDF) and statistical NLP techniques to improve named entity annotation in challenging Informal English domains. We validate our approach using on-line forums discussing popular music. Named entity annotation is particularly difficult in this domain because it is characterized by...
Conference Paper
Full-text available
The ever increasing amount of content on the Internet has fostered many efforts seeking to leverage this potentially yottascale information source. Service systems using advanced data and text analytics techniques have been developed to perform knowledge gathering and information discovery over Web data. Information gathered from free and public so...
Article
Full-text available
Over the last decade, companies have been slowly realizing that the World Wide Web represents both a pivotal new source of information on their customers and game-changing technology that will augment their current business operations. In response to this recognition, companies are showing interest in technology that tells them what is happening on...
Conference Paper
Full-text available
Blogs, discussion forums and social networking sites are an excellent source for people's opinions on a wide range of topics. We examine the application of voting theory to "information mashups" - the combining and summarizing of data from the multitude of often-conflicting sources. This paper presents an information mashup in the music domain: a T...
Article
This paper is a first exploration of the relationship between service science and Grid computing. Service science is the study of value co-creation interactions among entities, known as service systems. Within the emerging service science community, service is often defined as the application of competences (resources) for the benefit of another. G...
Conference Paper
Full-text available
The increased use of virtual machines in the enterprise environment presents an interesting new set of challenges for the administrators of today's information systems. In addition to the management of the sheer volume of easily-created new data on physical machines, VMs themselves contain data that is important to the user of the virtual machine....
Conference Paper
Information enrichment is generally considered a modern task of supplying specific-necessity metadata to an existing body of information, with the intent to enable a particular task. In this paper we suggest that information enrichment service systems have been existent for thousands of years, and that information has often been enriched in more ge...
Article
Full-text available
The service sector accounts for more than 80 percent of the US gross domestic product and employs a growing share of the science and engineering workforce. Yet it's one of the least-studied areas of the economy. Some see economics, operations research, industrial engineering, or the science of complex systems as the appropriate starting point for a...
Conference Paper
Popularity based search engines have served to stagnate in- formation retrieval from the web. Developed to deal with the very real problem of degrading quality within keyword based search they have had the unintended side eect of creating "icebergs" around topics, where only a small mi- nority of the information is above the popularity water- line....
Conference Paper
Full-text available
As systems grow larger in size and complexity, it becomes increasingly difficult for administrators to maintain some shared sense of awareness of what’s going on in the system. We implemented a large public display with appropriately designed visualizations that allow for rapid assessment and peripheral awareness of system health. By placing the vi...
Conference Paper
An increasing fraction of the global discourse is migrating online in the form of blogs, bulletin boards, web pages, wikis, editorials, and a dizzying array of new collaborative technologies. The migration has now proceeded to the point that topics reflecting certain individual products are sufficiently popular to allow targeted online tracking of...
Conference Paper
XML provides a universal and portable format for document and data exchange. While the syntax and specification of XML makes documents both human readable and machine parsable, it is often at the expense of efficiency when representing simple data structures.We investigate the ``costs'' associated with XML serialization from several resource perspe...
Conference Paper
We study the dynamics of information propagation in environments of low-overhead personal publishing, using a large collection of weblogs over time as our example domain. We characterize and model this collection at two levels. First, we present a macroscopic characterization of topic propagation through our corpus, formalizing the notion of long-r...
Article
Full-text available
This paper describes Seeker, a platform for large-scale text analytics, and SemTag, an application written on the platform to perform automated semantic tagging of large corpora. We apply SemTag to a collection of approximately 264 million web pages, and generate approximately 434 million automatically disambiguated semantic tags, published to the...
Article
WebFountain is a platform for very large-scale text analytics applications. The platform allows uniform access to a wide variety of sources, scalable system-managed deployment of a variety of document-level “augmenters” and corpus-level “miners,” and finally creation of an extensible set of hosted Web services containing information that drives end...
Article
Full-text available
This paper provides an objective evaluation of the performance impacts of binary XML encodings, using a fast stream-based XQuery processor as our representative application. Instead of proposing one binary format and comparing it against standard XML parsers, we investigate the individual effects of several binary encoding techniques that are share...
Article
This paper describes Seeker, a platform for large-scale text analytics, and SemTag, an application written on the platform to perform automated semantic tagging of large corpora. We apply SemTag to a collection of approximately 264 million web pages, and generate approximately 434 million automatically disambiguated semantic tags, published to the...
Conference Paper
This paper describes Seeker, a platform for large-scale text analytics, and SemTag, an application written on the platform to perform automated semantic tagging of large corpora. We apply SemTag to a collection of approximately 264 million web pages, and generate approximately 434 million automatically disambiguated semantic tags, published to the...
Article
Vinci is a local-area service-oriented architecture (SOA) designed for rapid development and management of robust Web applications. Based on XML document exchange, Vinci is designed to complement and interoperate with wide-area SOAs such as SOAP/UDDI and .NET. This paper presents the Vinci architecture, the rationale behind its design, and an evalu...
Conference Paper
Full-text available
YouServ is a system that allows its users to pool existing desktop computing resources for high availability web hosting and file sharing. By exploiting standard web and internet protocols (e.g. HTTP and DNS), YouServ does not require those who access YouServ-published content to install special purpose software. Because it requires minimal server-...
Conference Paper
Vinci is a local area service-oriented architecture designed for rapid development and management of robust web applications. Based on XML document exchange, Vinci is designed to complement and interoperate with wide area service-oriented architectures such as E-Speak and .NET. This paper presents the Vinci architecture, the rationale behind its de...
Article
. Tunable Tamper Proofing is a method of detecting image tampering that localizes and characterizes the changes that an image has undergone. It is "tunable" in the sense that different image metrics can be individually tuned to trigger at different thresholds of change. This enables a detector to ignore or focus on selected classes of image manipul...
Article
In an earlier paper, “Techniques for Data Hiding,” the overall goals and constraints of information-hiding problem space and a variety of approaches to information hiding in image, audio, and text were described. In this sequel, information-hiding goals and applications are expanded beyond watermarking to encompass the more general concept of infor...
Article
Full-text available
Ideally a computational approach could assist in the human-intensive tasks associated with selecting and presenting timely, relevant information, i.e., news editing. At present this goal is difficult to achieve because of the paucity of effective machine-understanding systems for news. A structure for news that affords a fluid interchange between h...
Conference Paper
Full-text available
. Security documents (currency, treasury bills, stocks, bonds, birth certificates, etc.) provide an interesting problem space for investigating information hiding. Recent advances in the quality of consumer printers and scanners have allowed the application of traditional information hiding techniques to printed materials. This paper explores how s...
Article
. Homomorphic signal-processing techniques are used to place information imperceivably into audio data streams by the introduction of synthetic resonances in the form of closely-spaced echoes. These echoes can be used to place digital identification tags directly into an audio signal with minimal objectionable degradation of the original signal. 1...
Article
Full-text available
Data hiding, a form of steganography, embeds data into digital media for the purpose of identification, annotation, and copyright. Several constraints affect this process: the quantity of data to be hidden, the need for invariance of these data under conditions where a "host" signal is subject to distortions, e.g., lossy compression, and the degree...
Conference Paper
Homomorphic signal-processing techniques are used to place information imperceivably into audio data streams by the introduction of synthetic resonances in the form of closely-spaced echoes. These echoes can be used to place digital identification tags directly into an audio signal with minimal objectionable degradation of the original signal.
Article
Data hiding is the process of embedding data into image and audio signals. The process is constrained by the quantity of data, the need for invariance of the data under conditions where the `host' signal is subject to distortions, e.g., compression, and the degree to which the data must be immune to interception, modification, or removal. We explor...
Article
Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g. , pa...
Article
Full-text available
We describe an approach to measure the popularity of mu- sic tracks, albums and artists by analyzing the comments of music listeners in social networking online communities such as MySpace. This measure of popularity appears to be more accurate than the traditional measure based on album sales figures, as demonstrated by our focus group study. We f...
Article
Full-text available
ABSTRACT We describe an approach to measure the popularity of mu- sic tracks, albums and artists by analyzing the comments of music listeners in social networking online communities such as MySpace. This measure of popularity appears to be

Network