DataPDF Available

A Criteria-Driven Method for Architecting Domain-Specific IE Applications

Authors:

Abstract and Figures

This paper proposes a method for architecting domain-specific information extraction (IE) applications focusing on a good cost/benefit ratio for a concrete domain. The method uses criteria to recommend the appropriate use of rule-based or machine learning based methods in the IE application architecture. By using an example from the tourism domain, the paper describes how the evaluation criteria can be applied in practice. An evaluation of the costs and benefits indicates a good IE recognition rate with reasonable development effort.
Content may be subject to copyright.
A Criteria-Driven Method for Architecting
Domain-Specific IE Applications1
T. Neef, S. Morana, B. G. Humm
Department of Computer Science
Hochschule Darmstadt University of Applied Sciences, Germany
{tobias.neef, stefan.morana}@stud.h-da.de, bernhard.humm@h-da.de
Abstract
This paper proposes a method for architecting domain-specific information extraction (IE)
applications focusing on a good cost/benefit ratio for a concrete domain. The method uses
criteria to recommend the appropriate use of rule-based or machine learning based methods in
the IE application architecture. By using an example from the tourism domain, the paper
describes how the evaluation criteria can be applied in practice. An evaluation of the costs and
benefits indicates a good IE recognition rate with reasonable development effort.
Keywords
NLP, supervised Information Extraction, tourism, architecture
1. Introduction
Companies have increasing need to access semantic information contained in natural
language text. This is an ongoing trend in the last decade which is mainly motivated
by the fact that the Internet enables companies to access massive amounts of
information available in textual form. Semantic information in the context of this
paper is knowledge contained in documents which are relevant to users of a specific
domain. Natural Language Processing (NLP), and in particular Information
Extraction (IE) is the technique to extract semantic information from natural
language text. A profit-oriented institution that develops an IE application needs to
define a domain-specific architecture with a good cost / benefit ratio.
Most IE publications, today, focus on approaches which solely increase the
recognition rate, i.e., the benefit of IE applications. However, the development costs
induced by those approaches are rarely considered. In this paper, we define a criteria-
driven method for specifying the architecture of domain-specific IE applications with
a well-balanced cost / benefit ratio.
1 This work was funded by HRS Hotel Reservation Service
The paper is structured as follows. Section 2 presents related work including state-of-
the-art domain-specific IE approaches. Based on these approaches, the method for
architecting domain-specific IE applications is defined (Section 3) and then applied
to a specific domain (Section 4). The evaluation (Section 4.3) shows the
development effort (cost) and recognition rate (benefit) of the resulting application.
Section 5 concludes the paper.
2. Related Work
Recent work has shown the capabilities of various approaches for IE in domain-
specific (supervised) scenarios. Those approaches can be categorized into two logical
components, Entity Mention Detection (EMD) and the Relation Mention Detection
(RMD) (Surdeanu et al., 2011). EMD groups the techniques used for detecting
(named) entities. RMD detects (semantic) relations which connect entities.
Publications like (Surdeanu et al., 2008) and others focus on applications which are
solely based on Machine Learning (ML). In ML, the extraction of information is
based on an ML model which was trained with sample data, a so-called corpus.
IE systems like ANNIE (Isabelle et al., 2001) do not use machine learning. Instead,
the system relies on domain-specific resources like gazetteers or taxonomies for
EMD. ANNIE is capable of doing morphologic normalization and basic rule-based
co-reference resolution in order to increase the entity recognition rate. Extended
versions of ANNIE use the annotation pattern language JAPE which is a finite state
transducer (Cunningham et al., 2000). With JAPE, rules can be defined which use
syntactic information to realize rule-based EMD and RMD. (Wyner and Wim, 2011)
shows how this approach is applied to the legal domain. They propose a
linguistically motivated system which is based on Phrase Structure Parses to
represent syntax.
(Surdeanu et al., 2011) argue that in supervised scenarios, the performance of a
domain-specific IE application can be optimized using domain-specific components
like rules and gazetteers.
In summary, both ML-based and rule-based approaches as well as mixed approaches
have been applied to domain-specific IE. The focus of research in those areas was on
the improvement of recognition rate measures. The following section proposes
evaluation criteria which also consider the effort spent on learning and customizing a
system.
3. A Criteria-Driven Method for Architecting
Domain-Specific IE Applications
3.1. Overview
The method takes as input the prioritization of certain criteria and influencing factors
of the problem domain. The outputs are architecture recommendations for the IE
application to be constructed. See Figure 1.
Method
[Prioritization of Criteria]
[Influencing Factors]
[Architecture Recommendations]
Figure 1 - Inputs and outputs of the method
Criteria, influencing factors and architecture alternatives are described in the
following sections.
3.2. Criteria
General criteria for architecting applications are costs and benefits where a good cost
/ benefit ratio is aspired. In the context of IE applications, these criteria can be
refined as follows.
Recognition rate: a high recognition rate is the main benefit of an IE
application
Effort: the development effort is the main cost factor for an IE application
and should be as low as possible. It can be split up into two factors.
o Customization effort: IE applications are usually built on top of
off-the-shelf IE components that need to be customized. The
classic programming effort is usually relatively low.
o Learning effort: The use of off-the-shelf IE components is usually
complex and requires proficiency in NLP and the technique used.
Depending on the NLP expertise in the development team, the
learning effort may be substantial.
In this paper, we do not consider the cost factor for software licences since our
method is independent of concrete off-the-shelf products and their pricing models.
Depending on the domain under consideration, either criterion may be prioritized
differently. In one project, the recognition rate has top priority and high costs may be
acceptable. In other projects, a reasonable recognition rate is acceptable but the costs
must be limited. The prioritization of the criteria is an important input for
architectural decisions.
3.3. Influencing Factors
Apart from the prioritization of the criteria, there are other factors influencing
architectural decisions for IE applications.
Linguistic complexity: the domain under consideration may involve
different linguistic complexities. Aspects are, e.g., grammatical correctness
incl. the occurrence of typing errors, the use of domain-specific
terminology, the writing style (length of sentences, nesting depth), or the
occurrence of co-references.
Team expertise: The expertise of the development team regarding NLP and
IE technologies and concrete off-the-shelf components has a strong
influence on the learning curve and the development effort.
Domain-specific resources: The availability of domain-specific resources
like, e.g., dictionaries (taxonomies, ontologies) or corpora has an influence
on IE approaches to be used.
3.4. Reference Architecture for IE Applications
Figure 2 shows a reference architecture for IE applications, i.e., a blue print that
shows essential components, inputs and outputs of IE applications in general.
Input
Information Extraction
Entity Mention Detect ion
Relation Mention Detec tion
Text
Resources
Output
Entities
Relations
Corpora Gazeteers Rules
DictionariesML Models
Figure 2 - Reference architecture for IE applications
The main input for an IE application is text in a natural language. The main outputs
are detected entities and relations. The main components of an IE application refer to
the main IE tasks: Entity Mention Detection (EMD) and Relation Mention Detection
(RMD). Resources may be used to configure the IE components: corpora, ML
models, gazetteers, dictionaries, and rule sets.
EMD and RMD may be performed by both, machine learning (ML) and rule-based
(RB) approaches each using different resources. See Table 1.
IE Approaches
IE Tasks
Machine Learning (ML)
Rule-Based (RB)
Entity Mention Detection
(EMD)
Corpora, ML Models
(Gazeteers, Dictionaries)
Rules, Gazeteers,
Dictionaries
Relation Mention
Detection (RMD)
Corpora, ML Models
Rules
Table 1 - Resources used by IE approaches for IE tasks
3.5. Architecture Recommendations
Our method gives architecture recommendations for IE applications depending on
the prioritization of criteria and influencing factors of the domain under
consideration. The following Table 2 gives an overview.
Topic
Input:
Prioritization of criteria
and influencing factors
EMD
RMD
ML
RB
ML
RB
(A.) Prioritization
of criteria
(A.1) Top priority on
recognition rate
(B.) Linguistic
complexity
(B.1) High linguistic
complexity
(B.2) Low linguistic
complexity
(C.) NLP
expertise
(C.1) Limited NLP expertise
in development team
(C.2) High ML experience in
development team
(C.3) High linguistic
experience in development
team
(D.) Domain-
specific resources
(D.1) Dictionary available
(D.2) Annotated corpus
available
Table 2 - Overview of architecture recommendations
The column “Input” describes the prioritization of criteria and influencing factors in
a particular application domain. The column “Architecture Recommendation”
indicates a preferable selection of an IE approach (machine learning or rule-based)
for a particular IE task (entity mention detection or relation mention detection). A
tick () denotes a suitable approach, a cross () denotes an approach which is to be
avoided. If, in a concrete application domain, different recommendations are in
conflict then the application architect has to make an informed decision. The
following paragraphs will help the architect in making such a decision.
Basics
Named entity recognition (NER) based on gazetteers can be broken down into two
segments. The first one includes pre-defined lists of categorized words (gazetteers).
If there is no doubt to which category those lookups belong to then they can be
marked as Named Entities. Often, it depends on the context of a token to which
category the Named Entity can be assigned. Also problems like word sense
ambiguity, incorrect spelling or co-references make it unfeasible to solely rely on
gazetteer lists. Therefore, gazetteer-based systems like ANNIE include context-
specific rules in order to yield better EMD performance.
Another approach to EMD is the use of machine learning (ML) based NER as
pioneered by (Lafferty, 2001). As with all ML approaches, a vast amount of pre-
annotated documents of the given domain have to be available to yield good results.
But those systems can be significantly more robust against various forms of
misspellings. Additionally, they can learn the correct categorization for the given
context and they already solve some co-reference related problems.
When evaluating the complexity of both approaches it soon becomes obvious that the
complexity characteristics are not linear. Instead they are a function based on the
quality of input data which is available for a given domain, and on the linguistic
complexity of the domain.
Linguistic Complexity (B.)
The rules in a RB approach have to be created manually. The customization effort
for the system increases with the linguistic complexity of the domain documents. A
ML-based application uses a corpus. The complexity of annotating a corpus is not
related to the linguistic complexity of the document. As a consequence, we
recommend ML-based approaches for EMD and RMD when the linguistic
complexity is high (B.1).
Prioritization of criteria (A.)
Rule-based and ML-based approaches can also be combined. According to
(Surdeanu et al., 2011), the combination of gazetteers and ML-based NER is
required to optimize the recognition rate of a domain-specific IE application.
However, this combination has an effect on the system complexity and, hence, on
learning and customization effort. If the recognition rate is prioritized (A.1) then the
combination of ML-based and RB approaches is recommended.
Domain-specific resources (D.)
In the gazetteer approach, gazetteer may be derived from resources like dictionaries
or taxonomies. If such resources do not exist then domain-specific gazetteer lists
need to be constructed manually by domain experts. Depending on the domain and
the number of relevant entities, this may lead to a customization effort which is not
acceptable anymore (Kozareva, 2006) (D.1).
The initial customization effort of the ML-based NER is mainly influenced by the
effort required to annotate documents. They are the input for the learning algorithms.
If those annotated documents are not available then the effort of creating a
sufficiently large corpus of good quality is a high initial investment.
When looking at the customization effort of both approaches we argue that gazetteer-
based applications have a relatively low initial customization effort. It increases if
the quality of the domain specific dictionaries is low or no dictionaries are available.
The initial customization effort of ML based methods is higher but does decrease
when an annotated, similar corpus is already available (D.2).
NLP Expertise (C.)
The learning effort of the gazetteer-based approach is mainly related to the linguistic
complexity of the domain and the annotation features available. The learning effort
of creating gazetteer lists is minimal as long as the creator of the lists has enough
domain knowledge. In contrast, the EMD related rules require a solid understanding
of rule systems like JAPE (Cunningham, et al., 2000), as well as the features which
are used as input for the rule system. Features can be part-of-speech (POS)
information as well as various representations of syntax.
The learning effort of ML-based NER is mainly influenced by the abstractions the
tools offer to hide the complex ML algorithms from the developer. This has been a
problem which was actively worked on since research started to focus on this area
(Lafferty, 2001). Therefore, tools have emerged on the market like the Stanford NER
(Finkel, et al., 2005) or GATE’s Batch Learning PR (Li, et al., 2005) which are
useable for an average developer to create a domain-specific NER models.
An important factor in the learning effort of the EMD component is the background
of the developers. If they have a more linguistic background the gazetteer based
approach is recommendable (C.3). When the users have some background in
statistics and machine learning, they can soon get productive with the ML-based
approach (C.2). Both techniques have a certain level of initial complexity which
needs to get resolved.
In ML-based RMD, features like Named Entities are required which are detected by
preprocessing steps or EMD. Those features and a corpus annotated with relations is
the input for a learning algorithm. The combination of these features and the
customization is an expert task and, from our perspective, no tool has gained enough
traction to be used outside a specific research group (C.1).
Rule-based RMD is mostly based on syntax information. This information is then
processed by a rule system in order to find the semantic relationships. There are two
commonly used syntactic representations: phrase structure parses and dependency
grammars. Phrase structure parses are trees that are optimized to represent the syntax
in a detailed and linguistically correct way. Therefore, it has been used in many
systems in the past. Publications like (de Marneffe and Manning, 2008) showed that
for most non-linguists the dependency grammar representation is more appealing.
They also proposed a dependency representation which is mainly motivated to
provide semantically helpful information instead of showing all the syntactic details.
Our experiments indicate that the acceptance of the dependency-based syntax
representation is higher than the phrase structure parses for developers without a
linguistic background. Further evaluation criteria of syntactic parsers and their
representation have been developed by (Miyao et al., 2008).
Although the same customization effort tradeoffs apply for ML-based RMD as they
do for EMD, we argue that, due to the lack of tool support in the ML-based RMD
area, the learning effort and is too high to recommend the approach.
4. The Method Applied in a Sample Domain
4.1. The Hotel Domain
In order to demonstrate the proposed method, we introduce an IE problem in the
tourism domain. This IE task was accomplished within the research project
Ontology-Based Text Mining” (OBTM) at Hochschule Darmstadt University of
Applied Sciences for the company HRS Hotel Reservation Service, a leading hotel
portal provider in Germany. The task was to extract information about a hotel, its
properties, and also about it rooms and their equipment. Thousands of textual hotel
descriptions are available in catalogues or online web pages. HRS plans to use this
information for offering their customers better search capabilities. In order to achieve
this goal, a semantic understanding of natural language text is required. The OTA
code list (OpenTravel Alliance, 2010), which is an existing classification system for
the tourism domain, defines categories of hotel-related entries. All terms in the OTA
code list have an identifier.
Code
Name
Category
GRI42
Room
Accommodation Profile,
Accommodation Unit
HAC79
Sauna
Hotel Facility,
Unit Facility
Table 3: Examples of OTA codes and categories
Hotel codes are assigned to categories depending on their roles. For example, the
code “GRI42/Rooms may be used to indicate the total amount of rooms. In the hotel
description, this fact may be formulated as follows.
“Our hotel has 120 well-equipped rooms.”
In this case, it is expected to find a relation between the code ”Room and the
number 120 which quantifies it.
RoomAmountRelation(GRI42/Room, 120)
Another role of a room is an Accommodation Unit which includes so called Unit
Facilities”.
“All our rooms contain a sauna for your personal pleasure.”
In this case it is expected to find relationships between the room and the units which
are contained in this room.
RoomUnitRelation(GRI42/Room, HAC79/Sauna)
Finally, an entity may also be in the category Hotel Facility. In this case, the unit is
not assigned to a room type. Instead, it is a unit of the Hotel:
“Concerning relaxation, you have the possibility to bring harmony between your body
and your soul, thanks to the swimming pool, the Finnish sauna, the solarium, the fitness-
room…”
While the fitness-room is quite obviously assigned to the hotel, it depends on the
context to which entity the sauna is assigned. In this case, the sauna is related to the
hotel.
HotelToUnitRelation(HAC79/Sauna)
But as seen in this section the sauna can also be assigned to a room.
This domain has a sufficiently high complexity to motivate the architecture which is
shown in the following section.
4.2. Applying the Method to the Hotel Domain
The project members were postgraduate students (M.Sc.) without prior NLP
experience and without much knowledge about either machine learning or linguistics
(Criterion C.1). The focus of the project was to find most of the relations while
keeping the costs low. As shown in the previous section, a domain-specific
dictionary is available with the OTA code list (D.1). The linguistic complexity of the
project is also an important factor for choosing the appropriate architecture. In the
given domain, typing and grammar errors are rare. Additionally, co-references could
appear theoretically but could not be identified in the test corpus. Because most hotel
descriptions are created by the hotel marketing departments, the sentences tend to be
lengthy and sometimes have a high nesting depth. Overall, the linguistic complexity
is considered relatively low (B.2).
Entity Mention Detection: Because the OTA code list was available as a mature
domain dictionary, it was easy to create gazetteers (D.1). Including the gazetteers is
also useful because the team had no prior NLP experience. This makes gazetteer lists
a good starting point for EMD (C.1). The only customization effort related to the
gazetteers was the consideration of synonyms. This was done manually based on
sample data. The hotel domain is not linguistically complex. Therefore, it is not
required to include a ML-based NER component (B.2).
Relation Mention Detection: The selection of the RMD approaches is also related
to the considerations in EMD. No criteria would motivate a ML-based RMD system
with the given project parameters.
According to the recommendations from the method, the application architecture
chosen included rule-based RMD and EMD together with gazetteers.
4.3. Evaluation
An evaluation in terms of recognition accuracy was performed. The corpus contained
a total of 227 hotel descriptions for 124 hotels. The corpus was split into two disjoint
parts. One part containing 100 documents was used for optimizing the customization
of the application, e.g., adding new synonyms to the gazetteer list and creating
syntax patterns. In order to test the effect on new documents, the recognition rate
accuracy was calculated using the second part of the corpus. For assessing the
results, we use Precision (P), Recall (R) and F-Measure (F), which are commonly
applied measures from Information Retrieval. P describes the correctness, R
describes the completeness and F is their weighted harmonic mean. Table 5 shows
the results.
F (EMD)
F (RMD)
Gazetteers based on OTA
0.75
0.72
Gazetteers based on OTA and synonyms
0.87
0.85
Table 4: Evaluation result
Generally, the results are satisfactory for the hotel domain. The data shows that the
RMD recognition rate is mainly capped by the EMD recognition rate. By enhancing
the gazetteers with synonyms of the OTA codes, the F-measure could be improved
by 0.15 (EMD) and 0.17 (RMD), respectively.
The customization was implemented by four postgraduate computer science students
(M.Sc.) without prior NLP experience in a six month timeframe. This indicates that
the complexity of this domain could also be handled by a development organization
in an industry project.
5. Conclusions
This paper presents a criteria-driven method for architecting domain-specific IE
applications. We have developed evaluation criteria which balance various IE
techniques in terms of costs and benefits. To our best knowledge, such a method that
takes into account IE application development costs has not been presented before.
We showed that this approach has been valuable for a real-world application scenario
from the tourism domain. We argue that other domains could also benefit from it.
As future work, we plan to apply this method in more IE projects in order to review
the validity of our statements. We assume that the tooling for ML-based RMD will
emerge which would require a re-evaluation of the presented criteria.
6. Reference
Cunningham, H., Maynard, D., and and Tablan., V. JAPE: a Java Annotation
Patterns Engine (Second Edition), Technical report CS-00-10, 2000, University
of Sheffield - Department of Computer Science, Sheffield.
Finkel, Jenny R., Grenager, Trond, and Manning, Christopher. Incorporating non-
local information into information extraction systems by gibbs sampling:
proceedings of the 43nd Annual Meeting of the Association, 2005, Association
for Computational Linguistics, Ann Arbor, Michigan, pp. 363-370.
Isabelle, Pierre, Cunningham, Hamish, Maynard, Diana, Bontcheva, Kalina, and
Tablan, Valentin. GATE: proceedings of the 40th Annual Meeting on Association
for Computational Linguistics 2001, Association for Computational Linguistics,
Stroudsburg, PA, USA, pp. 168175.
Kozareva, Zornitsa. Bootstrapping named entity recognition with automatically
generated gazetteer lists: proceedings of the Eleventh Conference of the
European Chapter of the Association for Computational Linguistics: Student
Research Workshop, Association for Computational Linguistics , 2006
Stroudsburg, PA, USA, pp. 1521.
Lafferty, John. Conditional random fields: Probabilistic models for segmenting and
labeling sequence data: proceedings of the 18th International Conf. on Machine
Learning, 2001, Morgan Kaufmann, San Francisco, CA, pp. 282289.
Li, Yaoyong, Bontcheva, Kalina, and Cunningham, Hamish. SVM Based Learning
System For Information Extraction: proceedings of Sheffield Machine Learning
Workshop, LNCS, 2005, Springer Verlag, Heidelberg, pp. 319-339.
Marneffe, Marie-catherine de, and Manning, Christopher D. The Stanford typed
dependencies representation: In Coling 2008: proceedings of the workshop on
Cross-Framework and Cross-Domain Parser Evaluation, 2008, Association for
Computational Linguistics, Stroudsburg, PA, USA, pp. 18
Miyao, Yusuke, Sætre, Rune, Sagae, Kenji, Matsuzaki, Takuya, and Tsujii, Jun’ichi.
Task-oriented Evaluation of Syntactic Parsers and Their Representations:
proceedings of the 46th Annual Meeting of the Association for Computational
Linguistics, The Association for Computer Linguistics 2008, Columbus, Ohio,
USA. pp. 46-54
OpenTravel Alliance. OpenTravel Implementation Guide: Executive Summary, No.
1.5, 2010.
http://www.opentravel.org/Resources/Uploads/PDF/OpenTravel_Implementation
Guide_v1.5_ExecSum.pdf.
Surdeanu, Mihai, Johansson, Richard, Meyers, Adam, Màrquez, Lluis, and Nivre,
Joakim, The CoNLL-2008 shared task on joint parsing of syntactic and semantic
dependencies: proceedings of the Twelfth Conference on Computational Natural
Language Learning. 2008. Association for Computational Linguistics.
Stroudsburg, PA, USA, pp. 159177
Surdeanu, Mihai, McClosky, David, Smith, Mason R., Gusev, Andrey, and Manning,
Christopher D. Customizing an Information Extraction System to a New Domain:
proceedings of the Workshop on Relational Models of Semantics, 2011,
Association for Computational Linguistics. Stroudsburg, PA, USA.
Wyner, Adam, and Wim, Peters. On Rule Extraction from Regulations: Proceedings
of the 24th International Conference on Legal Knowledge and Information
Systems. IOS Press, Amsterdam, 2011.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Rules in regulations such as found in the US Federal Code of Regulations can be expressed using conditional and deontic rules. Identifying and extracting such rules from the language of the source material would be useful for automating rulebook management and translating into an executable logic. The paper presents a linguistically-oriented, rule-based approach, which is in contrast to a machine learning approach. It outlines use cases, discusses the source materials, reviews the methodology, then provides initial results and future steps.
Conference Paper
Full-text available
This paper presents a comparative evalua- tion of several state-of-the-art English parsers based on different frameworks. Our approach is to measure the impact of each parser when it is used as a component of an information ex- traction system that performs protein-protein interaction (PPI) identification in biomedical papers. We evaluate eight parsers (based on dependency parsing, phrase structure parsing, or deep parsing) using five different parse rep- resentations. We run a PPI system with several combinations of parser and parse representa- tion, and examine their impact on PPI identi- fication accuracy. Our experiments show that the levels of accuracy obtained with these dif- ferent parsers are similar, but that accuracy improvements vary when the parsers are re- trained with domain-specific data.
Conference Paper
Full-text available
Most current statistical natural language process- ing models use only local features so as to permit dynamic programming in inference, but this makes them unable to fully account for the long distance structure that is prevalent in language use. We show how to solve this dilemma with Gibbs sam- pling, a simple Monte Carlo method used to per- form approximate inference in factored probabilis- tic models. By using simulated annealing in place of Viterbi decoding in sequence models such as HMMs, CMMs, and CRFs, it is possible to incorpo- rate non-local structure while preserving tractable inference. We use this technique to augment an existing CRF-based information extraction system with long-distance dependency models, enforcing label consistency and extraction template consis- tency constraints. This technique results in an error reduction of up to 9% over state-of-the-art systems on two established information extraction tasks.
Conference Paper
Current Named Entity Recognition sys- tems suffer from the lack of hand-tagged data as well as degradation when mov- ing to other domain. This paper explores two aspects: the automatic generation of gazetteer lists from unlabeled data; and the building of a Named Entity Recognition system with labeled and unlabeled data.
Article
We present conditional random fields, a framework for building probabilistic models to segment and label sequence data. Conditional random fields offer several advantages over hidden Markov models and stochastic grammars for such tasks, including the ability to relax strong independence assumptions made in those models. Conditional random fields also avoid a fundamental limitation of maximum entropy Markov models (MEMMs) and other discriminative Markov models based on directed graphical models, which can be biased towards states with few successor states. We present iterative parameter estimation algorithms for conditional random fields and compare the performance of the resulting models to HMMs and MEMMs on synthetic and natural-language data.
  • H Cunningham
  • D Maynard
  • V Jape
Cunningham, H., Maynard, D., and and Tablan., V. JAPE: a Java Annotation Patterns Engine (Second Edition), Technical report CS-00-10, 2000, University of Sheffield -Department of Computer Science, Sheffield.
The CoNLL-2008 shared task on joint parsing of syntactic and semantic dependencies: proceedings of the Twelfth Conference on Computational Natural Language Learning
  • Surdeanu
  • Mihai
  • Johansson
  • Richard
  • Meyers
  • Adam
  • Màrquez
  • Lluis
  • Joakim Nivre
Surdeanu, Mihai, Johansson, Richard, Meyers, Adam, Màrquez, Lluis, and Nivre, Joakim, The CoNLL-2008 shared task on joint parsing of syntactic and semantic dependencies: proceedings of the Twelfth Conference on Computational Natural Language Learning. 2008. Association for Computational Linguistics. Stroudsburg, PA, USA, pp. 159‐177
Customizing an Information Extraction System to a New Domain: proceedings of the Workshop on Relational Models of Semantics
  • Surdeanu
  • Mihai
  • Mcclosky
  • David
  • Smith
  • R Mason
  • Andrey Gusev
  • Manning
  • D Christopher
Surdeanu, Mihai, McClosky, David, Smith, Mason R., Gusev, Andrey, and Manning, Christopher D. Customizing an Information Extraction System to a New Domain: proceedings of the Workshop on Relational Models of Semantics, 2011, Association for Computational Linguistics. Stroudsburg, PA, USA.
SVM Based Learning System For Information Extraction: proceedings of Sheffield Machine Learning Workshop
  • Yaoyong Li
  • Bontcheva
  • Kalina
  • Hamish Cunningham
Li, Yaoyong, Bontcheva, Kalina, and Cunningham, Hamish. SVM Based Learning System For Information Extraction: proceedings of Sheffield Machine Learning Workshop, LNCS, 2005, Springer Verlag, Heidelberg, pp. 319-339.
OpenTravel Implementation Guide: Executive Summary
  • Opentravel Alliance
OpenTravel Alliance. OpenTravel Implementation Guide: Executive Summary, No. 1.5, 2010. http://www.opentravel.org/Resources/Uploads/PDF/OpenTravel_Implementation Guide_v1.5_ExecSum.pdf.