ArticlePDF Available

Abstract

Context sensitive rewrite rules have been widely used in several areas of natural language processing, including syntax, morphology, phonology and speech processing.
Proceedings of EACL '99
Transducers from Rewrite Rules with Backreferences
Dale Gerdemann
University of Tuebingen
K1. Wilhelmstr. 113
D-72074 Tuebingen
dg@sf s. nphil, uni-tuebingen,
de
Gertjan van Noord
Groningen University
PO Box 716
NL 9700 AS Groningen
vannoord@let, rug. nl
Abstract
Context sensitive rewrite rules have been
widely used in several areas of natural
language processing, including syntax,
morphology, phonology and speech pro-
cessing. Kaplan and Kay, Karttunen,
and Mohri & Sproat have given vari-
ous algorithms to compile such rewrite
rules into finite-state transducers. The
present paper extends this work by al-
lowing a limited form of backreferencing
in such rules. The explicit use of backref-
erencing leads to more elegant and gen-
eral solutions.
1 Introduction
Context sensitive rewrite rules have been widely
used in several areas of natural language pro-
cessing. Johnson (1972) has shown that such
rewrite rules are equivalent to finite state trans-
ducers in the special case that they are not al-
lowed to rewrite their own output. An algo-
rithm for compilation into transducers was pro-
vided by Kaplan and Kay (1994). Improvements
and extensions to this algorithm have been pro-
vided by Karttunen (1995), Karttunen (1997),
Karttunen (1996) and Mohri and Sproat (1996).
In this paper, the algorithm will be ex-
tended to provide a limited form of back-
referencing. Backreferencing has been im-
plicit in previous research, such as in the
"batch rules" of Kaplan and Kay (1994), brack-
eting transducers for finite-state parsing (Kart-
tunen, 1996), and the "LocalExtension" operation
of Roche and Schabes (1995). The explicit use of
backreferencing leads to more elegant and general
solutions.
Backreferencing is widely used in editors, script-
ing languages and other tools employing regular
expressions (Friedl, 1997). For example, Emacs
uses the special brackets \( and \) to capture
strings along with the notation \n to recall the nth
such string. The expression \(a*\)b\l matches
strings of the form
anba n.
Unrestricted use of
backreferencing thus can introduce non-regular
languages. For NLP finite state calculi (Kart-
tunen et al., 1996; van Noord, 1997) this is unac-
ceptable. The form of backreferences introduced
in this paper will therefore be restricted.
The central case of an allowable backreference
is:
x ~ T(x)/A__p
(1)
This says that each string x preceded by A and
followed by p is replaced by T(x), where A and p
are arbitrary regular expressions, and T is a trans-
ducer) This contrasts sharply with the rewriting
rules that follow the tradition of Kaplan & Kay:
¢ ~ ¢l:~__p
(2)
In this case, any string from the language ¢ is
replaced by any string independently chosen from
the language ¢.
We also allow multiple (non-permuting) back-
references of the form:
~The syntax at this point is merely suggestive. As
an example, suppose that T,c,. transduces phrases into
acronyms. Then
x =¢~ T=cr(x)/(abbr)__(/abbr>
would transduce
<abbr>non-deterministic finite
automaton</abbr> into <abbr>NDFA</abbr>.
To compare this with a backreference in Perl,
suppose that T~cr is a subroutine that con-
verts phrases into acronyms and that R~¢,. is
a regular expression matching phrases that can
be converted into acronyms. Then (ignoring
the left context) one can write something
like:
s/(R~c,.)(?=(/ASBR))/T,,c~($1)/ge;.
The backrefer-
ence variable, $1, will be set to whatever string R~c,.
matches.
126
Proceedings of EACL '99
xlx2.., xn ~ Tl(xl)T2(x2)...Tn(x,O/A--p (3)
Since transducers are closed under concatenation,
handling multiple backreferences reduces to the
problem of handling a single backreference:
x ~ (TI" T2..... T,O(x)/A--p (4)
A problem arises if we want capturing to fol-
low the POSIX standard requiring a longest-
capture strategy. ~riedl (1997) (p. 117), for
example, discusses matching the regular expres-
sion (toltop)(olpolo)?(gicallo?logical) against the
word: topological. The desired result is that
(once an overall match is established) the first set
of parentheses should capture the longest string
possible (top); the second set should then match
the longest string possible from what's left (o),
and so on. Such a left-most longest match con-
catenation operation is described in §3.
In the following section, we initially concentrate
on the simple Case in (1) and show how (1) may be
compiled assuming left-to-right processing along
with the overall longest match strategy described
by Karttunen (1996).
The major components of the algorithm are
not new, but straightforward modifications of
components presented in Karttunen (1996) and
Mohri and Sproat (1996). We improve upon ex-
isting approaches because we solve a problem con-
cerning the use of special marker symbols (§2.1.2).
A further contribution is that all steps are imple-
mented in a freely available system, the FSA Util-
ities of van Noord (1997) (§2.1.1).
2 The Algorithm
2.1 Preliminary Considerations
Before presenting the algorithm proper, we will
deal with a couple of meta issues. First, we in-
troduce our version of the finite state calculus in
§2.1.1. The treatment of special marker symbols
is discussed in §2.1.2. Then in §2.1.3, we discuss
various utilities that will be essential for the algo-
rithm.
2.1.1 FSA Utilities
The algorithm is implemented in the FSA Util-
ities (van Noord, 1997). We use the notation pro-
vided by the toolbox throughout this paper. Ta-
ble 1 lists the relevant regular expression opera-
tors. FSA Utilities offers the possibility to de-
fine new regular expression operators. For exam-
ple, consider the definition of the nullary operator
vowel as the union of the five vowels:
[] empty string
[El,... En] concatenation of E1 ... En
{} empty language
<El,...En} union of El,...En
E* Kleene closure
E ^ optionality
-E
complement
EI-E2
difference
$ E containment
E1 ~
E2 intersection
any symbol
A : B pair
E1 x E2
cross-product
A o B composition
domain(E) domain of a transduction
range (E) range of a transduction
ident ity (E) identity transduction
inverse (E) inverse transduction
Table 1: Regular expression operators.
macro (vowel, {a, e, i,o,u}).
In such macro definitions, Prolog variables can be
used in order to define new n-ary regular expres-
sion operators in terms of existing operators. For
instance, the lenient_composition operator (Kart-
tunen, 1998) is defined by:
macro (priorityiunion (Q ,R),
{Q, -domain(Q) o R}).
macro (lenient_composition (R, C),
priority_union(R o C,R)).
Here, priority_union of two regular expressions
Q and R is defined as the union of Q and the compo-
sition of the complement of the domain of Q with
R. Lenient composition of R and C is defined as the
priority union of the composition of R and C (on
the one hand) and R (on the other hand).
Some operators, however, require something
more than simple macro expansion for their def-
inition. For example, suppose a user wanted to
match n occurrences of some pattern. The FSA
Utilities already has the '*' and '+' quantifiers,
but any other operators like this need to be user
defined. For this purpose, the FSA Utilities sup-
plies simple Prolog hooks allowing this general
quantifier to be defined as:
macro (mat chn (N, X), Regex) -
mat ch_n (N, X, Regex).
match_n(O, _X, [] ) .
match_n(N,X, [XIRest]) :-
N>O,
N1 is N-l,
mat ch_n (NI, X, Rest) .
127
Proceedings of EACL '99
For example: match_n(3,a) is equivalent to the
ordinary finite state calculus expression [a, a, a].
Finally, regular expression operators can be
defined in terms of operations on the un-
derlying automaton. In such cases, Prolog
hooks for manipulating states and transitions
may be used. This functionality has been
used in van Noord and Gerdemann (1999) to pro-
vide an implementation of the algorithm in
Mohri and Sproat (1996).
2.1.2 Treatment of Markers
Previous algorithms for compiling rewrite
rules into transducers have followed
Kaplan and Kay (1994) by introducing spe-
cial marker symbols
(markers)
into strings in
order to mark off candidate regions for replace-
ment. The assumption is that these markers are
outside the resulting transducer's alphabets. But
previous algorithms have not ensured that the
assumption holds.
This problem was recognized by
Karttunen (1996), whose algorithm starts with
a filter transducer which filters out any string
containing a marker. This is problematic for two
reasons. First, when applied to a string that does
happen to contain a marker, the algorithm will
simply fail. Second, it leads to logical problems in
the interpretation of complementation. Since the
complement of a regular expression R is defined
as E - R, one needs to know whether the marker
symbols are in E or not. This has not been
clearly addressed in previous literature.
We have taken a different approach by providing
a contextual way of distinguishing markers from
non-markers. Every symbol used in the algorithm
is replaced by a pair of symbols, where the second
member of the pair is either a 0 or a 1 depending
on whether the first member is a marker or not. 2
As the first step in the algorithm, O's are inserted
after every symbol in the input string to indicate
that initially every symbol is a non-marker. This
is defined as:
macro (non_markers, [?, [] :0] *) .
Similarly, the following macro can be used to
insert a 0 after every symbol in an arbitrary ex-
pression E.
2This approach is similar to the idea of laying down
tracks as in the compilation of monadic second-order
logic into automata Klarlund (1997, p. 5). In fact, this
technique could possibly be used for a more efficient
implementation of our algorithm: instead of adding
transitions over 0 and 1, one could represent the al-
phabet as bit sequences and then add a final 0 bit for
any ordinary symbol and a final 1 bit for a marker
symbol.
macro (non_markers (E),
range (E o non_markers)).
Since E is a recognizer, it is first coerced to
identity(E). This form of implicit conversion is
standard in the finite state calculus.
Note that 0 and 1 are perfectly ordinary alpha-
bet symbols, which may also be used within a re-
placement. For example, the sequence [i,0] repre-
sents a non-marker use of the symbol I.
2.1.3 Utilities
Before describing the algorithm, it will be
helpful to have at our disposal a few general
tools, most of which were described already in
Kaplan and Kay (1994). These tools, however,
have been modified so that they work with our
approach of distinguishing markers from ordinary
symbols. So to begin with, we provide macros to
describe the alphabet and the alphabet extended
with marker symbols:
macro (sig, [?, 0] ).
macro (xsig, [?, {0,1}] ).
The macro xsig is useful for defining a special-
ized version of complementation and containment:
macro(not (X) ,xsig* - X).
macro ($$ (X), [xsig*, X, xsig*] ).
The algorithm uses four kinds of brackets, so
it will be convenient to define macros for each of
these brackets, and for a few disjunctions.
macro (lbl, [' <1 ', 1] )
macro (lb2, [' <2', 1] )
macro (rb2, [' 2> ', 1] )
macro (rbl, [' 1> ', 1] )
macro (lb, {lbl, lb2})
macro (rb, {rbl ,rb2})
macro (bl, {lbl, rbl})
macro (b2, {lb2, rb2})
macro (brack, {lb, rb}).
As in Kaplan & Kay, we define an Intro(S) op-
erator that produces a transducer that freely in-
troduces instances of S into an input string. We
extend this idea to create a family of Intro oper-
ators. It is often the case that we want to freely
introduce marker symbols into a string at any po-
sition
except
the beginning or the end.
%% Free introduction
macro(intro(S) ,{xsig-S, [] x S}*) .
~.7. Introduction, except at begin
macro (xintro (S) , ( [] , [xsig-S, intro (S) ] }) .
°/.~. Introduction, except at end
macro (introx (S) , ( [] , [intro (S) , xsig-S] }) .
128
Proceedings of EACL '99
%% Introduction, except at begin & end
macro (xintrox (S), { [], [xsig-S] ,
[xsig-S, intro (S), xsig-S] }).
This family of Intro operators is useful for defin-
ing
a family of Ignore operators:
macro( ign( E1,S),range(E1 o intro(S))).
macro(xign(El,S) ,range(E1 o xintro(S))).
macro( ignx(E1,S),range(E1 o introx(S))).
macro (xigax (El, S), range (El o xintrox (S)) ).
In order to create filter transducers to en-
sure that markers are placed in the correct po-
sitions, Kaplan & Kay introduce the operator
P-iff-S(L1,L2). A string is described by this
expression iff each prefix in L1 is followed by a
suffix in L2 and each suffix in L2 is preceded by a
prefix in L1. In our approach, this is defined as:
macro(if_p then s(L1,L2),
not(
iLl
,not (L2) ] )).
macro (if s then_p (L1,L2),
not ( [not (al), L2] ) ).
macro (p_iff_s (LI, L2),
if_p_then_s (LI, L2)
if_s_then_p (LI ,L2) ).
To make the use ofp_iff_s more convenient, we
introduce a new operator l_if f_r (L, R), which de-
scribes strings where every string position is pre-
ceded by a string in L just in case it is followed by
a string in R:
macro (l_iff_r (L ,R),
p_iff_s([xsig*,L] , [R,xsig*])) .
Finally, we introduce a new operator
if (Condit ion, Then, Else) for conditionals.
This operator is extremely useful, but in order
for it to work within the finite state calculus, one
needs a convention as to what counts as a boolean
true or false for the condition argument. It is
possible to define true as the universal language
and false as
the empty language:
macro(true,? *). macro(false,{}).
With these definitions, we can use the comple-
ment operator as negation, the intersection opera-
tor as conjunction and the union operator as dis-
junction. Arbitrary expressions may be coerced
to booleans
using
the following macro:
macro (coerce_t oboolean (E),
range(E o (true x true))).
Here, E should describe a recognizer. E is com-
posed with the universal transducer, which trans-
duces from anything (?*) to anything (?*). Now
with this background, we can define the condi-
tionah
macro ( if (Cond, Then, Else),
{ coerce_to_boolean(Cond) o Then,
-coerce_to_boolean(Cond) o Else
}).
2.2 Implementation
A rule of the form
x ~ T(x)/A__p
will be written
as replace(T,Lambda,Rho). Rules of the more
general form
xl ...z,, ~ Tl(xl)...T,~(Xn)/A_-p
will be discussed in §3. The algorithm consists
of nine steps composed as in figure 1.
The names of these steps are mostly
derived from Karttunen (1995) and
Mohri and Sproat (1996) even though the
transductions involved are not exactly the same.
In particular, the steps derived from Mohri &
Sproat (r, f, 11 and 12) will all be defined in
terms of the finite state calculus as opposed to
Mohri & Sproat's approach of using low-level
manipulation of states and transitions, z
The first step, non_markers, was already de-
fined above. For the second step, we first consider
a simple special case. If the empty string is in
the language described by Right, then r(Right)
should insert an rb2 in every string position. The
definition of r(Right) is both simpler and more
efficient if this is treated as a special case. To in-
sert a bracket in every possible string position, we
use:
[[[] x rb2,sig]*,[] x rb2]
If the empty string is not in Right, then we
must use intro(rb2) to introduce the marker
rb2, fol]owed by l_iff_r to ensure that such
markers are immediately followed by a string in
Right, or more precisely a string in Right where
additional instances of rb2 are freely inserted in
any position other than the beginning. This ex-
pression is written as:
intro (rb2)
o
i_ if f _r (rb2, xign (non_markers (Right) , rb2) )
Putting these two pieces together with the con-
ditional yields:
macro (r (R),
if([] ~ R, % If: [] is in R:
[[[] x rb2,sig]*,[] x rb2],
intro (rb2) % Else:
o
l_iff_r (rb2, xign (non_markers (R) , rb2) ) ) ) .
The third step,
f(domain(T))
is implemented
as:
3The alternative implementation is provided in
van Noord and Gerdemann (1999).
129
macro(replace(T,Left,Right),
non_markers
0
r(Right)
0
f(domain(T))
0
left_toright (domain(T))
0
longest_match(domain(T))
0
aux_replace(T)
0
ll(Left)
0
12(Left)
O
inverse(non_markers)).
Proceedings of EACL '99
% introduce 0 after every symbol
% (a b c => a 0 b 0 c 0).
% introduce rb2 before any string
% in Right.
% introduce ib2 before any string in
% domain(T) followed by rb2.
% ib2 ... rb2 around domain(T) optionally
% replaced by Ibl ... rbl
% filter out non-longest matches marked
% in previous step.
% perform T's transduction on regions marked
% off by bl's.
% ensure that Ibl must be preceded
% by a string in Left.
% ensure that Ib2 must not occur preceded
% by a string in Left.
% remove the auxiliary O's.
Figure 1: Definition of replace operator.
macro (f (Phi), intro (lb2)
O
l_iff_r (Ib2, [xignx (non_markers (Phi), b2),
lb2", rb2] ) ).
The lb2 is first introduced and then, using
t_i f f_.r, it is constrained to occur immediately be-
fore every instance of (ignoring complexities) Phi
followed by an rb2. Phi needs to be marked as
normal text using non_markers and then xign_x
is used to allow freely inserted lb2 and rb2 any-
where except at the beginning and end. The fol-
lowing lb2" allows an optional lb2, which occurs
when the empty string is in Phi.
The fourth step is a guessing component which
(ignoring complexities) looks for sequences of the
form lb2 Phi rb2 and converts some of these
into lbl Phi rbl, where the bl marking indicates
that the sequence is a candidate for replacement.
The complication is that Phi, as always, must
be converted to non_markers (Phi) and instances
of b2 need to be ignored. Furthermore, between
pairs of lbl and rbl, instances of lb2 are deleted.
These lb2 markers have done their job and are
no longer needed. Putting this all together, the
definition is:
macro (left_to_right (Phi),
[ [xsig*,
lib2 x ibl,
( ign (non_markers (Phi) , b2)
O
inverse (intro (ib2))
),
rb2 x rbl]
]*, xsig*]).
The fifth step filters out non-longest matches
produced in the previous step. For example (and
simplifying a bit), if Phi is ab*, then a string of
the form ... rbl a b Ibl b ... should be ruled out
since there is an instance of Phi (ignoring brackets
except at the end) where there is an internal Ibl.
This is implemented as:~
macro (longest_mat ch (Phi),
not ($$ ( [lbl,
(ignx (non_markers (Phi) , brack)
$$(rbl)
), % longer match must be
rb % followed by an rb
])) % so context is ok
0
~, done with rb2, throw away:
inverse (intro (rb2)) ) .
The sixth step performs the transduction de-
scribed by
T.
This step is straightforwardly imple-
mented, where the main difficulty is getting T to
apply to our specially marked string:
macro (aux_replace (T),
{{sig, Ib2},
[Ibl,
inverse (non_markers)
4The line with $$ (rbl) (:an be oI)ti-
mized a bit: Since we know that an rbl
must be preceded by Phi, we can write!
[ign_ (non_markers (Phi) , brack) , rb 1, xs ig*] ).
This may lead to a more constrained (hence smaller)
transducer.
130
Proceedings of EACL '99
oTo
non_markers,
rbl x []
]
}*).
The seventh step ensures that lbl is preceded
by a string in Left:
macro
(ii (L),
ign ( if _s_then p (
ignx ( [xsig*, non_markers (L) ], lbl),
[lbl,xsig*] ),
ib2)
O
inverse (intro (ib i) ) ).
The eighth step ensures that ib2 is not preceded
by a string in Left. This is implemented similarly
to the previous step:
macro
(12 (L),
if_s_then_p (
ignx (not ( [xsig*,non_markers (L) ] ), lb2),
[lb2, xsig*] )
0
inverse
( intro (lb2) ) ).
Finally the ninth step, inverse (non_markers),
removes "the O's so that the final result in not
marked up in any special way.
3 Longest Match Capturing
As discussed in §1 the POSIX standard requires
that multiple captures follow a longest match
strategy. For multiple captures as in (3), one es-
tablishes first a longest match for domain(T1).
.... domain( T~ ). Then we ensure that each of
domain(Ti) in turn is required to match as long
as possible, with each one having priority over its
rightward neighbors. To implement this, we define
a macro lm_concat(Ts) and use it as:
replace (lm_concat (Ts), Left, Right)
Ensuring the longest overall match is delegated
to the replace macro, so lm_concat(Ts) needs
only ensure that each individual transducer within
Ts gets its proper left-to-right longest matching
priority. This problem is mostly solved by the
same techniques used to ensure the longest match
within the replace macro. The only complica-
tion here is that Ts can be of unbounded length.
So it is not possible to have a single expression in
the finite state calculus that applies to all possi-
ble lenghts. This means that we need something
a little more powerful than mere macro expan-
sion to construct the proper finite state calculus
expression. The FSA Utilities provides a Prolog
hook for this purpose. The resulting definition of
lm_concat is given in figure 2.
Suppose (as in Friedl (1997)), we want to match
the following list of recognizers against the string
topological
and insert a marker in each bound-
ary position. This reduces to applying:
im_concat
(
[
[{[t,o],[t,o,p]},[] : '#'],
[{o,[p,o,l,o]},[]: '#'],
{ [g,i,c,a,l], [o',l,o,g,i,c,a,l] }
])
This expression transduces the string
topological
only to the string top#o#1ogical. 5
4 Conclusions
The algorithm presented here has extended previ-
ous algorithms for rewrite rules by adding a lim-
ited version of backreferencing. This allows the
output of rewriting to be dependent on the form of
the strings which are rewritten. This new feature
brings techniques used in Perl-like languages into
the finite state calculus. Such an integration is
needed in practical applications where simple text
processing needs to be combined with more so-
phisticated computational linguistics techniques.
One particularly interesting example where
backreferences are essential is cascaded determin-
istic (longest match) finite state parsing as de-
scribed for example in Abney (Abney, 1996) and
various papers in (Roche and Schabes, 1997a).
Clearly, the standard rewrite rules do not apply in
this domain. If NP is an NP recognizer, it would
not do to.say NP ~ [NP]/A_p. Nothing would
force the string matched by the NP to the left of
the arrow to be the same as the string matched
by the NP to the right of the arrow.
One advantage of using our algorithm for fi-
nite state parsing is that the left and right con-
texts may be used to bring in top-down filter-
ing. 6 An often cited advantage of finite state
5An anonymous reviewer suggested theft
lm_concat
could be implemented in the frame-
work of Karttunen (1996) as:
[toltoplolpolo]-+... #;
Indeed the resulting transducer from this expression
would transduce topological into top#o#1ogical.
But unfortunately this transducer would also trans-
duce polotopogical into polo#top#o#gical, since
the notion of left-right ordering is lost in this expres-
sion.
6The bracketing operator of Karttunen (1996), on
the other hand, does not provide for left and right
contexts.
131
Proceedings of EACL '99
macro(im_concat(Ts),mark_boundaries(Domains) o ConcatTs):-
domains(Ts,Domains), concatT(Ts,ConcatTs).
domains([],[]).
domains([FIRO],[domain(F) IR]):- domains(RO,R).
concatT([],[]).
concatT([TlTs], [inverse(non_markers) o T,ibl x []IRest]):- concatT(Ts,Rest).
%% macro(mark_boundaries(L),Exp): This is the central component of im_concat. For our
%% "toplological" example we will have:
%% mark_boundaries ([domain( [{ [t, o] , [t, o ,p] }, [] : #] ),
%% domain([{o,[p,o,l,o]},[]: #]),
%% domain({ [g,i, c,a, i] , [o^,l,o,g,i,c,a,l] })])
%% which simplifies to:
%% mark_boundaries([{[t,o],[t,o,p]}, {o,[p,o,l,o]}, {[g,i,c,a,l],[o^,l,o,g,i,c,a,l]}]).
%% Then by macro expansion, we get:
%% [{[t,o], [t,o,p]} o non_markers,[]x ibl,
%% {o,[p,o,l,o]} o non_markers,[]x ibl,
%% {[g,i,c,a,l],[o',l,o,g,i,c,a,l]} o non_markers,[]x ibl]
%%
o
%%
%
Filter i: {[t,o],[t,o,p]} gets longest match
%%
-
[ignx_l(non_markers({ [t,o] , [t,o,p] }),ibl) ,
%% ign(non_markers({o, [p,o,l,o] }) ,ibl) ,
%% ign(non_markers({ [g,i,c,a,l] , [o^,l,o,g,i,c,a,l] }) ,ibl)]
%%
o
%% % Filter 2: {o,[p,o,l,o]} gets longest match
%%
~
[non_markers ({ [t, o] , [t, o, p] }) , Ib i,
%% ignx_l(non_markers ({o, [p,o,l,o] }) ,ibl),
%% ign(non_markers({ [g, i,c,a,l] , [o',l,o,g,i,c,a,l] }) ,ibl)]
macro(mark_boundaries(L),Exp):-
boundaries(L,ExpO), % guess boundary positions
greed(L,ExpO,Exp). % filter non-longest matches
boundaries([],[]).
boundaries([FIRO],[F o non_markers, [] x ibl ]R]):- boundaries(RO,R).
greed(L,ComposedO,Composed) :-
aux_greed(L,[],Filters), compose_list(Filters,ComposedO,Composed).
aux_greed([HIT],Front,Filters):- aux_greed(T,H,Front,Filters,_CurrentFilter).
aux_greed([],F,_,[],[ign(non_markers(F),Ibl)]).
aux_greed([HlRO],F,Front,[-LiIR],[ign(non_markers(F),ibl)IRl]) "-
append(Front,[ignx_l(non_markers(F),Ibl)IRl],Ll),
append(Front,[non_markers(F),ibl],NewFront),
aux_greed(RO,H,NewFront,R,Rl).
%% ignore at least one instance of E2 except at end
macro(ignx_l(E1,E2), range(El o [[? *,[] x E2]+,? +])).
compose_list([],SoFar,SoFar).
compose_list([FlR],SoFar,Composed):- compose_list(R,(SoFar o F),Composed).
Figure 2: Definition of lm_concat operator.
132
Proceedings of EACL '99
parsing is robustness. A constituent is found bot-
tom up in an early level in the cascade even if
that constituent does not ultimately contribute
to an S in a later level of the cascade. While
this is undoubtedly an advantage for certain ap-
plications, our approach would allow the intro-
duction of some top-down filtering while main-
taining the robustness of a bottom-up approach.
A second advantage for robust finite state pars-
ing is that bracketing could also include the no-
tion of "repair" as in Abney (1990). One might,
for example, want to say something like:
xy
[NP
RepairDet(x) RepairN(y) ]/)~__p 7
so that an
NP
could be parsed as a slightly malformed
Det
followed by a slightly malformed N. RepairDet
and RepairN, in this example, could be doing a
variety of things such as: contextualized spelling
correction, reordering of function words, replace-
ment of phrases by acronyms, or any other oper-
ation implemented as a transducer.
Finally, we should mention the problem of com-
plexity. A critical reader might see the nine steps
in our algorithm and conclude that the algorithm
is overly complex. This would be a false conclu-
sion. To begin with, the problem itself is complex.
It is easy to create examples where the resulting
transducer created by any algorithm would be-
come unmanageably large. But there exist strate-
gies for keeping the transducers smaller. For ex-
ample, it is not necessary for all nine steps to
be composed. They can also be cascaded. In
that case it will be possible to implement different
steps by different strategies, e.g. by determinis-
tic or non-deterministic transducers or bimachines
(Roche and Schabes, 1997b). The range of possi-
bilities leaves plenty of room for future research.
References
Steve Abney. 1990. Rapid incremental parsing
with repair. In
Proceedings of the 6th New OED
Conference: Electronic Text Rese arch,
pages
1-9.
Steven Abney. 1996. Partial parsing via finite-
state cascades. In
Proceedings of the ESSLLI
'96 Robust Parsing Workshop.
.Jeffrey Friedl. 1997.
Mastering Regular Expres-
sions.
O'Reilly & Associates, Inc.
C. Douglas Johnson. 1972.
Formal Aspects
of Phonological Descriptions.
Mouton, The
Hague.
7The syntax here has been simplified. The rule
should be understood as: replace(lm_concat([[]:'[np',
repair_det, repair_n, []:']'],lambda, rho).
Ronald Kaplan and Martin Kay. 1994. Regular
models of phonological rule systems.
Computa-
tional Linguistics,
20(3):331-379.
L. Karttunen, J-P. Chanod, G. Grefenstette, and
A. Schiller. 1996. Regular expressions for lan-
guage engineering.
Natural Language Engineer-
ing,
2(4):305-238.
Lauri Karttunen. 1995. The replace operator.
In
33th Annual Meeting of the Association for
Computational Linguistics,
M.I.T. Cambridge
Mass.
Lauri Karttunen. 1996. Directed replacement.
In
34th Annual Meeting of the Association for
Computational Linguistics,
Santa Cruz.
Lauri Karttunen. 1997. The replace operator.
In Emannual Roche and Yves Schabes, editors,
Finite-State Language Processing,
pages 117-
147. Bradford, MIT Press.
Lauri Karttunen. 1998. The proper treatment
of optimality theory in computational phonol-
ogy. In
Finite-state Methods in Natural Lan-
guage Processing,
pages 1-12, Ankara, June.
Nils Klarlund. 1997. Mona & Fido: The logic
automaton connection in practice. In
CSL '97.
Mehryar Mohri and Richard Sproat. 1996. An
efficient compiler for weighted rewrite rules.
In
3~th Annual Meeting of the Association for
Computational Linguistics,
Santa Cruz.
Emmanuel Roche and Yves Schabes. 1995. De-
terministic part-of-speech tagging with finite-
state transducers.
Computational Linguistics,
21:227-263. Reprinted in Roche & Schabes
(1997).
Emmanuel Roche and Yves Schabes, editors.
1997a.
Finite-State Language Processing.
MIT
Press, Cambridge.
Emmanuel Roche and Yves Schabes. 1997b. In-
troduction. In Emmanuel Roche and Yves Sch-
abes, editors,
Finite-State Language Processing.
MIT Press, Cambridge, Mass.
Gertjan van Noord and Dale Gerdemann. 1999.
An extendible regular expression compiler for
finite-state approaches in natural language pro-
cessing. In
Workshop on Implementing Au-
tomata 99,
Potsdam Germany.
Gertjan van Noord. 1997. Fsa utilities.
The
FSA Utilities
toolbox is available free of
charge under Gnu General Public License at
http://www.let.rug.nl/-vannoord/Fsa/.
133
... Particularly we are interested in context string replacement, which is a widely-adopted string operation to secure PHP programs but had not been addressed in [2]. Context string replacement had been discussed in the context of natural language processing [16,11,4,14]. All these works are based on the composition of finite state transducers. ...
... Context String Replacement Context string replacement had been discussed through in Natural Language Processing [6,11,16,4]. Karttunen [6] first proposes the replace operator of regular expressions by applying phonological rewrite rules. ...
... In [16,4], Noord and Gerdemann further propose several transducers to specify delicate replacement operations, like the left-most, the longest match and the first-match semantics. An disadvantage is that the finite state transducer of regular expression is easily becoming quite complicated and hard to prove its correctness. ...
Article
String analysis is a static analysis technique that determines the string values that a variable can hold at specific points in a program. This information is often useful to help program understanding, to detect and fix programming errors and security vulnerabilities, and to solve certain program verification problems. We present a novel approach to perform string analysis on real-world programs.
... The general approach originated by Kaplan & Kay [8] involves steps for introducing markers, constraining the markers in various ways, making the actual replacement , constraining the markers again after the replacement and finally removing the markers. Several variants of Kaplan & Kay's algorithm have been presented ([9], [12], [4]) with the goals being either to improve efficiency or to provide slightly different semantics. ...
... This is understood as a rule for replacing the string x with the transduction T (x) in the specified context, with some unspecified mode of operation such as obligatory left-to-right. Traditionally T is specified as a cross-product of regular languages, but as discussed in Gerdemann & van Noord [4], this is by no means necessary. ...
... The general approach originated by Kaplan & Kay [8] involves steps for introducing markers, constraining the markers in various ways, making the actual replacement , constraining the markers again after the replacement and finally removing the markers. Several variants of Kaplan & Kay's algorithm have been presented ([9], [12], [4]) with the goals being either to improve efficiency or to provide slightly different semantics. In this paper, I present an alternative, more declarative approach. ...
Conference Paper
Full-text available
A flexible construction kit is presented compiling various forms of finite state replacement rules. The approach is simpler and more declarative than algorithms in the tradition of Kaplan & Kay. Simple constraints can be combined to achieve complex effects, including effects based on Optimality Theory.
... Each of these steps can be implemented as a finite-state transducer, and the resulting system is then the composition of these transducers. The implementation of the rules is greatly simplified by the replace operator (Karttunen, 1995; Gerdemann and van Noord, 1999), a finite-state method for implementing phonological rules (i.e. contextually sensitive rules for replacing one symbol sequence by another). ...
... 2 Several authors have given recipes for finite-state transducers that perform a single contextual edit operation (Kaplan and Kay, 1994;Mohri and Sproat, 1996;Gerdemann and van Noord, 1999). Such "rewrite rules" can be individually more expressive than our simple edit operations of section 2; but it is unclear how to train a cascade of them to model p(y | x). ...
Conference Paper
Full-text available
String similarity is most often measured by weighted or unweighted edit distance d(x, y). Ristad and Yianilos (1998) defined stochastic edit distance - a probability distribution p(y | x) whose parameters can be trained from data. We generalize this so that the probability of choosing each edit operation can depend on contextual features. We show how to construct and train a probabilistic finite-state transducer that computes our stochastic contextual edit distance. To illustrate the improvement from conditioning on context, we model typos found in social media text.
... According to Kaplan and Kay [11], such rules could be split into a number of subrules. However, it is arguable [15] that if the centers are defined as regular relations we obtain a more expressive and useful definition that includes, for example, marking rules. Therefore, we will specify the center X i directly as a same-length relation i.e. a language over Π. ...
Article
Full-text available
Generalized Two-Level Grammar (GTWOL) provides a new method for compilation of parallel replacement rules into transducers. The current paper identifies the role of generalized lenient composition (GLC) in this method. Thanks to the GLC operation, the compilation method becomes bipartite and easily extendible to capture various appli-cation modes. In the light of three notions of obligatoriness, a modifica-tion to the compilation method is proposed. We argue that the bipartite design makes implementation of parallel obligatoriness, directionality, length and rank based application modes extremely easy, which is the main result of the paper.
... In case of overlapping rule targets in the input, this operator will replace the leftmost target, and in cases where 6 a rule target contains a prefix which is also a potential target, the longer sequence will be replaced. Gerdemann and van Noord [27] implement leftmost longest-match replacement in FSA as the operator: ...
Article
Full-text available
A hybrid optimization algorithm combining finite state method (FSM) and genetic algorithm (GA) is proposed to solve the crude oil scheduling problem. The FSM and GA are combined to take the advantage of each method and compensate deficiencies of individual methods. In the proposed algorithm, the finite state method makes up for the weakness of GA which is poor at local searching ability. The heuristic returned by the FSM can guide the GA algorithm towards good solutions. The idea behind this is that we can generate promising substructure or partial solution by using FSM. Furthermore, the FSM can guarantee that the entire solution space is uniformly covered. Therefore, the combination of the two algorithms has better global performance than the existing GA or FSM which is operated individually. Finally, a real-life crude oil scheduling problem from the literature is used for conducting simulation. The experimental results validate that the proposed method outperforms the state-of-art GA method.
... Although they mostly share the same notation, the HFST replace rules differ in that they were compiled with the .r-glc. operator [7] and preference relations described in [29]. This approach makes it possible to more freely define contexts in parallel rules and to easily add new functionalities in future. ...
Article
Full-text available
HFST-HelsinkiFinite-StateTechnology (http://hfst.sf.net/) is a framework for compiling and applying linguistic descriptions with finitestatemethods. HFST currently collects some of the most important finite-state tools for creatingmorphologies and spellcheckers into one open-source platform and supports extending and improving the descriptions with weights to accommodate the modeling of statistical information. HFST offers a path from language descriptions to efficient language applications. In this article, we focus on aspects of HFST that are new to the end user, i.e. new tools, new features in existing tools, or new language applications, in addition to some revised algorithms that increase performance.
... To compile rule blocks into FSMs we represent sets of morphological features as finite state recognizers and realization rules as finite state transducers using the directed replacement operator (Karttunen, 1996;Gerdemann and van Noord, 1999). If we compile a rule block by composing the individual rules, following Karttunen (2003), the result will be an FSM which applies every applicable rule within a block to each input form. ...
Conference Paper
Full-text available
Paradigm Function Morphology (PFM; Stump, 2001) is an elaborated realization-based theory of inflectional morphology which is notable for its empirical scope and formal precision. As Karttunen (2003) shows, most of the apparatus of PFM can be straightforwardly mapped onto regular expressions or finite state machines (FSMs). However, Karttunen's implementation simplifies Stump's theory slightly by assuming that at most one rule per block may be compatible with any given form. This allows rule blocks to be compiled into FSMs simply by composing the FSMs which implement the individual realization rules. However, this precludes the case where more than one potentially applicable rules competes to apply to a particular form. In what Stump argues is a crucial feature of PFM, such rule competition should be resolved ac- cording to Pan.ini's principle: within each rule block, only the applicable rule with the narrowest domain is applied. In this talk, we will describe an alternative implementation of PFM as FSMs using van Noord and Gerdemann's (2001) FSA Utilities. This implementation, while otherwise similar in many respects to Karttunen's, uses Panini's principle to resolve rule competition and so is more faithful to Stump's version of PFM.
Article
Full-text available
Article
Full-text available
This paper presents a set of mathematical and computational tools for manipulating and rea-soning about regular languages and regular relations and argues that they provide a solid basis for computational phonology. It shows in detail how this framework applies to ordered sets of context-sensitive rewriting rules and also to grammars in Koskenniemi's two-level formalism. This analysis provides a common representation of phonological constraints that supports efficient generation and recognition by a single simple interpreter.
Conference Paper
Full-text available
Finite-state techniques are widely used in various areas of Natural Language Processing (NLP).As Kaplan and Kay [12] have argued, regular expressions are the appropriate level of abstraction for thinking about finite-state languages and finite-state relations.More complex finite-state operations (such as contexted replacement) are defined on the basis of basic operations (such as Kleene closure, complementation, composition). In order to be able to experiment with such complex finite-state operations the FSA Utilities (version 5) provides an extendible regular expression compiler.The paper discusses the regular expression operations provided by the compiler, and the possibilities to create new regular expression operators.The benefits of such an extendible regular expression compiler are illustrated with a number of examples taken from recent publications in the area of finite-state approaches to NLP.
Article
Full-text available
. This paper presents a novel formalization of optimality theory. Unlike previous treatments of optimality in computational linguistics, starting with Ellison (1994), the new approach does not require any explicit marking and counting of constraint violations. It is based on the notion of "lenient composition", defined as the combination of ordinary composition and priority union. If an underlying form has outputs that can meet a given constraint, lenient composition enforces the constraint; if none of the output candidates meets the constraint, lenient composition allows all of them. For the sake of greater efficiency, we may "leniently compose" the gen relation and all the constraints into a single finite-state transducer that maps each underlying form directly into its optimal surface realizations, and vice versa. Seen from this perspective, optimality theory is surprisingly similar to the two older strains of finite-state phonology: classical rewrite systems and two-level models. I...
Article
Full-text available
Context-dependent rewrite rules are used in many areas of natural language and speech processing. Work in computational phonology has demonstrated that, given certain conditions, such rewrite rules can be represented as finite-state transducers (FSTs). We describe a new algorithm for compiling rewrite rules into FSTs. We show the algorithm to be simpler and more efficient than existing algorithms. Further, many of our applications demand the ability to compile weighted rules into weighted FSTs, transducers generalized by providing transitions with weights. We have extended the algorithm to allow for this.
Article
Stochastic approaches to natural language processing have often been preferred to rule-based approaches because of their robustness and their automatic training capabilities. This was the case for part-of-speech tagging until Brill showed how state-of-the-art part-of-speech tagging can be achieved with a rule-based tagger by inferring rules from a training corpus. However, current implementations of the rule-based tagger run more slowly than previous approaches. In this paper, we present a finite-state tagger, inspired by the rule-based tagger, that operates in optimal time in the sense that the time to assign tags to a sentence corresponds to the time required to follow a single path in a deterministic finite-state machine. This result is achieved by encoding the application of the rules found in the tagger as a nondeterministic finite-state transducer and then turning it into a deterministic transducer. The resulting deterministic transducer yields a part-of-speech tagger whose speed is dominated by the access time of mass storage devices. We then generalize the techniques to the class of transformation-based systems.
Article
Thesis (Ph. D. in Linguistics)--University of California, Mar. 1970. Bibliography: leaves 157-160.
Article
This work describes a method of achieving rapid, reliable parsing of natural text through the application of three techniques: (1) resolving small questions sequentially, (2) repairing errors directly, instead of searching through a non-deterministic space, and (3) recognizing major constituents before analyzing the details of their internal structure. The resulting parser, which I call CASS, is fast and accurate. It parses a million words in 5-6 hours; that is as fast as the fastest parsers reported in the literature. Its accuracy at recognizing chunks (mid-level constituents) and at identifying subjects and predicates is 95% or better.
Article
Finite-state cascades represent an attractive architecture for parsing unrestricted text. Deterministic parsers specified by finite-state cascades are fast and reliable. They can be extended at modest cost to construct parse trees with finite feature structures. Finally, such deterministic parsers do not necessarily involve trading off accuracy against speed---they may in fact be more accurate than exhaustive-search stochastic contextfree parsers.