ArticlePDF Available

Abstract

The need to fully automate the batch typesetting process increases with the use of TEX as the engine for high-volume and on-the-fly typeset documents which, in turn, leads to the need for programmable hyphenation and line-breaking of the highest quality. An overview of approaches for building custom hyphenation patterns is provided, along with examples. A methodology of the process is given, combining different approaches: one based on morphology and hand-made patterns, and one based on word lists and the program PATGEN. The method aims at modular, easily maintainable, efficient, and portable hyphenation. The bag of tricks used in the process to develop custom hyphenation is described.
Hyphenation on Demand
Petr Sojka
Faculty of Informatics
Masaryk University Brno
Botanick´a 68a, 602 00 Brno
Czech Republic
sojka@informatics.muni.cz
Abstract
The need to fully automate the batch typesetting process increases with the use
of T
E
X as the engine for high-volume and on-the-fly typeset documents which, in
turn, leads to the need for programmable hyphenation and line-breaking of the
highest quality.
An overview of approaches for building custom hyphenation patterns is pro-
vided, along with examples. A methodology of the process is given, combining
different approaches: one based on morphology and hand-made patterns, and
one based on word lists and the program PATGEN. The method aims at modular,
easily maintainable, efficient, and portable hyphenation. The bag of tricks used
in the process to develop custom hyphenation is described.
Motivation
In principle, whether to hyphenate or not is a style
question and CSS [Cascading Style Sheets] should
develop properties to control hyphenation. In
practice, however, for most languages there
is no algorithm or dictionary that gives
all (and only) correct word breaks, so
some help from the author may
occasionally be needed.
(Bos, 1999)
Separation of content and presentation in today’s
open information managment style in the sense of
SGML/XML (Goldfarb, 1990; Megginson, 1998) is
a challenge for T
E
X as a batch typesetting tool.
The attempts to bring T
E
X’s engine to untangle pre-
sentation problems in the
WWW arena are numer-
ous (Sutor and D´ıaz, 1998; Skoup´y, 1998).
One bottleneck in the high-volume quality pub-
lishing is the proofreading stage line-breaking and
hyphenation handling that need to be fine tuned to
the layout of particular publication. Tight dead-
lines in paper-based document production and high-
volume electronic publishing put additional demands
for better automation of the typesetting process.
The need for multiple presentations of the same data
(e.g., for paper and screen) adds another dimension
to the problem. Problems with hyphenation are of-
ten one of the most difficult. As most T
E
X users are
perfectionists, fixing and tuning hyphenation for ev-
ery presentation is a tedious, time-consuming task.
We have already dealt with several issues re-
lated to hyphenation in T
E
X (Sojka and
ˇ
Seveˇcek,
1995; Sojka, 1995). On the basis of our being in-
volved in typesetting tens of thousands of T
E
X pages
of multilingual documents (mostly dictionaries), we
want to point out several methods suitable for the
development of hyphenation patterns.
Pattern generation
There is no place in the world that is
linguistically homogeneous, despite the
claims of the nationalists around the world.
(Plaice, 1998)
Liang (1983), in his thesis written under Knuth’s su-
pervision, developed a general method to solve the
hyphenation problem that was adopted in T
E
X82
(Knuth, 1986a, App. H). He wrote the PATGEN pro-
gram (Liang and Breitenlohner, 1999), which takes
a list of already hyphenated words (if any),
a set of patterns (if any) that describes “rules”,
a list of parameters for the pattern generation
process,
a character code translation file (added in PAT-
GEN 2.1; for details see Haralambous (1994),
and generates
an enriched set of patterns that “covers” all hy-
phenation points in the given input word list,
a word list hyphenated with the enriched set of
patterns (optional).
TUGboat, Volume 20 (1999), No. 3 Proceedings of the 1999 Annual Meeting 241
Petr Sojka
The patterns are loaded into T
E
X’s memory
and stored in a data structure (cf. Knuth, 1986b,
parts 40 43), which is also efficient for retrieval
a variant of trie memory (cf. Knuth, 1998, pp. 492
512). This data structure allows hyphenation pat-
tern searching in linear time with respect to the pat-
tern length. The algorithm using a trie that “out-
puts” possible hyphenation positions may be viewed
as finite automaton with output (Mealy automaton
or transducer).
Pattern deve lopment
...problems [with hyphenation] have more or less
disappeared, and I’ve learnt that this is only because,
nowadays, every hyphenation in the newspaper
is manually checked by human proof-readers.
(Jarnefors, 1995)
Studying patterns that are available for various lan-
guages shows that PATGEN has only been used for
about half of the hyphenation pattern files on
CTAN
(cf. Table 1 in Sojka and
ˇ
Seveˇcek, 1995).
There are two approaches to hyphenation pat-
tern development, depending on user preferences.
Single authors using T
E
X as an authoring tool want
to minimize system changes and want T
E
Xtobe-
have as a fixed point so that re-typesetting of old
articles is easily done, thanks to backwards compat-
ibility. For such users, one set of patterns that is
fixed once and for all might be sufficient.
On the other hand, for publishers and corpo-
rate users with high-volume output, it is more ef-
ficient to make a long-term investment into devel-
opment of hyphenation patterns for particular pur-
poses. I remember one T
E
X user saying that my sug-
gestion to enhance standard hyphenation patterns
with custom-made ones to allow better hyphenation
of chemical formulæ would save his employer thou-
sands of pounds per year. Of course, with this ap-
proach, one has to archive full sources for every pub-
lication, together with hyphenation patterns and ex-
ceptions.
One of the possible reasons PATGEN has not been
used more extensively may be the high investment
needed to create hyphenated lists of words, or bet-
ter, a morphological database of a given language.
Pattern bootstrapping and iterative
development
The road to wisdom?
Well it’s plain and simple to express:
Err and err and err again
but less and less and less.
(Hein, 1966)
When developing new patterns, it is good to start
with the following bootstrapping technique with it-
eration, which should avoid the tedious task of man-
ually marking hyphenation points in huge lists of
words:
1. Write down the most obvious initial patterns,
if any, and/or collect “the closest” ones (e.g.,
consonant-vowel rules).
2. Extract a small word list for the given language.
3. Hyphenate current word list with current set
patterns.
4. Check all hyphenated words and correct them;
in the case of errors return to step 3.
5. Collect a bigger word list.
6. Use the previously generated set of patterns to
hyphenate the words in this bigger list.
7. Check hyphenated words, and if there are no
errors, move to step 9.
8. Correct word list and return to step 6.
9. Generate final patterns with PATGEN with pa-
rameters fitted for the particular purpose (tuned
for space or efficiency).
10. Merge/combine new patterns with other mod-
ules of patterns to fit the particular publishing
project.
To find an initial set of patterns, some basic
rules of hyphenation in the specific language should
be known. Language can be grouped into one of two
categories: those that derive hyphenation points ac-
cording to etymology and those that derive hyphen-
ation according to pronunciation syllable-based”
hyphenation. For the first group of languages, one
should start with patterns for most frequent end-
ings and suffixes and prefixes. For syllable-based
hyphenation, patterns based on sequences of conso-
nants and vowels might be used (cf. Chicago Manual
of Style, 1993, Section 6.44, and Haralambous, 1999)
as first approximation of hyphenation patterns.
As using T
E
X itself for hyphenation of word lists
and development of patterns may be preferred to
other possibilities, we will start with this portable
solution, using hyphenation of phonetic transcrip-
tions as an example of a syllable-based “language”.
Let’s start with some plain T
E
X code to define
consonant-vowel (
CV) patterns:
% ... loading plain.tex
% without hyphen.tex patterns ...
\patterns{cv1cv cv2c1c ccv1c cccv1c
ccccv1c cccccv1c v2v1 v2v2v1 v2v2v2v1
...
}
242 TUGboat, Volume 20 (1999), No. 3 Proceedings of the 1999 Annual Meeting
Hyphenation on Demand
There is a way to typeset words together with their
hyphenation points in T
E
X; the code from Olˇak
(1997, with minor modifications) looks like this:
\def\showhyphenspar{\begingroup
\overfullrule=0pt \parindent0pt
\hbadness=10000 \tt
\def\par{\setparams\endgraf\composelines}%
\setbox0=\vbox\bgroup
\noindent\hskip0pt\relax}
\def\setparams{\leftskip=0pt
\rightskip=0pt plus 1fil
\linepenalty1000 \pretolerance=-1
\hyphenpenalty=-10000}
\def\composelines{%
\global\setbox1=\hbox{}%
\loop
\setbox0=\lastbox \unskip \unpenalty
\ifhbox0 %
\global\setbox1=\hbox{%
\unhbox0\unskip\hskip0pt\unhbox1}%
\repeat
\egroup % close \setbox0=\vbox
\exhyphenpenalty=10000%
\emergencystretch=4em%
\unhbox1\endgraf
\endgroup}
Now, we will typeset our word list in the typewriter
font without ligatures. To use the
CV patterns de-
fined above we need to map word characters prop-
erly:
% vowels mapping
\lccode‘\a=‘v \lccode‘\e=‘v
\lccode‘\i=‘v \lccode‘\o=‘v
...
% consonants
\lccode‘\b=‘c \lccode‘\c=‘c
\lccode‘\d=‘c \lccode‘\f=‘c
...
\raggedbottom \nopagenumbers
\showhyphenspar
The need to fully automate the
batch typesetting process increases
with the use of word in wordlist
...
\par\bye
Finally, extracting hyphenated words from dvi the
file via the dvitype program, we get our word list
hyphenated by our simple
CV patterns.
Another way to get the initial word list hyphen-
ated is to use PATGEN with initial patterns and no
new level, letting PATGEN hyphenate the word list
that was input.
PERL addicts may want to use the
PERL hyphenation module (Pazdziora, 1997) for the
task.
Once the job of proofreading the word list is
finished, we can generate new patterns and collect
other words in the language. Using new patterns
on the new collection will show the efficiency of the
process.
Fine tuning of patterns may be iterated, once
PATGEN parameters are set, so that nearly 100 % cov-
erage of hyphenation points is achieved in every it-
eration. The setting of such PATGEN parameters may
be difficult to find on the first attempt. Setting of
these parameters is discussed in Sojka and
ˇ
Seveˇcek
(1995).
Modularity of patterns
It is tractable for some languages to create patterns
by hand, simply by writing patterns according to the
rules for a given language. This approach is, how-
ever, doomed to failure for complex languages with
several levels of exceptions. Nevertheless, there are
special cases in which we may build pattern modules
and concatenate patterns to achieve special purpose
behaviour. This applies especially when additional
characters (not handled when patterns have been
built originally) may occur in words that we still
want to hyphenate.
Patterns generated by Raichle (1997) may serve
as an example that can be used with any fonts in the
standard L
A
T
E
X eight-bit T1 font encoding, to allow
hyphenation after an explicit hyphen. Similar pat-
tern modules can be written for words or chemical
formulæ that contain braces and parentheses. These
can be combined with “standard” patterns in the
needed encodings. Some problems might be caused
by the fact that T
E
X does not allow metrics to be
defined for \lefthyphenmin and \righthyphenmin
properly we might want to say that ligatures, for
instance, count as a single letter only or that some
characters should not affect hyphenation at all (e.g.
parentheses in words like colo(u)r). We must wait
until some naming mechanisms for output glyphs
(characters) is adopted by the T
E
X community for
handling these issues.
Adding a new primitive for the hyphenmin code
let’s call it \hccode, a calque on \lccode would
cause similar problems: changing it in mid-para-
graph would have unpredictable results.
1
It is advisable to create modules or libraries of
special-purpose hyphenation patterns, such as the
1
ε-T
E
Xv2hasanewfeaturetofixthe\lccode values
during the pattern read phase.
TUGboat, Volume 20 (1999), No. 3 Proceedings of the 1999 Annual Meeting 243
Petr Sojka
ones mentioned above, to ease the task of pattern
development. These patterns might be written in
such as as to be easily adaptable for use with core
patterns of a different language.
Common patterns for more languages
Having large hyphenated word lists of several lan-
guages the possibility then exists to make multilin-
gual or special-purpose patterns from collections of
words by using PATGEN. Joining word lists and gen-
erating patterns on demand for particular publica-
tions is especially useful when the word databases
are structured and split into sublists of personal
names, geographic names, abbreviations, etc. These
patterns are requested when typesetting material in
which language switching is not properly done (e.g.
on the
WWW).
Czech and Slovak are very closely related lan-
guages. Although they do not share exactly the
same alphabet, rules for hyphenation are similar.
That has led us to the idea of making one set of hy-
phenation patterns to work for both languages, sav-
ing on space in a format file that supports both. In
the Czech/Slovak standard T
E
X distribution there is
support for different font encodings. For every en-
coding, hyphenation patterns have to be loaded as
there is no character remapping on the level of trie
possible. Such Czechoslovak patterns would save
patterns for each encoding in use.
It should be mentioned that this approach can-
not be taken for any set of languages as there may
be, in general, identical words that hyphenate dif-
ferently in different languages; thus, simply merging
word lists to feed PATGEN is not sufficient without de-
grading the performance of patterns by forbidding
hyphenation in these conflicting words (e.g. re-cord
vs. rec-ord).
Phonetic hyphenation
As an example of custom-made hyphenation pat-
terns, the patterns required to hyphenate a pho-
netic (
IPA) transcription are described in this sec-
tion. Dictionaries use this extensively see Fig. 1,
2
taken from Kirsteinov´a.
The steps used to develop the hyphenation pat-
terns for this dictionary were similar to those de-
scribed in the previous section on bootstrapping:
1. Write down the most obvious (syllable) pat-
terns.
2. Extract all phonetic words from available texts.
2
The IPA font used is TechPhonetic, downloadable from
http://www.sil.org/ftp/PUB/SOFTWARE/WIN/FONTS/.
akkompagnement sb k
^
mpænj
{
-
'
ma
4
] -et, -er hudebnı´ doprovod m
alimentationsbidrag sb [ælim
E
ntæ-
'
sˇo:’ns
,
bi
,
dra:’w] -et, - alimenty pl,
p
ˇ
´sp
ˇ
evek m na vy´zˇivne´dı´t
ˇ
ete
befolknings
eksplosion sb [be
'
f
^
l’g-
ne
4
s-] -en, -er popula
ˇ
cnı´explozef
-tilvækst -en, -er ´r˚ustek m
obyvatelstva
-tæthed -en, -er
hustota f obyvatelstva
bemærkelsesværdig adj [be
'
R
-
g
{
ls
{
s
,
R
’di] -t, -e pozoruhodny´
beslutningsdygtig adj [be
'
slud-
ne
4
s
,
døgdi] -t, -e schopny´
rozhodovat; den lovgivende for-
samling var
˜
za´konoda´rne´ shro-
ma´zˇd
ˇ
enı´ bylo schopne´seusna´sˇet
Figure 1: Example of phonetic hyphenation
in Kirsteinov´a and Borg (1999).
3. Hyphenate this word list with the initial set of
patterns.
4. Check and correct all hyphenated words.
5. Generate final quality patterns.
In bigger publishing projects efforts like this pay off
very quickly.
Hyphenation for an etymological dictionary
In some publications (Rejzek, in prep., for exam-
ple), a different problem can arise: the possibility of
having more than 256 characters used within a sin-
gle paragraph. This problem cannot, in general, be
easily solved
3
withintheframeofT
E
X82. We thus
tried Ω, the typesetting system by Plaice and Hara-
lambous, for this purpose. One has to create special
virtual fonts (e.g., by using the fontinst package) on
top of the ones, in order to typeset it see Fig. 2.
More hyphenation classes
But at least I can point out a minor weakness
of T
E
X’s algorithm: all possible hyphenations
have the same penalty. This might be ok
for english, but for languages like German
that have a lot of composite words there
should be the ability to assign lower penalties
between parts of a composite i.e. Um-brechen
should be favored against Umbre-chen.
(Hars, 1999)
3
One could try to re-encode all fonts used in parallel
in some paragraph such that they share the same \lccode
mappings, but this exercise would have to be made for each
multilingual-intensive publication, again and again.
244 TUGboat, Volume 20 (1999), No. 3 Proceedings of the 1999 Annual Meeting
Hyphenation on Demand
Figure 2: Using Ω to typeset paragraphs in which
words from languages with more than 256 different
characters may appear and be hyphenated in
parallel.
Some suggestions on handling multiple hyphenation
classes were suggested in Sojka (1995). A proto-
type implementation of ε-T
E
XandPATGEN has re-
cently been done (Classen, 1998). For wider adop-
tion of such improvements availability of large word
lists and development of new patterns is crucial.
Many of the methods mentioned above could be
used to develop such multi-class/multi-purpose pat-
terns. Allen (1990) contains such a word list, which
shows that some publishers do pay attention to line-
breaking details.
Speed considerations
Even though hyphenation searches using a trie data
structure is fast, searching for unnecessary hyphen-
ation points is a waste of time. It is advisable to tell
T
E
X where words shouldn’t be hyphenated. Com-
paring several possibilities for suppressing hyphen-
ation, the option of setting \lefthyphenmin to 65 is
slightly faster than switching to \language, which
has no patterns. These solutions outperform the
\hyphenpenalty 10000 solution by a fair amount
(cf. Arsenau, 1994).
Reuse of patterns
Sometimes we need the same patterns with differ-
ent \lefthyphenmin and \righthyphenmin param-
eters. The suggested approach is not to limit hy-
phens close to word boundaries during the pattern
generation phase but to use T
E
X’s \setlanguage
primitive. This can be done to achieve special hy-
phenation handling for the last word in a paragraph
(e.g., a higher \righthyphenmin) given proper mark-
up by a preprocessing filter. For example:
\newcount\tmpcount
\def\lastwordinpar#1{%
\tmpcount=\righthyphenmin
\righthyphenmin5
\setlanguage\language #1
\expandafter\righthyphenmin\the\tmpcount
\setlanguage\language}
\showhyphens{demand}
\lastwordinpar{demand\showhyphens{demand}}
\bye
Future work
If you find that you’re spending almost
all your time on practice, start turning
some attention to theoretical things;
it will improve your practice.
(Knuth, 1989)
It seems inevitable that embedding of language-spe-
cific support modules will be necessary for the type-
setting system in the future. These demands may
not only apply for hyphenation but also for spelling
or even grammar checkers. As even people using
WYSIWYG systems may use tools that help to vi-
sualise possible typos (in color, etc.) on the fly, the
computing power of today’s machines is surely suf-
ficient to do the same in batch processing with even
better results.
The idea of using patterns to capture mappings
specific for particular languages or dialect modules
can be further generalized for different purposes and
mappings. The use of the theory of finite-state trans-
ducers (Mohri, 1996; Mohri, 1997; Roche and Sch-
abes, 1996) to implement other classes of language
modules looks promising.
TUGboat, Volume 20 (1999), No. 3 Proceedings of the 1999 Annual Meeting 245
Petr Sojka
Summary
Some computerized typesetting methods in frequent
use today may render a conservative approach
to word division impractical. Compromise may
therefore be necessary pending the development
of more sophisticated technology.
Chicago Manual of Style (1993, Section 6.43)
We have outlined some of the possibilities offered
by T
E
XandPATGEN for the development of cus-
tomized hyphenation patterns. We have suggested
bootstrapping and iterative techniques to facilitate
pattern development. We also suggest wider em-
ployment of PATGEN and preparation of hyphenated
word lists and modules of patterns for easy prepara-
tion of hyphenation patterns on demand in today’s
age of digital typography (Knuth, 1999).
Acknowledgements. We thank Bernd Raichle for
valuable comments and corrections to the paper. We
are indebted to the Proceedings editor for wording
improvements. The presentation of this work has
been made possible through support from the Min-
istery of Education, Youth and Physical Training
(
M
ˇ
SMT
ˇ
CR grant VS97028).
References
Allen, R.E. The Oxford Spelling Dictionary,vol-
ume II of The Oxford Library of English Usage.
Oxford University Press, 1990.
Arsenau, Donald. “Benchmarking paragraphs with-
out hyphenation”. Posting to the Usenet group
news:comp.text.tex on Dec 13, 1994.
Bos, Bert. “Internationalization / Localiza-
tion”. http://www.w3.org/International/
O-HTML-hyphenation.html, 1999.
Chicago Manual of Style. The Chicago Manual of
Style, 14th edition, 1993.
Classen, Matthias. “An extension of T
E
X’s hyphen-
ation algorithm”. ftp://peano.mathematik.
uni-freiburg.de/pub/etex/hyphenation/,
1998.
Goldfarb, Charles F. The
SGML Handbook. Claren-
don Press, Oxford, 1990.
Haralambous, Yannis. “A Small Tutorial on the
Multilingual Features of PATGEN2”. In elec-
tronic form, available from
CTAN as info/
patgen2.tutorial, 1994.
Haralambous, Yannis. “From Unicode to Typogra-
phy, A Case Study: The Greek Script”. Pro-
ceedings of 14th International Unicode Confer-
ence, preprint available from http://genepi.
louis-jean.com/omega/boston99.pdf, 1999.
Hars, Florian. “Typo-l email discussion list”. 1999.
Hein, Piet. Grooks.
MIT Press, Cambridge, Mas-
sachusetts, 1966.
Jarnefors, Olle.
ISO-10646 email discussion list”.
1995.
Kirsteinov´a, Blanka and B. Borg. ansko-ˇcesk´y
slovn´ık, Dansk-Tjekkisk Ordbog [Danish-Czech
dictionary].
LEDA, Prague, Czech Republic,
1999.
Knuth, Donald E. The T
E
Xbook, volume A of Com-
puters and Typesetting. Addison-Wesley, Read-
ing, MA, USA, 1986a.
Knuth, Donald E. T
E
X: The Program, volume B
of Computers and Typesetting. Addison-Wesley,
Reading, MA, USA, 1986b.
Knuth, Donald E. “Theory and Practice”. Keynote
address for the 11th World Computer Congress
(Information Processing ’89), 1989.
Knuth, Donald E. Sorting and Searching, volume 3
of The Art of Computer Programming. Addison-
Wesley, 1998.
Knuth, Donald E. Digital Typography.
CSLI Lecture
Notes 78. Center for the Study of Language and
Information, Stanford, California, 1999.
Liang, Frank. Word Hy-phen-a-tion by Com-put-er.
Ph.D. thesis, Department of Computer Science,
Stanford University, 1983.
Liang, Frank and P. Breitenlohner. PATtern GEN-
eration Program for the T
E
X82 Hyphenator”.
Electronic documentation of PATGEN program
version 2.3 from web2c distribution on
CTAN,
1999.
Megginson, David. Structuring
XML Documents.
Prentice-Hall, Inc., Englewood Cliffs, New Jer-
sey, 1998.
Mohri, Mehryar. “On some applications of finite-
state automata theory to natural language pro-
cessing”. Natural Language Engineering 2(1),
61–80, 1996.
Mohri, Mehryar. “Finite-State Transducers in Lan-
guage and Speech Processing”. Computational
Linguistics 23(2), 269–311, 1997.
Olˇak, Petr. T
E
Xbook naruby [T
E
Xbook topsy-turvy].
Konvoj, Brno, 1997.
Pazdziora, Jan. TeX::Hyphen hyphen-
ate words using T
E
X’s patterns”. CPAN:
modules/by-authors/Jan_Pazdziora/
TeX-Hyphen-0.10.tar.gz, 1997.
Plaice, John. “pdftex email discussion list”.
http://www.tug.org/archives/pdftex/
msg01913.html, 1998.
Raichle, Bernd. “Hyphenation patterns for words
containing explicit hyphens”. CTAN/language/
hyphenation/hypht1.tex, 1997.
246 TUGb oat, Volume 20 (1999), No. 3 Proceedings of the 1999 Annual Meeting
Hyphenation on Demand
Rejzek, Jan. Etymologick´yslovn´ık ˇcesk´eho jazyka
[Czech Etymological Dictionary].
LEDA, Prague,
Czech Republic, in prep..
Roche, Emmanuel and Y. Schabes. Finite-State
Language Processing.
MIT Press, 1996.
Skoup´y, Karel. N
T
S: A New Typesetting Sys-
tem”. TUGboat 18(3), 318–322, 1998.
Sojka, Petr. “Notes on Compound Word Hyphen-
ation in T
E
X”. TUGboat 16(3), 290–297, 1995.
Sojka, Petr and P.
ˇ
Seveˇcek. “Hyphenation in T
E
X—
Quo Vadis?”. TUGboat 16(3), 280–289, 1995.
Sutor, Robert S. and A. L. D´ıaz.
IBM techplorer:
Scientific Publishing for the Internet”. Cahiers
Gutenberg 28–29 , 295–308, 1998.
The Young Man of Vancouver
There was a young man of Vancouver
who thought he admired Anita Hoover
but he looked at some macros
which ran under Windows
and now all he can think of is \over s
—Sebastian Rahtz
The TUG conference
Down the T
E
Xing path we go
with a Sparc its not so slow
Up the network nodes we run
\href links can be so much fun
Round the browser wars we dodge
Sans MathML—a real hodge-podge
Home at last—the Web is fast—we wait for L
A
T
E
X3
While Frank and David trade ideas,
Chris seeks terminology
—Christina Thiele
The Young Lady of Stanford
There was a young lady from Stanford
who delighted to play with Mac Word
she met a Don Knuth
who told her the truth
and now what she enjoys is absurd
—Sebastian Rahtz and Patrick Ion
TUGboat, Volume 20 (1999), No. 3 Proceedings of the 1999 Annual Meeting 247
... Ze slovníku velikosti několika MB lze vytvořit vzory velikostiřádověvelikostiřádově desítek KB pokrývající nad 98 % dělicích bodů a s chybovostí pod 0,1 %. ˇ Cetné experimenty ukázaly, že se vystačívystačíčasto sě ctyřmi úrovněmi [21]. Pomocí vhodných technik (bootstrapping, stratifikace) a strategií nastavení parametrů pro lineární prahování bylo ukázáno [21,18,19] jak se dají generované vzory optimalizovat. Příklady statistik z generování variantčeskýchvariantčeských vzorů dělení jsou na obrázcích 1, 2 a 3. ...
... V následujícím popisu zanedbáváme detaily, které nejsou podstatné pro následující úvahy. Pro přesný popis odkazujeme na [11], nebočlánkynebočlánky [18,19]. ...
Article
Full-text available
Abstrakt: ˇ Clánek popisuje techniku vzor˚ uj ako prostredek pro získávání informace z rozsáhl˝ch dat a zpetné rozpoznávání. Typickou aplikací této techniky je delení slov. Dosud chybí generátor vzorudelení pro systém  (pro UNICODE) a rozöíˇ rení programu PATGEN ,o mezeného osmibitov˝m ASCII, není únosné. Proto vyvíjíme knihovnu PATLIB pro obecnou manipulaci se vzory a na ní postavíme generátor vzor˚ udelení slov v UNICODE. Popíöeme architekturu pripravovaného systému a dále méne známou datovou strukturu dynamic packed trie, kterou lze v˝hodnep ouûít pro efektivní ukládání konecn˝ch jazyk˚ u s v˝stupy. Vzory lze pouûít i pro rozpoznávání hranic sloûen˝ch slov, proto zmíníme návrhy na rozöíˇ rení následníku TeXu o klasifikované delení s více typy delících boduao automatické potlacování ligatur na övech sloûen˝ch slov.
... In this paper, we evaluate the feasibility of the development of universal phonology-based (syllabic) hyphenation patterns. We describe the development from word lists of Czech [11,15,12] and Slovak [14] used on the web pages. We describe the reproducible approach, and document the reproducible workflow and resources in the public repositories as a language resource and methods to be followed. ...
Article
Space- and time-effective segmentation (word hyphenation) of natural languages remain at the core of every document rendering system, be it TeX, web browser, or mobile operating system. In most languages, segmentation mimicking syllabic pronunciation is a pragmatic preference today. As language switching is often not marked in rendered texts, the typesetting engine needs universal syllabic segmentation. In this article, we show the feasibility of this idea by offering a prototypical solution to two main problems: A) Patgen generation process for several languages at once; B) no wide character support in tools like Patgen or TeX hyphenation, e.g. internal Unicode compliance is missing. For A), we have applied it to generating universal syllabic patterns from wordlists of nine syllabic, as opposed to etymology-based, languages. For B), we have created a version of Patgen that uses the Judy array data structure and compared its effectiveness with the trie implementation. With the data from nine languages (Czech, Slovak, Georgian, Greek, Polish, Russian, Turkish, Turkmen, and Ukrainian) we showed that A) developing universal, up-to-date, high-coverage, and highly generalized universal syllabic segmentation patterns is possible, with high impact on virtually all typesetting engines, including web page renderers. B) bringing wide character support into the hyphenation part of the TeX suite of programs is possible by using the Judy array.
... In this paper, we evaluate the feasibility of the development of universal phonology-based (syllabic) hyphenation patterns. As a case study, we describe the development of Czechoslovak hyphenation patterns from word lists of Czech [11,12,13] and Slovak [14]. We document our reproducible workflow and resources in a public repository. ...
Article
Full-text available
Space- and time-effective segmentation and hyphenation of natural languages stay at the core of every document preparation system, web browser, or mobile rendering system. Recently, the unreasonable effectiveness of pattern generation has been shown – it is possible to use hyphenation patterns to solve the dictionary problem for a single language without compromise. In this article, we will show how we applied the marvelous effectiveness of patgen for the generation of the new Czechoslovak hyphenation patterns that cover two languages. We show that the development of more universal hyphenation patterns is feasible, allows for significant quality improvements and space savings. We evaluate the new approach and the new Czechoslovak hyphenation patterns.
... The idea of competing patterns is taken from the method developed by Liang [13] for his English hyphenation algorithm. It has been shown by extensive studies [18,19,20] that the method scales well and that parameters of the pattern generator -PATGEN program [21] -could be fine-tuned so that virtually all hyphenation points are covered, leading to about 99.9% efficiency. ...
Article
Full-text available
Many tasks in natural language processing (NLP) require segmentation algorithms: segmentation of paragraph into sentences, segmentation of sentences into words is needed in languages like Chinese or Thai, segmentation of words into syllables (hyphenation) or into morphological parts (e.g. getting word stem for indexing), and many other tasks (e.g. tagging) could be formulated as segmentation problems. We evaluate methodology of using competing patterns for these tasks and decide on the complexity of creation of space-optimal (minimal) patterns that completely (100 %) implement the segmentation task. We formally define this task and prove that it is in the class of non-polynomial optimization problems. However, finding space-efficient competing patterns for real NLP tasks is feasible and gives efficient scalable solutions of segmentation task: segmentation is done in constant time with respect to the size of segmented dictionary. Constant time of access to segmentations makes competing patterns attractive data structure for many NLP tasks.
... The first version of this paper was presented at the TUG 1999 conference in Vancouver (Sojka, 1999a) and final version appeared as the journal publication (Sojka, 1999b): ...
Article
Full-text available
The goal of this dissertation is to explore models, methods and methodologies for machine learning of the compact and effective storage of empirical data in the areas of language engineering and computer typesetting, with a focus on the massive exception handling. Research has focused on the pattern-driven approach. The whole methodology of so called \emph{competing patterns} capable of handling exceptions to be found so widely in natural language data and computer typesetting, is further developed. Competing patterns can store \emph{context dependent} information and can be learnt from data, or written by experts, or combined together. In the first part of the thesis, the theory of competing patterns is built; competing patterns are defined, cornerstones of methodology based on stratified sampling, bootstrapping and problem modeling by competing patterns are described. Segmentation problems (hyphenation) and problems of disambiguation of tagged data in corpus linguistics are used as examples when developing formal model of the competing patterns method. The second part consist of a series of seven published papers that describe problems addressed by the proposed methods: applications of competing patterns and related learning methods in areas of hyphenation, hyphenation of compound words and, for example, the segmentation of Thai texts.
... Problematice generování vzorů na semináři S L T již byl věnováň clánek [1], proto zopakujeme jen hlavní principy a laskavéhočtenářelaskavéhočtenáře odkážeme dále na dalšíčlánky dalšíčlánky věnované této a příbuzné problematice [11,25,20,21]. Generování probíhá ve fázích, které se nazývají úrovně (anglicky levels). ...
Article
Full-text available
Logical analysis of natural language allows to extract semantic relations that are not revealed for standard full text search methods. Intensional logic systems, such as the Transparent Intensional Logic (TIL), can rigorously describe even the higher-order relations between the speaker and the content or meaning of the discourse. In this paper, we concentrate on the mechanism of logical analysis of direct and indirect discourse by means of TIL. We explicate the procedure within the Normal Translation Algorithm (NTA) for Transparent Intensional Logic (TIL), which covers the language analysis on the syntactic and semantic levels. Particular examples in the text are presented in syntactically complicated free-word-order language, viz the Czech language.
Article
Full-text available
This paper presents a few ideas on how to solve certain geometrical problems arising very often in character design, not directly solvable by METAFONTs plain macros. The first part of the paper presents two geometrical problems: the "k problem" and the "x problem", their solutions using dichotomy, and a different solution using path intersections. The latter was proposed earlier on the net by the author; although geometrically correct, it does not work in real-world METAFONT practice: a nice example of METAFONT code : : : to avoid. The second part of the paper presents two simple macro for drawing "loose" Bezier curves; in a sense, the opposite of the tension operator. Finally, the third part solves a problem stated by Alan Hoenig: how to extract text and data out of a METAFONT run, without using the log file. This is done in a straightforward manner by running a Flex-generated preprocessor over the GF file: the Flex code for this utility is given in appendix B. 1 Two Geometrical Problems, Solved by Iterated Calculations 1.1 Description Suppose you want to design a character `K', as on the left part of fig. 1. The character should fit inside a box of width w and height h, and should consist of three strokes: the vertical stroke z 0 Gamma Gammaz 0 0 , and the two oblique strokes z 1 Gamma Gammaz 2 and z 1 0 Gamma Gammaz 2 0 . Only constraint: the point z 1l = z 1 0 r should be fixed (for example, its coordinates can be (0; h 2 )). So, here is the problem: Find a stroke z 1 Gamma Gammaz 2 with fixed z 1l ; y 2l ; x 2r . This problem is not trivial, because METAFONT cannot compute pen positions without knowing in advance the angle of the pen (this stands both for defining a new pen with command pickup pen as for defining a simulated pen with command penpos)....
Article
Full-text available
The problems of the automatic compound word and discretionary hyphenation in TEX are discussed. These hyphenation points have to be marked manually in the TEX source file so far. Several methods how to tackle with these problems are observed. The results obtained from experiments with German word-list are discussed.
Article
Full-text available
Signicant progress has been made in the hyphenation ability of T E X since its rst version in 1978. However, in practice, we still face problems in many languages such as Czech, German, Swedish etc. when trying to adopt local typesetting industry standards. In this paper we discuss problems of hyphenation in multilingual documents in general, we show how we've made Czech and Slovak hyphenation patterns and we describe our results achieved using the program PATGEN for hyphenation pattern generation. We show that hyphenation of compound words may be partially solved even within the scope of T E X82. We discuss possible enhancements of the process of hyphenation pattern generation and describe features that might be reasonable to think about to be incorporated in or another successor to T E X82. Motivation Go forth and make masterpieces of hyphenation patterns . . . " (Haralambous, 1994) Editors' and publishers' typographical requirements for camera-ready prepared doc...
Article
Full-text available
this paper we will try to synthesize, from a typographical point of view, the various instances of Greek script. Globally, one can say that the Greek script is used for the following cases :
Article
We describe new applications of the theory of automata to natural language processing: the representation of very large scale dictionaries and the indexation of natural language texts. They are based on new algorithms that we introduce and describe in detail. In particular, we give pseudocodes for the determinization of string to string transducers, the deterministic union of p-subsequential string to string transducers, and the indexation by automata. We report several experiments illustrating the applications. 1 Introduction The theory of automata provides efficient and convenient tools for the representation of linguistic phenomena. Natural language processing can even be considered as one of the major fields of application of this theory (Perrin 1993). The use of finite-state machines has already been shown to be successful in various areas of computational linguistics: lexical analysis (Silberztein 1993), morphology and phonology (Koskenniemi 1985; Karttunen et al. 1992; Kaplan a...
Article
Finite-state machines have been used in various domains of natural language processing. We consider here the use of a type of transducer that supports very efficient programs: sequential transducers. We recall classical theorems and give new ones characterizing sequential string-to-string transducers. Transducers that output weights also play an important role in language and speech processing. We give a specific study of string-to-weight transducers, including algorithms for determinizing and minimizing these transducers very efficiently, and characterizations of the transducers admitting determinization and the corresponding algorithms. Some applications of these algorithms in speech recognition are described and illustrated.