ArticlePDF Available

Automating XML mark-up using a two stage machine learning technique

Authors:

Abstract and Figures

We introduce a novel two-stage automatic XML mark-up system, which combines the WEBSOM approach to document categorisation in conjunction with the C5 inductive learning algorithm. The WEBSOM method clusters the XML marked-up documents such that semantically similar documents lie close together on a Self-Organising Map (SOM). The C5 algorithm automatically learns and applies mark-up rules derived from the nearest SOM neighbours of an unmarked document. The system learns from mark-up errors to improve accuracy. The automatically marked-up documents produced by the system are also categorized on the Self-Organizing Map, to further refine SOM's document coverage.
Content may be subject to copyright.
Automating XML mark-up using a two stage
machine learning technique
Shazia Akhtar, Ronan G. Reilly, John Dunnion
Department of Computer Science,
University College Dublin
Belfield, Dublin 4.
E-mail: shazia.akhtar@ucd.ie
Abstract. We introduce a novel two-stage automatic XML mark-up system, which combines the
WEBSOM approach to document categorisation in conjunction with the C5 inductive learning
algorithm. The WEBSOM method clusters the XML marked-up documents such that semantically
similar documents lie close together on a Self-Organising Map (SOM). The C5 algorithm
automatically learns and applies mark-up rules derived from the nearest SOM neighbours of an
unmarked document. The system learns from mark-up errors to improve accuracy. The
automatically marked-up documents produced by the system are also categorized on the Self-
Organizing Map, to further refine SOM’s document coverage.
Keywords. automatic mark-up, C5/See5, machine learning, Self-Organizing Map, WEBSOM,
XML.
1. Introduction
With the advent of hypertext and hypermedia, vast amounts of information are now accessible in
electronic form. This ever-increasing quantity of information has created a need for intelligent
management and retrieval techniques. The requirement has become more pressing with the
dramatic growth of the World Wide Web. Hypertext is a continuously developing technology for a
variety of textual and non-textual applications in the fields of education, computer science,
information science, and many others. Currently, there are many hypertext retrieval systems but
the techniques they use are not content-sensitive, and consequently not optimal for accurate
retrieval. The most widely used mark-up language, hypertext mark-up language (HTML), deals
with the format and layout of documents, but is quite inflexible in the aspects of document structure
that it can represent. SGML is an extensible markup language and can be used to overcome the
limitations of HTML. It is, however, complex, expensive to use, and not compatible with the web.
Extensible Mark-up Language (XML) is a bridge between the complexity of SGML and the
simplicity of HTML. It can represent the hierarchical structure of the document; it is flexible,
portable and provides a tool for information exchange not only via the Internet but also in other
information technology areas. The adoption of XML as a standard for publication and data
interchange provides an opportunity for the development of more powerful intelligent information
retrieval systems. Currently, most search engines look for the exact match of the query with words
in the document, producing search results that are often irrelevant. If the documents are marked-up
in XML, exploiting the XML structure of the document intelligent search can be performed. For
example, searching can be made sensitive to structural rather than physical proximity of key words;
the low-level semantic aspects of the document tagging structure can be exploited to provide
context sensitivity. In addition, there is also a possibility of using metadata that might have been
included in the mark-up.
To mark-up documents manually in XML is difficult and tedious and there is, as yet, no tool
available that can automatically perform this task. We present a novel system that automatically
marks-up the documents in XML by using a combination of Self-Organizing Map (SOM)
(Kohonen 1997a, 1997b) and an inductive learning algorithm C5/See5 (Quinlan, 1993, 2000). We
will describe these two techniques in next section.
2. Techniques used by the system
2.1. The WEBSOM method
WEBSOM is an application of SOM to document classification and retrieval. It is a full-text
information retrieval system and an effective browsing tool (Honkela, Kaski, Lagus, Kohonen,
1996).
Full text documents are used as input to the WEBSOM algorithm. From the text of all the
documents non-contextual and non-relevant information such as punctuation marks, stop-words,
signatures, images, numbers, headers, e-mail addresses and url’s are removed. Rarely occurring
words and most frequently occurring words are also removed from the text corpus. Each remaining
word is then represented by a unique n-dimensional real vector xi (e.g. n=90) with random number
components, where x denotes the ith word in the corpus. The relation between the words is
determined by forming the average context for each occurrence of all the words. The average
context vector X (i) for each word x at position i can be read as
E {xi-1 | xi}
ε xi
E { xi+1|xi}
X( i ) =
where E denotes the average evaluated over the text corpus and ε is the scaling or balancing
parameter. The purpose of ε is to control the relative weight of averages in determining the
similarity of context (Kohonen 1997a). The average context vectors X(i) for all the words in the
text corpus form a training set for SOM algorithm and the words xi with similar context vectors are
mapped closer to each other
Figure 1: Constructing a Self-Organizing Map by using WEBSOM method
on the word category map. The SOM algorithm clusters words with similar context on the same
unit of the word category map; therefore the word category map has far less units than the actual
words in the text corpus. Each document is encoded by mapping its text, word by word, onto the
word category map, whereby a histogram of “hits” on it is formed. The histograms for all
documents are then used for a second SOM, and a document map is formed. The documents
addressing similar topics, in general, get a position closer to each other on the document map.
2.2. C5 Learning rules
C5/See5 rules and its previous version C4.5 (Quinlan, 1993) are the extension of Quinlan’s famous
Iterative Dichotomizer 3 (ID3) algorithm (Quinlan 1979). These two algorithms, ID3 and C5, are
used for inducing classification models or decision trees from the data. The main component of the
ID3 algorithm is the classification-learning algorithm. ID3 uses fixed set of examples to build a
decision tree. The resulting tree is used to classify future samples. The example has many
attributes and belongs to a class. The leaf nodes of the tree contain the class name and the non-leaf
node is the decision node, which is an attribute test with each branch (to another decision tree)
being a possible value of the attribute. ID3 uses a statistical property, called Gain, which helps to
select the attribute as a decision node. How well a given attribute separates training examples into
target classes, is measured by Gain.
Information gain is defined in terms of Entropy, which is used to measure how informative is the
node of a decision tree. For a training set S, entropy or Info (S) is defined as the measure of the
average amount of information required, identifying the case in S. We can also express it as
Entropy or
Entropy or Info (S) = p log2 p pΘ log2 pΘ
Where p denotes the proportion of positive examples in S and pΘ denotes the proportion
of negative examples in S. When training set S is partitioned according to the n values or outcomes
of a test attribute A, we can find the expected information as weighted sum over the subsets, as
Info (S, A) = å|Si|/|S| x Info (Si) where i =1…n
The term
Gain (S, A) = Info (S) - Info (S, A)
is defined as the difference between the information needed to identify an element of S, and the
information needed to identify an element of S after the value of the attribute A has been obtained.
For few years after the development of ID3, gain criteria were used for the selection of test
attributes because it worked well. Afterwards it was realized that these criteria had a strong bias
towards the test attributes with many values or outcomes. To solve this problem Quinlan suggested
the following ratio instead of Gain.
Gain Ratio (S,A) = Gain(S, A)/SplitInfo(S, A)
where SplitInfo (S, A) is the information due to the split of S on the basis of A and
SplitInfo (S,A) = å|Si|/|S| x log2 (|Si|/|S|) where i =1…n
The basic idea of C5 is the same as ID3 and uses the concept of Gain to produce a classifier in the
form of decision tree according to the previously chosen classification. The Gain can be expressed
as the effective decrease in entropy which helps us in choosing the attributes at different levels but
it also deals with the unavailable values, continuous attribute value ranges, pruning of decision
trees, rule derivation etc.
Simplified decision trees are easily understandable by human beings but occasionally decision
trees, even though accurate, grow to unwieldy proportions become too complex to be understood.
C5 generates set of production rules from a decision tree. These rules better express the
classification model than trees. The advantages of C5 algorithm are that it is very fast, it is not
sensitive to missing features, it can deal with large number of features and it is incremental.
3. System Architecture
Figure 2: System Architecture: First phase deals with the formation of map using WEBSOM algorithm. Second phase
deals with the mark-up of an incoming unmarked document using the inductive learning algorithm.
Our system combines the techniques of the SOM algorithm and an inductive learning algorithm for
the adaptive automatic mark-up of documents in XML as shown in Figure 2. The first phase of the
system is the formation of map, which is implemented by using the WEBSOM method, described
in section 2.1. The second phase of the system is also implemented as an independent automatic
mark-up system, and is described in section 4. Eventually these two systems will be combined to
form a hybrid system described below.
Once a map of marked-up documents has been formed, an incoming document is automatically
mapped into the cluster of documents most similar to it. Our system then captures the mark-up
information from all the elements of neighbouring marked documents, and learns classifiers by
applying an inductive learning algorithm. The incoming document is then automatically marked-up
according to the learned classifiers.
Figure 3: A detailed architecture of the system
The system is adaptive in that, it learns from its mark-up errors and makes changes in the mark-up.
The detailed architecture of the hybrid system is shown in Figure 3.
4. Mark-up rule extraction and application
The rule-based system is fully automatic and comprises two main modules – a rule learner and a
mark-up module (see Figure 4).
4.1. Rule Learner
In this module rules are learned automatically from the collection of XML1 marked-up documents
by using an inductive learning approach. The documents should have a valid mark-up following
the rules of a single Document Type Definition DTD and should be from a specific domain.
Training examples for our system are comprised of all the elements containing text. Rules are
learned for all the elements (having different tag names) from the set of training instances, which is
pre-classified by tag name of
Figure 4: Automatic mark-up system
the element. Each instance corresponds to an element containing text from the collection of
marked-up documents. The text enclosed between the start and end tag of all elements is encoded
using a fixed-width feature vector. We have used 22 numerical features such as word count,
character count, etc to encode the instances in our experiments. The encoded instances are
processed by the inductive learning algorithm to develop classifiers, which are used to discriminate
the text as the content of different elements of the XML documents. For this purpose we have
selected the C5/See5 learning algorithm. Sets of rules can be generated from the collections of
XML marked- up documents (from different domains) and can be saved in separate files. The
appropriate rules set generated from the marked-up documents of a specific domain can be
consulted in the process of marking up of an unmarked document from the same domain.
4.2. Mark-up
The second module marks up in XML a text document, which has a structure similar to the marked-
up documents used for generating the rules. The text of an unmarked document is divided into
chunks by using delimiters like blank lines etc. These chunks of text are discriminated as the text of
XML elements (specified by the tag name) by applying the rules learned by the C5/See5 classifier.
Further by applying the rules of the DTD (conformed to by the collection of marked-up
documents), the text document is marked-up in XML. The mark-up produced by the system should
be valid according to the same DTD. Once a document is marked-up, it can be validated to check
the accuracy of the mark-up.
4.3. Experiments
We have used documents from a few different domains (complying the rules of simple DTDs) as an
initial test bed for our experiments. For example marked-up letters from the MacGreevy archive
(Schreibman, 1998, 2000) are used by the system to learn classifiers for elements containing text.
The letters are marked-up in XML according to a simple DTD that is letter.dtd (Rees, 1999) to
generate training examples for our system. We have tested the system by marking-up unmarked
letters from the same domain (see Figure 5).
Figure 5: Unmarked letter and valid XML markup of the same letter (produced by the system)
4.4. Performance and Analysis
In the earlier version of this system we worked with a set of well-formed documents comprising
letters from the MacGreevy archive. The C5/See5 algorithm was applied on 200 training and 50
test cases and 98% accuracy was achieved. Each case in our system represents one element
(containing text) from the collection of marked letters. We have also tested it on the elements of
about 20 letters and achieved an accuracy rate of 94%. The accuracy rate is calculated by
considering the correctly marked-up elements as a percentage of the total number of elements of the
tested letters. We are currently working with valid documents and hope to achieve higher accuracy.
5. Conclusion
We have described a novel system and useful tool for automatically marking up documents in
XML. The system has a hybrid architecture that uses WEBSOM method to arrange the XML
marked-up documents on a Self-Organizing Map and the inductive learning algorithm to perform
automatic mark-up of text documents. The system automatically learns rules from the mark-up
information of the documents on the Self-Organizing Map. It then marks up the text document by
applying the learned rules. The system also learns from feedback and makes changes in the mark-
up to improve results.
Acknowledgements
The system is a part of INTINN project, which is funded by Enterprise Ireland under the
Informatics Research Initiative.
Notes
1 An XML document consists of the plain text and the mark-up instructions. These mark-up instructions are
called tags. XML elements are composed of a start tag, the content (either text or other elements) and the
end tag. The whole document is enclosed within one element called root. All other elements must be
completely enclosed in the root element. DTD defines the structure of the document and identifies, which
elements are allowed and how they should be nested.
References
Honkela, T., Kaski, S., Lagus, K. & Kohonen, T. (1996). Newsgroup Exploration with WEBSOM
method and browsing. Technical Report A32, Helsinki University of Technology, Laboratory of
Computer and Information Science, Espoo; Finland.
Kohonen, T. (1997a). Exploration of very large databases by self-organizing maps. In Proceedings
of ICNN'97, International Conference on Neural Networks. PL1-PL6. IEEE Service Center:
Piscataway, NJ.
Kohonen, T. (1997b). Self-Organizing Maps. Springer, Berlin, Heidelberg.
Quinlan, J. R. (1979). Discovering rules by induction from large collection of examples. In D.
Michie (ed.), Expert Systems in the Micro Electronic Age. Edinburgh University Press, Edinburgh,
UK.
Quinlan, J. R. (1993). C4.5: Programs For Machine Learning. Morgan Kaufman Publishers, San
Mateo, Calif.
Quinlan, J. R. (2000). Data Mining Tools See5 and C5.0. [http://www.rulequest.com/see5-
info.html]
Rees, Michael. (1999). ZDU XML Course. [http://comet.it.bond.edu.au: 8000/xmlzdu/dtd/letter.dtd]
Schreibman, S. (1998). The MacGreevy Archive. [http://www.ucd.ie/~cosei/archive.htm]
Schreibman, S. (2000). The MacGreevy Archive. [http://jafferson.village.Virginia.edu/macgreevy]
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
The current availability of large collections of full-text documents in electronic form emphasizes the need for intelligent information retrieval techniques. Especially in the rapidly growing World Wide Web it is important to have methods for exploring miscellaneous document collections automatically. In the report, we introduce the WEBSOM method for this task. Self-Organizing Maps (SOMs) are used to position encoded documents onto a map that provides a general view into the text collection. The general view visualizes similarity relations between the documents on a map display, which can be utilized in exploring the material rather than having to rely on traditional search expressions. Similar documents become mapped close to each other. The potential of the WEBSOM method is demonstrated in a case study where articles from the Usenet newsgroup "comp.ai.neural-nets" are organized. The map is available for exploration at the WWW address http://websom.hut.fi/websom/ Contents 1 Introdu...
Conference Paper
This paper describes a data organization system and genuine content-addressable memory called the WEBSOM. It is a two-layer self-organizing map (SOM) architecture where documents become mapped as points on the upper map, in a geometric order that describes the similarity of their contents. By standard browsing tools one can select from the map subsets of documents that are most similar mutually. It is also possible to submit free-form queries about the wanted documents whereby the WEBSOM locates the best-matching documents. The document map exemplified in this paper has over 100000 map nodes, with 315 inputs at each, and over 1000000 documents have been organized by it. The system has been implemented by software on a general-purpose computer
Expert Systems in the Micro Electronic Age
  • Michie
Michie (ed.), Expert Systems in the Micro Electronic Age. Edinburgh University Press, Edinburgh, UK.
The MacGreevy Archive. [http://jafferson.village.Virginia
  • S Schreibman
Schreibman, S. (2000). The MacGreevy Archive. [http://jafferson.village.Virginia.edu/macgreevy]
The MacGreevy Archive
  • S Schreibman
Schreibman, S. (1998). The MacGreevy Archive. [http://www.ucd.ie/~cosei/archive.htm]