Figure 4 - uploaded by John Dunnion
Content may be subject to copyright.
Automatic mark-up system 

Automatic mark-up system 

Source publication
Article
Full-text available
We introduce a novel two-stage automatic XML mark-up system, which combines the WEBSOM approach to document categorisation in conjunction with the C5 inductive learning algorithm. The WEBSOM method clusters the XML marked-up documents such that semantically similar documents lie close together on a Self-Organising Map (SOM). The C5 algorithm automa...

Context in source publication

Context 1
... basic idea of C5 is the same as ID3 and uses the concept of Gain to produce a classifier in the form of decision tree according to the previously chosen classification. The Gain can be expressed as the effective decrease in entropy which helps us in choosing the attributes at different levels but it also deals with the unavailable values, continuous attribute value ranges, pruning of decision trees, rule derivation etc. Simplified decision trees are easily understandable by human beings but occasionally decision trees, even though accurate, grow to unwieldy proportions become too complex to be understood. C5 generates set of production rules from a decision tree. These rules better express the classification model than trees. The advantages of C5 algorithm are that it is very fast, it is not sensitive to missing features, it can deal with large number of features and it is incremental. Our system combines the techniques of the SOM algorithm and an inductive learning algorithm for the adaptive automatic mark-up of documents in XML as shown in Figure 2. The first phase of the system is the formation of map, which is implemented by using the WEBSOM method, described in section 2.1. The second phase of the system is also implemented as an independent automatic mark-up system, and is described in section 4. Eventually these two systems will be combined to form a hybrid system described below. Once a map of marked-up documents has been formed, an incoming document is automatically mapped into the cluster of documents most similar to it. Our system then captures the mark-up information from all the elements of neighbouring marked documents, and learns classifiers by applying an inductive learning algorithm. The incoming document is then automatically marked-up according to the learned classifiers. The system is adaptive in that, it learns from its mark-up errors and makes changes in the mark-up. The detailed architecture of the hybrid system is shown in Figure 3. The rule-based system is fully automatic and comprises two main modules – a rule learner and a mark-up module (see Figure 4). In this module rules are learned automatically from the collection of XML marked-up documents by using an inductive learning approach. The documents should have a valid mark-up following the rules of a single Document Type Definition DTD and should be from a specific domain. Training examples for our system are comprised of all the elements containing text. Rules are learned for all the elements (having different tag names) from the set of training instances , which is pre-classified by tag name of the element . Each instance corresponds to an element containing text from the collection of marked-up documents. The text enclosed between the start and end tag of all elements is encoded using a fixed-width feature vector. We have used 22 numerical features such as word count, character count, etc to encode the instances in our experiments. The encoded instances are processed by the inductive learning algorithm to develop classifiers, which are used to discriminate the text as the content of different elements of the XML documents. For this purpose we have selected the C5/See5 learning algorithm. Sets of rules can be generated from the collections of XML marked- up documents (from different domains) and can be saved in separate files. The appropriate rules set generated from the marked-up documents of a specific domain can be consulted in the process of marking up of an unmarked document from the same domain. The second module marks up in XML a text document, which has a structure similar to the marked- up documents used for generating the rules. The text of an unmarked document is divided into chunks by using delimiters like blank lines etc. These chunks of text are discriminated as the text of XML elements (specified by the tag name) by applying the rules learned by the C5/See5 classifier. Further by applying the rules of the DTD (conformed to by the collection of marked-up documents), the text document is marked-up in XML. The mark-up produced by the system should be valid according to the same DTD. Once a document is marked-up, it can be validated to check the accuracy of the mark-up. We have used documents from a few different domains (complying the rules of simple DTDs) as an initial test bed for our experiments. For example marked-up letters from the MacGreevy archive (Schreibman, 1998, 2000) are used by the system to learn classifiers for elements containing text. The letters are marked-up in XML according to a simple DTD that is letter.dtd (Rees, 1999) to generate training examples for our system. We have tested the system by marking-up unmarked letters from the same domain (see Figure 5). In the earlier version of this system we worked with a set of well-formed documents comprising letters from the MacGreevy archive. The C5/See5 algorithm was applied on 200 training and 50 test cases and 98% accuracy was achieved. Each case in our system represents one element (containing text) from the collection of marked letters. We have also tested it on the elements of about 20 letters and achieved an accuracy rate of 94%. The accuracy rate is calculated by considering the correctly marked-up elements as a percentage of the total number of elements of the tested letters. We are currently working with valid documents and hope to achieve higher accuracy. We have described a novel system and useful tool for automatically marking up documents in XML. The system has a hybrid architecture that uses WEBSOM method to arrange the XML marked-up documents on a Self-Organizing Map and the inductive learning algorithm to perform automatic mark-up of text documents. The system automatically learns rules from the mark-up information of the documents on the Self-Organizing Map. It then marks up the text document by applying the learned rules. The system also learns from feedback and makes changes in the markup to improve results. The system is a part of INTINN project, which is funded by Enterprise Ireland under the Informatics Research Initiative. Honkela, T., Kaski, S., Lagus, K. & Kohonen, T. (1996). Newsgroup Exploration with WEBSOM method and browsing. Technical Report A32, Helsinki University of Technology, Laboratory of Computer and Information Science, Espoo; ...

Similar publications

Conference Paper
Full-text available
A novel approach to enhance the robustness of handovers in LTE femtocells is presented. A modified Self Organizing Map is used to allow femtocells to learn about their specific indoor environment including the locations that have prompted handover requests. Optimized handover parameter values are then used that are specific to these locations. This...
Article
Full-text available
In this paper we present a novel system for automatically marking up text documents into XML. The system uses the techniques of the Self-Organising Map (SOM) algorithm in conjunction with an inductive learning algorithm, C5.0. The SOM algorithm clusters the XML marked-up documents on a two-dimensional map such that documents having similar content...
Conference Paper
Full-text available
The purpose of this study was to review various machine learning techniques for Botnet detection system by looking at their advantage and limitation, and propose our Botnet detection system. In this paper, we summarized different machine learning techniques used in previous research. Recently, machine learning has become prominent in developing Bot...