Conference Paper

The Method of Semi-supervised Automatic Keyword Extraction for Web Documents using Transition Probability Distribution Generator

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In this paper, a semi-supervised method for automatic keyword extraction of web documents using unconventional Markov Chain with conditional transition matrices for each distinct feature distributed by Transition Probability Distribution Generator (TPDG) is introduced. Since keywords are the set of the most appropriate and relevant words which define the context of the document precisely and concisely, many applications such as text data mining, text analytics and other natural language processes of deriving high-quality information from text can take advantage of it. The conditional transition matrices for each distinct feature of the model is the state-of-the-art which mostly rely on the characteristics of the keywords and distribution probabilities of each feature on the state space in order to learn the sequence of behaviors of the keywords in various web documents. According to the experimental results, the proposed method outperforms the baseline methods for keyword extraction in terms of performance and semantically.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The details of its setup along with the corpus that is utilized during experiments are mentioned in Section 4.1 and 4.2, respectively. The acquired experimental results are analyzed with that of Rapid Automatic Keyword Extraction (RAKE) 48,49 . The details of it are mentioned in Section 4.4. ...
Article
Full-text available
A keyphrase extraction technique endeavors to extract quality keyphrases from a given document, which provide a high-level summary of that document. Except statistical keyphrase extraction approaches, all other approaches are either domain-dependent or require a sufficient amount of training data, which are rare at present. Therefore, in this paper, a new tree-based automatic keyphrase extraction technique is proposed, which is domain-independent and employs nominal statistical knowledge; but no train data are required. The proposed technique extracts a quality keyphrase through forming a tree from a candidate keyphrase; and later, it is expanded or shrunk or remained in the same state depending on other similar candidate keyphrases. At the end, keyphrases are extracted from the resultant trees based on a value, µ (which is the Maturity Index (MI) of a node in the tree), which enables flexibility in this process. A small µ value would yield many and/or lengthy keyphrases (greedy approach); whereas, a large µ value would yield lower and/or abbreviated keyphrases (conservative approach). Thereby, a user can extract his/her desired-level of keyphrases through tuning µ value. The effectiveness of the proposed technique is evaluated on an actual corpus, and compared with Rapid Automatic Keyphrase Extraction (RAKE) technique.
... Instead of constructing a lexical chain with every noun phrases in the text, the result of our experimental results shows that the extracted candidate keywords of the text deliver a better efficient summary with a better performance. Thus, we have built a system, Lynn et al. (2016), to extract the promising keywords from the text which is based on the characteristics of the manually assigned keywords. The system consists of a generator to produce the possibility distributions of three distinct features of assigned keywords; the occurrence of a term in title, the occurrence of a term in a first sentence, and the higher score of average Karen (1972) score of a term. ...
Article
Full-text available
Many researches have been converging on automatic text summarization as increasing of text documents due to the expansion of information diffusion constantly. The objective of this proposal is to achieve the most reliable and substantial context or most relevant brief summary of the text in extractive manner. The extractive text summarization produces the short summary of a certain text which contains the most important information of original text by extracting the set of sentences from the original document. This paper proposes an improved extractive text summarization method for documents by enhancing the conventional lexical chain method to produce better relevant information of the text using three distinct features or characteristics of keyword in a text. The keyword of the document is labeled using our previous work, transition probability distribution generator model which can learn the characteristics of the keyword in a document, and generates their probability distribution upon each feature.
Chapter
Keyword indexing is the problem of assigning keywords to text documents. It is an important task as keywords play crucial roles in several information retrieval tasks. The problem is also challenging as the number of text documents is increasing, and such documents come in different forms (i.e., scientific papers, online news articles, and microblog posts). This chapter provides an overview of keyword indexing and elaborates on keyword extraction techniques. The authors provide the general motivations behind the supervised and the unsupervised keyword extraction and enumerate several pioneering and state-of-the-art techniques. Feature engineering, evaluation metrics, and benchmark datasets used to evaluate the performance of keyword extraction systems are also discussed.
Article
Son yıllarda, farklı konular için sunulan dijital bilgi kaynaklarının sayısı aşırı miktarda artmaktadır. Bu dijital bilgi kaynaklarına erişim desteği sunan sistemlerin birçoğu tarama, arama ve bilgi geri kazanımı araçlarına odaklanmıştır. Sayısal kütüphaneler, elektronik kitaplıklar ve Web sayfaları, bilgi erişimini iyileştirmek, belge koleksiyonlarını farklı anahtar kriterlere göre hiyerarşik olarak oluşturmak ve düzenlemek için yeni birçok açılım sunmaktadır. Farklı arama araçları, bilgi erişim teknikleri kullanılarak erişilebilen belgeleri düzenlemek, endekslemek ve özetlemek için yazılım tabanlı hizmetleri kullanarak daha kapsamlı bir doküman kapsamı sunulabilmektedir. Dijital kütüphanelerdeki arama mekanizmalarına uygulanan teknolojiler, doküman koleksiyonlarını yönetmek, anlamlı veri çıkarmak ve doküman ilişkilerinin belirlenmesi için farklı yöntem ve teknolojilerin kullanımını zorunlu kılmıştır. Özellikle belgeler arasındaki ilişki ne biçimleri ne de türleri ile açıkça tanımlanamamaktadır. Bu çalışma, sayısal kütüphaneler için belgelerin içeriğinden üst-veri çıkarımı, varlık isimlerinin elde edilmesi, anahtar kelimelerin elde erilmesi ve doküman benzerliklerinin oluşturulması için kullanılan yöntem ve teknikler için kapsamlı bir çalışma sunmaktadır.
Article
Full-text available
In this paper, we introduce an unsupervised stochastic statistical approach for ranking key-phrases, and identifying the salient sentences within a single document for generic extractive summaries. In particular, we propose a method to perceive the salient information of a text unit which is related to the corresponding title and its leverage depending on the sentence position in a text. Furthermore, the proposed method boosts not only the computational time and speed but it still comprehends the substantial information of a document. The experimental results suggest the proposed method well outperforms the baseline approaches significantly in both keyword extraction and summary sentence extraction.
Chapter
Full-text available
This paper introduces a novel and domain-independent method for automatically extracting keywords, as sequences of one or more words, from individual documents. We describe the methods configuration parameters and algorithm, and present an evaluation on a benchmark corpus of technical abstracts. We also present a method for generating lists of stop words for specific corpora and domains, and evaluate its ability to improve keyword extraction on the benchmark corpus. Finally, we apply our method of automatic keyword extraction to a corpus of news articles and define metrics for characterizing the exclusivity, essentiality, and generality of extracted keywords within a corpus.
Conference Paper
Full-text available
A common strategy to assign keywords to documents is to select the most appropriate words from the document text. One of the most important criteria for a word to be selected as keyword is its relevance for the text. The tf.idf score of a term is a widely used relevance measure. While easy to compute and giving quite satisfactory results, this measure does not take (semantic) relations between words into account. In this paper we study some alternative relevance measures that do use relations between words. They are computed by defining co-occurrence distributions for words and comparing these distributions with the document and the corpus distribution. We then evaluate keyword extraction algorithms defined by selecting different relevance measures. For two corpora of abstracts with manually assigned keywords, we compare manually extracted keywords with different automatically extracted ones. The results show that using word co-occurrence information can improve precision and recall over tf.idf.
Article
Full-text available
Keywords can be considered as condensed versions of documents and short forms of their summaries. In this paper, the problem of automatic extraction of keywords from documents is treated as a supervised learning task. A lexical chain holds a set of semantically related words of a text and it can be said that a lexical chain represents the semantic content of a portion of the text. Although lexical chains have been extensively used in text summarization, their usage for keyword extraction problem has not been fully investigated. In this paper, a keyword extraction technique that uses lexical chains is described, and encouraging results are obtained.
Article
Full-text available
In this paper, experiments on automatic extraction of keywords from abstracts using a supervised machine learning algorithm are discussed. The main point of this paper is that by adding linguistic knowledge to the representation (such as syntactic features), rather than relying only on statistics (such as term frequency and n- grams), a better result is obtained as measured by keywords previously assigned by professional indexers. In more detail, extracting NP-chunks gives a better precision than n-grams, and by adding the POS tag(s) assigned to the term as a feature, a dramatic improvement of the results is obtained, independent of the term selection approach applied.
Article
Full-text available
This paper explains a keyword extraction algorithm based solely on a single document. First, frequent terms are extracted. Co-occurrences of a term and frequent terms are counted. If a term appears frequently with a particular subset of terms, the term is likely to have important meaning. The degree of bias of the cooccurrence distribution is measured by the # -measure. We show that our keyword extraction performs well without the need for a corpus. In this paper, a term is defined as a word or a word sequence. We do not intend to limit the meaning in a terminological sense. A word sequence is written as a phrase
Article
Full-text available
Keyphrases are an important means of document summarization, clustering, and topic search. Only a small minority of documents have author-assigned keyphrases, and manually assigning keyphrases to existing documents is very laborious. Therefore it is highly desirable to automate the keyphrase extraction process. This paper shows that a simple procedure for keyphrase extraction based on the naiveBayes learning scheme performs comparably to the state of the art. It goes on to explain how this procedure's performance can be boosted by automatically tailoring the extraction process to the particular document collection at hand. Results on a large collection of technical reports in computer science show that the quality of the extracted keyphrases improves signi#cantly when domain-speci#c information is exploited. 1 Introduction Keyphrases give a high-level description of a document's contents that is intended to make it easy for prospective readers to decide whether or no...
Article
Full-text available
We present a new keyword extraction algorithm that applies to a single document without using a corpus. Frequent terms are extracted first, then a set of cooccurrence between each term and the frequent terms, i.e., occurrences in the same sentences, is generated. Co-occurrence distribution shows importance of a term in the document as follows. If probability distribution of co-occurrence between term a and the frequent terms is biased to a particular subset of frequent terms, then term a is likely to be a keyword. The degree of biases of distribution is measured by the chi^2-measure. Our algorithm shows comparable performance to tfidf without using a corpus.
Conference Paper
Considering the urgent need to promote Chinese Information Retrieval, in this paper we will raise the significance of keyword extraction using a new PAT-treebased approach, which is efficient in automatic keyword extraction from a set of relevant Chinese documents. This approach has been successfully applied in several IR researches, such as document classification, book indexing and relevance feedback. Many Chinese language processing applications therefore step ahead from character level to word/phrase level, I.
Article
A standard feature in cataloging documents is the list of keywords. When the source documents are web pages, we can attempt to aid the cataloger by analyzing the page and presenting relevant support material. Since the keywords that occur in a document generally occur in keyphrases, and keyphrases provide contextual material for reviewing candidate keywords, they are a natural aggregate to extract and present to the cataloger. This paper describes PhraseRate, which is an interactive aid for keyword extraction, designed to assist human classifiers in the Infomine Project (http://infomine.ucr.edu/). In particular, it introduces a novel keyphrase extraction heuristic for web pages which requires no training, but instead is based on the assumption that most well written webpages "suggest" keyphrases based on their internal structure. It is very fast, flexible, and its results compare favorably with the state of the art in keyphrase extraction.
Article
A good deal of work has been done over the years in an attempt to use statistical or probabilistic techniques as a basis for automatic indexing and content analysis. (1–10) Unfortunately, many of these methods are lacking in effectiveness, and the more refined procedures are computationally unattractive. A new technique, known as discrimination value analysis, ranks the text words in accordance with how well they are able to discriminate the documents of a collection from each other; that is, the value of a term depends on how much the average separation between individual documents changes when the given term is assigned for content identification. The best words are those which achieve the greatest separation. The discrimination value analysis is computationally simple, and it assigns a specific role in content analysis to single words, juxtaposed words and phrases, and word groups or thesaurus categories. Experimental results are given showing the effectiveness of the technique.
Article
Written communication of ideas is carried out on the basis of statistical probability in that a writer chooses that level of subject specificity and that combination of words which he feels will convey the most meaning. Since this process varies among individuals and since similar ideas are therefore relayed at different levels of specificity and by means of different words, the problem of literature searching by machines still presents major difficulties. A statistical approach to this problem will be outlined and the various steps of a system based on this approach will be described. Steps include the statistical analysis of a collection of documents in a field of interest, the establishment of a set of “notions” and the vocabulary by which they are expressed, the compilation of a thesaurus-type dictionary and index, the automatic encoding of documents by machine with the aid of such a dictionary, the encoding of topological notations (such as branched structures), the recording of the coded information, the establishment of a searching pattern for finding pertinent information, and the programming of appropriate machines to carry out a search.
Conference Paper
This paper is concerned with keyword extraction. By keyword extraction, we mean extracting a subset of words/phrases from a document that can describe the ‘meaning’ of the document. Keywords are of benefit to many text mining applications. However, a large number of documents do not have keywords and thus it is necessary to assign keywords before enjoying the benefit from it. Several research efforts have been done on keyword extraction. These methods make use of the ‘global context information’, which makes the performance of extraction restricted. A thorough and systematic investigation on the issue is thus needed. In this paper, we propose to make use of not only ‘global context information’, but also ‘local context information’ for extracting keywords from documents. As far as we know, utilizing both ‘global context information’ and ‘local context information’ in keyword extraction has not been sufficiently investigated previously. Methods for performing the tasks on the basis of Support Vector Machines have also been proposed in this paper. Features in the model have been defined. Experimental results indicate that the proposed SVM based method can significantly outperform the baseline methods for keyword extraction. The proposed method has been applied to document classification, a typical text mining processing. Experimental results show that the accuracy of document classification can be significantly improved by using the keyword extraction method.
Conference Paper
urgent need to promote Chinese in this paper we will raise the significance of keyword extraction using a new PAT-tree- based approach, which is efficient in automatic keyword extraction from a set of relevant Chinese documents. This approach has been successfully applied in several IR researches, such as document classification, book indexing and relevance feedback. Many Chinese language processing applications therefore step ahead from character level to word/phrase level,
Conference Paper
Many conventional approaches to text analysis and informationretrieval prove ineffective when large textcollections must be processed in heterogeneous subjectareas. An alternative text manipulation systemis outlined useful for the retrieval of large heterogeneoustexts, and for the recognition of content similaritiesbetween text excerpts, based on flexible textmatching procedures carried out in several contexts ofdifferent scope. The methods are illustrated by searchexperiments...
Article
A method of drawing index terms from text is presented. The approach uses no stop list, stemmer, or other language- and domain-specific component, allowing operation in any language or domain with only trivial modification. The method uses n-gram counts, achieving a function similar to, but more general than, a stemmer. The generated index terms, which the author calls “highlights,” are suitable for identifying the topic for perusal and selection. An extension is also described and demonstrated which selects index terms to represent a subset of documents, distinguishing them from the corpus. Some experimental results are presented, showing operation in English, Spanish, German, Georgian, Russian, and Japanese. © 1995 John Wiley & Sons, Inc.
Article
Keywords are subset of words or phrases from a document that can describe the meaning of the document. Many text mining applications can take advantage from it. Unfortunately, a large portion of documents still do not have keywords assigned. On the other hand, manual assignment of high quality keywords is expensive, time-consuming, and error prone. Therefore, most algorithms and systems aimed to help people perform automatic keywords extraction have been proposed. Conditional Random Fields (CRF) model is a state-of-the-art sequence labeling method, which can use the features of documents more sufficiently and effectively. At the same time, keywords extraction can be considered as the string labeling. In this paper, keywords extraction based on CRF is proposed and implemented. As far as we know, using CRF model in keyword extraction has not been investigated previously. Experimental results show that the CRF model outperforms other machine learning methods such as support vector machine, multiple linear regression model etc. in the task of keywords extraction.
Using Lexical Chains for Keyword Extraction Information Processing and Management
  • G Ercan
  • I Cicekli
Using lexical chains for keyword extraction, Information Processing and Management: an International Journal
  • Ilyas Gonenc Ercan
  • Cicekli
Highlights: language-and domain-independent automatic indexing terms for abstracting Journal of the American Society for Information Science, v.46 n
  • Jonathan D Cohen
Automatic Keyword Extraction from Documents Using Conditional Random Fields
  • Z Chengzhi
  • W Huilin
  • L Yao
  • W Dan
  • L Yi
  • W Bo