Classification of Brand Names Based on n-Grams
SoCPaR'10: IEEE Conference on Soft Computing and Pattern Recognition
12/2010; DOI: 10.1109/SOCPAR.2010.5685842
Supervised classification has been extensively addressed in the literature as it has many applications, especially for text categorization or web content mining where data are organized through a hierarchy. On the other hand, the automatic analysis of brand names can be viewed as a special case of text management, although such names are very different from classical data. They are indeed often neologisms, and cannot be easily managed by existing NLP tools. In our framework, we aim at automatically analyzing such names and at determining to which extent they are related to some concepts that are hierarchically organized. The system is based on the use of character n-grams. The targeted system is meant to help, for instance, to automatically determine whether a name sounds like being related to ecology.
Available from: Mathieu Lafourcade
- "Therefore, the approach is to use the existing identifier names with predefined concepts (via the names of their packages) to predict the concept of an ambiguous identifier. As proposed in our previous works: , , text classification models are useful to find the related concepts of a new word where the new word is a combination of existing words with predefined concepts . "
[Show abstract] [Hide abstract]
ABSTRACT: Identifier names (e.g., packages, classes, methods, variables) are one of most important software comprehension sources. Identifier names need to be analyzed in order to support collaborative software engineering and to reuse source codes. Indeed, they convey domain concept of softwares. For instance, "getMinimumSupport" would be associated with association rule concept in data mining softwares, while some are difficult to recognize such as the case of mixing parts of words (e.g., "initFeatSet"). We thus propose methods for assisting automatic software understanding by classifying identifier names into domain concept categories. An innovative solution based on data mining algorithms is proposed. Our approach aims to learn character patterns of identifier names. The main challenges are (1) to automatically split identifier names into relevant constituent subnames (2) to build a model associating such a set of subnames to predefined domain concepts. For this purpose, we propose a novel manner for splitting such identifiers into their constituent words and use N-grams based text classification to predict the related domain concept. In this article, we report the theoretical method and the algorithms we propose, together with the experiments run on real software source codes that show the interest of our approach.
Data provided are for informational purposes only. Although carefully collected, accuracy cannot be guaranteed. The impact factor represents a rough estimation of the journal's impact factor and does not reflect the actual current impact factor. Publisher conditions are provided by RoMEO. Differing provisions from the publisher's actual policy or licence agreement may be applicable.