Classification of Brand Names Based on n-Grams

SoCPaR'10: IEEE Conference on Soft Computing and Pattern Recognition 01/2010; DOI: 10.1109/SOCPAR.2010.5685842
Source: OAI

ABSTRACT Supervised classification has been extensively addressed in the literature as it has many applications, especially for text categorization or web content mining where data are organized through a hierarchy. On the other hand, the automatic analysis of brand names can be viewed as a special case of text management, although such names are very different from classical data. They are indeed often neologisms, and cannot be easily managed by existing NLP tools. In our framework, we aim at automatically analyzing such names and at determining to which extent they are related to some concepts that are hierarchically organized. The system is based on the use of character n-grams. The targeted system is meant to help, for instance, to automatically determine whether a name sounds like being related to ecology.

  • Source
    Machine Learning - ML. 01/1997; 29:103-130.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The paper presents an approach to classifying Web documents into large topic ontology. The main emphasis is on having a simple approach appropriate for handling a large ontology and providing it with enriched data by including additional information on the Web page context obtained from the link structure of the Web. The context is generated form the in-coming and out-going links of the Web document we want to classify (the target document), meaning that for representing a document we use, not only text of the document itself, but also the text from the documents pointing to the target document as well as the text form the documents that the target document is pointing to. The idea is that providing enriched data is compensating for the simplicity of the approach while keeping it efficient and capable of handling large topic ontology.
    CIT. 12/2005; 13:279-285.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: N-gram-based representations for documents have several distinct advantages for various document processing tasks. First, they provide a more robust representation in the face of grammatical and typographical errors in the documents. Secondly, N-gram representations require no linguistic preparations such as word-stemming or stop word removal. Thus they are ideal in situations requiring multi-language operations. Vector processing retrieval models also have some unique advantages for information retrieval tasks. In particular, they provide a simple, uniform representation for documents and queries, and an intuitively appealing document similarity measure. Also, modern vector space models have good retrieval performance characteristics. In this work, we combine these two ideas by using a vector processing model for documents and queries, but using N-gram frequencies as the basis for the vector element values instead of more traditional term frequencies. The resulting system provides good retrieval performance on the TREC-1 and TREC-2 tests without the need for any kind of word-stemming or stop word removal. We also have begun testing the system on Spanish language documents.