ArticlePDF Available

GoldenBullet: Automated Classification of Product Data in E-commerce

Authors:

Abstract and Figures

Internet and Web technology starts to penetrate many aspects of our daily life. Its importance as a medium for business transactions will grow exponentially during the next years. In terms of the involved market volume the B2B area will hereby be the most interesting area. Also it will be the place, where the new technology will lead to drastic changes in established customer relationships and business models. B2B market places provide new kinds of services to their clients. Simple 1-1 connections are getting replaced by n-m relationships between customers and vendors. However, this new flexibility in electronic trading also generates serious challenges for the parties that want to realize it. The main problem here is caused by the heterogeneity of information descriptions used by vendors and customers. Intelligent solutions that help to mechanize the process of structuring, classifying, aligning, and personalizing are a key requisite for successfully overcoming the current bottlenecks of B2B electronic commerce. In this paper, we describe a system called GoldenBullet that applies techniques from information retrieval and machine learning to the problem of product data classification. The system helps to mechanize an important and labor-intensive task of content management for B2B Ecommerce.
Content may be subject to copyright.
GoldenBullet: Automated Classification of Product Data in E-commerce
Y. Ding, M. Korotkiy, B. Omelayenko, V. Kartseva, V. Zykov, M. Klein, E. Schulten,
and D. Fensel
Vrije Universiteit Amsterdam, De Boelelaan 1081a, 1081 HV Amsterdam, NL, ying@cs.vu.nl
Withold Abramowicz (ed.), Business Information Systems, Proceedings of BIS 2002, Poznan, Poland
Abstract
Internet and Web technology starts to penetrate many
aspects of our daily life. Its importance as a medium for
business transactions will grow exponentially during the
next years. In terms of the involved market volume the
B2B area will hereby be the most interesting area. Also it
will be the place, where the new technology will lead to
drastic changes in established customer relationships and
business models. B2B market places provide new kinds of
services to their clients. Simple 1-1 connections are
getting replaced by n-m relationships between customers
and vendors. However, this new flexibility in electronic
trading also generates serious challenges for the parties
that want to realize it. The main problem here is caused by
the heterogeneity of information descriptions used by
vendors and customers. Intelligent solutions that help to
mechanize the process of structuring, classifying,
aligning, and personalizing are a key requisite for
successfully overcoming the current bottlenecks of B2B
electronic commerce. In this paper, we describe a system
called GoldenBullet that applies techniques from
information retrieval and machine learning to the
problem of product data classification. The system helps
to mechanize an important and labor-intensive task of
content management for B2B Ecommerce.
1. Introduction
The World Wide Web (WWW) has drastically changed
the on-line availability of information and the amount of
electronically exchanged information. Meanwhile the
computer has mutated from a device for computation into
a entrance portal of large information volumes,
communication, and business transactions (cf. [Fensel,
2001]). It starts to change the commercial relationships
between suppliers and customers. Currently, a large
fraction of the B2B transactions are still realized by
traditional non-Internet networks, such as those conducted
over EDI systems. In this traditional paradigm, direct 1-1
connections and mappings are programmed based on
standards like EDIFACT (cf. [EDIFACT, 1999]).
However, this traditional paradigm does not at all employ
the full power of electronic commerce and it is quite likely
that it will soon be out-ranged by more timely, Internet
and web-based transaction types. Internet-based electronic
commerce provides a much higher level of flexibility and
openness that will help to optimize business relationships.
Instead of implementing one link to each supplier, a
supplier is linked to a large number of potential customers
when linked to the market place.
However, preventing their customers from the
bottleneck of facing exponential growth in the number of
implemented business connections faces B2B market
places with a serious problem. The have to deal with the
problem of heterogeneity in product, catalogue, and
document description standards of their customers.
Effective and efficient management of different
description styles become a key task for these market
places.
Successful content management for B2B electronic
commerce has to deal with various aspects: information
extraction from rough sources, information classification
to make product data maintainable and accessible,
reclassification of product data, information
personalization, and mappings between different
information presentations [Fensel et al., 2001]. All of
these sub-tasks are hampered by the lack of proper
standards (or in other words by the inflation and non-
consistency of arising pseudo-standards). The paper will
focus on these challenges for content management and
will discuss some potential solution paths.
The contents of the paper is organized as follows. In
Section 2 we describe the overall content management
problem that needs to be solved for effective E-commerce.
Section 3 introduces our system GoldenBullet that applies
information retrieval and machine learning techniques to
one of the important sub-tasks of content management.
GoldenBullet helps to mechanize the process of product
classification. Section 4 provides an evaluation of our
approach based on real-world data provided by B2B
market places. Finally Section 5 provides conclusions and
Withold Abramowicz (ed.), Business Information Systems, Proceedings of BIS 2002, Poznan, Poland
2 BUSINESS INFORMATION SYSTEMS - BIS 2002
discusses future directions.
2. Content Management in E-Commerce
B2B market places are an intermediate layer for
business communications providing one serious
advantages to their clients. They can communicate with a
large number of customers based on one communication
channel to the market place. The market places reduce the
number of mappings to their user community from n*m to
n+m. However, in order to provide this service, they have
to solve the significant mapping and normalization
problem for their clients. A successful market place has to
deal with various aspects. It has to integrate with various
hardware and software platforms and has to provide a
common protocol for information exchange. However, the
real problem is the heterogeneity and openness of the
exchanged content. Therefore, content management is one
of the real challenges in successful B2B electronic
commerce. It tackles with a number of serious problems
[Fensel et al., 2001]:
1 Product descriptions are unstructured.
2 Product descriptions are unclassified.
3 Product descriptions must be classified and described
in various dimensions because no standard product
classifications exist.
Product descriptions must be structured. Suppliers
have product catalogues that describe their products to
their potential clients. This information should be made
on-line available by a B2B market place. One could think
that this may be a simple task because most product
catalogues already exist electronically. However, these
product catalogues are designed for the human reader.
Extracting the actual product information and storing it in
a structured format is therefore mainly a manual task. A
content management solution provider like Content
Europe
1
has several hundred employees working in
content factories to manually structure the product
information. In the worst case, they take printed copies of
the product catalogues as input.
Product descriptions must be classified. At this stage
in the content management process we can assume that
our product information is structured in a tabular way.
Each product corresponds to an entry in a table where the
columns reflect the different attributes of a product.
Similar products are group together in the same table.
1.
http://www.contenteurope.com
Each supplier uses different structures and vocabularies to
describe its products. This may not cause a problem for a
1-1 relationship where the buyer may get used to the
private terminology of his supplier. B2B market places
that enable n-m commerce cannot rely on such an
assumption. They must classify all products according to
a standard classification schema that help buyers and
suppliers in communicating their product information. A
widely used classification schema in the US is UNSPSC
2
(for details about UNSPSC, please see next section).
Again it is a difficult and mainly manual task to classify
the products according to a classification schema like
UNSPSC. It requires domain expertise and knowledge
about the product domain.
Product descriptions must be re-classified.
Bottlenecks in exchanging information have led to a
plethora of different standards that should improve the
situation. However, usually there are two problems. First,
there are too many “standards”, i.e., none of them is an
actual standard. Second, mostly, standards lack important
features for various application problems. Not
surprisingly, both problems appear also in B2B electronic
commerce. UNSPSC is a typical example for a horizontal
standard that covers all possible product domain,
however, is not very detailed in any domain. Another
example for such a standard is the Universal Content
Extended Classification (UCEC)
3
. It takes UNSPSC as a
starting point and refines it by attributes. Rosetta Net
4
is
an example for a vertical standard describing products of
the hardware and software industry in detail. Vertical
standards describe a certain product domain in more
detail than common horizontal ones. More examples for
such “standards” can be found in [Fensel, 2001].
In the reminder of the paper we focus on one of these
sub-tasks. We will describe our solution we developed for
product classification. However, we would also like to
mention that we are currently evaluating similar
techniques for product data structuring and re-
classification.
3. GoldenBullet
Finding the right place for a product description in a
standard classification system such as UNSPSC is not at
all a trivial task. Each product must be mapped to the
2.
http://www.un-spsc.net and http://www.unspsc.org.
3.
http://www.ucec.org
4.
http://www.rosettanet.org/
Withold Abramowicz (ed.), Business Information Systems, Proceedings of BIS 2002, Poznan, Poland
GoldenBullet: Automated Classification of Product Data in E-commerce 3
corresponding product category in UNSPSC to create the
product catalog. Product classification schemes contain
huge number of categories with far from sufficient
definitions (e.g. over 12,000 classes for UNSPSC) and
millions of products must be classified according to them.
This requires tremendous labor effort and the product
classification stage takes altogether up to 25% of the time
spent for content management. Because product
classification is that expensive, complicated, time-
consuming and error-prone. Content Management needs
support in automation of the product classification process
and automatic creation of product classification rules.
GoldenBullet is a software environment targeted to
support product classification according to certain content
standards. It is currently designed to automatically
classify the products, based on their original descriptions
and existent classifications standards (such as UNSPSC).
It integrates different classification algorithms from the
information retrieval and machine learning areas and
some natural language processing techniques to pre-
process data and index UNSPSC.
3.1. UNSPSC
The Universal Standard Products and Services
Classification (UNSPSC) is an open global coding system
that classifies products and services. It was first developed
by Dun & Bradstreet and the United Nations
Development Program. It is now maintained by the
Electronic Commerce Code Management Association
(ECCMA) which is a not-profit membership organization.
The UNSPSC code covers almost any product or service
that can be bought or sold, which includes 12,000 codes
covering 54 industry segments from electronics to
chemical, to medical, to educational services, to
automotive to fabrications, etc. The UNSPSC is heavily
Figure 1. The layered structure of UNSPSC.
XX
Segment
The logical aggregation of families for analytical purpose
XX Family
A commonly recognized group of inter-related commodity categories
XX
Class
A group of commodities sharing a common use or function
XX Commodity
A group of substitutable products or services
Business Type
The function performed in support of the commodity. This value is
vendor specific and does not affect the base code for each commodity.
deployed around the world in the electronic catalogs,
search engines, procurement application systems and
accounting systems. It is a 10 digit hierarchical code that
consists of 5 levels (see Figure 1).
3.2. Overall GoldenBullet Functionality
GoldenBullet as a software environment provides the
following functions to fully achieve semi-automatic or
automatic product classification: Data input and export
facilities; text processing techniques; classification of
product data; and learning and enrichment of product
classification information (see Figure 2).
3.2.1 Data Input, Output, and Validation
A wrapper factory gathers various wrappers to convert
raw data description from external formats to internal
format, and final results to preferable output format or
Figure 2. Overview on GoldenBullet.
GoldenBullet
Data
Wrapper
- input/output
- manual or
automatic
- data
v
alidation
Text pre-
p
rocessing
- stemming
- tagging
- indexing
sifier
L
ear
ning
- Online
- off-line
Enriching
- UNSPSC
-
Classifica
tion
rule base
Input
data
(raw
product
description
)
UNSP
SC
UNSP
SC
UNSPSC
Input d
(raw
product
description
)
Input
data
(raw
product
description )
Withold Abramowicz (ed.), Business Information Systems, Proceedings of BIS 2002, Poznan, Poland
4 BUSINESS INFORMATION SYSTEMS - BIS 2002
user-designed formats. Besides the automatic importing
and exporting data, GoldenBullet also provides the editor
for manually inputting data, which suits well for small and
medium vendors.
3.2.2 Text pre-processing
The validated product data will be pre-processed before
the classification has been performed. Some of the Natural
Language Processing algorithms have been implemented
into GoldenBullet. The product data will be stemmed
(grouping different words with the same stems) and
tagged (extracting noun-phrases). Furthermore, UNSPSC
is also being pre-processed (stemmed and tagged) to make
sure that noisy words or information have been screened.
A stop word list has been generated, updated and extended
during the whole process. Currently, GoldenBullet can
handle English and French product data.
3.2.3 Product Classification
Figure 3 shows the user interface of the classifier. The
imported UNSPSC is browsable from the screen, which
directs the end user to the right location of UNSPSC. The
classifier classifies the pre-processed product data and
proposes the ranked solutions based on various weighting
algorithms. The end user can pull down the proposed list
Figure 3. A screenshot of GoldenBullet.
and make the final choice. But when he highlights one of
the proposed solutions, the above UNSPSC browse
window will show the exact location of it in UNSPSC
with the details of each level.
Performing the classification task is viewed as an
information retrieval problem (see [Ribiero-Neto &
Baeza-Yates, 1999] for an introduction to the field). The
problem of finding the right class is viewed as the
problem to find the right document as an answer to a
query:
A product description is viewed as a query and
UNSPSC is viewed as a document collection.
Each of the commodities in UNSPSC is treated as a
document, where each commodity description forms
the text of the document.
Assigning a proper category for a product is achieved
via retrieving a correspondent UNSPSC commodity
description.
The performance of such an approach is rather low
(see the next sub-section for more details). Directly using
UNSPSC as document collection fails in this respect
because the class descriptions are very short (i.e., we deal
Withold Abramowicz (ed.), Business Information Systems, Proceedings of BIS 2002, Poznan, Poland
GoldenBullet: Automated Classification of Product Data in E-commerce 5
with very short documents) and the product descriptions
are often very short too and use very specific vocabulary
that cannot directly be matched with more generic terms
in UNSPSC. Evaluation with real world data showed that
less than 1% of actual product data could classified
correctly which such a naïve approach. Therefore, we
employed various strategies to achieve a more reasonable
and workable result that are described in the next
subsection. Basically we employed different retrieval
strategies and we made use of large volumes of manually
classified data to improve the performance.
3.3. The heart of GoldenBullet: Intelligence and
Knowledge
The essence of GoldenBullet is its ability to
automatically classify product descriptions. This sub-
section will present our classification approaches and
show how they can make use of pre-classified data as
background knowledge. An evaluation of the approaches
in given in section 4.
3.3.1 The Vector Space Model (VSM)
A standard method in Information Retrieval is the well-
known Vector space model (VSM). Salton's Vector Space
Model (cf. [Salton et al., 1975]). It uses the word vector to
represent document and user query, then applies the
cosine similarity formula to calculate the similarity
between the document and query so as to retrieve the most
relevant document to user's query. The same model has
been applied in text categorization. [Gomez-Hidalgo &
Rodriguez, 1997] used Salton's vector space model to
represent document (in our case product description) and
existing categories (e.g. in our case UNSPSC). Then the
category (UNSPSC) can be assigned to a document
(product) when the cosine similarity between them
exceeds a certain threshold. The basic idea is to represent
each document as a vector of certain weighted word
frequencies.
VSM is adopted by us to find the match between
UNSPSC commodities and product descriptions. In the
following, we will describe two strategies how VSM can
make use of pre-classified examples in two ways. Both
strategies treat an unclassified product description as a
query, however, differ in what they us as a document
collection.
1 The first version takes each commodity as a
document. The examples are used to enrich the
commodity description. Basically we extract words
from pre-classified product data and add them to the
word list describing the commodity.
2 The second version takes each pre-classified product
description as a document. We use VSM to retrieve
the instance fitting best to a newly product
description and infer the UNSPSC code of the latter
from the known UNSPSC code of the former.
We will describe pros and cons of both approaches in
section 4.
3.3.2 K-nearest neighbor
Another instance-based classifier we implemented is
based on the k-Nearest Neighbor method KNN. Again, the
algorithm uses the set of pre-classified examples directly
to classify an example. The algorithm passes the whole
set of training examples and searches for the most similar
one, and then assigns the class to the new example, equal
to the class of the most similar one. KNN is
computationally expensive and requires lots of memory to
operate depending on the number of pre-classified
examples. Again we can distinguish two modes in regard
to whether the algorithm works directly on the pre-
classified product data or on enriched class descriptions.
3.3.3 Naïve-Bayes classifier
The final paradigm we employed is the machine
learning paradigm (see [Mitchell, 1997] for a good
introduction and analysis of the field). This paradigm
assumes existence of a set of (manually) pre-classified
products, which is called a training set, and a set of
product descriptions to be classified by the systems,
which is called a test set. The Naïve-Bayes classifier NB
[Burges, 1998] uses Bayes theorem to predict the
probability of each possible class, given a product
description and the set of training pre-classified examples
as input.
The classifier assigns the commodity, which has the
highest probability of being correct. Naïve-Bayes is a
standard text classification algorithm, with a long
successful application history.
5
3.3.4 Hierarchical Classification
UNSPSC provides a hierarchy of four levels for
classifying products: Segment, Family, Class, and
Commodity. Therefore, it is quite natural to employ a
hierarchical classification approach for our task. In
addition, we made the experience that lots of pre-
classified products we received for our evaluations we
report in the next section are not classified up to the
5.
[Koller & Sahami, 1997]
Withold Abramowicz (ed.), Business Information Systems, Proceedings of BIS 2002, Poznan, Poland
6 BUSINESS INFORMATION SYSTEMS - BIS 2002
Commodity level, but only up to Class or Family levels.
Therefore, we build a hierarchical classifier, which
actually consists of four classifiers, each of which is
working on a correspondent level (see also [Chakrabarti et
al., 1997], [Koller & Sahami, 1997], [Dumais & Chen,
2000], and [Agrawal & Srikant, 2001]).
3.4. Implementation
GoldenBullet is designed to provide widest access to
product description classification service. So, we intended
to make some kind of web service out of our tool. Our
current version of the prototype is oriented on an “html
like” user interface. Our main goal, concerning user
interface design, was to provide fully functional and
convenient for a user interaction environment and, at the
same time, not to put too many requirements on the user–
side software. Currently all what a user needs to use
GoldenBullet prototype is an html browser that supports
JavaScript 1.2.
The web service is provided by means of a client-server
approach. So, the core of our tool is a server-side set of
Java packages that implements the functionality and
generates all interaction pages. The server side module
was implemented as a web application. We use Java
Servlets technology and its extension Java Server Pages to
generate all client side user interface pages. All intelligent
content of our prototype was implemented by means of
Java 1.3. Due to the separation of the training-
classification process in an “off-line” training step
(computationally highly expensive) and an “on-line”
classification step we managed to achieve acceptable
performance.
4. Evaluation
The evaluation we report is based on around 40,000
real product descriptions that were already classified
manually. They come from various vendors and cover a
large number of categories in UNSPSC.
6
During the
following we compare the performance of a number of
algorithmic variants of our classifier.
4.1. Test Data Set
The data provide 41913 manually classified data
described in France. Table 1. summarizes the population
of the UNSPSC categories in the data set according to the
6.
The data were collected based on a cooperation with Hubwoo which is
a MRO market place in France.
UNSPSCv7.2.
The table can be understood as follows: The number of
UNSPSC categories reports the number of Commodities,
Classes, Families, and Segments of UNSPSCv7.2 that are
populated by some instances found in the data set. Second
we analyzed the distribution of the data. We took around
5% of the covered Commodities, Classes, Families, and
Segments and selected the ones with the highest
coverage. It shows that 28247 data of the data set are in
21 commodities, i.e., around 60% of the data are in 21 of
the around 15000 commodities. Similar only in one
Segment are around 25% of all data. In total, we worked
with 41913 product descriptions.
4.2. Accuracy Results for a Naïve Classifier
Up to a large number of the product descriptions in this
test data set are represented by the name of the product
models, such as “proliant”, “pentium”, “presario”,
“carepaq”, “RA3000”, but do not use any of the
functional terms of the product itself. In this case, our
Naïve Classifiers are not capable to secure high accuracy.
In fact, the accuracy is extremely low and reaches
maximally 0,2%. Clearly such a classifier is without any
use and we will describe in the following how training
could make a different story.
4.3. Accuracy Results of the trained Algorithms
For training the algorithms we have chosen the
following approach. A 60% random sample from product
descriptions data set was used as training set, and the rest
40% data – as test set. We repeated the test based on
several random splits of the data set. The results are
reported in Table 2. We applied two quality
measurements:
The total accuracy asks whether the commodity
recommendation of GoldenBullet with the highest
Table 1. Population of UNSPSC
Category Number of
UNSPSC
categories
5% Number of
instances in the
most frequent
categories
Segments 28 1 15233
Families 56 3 27407
Classes 150 7 24800
Commodities 421 21 28247
Withold Abramowicz (ed.), Business Information Systems, Proceedings of BIS 2002, Poznan, Poland
GoldenBullet: Automated Classification of Product Data in E-commerce 7
rating based on the product description matches the
actual commodity of a product.
The “First 10 Accuracy” asks whether one of the ten
commodity recommendations of GoldenBullet with
highest ratings based on the product description
matches the actual commodity of a product.
In addition, we distinguished two modes for all
algorithms. Either we treated the pre-classified product
data or the enriched class descriptions (based on the pre-
classified data) as documents that should be retrieved. In
general, the bayesian classifier outperforms all other
classifiers significantly.
7
Working directly with the pre-
classified data works best for all algorithms. Only in
regard to the “First 10 Accuracy” there is no difference
for the bayesian classifier in this respect. In general, an
accuracy between 78% to 88% looks rather convincing
and easily outperforms and qualify equal with the quality
of human classification.
8
We repeated the experiments for hierarchical versions
of the algorithms, i.e., first a classification decision is
taken at the segment level and then recursively on the
families, classes, and commodity level. Against our initial
intuition this lead to significant lowering of the accuracy
of the classification algorithms. Obviously, too many
wrong decision in the early steps of the classification
7.
Compare similar results of [Agrawal & Srikant, 2001].
Table 2. Accuracy of trained (non-hierarchical)
algorithms
Algorithm Total Accuracy First 10 Accuracy
VSM
I
60% 78%
VSM
C
28% 69%
KNN
I
45% 84%
KNN
C
29% 56%
NB
I
78% 88%
NB
C
59% 88%
8.
That is, higher accuracy would just mean over-fitting to human
classification decisions that also have a significant error rate which we
encountered in labour intensive manual checks.
process happens. The results are presented in Table 3.
4.4. Summary
Untrained algorithms fail completely to provide any
support in semi-automatic or automatic product
classification. Trained algorithms can provide significant
support. Up to 90% accuracy can be achieved based on
learning from pre-classified example. As always the
significant assumption of such an approach is the
availability of representative pre-classified examples.
5. Conclusions and Future Works
Market places for B2B electronic commerce have a
large economic potential. They provide openness and
flexibility in establishing commercial relationships for
their clients. In order to provide this service they have to
tackle with serious obstacles. The most prominent one is
concerned with integrating various styles to describe the
content and the structure of the exchanged information.
Product catalogues corresponds to large and complex
domain ontologies and in the case of horizontal standards
to upper-layer ontologies. Large modeling and mapping
effort is required to handle these descriptions. Content
manager have to structure, classify, re-classify, and
personalize large volumes of data to make product
descriptions automatically accessible via B2B market
places.
GoldenBullet aims on mechanizing the classification
process of product data. Accuracy rates between 70% and
98% indicate that this process can be mechanized to a
degree where severe cost reduction can be achieved
which is a pre-requisite for scalable E-commerce. The
success of GoldenBullet is based on the combination of
natural language processing, information retrieval,
machine learning and the use large volumes of manually
classified data. Future versions of the GoldenBullet will
provide more advanced features as explained in the
Table 3. Accuracy of trained (hierarchical) algorithms
Algorithm Total Accuracy First 10 Accuracy
VSM
I
22% 41%
VSM
C
14% 27%
KNN
I
25% 22%
KNN
C
13% 15%
NB
I
38% 42%
NB
C
29% 42%
Withold Abramowicz (ed.), Business Information Systems, Proceedings of BIS 2002, Poznan, Poland
8 BUSINESS INFORMATION SYSTEMS - BIS 2002
following.
Recently, Support Vector Machines have been shown
to be a very useful tool for text categorization (cf.
[Burges, 1998], [Cortes & Vapnik, 1995], and [Joachims,
1998], and [Dumais & Chen, 2000]). We want to explore
the applicability of this algorithmic paradigm for our
specific product classification task.
Multi-linguality (i.e. the product catalog and the
product classification standard are described in different
languages) is a severe requirement for E-commerce in
Europe. Currently, GoldenBullet supports English and
French. An extension to further languages is a pre-
requisite for open and flexible E-commerce.
Multi-standard classification is an important issue for
open and flexible E-commerce (cf. [Agrawal & Srikant,
2001], [Madhaven et al., 2001], and [Schulten et al.,
2001]). Currently, market places establish one “standard”
to make data of their clients accessible. However,
openness and flexibility of internet-enabled trading
demands multi-standard classification. Besides UNSPSC
there are other “standards” like UCEC
9
, ecl@ss
10
, and
RosettaNet
11
that are widely used in certain domain-
specific or geographical areas of E-commerce. Providing
mechanized support in reclassifying product data in an
additional classification schema is a demanding feature
for realizing the full potential of E-commerce.
GoldenBullet applies successfully information
retrieval and machine learning techniques to the problem
of automatically classifying product description.
GoldenBullet will also challenge other existing severe
problems for B2B market places, such as mapping and
reclassifying product descriptions according to different
product code systems
12
and to personalize views on
product data for divergent customers.
References
[Agrawal & Srikant, 2001]
R. Agrawal and R. Srikant: On Integrating Catalogs. In
Proceedings of the Tenth International World Wide Web
Conference (WWW2001), Hong Kong, May 2001.
9.
http://www.ucec.org
10.
http://www.eclass.de
11.
http://www.rossettanet.org
12.
See [Agrawal & Srikant, 2001] and [Navathe et al., 2001].
[Burges, 1998]
C. J. C. Burges: A Tutorial on Support Vector Machines for
Pattern Recognition, Data Mining and Knowledge
Discovery, 2(2):955-974, 1998.
[Chakrabarti et al., 1997]
S. Chakrabarti et al.: Using taxonomy, discriminants, and
signatures for navigating in text databases. In Proceedings
of the 23rd VLDB Conference, Greece, 1997.
[Cheeseman and Stutz, 1995]
P. Cheeseman and J. Stutz: Bayesian classification
(AutoClass): Theory and results. In U. M. Fayyad, G.
Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (eds.),
Advances in knowledge discovery and data mining, The
AAAI press, Menlo Park, 1995.
[Cortes & Vapnik, 1995]
C. Cortes and V. Vapnik: Support-Vector Networks,
Machine Learning, 20:273-297, 1995.
[Dumais & Chen, 2000]
S. Dumais and H. Chen: Hierarchical Classification of Web
Content. In Proceedings of SIGIR-00, 23rd ACM
International Conference on Research and Development in
Information Retrieval, pp. 256-263, Athens, Greece,
August 2000.
[EDIFACT, 1999]
United Nation: UN/EDIFACT-Directory. http://
www.unece.org/trade/untdid, 1999.
[Fensel, 2001]
D. Fensel: Ontologies: Silver Bullet for Knowledge
Management and Electronic Commerce, Springer-Verlag,
Berlin, 2001.
[Fensel et al., 2001]
D. Fensel, Y. Ding, E. Schulten, B. Omelayenko, G.
Botquin, M. Brown, and A. Flett: Product Data Integration
in B2B E-commerce, IEEE Intelligent System, 16(3), 2001.
[Gomez-Hidalgo & Rodriguez, 1997]
J. M. Gomez-Hidalgo and M. B. Rodriguez: Integrating a
lexical database and a training collection for text
categorization. In the Proceedings of ACL/EACL (the
Association for Computational Linguistics/European
Association for Computational Linguistics: Automatic
Information Extraction and Building of Lexical Semantic
Resources for NLP Applications), Madrid, Spain, July,
1997.
[Joachims, 1998]
T. Joachims: Text Categorization with Support Vector
Machines: Learning with Many Relevant Features. In C.
Nedellec and C. Rouveirol (eds.), Machine Learning:
ECML-98, Proceedings of the 10th European Conference
on Machine Learning, Lecture Notes in Artificial
Intelligence, LNAI 1398, pages 137-142, 1998.
[Koller & Sahami, 1997]
D. Koller and M. Sahami: Hierarchically classifying
documents using very few words. In Proceedings of the
International Conference on Machine Learning, vol 14,
Morgan-Kaufmann, July 1997.
[Madhaven et al., 2001]
Withold Abramowicz (ed.), Business Information Systems, Proceedings of BIS 2002, Poznan, Poland
GoldenBullet: Automated Classification of Product Data in E-commerce 9
J. Madhaven, P. A. Bernstein, and E. Rahm: Generic
Schema Matching with Cupid. In Proceedings of the 27th
VLDB Conference, Roma, Italy, 2001.
[Mitchell, 1997]
T. Mitchell, Machine Learning, McGraw Hill, 1997.
[Navathe et al., 2001]
S. Navathe, H. Thomas, M. Satitsamitpong, and A. Datta: A
Model to Support E-Catalog Integration. In Proceedings of
the 9th IFIP 2.6 Working Conference on Database (DS-9),
Semantic Issues in e-commerce Systems, Hong Kong, April
2001.
[Ribiero-Neto & Baeza-Yates, 1999]
B. Ribiero-Neto and R. Baeza-Yates, Modern Information
Retrieval, Addison Wesley, 1999.
[Salton et al., 1975]
G. Salton, A. Wong, and C. S. Yang: (1975): A vector
space model for automatic indexing, Communications of the
ACM, 18(7):613-620, 1975.
[Schulten et al., 2001]
E. Schulten, H. Akkermans, G. Botquin, M. Dörr, N.
Guarino, N. Lopes, and N. Sadeh: The ecommerce product
classification challenge, IEEE Intelligent systems, 16(4),
2001.
... A large number of research has emerged over the years to study the very problem of product categorisation or classification on the Web [5,9,11,3,6,19,12,14]. However, our work is different in two ways. ...
... Therefore, the majority of literature have only used product names, such as [11,4,19,9,1] and all of those participated in the 2018 Rakuten Data Challenge [14]. Several studies used both names and product descriptions [5,3,6,12,13], while a few also used other metadata such as model, brand, maker, etc., which need to be extracted from product web pages by an Information Extraction process [9,7,17]. In addition, [17] also used product images. ...
... Generally speaking, for text-based metadata, there are three types of feature representation. The first is based on Bag-of-Words (BoW) or N-gram models, where texts are represented based on the presence of vocabulary in the dataset using either 1-hot encoding or some weighting scheme such as TF-IDF [5,3,7,4,1]. The second uses an aggregation of the word embeddings from the input text. ...
Chapter
Full-text available
Markup languages such as RDFa and Microdata have been widely used by e-shops to embed structured product data, as evidence has shown that they improve click-through rates for e-shops and potentially increases their sales. While e-shops often embed certain categorisation information in their product data in order to improve their products’ visibility to product search and aggregator services, such site-specific product category labels are highly inconsistent and unusable across websites. This work studies the task of automatically classifying products into a universal categorisation taxonomy, using their markup data published on the Web. Using three new neural network models adapted based on previous work, we analyse the effect of different kinds of product markup data on this task, and show that: (1) despite the highly heterogeneous nature of the site-specific categories, they can be used as very effective features - even only by themselves - for the classification task; and (2) our best performing model can significantly improve state of the art on this task by up to 9.6% points in macro-average F1.
... Lee et al. [19] used the Centroid Classifier with an emphasis on hierarchical classifier. Ding et al. [5] classified product data using information retrieval and machine learning. They used classification system of the United Nations Standard Products and Services Code (UNSPSC). ...
... In case of (b), [5] classified product data using information retrieval and machine learning. They tested with 32 UNSPSC categories and showed 78.0% classification accuracy. ...
Article
Full-text available
In this paper, we develop an automatic product classifier that can become a vital part of a natural user interface for an integrated online-to-offline (O2O) service platform. We devise a novel feature extraction technique to represent product descriptions that are expressed in full natural language sentences. We specifically adapt doc2vec algorithm that implements the document embedding technique. Doc2vec is a way to predict a vector of salient contexts that are specific to a document. Our classifier is trained to classify a product description based on the doc2vec-based feature that is augmented in various ways. We trained and tested our classifier with up to 53,000 real product descriptions from Groupon, a popular social commerce site that also offers O2O commerce features such as online ordering for in-store pick-up. Compared to the baseline approaches of using bag-of-words modeling and word-level embedding, our classifier showed significant improvement in terms of classification accuracy when our adapted doc2vec-based feature was used.
... [4] analyze the classification of products in an e-commerce framework using a deep multi-modal architecture using text and images as inputs, showing the application of a deep learning model. [5] develops an intelligent platform for the classification of products in e-commerce based on natural language processing algorithms, vector space models, k-nearest neighbor, Naive-Bayes classifier and hierarchical classification. [6] address a classification problem with big data. ...
Chapter
This work is a proposal of an analytical intelligence model for the discontinuation of products in a transnational soft drink company. The objective is to identify products that due to their volume and sales value should leave the company’s catalog. For this, the integration of an analytical intelligence model that considers unsupervised classification algorithms integrating key information about the products to be evaluated is proposed. The results generated show that the product classification makes it possible to identify a set of products that are candidates for discontinuation due to their volumes and sales value, likewise, the detailed information of these products allows evaluating the characteristics of the cluster to be discontinued and thus planning production and distribution in the medium and long term. The planned model allows timely monitoring of the discontinuation process automatically as well as the monitoring of executive reports through the cloud.
... Product categorization problem in E-commerce is usually defined as classification of products into an existing list tens to thousands of categories [2,3,8]. Since the training data, usually consist of product titles and descriptions, are too sparse to be directly adopted as features, standard classification approaches utilized various types of manual features. ...
Conference Paper
E-commerce sites usually leverage taxonomies for better organizing products. The fine-grained categories, regarding the leaf categories in taxonomies, are defined by the most descriptive and specific words of products. Fine-grained product categorization remains challenging, due to blurred concepts of fine grained categories (i.e. multiple equivalent or synonymous categories), instable category vocabulary (i.e. the emerging new products and the evolving language habits), and lack of labelled data. To address these issues, we proposes a novel Neural Product Categorization model---NPC to identify fine-grained categories from the product content. NPC is equipped with a character-level convolutional embedding layer to learn the compositional word representations, and a spiral residual layer to extract the word context annotations capturing complex long range dependencies and structural information. To perform categorization beyond predefined categories, NPC categorizes a product by jointly recognizing categories from the product content and predicting categories from predefined category vocabularies. Furthermore, to avoid extensive human labors, NPC is able to adapt to weak labels, generated by mining the search logs, where the customers' behaviors naturally connect products with categories. Extensive experiments performed on a real e-commerce platform datasets illustrate the effectiveness of the proposed models.
... A number of competitor customary classification schemes are already available in the market but none of them are globally recognized and accepted [1]. Works in [2] connected a few procedures from data recovery and machine learning approaches, figuring out how to proceed with item information characterization by incorporating striking calculations like KNN (K-Nearest Neighbor), SVM (Support Vector Machine) and NBC (Naïve Bayes classifier) etc. ...
Conference Paper
Full-text available
Currently, Internet has numerous effects on our everyday lifecycle. Its significance as an intermediate for commercial transactions will develop exponentially throughout the next years. In terms of the engaged marketplace volume, the Business to Business region will hereby be the supreme exciting area. As the extensive usage of electronic business transactions increase, great volume of products information gets generated and managing such large information automatically becomes a challenging task. The accurate classification of such products to each of the existing classes also becomes an additional multifarious task. The catalog classification is an essential part for operative electronic business applications and classical machine learning problems. This paper presents a supervised Multinomial Naïve Bayes Classifier machine learning algorithm to classify product listings to anonymous marketplaces. If the existing products are classified under the master taxonomy, the task is to automatically categorize a new product into one of the existing categories. Our algorithm approach proposes a method to accurately classify the existing millions of products
... The United Nations Standard Products and Service (UNSPC) is a product and service taxonomy standard that was established according to the United Nations' Common Coding System (UNCCS) and the Dun & Bradstreet's Standard Product and Service Codes (SPSC) [1]. Furthermore, common products have also been categorised by product domain experts; however, these categorisation approaches have proven effective only if the number of products is small. ...
Conference Paper
The revolution of the digital age has resulted in e-commerce where consumers’ shopping is facilitated and flexible such as able to enquire about product availability and get instant response as well as able to search flexibly for products by using specific keywords, hence having an easy and precise search capability along with proper product categorisation through keywords that allow better overall shopping experience. This paper compared the performances of different machine learning techniques on product categorisation in our proposed framework. We measured the performance of each algorithm by an Area Under Receiver Operating Characteristic Curve (AUROC). Furthermore, we also applied Analysis of Variance (ANOVA) to our results to find out whether the differences were significant or not. Naïve Bayes was found to be the most effective algorithm in this investigation.
... The two main published product classifiers are GoldenBullet and AutoCat. The GoldenBullet product classifier of Ding et al [14] uses vector space models, knearest neighbour and Naïve Bayes to classify products into UNSPSC. 41,913 products were manually classified into 421 UNSPSC categories as training and test sets. ...
Conference Paper
Full-text available
A three year Knowledge Transfer Partnership between @UK plc, University of Reading and Goldsmiths College, London produced an e-procurement system called SpendInsight which the National Audit Office reports could save the NHS Σ500 million per annum. An extension to the system, GreenInsight, enables procurers to assess the environmental impact of their purchases and hence make savings there. Key to these systems is the ability to classify products from different sources. This paper focusses on the classification techniques used for analyzing product data, and the research which shows that the normal best method for text classification is not the best here.
Chapter
This work seeks to develop a nested non-supervised model that allows a transnational soft drink company to improve its decision-making for the discontinuation of products from its portfolio with the use of unsupervised models from a database with commercial and financial information for all your product line in your most important operation. The integration of different cluster methodologies through a nested non-supervised model allowed to generate a correct identification of the products that should be refined from the catalog due to financial and operational factors. Given the magnitude of the information, a cluster was integrated into a platform for data processing as well as the generation of automatic reports that could be consulted automatically through the cloud. The products identified through the nested unsupervised model made it possible to identify products that had low demand and a low contribution to the utility of the company. Removing said products from the catalog will allow maximizing the profit of the business in addition to not incurring sunk costs related to the production and distribution of low-demand products. The platform developed will allow continuous monitoring of business performance in order to automatically identify the products likely to leave the catalog.
Conference Paper
Cloth image retrieval in E-Commerce is a challenging task. In this paper, we propose an effective approach to solve this problem. Our work chooses three features for retrieval: (1) description (2) category (3) color features. It can handle clothes with multiple colors, complex background, and model disturbances. To evaluate the proposed method, we collect a set of women cloth images from Amazon.com. Results reported here demonstrate the robustness and effectiveness of our retrieval method.
Article
E-Catalogs which describe products or services are one of the most important data for the electronic commerce. E-Catalogs are created, updated, and removed in order to keep up-to-date information in e-Catalog database. However, when the number of catalogs increases, information integrity is violated by the several reasons like catalog duplication and abnormal classification. Catalog search, duplication checking, and automatic classification are important functions to utilize e-Catalogs and keep the integrity of e-Catalog database. To implement these functions, probabilistic models that use statistics of index words extracted from e-Catalogs had been suggested and the feasibility of the methods had been shown in several papers. However, even though these functions are used together in the e-Catalog management system, there has not been enough consideration about how to share common data used for each function and how to effectively manage statistics of index words. In this paper, we suggest a method to implement these three functions by using simple SQL supported by relational database management system. In addition, we use materialized views to reduce the load for implementing an application that manages statistics of index words. This brings the efficiency of managing statistics of index words by putting database management systems optimize statistics updating. We showed that our method is feasible to implement three functions and effective to manage statistics of index words with empirical evaluation.
Article
Full-text available
Schema matching is a critical step in many applications, such as XML message mapping, data warehouse loading, and schema integration. In this paper, we investigate algorithms for generic schema matching, outside of any particular data model or application. We first present a taxonomy for past solutions, showing that a rich range of techniques is available. We then propose a new algorithm, Cupid, that discovers mappings between schema elements based on their names, data types, constraints, and schema structure, using a broader set of techniques than past approaches. Some of our innovations are the integrated use of linguistic and structural matching, context-dependent matching of shared types, and a bias toward leaf structure where much of the schema content resides. After describing our algorithm, we present experimental results that compare Cupid to two other schema matching systems. 1
Article
This article launches an international research challenge in the area of intelligent e-business. The challenge is to come up with a generic model and working solution that can semiautomatically map a given product description between two different e-commerce product classification standards.
Article
In a document retrieval, or other pattern matching environment where stored entities (documents) are compared with each other or with incoming patterns (search requests), it appears that the best indexing (property) space is one where each entity lies as far away from the others as possible; in these circumstances the value of an indexing system may be expressible as a function of the density of the object space; in particular, retrieval performance may correlate inversely with space density. An approach based on space density computations is used to choose an optimum indexing vocabulary for a collection of documents. Typical evaluation results are shown, demonstating the usefulness of the model.
Article
This paper explores the use of Support Vector Machines (SVMs) for learning text classifiers from examples. It analyzes the particular properties of learning with text data and identifies why SVMs are appropriate for this task. Empirical results support the theoretical findings. SVMs achieve substantial improvements over the currently best performing methods and behave robustly over a variety of different learning tasks. Furthermore, they are fully automatic, eliminating the need for manual parameter tuning. 1
Book
. Currently computers are changing from single isolated devices to entry points into a world wide network of information exchange and business transactions called the World Wide Web (WWW). Therefore support in the exchange of data, information, and knowledge exchange is becoming the key issue in current computer technology. Ontologies provide a shared and common understanding of a domain that can be communicated between people and application systems. Therefore, they may play a major role in supporting information exchange processes in various areas. This book discusses the role ontologies will play in knowledge management and in electronic commerce. In addition, I show how arising web standards such as RDF and XML can be used as an underlying representation languages for ontologies. II III Preface ... February 2000 ??? IV Tab l e o f Con t en t s 1 Introduction 1 2 Ontologies 8 3 Application Area Knowledge Management 13 3.1 The pitfalls of current information sear...
A Model to Support E-Catalog Integration Semantic Issues in e-commerce Systems
  • Navathe
[Navathe et al., 2001] S. Navathe, H. Thomas, M. Satitsamitpong, and A. Datta: A Model to Support E-Catalog Integration. In Proceedings of the 9th IFIP 2.6 Working Conference on Database (DS-9), Semantic Issues in e-commerce Systems, Hong Kong, April 2001.
Burges: A Tutorial on Support Vector Machines for Pattern Recognition
C. J. C. Burges: A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 2(2):955-974, 1998. [Chakrabarti et al., 1997]
  • Fensel
[Fensel et al., 2001] D. Fensel, Y. Ding, E. Schulten, B. Omelayenko, G. Botquin, M. Brown, and A. Flett: Product Data Integration in B2B E-commerce, IEEE Intelligent System, 16(3), 2001.
Using taxonomy, discriminants, and signatures for navigating in text databases
  • S Chakrabarti
S. Chakrabarti et al.: Using taxonomy, discriminants, and signatures for navigating in text databases. In Proceedings of the 23rd VLDB Conference, Greece, 1997. [Cheeseman and Stutz, 1995] P. Cheeseman and J. Stutz: Bayesian classification (AutoClass): Theory and results. In U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (eds.), Advances in knowledge discovery and data mining, The AAAI press, Menlo Park, 1995. [Cortes & Vapnik, 1995] C. Cortes and V. Vapnik: Support-Vector Networks, Machine Learning, 20:273-297, 1995. [Dumais & Chen, 2000] S. Dumais and H. Chen: Hierarchical Classification of Web Content. In Proceedings of SIGIR-00, 23rd ACM International Conference on Research and Development in Information Retrieval, pp. 256-263, Athens, Greece, August 2000.