Figure 1 - available via license: Creative Commons Attribution 4.0 International
Content may be subject to copyright.
Histogram of the number of words per item description in the invoice dataset. The vertical dashed line represents the mean number of words per item description.
Source publication
Item categorization is a machine learning task which aims at classifying e-commerce items, typically represented by textual attributes, to their most suitable category from a predefined set of categories. An accurate item categorization system is essential for improving both the user experience and the operational processes of the company. In this...
Contexts in source publication
Context 1
... fact, it contains a value only in 24% of the invoices in our dataset. Figure 1 presents a histogram of the number of words per item (after concatenating the item name and description fields). The seller's industry field represents a classification of the seller itself (i.e., not a specific invoice/item) into a single category out of 40 predefined categories (such as fashion, jewelry, accounting, etc.). ...
Context 2
... TF-IDF: Our first baseline model is a regularized logistic regression model trained on TF-IDF with unigrams and bi-grams, similar to (Kozareva, 2015). The TF-IDF was built using the 3500 most popular terms in the invoice dataset (see Figure 12 in the appendix for more details on why the number 3500 was chosen). ...
Context 3
... that all three LSTM deep neural networks used for extracting embeddings shared the same architecture. The chosen architecture and hyperparameters were tuned using a grid search on the following hyper parameters: Number of LSTM layers (1-3), size of hidden dimension for the LSTM layers (100,200,400), size of hidden dimension for the fully connected layer (10,30,50,100) and dropout rates (0.1,0.2,0.3,0.5). The chosen architecture and hyper-parameters are shown in Table 3. ...
Context 4
... we see that ETE, which ensembles several transferred embeddings, performs the best in terms of classification accuracy and Weighted F1 score and has a low variance across folds compared to other methods. While Figure 8 presents the results for each of the Autoencoder, Industry Embedding and eBay Embedding methods alone, and where all three of them are combined together into a single ensemble, Figure 13 in the appendix presents also the results for ensembles based on all pair combinations. To further support our findings above, we performed statistical signifi-cance tests. ...
Context 5
... is especially true for rare categories for which the model had a small number of instances to train with. In Figure 10, we demonstrate the impact of restricting the number of categories to the top X most common categories (for X ranging between 5 and 40) and ignoring all instances that belong to the other categories, on the accuracy of the various compared methods. As expected, we see that all methods performed considerably better when restricted to a lower number of categories. ...
Context 6
... we see that the proposed ETE method, outperforms the other method in all considered cases. Finally, we also tested the effect of using different machine learning classification algorithms for training the transferred models as well as for training the ensemble model (see Figure 11). The box plot on the top of the figure shows the effect of using different classifiers for training the transferred models (while keeping the ensemble classifier fixed -using Logistic Regression). ...
Context 7
... methods can be integrated into the ETE framework in order to achieve better accuracy. Figure 13 presents a comparison of ensembles based on all combinations of the Autoencoder, Industry Embedding and eBay Embedding methods. As can be seen from the figure, the ETE method which ensembles all three embedding methods, presents the highest median, and the lowest interquartile range. ...
Context 8
... it is also apparent that most of classification power comes from the Industry Embedding method. Figure 14 presents a similar analysis to the one presented in Figure 10, but now the rest of the categories are not ignored but rather joined into a single "other" category. As can be seen, the two figures present quite similar trends. ...
Context 9
... it is also apparent that most of classification power comes from the Industry Embedding method. Figure 14 presents a similar analysis to the one presented in Figure 10, but now the rest of the categories are not ignored but rather joined into a single "other" category. As can be seen, the two figures present quite similar trends. ...
Context 10
... can be seen, the two figures present quite similar trends. Nevertheless, as expected the overall accuracy results in Figure 10 are a bit higher. Figure 15 presents the accuracy of the ETE model on different lengths (words count) of item descriptions. ...
Context 11
... as expected the overall accuracy results in Figure 10 are a bit higher. Figure 15 presents the accuracy of the ETE model on different lengths (words count) of item descriptions. Examining the figure, it is difficult to identify a clear trend between the length of item description and accuracy. ...
Similar publications
Basic machine learning algorithms or transfer learning models work well for language categorization, but these models require a vast volume of annotated data. We need a better model to tackle the problem because labeled data is scarce. This problem may have a solution in GAN-BERT. To classify Bengali text, we have developed a GAN-BERT based model,...