PreprintPDF Available

Hierarchical waste detection with weakly supervised segmentation in images from recycling plants



A neural network-based approach for hierarchical waste identification with poorly supervised object segmentation is described in the study. WaRP, a unique open labeled dataset, was created to train and evaluate suggested algorithms using industrial data from a waste recycling plant's conveyor. The dataset contains 28 different types of recyclable goods (bottles, glasses, card boards, cans, detergents , and canisters) that can overlap, be significantly distorted, or be in poor lighting conditions. On the WaRP dataset, we ran tests with cutting-edge neu-ral network designs and assessed their quality for waste identification with and without hierarchy representation, as well as supervised waste segmentation. Energy usage and environmental effect were assessed during model training. Both the suggested hierarchical technique and the WaRP dataset have showed great industrial application potential.
Hierarchical waste detection with weakly supervised
segmentation in images from recycling plants
Dmitry Yudina,b, Nikita Zakharenkoc, Artem Smetanind,e, Roman Filonovb,
Margarita Kichikb, Vladislav Kuznetsovb, Dmitry Larichevb, Evgeny Gudovd,
Semen Budennyya,c, Aleksandr Panova,b
aArtificial Intelligence Research Institute (AIRI), 32 Kutuzovsky Ave., Moscow, Russia
bMoscow Institute of Physics and Technology, 9 Institutsky per., Dolgoprudny, Russia
cSber AI Lab, 32 Kutuzovsky Ave., Moscow, Russia
dPlanetarium One, Naberezhnaya Obvodnogo kanala, 74c, Saint-Petersburg, Russia
eNational Research University ITMO, Kronverkskiy 49, Saint-Petersburg, Russia
A neural network-based approach for hierarchical waste identification with poorly
supervised object segmentation is described in the study. WaRP, a unique open
labeled dataset, was created to train and evaluate suggested algorithms using
industrial data from a waste recycling plant’s conveyor. The dataset contains
28 different types of recyclable goods (bottles, glasses, card boards, cans, deter-
gents, and canisters) that can overlap, be significantly distorted, or be in poor
lighting conditions. On the WaRP dataset, we ran tests with cutting-edge neu-
ral network designs and assessed their quality for waste identification with and
without hierarchy representation, as well as supervised waste segmentation. En-
ergy usage and environmental effect were assessed during model training. Both
the suggested hierarchical technique and the WaRP dataset have showed great
industrial application potential.
Keywords: object detection, weakly supervised segmentation, neural network,
hierarchical approach, waste recognition, image processing
Corresponding author
Email addresses: (Dmitry Yudin), (Nikita
Zakharenko), (Artem Smetanin),
(Roman Filonov), (Margarita Kichik),
(Vladislav Kuznetsov), (Dmitry Larichev),
(Evgeny Gudov), (Semen Budennyy), (Aleksandr
Preprint submitted to Neurocomputing July 28, 2022
This preprint research paper has not been peer reviewed. Electronic copy available at:
Preprint not peer reviewed
1. Introduction
Neural network methods of computer vision problems are proved to signifi-
cantly improve the efficiency of monitoring technological processes at industrial
enterprises. At waste processing plants, the automation of sorting waste suit-
able for recycling is acute. For this purpose, conveyor lines are commonly used.5
They are equipped with industrial manipulators and video cameras, capable
of localizing the desired categories of waste and carrying out its capture and
movement [1].
The development of such systems requires the creation of algorithms and
software that reliably allo to recognize images by performing the detection of10
bounding boxes, classifying objects and segmenting them [2]. Accurate detection
and segmentation are needed to determine the object location for the capture
by the actuator, which is usually a pneumatic sucker [3].
Such tasks are most effectively solved by deep neural networks [4, 5]. This ar-
ticle proposes a novel architecture of a hierarchical neural network (see Figure 1)15
that improves the quality of state-of-the-art object detection methods thanks
to the developed joint learning algorithm with an additional classifier and the
possibility of weakly supervised segmentation. A particular attention is paid to
low response time models for their suitability to operate on the equipment of
processing plants in real time mode.20
To train the neural network, a special open dataset WaRP was developed.
This is the largest set in existence containing 28 categories of recyclable objects
that can be found on the conveyor belt of recycling plants. The Dataset includes
subcategories of bottles, glasses, card boards, cans, detergents, canisters that
can overlap, be heavily deformed, or be in non-satisfactory lighting conditions.25
The results presented in the article were implemented into IT-landscape
of waste processing plants and could significantly reduce losses during waste
sorting. We also evaluated power consumption and environmental impact of
the proposed model training.
This preprint research paper has not been peer reviewed. Electronic copy available at:
Preprint not peer reviewed
Input image
* class activation map generation method: CCAM, Grad-CAM, Layer CAM, CAMERAS, etc.
Batch of cropped
images (bottle)
Batch of cropped
images (cardboard)
Batch of cropped
images (detergent)
CNN Classifer for
bottle (ResNet18,
ConvNeXT, etc.)
CNN Classifer for
cardboard (ResNet18,
ConvNeXT, etc.)
CNN Classifer for
detergent (ResNet18,
ConvNeXT, etc.)
20 subcategories
of bottle and glass
2 subcategories
of cardboard
4 subcategories
of detergent
CAM generation
for segmentation*
CAM generation
for segmentation*
CAM generation
for segmentation*
Input image CNN Classifer
28 object
CAM generation
for segmentation
(CCAM, Grad-
CAM, Layer
CAM, etc.)
Detected bounding
boxes for all 28
object categories
Batch of
Detected bounding
boxes for object groups:
1) bottle
2) cardboard
3) detergent
4) cans (single class)
5) canister (single class)
Figure 1: Variants of the proposed hierarchical detector scheme
2. Related work30
Waste detection. According to the report published in Nature [6], the problem
of garbage pollution reaches dangerous proportions. It is predicted that by the
end of the 21st century, the amount of garbage produced will reach 11 million
tons per day. The main danger of garbage accumulation is a decrease of harmless
organic waste and an increase of chemical active products in waste. Plastic35
garbage have radically changed the situation because it does not decompose. It
can be recycled, but there is no adequate system for its storage. To solve the
problem with garbage most effectively, it must be sorted.
Waste sorting includes not only the separate collection of garbage, which
occurs at the household level, but also the use of sorting complexes, where40
useful fractions isolating for further utilization. By using special equipment at
waste sorting complexes, such useful fractions as paper, metals, plastic, glass,
organic components, etc. can be separated from the total mass of garbage.
This preprint research paper has not been peer reviewed. Electronic copy available at:
Preprint not peer reviewed
The sorting line process is divided mainly into 4 stages. At the first stage,
household waste is weighted and the radiation level is measured. At the sec-45
ond stage, the garbage enters special units for tearing packages, where contents
are released and then move along conveyor belts. The third stage is manual
and automatic sorting and operation of the press. At this stage, a wide vari-
ety of equipment such as separators, mechanical seeders, magnetic separation
systems, screens, can be used. And at the fourth stage, the resulting fractions50
are distributed according to their intended purpose. Robots are often used as
separators at the third stage. They determine the type of garbage using ma-
chine vision system based on artificial intelligence. The focus of this article is
an effective method for detection and classification garbage on a conveyor belt
for separator robots.55
Garbage detection task is of importance in the modern world. Conveyor belt
is not the only place where garbage is needed to be detected. There are also
garbage at home environment [7], variety of garbage in urban areas [8], garbage
trucks [9], garbage in the oceans [10] and more. One of the most effective
methods for object detection is neural networks. In general, the problem of60
object detection can be decomposed into two smaller problems. The first one
is a detection task - finding an arbitrary number of objects. The second one
is to classify every single object and to estimate it’s size with a bounding box.
According to this neural networks can be divided into one- and two-stage object
Two-stage object detector solves detection and classification tasks separately
and has many inference steps per image, which is time consuming and may not
be suitable for real-time applications. The most common two-stage object de-
tectors is the R-CNN (Regions with convolution neural networks features) series
[11], for example: fast R-CNN [12] instead of feeding the region proposals to70
the CNN feed the input image to the CNN to generate a convolutional feature
map. Faster R-CNN [13] uses a separate network to predict the region propos-
als instead of a selective search algorithm, R-FCN [14] is a region-based fully
convolutional network, and Libra R-CNN solves the problem of feature, sample
This preprint research paper has not been peer reviewed. Electronic copy available at:
Preprint not peer reviewed
and objective level imbalance [15]. Two-stage object detectors have been con-75
tinuously improved in recent years, but the training process itself is far from
ideal. There is an inconsistency problem between the fixed network settings
and the dynamic training procedure, which greatly affects the performance. For
example, the fixed label assignment strategy and regression loss function can-
not fit the distribution change of proposals and thus hinder the training of high80
quality detectors.Based on proposal statistics during training dynamic R-CNN
[16] automatically adjusts the label assignment criterion (IoU threshold) and the
shape of regression loss function (parameters of SmoothL1 Loss). This dynamic
design provides better use of the training samples and pushes the detector to fit
more high quality samples.85
As for one-stage object detector, the most common models are YOLO [17]
[18] [19], SSD [20], and RetinaNet [21]. YOLO’s object detection algorithm
significantly differs from the region based algorithms. It looks at parts of the
image that have high probabilities of containing the object and single convo-
lutional network predicts the bounding boxes and the class probabilities them.90
According to [22] this model keeps a good balance between speed of the work,
which is important for real-time task, and classification quality. Waste recogni-
tion system based on the YOLOv5 model is partially tested in paper [23]. The
SSD model, [20] icompared to YOLO, adds several feature extracting layers to
the end of the core network, such as VGG-16 [24], which predicts the offsets to95
default boxes of different scales and aspect ratios. One-stage object detection is
commonly implemented by optimizing two sub-tasks - object classification and
localization - using heads with two parallel branches, which can lead to a certain
level of spatial misalignment in predictions between the two tasks. Task-aligned
One-stage Object Detection (TOOD) [25] explicitly aligns the two tasks in a100
learning-based manner.
The models described above have a disadvantage caused by an additional
post-processing technique - NMS (non-maximum supression), and anchor-based
detection that generates many mispredictions: YOLOv3, for example, predicts
more than 7K boxes for each image. In contrast, there is that anchorless Cen-105
This preprint research paper has not been peer reviewed. Electronic copy available at:
Preprint not peer reviewed
terNet [26] architecture, it is based on the insight that box predictions can be
sorted by relevance via the location of their centers, rather than their overlap
with the object. A transformer based architecture - DETR(DEtection TRans-
former) [27]. The main constituents are a set-based global loss that forces unique
predictions via bipartite matching, and a transformer encoder-decoder architec-110
ture. Also DETR does not require any additional hand-crafted components like
NMS or anchors.
The essential difference between anchor-based and anchor-free detection is
actually how to define positive and negative training samples, which leads to
the performance gap between them. If they adopt the same definition of posi-115
tive and negative samples during training, there is no obvious difference in the
final performance, no matter regressing from a box or a point. This shows that
selection of process of positive and negative training samples is important for
current object detectors. ATSS [28] automatically selects positive and negative
samples according to statistical characteristics of object. It significantly im-120
proves the performance of anchor-based and anchor-free detectors and bridges
the gap between them.
Waste classification. A number of papers consider waste recognition in images
only as a classification problem. For example, [29] investigates ResNet50, Mo-
bileNet, Inception-ResnetV2, DenseNet121 and Xception models. They demon-125
strate acceptable quality on a dataset with 6 garbage categories images taken
in good lighting conditions and without object overlap.
In the paper [30], the authors solve the problems of data imbalance, the
same type of background and small image size using transfer learning with the
DenseNet169 model.130
In the last year work[31], the authors used the DenseNet121 model with
image augmentation and a genetic algorithm to select hyperparameters. These
methods allowed them to achieve good results on the TrashNet dataset [32].
Binary classification of plastic waste using the Capsule neural networks al-
lows marginally superior to simple convolutional neural networks under similarly135
good imaging conditions [33].
This preprint research paper has not been peer reviewed. Electronic copy available at:
Preprint not peer reviewed
There are hybrid approaches based on convolutional neural networks and
multilayer perceptrons, which use information from extra sensors, in addition
to the camera [34]. This improves the quality, but is not always technically
feasible in real practice.140
In a recent paper [35], a ResNet-18-like convolutional model of waste classifi-
cation was proposed. This demonstrated high quality recognition of cardboard,
glass, metal, plastic and trash categories in good imaging conditions.
A good improvement in waste recognition was achieved by the authors of
the article [36], who used a hybrid classification model composed of models of145
different architectures. The disadvantage of this work is that the data used con-
tained images with the uniform background. This is rarely seen in the industrial
environment of recycling plants.
Waste segmentation. Simultaneous detection and segmentation of waste ob-
jects on the conveyor can be carried out using the Mask R-CNN model[3], which150
is trained in a supervised manner and requires a large set of target objects la-
beled for the segmentation task.
It is worth studying the possibility of unsupervised or weakly supervised
waste segmentation, which does not require the presence of segmentation masks
in the dataset, but only information about belonging to one or another category155
of the whole image. This allows us to significantly save resources for labeling
the dataset, and to to quickly adapt recognition algorithms to a new domain
(for example, associated with new camera installation locations, etc.)
In [37], the authors considered various neural network methods for deformable
object segmentation in cluttered scenes. They conducted a study of fully-, semi-,160
and weakly- supervised learning for garbage segmentation, which demonstrated
a significant superiority of methods implementing fully supervised approaches
based on DeepLabv3+ and poor results for weakly-supervised methods (CAM,
PuzzleCAM, EPS).
In our article, we show that the use of weakly supervised segmentation meth-165
ods as a part of the hierarchical detector allows us to achieve a sufficiently high
quality of segmentation without ground truth masks markup.
This preprint research paper has not been peer reviewed. Electronic copy available at:
Preprint not peer reviewed
Table 1: Modern datasets for waste recognition images
Dataset Categories Images Task Description
TACO [2020] [38] 28 (60 sub-
1 500 Segmentation Contains 420 annotations per bottle
category and 230 annotations per can
UAVVaste [2021] [39] 1 772 Segmentation Drone dataset, contains 772 pictures with
different rubbish
Trashnet [2017] [32] 6 2 527 Classification Contains 501 annotation per category
glass and 482 per plastic
WaDaBa [2021] [40], [41] 8 4 000 Classification All images contain objects made of dif-
ferent type of plastic
[2017] [42]
7 (136 sub-
2 000 Classification 144 pictures with bottles and 158 pic-
tures with cans inside
Waste Classification
data v2 [2019] [43]
327 500 Classification Organic, recyclable and non-recyclable
categories, pictures scraped from google
Waste Images from
Sushi Restaurant [2020]
16 500 Classification Contains 61 annotations per plastic cup
category, 37 per plastic utensil category
and other items
Open litter map [2017]
11 (187 sub-
> 100k Multilabel
This is a website that collects a dataset
from images with garbage from all over
the world
Litter [2020] [46] 24 14 000 Detection This is a website with limited access to
Drinking Waste Classifi-
cation [2020] [47]
4 9 640 Detection Contains 1000 images with cans, 1200
pictures with glass bottles, 2500 pic-
tures with plastic bottles
waste_pictures [2019]
34 24 000 Classification There are 209 cans, 201 glass bottles and
160 plastic bottles in this dataset, all of
them were scrapped from Google
spotgarbage [2018] [49] 3 2 400 Classification Pictures with garbage on the streets, all
of them were scraped from Bing search
MJU-Waste [2020] [50]
1 2 475 Segmentation Contains photographs of people holding
different types of garbage in their hands
(one category garbage)
Domestic Trash Dataset
[2021] [51]
10 > 9 000 Classification,
Waste in the wild, paid license, 250 im-
ages for free
Wade-ai [2019] [52] 1 1 500+ Instance-
Outdoor images with different garbage
Google Open Images v6
[2020] [53]
3 14 226 Detecition,
Outdoor and indoor images (bottles,
plastic bags, tin cans)
5 20 000 Classification Images with heterogeneous background
ZeroWaste Dataset
[2022] [37]
4 12 125 Detection,
Conveyor images, contain cardboard,
metal and plastic objects
WaRP [2022] 5 (28 sub-
Images from the conveyor of recycling
plant with categories of bottles, card-
boards, detergents, canisters and cans
This preprint research paper has not been peer reviewed. Electronic copy available at:
Preprint not peer reviewed
Datasets for waste recognition. A great number of datasets with various
waste images have appeared recently. Some of them contain photographs of lit-
tered nature or urban infrastructure (UAVVaste [39]), others are collected from170
photographs of various packaging items against a neutral background (TACO
[38], Trashnet [32]). The most popular modern datasets are listed in Table 1.
Each of the mentioned datasets contains waste categories that we are interested
in, such as plastic bottles or cans, but the environment of such objects in the
photographs does not look like the one seen on a conveyor belt.175
Among other datasets, the ZeroWaste Dataset [37] stands out: this dataset
contains photos of a transporter line at the paper recycling plant. There are such
categories as metal, cardboard and plastic in the dataset markup. ZeroWaste
Dataset is designed to solve the problem of paper waste segregation, while our
recognition task includes different types of packaging for drinking and household180
To meet the specific challenges of recycling plants, this paper considers the
creation of a new dataset, which is added to Table 1 and called WaRP. It is
described in detail in the next Section.
3. WaRP Dataset185
Waste recycling plants need to automatically select and sort recyclable items
on the conveyor. In our case, these objects should fall into several main cat-
egories: plastic and glass bottles, card boards, detergents, canisters and cans.
For the first three categories, it is desirable to know what color they are and
what they are used for, since recycling technologies differ. There are no open190
datasets containing all the required object categories for such an application.
Therefore, there was a need to develop our own dataset in order to train and
test methods for detecting, classifying and segmenting waste on it.
Our dataset called WaRP (abbreviation of Waste Recycling Plant) consists
of labeled pictures of an industrial conveyor. We selected 28 recyclable waste195
categories. Objects in the dataset are divided into the following groups (see
This preprint research paper has not been peer reviewed. Electronic copy available at:
Preprint not peer reviewed
Figure 2: Parts of the developed WaRP dataset: WaRP-D - images with bounding box labeling
for the detection task, WaRP-C - cropped images with class labels, WaRP-S - cropped images
with labeling for weakly supervised segmentation
Table 2): plastic bottles of 17 categories (class name with the bottle- prefix),
glass bottles of three types (the glass- prefix), card boards of two categories,
detergents of four categories, canisters and cans. The -full postfix means that
the bottle is filled with air, i.e. not flat. This is important for the correct work200
of the manipulator on the conveyor.
Examples of instances of each category of the WaRP Dataset are presented
in Figure 3. An important difference from other datasets is that objects can
overlap, be heavily deformed, or be in poor lighting conditions.
The dataset has three parts (see Figure 2): WaRP-D, WaRP-C, and WaRP-205
The first two parts are intended for training and objective quality assessment
of detection (WaRP-D) and classification (WaRP-C) tasks, and the third WaRP-
S is for validation of weakly supervised segmentation methods. The full statistics
of our dataset parts are given in Table 2.210
The main dataset part WaRP-D contains 2452 images in the training sample
and 522 images in the validation sample. The images have full HD resolution
of 1920 ×1080 pixels.
WaRP-C is cut-out image areas from the WaRP-D set with class labels.
This part includes 8823 images for training and 1583 for testing. The images215
range in size from 40 to 703 pixels wide and 35 to 668 pixels high. The dataset is
unbalanced because iof the real conditions of an industrial enterprise. The rarest
This preprint research paper has not been peer reviewed. Electronic copy available at:
Preprint not peer reviewed
bottle-green bottle-transp bottle-transp-full
bottle-yogurt glass-dark glass-green glass-transp
canister cans
juice-cardboard milk-cardboard detergent-box detergent-color
detergent-transparent detergent-white
bottle-blue bottle-blue5l
Figure 3: Example labeled images (for classes of ’bottle’, ’cans’, ’cardboard’, ’canister’, ’de-
tergent’) in the WaRP dataset
This preprint research paper has not been peer reviewed. Electronic copy available at:
Preprint not peer reviewed
Table 2: WaRP dataset statistics
Category WaRP-
bottle-blue 535 87 634 106 740 4
bottle-green 403 65 466 75 541 4
bottle-dark 451 80 533 96 629 4
bottle-milk 324 54 347 60 407 4
bottle-transp 947 164 1432 235 1667 4
bottle-multicolor 125 28 127 31 158 4
bottle-yogurt 261 41 277 42 319 4
bottle-blue-full 263 40 285 45 330 4
bottle-transp-full 457 79 528 93 621 4
bottle-dark-full 173 31 185 36 221 4
bottle-green-full 229 33 238 35 273 4
bottle-multicolorv-full 105 20 107 22 129 4
bottle-milk-full 110 21 110 21 131 4
bottle-oil 254 46 276 48 324 4
bottle-oil-full 23 8 24 8 32 4
bottle-blue5l 345 60 413 75 488 4
bottle-blue5l-full 87 23 89 24 113 4
glass-transp 165 34 177 37 214 4
glass-dark 132 24 136 25 161 4
glass-green 131 23 135 25 160 4
juice-cardboard 251 63 260 71 331 4
milk-cardboard 358 85 390 96 486 4
detergent-white 300 42 319 44 363 4
detergent-color 277 43 296 44 340 4
detergent-transparent 245 39 262 42 304 4
detergent-box 66 17 66 17 83 4
canister 144 28 149 30 179 4
cans 495 88 562 100 662 4
Total 2452 522 8823 1583 10406 112
This preprint research paper has not been peer reviewed. Electronic copy available at:
Preprint not peer reviewed
ground true
detected bounding
box (bottle-dark)
false positive
(milk-cardboard) omission
false positive + omission
Figure 4: Visualization of Hierarchical detector results
class is the bottle-oil-full (air-filled plastic sunflower oil bottles) category, which
includes only 32 crops. The most common category is bottle-transp (transparent
bottles), with 1667 clipped images.220
WaRP-S contains a total of 112 images ranging in size from 100 ×96 pixels
to 412 ×510 pixels, each category has 4 images with significantly deformed
recyclable objects.
4. Neural network for hierarchical waste detection with weakly su-
pervised object segmentation225
On complex datasets containing images of objects with overlaps and defor-
mations, state-of-the-art detection methods usually work imperfectly and gen-
erate false positives and miss objects (see Figure 4). Such datasets include the
proposed WaRP dataset.
It is promising to improve the quality of pre-trained detection neural network230
with the additional classification and segmentation modules. On the one hand,
this does not require intervention in the architecture of the detector, and on the
other hand, it can clarify the assignment of certain labels to the found bounding
boxes. Adding the ability to semantic segmentation of objects without resource-
intensive supervised learning is also beneficial.235
This article proposes to explore two main variants of the hierarchical classifier
scheme, which are shown in Figure 1.
The first option (Figure 1,a) involves the neural network-based detection of
This preprint research paper has not been peer reviewed. Electronic copy available at:
Preprint not peer reviewed
object bounding boxes belonging not to all 28 categories of the WaRP-D dataset,
but to 5 "supercategories": bottle (including glass), card board, detergent, cans240
and canisters. The first three categories include 20, 2 and 4 subcategories, re-
spectively, and for them it is proposed to train independent classifiers. Their
feature maps can be used to generate class activation maps and further segmen-
tation without additional model training and supervision.
The second option (Figure 1,b) involves the detection by the neural network245
of objects belonging to 28 categories at once, and further refinement of the
found classes using an additional classifier. Its class activation maps can also
be used for weakly supervised segmentation. The second option is closer to the
industrial application of neural networks, when the modularity of the solution
is important.250
The detector is separately trained with a supervision on the WaRP-D dataset.
As basic models, we investigate fast one-stage models YOLOv3 [19], YOLOv5
[23], YOLOX[54], CenterNet[26], two-stage approaches Faster R-CNN [13], Dy-
namic R-CNN[16], Sparce R-CNN [55], transformer architectures D-DETR[56],
TOOD [25].255
As basic classification models, it is proposed to study both architectures that
have become classical (ResNet [57], DenseNet [58], MobileNet [59], EfficientNet
[60], ResNeXT [61]), and more modern neural networks: ConvNeXT [62], Vision
Transformer [63], Data-Efficient Image Transformers [64], Swin Transformer
[65], ReXNet [66], RepVGG [67]260
The article also explores 2 training cases for the proposed hierarchical detec-
tor. In the first case, the base detector and the classifier learn independently of
each other, the base detector learns on images with ground truth (GT) markup
from WaRP-D, and the classifier learns on crops from these images included in
the WaRP-C sample. In the second case, the classifier is trained on the crops ob-265
tained by predicting the WaRP-D training dataset by the basic detector, while
the class labels for the crops are assigned based on the intersection over union
with the boxes from the original GT-labeling.
As for weakly supervised segmentation, we explore the possibilities of pop-
This preprint research paper has not been peer reviewed. Electronic copy available at:
Preprint not peer reviewed
ular methods for constructing class activation maps based on Grad-CAM [68]270
and its modification mGrad-CAM [69] that does not use average pooling, Layer
CAM methods [70] and CAMERAS [71], as well as a new unsupervised ap-
proach CCAM [72] using contrastive learning. To move from class activation
maps to segmentation masks, we use the algorithm proposed by the authors of
the current article in [69].275
5. Experiments
5.1. Waste detection
Our approach shows competitive results compared to other popular detection
architectures like Faster R-CNN[13], Deformable-DETR[56], TOOD[25]. We
performed many experiments to finally achieve best results, find most effective280
hyperparameters and suitable data augmentations. A lot of the experiments
were carried with the state-of-the-art real-time models YOLOX and YOLOv5
We performed experiments on WaRP-D dataset. Each image was annotated
with bounding boxes. There was a significant overfitting problem, while training285
our YOLO models, solved by using an efficient set of augmentations. The highest
impact obtained within mosaic augmentaion[73]. MixUp was set to 50%, 90
degrees rotation and resize to 448 ×832 (height/width). Keeping mosaic until
about the middle of the process and then turning it off gave huge leap in metrics.
It becomes easier for the model to perceive images, a consequence of this mAP50
(mean average precision for boxes with intersection over union more than 50%)
instantly increase by 15%.
We trained YOLOV5 with SGD+Nesterov setting the initial learning rate to
102, weight decay 5e4, initial momentum 0.937. Linear scheduler was used,
warmup for 3 epochs. Training was perfomed on Tesla V100 32GB.295
Figure 4 presents details of result visualization of the proposed hierarchical
detector. We used colored translucent rectangles to show ground true bounding
boxes and color frames to show detections. In the upper right corner, a list of
This preprint research paper has not been peer reviewed. Electronic copy available at:
Preprint not peer reviewed
ground true categories is placed. The right part of figure 4 contains the most
common "mistakes": false positive (milk-cardboard), omission (glass-dark in300
the upper right corner) and the combination of these two errors (bottle-dark-
full instead of glass-dark).
Figure 5 contains several more examples of waste detection on the test sample
of the proposed WaRP-D dataset. Two images at the bottom line of the figure
illustrate working with errors.305
Referring to Table 4, we see that the best mAP50 results on 28 classes
of WaRP-D dataset is obtained by TOOD, which is slightly inferior to the
YOLOV5 model in mAP50..95 . The most problematic classes, prosessing the
lowest metrics are bottle-oil-full, juice-cardboard and bottle-multicolor. We can
see that two-stage detectors architectures behave unpredictably, some of them310
get quite high accuracy on the same category and others show very poor per-
fomance, unlike one-stage models, which show themselves well on each class.
The YOLOV5 model chosen shows the best inference rates and corresponding
detection quality. The transformer detector D-DETR loses a lot in speed com-
pared to the YOLO models although it outperforms other models in terms of315
detection metric mAP50 of 5 categories (see Table 3).
It should be noted that the fast YOLOX-m model also shows consistently
high quality indicators, and is able to recognize objects of 5 categories with the
best quality in terms of the mAP50..95 metric.
5.2. Classification320
As a part of the experiment, several types of classifiers were trained on the
WaRP-C dataset. Architecture types and training results are shown in Table 5.
For improving the model quality, image augmentation methods were ap-
plied. The following augmentation approaches were used: resizing the image
with adding peddings for preserving the original image sides ratios; adding a325
partially covering mask (for helping CAM method to localize as many pixels of
the object as possible), 20% of image is closed; random shifts and turns with
80% probability, shift limit - 0.2, scale limit - 0.2, rotate limit - 90 degrees; ran-
This preprint research paper has not been peer reviewed. Electronic copy available at:
Preprint not peer reviewed
Figure 5: Examples of Hierarchial detector work on test sample of the WaRP-D Dataset
This preprint research paper has not been peer reviewed. Electronic copy available at:
Preprint not peer reviewed
Table 3: Detection quality on 5 joined categories of WaRP-D Dataset for state-of-the-art
AP50, % mAP50,
bottle cardboard detergent canister cans
YOLOV5x 79,6 46,4 48,8 56,8 55,7 57,5 43,8
YOLOX-m 80,9 54,6 48,6 46,6 62,9 58,3 44,5
YOLOX-l 80,9 52,3 47,7 44,8 59,5 57,0 43,5
D-DETR 83,0 48,7 50,9 57,0 54,7 58,9 42,8
Dynamic-RCNN 77,2 40,9 51,2 50,3 55,2 55,0 40,4
Faster-RCNN 75,1 39,5 47,6 36,3 47,6 48,0 31,8
TOOD 78,5 41,9 51,0 46,2 57,5 55,0 41,4
YOLOV3 75,0 37,6 43,4 44,4 49,8 50,0 32,5
CenterNet 76,2 36,3 37,7 38,3 52,1 48,1 34,7
ATSS 79,0 41,6 48,9 51,5 48,4 53,9 40,6
Sparce-RCNN 75,0 37,3 40,3 35,7 51,9 48,1 33,0
Table 4: Detection quality on 28 categories of WaRP-D Dataset for state-of-the-art detectors
mAP50,% AP50 ,% mAP50 ,
bottle cardb. deterg. canister cans
YOLOV3 44,8 12,9 21,1 27,3 20,6 37,6 26,0 52,9
YOLOV5-x 62,6 41,1 38,6 32,0 58,1 56,4 46,6 66,6
YOLOX-m 63,5 39,7 45,0 54,0 59,3 58,6 45,7 64,9
YOLOX-l 51,8 27,9 27,7 28,1 45,2 45,6 34,6 52,4
D-DETR 60,1 41,3 44,3 43,5 55,3 55,7 40,3 13,9
Dynamic-RCNN 61,6 35,8 44,9 39,7 55,2 56,4 38,3 33,8
Faster-RCNN 41,6 24,3 31,0 24,5 33,4 56,4 38,0 35,3
TOOD 65,8 34,5 47,2 52,7 61,5 60,2 46,5 28,9
CenterNet 56,0 24,4 36,4 30,5 56,2 50,1 37,6 9,1
ATSS 62,6 32,3 41,9 52,9 51,5 56,7 43,0 38,5
Sparce-RCNN 50,9 24,8 30,2 32,2 45,0 45,2 32,0 27,3
dom changes in brightness and contrast with 50% probability, brightness limit
- 0.1; random color changes for each RGB channel with 50% probability, color330
shift limit - 15; random vertical and horizontal flips with 50% probability.
The following models from the timm deep learning library were selected for
the experiments [74]: ResNet[57] with 12M params and 71 layers; ConvNeXT_tiny[62]
with 28M params and 202 layers; DenseNet121 [58] with 8M params, 433 layers;
EfficientNet-B5[60] with 30M params, 551 layers; ResNeXT50_32x4d[61] with335
25M params, 177 layers; Transformers ViT_small_resnet50d_s16_224[63] with
This preprint research paper has not been peer reviewed. Electronic copy available at:
Preprint not peer reviewed
Table 5: Classifier module quality on WaRP-C Dataset
Classifier Mean Recall, % Recall, % Accuracy, % FPS
bottle cardboard detergent canister cans
ViT (bs8) 49,1 54,4 60,8 46,7 70,4 51,5 100
DEiT (bs8) 64,6 70,1 59,9 90,0 84,7 68,0 120
RepVGG (bs8) 55,0 70,0 51,9 83,3 92,9 59,1 98
SWiN (bs8) 58,8 66,3 42,7 43,3 93,9 64,8 77
ResNet18 (bs8) 63,84 74,18 53,44 58,96 91,57 67,59 230
ResNet18 (bs32) 67,6 78,9 65,15 60,0 93,0 74,2 230
MobileNetv3 (bs8) 70,7 79,3 69,9 66,7 94,9 72,8 107
MobileNetv3 (bs32) 75,7 82,2 70,3 73,3 92,9 77,4 107
DenseNet121 (bs8) 75,5 79,6 71,0 90,0 96,9 78,3 42
DenseNet121 (bs32) 76,8 84,1 72,9 63,3 82,7 76,6 42
RexNet (bs8) 74,8 78,3 72,4 90,0 94,9 76,8 74
RexNet (bs32) 79,5 84,0 74,5 80,0 93,9 80,1 74
EfficientNet-B5 (bs8) 78,8 83,2 67,2 96,7 95,9 79,8 38
EfficientNet-B5 (bs32) 79,6 83,3 76,3 86,7 95,9 81,9 38
ResNeXT (bs8) 78,8 80,9 72,7 86,7 93,9 79,0 82
ResNeXT (bs32) 76,9 80,9 77,4 83,3 95,9 79,5 82
ConvNeXT(28) (bs8) 73,7 84,1 69,1 73,3 95,9 78,8 48
ConvNeXT(28) (bs32) 77,0 83,0 75,0 86,7 98,0 81,8 48
ConvNeXT(20) (bottle) 75,7 - - - - 75,4 48
ConvNeXT(2) (cardboard) - 90,5 - - - 90,1 48
ConvNeXT(4) (detergent) - - 91,7 - - 92,4 48
This preprint research paper has not been peer reviewed. Electronic copy available at:
Preprint not peer reviewed
57M params, 277 layers and DEiT_tiny_patch16_224[64] with 6M params, 188
layers; RepVGG_a2[67] with 28M params, 306 layers; SWiN_tiny[65] with 28M
params, 217 layers; MobileNetv3_large_100[59] with 5M params, 195 layers;
ReXNet_100[66] with 5M params, 313 layers.340
Each model used pre-trained weights, which were further tuned during the
experiments. Cross Entropy Loss was chosen as error function. The models
were trained for 40 epochs with an initial learning rate of 0.001, which decreased
during the training process if the quality metric on the validation data was not
improving over several epochs.345
The training and the test datasets had similar unbalanced distribution, so
balancing methods like equivalent inter-class sampling and Weighted Cross En-
tropy Loss did not significantly improve results compared to the conventional
From Table 5 with the obtained quality metrics on the WaRP-C dataset,350
we can see that the highest quality scores are achieved by the ConvNeXt and
EfficientNet-B5 models, while the ResNet-18 model is the fastest one. Con-
vNeXt is also significantly faster than EfficientNet-B5. So, ConvNeXt-tiny is
most promising for use as a part of the hierarchical detector. ResNet-18 can
also be used if we need the best possible detector speed.355
5.3. Quality of hierarchical waste detection
The quality indicators of various options for implementing a hierarchical ap-
proach to waste detection were analyzed. The results are shown in Table 6.
Table shows that independent training of the YOLOX-m on 5 detected classes
and three ConvNeXt-tiny models for bottles (20 categories), detergents (4 cate-360
gories) and cardboards (2 categories) does not improve the mAP50 metric (the
first scheme of the approach demonstrated in Figure 1,a). So, training of three
independent classifiers leads to a significant deterioration in the quality metrics,
and such an implementation of the hierarchical detector is inappropriate.
In the same time we have improvement of the mAP50 and mAP50..95 metrics365
for option shown in in Figure 1,b for YOLOX-m detector with ConvNeXt-tiny
This preprint research paper has not been peer reviewed. Electronic copy available at:
Preprint not peer reviewed
model trained on the all 28 categories.
5.4. Weakly supervised waste segmentation
After classification, CAM (Class Activation Map) methods are applied, which
allow one to build classifier attention maps for a given image according to a cer-370
tain class. These maps are subsequently converted into binary segmentation
maps based on the algorithm described in [69]. To assess the quality of the
methods, a standard semantic segmentation metric - mIoU (mean Intersection
over Union) - was used. The fastest ResNet-18 model was chosen as the base
model for the verification of using these methods. The obtained quality scores375
are listed in Table 7. The visualization of the generated class activation maps
and binary object masks is shown in Figure 6.
The best quality is shown by the unsupervised approach CCAM(5) based on
contrastive learning, which is trained for 5 different "supercategories". CCAM(28)
trained on the combined 28 categories is slightly inferior to it. Among the rest380
of the methods, the best approach is CAMERAS, which uses classifier directly
trained on 28 classes.
5.5. Energy consumption and environmental impact of model training
The table 8 shows the energy consumption and equivalent C O2emissions
estimated during the neural network models training. It includes various vari-385
ants of the proposed hierarchical waste detector. The number in brackets after
models name indicates number of waste categories were predicted at the output
of the neural network. We determined the carbon emission using the modern
open-source library eco2AI [75]. The results show that the total equivalent
carbon emission while training the highest quality hierarchical detector with390
YOLOX-m(28) + ConvNeXt(28) + CCAM is only about 0.15 kg with an en-
ergy consumption of about 0.48 kWh.
Thus, the developed solution, on the one hand, allows achieving high quality
metrics, and, on the other hand, demonstrates low energy consumption and low
negative environmental impact, which is essential for its industrial application.395
This preprint research paper has not been peer reviewed. Electronic copy available at:
Preprint not peer reviewed
Table 6: Hierarchical detector metrics
Per category AP , % YOLOX-m(28)
Hierarchical detector
YOLOX-m(5) + YOLOX-m(28) +
ConvNeXt(20-4-2) ConvNeXt(28)
bottle-blue-full 66,8 51,6 66,8
bottle-transp-full 70,8 64,4 70,6
bottle-dark-full 85,5 77,7 79,9
bottle-green-full 85,3 83,7 85,6
bottle-multicolorv-full 77,4 60,9 78,5
bottle-blue5l-full 84,5 67,3 84,5
bottle-milk-full 88,6 81,3 93,9
bottle-oil-full 44,5 35,1 58,9
glass-transp 53,4 54,7 55,3
glass-dark 74,7 73,7 74,7
glass-green 69,2 58,8 72,3
bottle-blue5l 64,2 51,9 64,5
bottle-blue 59,3 48,5 58,3
bottle-green 74,5 63,7 74,1
bottle-dark 73,1 72,0 74,1
bottle-milk 46,6 41,8 46,4
bottle-transp 54,6 40,4 53,0
bottle-multicolor 36,0 30,4 31,8
bottle-oil 22,2 35,1 23,6
bottle-yogurt 37,9 31,0 40,5
juice-cardboard 35,0 40,0 35,9
milk-cardboard 44,5 38,7 44,2
cans 59,3 52,9 61,2
canister 54,0 42,6 55,1
detergent-color 43,0 34,8 43,1
detergent-transparent 37,0 34,4 36,8
detergent-box 53,3 68,0 59,4
detergent-white 46,7 47,6 46,6
mAP50 58,6 52,7 59,6
mAP50..95 45,7 40,4 46,7
This preprint research paper has not been peer reviewed. Electronic copy available at:
Preprint not peer reviewed
Figure 6: The results of various weakly supervised waste segmentation approaches. For each
image, the generated class activation maps and binarized masks based on them are shown.
This preprint research paper has not been peer reviewed. Electronic copy available at:
Preprint not peer reviewed
Table 7: Weakly supervised segmentation quality, %
Method mI oUbottle mIoUcardb I oUcans I oUcanister mIoUdeter gent mIoUall
CCAM(5) 64,78 71,73 63,01 65,28 69,31 65,88
CCAM(28) 62,48 66,54 69,18 69,11 65,25 63,64
CAMERAS 55,63 59,40 57,83 61,00 60,51 56,87
GradCAM 55,01 60,39 38,20 59,31 58,22 55,41
LayerCAM 60,19 63,71 44,76 66,77 60,30 60,14
mGradCAM 52,87 59,40 39,73 60,29 55,31 53,48
Table 8: Power consumption and CO2emissions for neural network training on WaRP Dataset
using Tesla V100-SXM3-32GB
Model Power consumption, kWTh C O2emissions, kg Training duration, s
YOLOX-m (28) 0.2612 0.0810 4111.37
YOLOX-m (5) 0.2557 0.0793 4050.88
YOLOv5-X (28) 0.3800 0.1178 4392.13
ConvNext (28) 0.2106 0.065 2363.79
ConvNext (20) 0.1580 0.049 1792.71
ConvNext (4) 0.0150 0.004 209.04
ConvNext (2) 0.0220 0.007 292.03
ResNet18 (28) 0.0492 0.0152 1394.42
CCAM (28) 0.0126 0.0039 247.41
6. Results of model integration
After the model integration, useful fractions of garbage detecting experi-
ments were carried out. During the experiment, a small fragment of video from
the camera was recorded. Then, the video was viewed by an expert and the
number of correctly and incorrectly recognized objects, as well as the number400
of missed objects, were counted. In different days, 3 experiments were made
in total. The duration of a single measurement was 2 minutes. This period is
equivalent to 1830 pictures. As a result, total amount of the analyzed informa-
tion is 5490 pictures.
This statistic was calculated with grouping 28 classes into 5 more general405
classes. Due to the fact that canisters were not detected during the experiment,
this class was not included in the table. It is important to note that the ana-
lyzed data is quite different from the training dataset because camera located
This preprint research paper has not been peer reviewed. Electronic copy available at:
Preprint not peer reviewed
Table 9: Results of industrial experiment on waste processing complex RT Invest Recycle
Category True Detection False Detection Missed Precision Recall F1
bottle 218 61 55 0.78 0.80 0.79
cardboard 54 21 19 0.72 0.74 0.73
detergent 19 13 9 0.59 0.68 0.63
cans 86 21 19 0.80 0.82 0.81
at the end of the pipeline. Thus, the problem made by this camera have diverse
background, lighting conditions, angle and composition of moving objects. De-410
spite this fact, according to Table 9 the model shows good results of detection
and classification (F1-score varies from 63% to 81% for different classes), which
indicates its high generalization ability.
7. Conclusion
In the study we proved the problem of waste recognition on the conveyor415
of recycling plants to be successfully tackled with various architectures of deep
neural networks, even being integrated into in-plant exploitation processes.
At the same time, it was noted there were no suitable open datasets contain-
ing the required categories of recyclable waste. The created specialized WaRP
dataset is a unique and diverse tool that allows to train and test neural network420
methods for detection (WaRP-D set), classification (WaRP-C set) and segmen-
tation (WaRP-S set) of recyclable waste in non-satisfactory lighting conditions,
overlapping and deformations.
The proposed hierarchical approach to waste detection made it possible to
improve the quality of the basic pre-trained models, and also to carry out addi-425
tional weakly supervised object segmentation with acceptable accuracy. Such a
solution is practically useful, since for industrial applications it is necessary to
constantly expand and re-label the existing dataset in order to provide the best
recognition quality for new domains (for different conveyors and plants). For
weakly supervised segmentation in the formulation considered, it is sufficient to430
This preprint research paper has not been peer reviewed. Electronic copy available at:
Preprint not peer reviewed
simply label objects in images as bounding boxes with the object categories.
Moreover, for a CCAM algorithm based on contrastive learning, categorization
information is not necessary. This labeling is much easier, faster and cheaper
to implement.
We also examined the energy consumption and environmental impact of the435
proposed modular hierarchical detector and noted the low CO2emissions. This
indicates the environmental friendliness of the developed solution.
The experiment with the developed approach at the waste processing com-
plex RT Invest Recycle confirmed its applicability at the conveyor site after
the manual waste sorting. The neural network detector of recyclable objects440
(bottles, card boards, detergents, cans) passed by people showed acceptable
precision and recall of recognition. This indicates its superiority over manual
conveyor monitoring, which is monotonous and harmful to human health.
Promising topics for further development of the study are the integration of
few shot learning methods for working with rare categories of objects and the445
issue of quality improving of waste detection not from single images, but from
a video sequence.
For providing exclusive data for research, assistance in annotation, prompt
consultation on the specifics of the technological process of sorting waste and as-450
sistance in integrating the model into production, the authors express their grat-
itude to waste processing complex RT Invest Recycle, and its director Evgeny
Komarov and Planetarium One company and its employees: Natalia Kashi-
rina, Vladislav Makarovsky, Evgeny Yakovlev, Konstantin Roslyakov, Mikhail
Shimusyuk, Valeria Kuznetsova, Yankovskiy Nikita.455
Competing interests
The authors declare no competing interests.
This preprint research paper has not been peer reviewed. Electronic copy available at:
Preprint not peer reviewed
[1] C. Zhihong, Z. Hebin, W. Yanbo, L. Binyan, L. Yu, A vision-based robotic
grasping system using deep learning for garbage sorting, in: 2017 36th460
Chinese control conference (CCC), IEEE, 2017, pp. 11223–11226.
[2] D. Ni, Z. Xiao, M. K. Lim, Machine learning in recycling business: an
investigation of its practicality, benefits and future trends, Soft Computing
25 (12) (2021) 7907–7927.
[3] M. Koskinopoulou, F. Raptopoulos, G. Papadopoulos, N. Mavrakis, M. Ma-465
niadakis, Robotic waste sorting technology: Toward a vision-based catego-
rization system for the industrial robotic separation of recyclable waste,
IEEE Robotics & Automation Magazine 28 (2) (2021) 50–60.
[4] Z.-Q. Zhao, P. Zheng, S.-t. Xu, X. Wu, Object detection with deep learn-
ing: A review, IEEE transactions on neural networks and learning systems470
30 (11) (2019) 3212–3232.
[5] J. Bobulski, M. Kubanek, Deep learning for plastic waste classification
system, Applied Computational Intelligence and Soft Computing 2021.
[6] K. C. Hoornweg Daniel, Bhada-Tata Perinaz, Environment: Waste produc-
tion must peak this century, Naturedoi:10.1038/502615a.475
[7] Y. Wu, X. Shen, Q. Liu, F. Xiao, C. Li, A garbage detection and classifica-
tion method based on visual scene understanding in the home environment,
Complexity 2021 (2021) 1–14. doi:10.1155/2021/1055604.
[8] Y. Wang, X. Zhang, Autonomous garbage detection for intelligent urban
management, MATEC Web of Conferences 232 (2018) 01056. doi:10.480
[9] X. Zhang, Y. Gao, G. Xiao, B. Feng, W. Chen, A real-time garbage truck
supervision and data statistics method based on object detection, Wireless
This preprint research paper has not been peer reviewed. Electronic copy available at:
Preprint not peer reviewed
Communications and Mobile Computing 2020 (2020) 1–9. doi:10.1155/
[10] H. Deng, D. Ergu, F. Liu, B. Ma, Y. Cai, An embeddable algorithm for au-
tomatic garbage detection based on complex marine environment, Sensors
21 (2021) 6391. doi:10.3390/s21196391.
[11] R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies
for accurate object detection and semantic segmentation, Proceedings of490
the IEEE Computer Society Conference on Computer Vision and Pattern
[12] R. B. Girshick, Fast r-cnn, 2015 IEEE International Conference on Com-
puter Vision (ICCV) (2015) 1440–1448.
[13] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object495
detection with region proposal networks, IEEE Transactions on Pattern
Analysis and Machine Intelligence 39. doi:10.1109/TPAMI.2016.2577031.
[14] J. Dai, Y. Li, K. He, J. Sun, R-fcn: Object detection via region-based fully
convolutional networks, 2016.
[15] J. Pang, K. Chen, J. Shi, H. Feng, W. Ouyang, D. Lin, Libra r-cnn: Towards500
balanced learning for object detection, 2019, pp. 821–830. doi:10.1109/
[16] H. Zhang, H. Chang, B. Ma, N. Wang, X. Chen, Dynamic r-cnn: Towards
high quality object detection via dynamic training, in: European conference
on computer vision, Springer, 2020, pp. 260–275.505
[17] J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look once: Uni-
fied, real-time object detection, 2016, pp. 779–788. doi:10.1109/CVPR.
[18] J. Redmon, A. Farhadi, Yolo9000: Better, faster, stronger, 2017 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR) (2017)510
This preprint research paper has not been peer reviewed. Electronic copy available at:
Preprint not peer reviewed
[19] J. Redmon, A. Farhadi, Yolov3: An incremental improvement, ArXiv
[20] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C.-Y. Fu, A. C.
Berg, Ssd: Single shot multibox detector, in: ECCV, 2016.515
[21] T.-Y. Lin, P. Goyal, R. B. Girshick, K. He, P. Doll´ar, Focal loss for dense
object detection, 2017 IEEE International Conference on Computer Vision
(ICCV) (2017) 2999–3007.
[22] A. Bochkovskiy, C.-Y. Wang, H.-y. Liao, Yolov4: Optimal speed and accu-
racy of object detection.520
[23] G. Yang, J. Jin, Q. Lei, Y. Wang, J. Zhou, Z. Sun, X. Li, W. Wang,
Garbage classification system with yolov5 based on image recognition, in:
2021 IEEE 6th International Conference on Signal and Image Processing
(ICSIP), IEEE, 2021, pp. 11–18.
[24] S. Tammina, Transfer learning using vgg-16 with deep convolutional neu-525
ral network for classifying images, International Journal of Scientific and
Research Publications (IJSRP) 9 (2019) p9420. doi:10.29322/IJSRP.9.
[25] C. Feng, Y. Zhong, Y. Gao, M. R. Scott, W. Huang, Tood: Task-aligned
one-stage object detection, in: 2021 IEEE/CVF International Conference530
on Computer Vision (ICCV), IEEE Computer Society, 2021, pp. 3490–3499.
[26] X. Zhou, D. Wang, P. Kr¨ahenb¨uhl, Objects as points, arXiv preprint
[27] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, S. Zagoruyko,
End-to-end object detection with transformers, in: European conference535
on computer vision, Springer, 2020, pp. 213–229.
[28] S. Zhang, C. Chi, Y. Yao, Z. Lei, S. Z. Li, Bridging the gap between
anchor-based and anchor-free detection via adaptive training sample selec-
tion, arXiv preprint arXiv:1912.02424.
This preprint research paper has not been peer reviewed. Electronic copy available at:
Preprint not peer reviewed
[29] C. Bircano˘glu, M. Atay, F. Be¸ser, ¨
O. Gen¸c, M. A. Kızrak, Recyclenet:540
Intelligent waste sorting using deep neural networks, in: 2018 Innovations
in intelligent systems and applications (INISTA), IEEE, 2018, pp. 1–7.
[30] Q. Zhang, Q. Yang, X. Zhang, Q. Bao, J. Su, X. Liu, Waste image classifi-
cation based on transfer learning and convolutional neural network, Waste
Management 135 (2021) 150–157.545
[31] W.-L. Mao, W.-C. Chen, C.-T. Wang, Y.-H. Lin, Recycling waste classifi-
cation using optimized convolutional neural network, Resources, Conserva-
tion and Recycling 164 (2021) 105132.
[32] Trashnet dataset,, accessed:
[33] K. Sreelakshmi, S. Akarsh, R. Vinayakumar, K. Soman, Capsule neural net-
works and visualization for segregation of plastic and non-plastic wastes,
in: 2019 5th international conference on advanced computing & communi-
cation systems (ICACCS), IEEE, 2019, pp. 631–636.
[34] Y. Chu, C. Huang, X. Xie, B. Tan, S. Kamal, X. Xiong, Multilayer hybrid555
deep-learning method for waste classification and recycling, Computational
Intelligence and Neuroscience 2018.
[35] Q. Zhang, X. Zhang, X. Mu, Z. Wang, R. Tian, X. Wang, X. Liu, Recyclable
waste image recognition based on deep learning, Resources, Conservation
and Recycling 171 (2021) 105636.560
[36] K. Ahmad, K. Khan, A. Al-Fuqaha, Intelligent fusion of deep features for
improved waste classification, IEEE access 8 (2020) 96495–96504.
[37] D. Bashkirova, M. Abdelfattah, Z. Zhu, J. Akl, F. Alladkani, P. Hu,
V. Ablavsky, B. Calli, S. A. Bargal, K. Saenko, Zerowaste dataset: To-
wards deformable object segmentation in cluttered scenes, in: Proceedings565
of the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
tion, 2022, pp. 21147–21157.
This preprint research paper has not been peer reviewed. Electronic copy available at:
Preprint not peer reviewed
[38] P. F. Proen¸ca, P. Sim˜oes, Taco: Trash annotations in context for litter
detection, arXiv preprint arXiv:2003.06975.
[39] M. Kraft, M. Piechocki, B. Ptak, K. Walas, Autonomous, onboard vision-570
based trash and litter detection in low altitude aerial images collected
by an unmanned aerial vehicle, Remote Sensing 13 (5). doi:10.3390/
[40] J. Bobulski, J. Piatkowski, Pet waste classification method and plastic575
waste database-wadaba, in: International conference on image processing
and communications, Springer, 2017, pp. 57–64.
[41] K. M. Bobulski J., Apet waste classification method and plastic waste
database wadaba, Deep Learning for Plastic Waste Classification System,
Applied Computational Intelligence and Soft Computingdoi:10.1155/580
[42] F. O. Joan Sosa-Garcia, Glassense-vision dataset, http://www.slipguru., accessed: 2022-06-20.585
[43] Waste classification data v2,
sapal6/waste-classification-data-v2, accessed: 2022-06-20.
[44] Waste images from sushi restaurant,
datasets/arthurcen/waste-images-from-sushi-restaurant, ac-
cessed: 2022-06-20.590
[45] Open litter map,, accessed: 2022-06-20.
[46] Litter dataset,, ac-
cessed: 2022-06-20.
This preprint research paper has not been peer reviewed. Electronic copy available at:
Preprint not peer reviewed
[47] Drinking waste classification,
arkadiyhacks/drinking-waste-classification, accessed: 2022-06-20.595
[48] waste_pictures,
waste-pictures, accessed: 2022-06-20.
[49] Garbage in images (gini) dataset,
spotgarbage-GINI, accessed: 2022-06-20.
[50] T. Wang, Y. Cai, L. Liang, D. Ye, A multi-level approach to waste object600
segmentation, Sensors.
[51] Domestic trash dataset,
Domestic-Trash-Dataset, accessed: 2022-06-20.
[52] Wade dataset,, accessed:
[53] Open images dataset v6,
openimages/web/index.html, accessed: 2022-06-20.
[54] Z. Ge, S. Liu, F. Wang, Z. Li, J. Sun, Yolox: Exceeding yolo series in 2021,
arXiv preprint arXiv:2107.08430.
[55] P. Sun, R. Zhang, Y. Jiang, T. Kong, C. Xu, W. Zhan, M. Tomizuka,610
L. Li, Z. Yuan, C. Wang, et al., Sparse r-cnn: End-to-end object detection
with learnable proposals, in: Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, 2021, pp. 14454–14463.
[56] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, J. Dai, Deformable detr: Deformable
transformers for end-to-end object detection, in: International Conference615
on Learning Representations, 2021.
[57] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recog-
nition, in: Proceedings of the IEEE conference on computer vision and
pattern recognition, 2016, pp. 770–778.620
This preprint research paper has not been peer reviewed. Electronic copy available at:
Preprint not peer reviewed
[58] G. Huang, Z. Liu, G. Pleiss, L. Van Der Maaten, K. Weinberger, Convo-
lutional networks with dense connectivity, IEEE transactions on pattern
analysis and machine intelligence.
[59] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand,
M. Andreetto, H. Adam, Mobilenets: Efficient convolutional neural net-625
works for mobile vision applications, arXiv preprint arXiv:1704.04861.
[60] M. Tan, Q. Le, Efficientnet: Rethinking model scaling for convolutional
neural networks, in: International conference on machine learning, PMLR,
2019, pp. 6105–6114.
[61] S. Xie, R. Girshick, P. Doll´ar, Z. Tu, K. He, Aggregated residual transfor-630
mations for deep neural networks, in: Proceedings of the IEEE conference
on computer vision and pattern recognition, 2017, pp. 1492–1500.
[62] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, S. Xie, A convnet
for the 2020s, arXiv preprint arXiv:2201.03545.
[63] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Un-635
terthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image
is worth 16x16 words: Transformers for image recognition at scale, arXiv
preprint arXiv:2010.11929.
[64] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, H. egou,
Training data-efficient image transformers & distillation through atten-640
tion, in: International Conference on Machine Learning, PMLR, 2021, pp.
[65] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, Swin trans-
former: Hierarchical vision transformer using shifted windows, in: Proceed-
ings of the IEEE/CVF International Conference on Computer Vision, 2021,645
pp. 10012–10022.
This preprint research paper has not been peer reviewed. Electronic copy available at:
Preprint not peer reviewed
[66] D. Han, S. Yun, B. Heo, Y. Yoo, Rethinking channel dimensions for efficient
model design, in: Proceedings of the IEEE/CVF conference on Computer
Vision and Pattern Recognition, 2021, pp. 732–741.
[67] X. Ding, X. Zhang, N. Ma, J. Han, G. Ding, J. Sun, Repvgg: Making vgg-650
style convnets great again, in: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, 2021, pp. 13733–13742.
[68] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Ba-
tra, Grad-cam: Visual explanations from deep networks via gradient-based
localization, in: Proceedings of the IEEE international conference on com-655
puter vision, 2017, pp. 618–626.
[69] V. I. Kuznetsov, D. A. Yudin, Neural networks for classification and un-
supervised segmentation of visibility artifacts on monocular camera image,
Optical Memory and Neural Networks (Information Optics).
[70] P.-T. Jiang, C.-B. Zhang, Q. Hou, M.-M. Cheng, Y. Wei, Layercam: Ex-660
ploring hierarchical class activation maps for localization, IEEE Transac-
tions on Image Processing 30 (2021) 5875–5888.
[71] M. A. Jalwana, N. Akhtar, M. Bennamoun, A. Mian, Cameras: En-
hanced resolution and sanity preserving class activation mapping for image
saliency, in: Proceedings of the IEEE/CVF Conference on Computer Vision665
and Pattern Recognition, 2021, pp. 16327–16336.
[72] J. Xie, J. Xiang, J. Chen, X. Hou, X. Zhao, L. Shen, C2am: Contrastive
learning of class-agnostic activation map for weakly supervised object lo-
calization and semantic segmentation, in: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, 2022, pp. 989–670
[73] G. Ghiasi, Y. Cui, A. Srinivas, R. Qian, T.-Y. Lin, E. D. Cubuk, Q. V.
Le, B. Zoph, Simple copy-paste is a strong data augmentation method for
instance segmentation, arXiv preprint arXiv:2012.07177.
This preprint research paper has not been peer reviewed. Electronic copy available at:
Preprint not peer reviewed
[74] R. Wightman, Pytorch image models,
pytorch-image-models (2019). doi:10.5281/zenodo.4414861.
[75] S. Budennyy, N. Zakharenko, O. Plosskaya, I. Barsola, I. Egorov, A. Kos-
terina, V. Lazarev, A. Korovin, L. Zhukov, V. Arkhipkin, I. Oseledets,
D. Dimitrov, Eco2ai: carbon emissions tracking of machine learning mod-
els as the first step towards sustainable ai (2022).680
This preprint research paper has not been peer reviewed. Electronic copy available at:
Preprint not peer reviewed
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
For computer vision systems of autonomous vehicles, an important task is to ensure high reliability of visual information coming from on-board cameras. Frequent problems are contamination of the camera lens, its defocusing due to mechanical damage, image motion blur in low light conditions. In our work, we propose a novel neural network approach to the classification and unsupervised segmentation of visibility artifacts on monocular camera images. It is based on the compact classification deep neural network with an integrated modification of the gradient method for class activation map and segmentation mask generating. We present a new dataset named Visibility Artifacts containing over 22300 images including six common artifacts: complete loss of camera visibility, strong or partial contamination, rain or snow drops, motion blur, defocus. To check the quality of artifact localization, a small test set with ground truth masks is additionally labeled. It allowed us to objectively quantitatively compare various methods for constructing class activation maps (CAMERAS, FullGrad, original and modified Grad-CAM, Layer-CAM), which demonstrated image segmentation quality above 54% mIoU without any supervision. This is a promising result. Experiments with the developed dataset demonstrated the superiority of the neural network classification method ResNet-18_U (with test accuracy of 99.37%), compared to more complex convolutional (ResNet-34, ResNeXt-50, EfficientNet-B0) and transformer (ViT-Ti, DeiT-Ti) neural networks. The code of the proposed method and the dataset are publicly available at
Full-text available
Garbage classification is a social issue related to people’s livelihood and sustainable development, so letting service robots autonomously perform intelligent garbage classification has important research significance. Aiming at the problems of complex systems with data source and cloud service center data transmission delay and untimely response, at the same time, in order to realize the perception, storage, and analysis of massive multisource heterogeneous data, a garbage detection and classification method based on visual scene understanding is proposed. This method uses knowledge graphs to store and model items in the scene in the form of images, videos, texts, and other multimodal forms. The ESA attention mechanism is added to the backbone network part of the YOLOv5 network, aiming to improve the feature extraction ability of the network, combining with the built multimodal knowledge graph to form the YOLOv5-Attention-KG model, and deploying it to the service robot to perform real-time perception on the items in the scene. Finally, collaborative training is carried out on the cloud server side and deployed to the edge device side to reason and analyze the data in real time. The test results show that, compared with the original YOLOv5 model, the detection and classification accuracy of the proposed model is higher, and the real-time performance can also meet the actual use requirements. The model proposed in this paper can realize the intelligent decision-making of garbage classification for big data in the scene in a complex system and has certain conditions for promotion and landing.
Full-text available
With the continuous development of artificial intelligence, embedding object detection algorithms into autonomous underwater detectors for marine garbage cleanup has become an emerging application area. Considering the complexity of the marine environment and the low resolution of the images taken by underwater detectors, this paper proposes an improved algorithm based on Mask R-CNN, with the aim of achieving high accuracy marine garbage detection and instance segmentation. First, the idea of dilated convolution is introduced in the Feature Pyramid Network to enhance feature extraction ability for small objects. Secondly, the spatial-channel attention mechanism is used to make features learn adaptively. It can effectively focus attention on detection objects. Third, the re-scoring branch is added to improve the accuracy of instance segmentation by scoring the predicted masks based on the method of Generalized Intersection over Union. Finally, we train the proposed algorithm in this paper on the Transcan dataset, evaluating its effectiveness by various metrics and comparing it with existing algorithms. The experimental results show that compared to the baseline provided by the Transcan dataset, the algorithm in this paper improves the mAP indexes on the two tasks of garbage detection and instance segmentation by 9.6 and 5.0, respectively, which significantly improves the algorithm performance. Thus, it can be better applied in the marine environment and achieve high precision object detection and instance segmentation.
The rapid economic and social development has led to a rapid increase in the output of domestic waste. How to realize waste classification through intelligent methods has become a key factor for human beings to achieve sustainable development. Traditional waste classification technology has low efficiency and low accuracy. To improve the efficiency and accuracy of waste classification processing, this paper proposes a DenseNet169 waste image classification model based on transfer learning. Because of the disadvantages of the existing public waste dataset, such as uneven distribution of data, single background, obvious features, and small sample size of the waste image, the waste image dataset NWNU-TRASH is constructed. The dataset has the advantages of balanced distribution, high diversity, and rich background, which is more in line with real needs. 70% of the dataset is used as the training set and 30% as the test set. Based on the deep learning network DenseNet169 pre-trained model, we can form a DenseNet169 model suitable for this experimental dataset. The experimental results show that the accuracy of classification is over 82% in the DenseNet169 model after the transfer learning, which is better than other image classification algorithms.