ArticlePDF Available

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

Authors:

Abstract and Figures

We present Fashion-MNIST, a new dataset comprising of 28x28 grayscale images of 70,000 fashion products from 10 categories, with 7,000 images per category. The training set has 60,000 images and the test set has 10,000 images. Fashion-MNIST is intended to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms, as it shares the same image size, data format and the structure of training and testing splits. The dataset is freely available at https://github.com/zalandoresearch/fashion-mnist.
Content may be subject to copyright.
arXiv:1708.07747v1 [cs.LG] 25 Aug 2017
Fashion-MNIST: a Novel Image Dataset for
Benchmarking Machine Learning Algorithms
Han Xiao
Zalando Research
Mühlenstraße 25, 10243 Berlin
han.xiao@zalando.de
Kashif Rasul
Zalando Research
Mühlenstraße 25, 10243 Berlin
kashif.rasul@zalando.de
Roland Vollgraf
Zalando Research
Mühlenstraße 25, 10243 Berlin
roland.vollgraf@zalando.de
Abstract
We present Fashion-MNIST, a new dataset comprising of 28 ×28 grayscale
images of 70,000 fashion products from 10 categories, with 7,000 images
per category. The training set has 60,000 images and the test set has
10,000 images. Fashion-MNIST is intended to serve as a direct drop-
in replacement for the original MNIST dataset for benchmarking machine
learning algorithms, as it shares the same image size, data format and the
structure of training and testing splits. The dataset is freely available at
https://github.com/zalandoresearch/fashion-mnist.
1 Introduction
The MNIST dataset comprising of 10-class handwritten digits, was first introduced by LeCun et al.
[1998] in 1998. At that time one could not have foreseen the stellar rise of deep learning tech-
niques and their performance. Despite the fact that today deep learning can do so much the sim-
ple MNIST dataset has become the most widely used testbed in deep learning, surpassing CIFAR-
10 [Krizhevsky and Hinton, 2009] and ImageNet [Deng et al., 2009] in its popularity via Google
trends1. Despite its simplicity its usage does not seem to be decreasing despite calls for it in the
deep learning community.
The reason MNIST is so popular has to do with its size, allowing deep learning researchers to quickly
check and prototype their algorithms. This is also complemented by the fact that all machine learning
libraries (e.g. scikit-learn) and deep learning frameworks (e.g. Tensorflow, Pytorch) provide helper
functions and convenient examples that use MNIST out of the box.
Our aim with this work is to create a good benchmark dataset which has all the accessibility of
MNIST, namely its small size, straightforward encoding and permissive license. We took the ap-
proach of sticking to the 10 classes 70,000 grayscale images in the size of 28 ×28 as in the original
MNIST. In fact, the only change one needs to use this dataset is to change the URL from where the
MNIST dataset is fetched. Moreover, Fashion-MNIST poses a more challenging classification task
than the simple MNIST digits data, whereas the latter has been trained to accuracies above 99.7%
as reported in Wan et al. [2013], Ciregan et al. [2012].
We also looked at the EMNIST dataset provided by Cohen et al. [2017], an extended version of
MNIST that extends the number of classes by introducing uppercase and lowercase characters. How-
1https://trends.google.com/trends/explore?date=all&q=mnist,CIFAR, ImageNet
ever, to be able to use it seamlessly one needs to not only extend the deep learning framework’s
MNIST helpers, but also change the underlying deep neural network to classify these extra classes.
2 Fashion-MNIST Dataset
Fashion-MNIST is based on the assortment on Zalando’s website2. Every fashion product on Za-
lando has a set of pictures shot by professional photographers, demonstrating different aspects of
the product, i.e. front and back looks, details, looks with model and in an outfit. The original picture
has a light-gray background (hexadecimal color: #fdfdfd) and stored in 762 ×1000 JPEG format.
For efficiently serving different frontend components, the original picture is resampled with multiple
resolutions, e.g. large, medium, small, thumbnail and tiny.
We use the front look thumbnail images of 70,000 unique products to build Fashion-MNIST. Those
products come from different gender groups: men, women, kids and neutral. In particular, white-
color products are not included in the dataset as they have low contrast to the background. The
thumbnails (51 ×73) are then fed into the following conversion pipeline, which is visualized in
Figure 1.
1. Converting the input to a PNG image.
2. Trimming any edges that are close to the color of the corner pixels. The “closeness” is
defined by the distance within 5% of the maximum possible intensity in RGB space.
3. Resizing the longest edge of the image to 28 by subsampling the pixels, i.e. some rows and
columns are skipped over.
4. Sharpening pixels using a Gaussian operator of the radius and standard deviation of 1.0,
with increasing effect near outlines.
5. Extending the shortest edge to 28 and put the image to the center of the canvas.
6. Negating the intensities of the image.
7. Converting the image to 8-bit grayscale pixels.
Figure 1: Diagram of the conversion process used to generate Fashion-MNIST dataset. Two exam-
ples from dress and sandals categories are depicted, respectively. Each column represents a step
described in section 2.
Table 1: Files contained in the Fashion-MNIST dataset.
Name Description # Examples Size
train-images-idx3-ubyte.gz Training set images 60,000 25 MBytes
train-labels-idx1-ubyte.gz Training set labels 60,000 140 Bytes
t10k-images-idx3-ubyte.gz Test set images 10,000 4.2MBytes
t10k-labels-idx1-ubyte.gz Test set labels 10,000 92 Bytes
For the class labels, we use the silhouette code of the product. The silhouette code is manually
labeled by the in-house fashion experts and reviewed by a separate team at Zalando. Each product
2Zalando is the Europe’s largest online fashion platform. http://www.zalando.com
2
contains only one silhouette code. Table 2 gives a summary of all class labels in Fashion-MNIST
with examples for each class.
Finally, the dataset is divided into a training and a test set. The training set receives a randomly-
selected 6,000 examples from each class. Images and labels are stored in the same file format as the
MNIST data set, which is designed for storing vectors and multidimensional matrices. The result
files are listed in Table 1. We sort examples by their labels while storing, resulting in smaller label
files after compression comparing to the MNIST. It is also easier to retrieve examples with a certain
class label. The data shuffling job is therefore left to the algorithm developer.
Table 2: Class names and example images in Fashion-MNIST dataset.
Label Description Examples
0T-Shirt/Top
1Trouser
2Pullover
3Dress
4Coat
5Sandals
6Shirt
7Sneaker
8Bag
9Ankle boots
3 Experiments
We provide some classification results in Table 3 to form a benchmark on this data set. All al-
gorithms are repeated 5times by shuffling the training data and the average accuracy on the
test set is reported. The benchmark on the MNIST dataset is also included for a side-by-side
comparison. A more comprehensive table with explanations on the algorithms can be found on
https://github.com/zalandoresearch/fashion-mnist.
Table 3: Benchmark on Fashion-MNIST (Fashion) and MNIST.
Test Accuracy
Classifier Parameter Fashion MNIST
DecisionTreeClassifier criterion=entropy max_depth=10 splitter=best 0.798 0.873
criterion=entropy max_depth=10 splitter=random 0.792 0.861
criterion=entropy max_depth=50 splitter=best 0.789 0.886
Continued on next page
3
Table 3 continued from previous page
Test Accuracy
Classifier Parameter Fashion MNIST
criterion=entropy max_depth=100 splitter=best 0.789 0.886
criterion=gini max_depth=10 splitter=best 0.788 0.866
criterion=entropy max_depth=50 splitter=random 0.787 0.883
criterion=entropy max_depth=100 splitter=random 0.787 0.881
criterion=gini max_depth=100 splitter=best 0.785 0.879
criterion=gini max_depth=50 splitter=best 0.783 0.877
criterion=gini max_depth=10 splitter=random 0.783 0.853
criterion=gini max_depth=50 splitter=random 0.779 0.873
criterion=gini max_depth=100 splitter=random 0.777 0.875
ExtraTreeClassifier criterion=gini max_depth=10 splitter=best 0.775 0.806
criterion=entropy max_depth=100 splitter=best 0.775 0.847
criterion=entropy max_depth=10 splitter=best 0.772 0.810
criterion=entropy max_depth=50 splitter=best 0.772 0.847
criterion=gini max_depth=100 splitter=best 0.769 0.843
criterion=gini max_depth=50 splitter=best 0.768 0.845
criterion=entropy max_depth=50 splitter=random 0.752 0.826
criterion=entropy max_depth=100 splitter=random 0.752 0.828
criterion=gini max_depth=50 splitter=random 0.748 0.824
criterion=gini max_depth=100 splitter=random 0.745 0.820
criterion=gini max_depth=10 splitter=random 0.739 0.737
criterion=entropy max_depth=10 splitter=random 0.737 0.745
GaussianNB priors=[0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1] 0.511 0.524
GradientBoostingClassifier n_estimators=100 loss=deviance max_depth=10 0.880 0.969
n_estimators=50 loss=deviance max_depth=10 0.872 0.964
n_estimators=100 loss=deviance max_depth=30.862 0.949
n_estimators=10 loss=deviance max_depth=10 0.849 0.933
n_estimators=50 loss=deviance max_depth=30.840 0.926
n_estimators=10 loss=deviance max_depth=50 0.795 0.888
n_estimators=10 loss=deviance max_depth=30.782 0.846
KNeighborsClassifier weights=distance n_neighbors=5p=10.854 0.959
weights=distance n_neighbors=9p=10.854 0.955
weights=uniform n_neighbors=9p=10.853 0.955
weights=uniform n_neighbors=5p=10.852 0.957
weights=distance n_neighbors=5p=20.852 0.945
weights=distance n_neighbors=9p=20.849 0.944
weights=uniform n_neighbors=5p=20.849 0.944
weights=uniform n_neighbors=9p=20.847 0.943
weights=distance n_neighbors=1p=20.839 0.943
weights=uniform n_neighbors=1p=20.839 0.943
weights=uniform n_neighbors=1p=10.838 0.955
weights=distance n_neighbors=1p=10.838 0.955
LinearSVC loss=hinge C=1multi_class=ovr penalty=l2 0.836 0.917
loss=hinge C=1multi_class=crammer_singer penalty=l2 0.835 0.919
loss=squared_hinge C=1multi_class=crammer_singer penalty=l2 0.834 0.919
loss=squared_hinge C=1multi_class=crammer_singer penalty=l1 0.833 0.919
loss=hinge C=1multi_class=crammer_singer penalty=l1 0.833 0.919
loss=squared_hinge C=1multi_class=ovr penalty=l2 0.820 0.912
loss=squared_hinge C=10 multi_class=ovr penalty=l2 0.779 0.885
loss=squared_hinge C=100 multi_class=ovr penalty=l2 0.776 0.873
loss=hinge C=10 multi_class=ovr penalty=l2 0.764 0.879
loss=hinge C=100 multi_class=ovr penalty=l2 0.758 0.872
Continued on next page
4
Table 3 continued from previous page
Test Accuracy
Classifier Parameter Fashion MNIST
loss=hinge C=10 multi_class=crammer_singer penalty=l1 0.751 0.783
loss=hinge C=10 multi_class=crammer_singer penalty=l2 0.749 0.816
loss=squared_hinge C=10 multi_class=crammer_singer penalty=l2 0.748 0.829
loss=squared_hinge C=10 multi_class=crammer_singer penalty=l1 0.736 0.829
loss=hinge C=100 multi_class=crammer_singer penalty=l1 0.516 0.759
loss=hinge C=100 multi_class=crammer_singer penalty=l2 0.496 0.753
loss=squared_hinge C=100 multi_class=crammer_singer penalty=l1 0.492 0.746
loss=squared_hinge C=100 multi_class=crammer_singer penalty=l2 0.484 0.737
LogisticRegression C=1multi_class=ovr penalty=l1 0.842 0.917
C=1multi_class=ovr penalty=l2 0.841 0.917
C=10 multi_class=ovr penalty=l2 0.839 0.916
C=10 multi_class=ovr penalty=l1 0.839 0.909
C=100 multi_class=ovr penalty=l2 0.836 0.916
MLPClassifier activation=relu hidden_layer_sizes=[100] 0.871 0.972
activation=relu hidden_layer_sizes=[100, 10] 0.870 0.972
activation=tanh hidden_layer_sizes=[100] 0.868 0.962
activation=tanh hidden_layer_sizes=[100, 10] 0.863 0.957
activation=relu hidden_layer_sizes=[10, 10] 0.850 0.936
activation=relu hidden_layer_sizes=[10] 0.848 0.933
activation=tanh hidden_layer_sizes=[10, 10] 0.841 0.921
activation=tanh hidden_layer_sizes=[10] 0.840 0.921
PassiveAggressiveClassifier C=10.776 0.877
C=100 0.775 0.875
C=10 0.773 0.880
Perceptron penalty=l1 0.782 0.887
penalty=l2 0.754 0.845
penalty=elasticnet 0.726 0.845
RandomForestClassifier n_estimators=100 criterion=entropy max_depth=100 0.873 0.970
n_estimators=100 criterion=gini max_depth=100 0.872 0.970
n_estimators=50 criterion=entropy max_depth=100 0.872 0.968
n_estimators=100 criterion=entropy max_depth=50 0.872 0.969
n_estimators=50 criterion=entropy max_depth=50 0.871 0.967
n_estimators=100 criterion=gini max_depth=50 0.871 0.971
n_estimators=50 criterion=gini max_depth=50 0.870 0.968
n_estimators=50 criterion=gini max_depth=100 0.869 0.967
n_estimators=10 criterion=entropy max_depth=50 0.853 0.949
n_estimators=10 criterion=entropy max_depth=100 0.852 0.949
n_estimators=10 criterion=gini max_depth=50 0.848 0.948
n_estimators=10 criterion=gini max_depth=100 0.847 0.948
n_estimators=50 criterion=entropy max_depth=10 0.838 0.947
n_estimators=100 criterion=entropy max_depth=10 0.838 0.950
n_estimators=100 criterion=gini max_depth=10 0.835 0.949
n_estimators=50 criterion=gini max_depth=10 0.834 0.945
n_estimators=10 criterion=entropy max_depth=10 0.828 0.933
n_estimators=10 criterion=gini max_depth=10 0.825 0.930
SGDClassifier loss=hinge penalty=l2 0.819 0.914
loss=perceptron penalty=l1 0.818 0.912
loss=modified_huber penalty=l1 0.817 0.910
loss=modified_huber penalty=l2 0.816 0.913
loss=log penalty=elasticnet 0.816 0.912
loss=hinge penalty=elasticnet 0.816 0.913
Continued on next page
5
Table 3 continued from previous page
Test Accuracy
Classifier Parameter Fashion MNIST
loss=squared_hinge penalty=elasticnet 0.815 0.914
loss=hinge penalty=l1 0.815 0.911
loss=log penalty=l1 0.815 0.910
loss=perceptron penalty=l2 0.814 0.913
loss=perceptron penalty=elasticnet 0.814 0.912
loss=squared_hinge penalty=l2 0.814 0.912
loss=modified_huber penalty=elasticnet 0.813 0.914
loss=log penalty=l2 0.813 0.913
loss=squared_hinge penalty=l1 0.813 0.911
SVC C=10 kernel=rbf 0.897 0.973
C=10 kernel=poly 0.891 0.976
C=100 kernel=poly 0.890 0.978
C=100 kernel=rbf 0.890 0.972
C=1kernel=rbf 0.879 0.966
C=1kernel=poly 0.873 0.957
C=1kernel=linear 0.839 0.929
C=10 kernel=linear 0.829 0.927
C=100 kernel=linear 0.827 0.926
C=1kernel=sigmoid 0.678 0.898
C=10 kernel=sigmoid 0.671 0.873
C=100 kernel=sigmoid 0.664 0.868
4 Conclusions
This paper introduced Fashion-MNIST, a fashion product images dataset intended to be a drop-
in replacement of MNIST and whilst providing a more challenging alternative for benchmarking
machine learning algorithm. The images in Fashion-MNIST are converted to a format that matches
that of the MNIST dataset, making it immediately compatible with any machine learning package
capable of working with the original MNIST dataset.
References
D. Ciregan, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image classifi-
cation. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages
3642–3649. IEEE, 2012.
G. Cohen, S. Afshar, J. Tapson, and A. van Schaik. Emnist: an extension of mnist to handwritten
letters. arXiv preprint arXiv:1702.05373, 2017.
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical im-
age database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference
on, pages 248–255. IEEE, 2009.
A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document
recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus. Regularization of neural networks using
dropconnect. In Proceedings of the 30th international conference on machine learning (ICML-
13), pages 1058–1066, 2013.
6

Supplementary resource (1)

... Accepted at the ICLR Workshop on Neural Network Weights as a New Data Modality 2025 Current INR classification methods Kalogeropoulos et al., 2024;Zhou et al., 2024a) face several shortcomings. First, these methods are only demonstrated on low-resolution image datasets such as MNIST (LeCun, 1998), Fashion-MNIST (Xiao et al., 2017), and CIFAR10 (Krizhevsky et al., 2009). Second, the employed INR architecture learns entirely different representations under camera transformations such as translation. ...
... How well do ARCs and existing INR classification pipelines perform as image resolution increases? To answer this question, we analyse INR classification accuracy on Fashion-MNIST (FMNIST) (Xiao et al., 2017) whereby we pad the images with zeroes to a 100 × 100 and 1024 × 1024 resolution (Figure 9). These datasets are then converted into SIRENs and ARCs and classified by their respective methods. ...
... DWSnets (Navon et al., 2023) 85.71 ± 0.6 67.06 ± 0.3 -NG-GNN 91.40 ± 0.6 68.00 ± 0.2 36.04 ⋄ ± 0.44 NG-T 92.40 ± 0.3 72.70 ± 0.6 -ScaleGMN (Kalogeropoulos et al., 2024) (Xiao et al., 2017), and CIFAR10 (Krizhevsky et al., 2009). The complete datasets are used, along with data augmentation methods that are expanded upon in Experiment 3. Results are gathered on 3 different seeds and summarised in Table 3. ...
Preprint
Implicit neural representations (INRs) encode signals in neural network weights as a memory-efficient representation, decoupling sampling resolution from the associated resource costs. Current INR image classification methods are demonstrated on low-resolution data and are sensitive to image-space transformations. We attribute these issues to the global, fully-connected MLP neural network architecture encoding of current INRs, which lack mechanisms for local representation: MLPs are sensitive to absolute image location and struggle with high-frequency details. We propose ARC: Anchored Representation Clouds, a novel INR architecture that explicitly anchors latent vectors locally in image-space. By introducing spatial structure to the latent vectors, ARC captures local image data which in our testing leads to state-of-the-art implicit image classification of both low- and high-resolution images and increased robustness against image-space translation. Code can be found at https://github.com/JLuij/anchored_representation_clouds.
... A. Experimental Setup 1) Datasets: In our experiments, we use six distinct datasets from various domains, which encompass MNIST [36], Fashion-MNIST [37], Human Activity Recognition (HAR) [38], Purchase [39], Large-scale CelebFaces Attributes (CelebA) [40], and CIFAR-10 [41]. a) MNIST [36]: The MNIST dataset consists of 60,000 training images and 10,000 testing images, encompassing a total of 10 unique classes. ...
... a) MNIST [36]: The MNIST dataset consists of 60,000 training images and 10,000 testing images, encompassing a total of 10 unique classes. b) Fashion-MNIST [37]: The Fashion-MNIST dataset comprises a total of 70,000 fashion images. The training dataset contains 60,000 images, and the testing set contains 10,000 images. ...
... For example, Yuille and Kersten [3] highlighted the need to study vision using 10 stimuli with the dynamics and complexity of the real 3D world. They also promoted 11 analysis-by-synthesis, i.e., using a top-down signal to guide bottom-up search under 12 Bayesian inference as a prime candidate to explain these real-world complexities. In 13 processing highly degraded images, analysis-by-synthesis has been a primary candidate 14 algorithm used to explain human performance. ...
... We use genetic search for possible images in an image generator's 55 (GAN's [10]) latent space that fits the structure of dots in the images. We perform 56 experiments to analyse the performance of humans and the GenSearch algorithm on two 57 sets of constellation images generated from MNIST [11] and Fashion MNIST [12]. The 58 solving process was captured by allowing humans to draw their guesses while in the 59 process of solving the images and for the algorithm to store the intermediate solutions. ...
Preprint
Full-text available
Human vision is not merely a passive process of interpreting sensory input but incorporates generative mechanisms that infer and synthesize plausible interpretations of ambiguous or noisy data. This synergy between the generative and discriminative components, often described as analysis-by-synthesis, enables robust perception and rapid adaptation to out-of-distribution inputs. By leveraging top-down feedback, human vision excels in constructing meaningful interpretations even in challenging scenarios. In this work, we investigate a computational implementation of the analysis-by-synthesis paradigm using search in a generative model applied to an underspecified image dataset inspired by star constellations. The search is guided by low-level cues based on the structural fitness of solutions to the test images. This dataset serves as a testbed for exploring how inferred signals can guide the synthesis of suitable solutions in ambiguous conditions. Drawing on insights from human experiments, we develop a generative search algorithm and compare its performance to humans, examining factors such as accuracy, reaction time, and overlap in drawings. Our results shed light on possible mechanisms of human visual inference and highlight the potential of generative search models to emulate aspects of this process. Author summary Human vision is not just about passively receiving information from the environment. Rather, it also involves actively making sense of what we see. When faced with unclear or incomplete visual input, our brains use prior knowledge to fill in gaps and create the most likely interpretation. This ability helps us recognize objects and patterns even in difficult conditions. In this study, we explore how this process can be replicated using computer models. Specifically, we test a method that generates possible interpretations of underspecified visual data, inspired by the way people recognize star constellations. By comparing the model’s performance with human participants, we examine how well it mirrors human perception. We analyze factors such as accuracy, response time, and similarities in the interpretations produced. Our findings offer insights into how people make sense of uncertain visual information and suggest ways in which computer models can be designed to mimic this ability. These results could contribute to the understanding of human vision but also help advances in artificial vision systems to improve technologies beyond relying on pattern recognition.
... Aligned with recent work on distributed learning, we implemented a Convolutional Neural Network (CNN), as described by Cao et al. [10]. FMNIST is composed of 60 000 training and 10 000 test images, each sized 28x28 pixels, depicting various types of clothing across 10 classes [58]. As DNN we use also the CNN described by Cao et al. [10]. ...
... Below there is a Table 1 with a description of the datasets used. We have chosen four datasets from the torchivison library: MNIST [44], FashionMNIST [46], CIFAR10 and CIFAR100 [47]. The only preprocessing of the data is normalization to bring the values into the range [-1; 1]. ...
Article
Full-text available
The loss landscape of neural networks is a critical aspect of their training, and understanding its properties is essential for improving their performance. In this paper, we investigate how the loss surface changes when the sample size increases, a previously unexplored issue. We theoretically analyze the convergence of the loss landscape in a fully connected neural network and derive upper bounds for the difference in loss function values when adding a new object to the sample. Our empirical study confirms these results on various datasets, demonstrating the convergence of the loss function surface for image classification tasks. Our findings provide insights into the local geometry of neural loss landscapes and have implications for the development of sample size determination techniques.
Article
Federated learning is an upcoming machine learning paradigm which allows data from multiple sources to be used for training of classifiers without the data leaving the source it originally resides. This can be highly valuable for use cases such as medical research, where gathering data at a central location can be quite complicated due to privacy and legal concerns of the data. In such cases, federated learning has the potential to vastly speed up the research cycle. Although federated and central learning have been compared from a theoretical perspective, an extensive experimental comparison of performances and learning behavior still lacks. We have performed a comprehensive experimental comparison between federated and centralized learning. We evaluated various classifiers on various datasets exploring influences of different sample distributions as well as different class distributions across the clients. The results show similar performances under a wide variety of settings between the federated and central learning strategies. Federated learning is able to deal with various imbalances in the data distributions. It is sensitive to batch effects between different datasets when they coincide with location, similar to central learning, but this setting might go unobserved more easily. Federated learning seems to be robust to various challenges such as skewed data distributions, high data dimensionality, multiclass problems, and complex models. Taken together, the insights from our comparison gives much promise for applying federated learning as an alternative to sharing data. Code for reproducing the results in this work can be found at: https://github.com/swiergarst/FLComparison
Article
The advancement of mobile multimedia communications, 5G, and Internet of Things (IoT) has led to the widespread use of edge devices, including sensors, smartphones, and wearables. This has generated in a large amount of distributed data, leading to new prospects for deep learning. However, this data is confined within data silos and contains sensitive information, making it difficult to be processed in a centralized manner, particularly under stringent data privacy regulations. Federated learning (FL) offers a solution by enabling collaborative learning while ensuring privacy. Nonetheless, data and device heterogeneity complicate FL implementation. This research presents a specialized FL algorithm for heterogeneous edge computing. It integrates a lightweight grouping strategy for homogeneous devices, a scheduling algorithm within groups, and a Split Learning (SL) approach. These contributions enhance model accuracy and training speed, alleviate the burden on resource-constrained devices, and strengthen privacy. Experimental results demonstrate that the GSFL outperforms FedAvg and SplitFed by 6.53× and 1.18×. Under experimental conditions with α=0.05\alpha=0.05 , representing a highly heterogeneous data distribution typical of extreme Non-IID scenarios, GSFL showed better accuracy compared to FedAvg by 10.64%, HACCS by 4.53%, and Cluster-HSFL by 1.16%. GSFL effectively balances privacy protection and computational efficiency for real-world applications in mobile multimedia communications.
Preprint
Differentially private (DP) image synthesis aims to generate artificial images that retain the properties of sensitive images while protecting the privacy of individual images within the dataset. Despite recent advancements, we find that inconsistent--and sometimes flawed--evaluation protocols have been applied across studies. This not only impedes the understanding of current methods but also hinders future advancements. To address the issue, this paper introduces DPImageBench for DP image synthesis, with thoughtful design across several dimensions: (1) Methods. We study eleven prominent methods and systematically characterize each based on model architecture, pretraining strategy, and privacy mechanism. (2) Evaluation. We include nine datasets and seven fidelity and utility metrics to thoroughly assess them. Notably, we find that a common practice of selecting downstream classifiers based on the highest accuracy on the sensitive test set not only violates DP but also overestimates the utility scores. DPImageBench corrects for these mistakes. (3) Platform. Despite the methods and evaluation protocols, DPImageBench provides a standardized interface that accommodates current and future implementations within a unified framework. With DPImageBench, we have several noteworthy findings. For example, contrary to the common wisdom that pretraining on public image datasets is usually beneficial, we find that the distributional similarity between pretraining and sensitive images significantly impacts the performance of the synthetic images and does not always yield improvements. In addition, adding noise to low-dimensional features, such as the high-level characteristics of sensitive images, is less affected by the privacy budget compared to adding noise to high-dimensional features, like weight gradients. The former methods perform better than the latter under a low privacy budget.
Article
Full-text available
The MNIST dataset has become a standard benchmark for learning, classification and computer vision systems. Contributing to its widespread adoption are the understandable and intuitive nature of the task, its relatively small size and storage requirements and the accessibility and ease-of-use of the database itself. The MNIST database was derived from a larger dataset known as the NIST Special Database 19 which contains digits, uppercase and lowercase handwritten letters. This paper introduces a variant of the full NIST dataset, which we have called Extended MNIST (EMNIST), which follows the same conversion paradigm used to create the MNIST dataset. The result is a set of datasets that constitute a more challenging classification tasks involving letters and digits, and that shares the same image structure and parameters as the original MNIST task, allowing for direct compatibility with all existing classifiers and systems. Benchmark results are presented along with a validation of the conversion process through the comparison of the classification results on converted NIST digits and the MNIST digits.
Conference Paper
Full-text available
The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called ldquoImageNetrdquo, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.
Article
Full-text available
Multilayer neural networks trained with the back-propagation algorithm constitute the best example of a successful gradient based learning technique. Given an appropriate network architecture, gradient-based learning algorithms can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters, with minimal preprocessing. This paper reviews various methods applied to handwritten character recognition and compares them on a standard handwritten digit recognition task. Convolutional neural networks, which are specifically designed to deal with the variability of 2D shapes, are shown to outperform all other techniques. Real-life document recognition systems are composed of multiple modules including field extraction, segmentation recognition, and language modeling. A new learning paradigm, called graph transformer networks (GTN), allows such multimodule systems to be trained globally using gradient-based methods so as to minimize an overall performance measure. Two systems for online handwriting recognition are described. Experiments demonstrate the advantage of global training, and the flexibility of graph transformer networks. A graph transformer network for reading a bank cheque is also described. It uses convolutional neural network character recognizers combined with global training techniques to provide record accuracy on business and personal cheques. It is deployed commercially and reads several million cheques per day
Conference Paper
Traditional methods of computer vision and machine learning cannot match human performance on tasks such as the recognition of handwritten digits or traffic signs. Our biologically plausible, wide and deep artificial neural network architectures can. Small (often minimal) receptive fields of convolutional winner-take-all neurons yield large network depth, resulting in roughly as many sparsely connected neural layers as found in mammals between retina and visual cortex. Only winner neurons are trained. Several deep neural columns become experts on inputs preprocessed in different ways; their predictions are averaged. Graphics cards allow for fast training. On the very competitive MNIST handwriting benchmark, our method is the first to achieve near-human performance. On a traffic sign recognition benchmark it outperforms humans by a factor of two. We also improve the state-of-the-art on a plethora of common image classification benchmarks.
Conference Paper
We introduce DropConnect, a generalization of Dropout (Hinton et al., 2012), for regularizing large fully-connected layers within neural networks. When training with Dropout, a randomly selected subset of activations are set to zero within each layer. DropConnect instead sets a randomly selected subset of weights within the network to zero. Each unit thus receives input from a random subset of units in the previous layer. We derive a bound on the generalization performance of both Dropout and DropConnect. We then evaluate DropConnect on a range of datasets, comparing to Dropout, and show state-of-the-art results on several image recognition benchmarks by aggregating multiple DropConnect-trained models.
Multi-column deep neural networks for image classification
  • D Ciregan
  • U Meier
  • J Schmidhuber
D. Ciregan, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image classification. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 3642-3649. IEEE, 2012.