Available via license: CC BY 4.0
Content may be subject to copyright.
Tab2Visual: Overcoming Limited Data in Tabular Data
Classification Using Deep Learning with Visual
Representations
Ahmed Mamdouha, Moumen El-Melegya,b, Samia Alia, Ron Kikinisb
aElectrical Engineering Department, Assiut University, Assiut, 71516, Egypt
bBrigham and Women’s Hospital, Harvard Medical School, Boston, MA, 02115, USA
Abstract
This research addresses the challenge of limited data in tabular data classi-
fication, particularly prevalent in domains with constraints like healthcare.
We propose Tab2Visual, a novel approach that transforms heterogeneous
tabular data into visual representations, enabling the application of power-
ful deep learning models. Tab2Visual effectively addresses data scarcity by
incorporating novel image augmentation techniques and facilitating transfer
learning. We extensively evaluate the proposed approach on diverse tabular
datasets, comparing its performance against a wide range of machine learn-
ing algorithms, including classical methods, tree-based ensembles, and state-
of-the-art deep learning models specifically designed for tabular data. We
also perform an in-depth analysis of factors influencing Tab2Visual’s perfor-
mance. Our experimental results demonstrate that Tab2Visual outperforms
other methods in classification problems with limited tabular data.
Keywords: Tabular Data, Limited Data, Deep Learning, Machine
Learning, Data Augmentation, Transfer Learning
1. Introduction
Recent years have witnessed significant advancements in deep learning,
with Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs)
Email addresses: ahmed.mamdouh@aun.edu.eg (Ahmed Mamdouh),
melmelegy@bwh.harvard.edu (Moumen El-Melegy), samia_fattah@aun.edu.eg (Samia
Ali), kikinis@bwh.harvard.edu (Ron Kikinis)
arXiv:2502.07181v1 [cs.LG] 11 Feb 2025
revolutionizing image classification tasks [1]. CNNs excel at automatically
extracting hierarchical features from raw data, thus being exceptionally effec-
tive in handling the complexities of image data [2]. ViTs have further pushed
the boundaries by capturing long-range dependencies and surpassing CNNs
in several benchmarks, demonstrating their powerful capabilities [3, 4]. Both
CNNs and ViTs have exceeded human performance in numerous tasks [5],
solidifying their status as state-of-the-art techniques in the field.
However, deep learning has not been that successful on tabular data [6, 7].
Tabular data, commonly represented as a two-dimensional matrix of rows
(samples or observations) and columns (attributes or features), constitute the
most prevalent data format in data science [8]. Classification of these data
has broad applications across finance, healthcare, cybersecurity, anomaly de-
tection, and social science [7]. The inherent characteristics of tabular data
pose unique challenges for machine learning algorithms. Unlike image, lan-
guage, and speech data, which often exhibit inherent structure and homo-
geneity, tabular data typically comprises diverse feature types (numerical,
categorical, Boolean, and ordinal) with varying scales and sources, leading
to increased complexity. Furthermore, the interdependencies among features
in tabular data are generally weaker compared to those observed in image
or speech data, where spatial or semantic relationships often establish strong
correlations between data points.
Due to these challenges, deep neural networks frequently underperform
on tabular data compared to other machine learning methods, such as tree-
based ensembles. To address these challenges, researchers have explored
various approaches, broadly categorized into three groups [7]: specialized
architectures, regularization techniques, and data transformation methods.
The first group focuses on developing specialized architectures specifi-
cally designed for tabular data. Notable examples within this category in-
clude NODE [9], TabNet [10], TabTransformer [11], and TabPFN [12, 13].
The Neural Oblivious Decision Ensembles (NODE) [9] are characterized by
a deep, layer-wise structure composed of an ensemble of differentiable obliv-
ious trees. Gradient-based optimization is employed for end-to-end training
of this architecture. The other three architectures, TabNet, TabTransformer,
and TabPFN, are inspired by the success of transformer models. TabNet [10]
is a pioneering transformer-based architecture for tabular data, employing a
sequential attention mechanism. Similarly, the TabTransformer [11] employs
self-attention mechanisms within a transformer architecture to generate con-
textual embeddings for categorical features. These enriched embeddings,
2
along with the original numerical features, are then fed into a multilayer per-
ceptron for the final classification. The more recent Tabular Prior-data Fit-
ted Network (TabPFN) [12, 13] relies on in-context learning that integrates
approximate Bayesian inference and structural causal modeling within a two-
way attention mechanism. Notably, TabPFN undergoes a single pre-training
phase on a vast collection of synthetic datasets encompassing diverse predic-
tion tasks. During inference, given a new dataset with both labeled training
and unlabeled test samples, the model simultaneously trains and predicts
within a single forward pass of the neural network. Due to its significant
memory requirements, TabPFN’s current application is restricted to small
datasets [13].
The second group of tabular data methods attributes the moderate per-
formance of deep learning models to their inherent nonlinearity and high
model complexity. As such, they enforce strong regularization during model
training. This often involves employing specialized loss functions, such as
the learned regularization scheme proposed by Shavitt et al. [14] and the
regularization cocktails (combinations) introduced by Kadra et al. [15].
The final category focuses on data transformation methods, aiming to
convert heterogeneous tabular inputs into a homogeneous format more suit-
able for deep learning. These methods typically focus on data preprocess-
ing techniques rather than requiring the development of entirely new deep
architectures. One pioneering approach is DeepInsight [16], which trans-
forms high-dimensional tabular data into a spatial representation suitable for
CNNs. DeepInsight projects the data into a 2D space using t-SNE [17], a non-
linear dimensionality reduction technique that preserves local similarities.
DeepInsight then constructs an image from this 2D projection using convex
hull analysis with proper translation, rotation, quantization, and normaliza-
tion operations. This method relies on t-SNE, which is non-deterministic and
sensitive to parameter choices (e.g., perplexity [17]) and suffers from ”crowd-
ing problem” where many points may cluster together in the low-dimensional
space. Furthermore, for datasets with few features, the resulting images may
contain isolated islands, reducing the efficiency of CNNs in learning mean-
ingful patterns.
The REFINED (REpresentation of Features as Images with NEighbor-
hood Dependencies) approach [18] also projects data into a 2D space, but
replaces t-SNE with Bayesian Metric Multidimensional Scaling to preserve
pairwise distances in the low-dimensional representation. Despite these mod-
ifications, REFINED shares similar limitations to DeepInsight. The Su-
3
perTML method [19] adopts another approach by converting tabular data
into text representations visualized as 2D binary images. This transforms
the tabular classification problem into a text classification task, which can
then be addressed by CNNs. However, the reported method’s evaluation is
limited to only three datasets.
Buturovic et al. [20] introduced the TAbular Convolution (TAC) method,
which arranges data samples into zero-mean square matrices (kernels) of odd
integer dimensions. These kernels are then convolved with a fixed ”base
image,” and the resulting images are subsequently fed to a CNN for classi-
fication. While applied successfully to classify gene expression data in their
study, the method’s reliance on an arbitrarily chosen base image presents a
significant limitation. The impact of this base image on model performance
remains unclear, and the authors themselves acknowledge its inconclusive in-
fluence. Furthermore, the necessary padding or trimming of data samples to
achieve the required kernel size can potentially introduce bias or information
loss, negatively impacting the method’s overall performance. Zhu et al. [21]
proposed the Image Generator for Tabular Data (IGTD) approach. IGTD
generates an image for each data sample, where pixel intensities directly
represent feature values. Consequently, the resulting image has a size that
corresponds to the number of features in the original data. The algorithm
employs an iterative optimization process to assign features to pixels, prior-
itizing the placement of similar features in close proximity. This approach
leads to more compact image representations, reducing memory consump-
tion and accelerating CNN training. However, IGTD faces limitations when
applied to datasets with a limited number of features. In such cases, the gen-
erated images may lack sufficient detail, hindering effective CNN training.
Despite all those research efforts to apply deep learning to tabular data,
a recent comprehensive review [7] concludes that tree-ensemble models (e.g.,
XGBoost, LightGBM, CatBoost) continue to demonstrate superior perfor-
mance in classification tasks. This observation has been corroborated by
numerous independent studies (e.g., see [6, 22]). The underperformance of
deep learning is particularly pronounced on small tabular datasets [7, 22].
In other areas of machine learning, such as image classification and natu-
ral language processing, these challenges are effectively addressed through
techniques like data augmentation [23] to increase the dataset size and/or
transfer learning [24] to leverage knowledge learned from data-rich domains.
While few data augmentation techniques have been proposed for tabular
data—such as SMOTE variants (with SMOTE-NC being the only variant
4
Figure 1: Distribution of tabular dataset sizes on the OpenML platform (ndenotes the
dataset size). The majority of datasets fall within the smaller size ranges, highlighting the
significance of addressing challenges associated with limited data.
that can handle discrete features) [25] and the latent space interpolation
method by Darabi and Elor [26]—their primary focus remains minority class
oversampling through linear syntheses of existing samples. These methods,
however, often yield only marginal gains in classification performance [26],
underscoring a critical gap in the field. Consequently, the development of ef-
fective augmentation strategies tailored to tabular data remains an open re-
search challenge [7]. Furthermore, despite its success in other domains, trans-
fer learning has proven challenging to effectively apply to tabular data [7].
The current research is driven by the challenges of limited data encoun-
tered in our ongoing investigation of prostate cancer prediction using clinical
biomarkers and patient-supplied questionnaires [27, 28, 29]. Our dataset,
with fewer than 100 records and only 9 features, exemplifies the typical data
scarcity encountered in healthcare applications, often constrained by factors
such as cost, privacy concerns, and limited access to resources. Our motiva-
tion is further reinforced by the prevalence of small tabular datasets across
various domains. Notably, OpenML, a prominent platform for sharing ma-
chine learning datasets, reveals that over 71% of available tabular datasets
contain fewer than 10K records at the time of writing (see Fig. 1).
To overcome these challenges, we introduce Tab2Visual, a novel data
transformation strategy that converts heterogeneous tabular data into visual
representations, thereby enabling the utilization of powerful deep learning
models including CNNs and ViTs. This approach enables effective cap-
5
ture of complex patterns without specialized architectures. Furthermore,
Tab2Visual offers potential solutions to open research questions regarding
transfer learning and data augmentation for tabular data. Specifically, it
leverages image augmentation to enhance model generalization and effec-
tively increase dataset size without additional data collection. We also in-
troduce a new set of efficient and semantically meaningful image augmen-
tation techniques tailored for Tab2Visual-generated images. Additionally,
Tab2Visual facilitates transfer learning, enabling fine-tuning of pre-trained
models, even from different domains, for new tabular classification tasks.
This reduces reliance on large labeled datasets and allows for efficient knowl-
edge transfer between tabular datasets. Critically, unlike the aforementioned
tabular-to-image conversion methods [16, 18, 19, 20, 21], which typically
require feature-rich datasets and struggle with generating meaningful aug-
mentations, Tab2Visual effectively handles data with limited features and
offers numerous augmentation strategies. Consequently, Tab2Visual effec-
tively mitigates the challenges posed by limited data in tabular data classi-
fication.
It is interesting to note that the concept of representing tabular informa-
tion as images, upon which Tab2Visual is founded, aligns with established
principles in cognitive psychology. The Picture Superiority Effect [30, 31]
demonstrates that people tend to learn and remember information presented
visually more effectively than information presented in textual or numerical
formats. This phenomenon is explained by the fact that images trigger a
richer set of cognitive representations and associations compared to words,
drawing upon broader knowledge of the world [32]. Furthermore, the distinc-
tive visual features within images contribute to enhanced memorability [31].
This work significantly extends the preliminary, exploratory ideas drafted
in our prior work [29], which focused on a specific clinical context. We make
several new contributions, advancing the state-of-the-art across scope, al-
gorithmic development, presentation, and experimental work. We address
the wider context of general tabular data classification, tackling its open
and challenging research problems. In addition, we propose comprehensive
and generalized algorithms for Tab2Visual modeling and image augmenta-
tion. To demonstrate the effectiveness of Tab2Visual, we evaluate its per-
formance using two state-of-the-art backbones: the CNN architecture, Effi-
cientNet [33], and the Vision Transformer, EfficientViT [34]. We conduct a
comprehensive empirical study across 10 diverse datasets from the UCI Ma-
chine Learning Repository, spanning various application domains including
6
medical, e-commerce, and engineering. While Tab2Visual is particularly in-
tended for small datasets, we assess its performance on a range of datasets,
from small to medium and large scale, encompassing thousands of records
and tens of features. Furthermore, we conduct a rigorous comparative anal-
ysis of Tab2Visual in terms of accuracy and speed against a diverse set of
machine learning algorithms, including classical methods (support vector ma-
chines, logistic regression, shallow neural networks), established tree-based
ensembles (random forests, extra trees, gradient boosting machines), and
state-of-the-art deep learning architectures specifically designed for tabular
data, such as TabNet [10] and TabPFN [12, 13]. This comprehensive evalu-
ation aims to benchmark Tab2Visual against key existing approaches in the
context of tabular data classification. Finally, we perform an in-depth analy-
sis of key factors influencing Tab2Visual’s performance, including the impact
of augmentation strategies, the choice between transfer learning and training
from scratch, the selection of the backbone model, and the arrangement of
features within the generated image representations.
The rest of this paper is structured as follows: Section 2 introduces the
Tab2Visual modeling approach. Section 3 presents our comprehensive ex-
perimental setup and results, followed by thorough discussion and analysis
in Section 4. Lastly, Section 5 provides a summary of our findings and con-
cluding remarks.
2. Tab2Visual: Visual Representation of Tabular Data
Deep learning has achieved remarkable success in image analysis by effec-
tively exploiting the spatial structure and local correlations inherent within
image data. However, the application of deep learning to heterogeneous tab-
ular data has been less successful. Tab2Visual addresses this limitation by
transforming tabular data into visual representations. This transformation
enables powerful vision models, such as CNNs and ViTs, to effectively extract
meaningful features from the tabular data, leading to improved classification
performance. This section details the proposed Tab2Visual methodology.
We begin by formally defining the problem.
Problem Statement: Given a tabular dataset D={(xi, yi)}n
i=1, where
nis the number of samples, each sample xi= [x(1)
i, x(2)
i, . . . , x(m)
i]T∈Rm
consists of mfeatures, and yi∈ {1, . . . , C}is the label associated with sample
xi, where Cis the number of unique labels or classes, the objective is to
transform each sample xiinto an image representation, Iiof dimensions
7
Figure 2: The Tab2Visual approach: Tabular data undergoes normalization and is subse-
quently encoded as images where each feature is represented by a bar with varying width.
To enhance data diversity, optional data augmentation techniques can be applied. Finally,
these image representations are fed to a deep learning model for training or inference.
H×W, with Hbeing the image height and Wthe image width, such that
a deep learning model M(e.g., CNN or ViT) can be trained on the set of
images {(Ii, yi)}n
i=1 for the classification task.
Fig. 2 illustrates the Tab2Visual modeling steps, which are further elab-
orated below:
•Data Preparation and Normalization: First, categorical features
are one-hot encoded, and ordinal variables are encoded numerically
based on their inherent order. Following this, min-max normalization is
applied to the entire dataset D, such that x(j)
i∈[0,1],∀i= 1, ..., n, j =
1, ..., m. This standardization ensures consistent feature comparisons
and prepares the data for visual representation.
•Image Preparation: The next step involves converting the normal-
ized data into visual representations suitable for deep learning models.
For each sample, an image is created encoding the sample’s features.
The image size is a user-supplied parameter, and should balance the
desired image details with computational efficiency. It should also be
compatible with the input requirements of the deep models to be used.
Generally, the image will consist of bars of widths proportional to the
feature values, arranged in multiple rows and columns. The user speci-
fies the desired number of rows, r. Accordingly, the number of bars per
row becomes c=⌈m/r⌉. By construction, the image space is divided
8
Figure 3: Visual representation of a sample from a 9-feature tabular dataset. The bar
widths are proportional to feature values. Note that the 8th feature has a value of zero,
resulting in a bar with no width against the white background.
equally among all features; the maximum bar width is calculated as
b=W/c, and each bar has a height of h=H/r. For example, when
r= 1, a dataset with 9 features will be represented as an image with 9
bars arranged in a single row, as illustrated in Fig. 3. For datasets with
a larger number of features, to ensure the bars have sufficient image
support, rshould be increased. As illustrated in Fig. 4, as the num-
ber of rows increases, the width of each bar correspondingly expands,
accompanied by a reduction in the bar’s hight. In our experiments,
the impact of the bar arrangements on the classification performance
is investigated.
•Feature Encoding: Each feature x(j)
iof the i-th sample is represented
as a vertical bar in the image Iihaving a height hand width, wj, propor-
tional to the normalized feature value. That is, wj=x(j)
i×b. A feature
with a normalized value of 1 will be represented by a bar occupying the
full bar support h×bwithin the image. This representation maintains
the relative feature magnitudes. Moreover, each feature is assigned a
color, which enhances the visual differentiation between features and
facilitates image interpretation, see Fig. 3.
•Image Augmentation: This step is optional but highly recommended
for smaller datasets. In such cases, a common practice in deep learn-
ing is to employ augmentation techniques to increase the dataset size
9
(a) One row (b) Two rows
(c) Four rows (d) Eight rows
Figure 4: Tab2Visual representations of a sample with 40 features arranged in different
configurations.
and diversity without collecting additional data samples. Data aug-
mentation involves transforming the data, thus changing the data fea-
tures, while preserving its correspondence to the label/class. Apply-
ing augmentation methods to heterogeneous tabular data presents sig-
nificant challenges [7]. One direct advantage of representing data as
images is the availability of numerous effective image augmentation
techniques (e.g., mirroring, rotation, translation, cropping, intensity
changing) [23, 35]. Nevertheless, we propose a set of image augmenta-
tion techniques specifically tailored for Tab2Visual-generated represen-
tations. Our approach incorporates elastic distortions and morphologi-
cal operations, including dilation, erosion, opening, and closing, applied
at varying scales. These operations are carefully selected to introduce
meaningful variations to the image representations.
10
More specifically, the proposed augmentation methods are implemented
using the Albumentations library [35], a fast and flexible open-source
library for image augmentations. Initially, an image undergoes elas-
tic distortion. The extent of elastic distortion is controlled by two
parameters: the scaling factor, α, which determines the intensity of
displacement, and σ, the standard deviation of the Gaussian filter ap-
plied to this field. Higher αvalues result in greater distortion, while
σcontrols the nature of the distortion: lower σvalues produce small,
localized ripples, while higher σvalues lead to larger, smoother wavy
distortions.
Subsequently, the image undergoes random morphological operations:
dilation and erosion. These operations are applied with probabilities
Pdand Pe, respectively, and with randomly sized structuring elements
SEeand SEd. A structuring element is a matrix that defines the neigh-
borhood used to process each pixel. As the size of structuring elements
increases, the extent of dilation or erosion effects becomes more pro-
nounced. To introduce further variability, the order of dilation and
erosion is randomized, leading to images that are either dilated only,
eroded only, morphologically closing (dilated followed by eroded), or
morphologically opening (eroded followed by dilated).
As the critical information in the proposed image representation re-
sides primarily in bar widths, these augmentations subtly and non-
uniformly alter bar boundaries and widths. This effectively generates
synthetic samples that closely resemble the original data distribution,
enhancing the model’s generalization ability. Fig. 5 illustrates the im-
age representations of three samples of a dataset, each accompanied
by two augmented versions. Note the slight to moderate nonuniform
adjustments introduced by the augmentation operations along the im-
age edges, thereby generating new samples that closely resemble the
original distribution.
•Model Selection and Transfer Learning: After image prepara-
tion and optional augmentation, the images are utilized to train a deep
learning model. The proposed approach offers flexibility in model selec-
tion, allowing for the use of either CNNs or ViTs depending on the spe-
cific task demands. CNNs, with their hierarchical structure, are well-
suited for extracting both local and global features, while ViTs excel in
11
(A) (A’) (A”)
(B) (B’) (B”)
(C) (C’) (C”)
Figure 5: Example of data augmentation applied to Tab2Visual image representations of
the Juice dataset. Three original samples are shown on the left, each accompanied by two
augmented versions.
capturing long-range dependencies through self-attention mechanisms.
Training can be performed from scratch,learning all parameters from
the dataset, or through transfer learning, where a pre-trained model is
adapted to the specific task.
In summary, Algorithm 1 outlines all the steps of the Tab2Visual ap-
proach for training a deep model Mon a tabular dataset D, while Algo-
rithm 2 details the augmentation methods of the approach. The required
layout of the bars in the produced images is defined by the input parameter
r, whereas the desired scale of augmentation is determined by K.
3. Experimental Results
This section presents a detailed experimental evaluation of Tab2Visual.
We begin with an overview of the datasets used in our experiments, followed
12
Algorithm 1 The Tab2Visual approach for training a deep model on a
tabular dataset
Require: Tabular dataset D={(xi, yi)}n
i=1 with mfeatures; Desired image
dimensions Wand H; Selected number of rows r; Scale of augmentation
K; Deep learning model M.
Ensure: Trained model M.
1: Data Preparation and Normalization:
2: Encode categorical and ordinal features
3: Apply 0-1 normalization such that x(j)
i∈[0,1],∀i= 1, ..., n, ∀j= 1, ..., m
4: Image Preparation and Feature Encoding:
5: Output dataset I ← ϕ
6: Number of columns c← ⌈m/r⌉
7: Bar height h←=H/r
8: Maximum bar width w←W/c
9: for i= 1 to ndo
10: Initialize image Iiof size H×Wto background color
11: for j= 1 to mdo
12: Row index rj← ⌈j/c⌉
13: Column index cj←j−(rj−1) c
14: Bar width wj←w×x(j)
i
15: Bar position:
16: xstart ←(cj−1) w
17: ystart ←(rj−1) h
18: Assign color colorjfor feature j
19: Draw rectangle in Iiat (xstart, ystart) with width wj, height h, and
color colorj
20: end for
21: I ← I ∪ {(Ii, yi)}
22: Optional Image Augmentation:
23: k←K
24: while k > 0do
25: I′←AugmentImage(Ii)▷Apply Algorithm 2
26: I ← I ∪ {(I′, yi)}
27: k←k−1
28: end while
29: end for
30: Model Training:
31: Train or fine-tune model Mon the output dataset I
13
Algorithm 2 Image Augmentation for Tab2Visual (AugmentImage)
Require: Image I; Elastic distortion parameters αand σ; Morphological
operation probabilities Pdand Pe; Structuring element sizes Seand Sd
Ensure: Augmented image I′.
1: Elastic Distortion:
2: I′←ApplyElasticDistortion(I, σ, α)
3: Morphological Operations:
4: Choose two random numbers, uand v, drawn uniformly from the interval
[0,1]
5: Generate structuring element SEdrandomly with size Sdor smaller
6: Generate structuring element SEerandomly with size Seor smaller
7: if (u < Pd)and (v < Pe)then
8: Choose a random number adrawn uniformly from the interval [0,1]
9: if a < 0.5then
10: I′←Erode(Dilate(I′, SEd), SEe)▷Morphological closing
11: else
12: I′←Dilate(Erode(I′, SEe), SEd)▷Morphological opening
13: end if
14: else if u < Pdthen
15: I′←Dilate(I′, SEd)
16: else if v < Pethen
17: I′←Erode(I′, SEe)
18: else
19: I′remains unchanged
20: end if
21: Output:
22: Return augmented image I′
14
by a description of the machine learning methods employed for comparison.
Next, we discuss the configuration and parameters of Tab2Visual, including
the backbone deep models used for training. Finally, we report the exper-
imental results. A detailed discussion of these results and our findings is
presented in Section 4.
3.1. Datasets
Our study incorporates ten diverse datasets from the UCI Machine Learn-
ing Repository, a renowned public resource. These datasets span a range
of domains, including medical, e-commerce, and engineering, providing a
comprehensive evaluation ground for Tab2Visual across various application
scenarios. Table 1 summarizes these datasets, which include both small-
scale datasets (e.g., Heart Failure, Juice, Diabetes, Breast Cancer, Glass,
US Presidential Election Results) with fewer than 1000 samples and larger-
scale datasets (e.g., Satellite, Employee, Electrical Grid, Telescope) with
6000 or more samples. This diverse selection enables a thorough evaluation
of Tab2Visual’s generalization capabilities across different data scales.
Table 1: Summary of Datasets Used to Evaluate Tab2Visual.
Dataset Size (n)Attributes (m)Number of Classes
(C)
Heart Failure (HRT) 200 16 2
Juice (JU) 1070 19 2
Diabetes (DIA) 768 9 2
Breast Cancer (BC) 683 10 2
Glass (GL) 214 10 6
US Presidential
Election (PE)
497 7 2
Satellite (SAT) 6435 37 6
Employees (EMP) 14,999 10 2
Electrical Grid (EG) 10,000 13 2
Telescope (TEL) 19,020 11 2
3.2. Competing Machine Learning Methods
To rigorously evaluate Tab2Visual, we compare its performance against
a diverse set of classification algorithms:
15
•Conventional Methods: Logistic Regression, Support Vector Machines
(SVM), and a shallow Multi-Layer Perceptron (MLP), which are com-
monly used for tabular data classification [7].
•Tree-based Ensembles: Random Forest [36], ExtraTrees [37], XGBoost [38],
LightGBM [39], and CatBoost [40], renowned for their robustness and
ability to capture complex feature interactions. They serve as strong
baselines for tabular data classification [7, 6, 22].
•Deep Learning Methods: state-of-the-art models specifically designed
for tabular data, such as TabNet1[10] and TabPFN2[12, 13], which
leverage advanced techniques like attention mechanisms and proba-
bilistic forecasting to automatically extract meaningful patterns in the
data.
This comprehensive comparison provides valuable insights into the strengths
and limitations of Tab2Visual relative to established and cutting-edge ap-
proaches for tabular data classification.
3.3. Tab2Visual Configuration
Our experiments utilize EfficientNet [33] as the initial backbone model
for Tab2Visual. The EfficientNet series is a family of CNN architectures
known for achieving state-of-the-art accuracy while maintaining high effi-
ciency across various classification tasks [33]. This series employs a compound
scaling method to uniformly scale network width, depth, and resolution, lead-
ing to a family of models (EfficientNet-B0 to B7) with varying complexity.
The foundational EfficientNet-B0 model was developed through a synergistic
approach that combined AutoML and the mobile neural architecture search
framework.
EfficientNetV2 [41] builds on its predecessor by integrating training-aware
neural architecture search with scaling strategies, resulting in enhanced ef-
ficiency and accuracy. Notable architectural modifications include the re-
moval of depthwise convolutions and squeeze-and-excitation blocks in the
early layers, along with the use of smaller kernel sizes and contraction ratios
1https://github.com/dreamquark-ai/tabnet
2https://github.com/PriorLabs/TabPFN. Version 1 of the software was used for this
study. It is important to note that Version 2 became available during the manuscript
submission process.
16
in mobile blocks. Additionally, an improved progressive learning technique
accelerates training while boosting accuracy. Empirical studies [41, 42, 43]
have highlighted EfficientNetV2’s strong performance in transfer learning
tasks, achieving high accuracy with fewer parameters than competing mod-
els. These characteristics make it particularly suitable for transfer learning,
especially in scenarios with limited data. To minimize overfitting and com-
putational costs, we opt for the smallest variant, EfficientNetV2-B0, which
is pre-trained on ImageNet-1K [44], a labeled dataset comprising 1.2 million
images spanning 1,000 categories.
In Tab2Visual experiments, we set the various parameters as follows. We
select r= 1 for all datasets, arranging the bars in one row in the produced
images. We set the image height Hand width Wto 224 pixels each. This
choice aligns with the input size requirement of EfficientNetV2, which is
designed to accept images of 224 ×224 pixels. For the smaller datasets,
namely (heart, glass, diabetes, breast cancer, USA election and juice), we
apply the proposed set of augmentation methods (see Algorithm 2) to the
training partition of each dataset. We experiment with different control
parameter settings. For the elastic distortion part, we use α∈[40,60], and
σ∈[3,5]. In the morphological operation part, we employ Pd∈[0.6,0.8],
and Pe∈[0.6,0.8]. Both the structuring elements SEeand SEdare randomly
set to size (2,5) or smaller.
To mitigate overfitting, particularly when dealing with small datasets, we
implement a multi-faceted strategy during backbone model training. Firstly,
we apply the proposed image augmentation methods to small datasets to
expand the effective dataset size. Secondly, we utilize EfficientNetV2-B0,
the smallest and least complex model in the EfficientNetV2 family, which
inherently minimizes the risk of overfitting owing to its smaller number of
parameters. Thirdly, we employ transfer learning, in which, the pre-trained
weights of the deep model are frozen, except for the weights of the final clas-
sification layer. This allows the model to adapt to the specific classification
task while preserving learned feature representations. Finally, we incorporate
a combination of regularization techniques:
•Batch Normalization: This technique stabilizes training and reduces
internal covariate shift, allowing the model to learn more effectively
across different mini-batches.
•Strong Weight Decay (L2 Regularization): By penalizing large weights,
17
this technique promotes simpler models that are less prone to overfit-
ting.
•Dropout: Randomly deactivating neurons during training reduces re-
liance on individual neurons and encourages more robust feature rep-
resentations.
This multi-faceted approach effectively mitigates overfitting, enabling our
model to generalize well to unseen data.
3.4. Experimental Setup and Results
All methods mentioned in Section 3.2, in addition to Tab2Visual, are
implemented in Python and evaluated using 5-fold cross-validation. Two
metrics are used to assess a method’s performance: the macro average F1-
score as well as the Area Under the Receiver Operating Characteristic Curve
(AUC). The former metric measures the harmonic mean of precision and
recall of a method, providing a balanced assessment, while the latter offers
a measure of a method’s average performance across different trade-offs be-
tween true positive and false positive rates at various thresholds.
On using augmentation, data is first divided into non-overlapping train-
ing and testing partitions. Augmentation is then applied exclusively to the
training partition to prevent data leakage. This ensures that augmented data
from the training partition do not inadvertently influence the evaluation on
the held-out test partition, maintaining the integrity and generalizability of
the evaluation process. Different augmentation scales are investigated (see
Table 2) to evaluate the influence of augmentation on model performance.
Table 2: Augmentation Scales.
K Augmented Data Description
0A0No augmentations applied
1A1Each image augmented once
2A2Each image augmented twice
3A3Each image augmented three times
4A4Each image augmented four times
To ensure fair comparison, we perform hyperparameter optimization for
all methods using the Optuna [45] framework with 100 iterations. Each
hyperparameter configuration is cross-validated with the 5 folds. Table 3
summarizes the hyperparameters and their search spaces for each method,
18
including the best-performing hyperparameter settings for the US Presiden-
tial Election Results dataset. For details on hyperparameter abbreviations,
please refer to the scikit-learn documentation [46]. All experiments are per-
formed on a workstation with an Intel Core i9 CPU (3 GHz, 18 cores), 256
GB RAM, and an NVIDIA Quadro RTX 5000 GPU (12 GB VRAM).
Table 4 summarizes average F1-score and AUC results for all meth-
ods across all datasets. Note image augmentation is applied exclusively to
datasets with fewer than 1000 samples. Therefore, Tab2Visual results are not
reported for augmented data (Ak, for k > 0) for larger datasets. In contrast,
there are 5 variants of the Tab2Visual model on smaller datasets (one for
each augmentation level). Additionally, due to the method’s limitations as
noted in the original paper [12], TabPFN results are not reported for datasets
larger than 1000 samples.
4. Discussion
We here present a detailed comparative analysis of the accuracy and
speed of Tab2Visual against other methods. We evaluate the performance
of Tab2Visual across various datasets, examining its strengths and limita-
tions in different scenarios. Furthermore, we perform an in-depth analysis of
each key component of the Tab2Visual approach, including the impact of im-
age augmentation, the benefit from transfer learning, the choice of backbone
model, and the influence of different image feature arrangements.
4.1. Performance Comparison on Different Datasets
Our experimental evaluation reveals varying performance across datasets,
with no universally superior method. On smaller datasets like Heart, Juice,
Diabetes, Breast Cancer, Glass, and US Election, Tab2Visual model variants
(particularly from A4and A3) demonstrate strong performance, achieving
high F1-scores and AUC values. More specifically, Tab2Visual-A4achieves
the highest F1-score in the Heart and Diabetes datasets, indicating good
balance between precision and recall on both datasets. It also achieves the
highest AUC in the Heart, Juice and USA Election datasets. Tab2Visual-A3
stands out with superior F1-score and AUC metrics in the Breast Cancer
dataset.
Advanced deep learning models also demonstrate strong performance.
TabNet excels in the Juice and USA Election datasets, achieving the high-
est F1-scores in both. TabNet exhibits the highest F1-score and AUC in
19
Table 3: Optimizing hyperparameters for all classification models.
Model Hyperparameter Search Range Best Configuration
Logistic Regression p enalty [l1, l2] l2
C 0.01 to 1 0.833
SVM
C 0.001 to 100 38.81
kernel [linear, poly, rbf] rbf
gamma 0.001 to 100 0.01
Random Forest
n estimators 1 to 500 100
max depth 1 to 40 9
min samples split 2 to 14 7
min samples leaf 1 to 14 4
max features [auto, sqrt, log2] auto
Extra Trees
n estimators 1 to 500 150
max depth 1 to 40 5
min samples split 2 to 14 10
min samples leaf 1 to 14 5
max features [auto, sqrt, log2] auto
XGBoost
n estimators 1 to 500 300
max depth 1 to 40 10
gamma 0 to 1 0.011
learning rate 0.001 to 1 0.05
reg alpha 0 to 2 0.814
reg lambda 0 to 2 1.478
subsample 0.5 to 1 0.614
colsample bytree 0.5 to 1 0.7
LightGBM
nestimators 1 to 500 260
max depth 1 to 20 7
num leaves 2 to 256 70
learning rate 0.01 to 1 0.1
reg alpha 0 to 2 0.11
reg lambda 0 to 2 1.03
subsample 0.5 to 1 0.92
colsample bytree 0.5 to 1 0.81
CatBoost
iterations 50 to 300 100
learning rate 0.01 to 0.3 0.12
depth 2 to 12 6
l2 leaf reg 1 to 10 3.46
MLP
n hidden layers [1, 2, 3] 2
n neurons hidden layer [32, 64, 128, 256] [128, 64]
learning rate 0.0001 to 0.1 0.001
batch size [16, 32, 64, 128] 16
weight decay 0.00001 to 0.01 0.002
drop prob 0.1 to 0.7 0.4
TabNet
batch size [8, 16, 32, 64] 8
mask type [entmax, sparsemax] entmax
n d 8 to 64 (step 4) 8
n a 8 to 64 (step 4) 8
n steps 1 to 8 (step 1) 2
gamma 1.0 to 1.4 (step 0.2) 1.291
n shared 1 to 3 1
lambda sparse 0.0001 to 1 0.003
patienceScheduler 3 to 10 6
learning rate 0.001 to 1 0.023
TabPFN n ensemble configurations [1, 32] 4
Tab2Visual
n hidden layers [1, 2, 3] 1
n neurons hidden layer [32, 64, 128, 256] 128
learning rate 0.000001 to 1 0.0005
batch size [16, 32, 64, 128] 32
weight decay 0.00001 to 0.5 0.005
drop prob 0.1 to 0.7 0.5
20
Table 4: Performance comparison of Tab2Visual against other machine learning methods
on different datasets.
HRT JU DIA BC GL PE SAT EG EMP TEL
Logistic Regression F1
AUC
0.5131
0.7216
0.7555
0.8907
0.6174
0.8017
0.9545
0.9914
0.4010
0.8011
0.9261
0.9811
0.7336
0.9600
0.8599
0.8922
0.7560
0.8382
0.6711
0.8328
SVM F1
AUC
0.4762
0.6512
0.6814
0.8147
0.5861
0.6836
0.9432
0.9872
0.3396
0.7803
0.9173
0.9514
0.7137
0.9432
0.8140
0.8805
0.7369
0.8157
0.7260
0.7887
Random Forest F1
AUC
0.4444
0.7327
0.7105
0.8666
0.5722
0.7938
0.9423
0.9937
0.6470
0.9209
0.9238
0.9763
0.8863
0.9905
0.9361
0.9774
0.9748
0.9913
0.8091
0.9307
Extra Trees F1
AUC
0.4328
0.7287
0.6930
0.8426
0.5592
0.7782
0.9465
0.9947
0.6874
0.9329
0.9094
0.9628
0.8914
0.9914
0.9391
0.9813
0.9665
0.9899
0.7976
0.9289
XGBoost F1
AUC
0.4796
0.7333
0.7160
0.8648
0.5765
0.7639
0.9318
0.9916
0.6515
0.8901
0.9204
0.9698
0.8875
0.9911
0.9544
0.9876
0.9694
0.9926
0.8194
0.9353
LightGBM F1
AUC
0.4444
0.6982
0.7228
0.8681
0.5553
0.7665
0.9434
0.9948
0.5740
0.9086
0.9113
0.9690
0.8904
0.9913
0.9497
0.9857
0.9697
0.9932
0.8150
0.9320
CatBoost F1
AUC
0.4991
0.7754
0.7322
0.8839
0.5722
0.7918
0.9435
0.9943
0.6577
0.9290
0.9212
0.9790
0.8942
0.9925
0.9599
0.9906
0.9628
0.9919
0.8131
0.9353
MLP F1
AUC
0.5049
0.6544
0.7707
0.8919
0.6368
0.8158
0.9615
0.9936
0.5178
0.7947
0.9381
0.9767
0.8612
0.9828
0.9336
0.9714
0.9079
0.9750
0.7858
0.9148
TabNet F1
AUC
0.4976
0.6778
0.7837
0.8887
0.6205
0.7995
0.9609
0.9949
0.5156
0.7834
0.9409
0.9808
0.8837
0.9863
0.9724
0.9950
0.9447
0.9844
0.8066
0.9302
TabPFN F1
AUC
0.4287
0.7193
0.7658
0.8944
0.6321
0.8216
0.9614
0.9925
0.5349
0.8264
0.9333
0.9829
— — — —
Tab2Visual A0F1
AUC
0.5246
0.7346
0.7510
0.8672
0.6262
0.7790
0.9561
0.9884
0.6014
0.8604
0.9331
0.9655
0.8761
0.9826
0.9353
0.9742
0.9513
0.9830
0.7823
0.9100
Tab2Visual A1F1
AUC
0.5014
0.7908
0.7428
0.8288
0.6227
0.7693
0.9591
0.9872
0.6182
0.8817
0.9204
0.9704
— — — —
Tab2Visual A2F1
AUC
0.5179
0.7374
0.7402
0.8710
0.6211
0.7809
0.9501
0.9880
0.6119
0.8822
0.9148
0.9759
— — — —
Tab2Visual A3F1
AUC
0.4962
0.7601
0.7488
0.8596
0.6291
0.7891
0.9647
0.9954
0.6469
0.8892
0.9260
0.9512
— — — —
Tab2Visual A4F1
AUC
0.5390
0.7755
0.7675
0.8957
0.6440
0.7873
0.9562
0.9906
0.6396
0.8927
0.9270
0.9841
— — — —
the Electrical Grid dataset, while TabPFN achieves the highest AUC in the
Diabetes dataset. Tree-based ensembles, such as CatBoost, ExtraTrees, XG-
Boost, and LightGBM, consistently exhibit robust performance. ExtraTrees
achieves the highest F1-score and AUC in the Glass dataset. In the Satellite
dataset, CatBoost demonstrates superior performance, while XGBoost excels
in the Telescope dataset. While traditional models like Logistic Regression,
MLP, and SVM show commendable results, they are often outperformed by
the other methods across the various datasets.
While no single method consistently outperforms all others across all
datasets, distinct performance trends emerge when comparing results on
smaller (≤1000 samples) and larger (≥6000 samples) datasets. Figures 6
and 7 illustrate the average F1-score and AUC performance of each method
21
on these two dataset categories, respectively. Furthermore, Figures 8 and 9
depict the average ranks of the competing methods on smaller and larger
datasets based on F1-score and AUC. These figures clearly demonstrate the
varying performance trends of different methods across different dataset sizes.
Our analysis reveals a clear advantage for Tab2Visual on smaller datasets.
Tab2Visual models, particularly A4and A3, consistently achieve high average
F1-scores and AUC values on these datasets. In fact, Tab2Visual-A4achieves
the highest average F1-score of 74.6% in Figure 6, indicating strong perfor-
mance in balancing precision and recall. It surpasses second-place methods
(CatBoost, MLP, and TabPFN) by at least 2%, with a notable 4% improve-
ment over TabPFN. CatBoost and Tab2Visual-A4achieve the highest aver-
age AUC on small datasets, with CatBoost exhibiting a slight performance
edge (89.2% vs. 88.7%). Significantly, Tab2Visual-A4emerges as the best av-
erage rank among all methods, as demonstrated in Figure 8. Conversely, on
larger datasets, tree-based ensembles show very close performance, achieving
top positions in terms of average AUC (about 97%), F1-score (about 90%),
and rank (see Figures 7 and 9). TabNet then Tab2Visual follow closely
behind in terms of performance on larger datasets. SVM shows the least
performances on both smaller and larger datasets.
These results demonstrate that Tab2Visual exhibits a distinct advan-
tage in handling limited data, consistently outperforming other methods on
smaller datasets. This performance gain can be attributed to several fac-
tors. Tab2Visual’s image augmentation techniques effectively enhance data
diversity and size, particularly beneficial for small datasets. Moreover, the
EfficientNet architecture within Tab2Visual, combined with transfer learning,
leverages knowledge learned from larger, pre-trained datasets, compensating
for the limited availability of training data. Conversely, on larger datasets,
the benefits of data augmentation may be less pronounced due to the inherent
diversity within the data itself. This allows tree-based ensembles, which ex-
cel at capturing complex relationships within larger datasets, to demonstrate
superior performance.
4.2. Augmentation Effect
Tab2Visual incorporates image augmentation techniques to enhance data
diversity and effectively increase the dataset size. Figure 10 illustrates the
impact of varying augmentation levels from A0to A4on Tab2Visual’s average
performance across smaller datasets. A consistent improvement is observed
in both average F1-score and AUC with increasing augmentation levels. For
22
Figure 6: Average F1-score and AUC of different classification algorithms on smaller
datasets (≤1000 samples) as evaluated in our experiments. Higher values indicate better
performance.
Figure 7: Average F1-score and AUC of different classification algorithms on larger
datasets (≥6000 samples) as evaluated in our experiments. Higher values indicate better
performance.
23
Figure 8: Average rank of classification algorithms based on F1-score and AUC across
smaller datasets (≤1000 samples) as evaluated in our experiments. Lower ranks indicate
better overall performance.
Figure 9: Average rank of classification algorithms based on F1-score and AUC across
larger datasets (≥6000 samples) as evaluated in our experiments. Lower ranks indicate
better overall performance.
24
Figure 10: Impact of augmentation levels (from A0to A4) on Tab2Visual’s average per-
formance on smaller datasets. Performance is measured in terms of F1-score and AUC.
Higher values indicate better performance.
instance, AUC accuracy increases from 86.6% at A0to 88.7% at A4, repre-
senting a 2.1% gain when each training sample is augmented four times using
the methods outlined in Algorithm 2. Similarly, an approximately 1.4% gain
in F1-score is observed at augmentation level A4. These results unequiv-
ocally demonstrate that augmentation significantly enhances Tab2Visual’s
generalization ability and improves its predictive accuracy.
4.3. Training from Scratch
We here investigate the benefits of transfer learning by comparing the
performance of EfficientNetV2-B0 trained from scratch (without ImageNet
pre-training) against the pre-trained model. All other experimental settings
are matched to those described in Section 3.4. Experiments are performed
on the original, unaugmented datasets. The results in Table 5 clearly demon-
strate the advantages of transfer learning. On smaller datasets, pre-training
yields substantial improvements, with average AUC gains of 7.5% (peaking
at 11% for the glass dataset) and average F1-score gains of 6.7%. Smaller
improvements (around 1% for both AUC and F1-score) are also observed on
larger datasets. These results suggest that the abundance of data can reduce
the necessity for extensive pre-training.
25
Table 5: Training Tab2Visual from Scratch vs. Transfer Learning on Different Datasets.
Dataset From Scratch Learning Transfer Learning
F1-Score AUC F1-Score AUC
HRT 0.4439 0.6933 0.5246 0.7346
JU 0.6690 0.8016 0.7510 0.8672
DIA 0.5484 0.6889 0.6262 0.7790
BC 0.8825 0.8984 0.9561 0.9884
GL 0.5474 0.7504 0.6014 0.8604
PE 0.8983 0.9141 0.9331 0.9655
SAT 0.8672 0.9661 0.8761 0.9826
EMP 0.9389 0.9751 0.9353 0.9742
EG 0.9291 0.9567 0.9513 0.9830
TEL 0.7664 0.9053 0.7823 0.9100
4.4. CNNs vs ViTs
The proposed Tab2Visual methodology is general and can be used with
several deep network architectures as a backbone. In our earlier experiments,
we have demonstrated its performance using a CNN backbone from the Ef-
ficientNet family [33]. To further investigate the impact of the backbone
model, in another series of experiments, we replace EfficientNetV2-B0 with
EfficientViT [34], a memory-efficient ViT optimized for both accuracy and
efficiency. EfficientViT leverages Cascaded Group Attention (CGA) to signif-
icantly reduce computational overhead while effectively capturing local and
global image features. This enables us to directly compare the performance
of ViTs, which excel at capturing long-range dependencies through atten-
tion mechanisms, against CNNs like EfficientNetV2 within the Tab2Visual
framework.
We compare the results of EfficientViT against those obtained using Ef-
ficientNetV2 as the backbone model. We utilize EfficientViT pre-trained on
ImageNet-1k [44] and fine-tune it on each dataset, following the experimental
setup described in Section 3.4. The performance of EfficientViT is then com-
pared with that of EfficientNetV2 in terms of average AUC and F1-score on
smaller datasets for various augmentation levels as well as on larger datasets
without augmentation, see Table 6.
26
Table 6: Comparison of EfficientNetV2 and EfficientViT Backbones for Tab2Visual: Av-
erage Performance on Various Datasets.
Dataset Tab2Visual with CNN Tab2Visual with ViT
Type Augmentation F1-score AUC F1-score AUC
Smaller Datasets
A00.73 0.87 0.71 0.84
A10.73 0.87 0.71 0.85
A20.73 0.87 0.72 0.85
A30.74 0.87 0.72 0.85
A40.75 0.89 0.73 0.87
Larger Datasets A00.89 0.96 0.90 0.97
EfficientNetV2 demonstrates superior performance on smaller datasets (≤
1000 samples) compared to EfficientViT, achieving higher F1-scores (0.73-
0.75 vs. 0.71-0.73) and AUC values (0.87-0.89 vs. 0.84-0.87). Interestingly,
the performance trend reverses on larger datasets (≥6000 samples), with
EfficientViT achieving slightly higher F1-scores (0.90 vs. 0.89) and AUC
values (0.97 vs. 0.96) than EfficientNetV2. Consistent with our previous
findings in Section 4.2, increasing data augmentation continues to enhance
the performance of both models across all dataset sizes.
Our findings suggest that EfficientNetV2’s powerful feature extraction ca-
pabilities make it a better choice for smaller datasets within our Tab2Visual
framework. However, on larger datasets, EfficientViT offers a competitive
advantage, likely owing to its ability to effectively model long-range depen-
dencies and complex feature interactions through its attention mechanisms.
EfficientViT’s optimized architecture, incorporating Cascaded Group Atten-
tion [34], enables it to scale effectively and maintain high accuracy with
increased data.
4.5. Different Feature Arrangements
We then explore how feature arrangement within the Tab2Visual image
representation affects performance. Using the Satellite dataset, which has
the largest number of features (37) among our datasets, we vary the number
of rows (rin Algorithm 1) used to arrange the features (refer to Fig. 4).
EfficientNetV2 is fine-tuned for each arrangement, following the experimental
setup in Sections 3.3 and 3.4. Figure 11 presents the results.
Our results in Fig. 11 show a strong influence of feature arrangement
(number of rows) on EfficientNetV2’s performance on the Satellite dataset.
We observe optimal performance with 1 or 2 rows, with negligible differ-
ences between these two. Increasing the number of rows results in a gradual
27
Figure 11: Impact of feature arrangement on Tab2Visual’s performance: F1-score and
AUC for different row configurations on the Satellite Dataset.
decline in both F1-score and AUC, with approximately 10% drop in both
metrics at r= 16. This suggests that compact arrangements tend to be
more effective for preserving feature relationships and maximizing general-
ization in Tab2Visual. Therefore, careful feature arrangement is essential for
optimizing model performance on a given dataset.
4.6. Time Performance
Fig. 12 illustrates the average 5-fold training time for each classifier across
all datasets, measured on a workstation with the specifications outlined in
Section 3.4. Traditional methods (Logistic Regression, SVM) exhibit the
shortest training times, typically under 10 seconds, making them suitable for
resource-constrained environments. Tree-based methods demonstrate moder-
ate training times, generally completing within a minute for smaller datasets
and under an hour for larger datasets. TabPFN stands out for its excep-
tional training speed on smaller datasets, consistently completing within 5
seconds. However, its applicability is limited to datasets smaller than 1000
samples [12] 3. TabNet exhibits longer training times, ranging from several
3The recently released Version 2 extends its applicability to datasets with up to 10,000
samples [13].
28
Figure 12: Computational cost analysis: Heatmap of average 5-fold training times (in
seconds) for different classifiers on various datasets.
minutes on smaller datasets to approximately 25 minutes on larger datasets,
reflecting the complexity of its architecture. Tab2Visual, even without aug-
mentation (i.e., A0), generally exhibits the longest training times, especially
on larger datasets like Satellite and Electrical Grid. It is worth noting that
Tab2Visual’s training time includes the time required to convert tabular data
into images. These results underscore the usual trade-off between a model’s
complexity and its training time, highlighting the importance of selecting
appropriate methods based on computational resources and task-specific re-
quirements.
During the test phase, all evaluated methods, including Tab2Visual, ex-
hibit rapid inference times, typically completing within a fraction of a second
(see Table 7). Notably, the inference time for Tab2Visual includes the time
required for the initial data-to-image transformation. These results demon-
strate the potential of all evaluated methods, including Tab2Visual, for near
real-time applications.
29
Table 7: Average Inference Times for All Methods Across Various Datasets.
Method SVM Logistic Regression Extra Trees Random Forest LightGBM
Inference Time (ms) 0.2 1.2 1.0 1.7 1.5
Method XGBoost CatBoost MLP TabNet TabPFN Tab2Visual A0
Inference Time (ms) 2.0 2.4 10.0 7.0 101.0 289.0
5. Conclusions
This research addresses the critical challenge of limited data in tabular
data classification, a prevalent issue exacerbated by the high proportion of
small datasets observed in real-world applications. Deep learning models
often struggle with such data, and traditional data augmentation and trans-
fer learning methods are difficult to apply. To address this, we introduce
Tab2Visual, a novel data transformation strategy that converts heteroge-
neous tabular data into visual representations, unlocking the power of deep
learning models, such as CNNs and ViTs. Tab2Visual provides solutions for
applying transfer learning and data augmentation to tabular data. It lever-
ages image augmentation to enhance model generalization and effectively
increase training dataset size without additional data collection. We also
introduce a new set of efficient and semantically meaningful image augmen-
tation techniques tailored for Tab2Visual-generated images. Furthermore,
Tab2Visual capitalizes on transfer learning, allowing for the fine-tuning of
pre-trained models on specific tabular data classification tasks, thereby re-
ducing the reliance on large labeled datasets and allowing for knowledge shar-
ing across different tabular data domains. Unlike existing tabular-to-image
methods, Tab2Visual effectively handles data with limited features and of-
fers tailored augmentation strategies, making it particularly well-suited for
applications with small tabular datasets.
We have comprehensively evaluated our approach on ten diverse datasets,
benchmarking its performance against a broad spectrum of machine learn-
ing algorithms, encompassing classical methods, tree-based ensembles, and
state-of-the-art deep learning models specifically developed for handling tab-
ular data. Our experimental results demonstrate Tab2Visual’s effectiveness
in classification problems with limited tabular data. On smaller datasets
(≤1000 samples), Tab2Visual has outperformed all compared methods, in-
cluding specialized deep learning models like TabNet and TabPFN. Our ex-
30
periments also demonstrate that tree-based ensembles exhibit rather better
performances on larger datasets.
Furthermore, our study provides valuable insights into the key factors
influencing Tab2Visual’s performance:
•Impact of data augmentation: Our experiments demonstrate that aug-
mentation significantly enhances Tab2Visual’s generalization ability
and improves its predictive accuracy, especially on smaller datasets.
•Effectiveness of transfer learning: Transfer learning with pre-trained
models significantly outperforms training from scratch, especially in
data-limited scenarios.
•Backbone model selection: Within the Tab2Visual framework, the
CNN-based EfficientNetV2 model has outperformed EfficientViT in our
experiments on smaller tabular datasets (≤1000 samples), while Effi-
cientViT demonstrates superior performance on larger datasets.
•Influence of feature arrangement in the visual representations: Our
empirical analysis shows that arranging the features in more compact
representations tends to be more effective for maximizing generalization
in Tab2Visual.
Several avenues for future research are currently being explored to fur-
ther enhance Tab2Visual. Firstly, we plan to investigate a broader range
of image augmentation techniques beyond the set proposed in this work.
Recognizing that a fixed augmentation policy may not be optimal for all
datasets, we aim to adopt data-adaptive augmentation strategies [47, 48]
to tailor augmentation policies specifically to each dataset, optimizing for
improved generalization. Secondly, we will explore alternative feature ar-
rangement strategies within the image representations, potentially by con-
sidering the similarity or correlation between features. Thirdly, to further
boost Tab2Visual’s performance on larger datasets, we plan to investigate
a wider range of deep learning backbones. Finally, we are actively working
towards applying Tab2Visual to address real-world challenges, particularly
in the domain of healthcare. We aim to utilize Tab2Visual to develop AI-
driven prostate cancer diagnostic tools using limited tabular data comprising
clinical biomarkers and symptoms variables.
31
Acknowledgment
This work has been funded in whole or in part with Federal funds from
the National Cancer Institute, National Institutes of Health, under Task
Order No. HHSN26110071 under Contract No. HHSN261201500003l; and
also by NIH NIBIB P41EB015902 and NIH NIBIB P41EB028741. El-Melegy
is supported through the Arab Fund Fellowship Program, Kuwait.
References
[1] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural
networks, in: Advances in Neural Information Processing Systems, Vol. 25, 2012.
[2] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
[3] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, et al., An image
is worth 16x16 words: Transformers for image recognition at scale, arXiv preprint arXiv:2010.11929
(2020).
[4] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, S. Zagoruyko, End-to-end object detection
with transformers, in: European Conference on Computer Vision, Springer, 2020, pp. 213–229.
[5] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, et al., Mastering the
game of Go without human knowledge, Nature 550 (7676) (2017) 354–359.
[6] L. Grinszta jn, E. Oyallon, G. Varoquaux, Why do tree-based models still outperform deep learning
on typical tabular data?, in: Thirty-sixth Conference on Neural Information Processing Systems
Datasets and Benchmarks Track, 2022.
[7] V. Borisov, T. Leemann, K. Seßler, J. Haug, M. Pawelczyk, G. Kasneci, Deep neural networks and
tabular data: A survey, IEEE Transactions on Neural Networks and Learning Systems 35 (6) (2024)
7499–7519. doi:10.1109/TNNLS.2022.3229161.
[8] B. Van Breugel, M. Van Der Schaar, Position: Why tabular foundation models should be a research
priority, in: Proceedings of the 41st International Conference on Machine Learning, ICML’24, 2024.
[9] S. Popov, S. Morozov, A. Babenko, Neural oblivious decision ensembles for deep learning on tabular
data, in: International Conference on Learning Representations, 2020.
[10] S. ¨
O. Arik, T. Pfister, TabNet: Attentive interpretable tabular learning, in: Proceedings of the AAAI
Conference on Artificial Intelligence, Vol. 35, 2021, pp. 6679–6687.
[11] X. Huang, A. Khetan, M. Cvitkovic, Z. Karnin, TabTransformer: Tabular data modeling using
contextual embeddings, arXiv preprint arXiv:2012.06678 (2020).
[12] N. Hollmann, S. M¨uller, K. Eggensperger, F. Hutter, TabPFN: A transformer that solves small
tabular classification problems in a second, in: Proceedings of The Eleventh International Conference
on Learning Representations, 2023.
[13] N. Hollmann, S. M¨uller, L. Purucker, A. Krishnakumar, M. K¨orfer, S. Hoo, R. Schirrmeister, F. Hut-
ter, Accurate predictions on small data with a tabular foundation model, Nature 637 (2025) 319–326.
doi:10.1038/s41586-024- 08328-6.
32
[14] I. Shavitt, E. Segal, Regularization learning networks: deep learning for tabular datasets, in: Pro-
ceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18,
Curran Associates Inc., Red Hook, NY, USA, 2018, p. 1386–1396.
[15] A. Kadra, M. Lindauer, F. Hutter, J. Grabocka, Well-tuned simple nets excel on tabular datasets, in:
Proceedings of the 35th International Conference on Neural Information Processing Systems, NIPS
’21, Curran Associates Inc., Red Hook, NY, USA, 2021.
[16] A. Sharma, E. Vans, D. Shigemizu, K. A. Boroevich, T. Tsunoda, Deepinsight: A methodology
to transform a non-image data to an image for convolution neural network architecture, Scientific
reports 9 (1) (2019) 11399.
[17] L. van der Maaten, G. Hinton, Visualizing data using t-SNE, Journal of Machine Learning Research
9 (86) (2008) 2579–2605.
[18] O. Bazgir, R. Zhang, S. R. Dhruba, R. Rahman, S. Ghosh, R. Pal, Representation of features as
images with neighborhood dependencies for compatibility with convolutional neural networks, Nature
Communications 11 (09 2020). doi:10.1038/s41467- 020-18197-y.
[19] B. Sun, L. Yang, W. Zhang, M. Lin, P. Dong, C. Young, J. Dong, SuperTML: Two-dimensional
word embedding for the precognition on structured tabular data, in: 2019 IEEE/CVF Conference
on Computer Vision and Pattern Recognition Workshops (CVPRW), 2019, pp. 2973–2981. doi:
10.1109/CVPRW.2019.00360.
[20] L. Buturovi´c, D. Miljkovi´c, A novel method for classification of tabular data using convolutional
neural networks, BioRxiv (2020) 2020–05.
[21] T. Nguyen, H. Tran, Image generator for tabular data (IGTD), Pattern Recognition Letters 145
(2021) 71–78.
[22] R. Shwartz-Ziv, A. Armon, Tabular data: Deep learning is not all you need, Information Fusion 81
(2022) 84–90. doi:https://doi.org/10.1016/j.inffus.2021.11.011.
[23] M. Xu, S. Yoon, A. Fuentes, D. S. Park, A comprehensive survey of image augmentation techniques
for deep learning, Pattern Recognition 137 (2023) 109347. doi:https://doi.org/10.1016/j.patcog.
2023.109347.
[24] C. Tan, F. Sun, T. Kong, W. Zhang, C. Yang, C. Liu, A survey on deep transfer learning, in:
V. K˚urkov´a, Y. Manolopoulos, B. Hammer, L. Iliadis, I. Maglogiannis (Eds.), Artificial Neural Net-
works and Machine Learning – ICANN 2018, Springer International Publishing, Cham, 2018, pp.
270–279.
[25] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, SMOTE: synthetic minority over-
sampling technique, Journal of Artificial Intelligence Research 16 (1) (2002) 321–357.
[26] S. Darabi, Y. Elor, Synthesising multi-modal minority samples for tabular data, ArXiv
abs/2105.08204 (2021).
[27] A. Mamdouh, M. T. El-Melegy, S. A. Ali, A. S. El-Baz, Prediction of the gleason group of prostate
cancer from clinical biomarkers: Machine and deep learning from tabular data, in: 2022 International
Joint Conference on Neural Networks (IJCNN), IEEE, 2022, pp. 1–8.
[28] M. T. El-Melegy, A. Mamdouh, S. A. Ali, A. S. El-Baz, Prostate cancer diagnosis from structured
clinical biomarkers with deep learning, in: 2022 International Conference on Digital Image Comput-
ing: Techniques and Applications (DICTA), IEEE, 2022, pp. 1–8.
33
[29] M. El-Melegy, A. Mamdouh, S. Ali, M. Badawy, M. A. El-Ghar, N. S. Alghamdi, A. El-Baz, Prostate
cancer diagnosis via visual representation of tabular data and deep transfer learning, Bioengineering
11 (7) (2024) 635.
[30] A. Paivio, K. Csapo, Picture superiority in free recall: Imagery or dual coding?, Cognitive Psychology
5 (2) (1973) 176–206. doi:https://doi.org/10.1016/0010-0285(73)90032- 7.
[31] M. Z. Mintzer, J. G. Snodgrass, The picture superiority effect: Support for the distinctiveness model,
The American Journal of Psychology 112 (1) (1999) 113–146.
[32] C. L. Grady, A. R. McIntosh, M. N. Rajah, F. I. M. Craik, Neural correlates of the episodic encoding
of pictures and words, Proceedings of the National Academy of Sciences 95 (5) (1998) 2703–2708.
doi:10.1073/pnas.95.5.2703.
[33] M. Tan, Q. Le, EfficientNet: Rethinking model scaling for convolutional neural networks, in: Inter-
national conference on machine learning, PMLR, 2019, pp. 6105–6114.
[34] X. Liu, H. Peng, N. Zheng, Y. Yang, H. Hu, Y. Yuan, EfficientViT: Memory efficient vision trans-
former with cascaded group attention, in: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, 2023, pp. 14420–14430.
[35] A. Buslaev, V. I. Iglovikov, E. Khvedchenya, A. Parinov, M. Druzhinin, A. A. Kalinin, Albumenta-
tions: Fast and flexible image augmentations, Information 11 (2) (2020).
[36] T. K. Ho, Random decision forests, in: Proceedings of 3rd international conference on document
analysis and recognition, Vol. 1, IEEE, 1995, pp. 278–282.
[37] P. Geurts, D. Ernst, L. Wehenkel, Extremely randomized trees, Machine learning 63 (2006) 3–42.
[38] T. Chen, C. Guestrin, XGBoost: A scalable tree boosting system, in: Proceedings of the 22nd ACM
sigkdd international conference on knowledge discovery and data mining, 2016, pp. 785–794.
[39] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, T.-Y. Liu, LightGBM: A highly efficient
gradient boosting decision tree, Advances in neural information processing systems 30 (2017).
[40] A. V. Dorogush, V. Ershov, A. Gulin, CatBoost: Gradient boosting with categorical features support,
arXiv preprint arXiv:1810.11363 (2018).
[41] M. Tan, Q. Le, EfficientNetV2: Smaller models and faster training, in: International conference on
machine learning, PMLR, 2021, pp. 10096–10106.
[42] L. Deng, H. Suo, D. Li, et al., Deepfake video detection based on EfficientNet-V2 network, Compu-
tational Intelligence and Neuroscience 2022 (2022).
[43] S. Banerjee, M. K. H. Monir, CEIMVEN: An approach of cutting edge implementation of modi-
fied versions of efficientnet (v1-v2) architecture for breast cancer detection and classification from
ultrasound images, arXiv preprint arXiv:2308.13356 (2023).
[44] R. Wightman, Pytorch image models, https://github.com/huggingface/pytorch-image-models
(2019). doi:10.5281/zenodo.4414861.
[45] T. Akiba, S. Sano, T. Yanase, T. Ohta, M. Koyama, Optuna: A next-generation hyperparameter
optimization framework, in: Proceedings of the 25th ACM SIGKDD international conference on
knowledge discovery & data mining, 2019, pp. 2623–2631.
34
[46] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Pret-
tenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot,
E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine Learning Research 12
(2011) 2825–2830.
[47] C. Hou, J. Zhang, T. Zhou, When to learn what: Model-adaptive data augmentation curriculum,
in: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 1717–1728.
doi:10.1109/ICCV51070.2023.00165.
[48] T.-H. Cheung, D.-Y. Yeung, AdaAug: Learning class- and instance-adaptive data augmentation
policies, in: International Conference on Learning Representations, 2022.
35