Joon Yoo’s research while affiliated with Gachon University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (42)


Comprehensive architecture of data mining
PRISMA diagram for the review selection process
Comprehensive framework of data mining applications in biological, telecommunication, and commercial sectors
Integrated architecture for deviation prediction, analysis, and detection in data processing
malicious code detection across industrial fields using state-of-the-art algorithms

+6

Review of malicious code detection in data mining applications: challenges, algorithms, and future direction
  • Article
  • Publisher preview available

January 2025

·

48 Reads

Cluster Computing

·

·

Joon Yoo

·

[...]

·

In an era where machine learning critically underpins business operations, detecting vulnerabilities introduced by malicious code has become increasingly essential. Although prior research has extensively explored malicious code within machine learning algorithms, a targeted analysis specifically designed to identify and address these threats remains necessary. This paper presents an exhaustive literature review, focusing on the key processes of insertion, recognition, decision-making, and selection of malicious codes. We aim to uncover architectural weaknesses in data mining applications that amplify system vulnerabilities. Leveraging an integrative review covering publications from 2008 to 2024, we synthesize insights from a diverse array of academic and digital sources, examining 167 pertinent articles. This rigorous approach reveals the nuanced effects of malicious code on feature selection algorithms, crucial for maintaining data integrity. Our findings indicate that malicious code can significantly disrupt various sectors, including industrial, telecommunications, and biological data mining, adversely affecting clustering, classification, and regression algorithms. However, an encouraging outcome is observed in advanced feature selection algorithms that demonstrate resilience by effectively filtering out irrelevant data inputs. The paper concludes with a strong call for the development of sophisticated detection methods, which are vital for mitigating the growing risks associated with malicious code. It stresses the importance of proactive algorithm identification and classification to preserve the efficacy of data mining. Current challenges in accurately classifying machine learning algorithms raise concerns about data privacy, security, and potential biases. Ongoing research is crucial for improving data interoperability and algorithm transparency, thereby strengthening the defense mechanisms of machine learning applications against the complex and evolving landscape of cyber threats.

View access options

LightweightUNet: Multimodal Deep Learning with GAN-Augmented Imaging Data for Efficient Breast Cancer Detection

January 2025

·

14 Reads

Breast cancer ranks as the second most prevalent cancer globally and is the most frequently diagnosed cancer among women; therefore, early, automated, and precise detection is essential. Most AI-based techniques for breast cancer detection are complex and have high computational costs. Hence, to overcome this challenge, we have presented the innovative LightweightUNet hybrid deep learning (DL) classifier for the accurate classification of breast cancer. The proposed model boasts a low computational cost due to its smaller number of layers in its architecture, and its adaptive nature stems from its use of depth-wise separable convolution. We have employed a multimodal approach to validate the model’s performance, using 13,000 images from two distinct modalities: mammogram imaging (MGI) and ultrasound imaging (USI). We collected the multimodal imaging datasets from seven different sources, including the benchmark datasets DDSM, MIAS, INbreast, BrEaST, BUSI, Thammasat, and HMSS. Since the datasets are from various sources, we have resized them to the uniform size of 256 × 256 pixels and normalized them using the Box-Cox transformation technique. Since the USI dataset is smaller, we have applied the StyleGAN3 model to generate 10,000 synthetic ultrasound images. In this work, we have performed two separate experiments: the first on a real dataset without augmentation and the second on a real + GAN-augmented dataset using our proposed method. During the experiments, we used a 5-fold cross-validation method, and our proposed model obtained good results on the real dataset (87.16% precision, 86.87% recall, 86.84% F1-score, and 86.87% accuracy) without adding any extra data. Similarly, the second experiment provides better performance on the real + GAN-augmented dataset (96.36% precision, 96.35% recall, 96.35% F1-score, and 96.35% accuracy). This multimodal approach, which utilizes LightweightUNet, enhances the performance by 9.20% in precision, 9.48% in recall, 9.51% in F1-score, and a 9.48% increase in accuracy on the combined dataset. The LightweightUNet model we proposed works very well thanks to a creative network design, adding fake images to the data, and a multimodal training method. These results show that the model has a lot of potential for use in clinical settings.


Transformative Advances in AI for Precise Cancer Detection: A Comprehensive Review of Non-Invasive Techniques

January 2025

·

23 Reads

Archives of Computational Methods in Engineering

Cancer continues to be a primary cause of death worldwide, highlighting the critical need for early diagnosis methods. Automated, quick, and efficient technologies are critical to this endeavor, yet considerable gaps remain in this field. A comprehensive review was undertaken to examine seven cancer types characterized by elevated prevalence and mortality: lung, prostate, brain, skin, breast, leukemia, and colorectal cancer. The study aimed to reveal gaps in the existing research and compare traditional machine learning (TML) with deep learning (DL) methodologies, since such contrasts have been not much explored. A total of 320 publications were carefully chosen for study, including 150 that focused on TML methods and 170 that address DL techniques for the classification of cancer. Diverse parameters were used to assess these investigations, encompassing publication year, employed databases, data sample, classifier, modalities, and evaluation metrics. Separate evaluations were conducted for each cancer type and methodology, yielding 14 unique review tables. The assessment of each cancer type using ML/DL independently relied on four standard criteria: High performance (> 99%), Limited performance (< 85%), key findings, and key challenges. These studies were accompanied by a brief descriptive outline of the features, classifiers, public databases, and evaluation metrics that were utilized in the review process. The study concluded by offering general conclusions that highlighted the overall findings, overall challenges observed during the investigation. This thorough review seeks to improve clinical applications and guide future research initiatives in the persistent fight against cancer.



The Improved Network Intrusion Detection Techniques Using the Feature Engineering Approach with Boosting Classifiers

December 2024

·

55 Reads

·

1 Citation

In the domain of cybersecurity, cyber threats targeting network devices are very crucial. Because of the exponential growth of wireless devices, such as smartphones and portable devices, cyber risks are becoming increasingly frequent and common with the emergence of new types of threats. This makes the automatic and accurate detection of network-based intrusion very essential. In this work, we propose a network-based intrusion detection system utilizing the comprehensive feature engineering approach combined with boosting machine-learning (ML) models. A TCP/IP-based dataset with 25,192 data samples from different protocols has been utilized in our work. To improve the dataset, we used preprocessing methods such as label encoding, correlation analysis, custom label encoding, and iterative label encoding. To improve the model’s accuracy for prediction, we then used a unique feature engineering methodology that included novel feature scaling and random forest-based feature selection techniques. We used three conventional models (NB, LR, and SVC) and four boosting classifiers (CatBoostGBM, LightGBM, HistGradientBoosting, and XGBoost) for classification. The 10-fold cross-validation methods were employed to train each model. After an assessment using numerous metrics, the best-performing model emerged as XGBoost. With mean metric values of 99.54 ± 0.0007 for accuracy, 99.53 ± 0.0013 for precision, 99.54 ± 0.001 for recall, and an F1-score of 99.53 ± 0.0014, the XGBoost model produced the best performance overall. Additionally, we showed the ROC curve for evaluating the model, which demonstrated that all boosting classifiers obtained a perfect AUC value of one. Our suggested methodologies show effectiveness and accuracy in detecting network intrusions, setting the stage for the model to be used in real time. Our method provides a strong defensive measure against malicious intrusions into network infrastructures while cyber threats keep varying.



BERT-Based Deceptive Review Detection in Social Media: Introducing DeceptiveBERT

December 2024

·

50 Reads

·

3 Citations

IEEE Transactions on Computational Social Systems

In recent years, the Internet has facilitated the emergence of social media platforms as significant channels for individuals to express their thoughts and engage in instantaneous interactions. However, the reliance on online reviews has also given rise to deceptive practices, where anonymous spammers generate fake reviews to manipulate the perception of a product. Ensuring the integrity of the online review system requires identifying and mitigating fake reviews. While existing machine learning (ML)- and neural network (NN)-based sentiment analysis methods can detect deceptive reviews, they often suffer from long training times, high computational resource requirements, and memory constraints. This study aims to overcome these limitations by introducing a transformer-based “deceptive bidirectional encoder representations from transformers (DeceptiveBERT) model.” This model utilizes contextual representations to enhance the precision of deceptive review identification. Transfer learning is employed to leverage knowledge from a pre-existing BERT base-uncased word embedding model, enabling efficient feature extraction. The proposed model incorporates a combination of classification layers to categorize reviews into two distinct categories: deceptive and truthful. Additionally, the study addresses the challenge of imbalanced datasets by utilizing three separate datasets and implementing appropriate methodologies for dataset curation. The effectiveness of the DeceptiveBERT model was evaluated through experimentation. The results demonstrate its efficacy, with the model achieving accuracy rates of 75%, 84.79%, and 81.08% on the Ott, YelpNYC, and YelpZip datasets, respectively.


MeDi‐TODER: Medical Domain‐Incremental Task‐Oriented Dialogue Generator Using Experience Replay

November 2024

·

10 Reads

Expert Systems

Artificial intelligence (AI) technology has brought groundbreaking changes to the healthcare domain. Specifically, the integration of a medical dialogue system (MDS) has facilitated interactions with patients, identifying meaningful information such as symptoms and medications from their dialogue history to generate appropriate responses. However, shortcomings arise when MDS lacks access to the patient's cumulative history or prior domain knowledge, resulting in the generation of inaccurate responses. To address this challenge, we propose a medical domain‐incremental task‐oriented dialogue generator using experience replay (MeDi‐TODER) that applies the continual learning technique to the medical task‐oriented dialogue generator. By strategically sampling and storing exemplars from previous domains and rehearsing it as it learns, the model effectively retains knowledge and can respond to the novel domains. Extensive experiments demonstrated that MeDi‐TODER significantly outperforms other models that lack continual learning in both natural language generation and natural language understanding. Notably, our proposed method achieves the highest scores, surpassing the upper‐class benchmarks.


Next-Generation Diagnostics: The Impact of Synthetic Data Generation on the Detection of Breast Cancer from Ultrasound Imaging

September 2024

·

45 Reads

·

4 Citations

Breast cancer is one of the most lethal and widespread diseases affecting women worldwide. As a result, it is necessary to diagnose breast cancer accurately and efficiently utilizing the most cost-effective and widely used methods. In this research, we demonstrated that synthetically created high-quality ultrasound data outperformed conventional augmentation strategies for efficiently diagnosing breast cancer using deep learning. We trained a deep-learning model using the EfficientNet-B7 architecture and a large dataset of 3186 ultrasound images acquired from multiple publicly available sources, as well as 10,000 synthetically generated images using generative adversarial networks (StyleGAN3). The model was trained using five-fold cross-validation techniques and validated using four metrics: accuracy, recall, precision, and the F1 score measure. The results showed that integrating synthetically produced data into the training set increased the classification accuracy from 88.72% to 92.01% based on the F1 score, demonstrating the power of generative models to expand and improve the quality of training datasets in medical-imaging applications. This demonstrated that training the model using a larger set of data comprising synthetic images significantly improved its performance by more than 3% over the genuine dataset with common augmentation. Various data augmentation procedures were also investigated to improve the training set’s diversity and representativeness. This research emphasizes the relevance of using modern artificial intelligence and machine-learning technologies in medical imaging by providing an effective strategy for categorizing ultrasound images, which may lead to increased diagnostic accuracy and optimal treatment options. The proposed techniques are highly promising and have strong potential for future clinical application in the diagnosis of breast cancer.


GAN-SkipNet: A Solution for Data Imbalance in Cardiac Arrhythmia Detection Using Electrocardiogram Signals from a Benchmark Dataset

August 2024

·

33 Reads

·

7 Citations

Electrocardiography (ECG) plays a pivotal role in monitoring cardiac health, yet the manual analysis of ECG signals is challenging due to the complex task of identifying and categorizing various waveforms and morphologies within the data. Additionally, ECG datasets often suffer from a significant class imbalance issue, which can lead to inaccuracies in detecting minority class samples. To address these challenges and enhance the effectiveness and efficiency of cardiac arrhythmia detection from imbalanced ECG datasets, this study proposes a novel approach. This research leverages the MIT-BIH arrhythmia dataset, encompassing a total of 109,446 ECG beats distributed across five classes following the Association for the Advancement of Medical Instrumentation (AAMI) standard. Given the dataset’s inherent class imbalance, a 1D generative adversarial network (GAN) model is introduced, incorporating the Bi-LSTM model to synthetically generate the two minority signal classes, which represent a mere 0.73% fusion (F) and 2.54% supraventricular (S) of the data. The generated signals are rigorously evaluated for similarity to real ECG data using three key metrics: mean squared error (MSE), structural similarity index (SSIM), and Pearson correlation coefficient (r). In addition to addressing data imbalance, the work presents three deep learning models tailored for ECG classification: SkipCNN (a convolutional neural network with skip connections), SkipCNN+LSTM, and SkipCNN+LSTM+Attention mechanisms. To further enhance efficiency and accuracy, the test dataset is rigorously assessed using an ensemble model, which consistently outperforms the individual models. The performance evaluation employs standard metrics such as precision, recall, and F1-score, along with their average, macro average, and weighted average counterparts. Notably, the SkipCNN+LSTM model emerges as the most promising, achieving remarkable precision, recall, and F1-scores of 99.3%, which were further elevated to an impressive 99.60% through ensemble techniques. Consequently, with this innovative combination of data balancing techniques, the GAN-SkipNet model not only resolves the challenges posed by imbalanced data but also provides a robust and reliable solution for cardiac arrhythmia detection. This model stands poised for clinical applications, offering the potential to be deployed in hospitals for real-time cardiac arrhythmia detection, thereby benefiting patients and healthcare practitioners alike.


Citations (33)


... In contrast to these offline and federated learning models, our study introduces an online learning-based ARF approach, which continuously updates its model in response to evolving network traffic patterns. This adaptability enables real-time threat detection, a crucial advantage over static offline-trained models [53][54][55][56]. Our system demonstrated superior classification performance while maintaining computational efficiency. ...

Reference:

Online Machine Learning for Intrusion Detection in Electric Vehicle Charging Systems
The Improved Network Intrusion Detection Techniques Using the Feature Engineering Approach with Boosting Classifiers

... Swin-SFTNet [53] 0.241 U-Net SE-RestNet-101 [54] 0.750 AM-MSP-cGAN [55] 0.845 TrEnD [56] 0.895 SegNest [52] 0.901 DeepLabV3+ (Our Approach) 0.789 ...

Next-Generation Diagnostics: The Impact of Synthetic Data Generation on the Detection of Breast Cancer from Ultrasound Imaging

... In the study of electrocardiogram (ECG) arrhythmia classification, the issue of data imbalance remains a significant challenge. Data imbalance hampers the model's capacity to identify minority class arrhythmias, thereby affecting overall diagnostic accuracy [10]. Table 1 provides a summary of the related work reviewed in this section. ...

GAN-SkipNet: A Solution for Data Imbalance in Cardiac Arrhythmia Detection Using Electrocardiogram Signals from a Benchmark Dataset

... The measured parameter is used as the input variable (Figure 2a). At the same time, the scaling coefficient K scal serves as the output [24,25] (Figure 2b). To construct the fuzzy rule base for fuzzy inference systems, input and output variable values are determined as fuzzy numbers. ...

Quantum-Fuzzy Expert Timeframe Predictor for Post-TAVR Monitoring

... In this study, a set of seven independent machine-learning classifiers was carefully applied to precisely preprocessed and feature-engineered network intrusion detection datasets [46,47]. The following section discusses numerous ML classifiers, emphasizing their abilities, limitations, and applicability for various intrusion detection scenarios [48]. ...

HRIDM: Hybrid Residual/Inception-Based Deeper Model for Arrhythmia Detection from Large Sets of 12-Lead ECG Recordings

... This model uses a refined algorithm tailored for precipitation data, improving its ability to classify and assess the impact of malicious code on weather forecasting accuracy. Furthermore, a secure KNN searching algorithm has been developed to guard against adversarial code insertions, effectively calculating nearest neighbor distances to pinpoint data discrepancies [91]. Tested on public and synthetic datasets, this method has proven successful in detecting and mitigating threats posed by malicious code. ...

Comparative analysis of machine learning and deep learning models for improved cancer detection: A comprehensive review of recent advancements in diagnostic techniques
  • Citing Article
  • July 2024

Expert Systems with Applications

... Recent research has focused on addressing these challenges using NLP and Large Language Models (LLMs). In [39], the authors address deceptive review detection by leveraging the power of BERT for contextual embeddings. The model is fine-tuned to recognize patterns in deceptive reviews on Twitter, achieving high classification accuracy. ...

BERT-Based Deceptive Review Detection in Social Media: Introducing DeceptiveBERT
  • Citing Article
  • December 2024

IEEE Transactions on Computational Social Systems

... RD involves extracting high-dimensional quantitative features from medical images, revealing tumor characteristics not visible to the naked eye. By contrast, DL architectures can discern complex patterns, facilitating predictions of molecular features, such as methylation status, that would otherwise require invasive testing [17][18][19][20]. ...

Two-headed UNetEfficientNets for parallel execution of segmentation and classification of brain tumors: incorporating postprocessing techniques with connected component labelling

Journal of Cancer Research and Clinical Oncology

... AI has also been fundamental in optimizing business resources, improving decisionmaking processes, and reducing overall costs [13]. In the area of financial innovations, advanced technologies like blockchain have been integrated to improve information management, regulatory compliance, and operational efficiency [14]. ...

Designing an Intelligent Scoring System for Crediting Manufacturers and Importers of Goods in Industry 4.0

... 5 Within smart greenhouses, the functionality of fog servers pose certain constraints on data preprocessing and system performance. 6 In the area of sustainable smart building management, there is a pronounced need for modularization and standardization to facilitate integration and testing in larger-scale buildings. 7 Additionally, for smart home environments, further attention must be devoted to customization, complexity of implementation, scalability or potential unforeseen problems with cross-domain applicability. ...

Digital twin framework for smart greenhouse management using next-gen mobile networks and machine learning
  • Citing Article
  • March 2024

Future Generation Computer Systems