Chapter

Multi-compound Transformer for Accurate Biomedical Image Segmentation

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The recent vision transformer (i.e. for image classification) learns non-local attentive interaction of different patch tokens. However, prior arts miss learning the cross-scale dependencies of different pixels, the semantic correspondence of different labels, and the consistency of the feature representations and semantic embeddings, which are critical for biomedical segmentation. In this paper, we tackle the above issues by proposing a unified transformer network, termed Multi-Compound Transformer (MCTrans), which incorporates rich feature learning and semantic structure mining into a unified framework. Specifically, MCTrans embeds the multi-scale convolutional features as a sequence of tokens, and performs intra- and inter-scale self-attention, rather than single-scale attention in previous works. In addition, a learnable proxy embedding is also introduced to model semantic relationship and feature enhancement by using self-attention and cross-attention, respectively. MCTrans can be easily plugged into a UNet-like network, and attains a significant improvement over the state-of-the-art methods in biomedical image segmentation in six standard benchmarks. For example, MCTrans outperforms UNet by 3.64%, 3.71%, 4.34%, 2.8%, 1.88%, 1.57% in Pannuke, CVC-Clinic, CVC-Colon, Etis, Kavirs, ISIC2018 dataset, respectively. Code is available at https://github.com/JiYuanFeng/MCTrans.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Ji et al. [34] proposed MCTrans, a model embedding multi-scale convolutional features into token sequences. By leveraging a self-attention mechanism, MCTrans facilitates crossscale pixel-level context modeling and employs learnable proxy embeddings to capture class dependencies. ...
... However, traditional image segmentation algorithms [11,[20][21][22][23] can only extract low-level features, making them inadequate for diverse datasets and complex segmentation tasks. With advancements in computational power and data availability, CNN-based [14,[24][25][26][27][28][29] and transformer-based [6,16,[30][31][32][33][34] image segmentation models have gradually replaced traditional methods and are widely used in complex segmentation tasks. CNN-based segmentation models demonstrate strong feature extraction and representation capabilities, but due to the local receptive field of convolutional kernels, they are typically limited to processing local regions and struggle to capture long-range dependencies between pixels in an image. ...
Article
Full-text available
Breast cancer is one of the most prevalent cancers among women, with early detection playing a critical role in improving survival rates. This study introduces a novel transformer-based explainable model for breast cancer lesion segmentation (TEBLS), aimed at enhancing the accuracy and interpretability of breast cancer lesion segmentation in medical imaging. TEBLS integrates a multi-scale information fusion approach with a hierarchical vision transformer, capturing both local and global features by leveraging the self-attention mechanism. This model addresses the limitations of existing segmentation methods, such as the inability to effectively capture long-range dependencies and fine-grained semantic information. Additionally, TEBLS incorporates visualization techniques to provide insights into the segmentation process, enhancing the model’s interpretability for clinical use. Experiments demonstrate that TEBLS outperforms traditional and existing deep learning-based methods in segmenting complex breast cancer lesions with variations in size, shape, and texture, achieving a mean DSC of 81.86% and a mean AUC of 97.72% on the CBIS-DDSM test set. Our model not only improves segmentation accuracy but also offers a more explainable framework, which has the potential to be used in clinical settings.
... These features are upsampled and combined with features extracted at multiple scales in the encoding path using skip connections. Ji et al. (2021) proposed the Multi-Compound Transformer (MCTrans), which embeds the multi-scale convolutional features as a sequence of tokens that perform intra-and inter-scale self-attention. The MCTrans model was implemented in a UNet network architecture. ...
... Seven baseline models based on the UNet, TransUNet, and MCTrans architectures were used for comparison. Their parameters were defined as reported on Ji et al. (2021). UNet V GG uses a modified VGG (Simonyan and Zisserman 2014) network as a backbone in the encoder path to capture hierarchical features. ...
Article
Full-text available
Segmentation of medical images is a critical step in assisting doctors in making accurate diagnoses and planning appropriate treatments. Deep learning architectures often serve as the basis for computer models used for this task. However, a common challenge faced by segmentation models is class imbalance, which leads to a bias towards classes with a larger number of pixels, resulting in reduced accuracy for the minority-class regions. To address this problem, the αα\alpha -balanced variant of the focal loss function introduces a αα\alpha modulation factor that reduces the weight assigned to majority classes and gives greater weight to minority classes. This study proposes the use of a fuzzy inference system to automatically adjust the αα\alpha factor, rather than maintaining a fixed value as commonly implemented. The adaptive fuzzy focal loss (AFFL) achieves an appropriate adjustment in αα\alpha by employing fifteen fuzzy rules. To evaluate the effectiveness of AFFL, we implement an encoder-decoder segmentation model based on the UNet and Transformer architectures (AFFL-TransUNet) using the CHAOS dataset. We compare the performance of seven segmentation models implemented using the same data partition and hardware equipment. A statistical analysis, considering the DICE coefficient metric, demonstrates that AFFL-TransUNet outperforms four baseline models and performs comparably to the remaining models. Remarkably, AFFL-TransUNet achieves this high performance while significantly reducing training processing time by 66.31–72.39%. This reduction is attributed to the fuzzy system that effectively adapts the αα\alpha value of the loss function, stabilizing the model within just a few epochs.
... Liu et al. [26] introduced the TransFusion model, incorporating two new modules in a transformer model to address complex medical image segmentation tasks by constructing semantic dependencies across diverse scales and views. Furthermore, Galazis et al. [27] proposed Tempera, a feature pyramid with a geometric spatial transformer for multiple passes. They employed a spatial transformer to establish connections between various views of the heart, facilitating smooth Content courtesy of Springer Nature, terms of use apply. ...
Article
Full-text available
Accurate segmentation of cardiac structures in magnetic resonance imaging (MRI) is essential for reliable diagnosis and management of cardiovascular disease. Although numerous robust models have been proposed, no single segmentation model consistently outperforms others across all cases, and models that excel on one dataset may not achieve similar accuracy on others or when the same dataset is expanded. This study introduces FCTransNet, an ensemble-based computer-aided diagnosis system that leverages the complementary strengths of Vision Transformer (ViT) models (specifically TransUNet, SwinUNet, and SegFormer) to address these challenges. To achieve this, we propose a novel pixel-level fusion technique, the Intelligent Weighted Summation Technique (IWST), which reconstructs the final segmentation mask by integrating the outputs of the ViT models and accounting for their diversity. First, a dedicated U-Net module isolates the region of interest (ROI) from cine MRI images, which is then processed by each ViT to generate preliminary segmentation masks. The IWST subsequently fuses these masks to produce a refined final segmentation. By using a local window around each pixel, IWST captures specific neighborhood details while incorporating global context to enhance segmentation accuracy. Experimental validation on the ACDC dataset shows that FCTransNet significantly outperforms individual ViTs and other deep learning-based methods, achieving a Dice Score (DSC) of 0.985 and a mean Intersection over Union (IoU) of 0.914 in the end-diastolic phase. In addition, FCTransNet maintains high accuracy in the end-systolic phase with a DSC of 0.989 and an IoU of 0.908. These results underscore FCTransNet’s ability to improve cardiac MRI segmentation accuracy.
... (You et al., 2022) incorporate Transformer and Generative Adversarial Network for medical segmentation. MCTrans (Ji et al., 2021) combines complex feature learning and rich semantic structure excavating into a unified framework by Multi-Compound Transformer by mapping the convolutional features from different multi-scale to different tokens and performs intra-and inter-scale self-attention, rather than singlescale attention. RTNet (Huang et al., 2022) design relation self-attention and build the transformer with two capabilities including global dependencies among lesion features, while a cross-attention transformer allows interactions between lesion and vessel features by integrating valuable vascular information to alleviate ambiguity in lesion detection caused by complex fundus structures. ...
Preprint
Vision Transformer shows great superiority in medical image segmentation due to the ability in learning long-range dependency. For medical image segmentation from 3D data, such as computed tomography (CT), existing methods can be broadly classified into 2D-based and 3D-based methods. One key limitation in 2D-based methods is that the intra-slice information is ignored, while the limitation in 3D-based methods is the high computation cost and memory consumption, resulting in a limited feature representation for inner-slice information. During the clinical examination, radiologists primarily use the axial plane and then routinely review both axial and coronal planes to form a 3D understanding of anatomy. Motivated by this fact, our key insight is to design a hybrid model which can first learn fine-grained inner-slice information and then generate a 3D understanding of anatomy by incorporating 3D information. We present a novel \textbf{H}ybrid \textbf{Res}idual trans\textbf{Former} \textbf{(HResFormer)} for 3D medical image segmentation. Building upon standard 2D and 3D Transformer backbones, HResFormer involves two novel key designs: \textbf{(1)} a \textbf{H}ybrid \textbf{L}ocal-\textbf{G}lobal fusion \textbf{M}odule \textbf{(HLGM)} to effectively and adaptively fuse inner-slice information from 2D Transformer and intra-slice information from 3D volumes for 3D Transformer with local fine-grained and global long-range representation. \textbf{(2)} a residual learning of the hybrid model, which can effectively leverage the inner-slice and intra-slice information for better 3D understanding of anatomy. Experiments show that our HResFormer outperforms prior art on widely-used medical image segmentation benchmarks. This paper sheds light on an important but neglected way to design Transformers for 3D medical image segmentation.
... The batch size is set to 16 in the Hippocampus dataset and the LiTS dataset. For the CPCGEA dataset, the batch size is set to 8. The batch size for both the COVID-19 Lung CT and MoNuSeg datasets is established at 4 [15]. The patch size P for all seven datasets is set to 16. ...
Article
Full-text available
Recent methods often introduce attention mechanisms into the skip connections of U-shaped networks to capture features. However, these methods usually overlook spatial information extraction in skip connections and exhibit inefficiency in capturing spatial and channel information. This issue prompts us to reevaluate the design of the skip-connection mechanism and propose a new deep-learning network called the Fusing Spatial and Channel Attention Network, abbreviated as FSCA-Net. FSCA-Net is a novel U-shaped network architecture that utilizes the Parallel Attention Transformer (PAT) to enhance the extraction of spatial and channel features in the skip-connection mechanism, further compensating for downsampling losses. We design the Cross-Attention Bridge Layer (CAB) to mitigate excessive feature and resolution loss when downsampling to the lowest level, ensuring meaningful information fusion during upsampling at the lowest level. Finally, we construct the Dual-Path Channel Attention (DPCA) module to guide channel and spatial information filtering for Transformer features, eliminating ambiguities with decoder features and better concatenating features with semantic inconsistencies between the Transformer and the U-Net decoder. FSCA-Net is designed explicitly for fine-grained segmentation tasks of multiple organs and regions. Our approach achieves over 48% reduction in FLOPs and over 32% reduction in parameters compared to the state-of-the-art method. Moreover, FSCA-Net outperforms existing segmentation methods on seven public datasets, demonstrating exceptional performance. The code has been made available on GitHub: https://github.com/Henry991115/FSCA-Net .
... Swin-UNet (Cao et al. 2022) changed vanilla transformers into Swin transformers, and designed a symmetric Swin transformer-based decoder and a patch extension layer to perform up-sampling operations. MCTrans (Ji et al. 2021) incorporated rich contextual dependencies and semantic relations for accurate biomedical segmentation within a unified transformer network. ...
Article
Despite the great potential in capturing long-range dependency, one rarely-explored underlying issue of transformer in medical image segmentation is attention collapse, making it often degenerate into a bypass module in CNN-Transformer hybrid architectures. This is due to the high computational complexity of vision transformers requiring extensive training data while well-annotated medical image data is relatively limited, resulting in poor convergence. In this paper, we propose a plug-n-play transformer block with dynamic token merging, named DTMFormer, to avoid building long-range dependency on redundant and duplicated tokens and thus pursue better convergence. Specifically, DTMFormer consists of an attention-guided token merging (ATM) module to adaptively cluster tokens into fewer semantic tokens based on feature and dependency similarity and a light token reconstruction module to fuse ordinary and semantic tokens. In this way, as self-attention in ATM is calculated based on fewer tokens, DTMFormer is of lower complexity and more friendly to converge. Extensive experiments on publicly-available datasets demonstrate the effectiveness of DTMFormer working as a plug-n-play module for simultaneous complexity reduction and performance improvement. We believe it will inspire future work on rethinking transformers in medical image segmentation. Code: https://github.com/iam-nacl/DTMFormer.
... As a result, taking into account global context information and multiscale information is a very effective method. Yuanfeng Ji [49] et al. proposed MCTrans, a self-attention transformer module and a cross-attention transformer module. The self-attention transformer module performs pixel-level context modeling at multiple scales. ...
Article
Full-text available
Purpose Convolution operator-based neural networks have shown great success in medical image segmentation over the past decade. The U-shaped network with a codec structure is one of the most widely used models. Transformer, a technology used in natural language processing, can capture long-distance dependencies and has been applied in Vision Transformer to achieve state-of-the-art performance on image classification tasks. Recently, researchers have extended transformer to medical image segmentation tasks, resulting in good models. Methods This review comprises publications selected through a Web of Science search. We focused on papers published since 2018 that applied the transformer architecture to medical image segmentation. We conducted a systematic analysis of these studies and summarized the results. Results To better comprehend the benefits of convolutional neural networks and transformers, the construction of the codec and transformer modules is first explained. Second, the medical image segmentation model based on transformer is summarized. The typically used assessment markers for medical image segmentation tasks are then listed. Finally, a large number of medical segmentation datasets are described. Conclusion Even if there is a pure transformer model without any convolution operator, the sample size of medical picture segmentation still restricts the growth of the transformer, even though it can be relieved by a pretraining model. More often than not, researchers are still designing models using transformer and convolution operators.
... 26 Due to its ability to mine rich contextual and global information, several Transformer-based methods have also been proposed for medical image segmentation. [27][28][29] For example, Medical-Transformer constructed a two-branch Transformer architecture using patch-wise input for images and image-wise input to fully exploit global and local information. 28 Unfortunately, while FCNs need large-scale annotation data for training, Transformer-based methods are more dependent on data scale. ...
Article
Full-text available
Purpose Segmentation of orbital tumors in CT images is of great significance for orbital tumor diagnosis, which is one of the most prevalent diseases of the eye. However, the large variety of tumor sizes and shapes makes the segmentation task very challenging, especially when the available annotation data is limited. Methods To this end, in this paper, we propose a multi‐scale consistent self‐training network (MSCINet) for semi‐supervised orbital tumor segmentation. Specifically, we exploit the semantic‐invariance features by enforcing the consistency between the predictions of different scales of the same image to make the model more robust to size variation. Moreover, we incorporate a new self‐training strategy, which adopts iterative training with an uncertainty filtering mechanism to filter the pseudo‐labels generated by the model, to eliminate the accumulation of pseudo‐label error predictions and increase the generalization of the model. Results For evaluation, we have built two datasets, the orbital tumor binary segmentation dataset (Orbtum‐B) and the orbital multi‐organ segmentation dataset (Orbtum‐M). Experimental results on these two datasets show that our proposed method can both achieve state‐of‐the‐art performance. In our datasets, there are a total of 55 patients containing 602 2D images. Conclusion In this paper, we develop a new semi‐supervised segmentation method for orbital tumors, which is designed for the characteristics of orbital tumors and exhibits excellent performance compared to previous semi‐supervised algorithms.
Article
Transformer-based technology has attracted widespread attention in medical image segmentation. Due to the diversity of organs, effective modelling of multi-scale information and establishing long-range dependencies between pixels are crucial for successful medical image segmentation. However, most studies rely on a fixed single-scale window for modeling, which ignores the potential impact of window size on performance. This limitation can hinder window-based models’ ability to fully explore multi-scale and long-range relationships within medical images. To address this issue, we propose a multi-scale reconfiguration self-attention (MSR-SA) module that accurately models multi-scale information and long-range dependencies in medical images. The MSR-SA module first divides the attention heads into multiple groups, each assigned an ascending dilation rate. These groups are then uniformly split into several non-overlapping local windows. Using dilated sampling, we gather the same number of keys to obtain both long-range and multi-scale information. Finally, dynamic information fusion is achieved by integrating features from the sampling points at corresponding positions across different windows. Based on the MSR-SA module, we propose a multi-scale reconfiguration U-Net (MSR-UNet) framework for medical image segmentation. Experiments on the Synapse and automated cardiac diagnosis challenge (ACDC) datasets show that MSR-UNet can achieve satisfactory segmentation results. The code is available at https://github.com/davidsmithwj/MSR-UNet (DOI: 10.5281/zenodo.13969855 ).
Article
Automated and accurate classification of pneumonia plays a crucial role in improving the performance of computer-aided diagnosis systems for chest X-ray images. Nevertheless, it is a challenging task due to the difficulty of learning the complex structure information of lung abnormality from chest X-ray images. In this paper, we propose a multi-view aggregation network with Transformer (TransMVAN) for pneumonia classification in chest X-ray images. Specifically, we propose to incorporate the knowledge from glance and focus views to enrich the feature representation of lung abnormality. Moreover, to capture the complex relationships among different lung regions, we propose a bi-directional multi-scale vision Transformer (biMSVT), with which the informative messages between different lung regions are propagated through two directions. In addition, we also propose a gated multi-view aggregation (GMVA) to adaptively select the feature information from glance and focus views for further performance enhancement of pneumonia diagnosis. Our proposed method achieves AUCs of 0.9645 and 0.9550 for pneumonia classification on two different chest X-ray image datasets. In addition, it achieves an AUC of 0.9761 for evaluating positive and negative polymerase chain reaction (PCR). Furthermore, our proposed method also attains an AUC of 0.9741 for classifying non-COVID-19 pneumonia, COVID-19 pneumonia, and normal cases. Experimental results demonstrate the effectiveness of our method over other methods used for comparison in pneumonia diagnosis from chest X-ray images.
Conference Paper
In this paper, we propose a novel convolutional neural network called MAGNet that employs multi-scale and global attention mechanisms for the task of medical image segmentation. This network is shown effectively to handle the segmentation task of an image of a given modality provided the network is suitably trained using a training set of the same modality. Experiments are performed to train the proposed network using three different training sets of images (CT, colonoscopy, and non-mydriatic 3CCD images), each acquired from a different imaging technique, resulting in three different trained models. The three trained models are tested on the respective test sets. Each model is shown to significantly outperform the state-of-the-art networks in terms of intersection over union, dice coefficient, and accuracy.
Article
Accurate skin lesion segmentation from dermoscopic images is of great importance for skin cancer diagnosis. However, automatic segmentation of melanoma remains a challenging task because it is difficult to incorporate useful texture representations into the learning process. Texture representations are not only related to the local structural information learned by CNN, but also include the global statistical texture information of the input image. In this paper, we propose a trans Former network ( SkinFormer ) that efficiently extracts and fuses statistical texture representation for Skin lesion segmentation. Specifically, to quantify the statistical texture of input features, a Kurtosis-guided Statistical Counting Operator is designed. We propose Statistical Texture Fusion Transformer and Statistical Texture Enhance Transformer with the help of Kurtosis-guided Statistical Counting Operator by utilizing the transformer's global attention mechanism. The former fuses structural texture information and statistical texture information, and the latter enhances the statistical texture of multi-scale features. Extensive experiments on three publicly available skin lesion datasets validate that our SkinFormer outperforms other SOAT methods, and our method achieves 93.2% Dice score on ISIC 2018. It can be easy to extend SkinFormer to segment 3D images in the future. Our code is available at https://github.com/Rongtao-Xu/SkinFormer</uri
Preprint
Full-text available
Convolutional Neural Networks (CNNs) and Transformer-based self-attention models have become standard for medical image segmentation. This paper demonstrates that convolution and self-attention, while widely used, are not the only effective methods for segmentation. Breaking with convention, we present a Convolution and self-Attention Free Mamba-based semantic Segmentation Network named CAF-MambaSegNet. Specifically, we design a Mamba-based Channel Aggregator and Spatial Aggregator, which are applied independently in each encoder-decoder stage. The Channel Aggregator extracts information across different channels, and the Spatial Aggregator learns features across different spatial locations. We also propose a Linearly Interconnected Factorized Mamba (LIFM) Block to reduce the computational complexity of a Mamba and to enhance its decision function by introducing a non-linearity between two factorized Mamba blocks. Our goal is not to outperform state-of-the-art results but to show how this innovative, convolution and self-attention-free method can inspire further research beyond well-established CNNs and Transformers, achieving linear complexity and reducing the number of parameters. Source code and pre-trained models will be publicly available.
Article
Full-text available
With Artificial Intelligence (AI) increasingly permeating various aspects of society, including healthcare, the adoption of the Transformers neural network architecture is rapidly changing many applications. Transformer is a type of deep learning architecture initially developed to solve general-purpose Natural Language Processing (NLP) tasks and has subsequently been adapted in many fields, including healthcare. In this survey paper, we provide an overview of how this architecture has been adopted to analyze various forms of healthcare data, including clinical NLP, medical imaging, structured Electronic Health Records (EHR), social media, bio-physiological signals, biomolecular sequences. Furthermore, which have also include the articles that used the transformer architecture for generating surgical instructions and predicting adverse outcomes after surgeries under the umbrella of critical care. Under diverse settings, these models have been used for clinical diagnosis, report generation, data reconstruction, and drug/protein synthesis. Finally, we also discuss the benefits and limitations of using transformers in healthcare and examine issues such as computational cost, model interpretability, fairness, alignment with human values, ethical implications, and environmental impact.
Article
The Transformer has been successfully used in medical image segmentation due to its excellent long-range modeling capabilities. However, patch segmentation is necessary when building a Transformer class model. This process ignores the tissue structure features within patch, resulting in the loss of shallow representation information. In this study, we propose a Heterogeneous Swin Transformer with Multi-Receptive Field (HST-MRF) model that fuses patch information from different receptive fields to solve the problem of loss of feature information caused by patch segmentation. The heterogeneous Swin Transformer (HST) is the core module, which achieves the interaction of multi-receptive field patch information through heterogeneous attention and passes it to the next stage for progressive learning, thus complementing the patch structure information. We also designed a two-stage fusion module, multimodal bilinear pooling (MBP), to assist HST in further fusing multi-receptive field information and combining low-level and high-level semantic information for accurate localization of lesion regions. In addition, we developed adaptive patch embedding (APE) and soft channel attention (SCA) modules to retain more valuable information when acquiring patch embedding and filtering channel features, respectively, thereby improving model segmentation quality. We evaluated HST-MRF on multiple datasets for polyp, skin lesion and breast ultrasound segmentation tasks. Experimental results show that our proposed method outperforms state-of-the-art models and can achieve superior performance. Furthermore, we verified the effectiveness of each module and the benefits of multi-receptive field segmentation in reducing the loss of structural information through ablation experiments and qualitative analysis.
Article
Low-contrast medical image segmentation is a challenging task that requires full use of local details and global context. However, existing convolutional neural networks (CNNs) cannot fully exploit global information due to limited receptive fields and local weight sharing. On the other hand, the transformer effectively establishes long-range dependencies but lacks desirable properties for modeling local details. This paper proposes a Transformer-embedded Boundary perception Network (TBNet) that combines the advantages of transformer and convolution for low-contrast medical image segmentation. Firstly, the transformer-embedded module uses convolution at the low-level layer to model local details and uses the Enhanced TRansformer (ETR) to capture long-range dependencies at the high-level layer. This module can extract robust features with semantic contexts to infer the possible target location and basic structure in low-contrast conditions. Secondly, we utilize the decoupled body-edge branch to promote general feature learning and precept precise boundary locations. The ETR establishes long-range dependencies across the whole feature map range and is enhanced by introducing local information. We implement it in a parallel mode, i.e., the group of self-attention with multi-head captures the global relationship, and the group of convolution retains local details. We compare TBNet with other state-of-the-art (SOTA) methods on the cornea endothelial cell, ciliary body, and kidney segmentation tasks. The TBNet improves segmentation performance, proving its effectiveness and robustness.
Article
Existing Magnetic resonance imaging (MRI) translation models rely on Generative Adversarial Networks, primarily employing simple convolutional neural networks. Unfortunately, these networks struggle to capture global representations and contextual relationships within MRI images. While the advent of Transformers enables capturing long-range feature dependencies, they often compromise the preservation of local feature details. To address these limitations and enhance both local and global representations, we introduce a novel Dual-Branch Generative Adversarial Network (DBGAN). In this framework, the Transformer branch comprises sparse attention blocks and dense self-attention blocks, allowing for a wider receptive field while simultaneously capturing local and global information. The CNN branch, built with integrated residual convolutional layers, enhances local modeling capabilities. Additionally, we propose a fusion module that cleverly integrates features extracted from both branches. Extensive experimentation on two public datasets and one clinical dataset validates significant performance improvements with DBGAN. On Brats2018, it achieves a 10%improvement in MAE, 3.2% in PSNR, and 4.8% in SSIM for image generation tasks compablack to RegGAN. Notably, the generated MRIs receive positive feedback from radiologists, underscoring the potential of our proposed method as a valuable tool in clinical settings.
Article
Full-text available
Breast tumor is a common female physiological disease, and the malignant tumor is one of the main fatal diseases of women. Accurate examination and assessment of tumor shape can facilitate subsequent treatment and improve the cure rate. With the development of deep learning, automatic detection systems are designed to assist doctors in diagnosis. However, the blurry edges, poor visual quality, and irregular shapes of breast tumors pose significant challenges to design a highly efficient detection system. In addition, the lack of publicly available labeled data is a major obstacle in developing highly accurate and robust deep learning models for breast tumor detection. To overcome the aforementioned issues, we propose SRU-PMT+, a pseudo-label reusing Mean-Teacher architecture based on squeeze-and-excitation residual (SE-Res) attention. We utilize the proposed segmentation network, SRU-Net++, to generate pseudo-labels for unlabeled data, and guide the learning of the student model using the generated pseudo-labels and groundtruth, improving the accuracy and robustness of the model. Our proposed semi-supervised method has been rigorously evaluated on the available labeled dataset, i.e., Breast Ultrasound Images (BUSI) dataset. Results show that our proposed method outperforms current segmentation methods and has good performance. Importantly, our strategy of reusing pseudo-labels improves the performance of breast tumor segmentation.
Article
Full-text available
Accurate medical image segmentation is critical for various clinical applications, and convolutional neural networks (CNNs) have demonstrated promising results in this field. The performance of CNN models for segmenting specific organs or lesion areas from medical images heavily depends on the feature extraction ability of the backbone network. In this study, we aim to explore the deep features of the backbone network and propose a novel network that can accurately capture multi-scale image features for medical image segmentation. To achieve this goal, we built upon the widely used U-Net framework and evaluated the feature extraction performance of different backbone networks for medical images. Then, we introduced a novel backbone network called ResX block, which utilizes rectangular and dilated convolutions to capture multi-scale features.To validate our conclusions, we conducted experiments on four benchmark datasets, including Lits2017, 3Dircadb, LIDC, and LCTSC. Our results demonstrate that the proposed ResX block outperforms mainstream feature extraction blocks in terms of accuracy and robustness. Our study confirms the importance of accurate multi-scale feature extraction for improving the performance of CNNs in medical image segmentation. Furthermore, we have verified the potential of rectangular and dilated convolutions for capturing multi-scale features in medical images. Finally, we proposed a novel backbone network, the ResX block, which can be seamlessly integrated into any CNN used for medical image segmentation. Our study provides valuable insights for developing more accurate and efficient CNN models for medical image analysis.
Article
Full-text available
Colonoscopy is considered the gold standard for detection of colorectal cancer and its precursors. Existing examination methods are, however, hampered by high overall miss-rate, and many abnormalities are left undetected. Computer-Aided Diagnosis systems based on advanced machine learning algorithms are touted as a game-changer that can identify regions in the colon overlooked by the physicians during endoscopic examinations, and help detect and characterize lesions. In previous work, we have proposed the ResUNet++ architecture and demonstrated that it produces more efficient results compared with its counterparts U-Net and ResUNet. In this paper, we demonstrate that further improvements to the overall prediction performance of the ResUNet++ architecture can be achieved by using CRF and TTA. We have performed extensive evaluations and validated the improvements using six publicly available datasets: Kvasir-SEG, CVC-ClinicDB, CVC-ColonDB, ETIS-Larib Polyp DB, ASU-Mayo Clinic Colonoscopy Video Database, and CVC-VideoClinicDB. Moreover, we compare our proposed architecture and resulting model with other State-of-the-art methods. To explore the generalization capability of ResUNet++ on different publicly available polyp datasets, so that it could be used in a real-world setting, we performed an extensive cross-dataset evaluation. The experimental results show that applying CRF and TTA improves the performance on various polyp segmentation datasets both on the same dataset and cross-dataset. To check the model's performance on difficult to detect polyps, we selected, with the help of an expert gastroenterologist, 196 sessile or flat polyps that are less than ten millimeters in size. This additional data has been made available as a subset of Kvasir-SEG. Our approaches showed good results for flat or sessile and smaller polyps, which are known to be one of the major reasons for high polyp miss-rates. This is one of the significant strengths of our work and indicates that our methods should be investigated further for use in clinical practice.
Chapter
Full-text available
The detection of curvilinear structures in medical images, e.g., blood vessels or nerve fibers, is important in aiding management of many diseases. In this work, we propose a general unifying curvilinear structure segmentation network that works on different medical imaging modalities: optical coherence tomography angiography (OCT-A), color fundus image, and corneal confocal microscopy (CCM). Instead of the U-Net based convolutional neural network, we propose a novel network (CS-Net) which includes a self-attention mechanism in the encoder and decoder. Two types of attention modules are utilized - spatial attention and channel attention, to further integrate local features with their global dependencies adaptively. The proposed network has been validated on five datasets: two color fundus datasets, two corneal nerve datasets and one OCT-A dataset. Experimental results show that our method outperforms state-of-the-art methods, for example, sensitivities of corneal nerve fiber segmentation were at least 2% higher than the competitors. As a complementary output, we made manual annotations of two corneal nerve datasets which have been released for public access.
Preprint
Full-text available
Medical image segmentation is an important step in medical image analysis. With the rapid development of convolutional neural network in image processing, deep learning has been used for medical image segmentation, such as optic disc segmentation, blood vessel detection, lung segmentation, cell segmentation, etc. Previously, U-net based approaches have been proposed. However, the consecutive pooling and strided convolutional operations lead to the loss of some spatial information. In this paper, we propose a context encoder network (referred to as CE-Net) to capture more high-level information and preserve spatial information for 2D medical image segmentation. CE-Net mainly contains three major components: a feature encoder module, a context extractor and a feature decoder module. We use pretrained ResNet block as the fixed feature extractor. The context extractor module is formed by a newly proposed dense atrous convolution (DAC) block and residual multi-kernel pooling (RMP) block. We applied the proposed CE-Net to different 2D medical image segmentation tasks. Comprehensive results show that the proposed method outperforms the original U-Net method and other state-of-the-art methods for optic disc segmentation, vessel detection, lung segmentation, cell contour segmentation and retinal optical coherence tomography layer segmentation.
Conference Paper
Full-text available
In this paper, we present UNet++, a new, more powerful architecture for medical image segmentation. Our architecture is essentially a deeply-supervised encoder-decoder network where the encoder and decoder sub-networks are connected through a series of nested, dense skip pathways. The re-designed skip pathways aim at reducing the semantic gap between the feature maps of the encoder and decoder sub-networks. We argue that the optimizer would deal with an easier learning task when the feature maps from the decoder and encoder networks are semantically similar. We have evaluated UNet++ in comparison with U-Net and wide U-Net architectures across multiple medical image segmentation tasks: nodule segmentation in the low-dose CT scans of chest, nuclei segmentation in the microscopy images, liver segmentation in abdominal CT scans, and polyp segmentation in colonoscopy videos. Our experiments demonstrate that UNet++ with deep supervision achieves an average IoU gain of 3.9 and 3.4 points over U-Net and wide U-Net, respectively.
Article
Full-text available
In this article, we describe the design and implementation of a publicly accessible dermatology image analysis benchmark challenge. The goal of the challenge is to support research and development of algorithms for automated diagnosis of melanoma, a lethal form of skin cancer, from dermoscopic images. The challenge was divided into sub-challenges for each task involved in image analysis, including lesion segmentation, dermoscopic feature detection within a lesion, and classification of melanoma. Training data included 900 images. A separate test dataset of 379 images was provided to measure resultant performance of systems developed with the training data. Ground truth for both training and test sets was generated by a panel of dermoscopic experts. In total, there were 79 submissions from a group of 38 participants, making this the largest standardized and comparative study for melanoma diagnosis in dermoscopic images to date. While the official challenge duration and ranking of participants has concluded, the datasets remain available for further research and development.
Article
Full-text available
Purpose: Wireless capsule endoscopy (WCE) is commonly used for noninvasive gastrointestinal tract evaluation, including the detection of mucosal polyps. A new embeddable method for polyp detection in wireless capsule endoscopic images was developed and tested. Methods: First, possible polyps within the image were extracted using geometric shape features. Next, the candidate regions of interest were evaluated with a boosting based method using textural features. Each step was carefully chosen to accommodate hardware implementation constraints. The method's performance was evaluated on WCE datasets including 300 images with polyps and 1,200 images without polyps. Hardware implementation of the proposed approach was evaluated to quantitatively demonstrate the feasibility of such integration into the WCE itself. Results: The boosting based polyp classification demonstrated a sensitivity of 91.0 %, a specificity of 95.2 % and a false detection rate of 4.8 %. This performance is close to that reported recently in systems developed for an online analysis of video colonoscopy images. Conclusion: A new method for polyp detection in videoendoscopic WCE examinations was developed using boosting based approach. This method achieved good classification performance and can be implemented in situ with embedded hardware.
Chapter
We present a new method that views object detection as a direct set prediction problem. Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components like a non-maximum suppression procedure or anchor generation that explicitly encode our prior knowledge about the task. The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss that forces unique predictions via bipartite matching, and a transformer encoder-decoder architecture. Given a fixed small set of learned object queries, DETR reasons about the relations of the objects and the global image context to directly output the final set of predictions in parallel. The new model is conceptually simple and does not require a specialized library, unlike many other modern detectors. DETR demonstrates accuracy and run-time performance on par with the well-established and highly-optimized Faster R-CNN baseline on the challenging COCO object detection dataset. Moreover, DETR can be easily generalized to produce panoptic segmentation in a unified manner. We show that it significantly outperforms competitive baselines. Training code and pretrained models are available at https://github.com/facebookresearch/detr.
Chapter
Aggregating multi-level feature representation plays a critical role in achieving robust volumetric medical image segmentation, which is important for the auxiliary diagnosis and treatment. Unlike the recent neural architecture search (NAS) methods that typically searched the optimal operators in each network layer, but missed a good strategy to search for feature aggregations, this paper proposes a novel NAS method for 3D medical image segmentation, named UXNet, which searches both the scale-wise feature aggregation strategies as well as the block-wise operators in the encoder-decoder network. UXNet has several appealing benefits. (1) It significantly improves flexibility of the classical UNet architecture, which only aggregates feature representations of encoder and decoder in equivalent resolution. (2) A continuous relaxation of UXNet is carefully designed, enabling its searching scheme performed in an efficient differentiable manner. (3) Extensive experiments demonstrate the effectiveness of UXNet compared with recent NAS methods for medical image segmentation. The architecture discovered by UXNet outperforms existing state-of-the-art models in terms of Dice on several public 3D medical image segmentation benchmarks, especially for the boundary locations and tiny tissues. The searching computational complexity of UXNet is cheap, enabling to search a network with best performance less than 1.5 days on two TitanXP GPUs.
Chapter
In this work we present an experimental setup to semi automatically obtain exhaustive nuclei labels across 19 different tissue types, and therefore construct a large pan-cancer dataset for nuclei instance segmentation and classification, with minimal sampling bias. The dataset consists of 455 visual fields, of which 312 are randomly sampled from more than 20K whole slide images at different magnifications, from multiple data sources. In total the dataset contains 216.4K labeled nuclei, each with an instance segmentation mask. We independently pursue three separate streams to create the dataset: detection, classification, and instance segmentation by ensembling in total 34 models from already existing, public datasets, therefore showing that the learnt knowledge can be efficiently transferred to create new datasets. All three streams are either validated on existing public benchmarks or validated by expert pathologists, and finally merged and validated once again to create a large, comprehensive pan-cancer nuclei segmentation and detection dataset PanNuke.
Article
Both convolutional and recurrent operations are building blocks that process one local neighborhood at a time. In this paper, we present non-local operations as a generic family of building blocks for capturing long-range dependencies. Inspired by the classical non-local means method in computer vision, our non-local operation computes the response at a position as a weighted sum of the features at all positions. This building block can be plugged into many computer vision architectures. On the task of video classification, even without any bells and whistles, our non-local models can compete or outperform current competition winners on both Kinetics and Charades datasets. In static image recognition, our non-local models improve object detection/segmentation and pose estimation on the COCO suite of tasks. Code will be made available.
Conference Paper
There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net .
Article
In this work we address the task of semantic image segmentation with Deep Learning and make three main contributions that are experimentally shown to have substantial practical merit. First, we highlight convolution with upsampled filters, or 'atrous convolution', as a powerful tool in dense prediction tasks. Atrous convolution allows us to explicitly control the resolution at which feature responses are computed within Deep Convolutional Neural Networks. It also allows us to effectively enlarge the field of view of filters to incorporate larger context without increasing the number of parameters or the amount of computation. Second, we propose atrous spatial pyramid pooling (ASPP) to robustly segment objects at multiple scales. ASPP probes an incoming convolutional feature layer with filters at multiple sampling rates and effective fields-of-views, thus capturing objects as well as image context at multiple scales. Third, we improve the localization of object boundaries by combining methods from DCNNs and probabilistic graphical models. The commonly deployed combination of max-pooling and downsampling in DCNNs achieves invariance but has a toll on localization accuracy. We overcome this by combining the responses at the final DCNN layer with a fully connected Conditional Random Field (CRF), which is shown both qualitatively and quantitatively to improve localization performance. Our proposed "DeepLab" system sets the new state-of-art at the PASCAL VOC-2012 semantic image segmentation task, reaching 79.7% mIOU in the test set, and advances the results on three other datasets: PASCAL-Context, PASCAL-Person-Part, and Cityscapes. All of our code is made publicly available online.
Article
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
Article
We introduce in this paper a novel polyp localization method for colonoscopy videos. Our method is based on a model of appearance for polyps which defines polyp boundaries in terms of valley information. We propose the integration of valley information in a robust way fostering complete, concave and continuous boundaries typically associated to polyps. This integration is done by using a window of radial sectors which accumulate valley information to create WM-DOVA (Window Median Depth of Valleys Accumulation) energy maps related with the likelihood of polyp presence. We perform a double validation of our maps, which include the introduction of two new databases, including the first, up to our knowledge, fully annotated database with clinical metadata associated. First we assess that the highest value corresponds with the location of the polyp in the image. Second, we show that WM-DOVA energy maps can be comparable with saliency maps obtained from physicians' fixations obtained via an eye-tracker. Finally, we prove that our method outperforms state-of-the-art computational saliency results. Our method shows good performance, particularly for small polyps which are reported to be the main sources of polyp miss-rate, which indicates the potential applicability of our method in clinical practice. Copyright © 2015 Elsevier Ltd. All rights reserved.
TransUNet: transformers make strong encoders for medical image segmentation
  • J Chen
Towards automatic polyp detection with a polyp appearance model
  • J Bernal
  • J Sánchez
  • F Vilarino