Chapter

Multi-compound Transformer for Accurate Biomedical Image Segmentation

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The recent vision transformer (i.e. for image classification) learns non-local attentive interaction of different patch tokens. However, prior arts miss learning the cross-scale dependencies of different pixels, the semantic correspondence of different labels, and the consistency of the feature representations and semantic embeddings, which are critical for biomedical segmentation. In this paper, we tackle the above issues by proposing a unified transformer network, termed Multi-Compound Transformer (MCTrans), which incorporates rich feature learning and semantic structure mining into a unified framework. Specifically, MCTrans embeds the multi-scale convolutional features as a sequence of tokens, and performs intra- and inter-scale self-attention, rather than single-scale attention in previous works. In addition, a learnable proxy embedding is also introduced to model semantic relationship and feature enhancement by using self-attention and cross-attention, respectively. MCTrans can be easily plugged into a UNet-like network, and attains a significant improvement over the state-of-the-art methods in biomedical image segmentation in six standard benchmarks. For example, MCTrans outperforms UNet by 3.64%, 3.71%, 4.34%, 2.8%, 1.88%, 1.57% in Pannuke, CVC-Clinic, CVC-Colon, Etis, Kavirs, ISIC2018 dataset, respectively. Code is available at https://github.com/JiYuanFeng/MCTrans.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Reflections datasets used in international biomedical segmentation competitions [39]. In the end, with the increased appeal in the recent years of the attention theory in transformers, networks such as UNETR [32], Swin-Unet [15], MCTrans [45], TransUNet [45], etc. which still have an architecture similar to the U-Net have been proposed to perform medical image segmentation. ...
... Reflections datasets used in international biomedical segmentation competitions [39]. In the end, with the increased appeal in the recent years of the attention theory in transformers, networks such as UNETR [32], Swin-Unet [15], MCTrans [45], TransUNet [45], etc. which still have an architecture similar to the U-Net have been proposed to perform medical image segmentation. ...
... TransFuse [29] combines Transformers and CNN in a parallel style to improve efficiency for modeling global context information. Similarly, MCTrans [30] utilizes Transformer to incorporate rich context modeling and semantic relationship mining for accurate biomedical image segmentation. Besides, MedT [31] proposes a Gated Axial-Attention model that utilizes the Transformer based gated position-sensitive axial attention mechanism for medical image segmentation. ...
... • Multi-scale context approaches: Meanwhile, several multi-scale context models are used as the major contenders, including Unet++ [8], R2U-Net [39] ResUNet [41], ResUNet++ [42], BCDU-Net [43], KiU-Net [38], and DoubleU-Net [37]. • Transformer-based approaches: Moreover, two stateof-the-art Transformer-based models, i.e., MedT [31], MCTrans [30] are also considered as important baselines. 2) Implementation Details: To make a fair comparison with the existing works, the input images from Clean-CC-CCII, JSRT, Montgomery, NIH are resized into 512 × 512 for training and test, while the images of ISIC-2018 and Bowl are resized into 256 × 256. ...
Preprint
With the development of deep encoder-decoder architectures and large-scale annotated medical datasets, great progress has been achieved in the development of automatic medical image segmentation. Due to the stacking of convolution layers and the consecutive sampling operations, existing standard models inevitably encounter the information recession problem of feature representations, which fails to fully model the global contextual feature dependencies. To overcome the above challenges, this paper proposes a novel Transformer based medical image semantic segmentation framework called TransAttUnet, in which the multi-level guided attention and multi-scale skip connection are jointly designed to effectively enhance the functionality and flexibility of traditional U-shaped architecture. Inspired by Transformer, a novel self-aware attention (SAA) module with both Transformer Self Attention (TSA) and Global Spatial Attention (GSA) is incorporated into TransAttUnet to effectively learn the non-local interactions between encoder features. In particular, we also establish additional multi-scale skip connections between decoder blocks to aggregate the different semantic-scale upsampling features. In this way, the representation ability of multi-scale context information is strengthened to generate discriminative features. Benefitting from these complementary components, the proposed TransAttUnet can effectively alleviate the loss of fine details caused by the information recession problem, improving the diagnostic sensitivity and segmentation quality of medical image analysis. Extensive experiments on multiple medical image segmentation datasets of different imaging demonstrate that our method consistently outperforms the state-of-the-art baselines.
... Inspired by the Swin Transformer [9], Cao proposed a U-Net-like pure Transformer based segmentation model which uses hierarchical Swin Transformer as the encoder and a symmetric Swin Transformer with patch expanding layer as the decoder [10]. Other Transformer-based networks [11][12][13] also mark the success of Transformer in medical image segmentation and reconstruction. ...
Article
Full-text available
Four-dimensional flow magnetic resonance imaging (4D Flow MRI) enables visualization of intra-cardiac blood flow and quantification of cardiac function using time-resolved three directional velocity data. Segmentation of cardiac 4D flow data is a big challenge due to the extremely poor contrast between the blood pool and myocardium. The magnitude and velocity images from a 4D flow acquisition provide complementary information, but how to extract and fuse these features efficiently is unknown. Automated cardiac segmentation methods from 4D flow MRI have not been fully investigated yet. In this paper, we take the velocity and magnitude image as the inputs of two branches separately, then propose a Transformer based cross- and self-fusion layer to explore the inter-relationship from two modalities and model the intra-relationship in the same modality. A large in-house dataset of 104 subjects (91,182 2D images) was used to train and evaluate our model using several metrics including the Dice, Average Surface Distance (ASD), end-diastolic volume (EDV), end-systolic volume (ESV), Left Ventricle Ejection Fraction (LVEF) and Kinetic Energy (KE). Our method achieved a mean Dice of 86.52%, and ASD of 2.51 mm. Evaluation on the clinical parameters demonstrated competitive results, yielding a Pearson correlation coefficient of 83.26%, 97.4%, 96.97% and 98.92% for LVEF, EDV, ESV and KE respectively. Code is available at github.com/xsunn/4DFlowLVSeg.KeywordsLV segmentation4D Flow MRIFeature fusionTransformer
... To address this, Lin et al. [20] adopted a double-scale encoder to restore local information loss between different-scaled Swin Transformers. Deformable self-attention methods directly reduce computational complexity by only building dependency within a subset of patches [14], [21]. These two types of approaches ease the computation burden of transformers at the risk of losing important long-range dependency, which may result in sub-optimal feature expression. ...
Preprint
Full-text available
Vision transformers have recently set off a new wave in the field of medical image analysis due to their remarkable performance on various computer vision tasks. However, recent hybrid-/transformer-based approaches mainly focus on the benefits of transformers in capturing long-range dependency while ignoring the issues of their daunting computational complexity, high training costs, and redundant dependency. In this paper, we propose to employ adaptive pruning to transformers for medical image segmentation and propose a lightweight and effective hybrid network APFormer. To our best knowledge, this is the first work on transformer pruning for medical image analysis tasks. The key features of APFormer mainly are self-supervised self-attention (SSA) to improve the convergence of dependency establishment, Gaussian-prior relative position embedding (GRPE) to foster the learning of position information, and adaptive pruning to eliminate redundant computations and perception information. Specifically, SSA and GRPE consider the well-converged dependency distribution and the Gaussian heatmap distribution separately as the prior knowledge of self-attention and position embedding to ease the training of transformers and lay a solid foundation for the following pruning operation. Then, adaptive transformer pruning, both query-wise and dependency-wise, is performed by adjusting the gate control parameters for both complexity reduction and performance improvement. Extensive experiments on two widely-used datasets demonstrate the prominent segmentation performance of APFormer against the state-of-the-art methods with much fewer parameters and lower GFLOPs. More importantly, we prove, through ablation studies, that adaptive pruning can work as a plug-n-play module for performance improvement on other hybrid-/transformer-based methods. Code is available at https://github.com/xianlin7/APFormer.
... Recently, the Transformer-based architecture has shown excellent success (Dosovitskiy et al., 2021). A commonly adopted strategy for image segmentation is to take a hybrid CNN-Transformer-based architecture (Xie et al., 2021;Ji et al., 2021). (Chen et al., 2021) proposed TransUnet structure that embeds Transformer in the encoder to enhance the longdistance dependency in features for 2D image segmentation tasks. ...
Preprint
In recent years, 3D convolutional neural networks have become the dominant approach for volumetric medical image segmentation. However, compared to their 2D counterparts, 3D networks introduce substantially more training parameters and higher requirement for the GPU memory. This has become a major limiting factor for designing and training 3D networks for high-resolution volumetric images. In this work, we propose a novel memory-efficient network architecture for 3D high-resolution image segmentation. The network incorporates both global and local features via a two-stage U-net-based cascaded framework and at the first stage, a memory-efficient U-net (meU-net) is developed. The features learnt at the two stages are connected via post-concatenation, which further improves the information flow. The proposed segmentation method is evaluated on an ultra high-resolution microCT dataset with typically 250 million voxels per volume. Experiments show that it outperforms state-of-the-art 3D segmentation methods in terms of both segmentation accuracy and memory efficiency.
Article
Full-text available
Skin lesion segmentation has become an essential recent direction in machine learning for medical applications. In a deep learning segmentation network, the convolutional neural network (CNN) uses convolution to capture local information for modeling. However, it ignores the relationship between pixels and still can not meet the precise segmentation requirements of some complex low contrast datasets. Transformer performs well in modeling global feature information, but their ability to extract fine-grained local feature patterns is weak. In this work, The dual coding fusion network architecture Transformer and CNN (TC-Net), as an architecture that can more accurately combine local feature information and global feature information, can improve the segmentation performance of skin images. The results of this work demonstrate that the combination of CNN and Transformer brings very significant improvement in global segmentation performance and allows outperformance as compared to the pure single network model. The experimental results and visual analysis of these three datasets quantitatively and qualitatively illustrate the robustness of TC-Net. Compared with Swin UNet, on the ISIC2018 dataset, it has increased by 2.46% in the dice index and about 4% in the JA index. On the ISBI2017 dataset, the dice and JA indices rose by about 4%.
Article
There are limitations in the study of transformer-based medical image segmentation networks for token position encoding and decoding of images. The position encoding module cannot encode the position information adequately, and the serial decoder cannot utilize the contextual information efficiently. In this paper, we propose a new CNN-transformer hybrid structure for the medical image segmentation network APT-Net based on the encoder-decoder architecture. The network introduces an adaptive position encoding module for the fusion of position information of a multi-receptive field to provide more adequate position information for the token sequences in the transformer. In addition, the dual-path parallel decoder's basic and guide information paths simultaneously process multiscale feature maps to efficiently utilize contextual information. We conducted extensive experiments and reported a number of important metrics from multiple perspectives on seven datasets containing skin lesions, polyps, and glands. The IoU reached 0.783 and 0.851 on the ISIC2017 and Glas datasets, respectively. To the best of our knowledge, APT-Net achieves state-of-the-art performance on the Glas dataset and polyp segmentation tasks. Ablation experiments validate the effectiveness of the proposed adaptive position encoding module and the dual-path parallel decoder. Comparative experiments with state-of-the-art methods demonstrate the high accuracy and portability of APT-Net.
Article
Accurate and reliable segmentation of colorectal polyps is important for the diagnosis and treatment of colorectal cancer. Most of the existing polyp segmentation methods innovatively combine CNN with Transformer. Due to the single combination approach, there are limitations in establishing connections between local feature information and utilizing global contextual information captured by Transformer. Still not a better solution to the problems in polyp segmentation. In this paper, we propose a Dual Branch Multiscale Feature Fusion Network for Polyp Segmentation, abbreviated as DBMF, for polyp segmentation to achieve accurate segmentation of polyps. DBMF uses CNN and Transformer in parallel to extract multi-scale local information and global contextual information respectively, with different regions and levels of information to make the network more accurate in identifying polyps and their surrounding tissues. Feature Super Decoder(FSD) fuses multi-level local features and global contextual information in dual branches to fully exploit the potential of combining CNN and Transformer to improve the network’s ability to parse complex scenes and the detection rate of tiny polyps. The FSD generates an initial segmentation map to guide the second parallel decoder(SPD) to refine the segmentation boundary layer by layer. SPD consists of a multi-scale feature aggregation module(MFA) and parallel polarized self-attention(PSA) and reverse attention fusion modules(RAF). MFA aggregates multi-level local feature information extracted by CNN Brach to find consensus regions between multiple scales and improve the network’s ability to identify polyp regions. PSA uses dual attention to enhance the fine-grained nature of segmented regions and reduce the redundancy introduced by MFA and interference information.RAF mines boundary cues and establishes relationships between regions and boundary cues. The three RAFs guide the network to explore lost targets and boundaries in a bottom-up manner. We used the CVC-ClinicDB, Kvasir, CVC-300, CVC-ColonDB, and ETIS datasets to conduct comparison experiments and ablation experiments between DBMF and mainstream polyp segmentation networks. The results showed that DBMF outperformed the current mainstream networks on five benchmark datasets.
Article
Accurate and efficient pancreas segmentation is the basis for subsequent diagnosis and qualitative treatment of pancreatic cancer. Segmenting the pancreas from abdominal CT images is a challenging task because the morphology of the pancreas varies greatly among different individuals and may be affected by problems such as the unbalanced category and blurred boundaries. This paper proposes a two-stage Trans-Deformer network to solve these problems of pancreas segmentation. To be specific, we first use 2D Unet for coarse segmentation to generate candidate regions of the pancreas. In the fine segmentation stage, we propose to integrate deformable convolution into Vision Transformer (VIT) for solving the deformation problem of the pancreas. For the problem of blurred boundaries caused by low contrast in the pancreas, a multi-input module based on wavelet decomposition is proposed to make our network pay more attention to high-frequency texture information. In addition, we propose using the Scale Inter-active Fusion (SIF) module to merge local features and global features. Our method was evaluated on the public NIH dataset including 82 abdominal contrast-enhanced CT volumes and the public MSD dataset including 281 abdominal contrast-enhanced CT volumes via four-fold cross-validation. We have achieved the average Dice Similarity Coefficient (DSC) values of 89.89±1.82% on the NIH dataset, and 91.22±1.37% on the MSD dataset, outperforming other exiting state-of-the-art pancreas segmentation methods.
Chapter
The severe whirling of cage is the main failure mode of space bearing. However, it is difficult to capture whirl motion without changing the cage structure. This paper aims to capture whirl motion of cage precisely with high-speed technology and Semantic Segmentation Algorithms. An improved MultiResUNet for the multimodal biomedical image segmentation is trained with 20 high-speed cage rotational pictures, 5 pictures are validation set, then 1000 cage mask images during rotation are obtained. The cage mass center for 1000 mask images is calculated with Center of Mass (CoM) operation for whirling analysis. To verify the effectiveness of our model, the results of whirl orbit and Y displacement are compared by MultiResUNet and Improved MultiResUNet. Additionally, our model is also compared with TEMA Motion which is a commercial tracking software. The results show that our model can correctly predict the whirl trend that the whirl frequency is consistent with cage rotation frequency and whirl radius is a fixed value, and is high-precision trajectory capture that max deviation is 0.0118 mm from real cage mass center to prediction.KeywordsSpace bearingCageWhirl motionImproved MultiResUNet
Article
Full-text available
The model, Transformer, is known to rely on a self-attention mechanism to model distant dependencies, which focuses on modeling the dependencies of the global elements. However, its sensitivity to the local details of the foreground information is not significant. Local detail features help to identify the blurred boundaries in medical images more accurately. In order to make up for the defects of Transformer and capture more abundant local information, this paper proposes an attention and MLP hybrid-encoder architecture combining the Efficient Attention Module (EAM) with a Dual-channel Shift MLP module (DS-MLP), called HEA-Net. Specifically, we effectively connect the convolution block with Transformer through EAM to enhance the foreground and suppress the invalid background information in medical images. Meanwhile, DS-MLP further enhances the foreground information via channel and spatial shift operations. Extensive experiments on public datasets confirm the excellent performance of our proposed HEA-Net. In particular, on the GlaS and MoNuSeg datasets, the Dice reached 90.56% and 80.80%, respectively, and the IoU reached 83.62% and 68.26%, respectively.
Chapter
Combining information from multi-view images is crucial to improve the performance and robustness of automated methods for disease diagnosis. However, due to the non-alignment characteristics of multi-view images, building correlation and data fusion across views largely remain an open problem. In this study, we present TransFusion, a Transformer-based architecture to merge divergent multi-view imaging information using convolutional layers and powerful attention mechanisms. In particular, the Divergent Fusion Attention (DiFA) module is proposed for rich cross-view context modeling and semantic dependency mining, addressing the critical issue of capturing long-range correlations between unaligned data from different image views. We further propose the Multi-Scale Attention (MSA) to collect global correspondence of multi-scale feature representations. We evaluate TransFusion on the Multi-Disease, Multi-View & Multi-Center Right Ventricular Segmentation in Cardiac MRI (M &Ms-2) challenge cohort. TransFusion demonstrates leading performance against the state-of-the-art methods and opens up new perspectives for multi-view imaging integration towards robust medical image segmentation. KeywordsCardiac MRISegmentationDeep learning
Chapter
The segmentation of corneal nerves in corneal confocal microscopy (CCM) is of great to the quantification of clinical parameters in the diagnosis of eye-related diseases and systematic diseases. Existing works mainly use convolutional neural networks to improve the segmentation accuracy, while further improvement is needed to mitigate the nerve discontinuity and noise interference. In this paper, we propose a novel corneal nerve segmentation network, named NerveFormer, to resolve the above-mentioned limitations. The proposed NerveFormer includes a Deformable and External Attention Module (DEAM), which exploits the Transformer-based Deformable Attention (TDA) and External Attention (TEA) mechanisms. TDA is introduced to explore the local internal nerve features in a single CCM, while TEA is proposed to model global external nerve features across different CCM images. Specifically, to efficiently fuse the internal and external nerve features, TDA obtains the query set required by TEA, thereby strengthening the characterization ability of TEA. Therefore, the proposed model aggregates the learned features from both single-sample and cross-sample, allowing for better extraction of corneal nerve features across the whole dataset. Experimental results on two public CCM datasets show that our proposed method achieves state-of-the-art performance, especially in terms of segmentation continuity and noise discrimination.KeywordsCorneal nerve segmentationTransformerCross-sample
Chapter
Over the past few years, convolution neural networks (CNNs) and vision transformers (ViTs) have been two dominant architectures in medical image segmentation. Although CNNs can efficiently capture local representations, they experience difficulty establishing long-distance dependencies. Comparably, ViTs achieve impressive success owing to their powerful global contexts modeling capabilities, but they may not generalize well on insufficient datasets due to the lack of inductive biases inherent to CNNs. To inherit the merits of these two different design paradigms while avoiding their respective limitations, we propose a concurrent structure termed ConTrans, which can couple detailed localization information with global contexts to the maximum extent. ConTrans consists of two parallel encoders, i.e., a Swin Transformer encoder and a CNN encoder. Specifically, the CNN encoder is progressively stacked by the novel Depthwise Attention Block (DAB), with the aim to provide the precise local features we need. Furthermore, a well-designed Spatial-Reduction-Cross-Attention (SRCA) module is embedded in the decoder to form a comprehensive fusion of these two distinct feature representations and eliminate the semantic divergence between them. This allows to obtain accurate semantic information and ensure the up-sampling features with semantic consistency in a hierarchical manner. Extensive experiments across four typical tasks show that ConTrans significantly outperforms state-of-the-art methods on ten famous benchmarks.KeywordsMedical image segmentationTransformerConvolution Neural NetworkCross-attention
Chapter
The colorectal polyps classification is a critical clinical examination. To improve the classification accuracy, most computer-aided diagnosis algorithms recognize colorectal polyps by adopting Narrow-Band Imaging (NBI). However, the NBI usually suffers from missing utilization in real clinic scenarios since the acquisition of this specific image requires manual switching of the light mode when polyps have been detected by using White-Light (WL) images. To avoid the above situation, we propose a novel method to directly achieve accurate white-light colonoscopy image classification by conducting structured cross-modal representation consistency. In practice, a pair of multi-modal images, i.e. NBI and WL, are fed into a shared Transformer to extract hierarchical feature representations. Then a novel designed Spatial Attention Module (SAM) is adopted to calculate the similarities between class token and patch tokens for a specific modality image. By aligning the class tokens and spatial attention maps of paired NBI and WL images at different levels, the Transformer achieves the ability to keep both global and local representation consistency for the above two modalities. Extensive experimental results illustrate the proposed method outperforms the recent studies with a margin, realizing multi-modal prediction with a single Transformer while greatly improving the classification accuracy when only with WL images. Code is available at https://github.com/WeijieMax/CPC-Trans.
Article
Transformers have dominated the field of natural language processing and have recently made an impact in the area of computer vision. In the field of medical image analysis, transformers have also been successfully used in to full-stack clinical applications, including image synthesis/reconstruction, registration, segmentation, detection, and diagnosis. This paper aims to promote awareness of the applications of transformers in medical image analysis. Specifically, we first provide an overview of the core concepts of the attention mechanism built into transformers and other basic components. Second, we review various transformer architectures tailored for medical image applications and discuss their limitations. Within this review, we investigate key challenges including the use of transformers in different learning paradigms, improving model efficiency, and coupling with other techniques. We hope this review will provide a comprehensive picture of transformers to readers with an interest in medical image analysis.
Article
Automatic segmentation of skin lesions is beneficial for improving the accuracy and efficiency of melanoma diagnosis. However, due to variation in the size and shape of the lesion areas and the low contrast between the edges of the lesion and the normal skin tissue, this task is very challenging. The traditional convolutional neural network based on codec structure lacks the capability of multi-scale context information modeling and cannot realize information interaction of skip connections at the various levels, which limits the segmentation performance. Therefore, a new codec structure of skin lesion Transformer network (SLT-Net) was proposed and applied to skin lesion segmentation in this study. Specifically, SLT-Net used CSwinUnet as the codec to model the long-distance dependence between features and used the multi-scale context Transformer (MCT) as the skip connection to realize information interaction between skip connections across levels in the channel dimension. We have performed extensive experiments to verify the effectiveness and superiority of our proposed method on three public skin lesion datasets, including the ISIC-2016, ISIC-2017, and ISIC-2018. The DSC values on the three data sets reached 90.45%, 79.87% and 82.85% respectively, higher than most of the state-of-the-art methods. The excellent performance of SLT-Net on these three datasets proved that it could improve the accuracy of skin lesion segmentation, providing a new benchmark reference for skin lesion segmentation tasks. The code is available at https://github.com/FengKaili-fkl/SLT-Net.git.
Article
Osteosarcoma is a malignant bone tumor commonly found in adolescents or children, with high incidence and poor prognosis. Magnetic resonance imaging (MRI), which is the more common diagnostic method for osteosarcoma, has a very large number of output images with sparse valid data and may not be easily observed due to brightness and contrast problems, which in turn makes manual diagnosis of osteosarcoma MRI images difficult and increases the rate of misdiagnosis. Current image segmentation models for osteosarcoma mostly focus on convolution, whose segmentation performance is limited due to the neglect of global features. In this paper, we propose an intelligent assisted diagnosis system for osteosarcoma, which can reduce the burden of doctors in diagnosing osteosarcoma from three aspects. First, we construct a classification-image enhancement module consisting of resnet18 and DeepUPE to remove redundant images and improve image clarity, which can facilitate doctors' observation. Then, we experimentally compare the performance of serial, parallel, and hybrid fusion transformer and convolution, and propose a Double U-shaped visual transformer with convolution (DUconViT) for automatic segmentation of osteosarcoma to assist doctors' diagnosis. This experiment utilizes more than 80,000 osteosarcoma MRI images from three hospitals in China. The results show that DUconViT can better segment osteosarcoma with DSC 2.6% and 1.8% higher than Unet and Unet++, respectively. Finally, we propose the pixel point quantification method to calculate the area of osteosarcoma, which provides more reference basis for doctors' diagnosis. Code: https://github.com/lingziqiang/DUconViT.
Article
Transformers have demonstrated impressive expressiveness and transfer capability in computer vision fields. Dense prediction is a fundamental problem in computer vision that is more challenging to solve than general image-level prediction tasks. The inherent properties of transformers enable them to process feature representations with stable and relatively high resolution, which precisely satisfies the demands of dense prediction tasks for finer-grained and more globally coherent predictions. Furthermore, compared to convolutional networks, transformer methods require minimal inductive bias and permit long-range information interaction. These strengths have contributed to exciting advancements in dense prediction tasks that apply transformer networks. This survey aims to provide a comprehensive overview of transformer models with a specific focus on dense prediction. In this survey, we provide a well-rounded view of state-of-the-art transformer-based approaches, explicitly emphasizing pixel-level prediction tasks. We generally consider transformer variants from the network architecture perspective. We further propose a novel taxonomy to organize these models according to their constructions. Subsequently, we examine various specific optimization strategies to tackle certain bottleneck problems in dense prediction tasks. We explore the commonalities and differences among these works and provide multiple horizontal comparisons from the experimental point of view. Finally, we summarize several stubborn problems that continue to impact visual transformers and outline some possible development directions.
Article
The prediction of urban crowds is crucial not only to traffic management but also to studies on the city-level social phenomena, such as energy consumption, urban growth, city planning, and epidemic prevention. The challenges of accurately predicting crowd flow come from the non-linear spatial-temporal dependence of crowd flow data, periodic laws, such as daily and weekly periodicity, and external factors, such as weather and holidays. It is even more challenging for most existing short-term prediction models to make an accurate long-term prediction. In this paper, we propose a novel patched Transformer-based sequence-to-sequence model, called MultiSize Patched Spatial-Temporal Transformer Network (MSP-STTN), to incorporate rich and unified context modeling via a self-attention mechanism and global memory learning via a cross-attention mechanism for short- and long-term grid-based crowd flow prediction. In particular, a multisize patched spatial-temporal self-attention Transformer is designed to capture cross-space-time and cross-size contextual dependence of crowd data. The same structured cross-attention Transformer is developed to adaptively learn a global memory for long-term prediction in a responding-to-a-query style without error accumulation. In addition, a categorized space-time expectation is proposed as a unified regional encoding with temporal and external factors and is used as a base prediction for stable training. Furthermore, auxiliary tasks are introduced for promoting feature encoding and leveraging feature consistency to assist in the main prediction task. The experimental results reveal that MSP-STTN is competitive with the state of the art for one-step and multi-step short-term prediction within several hours and achieves practical long-term crowd flow prediction within one day on real-world grid-based crowd data sets TaxiBJ, BikeNYC, and CrowdDensityBJ. Our code and data are available at https://github.com/xieyulai/MSP-STTN .
Article
In this work, we consider transferring global information from Transformer to Convolutional Neural Network (CNN) for medical semantic segmentation tasks. Previous network models for medical semantic segmentation tasks often suffer from difficulties in modeling global information or oversized model parameters. Here, to design a compact network with global and local information, we extract the global information modeling capability of Transformer into the CNN network and successfully apply it to the medical semantic segmentation tasks, called Global Information Distillation. In addition, the following two contributions are proposed to improve the effectiveness of distillation: i) We present an Information Transfer Module, which is based on a convolutional layer to prevent over-regularization and a Transformer layer to transfer global information; ii) For purpose of better transferring the teacher’s soft targets, a Shrinking Result-Pixel distillation method is proposed in this paper. The effectiveness of our knowledge distillation approach is demonstrated by the experiments on multi-organ and cardiac segmentation tasks.
Article
Automatic medical image segmentation has made great progress owing to powerful deep representation learning. Inspired by the success of self-attention mechanism in transformer, considerable efforts are devoted to designing the robust variants of the encoder–decoder architecture with transformer. However, the patch division used in the existing transformer-based models usually ignores the pixel-level intrinsic structural features inside each patch. In this article, we propose a novel deep medical image segmentation framework called dual swin transformer U-Net (DS-TransUNet), which aims to incorporate the hierarchical swin transformer into both the encoder and the decoder of the standard U-shaped architecture. Our DS-TransUNet benefits from the self-attention computation in swin transformer and the designed dual-scale encoding, which can effectively model the non-local dependencies and multiscale contexts for enhancing the semantic segmentation quality of varying medical images. Unlike many prior transformer-based solutions, the proposed DS-TransUNet adopts a well-established dual-scale encoding mechanism that uses dual-scale encoders based on swin transformer to extract the coarse and fine-grained feature representations of different semantic scales. Meanwhile, a well-designed transformer interactive fusion (TIF) module is proposed to effectively perform multiscale information fusion through the self-attention mechanism. Furthermore, we introduce the swin transformer block into the decoder to further explore the long-range contextual information during the up-sampling process. Extensive experiments across four typical tasks for medical image segmentation demonstrate the effectiveness of DS-TransUNet, and our approach significantly outperforms the state-of-the-art methods.
Article
Accurate skin lesion segmentation in dermoscopic images is crucial to the early diagnosis of skin cancers. However, it remains a challenging task due to fuzzy lesion boundaries, irregular lesion shapes, and the existence of various interference factors. In this paper, a novel Attention Synergy Network (AS-Net) is developed to enhance the discriminative ability for skin lesion segmentation by combining both spatial and channel attention mechanisms. The spatial attention path captures lesion-related features in the spatial dimension while the channel attention path selectively emphasizes discriminative features in the channel dimension. The synergy module is designed to optimally integrate both spatial and channel information, and a weighted binary cross-entropy loss function is introduced to emphasize the foreground lesion region. Comprehensive experiments indicate that our proposed model achieves the state-of-the-art performance with the highest overall score in the ISIC2017 challenge, and outperforms several popular deep neural networks on both ISIC2018 and PH2 datasets.
Article
Full-text available
Colonoscopy is considered the gold standard for detection of colorectal cancer and its precursors. Existing examination methods are, however, hampered by high overall miss-rate, and many abnormalities are left undetected. Computer-Aided Diagnosis systems based on advanced machine learning algorithms are touted as a game-changer that can identify regions in the colon overlooked by the physicians during endoscopic examinations, and help detect and characterize lesions. In previous work, we have proposed the ResUNet++ architecture and demonstrated that it produces more efficient results compared with its counterparts U-Net and ResUNet. In this paper, we demonstrate that further improvements to the overall prediction performance of the ResUNet++ architecture can be achieved by using CRF and TTA. We have performed extensive evaluations and validated the improvements using six publicly available datasets: Kvasir-SEG, CVC-ClinicDB, CVC-ColonDB, ETIS-Larib Polyp DB, ASU-Mayo Clinic Colonoscopy Video Database, and CVC-VideoClinicDB. Moreover, we compare our proposed architecture and resulting model with other State-of-the-art methods. To explore the generalization capability of ResUNet++ on different publicly available polyp datasets, so that it could be used in a real-world setting, we performed an extensive cross-dataset evaluation. The experimental results show that applying CRF and TTA improves the performance on various polyp segmentation datasets both on the same dataset and cross-dataset. To check the model's performance on difficult to detect polyps, we selected, with the help of an expert gastroenterologist, 196 sessile or flat polyps that are less than ten millimeters in size. This additional data has been made available as a subset of Kvasir-SEG. Our approaches showed good results for flat or sessile and smaller polyps, which are known to be one of the major reasons for high polyp miss-rates. This is one of the significant strengths of our work and indicates that our methods should be investigated further for use in clinical practice.
Chapter
Full-text available
The detection of curvilinear structures in medical images, e.g., blood vessels or nerve fibers, is important in aiding management of many diseases. In this work, we propose a general unifying curvilinear structure segmentation network that works on different medical imaging modalities: optical coherence tomography angiography (OCT-A), color fundus image, and corneal confocal microscopy (CCM). Instead of the U-Net based convolutional neural network, we propose a novel network (CS-Net) which includes a self-attention mechanism in the encoder and decoder. Two types of attention modules are utilized - spatial attention and channel attention, to further integrate local features with their global dependencies adaptively. The proposed network has been validated on five datasets: two color fundus datasets, two corneal nerve datasets and one OCT-A dataset. Experimental results show that our method outperforms state-of-the-art methods, for example, sensitivities of corneal nerve fiber segmentation were at least 2% higher than the competitors. As a complementary output, we made manual annotations of two corneal nerve datasets which have been released for public access.
Preprint
Full-text available
Medical image segmentation is an important step in medical image analysis. With the rapid development of convolutional neural network in image processing, deep learning has been used for medical image segmentation, such as optic disc segmentation, blood vessel detection, lung segmentation, cell segmentation, etc. Previously, U-net based approaches have been proposed. However, the consecutive pooling and strided convolutional operations lead to the loss of some spatial information. In this paper, we propose a context encoder network (referred to as CE-Net) to capture more high-level information and preserve spatial information for 2D medical image segmentation. CE-Net mainly contains three major components: a feature encoder module, a context extractor and a feature decoder module. We use pretrained ResNet block as the fixed feature extractor. The context extractor module is formed by a newly proposed dense atrous convolution (DAC) block and residual multi-kernel pooling (RMP) block. We applied the proposed CE-Net to different 2D medical image segmentation tasks. Comprehensive results show that the proposed method outperforms the original U-Net method and other state-of-the-art methods for optic disc segmentation, vessel detection, lung segmentation, cell contour segmentation and retinal optical coherence tomography layer segmentation.
Conference Paper
Full-text available
In this paper, we present UNet++, a new, more powerful architecture for medical image segmentation. Our architecture is essentially a deeply-supervised encoder-decoder network where the encoder and decoder sub-networks are connected through a series of nested, dense skip pathways. The re-designed skip pathways aim at reducing the semantic gap between the feature maps of the encoder and decoder sub-networks. We argue that the optimizer would deal with an easier learning task when the feature maps from the decoder and encoder networks are semantically similar. We have evaluated UNet++ in comparison with U-Net and wide U-Net architectures across multiple medical image segmentation tasks: nodule segmentation in the low-dose CT scans of chest, nuclei segmentation in the microscopy images, liver segmentation in abdominal CT scans, and polyp segmentation in colonoscopy videos. Our experiments demonstrate that UNet++ with deep supervision achieves an average IoU gain of 3.9 and 3.4 points over U-Net and wide U-Net, respectively.
Article
Full-text available
In this article, we describe the design and implementation of a publicly accessible dermatology image analysis benchmark challenge. The goal of the challenge is to support research and development of algorithms for automated diagnosis of melanoma, a lethal form of skin cancer, from dermoscopic images. The challenge was divided into sub-challenges for each task involved in image analysis, including lesion segmentation, dermoscopic feature detection within a lesion, and classification of melanoma. Training data included 900 images. A separate test dataset of 379 images was provided to measure resultant performance of systems developed with the training data. Ground truth for both training and test sets was generated by a panel of dermoscopic experts. In total, there were 79 submissions from a group of 38 participants, making this the largest standardized and comparative study for melanoma diagnosis in dermoscopic images to date. While the official challenge duration and ranking of participants has concluded, the datasets remain available for further research and development.
Article
Full-text available
Purpose: Wireless capsule endoscopy (WCE) is commonly used for noninvasive gastrointestinal tract evaluation, including the detection of mucosal polyps. A new embeddable method for polyp detection in wireless capsule endoscopic images was developed and tested. Methods: First, possible polyps within the image were extracted using geometric shape features. Next, the candidate regions of interest were evaluated with a boosting based method using textural features. Each step was carefully chosen to accommodate hardware implementation constraints. The method's performance was evaluated on WCE datasets including 300 images with polyps and 1,200 images without polyps. Hardware implementation of the proposed approach was evaluated to quantitatively demonstrate the feasibility of such integration into the WCE itself. Results: The boosting based polyp classification demonstrated a sensitivity of 91.0 %, a specificity of 95.2 % and a false detection rate of 4.8 %. This performance is close to that reported recently in systems developed for an online analysis of video colonoscopy images. Conclusion: A new method for polyp detection in videoendoscopic WCE examinations was developed using boosting based approach. This method achieved good classification performance and can be implemented in situ with embedded hardware.
Chapter
We present a new method that views object detection as a direct set prediction problem. Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components like a non-maximum suppression procedure or anchor generation that explicitly encode our prior knowledge about the task. The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss that forces unique predictions via bipartite matching, and a transformer encoder-decoder architecture. Given a fixed small set of learned object queries, DETR reasons about the relations of the objects and the global image context to directly output the final set of predictions in parallel. The new model is conceptually simple and does not require a specialized library, unlike many other modern detectors. DETR demonstrates accuracy and run-time performance on par with the well-established and highly-optimized Faster R-CNN baseline on the challenging COCO object detection dataset. Moreover, DETR can be easily generalized to produce panoptic segmentation in a unified manner. We show that it significantly outperforms competitive baselines. Training code and pretrained models are available at https://github.com/facebookresearch/detr.
Chapter
Aggregating multi-level feature representation plays a critical role in achieving robust volumetric medical image segmentation, which is important for the auxiliary diagnosis and treatment. Unlike the recent neural architecture search (NAS) methods that typically searched the optimal operators in each network layer, but missed a good strategy to search for feature aggregations, this paper proposes a novel NAS method for 3D medical image segmentation, named UXNet, which searches both the scale-wise feature aggregation strategies as well as the block-wise operators in the encoder-decoder network. UXNet has several appealing benefits. (1) It significantly improves flexibility of the classical UNet architecture, which only aggregates feature representations of encoder and decoder in equivalent resolution. (2) A continuous relaxation of UXNet is carefully designed, enabling its searching scheme performed in an efficient differentiable manner. (3) Extensive experiments demonstrate the effectiveness of UXNet compared with recent NAS methods for medical image segmentation. The architecture discovered by UXNet outperforms existing state-of-the-art models in terms of Dice on several public 3D medical image segmentation benchmarks, especially for the boundary locations and tiny tissues. The searching computational complexity of UXNet is cheap, enabling to search a network with best performance less than 1.5 days on two TitanXP GPUs.
Chapter
In this work we present an experimental setup to semi automatically obtain exhaustive nuclei labels across 19 different tissue types, and therefore construct a large pan-cancer dataset for nuclei instance segmentation and classification, with minimal sampling bias. The dataset consists of 455 visual fields, of which 312 are randomly sampled from more than 20K whole slide images at different magnifications, from multiple data sources. In total the dataset contains 216.4K labeled nuclei, each with an instance segmentation mask. We independently pursue three separate streams to create the dataset: detection, classification, and instance segmentation by ensembling in total 34 models from already existing, public datasets, therefore showing that the learnt knowledge can be efficiently transferred to create new datasets. All three streams are either validated on existing public benchmarks or validated by expert pathologists, and finally merged and validated once again to create a large, comprehensive pan-cancer nuclei segmentation and detection dataset PanNuke.
Article
Both convolutional and recurrent operations are building blocks that process one local neighborhood at a time. In this paper, we present non-local operations as a generic family of building blocks for capturing long-range dependencies. Inspired by the classical non-local means method in computer vision, our non-local operation computes the response at a position as a weighted sum of the features at all positions. This building block can be plugged into many computer vision architectures. On the task of video classification, even without any bells and whistles, our non-local models can compete or outperform current competition winners on both Kinetics and Charades datasets. In static image recognition, our non-local models improve object detection/segmentation and pose estimation on the COCO suite of tasks. Code will be made available.
Conference Paper
There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net .
Article
In this work we address the task of semantic image segmentation with Deep Learning and make three main contributions that are experimentally shown to have substantial practical merit. First, we highlight convolution with upsampled filters, or 'atrous convolution', as a powerful tool in dense prediction tasks. Atrous convolution allows us to explicitly control the resolution at which feature responses are computed within Deep Convolutional Neural Networks. It also allows us to effectively enlarge the field of view of filters to incorporate larger context without increasing the number of parameters or the amount of computation. Second, we propose atrous spatial pyramid pooling (ASPP) to robustly segment objects at multiple scales. ASPP probes an incoming convolutional feature layer with filters at multiple sampling rates and effective fields-of-views, thus capturing objects as well as image context at multiple scales. Third, we improve the localization of object boundaries by combining methods from DCNNs and probabilistic graphical models. The commonly deployed combination of max-pooling and downsampling in DCNNs achieves invariance but has a toll on localization accuracy. We overcome this by combining the responses at the final DCNN layer with a fully connected Conditional Random Field (CRF), which is shown both qualitatively and quantitatively to improve localization performance. Our proposed "DeepLab" system sets the new state-of-art at the PASCAL VOC-2012 semantic image segmentation task, reaching 79.7% mIOU in the test set, and advances the results on three other datasets: PASCAL-Context, PASCAL-Person-Part, and Cityscapes. All of our code is made publicly available online.
Article
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
Article
We introduce in this paper a novel polyp localization method for colonoscopy videos. Our method is based on a model of appearance for polyps which defines polyp boundaries in terms of valley information. We propose the integration of valley information in a robust way fostering complete, concave and continuous boundaries typically associated to polyps. This integration is done by using a window of radial sectors which accumulate valley information to create WM-DOVA (Window Median Depth of Valleys Accumulation) energy maps related with the likelihood of polyp presence. We perform a double validation of our maps, which include the introduction of two new databases, including the first, up to our knowledge, fully annotated database with clinical metadata associated. First we assess that the highest value corresponds with the location of the polyp in the image. Second, we show that WM-DOVA energy maps can be comparable with saliency maps obtained from physicians' fixations obtained via an eye-tracker. Finally, we prove that our method outperforms state-of-the-art computational saliency results. Our method shows good performance, particularly for small polyps which are reported to be the main sources of polyp miss-rate, which indicates the potential applicability of our method in clinical practice. Copyright © 2015 Elsevier Ltd. All rights reserved.
TransUNet: transformers make strong encoders for medical image segmentation
  • J Chen
Towards automatic polyp detection with a polyp appearance model
  • J Bernal
  • J Sánchez
  • F Vilarino