Book

U-Net: Convolutional Networks for Biomedical Image Segmentation

Authors:
... The network architecture comprises three key components: (1) Image encoder, (2) Multi-task decoder, and (3) TCL blocks (see in Fig. 6). Given its efficiency and suitability for small datasets (Hu et al., 2021;Wang et al., 2023), as well as has been proven effective in crop yield prediction in the previous study, the Unet (Ronneberger et al., 2015) is employed as the encoder-decoder framework of the proposed MT-CYP-Net. This network utilizes a shared encoder and simultaneously generates both crop yield predictions and classification results by a regression decoder and a segmentation decoder. ...
... Both decoders follow the architecture of the Unet decoder (Ronneberger et al., 2015), which progressively restores the spatial resolution of the feature map downsampled by the image encoder, ultimately generating an output matching the input image size. Each decoder consists of upsampling and convolutional layers and performs feature fusion with corresponding encoder layers through skip connections. ...
... To comprehensively evaluate the performance of our proposed MT-CYP-Net, we implement three classical machine learning models (Random forest (Breiman, 2001), XGBoost (Chen and Guestrin, 2016) and LightGBM (Ke et al., 2017)) and two advanced deep learning models (FPN-DenseNet161 (Baghdasaryan et al., 2022) and Unet (Ronneberger et al., 2015)), all of which are commonly used in crop yield prediction tasks. The machine learning models can predict the continuous output of crop yield based on input features such as satellite imagery reflectance value and vegetation indices. ...
Preprint
Full-text available
Accurate and fine-grained crop yield prediction plays a crucial role in advancing global agriculture. However, the accuracy of pixel-level yield estimation based on satellite remote sensing data has been constrained by the scarcity of ground truth data. To address this challenge, we propose a novel approach called the Multi-Task Crop Yield Prediction Network (MT-CYP-Net). This framework introduces an effective multi-task feature-sharing strategy, where features extracted from a shared backbone network are simultaneously utilized by both crop yield prediction decoders and crop classification decoders with the ability to fuse information between them. This design allows MT-CYP-Net to be trained with extremely sparse crop yield point labels and crop type labels, while still generating detailed pixel-level crop yield maps. Concretely, we collected 1,859 yield point labels along with corresponding crop type labels and satellite images from eight farms in Heilongjiang Province, China, in 2023, covering soybean, maize, and rice crops, and constructed a sparse crop yield label dataset. MT-CYP-Net is compared with three classical machine learning and deep learning benchmark methods in this dataset. Experimental results not only indicate the superiority of MT-CYP-Net compared to previous methods on multiple types of crops but also demonstrate the potential of deep networks on precise pixel-level crop yield prediction, especially with limited data labels.
... I MAGE segmentation is crucial in the analysis of medical images, typically serving as the preliminary step for examining anatomical structures and surgical planning [1]- [4]. During recent years, Convolutional Neural Networks (CNN) [5] and, in particular, Ushaped Fully Convolutional Neural Networks (FCNN) have garnered widespread acceptance within the research community [6]. Their success can be attributed to their local receptive field, which allows them to capture substantial contextual information while maintaining relatively low GPU memory consumption. ...
... Convolutional Neural Networks (CNNs) [5] have been the dominant solution for both 2D and 3D medical image segmentation for years. Among these, U-Net [6], characterized by its U-shaped symmetric encoder-decoder structure with skip connections, represents an effective architecture that subsequent models have continued to adopt until the present day. Following U-Net, several variants have been introduced, including Res-U-Net [27], Dense-U-Net [28], V-Net [29], 3D U-Net and its state-of-the-art ecosystem nnU-Net [30], each proposing enhancements to the original framework. ...
... SegMambaSkip. One of the universally recognized strength points of the U-Net architecture lies in its skipped connections [6], which allow the decoding part of the network to access fine-grained details coming from the encoder. Indeed, as the network compresses the image to extract high-level features, it loses some fine-grained details. ...
Article
Full-text available
Recently, the field of 3D medical segmentation has been dominated by deep learning models employing Convolutional Neural Networks (CNNs) and Transformer-based architectures, each with its distinctive strengths and limitations. CNNs are constrained by a local receptive field, whereas Transformer are hindered by their substantial memory requirements as well as their data hunger, making them not ideal for processing 3D medical volumes at a fine-grained level. For these reasons, fully convolutional neural networks, as nnU-Net, still dominate the scene when segmenting medical structures in large 3D medical volumes. Despite numerous advancements toward developing transformer variants with subquadratic time and memory complexity, these models still fall short in content-based reasoning. A recent breakthrough is Mamba, a Recurrent Neural Network (RNN) based on State Space Models (SSMs), outperforming Transformers in many long-context tasks (million-length sequences) on famous natural language processing and genomic benchmarks while keeping a linear complexity. In this paper, we evaluate the effectiveness of Mamba-based architectures in comparison to state-of-the-art convolutional and Transformer-based models for 3D medical image segmentation across three well-established datasets: Synapse Abdomen, MSD BrainTumor, and ACDC. Additionally, we address the primary limitations of existing Mamba-based architectures by proposing alternative architectural designs, hence improving segmentation performances. The source code is publicly available to ensure reproducibility and facilitate further research: https://github.com/LucaLumetti/TamingMambas.
... One pattern is that various maps are together input into tandem hourglass networks (Cheng et al., 2018;Chen et al., 2019;Cheng et al., 2020;Park et al., 2020;Xu et al., 2020;Yan et al., 2023a;. For example, S2D (Ma et al., 2019) directly feeds the concatenation into a simple UNet (Ronneberger et al., 2015). CSPN (Cheng et al., 2018) studies the affinity matrix to refine coarse depth with spatial propagation network (SPN). ...
... Compared with DRHN 1 , DRHN i (i > 1) has the same or more lightweight architecture, which is used to extract clearer image guidance semantics (Zeiler & Fergus, 2014). In addition, we adopt the skip connection strategy (Ronneberger et al., 2015;Chen et al., 2019) to utilize low-level and high-level features simultaneously in DRHN i . Between multiple DRHN subnetworks, a dense-connection strategy (see Fig. 3) is conducted. ...
Article
Full-text available
Depth completion aims to recover dense depth maps from sparse ones, where color images are often used to facilitate this task. Recent depth methods primarily focus on image guided learning frameworks. However, blurry guidance in the image and unclear structure in the depth still impede their performance. To tackle these challenges, we explore a repetitive design in our image guided network to gradually and sufficiently recover depth values. Specifically, the repetition is embodied in both the image guidance branch and depth generation branch. In the former branch, we design a dense repetitive hourglass network (DRHN) to extract discriminative image features of complex environments, which can provide powerful contextual instruction for depth prediction. In the latter branch, we present a repetitive guidance (RG) module based on dynamic convolution, in which an efficient convolution factorization is proposed to reduce the complexity while modeling high-frequency structures progressively. Furthermore, in the semantic guidance branch, we utilize the well-known large vision model, i.e., segment anything (SAM), to supply RG with semantic prior. In addition, we propose a region-aware spatial propagation network (RASPN) for further depth refinement based on the semantic prior constraint. Finally, we collect a new dataset termed TOFDC for the depth completion task, which is acquired by the time-of-flight (TOF) sensor and the color camera on smartphones. Extensive experiments demonstrate that our method achieves state-of-the-art performance on KITTI, NYUv2, Matterport3D, 3D60, VKITTI, and our TOFDC.
... The overall architecture of our VD-Mamba is shown in Fig. 4. It consists of two distinct branches, one for spatial feature extraction and the other for temporal feature fusion. Both branches follow a U-Net [41] architecture, comprising four levels of feature transformation through upsampling and downsampling. The encoders and decoders are connected through skip connections using residual learning. ...
... Network Structure. We design three additional model variants: (1) replacing both S3ML and TSML with residual blocks to form a pure dual-branch U-Net [41], (2) replacing S3ML alone, and (3) replacing TSML alone. The results in the first row of Tab. 3 shows that both S3ML and TSML contribute to improving the deraining performance. ...
Preprint
Full-text available
Significant progress has been made in video restoration under rainy conditions over the past decade, largely propelled by advancements in deep learning. Nevertheless, existing methods that depend on paired data struggle to generalize effectively to real-world scenarios, primarily due to the disparity between synthetic and authentic rain effects. To address these limitations, we propose a dual-branch spatio-temporal state-space model to enhance rain streak removal in video sequences. Specifically, we design spatial and temporal state-space model layers to extract spatial features and incorporate temporal dependencies across frames, respectively. To improve multi-frame feature fusion, we derive a dynamic stacking filter, which adaptively approximates statistical filters for superior pixel-wise feature refinement. Moreover, we develop a median stacking loss to enable semi-supervised learning by generating pseudo-clean patches based on the sparsity prior of rain. To further explore the capacity of deraining models in supporting other vision-based tasks in rainy environments, we introduce a novel real-world benchmark focused on object detection and tracking in rainy conditions. Our method is extensively evaluated across multiple benchmarks containing numerous synthetic and real-world rainy videos, consistently demonstrating its superiority in quantitative metrics, visual quality, efficiency, and its utility for downstream tasks.
... To tackle this issue, recent advances Gou et al. (2023); Morelli et al. (2023); Xu et al. (2024a); take the inspiration from Diffusion models Avrahami et al. (2022); Hertz et al. (2023); Kawar et al. (2023); Ruiz et al. (2023); Rombach et al. (2022) with enhanced training stability and scalability in content creation, and present a new diffusion-based direction for VTON task. In general, the whole garment image as appearance reference is encoded via VAE encoder and UNet Kingma & Welling (2014); Ronneberger et al. (2015), which is further integrated into diffusion model for conditional image generation. Despite improving the quality of synthesized person image with few artifacts, such VTON results still fail to preserve sufficient garment details (Figure 1 (c-d)) due to the stochastic denoising process in diffusion model. ...
... In the reverse denoising process, LDM learns to predict the added noise ϵ and removes it. The typical training objective of LDM (generally implemented as a UNet Ronneberger et al. (2015)) parameterized by θ can be simply formulated as ...
Preprint
Diffusion models have shown preliminary success in virtual try-on (VTON) task. The typical dual-branch architecture comprises two UNets for implicit garment deformation and synthesized image generation respectively, and has emerged as the recipe for VTON task. Nevertheless, the problem remains challenging to preserve the shape and every detail of the given garment due to the intrinsic stochasticity of diffusion model. To alleviate this issue, we novelly propose to explicitly capitalize on visual correspondence as the prior to tame diffusion process instead of simply feeding the whole garment into UNet as the appearance reference. Specifically, we interpret the fine-grained appearance and texture details as a set of structured semantic points, and match the semantic points rooted in garment to the ones over target person through local flow warping. Such 2D points are then augmented into 3D-aware cues with depth/normal map of target person. The correspondence mimics the way of putting clothing on human body and the 3D-aware cues act as semantic point matching to supervise diffusion model training. A point-focused diffusion loss is further devised to fully take the advantage of semantic point matching. Extensive experiments demonstrate strong garment detail preservation of our approach, evidenced by state-of-the-art VTON performances on both VITON-HD and DressCode datasets. Code is publicly available at: https://github.com/HiDream-ai/SPM-Diff.
... We report unconditional generation results on CIFAR-10 [25] (32×32) in Tab. 3. FID-50K is reported with 1-NFE sampling. All entries are with the same U-net [38] developed from [44] (∼55M), applied directly on the pixel space. All other competitors are with the EDMstyle pre-conditioner [22], and ours has no preconditioner. ...
... The input to the model is 32×32×3 in the pixel space. The network is a U-net [38] developed from [44] (∼55M), which is commonly used by other baselines we compare. We apply positional embedding on the two time variables (here, (t, t − r)) and concatenate them for conditioning. ...
Preprint
Full-text available
We propose a principled and effective framework for one-step generative modeling. We introduce the notion of average velocity to characterize flow fields, in contrast to instantaneous velocity modeled by Flow Matching methods. A well-defined identity between average and instantaneous velocities is derived and used to guide neural network training. Our method, termed the MeanFlow model, is self-contained and requires no pre-training, distillation, or curriculum learning. MeanFlow demonstrates strong empirical performance: it achieves an FID of 3.43 with a single function evaluation (1-NFE) on ImageNet 256x256 trained from scratch, significantly outperforming previous state-of-the-art one-step diffusion/flow models. Our study substantially narrows the gap between one-step diffusion/flow models and their multi-step predecessors, and we hope it will motivate future research to revisit the foundations of these powerful models.
... most important subtasks in semantic segmentation of remote sensing images, and it helps researchers in urban planning, cartography and intelligent navigation [1]- [3], etc. There are numerous current semantic segmentation methods for natural [4]- [6] and other targets in remote sensing scenarios [7]- [9]. However, the performance shows significant degradation when segmenting roads in remote sensing images using these methods directly. ...
... We compare our method with other current popular road segmentation models for remote sensing images on dataset mentioned above. These methods include UNet [4], DeepLab v3+ [48], D-LinkNet [49], CoANet [50], SGCN [51], Seg-Road [52], VNet [11], RCFS-Net [22], OARENet [53]. Fig. 7 shows the qualitative segmentation results of some of the aforementioned methods on the three datasets Deepglobe, CHN6-CUG, and Ottawa Road Dataset. ...
Article
Full-text available
Using very high-resolution optical remote sensing images for road segmentation is a challenging and important interpretation task. Different from other segmentation tasks, road segmentation typically faces unpredictable structure, irregular distribution, and complex background interference. Thus, establishing an effective and stable semantic description for road segmentation becomes a challenge. In this article, a novel architecture called Boundary Decoupling based W-shape Network (BD-WNet) is proposed for achieving road segmentation from optical remote sensing imagery. First, a novel W-shaped double encoder-decoder architecture network is designed to provide more stable semantic description, which can be used for road body extraction. Second, the unstable semantic features within the initial stage of double encoder-decoder architecture are considered for road boundary description. For decoupling unstable boundary information from the output feature of first encoder-decoder, a boundary separate model is designed, which is called Boundary-Body Decoupling (BBD) module. This module utilizes the flow field mechanism to compare image features before and after passing through the encoder-decoder. The stable features among the overall features are decoupled into the main body of the road, while the dynamic features are decoupled into the boundary of the road. Third, a boundary and body weighting fusion model is also designed to fuse stable road body and unstable road boundary information for supervised learning. Extensive experiments have been carried out on remote sensing road segmentation datasets, and our method achieves impressive performance. Specifically, the proposed BD-WNet achieves 82.5% F1 score and 69.8 IoU% on DeepGlobe dataset, 75.5% F1 score and 60.7% IoU on CHN6-CUG dataset and 93.8% F1 score and 88.3% IoU on Ottawa Road Dataset.
... Early diffusion models (e.g., ADM [11] and Stable Diffusion [61]) primarily employed hybrid U-Net [62] architectures that integrate convolutional layers with self-attention. More recently, Transformers [77] have emerged as a more powerful and scalable backbone [56,3], prompting a shift toward fully Transformer-based designs. ...
... Architecture of Diffusion Models. Early diffusion models commonly employ U-Net [62] as the foundational architecture [11,28,61]. More recently, a growing body of research has explored Vision Transformers (ViTs) [15] as alternative backbones for diffusion models, yielding remarkable results [56,3,52,88,59,48]. ...
Preprint
Diffusion Transformer (DiT), a promising diffusion model for visual generation, demonstrates impressive performance but incurs significant computational overhead. Intriguingly, analysis of pre-trained DiT models reveals that global self-attention is often redundant, predominantly capturing local patterns-highlighting the potential for more efficient alternatives. In this paper, we revisit convolution as an alternative building block for constructing efficient and expressive diffusion models. However, naively replacing self-attention with convolution typically results in degraded performance. Our investigations attribute this performance gap to the higher channel redundancy in ConvNets compared to Transformers. To resolve this, we introduce a compact channel attention mechanism that promotes the activation of more diverse channels, thereby enhancing feature diversity. This leads to Diffusion ConvNet (DiCo), a family of diffusion models built entirely from standard ConvNet modules, offering strong generative performance with significant efficiency gains. On class-conditional ImageNet benchmarks, DiCo outperforms previous diffusion models in both image quality and generation speed. Notably, DiCo-XL achieves an FID of 2.05 at 256x256 resolution and 2.53 at 512x512, with a 2.7x and 3.1x speedup over DiT-XL/2, respectively. Furthermore, our largest model, DiCo-H, scaled to 1B parameters, reaches an FID of 1.90 on ImageNet 256x256-without any additional supervision during training. Code: https://github.com/shallowdream204/DiCo.
... Deep learning approaches, particularly Convolutional Neural Networks (CNN), have gained prominence in medical image segmentation in recent years. One of the foundational architectures in this domain, U-Net [7], has become the gold standard for biomedical image segmentation due to its efficient encoder-decoder structure and the capability to leverage limited labelled data through the skip connections mechanism. Several extensions were proposed starting from the U-Net architecture. ...
... In this section, we describe and discuss the experimental activities carried out to assess the effectiveness of our proposed methodology. One of the most prominent deep learning approaches for medical image segmentation is U-Net [7], a convolutional neural network characterized by an encoderdecoder structure with skip connections. This architecture allows the model to capture both contextual information and fine details, with the encoder down-sampling the input while the decoder up-samples it to produce a segmentation map. ...
Conference Paper
Gliomas are among the most aggressive and heterogeneous brain tumours, and their characteristics make their precise segmentation very difficult, with negative consequences in diagnosis and treatment planning. Classical pixel-based segmentation techniques often struggle with the variability and complexity of glioma occurrence. In this paper, we propose a novel graph-based segmentation method utilizing Graph Neural Networks to enhance the accuracy of glioma segmentation in MRI images. Representing MRI scans with graphs helps to capture the spatial structure and contextual information about the tumour. We evaluate our method on a standard glioma dataset and compare it with U-Net-based segmentation techniques, demonstrating that our approach outperforms traditional models across multiple metrics. The results suggest that graph-based segmentation offers a powerful alternative for medical image analysis, potentially improving clinical outcomes in brain tumour management.
... To perform denoising, we incorporate the DRUNET [23] as the Gaussian denoiser within the proposed GGSM-PnP algorithm. DRUNET integrates the structures of UNet [38] and residual blocks [39], allowing it to handle images with varying noise levels. This characteristic makes it suitable for situations where the noise level η β changes during different iterations of GGSM-PnP. ...
Article
Full-text available
Restoring images corrupted by a combination of additive white Gaussian noise (AWGN) and salt-and-pepper impulse noise (SPIN) poses a significant challenge, primarily stemming from the complexities involved in accurately modeling the distributions of the mixed noise. Traditional methods for mixed noise removal, such as filters and optimization models, often exhibit high computational complexity and limited performance. Inspired by the strong performance of deep learning-based approaches, we propose a novel mixed noise removal method that leverages an implicit deep image prior, called GGSM-PnP. Specifically, drawing inspiration from empirical distributions, we model the mixed noise using the generalized Gaussian distribution and establish the model within a maximum a posteriori framework. Subsequently, we employ the alternating direction multiplier method to derive the algorithm for solving the proposed model. Within the deep prior involved sub-problem, we integrate an offline-trained Gaussian denoiser into the plug-and-play framework. Experimental results on synthetic noisy images demonstrate the superior performance of the proposed method compared to existing techniques for removing mixed AWGN+SPIN noise.
... Mountainous regions present distinct challenges and oppor-42 tunities for environmental analysis and understanding due to 43 their rugged terrain and diverse ecosystems. Remote sensing 44 technology emerges as a powerful tool for studying mountains, 45 providing insight into critical aspects such as vegetation health, 46 snow accumulation, glacier movement, and natural hazards. It 47 also facilitates the detection of underground targets, such as 48 pipelines, within mountains, thereby aiding urban management 49 and the development of underground spaces [4]. ...
Article
Full-text available
In mountainous landscapes, integrating topograph2 ical information is crucial for effective analysis and under3 standing. Remote sensing becomes indispensable in studying mountains, enabling the monitoring of critical aspects such as grassland degradation, snow depth, glacier dynamics. The intricate nature of mountain ecosystems requires a thorough un7 derstanding of their dynamics, given their vital role in providing habitats for unique species, influencing hydrology patterns, and indicating the impact of climate change. However, the persistent challenge of cloud cover obstructs surface observations, limiting the effectiveness of optical sensors. To overcome this obstacle, various cloud removal techniques have been developed, although they tend to struggle in such landscapes. In this study, we introduce CRT-UNet, a UNet-based cloud removal model that incorporates topographical information from Digital Elevation Model (DEM) data as input to enhance perfor17 mance in mountainous regions. By integrating Synthetic Aperture Radar Sentinel-1 (S1) data, DEM information, and topographic insights into Sentinel-2 (S2) data, our model aims to enhance cloud removal capabilities, particularly in challenging terrains characterized by thick cloud coverage and significant elevation variations. This integration of topographical information enriches the cloud removal process, enabling more accurate restoration of obscured terrain features. The results demonstrate the superior performance of CRT-UNet in cloud removal across varying cloud cover levels and complex terrain features, such as rugged peaks and deep valleys. The proposed model outperforms state-of28 the-art cloud removal models quantitatively and qualitatively. This underscores the importance of incorporating topographi30 cal information in remote sensing applications, particularly in mountainous regions to improve data accuracy.
... Large-scale text-to-image (T2I) diffusion models have rapidly become the backbone of generative AI. Building on latent diffusion, Stable Diffusion [2,3] popularized an open-source U-Net [56] conditioned on CLIP [57], capable of efficient generation by operating in a compressed latent space. Meanwhile, the PixArt series [4,5] demonstrates that decomposed training stages, latent consistency modules, and weak-to-strong paradigms can reduce training cost by over 90%, while supporting 4K output and 2-4-step sampling for sub-second inference. ...
Preprint
Continual post-training adapts a single text-to-image diffusion model to learn new tasks without incurring the cost of separate models, but naive post-training causes forgetting of pretrained knowledge and undermines zero-shot compositionality. We observe that the absence of a standardized evaluation protocol hampers related research for continual post-training. To address this, we introduce T2I-ConBench, a unified benchmark for continual post-training of text-to-image models. T2I-ConBench focuses on two practical scenarios, item customization and domain enhancement, and analyzes four dimensions: (1) retention of generality, (2) target-task performance, (3) catastrophic forgetting, and (4) cross-task generalization. It combines automated metrics, human-preference modeling, and vision-language QA for comprehensive assessment. We benchmark ten representative methods across three realistic task sequences and find that no approach excels on all fronts. Even joint "oracle" training does not succeed for every task, and cross-task generalization remains unsolved. We release all datasets, code, and evaluation tools to accelerate research in continual post-training for text-to-image models.
... z t represents the latent vector at denoising step t, and p denotes the textual prompt utilized for conditioning the generation process. The denoising function ϵ θ , parameterized by θ, typically corresponds to a UNet architecture [39]. ...
Preprint
Contemporary diffusion models show remarkable capability in text-to-image generation, while still being limited to restricted resolutions (e.g., 1,024 X 1,024). Recent advances enable tuning-free higher-resolution image generation by recycling pre-trained diffusion models and extending them via regional denoising or dilated sampling/convolutions. However, these models struggle to simultaneously preserve global semantic structure and produce creative regional details in higher-resolution images. To address this, we present C-Upscale, a new recipe of tuning-free image upscaling that pivots on global-regional priors derived from given global prompt and estimated regional prompts via Multimodal LLM. Technically, the low-frequency component of low-resolution image is recognized as global structure prior to encourage global semantic consistency in high-resolution generation. Next, we perform regional attention control to screen cross-attention between global prompt and each region during regional denoising, leading to regional attention prior that alleviates object repetition issue. The estimated regional prompts containing rich descriptive details further act as regional semantic prior to fuel the creativity of regional detail generation. Both quantitative and qualitative evaluations demonstrate that our C-Upscale manages to generate ultra-high-resolution images (e.g., 4,096 X 4,096 and 8,192 X 8,192) with higher visual fidelity and more creative regional details.
... The U-Net architecture, 24 which is the most widely used architecture for semantic segmentation, consists of an encoding phase to capture context and a symmetric decoding phase for precise localization. In this study, the encoder had four layers, each comprising a convolutional layer, followed by batch normalization, ReLU activation, and a pooling layer. ...
... While often overlooked, it is important to note that these approaches, can be divided into two families of approaches that leverage multiscale concepts. The first is to learn parameters for each scale, and a separate set of parameters that mix scales, as in UNet [23]. The second, called multiscale training, enables the approximation of fine-scale parameters using coarse-scale samples [17,30,8,14]. ...
Article
Analysing learned concepts for PDE-based parameter identification problems requires input from different research areas such as inverse problems, partial differential equations, statistics and mathematical foundations of deep learning. This workshop brought together a critical mass of experts in the various field. A thorough mathematical theory for PDE-based inverse problems using learned concepts is within reach in the coming few years and the inspiration of this Oberwolfach meeting will substantially influence this development.
... We adopted the U-Net architecture [43], following the design used in [1,10], as our diffusion model backbone. Models were trained with the AdamW optimizer [44] and used an exponential moving average (EMA) to stabilize training by averaging model weights over time, using a decay rate of 0.9999 for gradual updates. ...
Preprint
Full-text available
Diffusion models are widely used in applications ranging from image generation to inverse problems. However, training diffusion models typically requires clean ground-truth images, which are unavailable in many applications. We introduce the Measurement Score-based diffusion Model (MSM), a novel framework that learns partial measurement scores using only noisy and subsampled measurements. MSM models the distribution of full measurements as an expectation over partial scores induced by randomized subsampling. To make the MSM representation computationally efficient, we also develop a stochastic sampling algorithm that generates full images by using a randomly selected subset of partial scores at each step. We additionally propose a new posterior sampling method for solving inverse problems that reconstructs images using these partial scores. We provide a theoretical analysis that bounds the Kullback-Leibler divergence between the distributions induced by full and stochastic sampling, establishing the accuracy of the proposed algorithm. We demonstrate the effectiveness of MSM on natural images and multi-coil MRI, showing that it can generate high-quality images and solve inverse problems -- all without access to clean training data. Code is available at https://github.com/wustl-cig/MSM.
... For brain extraction, we used a 2.5D deep learning-based segmentation method. This method combined adjacent slices into three channels and employed a U-Net architecture with an EfficientNetB5 backbone (Ronneberger et al., 2015;Tan and Le, 2019). Encoders pre-trained on ImageNet were used for feature extraction (Deng et al., 2009). ...
Article
Full-text available
Introduction In the developmental field, sex differences can alter brain growth and development. Across the literature, sex differences have been reported in overall brain volume, white matter, gray matter and numerous other regions and tracts captured through non-invasive neuroimaging. Growing evidence suggests that sex differences appear at birth and continue through childhood. However, limited work has been completed in translational animal models, such as the domestic pig. Additionally, when using neuroimaging, uncertainties remain about which method best depicts microstructural changes, such as myelination. Materials and methods To address this gap, the present study utilized a total of 24 pigs (11 intact males or boars; 13 females or gilts) that underwent neuroimaging at postnatal day (PND) 29 or 30 to assess overall brain structural anatomy (MPRAGE), microstructural differences using diffusion (DTI), and an estimation of myelin content via myelin water fraction (MWF). On PND 32, brains were collected from all pigs, with the left hippocampus isolated, sectioned, and stained using the Gallyas silver impregnation method to quantify myelin density. Results Minimal sex differences were observed across neuroimaging modalities, with only myelin content exhibiting sex differences in the hippocampus ( P = 0.022). In the left hippocampus ( P = 0.038), females had a higher MWF value compared with males. This was supported by histologically derived myelin density as assessed by positive pixel percentage, but differences were isolated to one anatomical plane of the hippocampus ( P = 0.024) and not the combined mean value ( P = 0.333). Further regression analysis determined that axial ( P = 0.01) and mean ( P = 0.048) diffusivity measures, but not fractional anisotropy or MWF, were positively correlated with histologically derived myelin density in the left hippocampus, independent of sex. Discussion These findings suggest that at 4 weeks of age, axial and mean diffusivity may better reflect myelin density. Further investigation is required to confirm underlying mechanisms. Overall, minimal sex differences were observed in 4-week-old domestic pigs, indicating similar brain structure at this early stage of development.
... 2. LoRA-based fine-tuning: We fine-tune only the attention layers of the Unet (Ronneberger et al. (2015)) of SDXL (Podell et al. (2023)) with LoRA ) on 4-5 subject images, such that the model overfits on the new object, and store the LoRA safetensors separately. ...
Preprint
Full-text available
Recent advances in text-to-image diffusion models, particularly Stable Diffusion, have enabled the generation of highly detailed and semantically rich images. However, personalizing these models to represent novel subjects based on a few reference images remains challenging. This often leads to catastrophic forgetting, overfitting, or large computational overhead.We propose a two-stage pipeline that addresses these limitations by leveraging LoRA-based fine-tuning on the attention weights within the U-Net of the Stable Diffusion XL (SDXL) model. First, we use the unmodified SDXL to generate a generic scene by replacing the subject with its class label. Then, we selectively insert the personalized subject through a segmentation-driven image-to-image (Img2Img) pipeline that uses the trained LoRA weights.This framework isolates the subject encoding from the overall composition, thus preserving SDXL's broader generative capabilities while integrating the new subject in a high-fidelity manner. Our method achieves a DINO similarity score of 0.789 on SDXL, outperforming existing personalized text-to-image approaches.
... The growing integration of Artificial Intelligence (AI) into medical imaging has offered promising solutions to these challenges. Deep learning, particularly through Convolutional Neural Networks (CNNs), has shown remarkable success in automating diagnostic tasks across various imaging modalities [6], [7]. Among the CNN-based models, the U-Net architecture has emerged as a cornerstone in biomedical image segmentation due to its encoder-decoder structure, which captures both spatial context and fine-grained detail [8]. ...
Article
Full-text available
Accurate segmentation of brain tumors from magnetic resonance imaging (MRI) is a critical step in clinical diagnosis and treatment planning. While U-Net and its derivatives have shown strong performance in medical image segmentation, they often struggle with capturing long-range dependencies and handling complex tumor boundaries. In this study, we propose an enhanced U-Net architecture that integrates Swin Transformer blocks as the encoder backbone, combined with dual attention mechanisms and a multi-scale feature fusion strategy to improve segmentation precision. The model is trained and evaluated on the BraTS 2020 dataset, which includes multi-modal MRI scans (T1, T1CE, T2, FLAIR) of glioma patients. Extensive experiments demonstrate the superiority of our model compared to baseline architectures. Our method achieved Dice scores of 0.91 for Whole Tumor (WT), 0.85 for Tumor Core (TC), and 0.82 for Enhancing Tumor (ET), outperforming conventional U-Net and ResUNet variants. Qualitative results further validate the model's ability to capture fine-grained tumor details and maintain spatial continuity. The proposed framework effectively leverages the global contextual awareness of transformers and the localization strength of CNNs, offering a robust solution for automated brain tumor segmentation in clinical settings.
Article
Full-text available
Generative Artificial Intelligence (GenAI) is revolutionizing ophthalmology imaging by addressing critical limitations in data availability, annotation costs, and clinical workflow automation. This review provides a comprehensive analysis of GenAI’s technical innovations, clinical applications, and persistent challenges within the ophthalmic imaging domain. We first survey the evolution of generative architectures, from Generative Adversarial Networks to Diffusion Models and vision-language frameworks. These innovations enable novel applications including counterfactual pathology synthesis, longitudinal disease progression modeling, and post-treatment outcome visualization, which enhance diagnostic precision and patient engagement. We then systematically review methodological advancements in GenAI, with a focused analysis on key clinical application categories: image generation, cross-modal domain transfer, image enhancement, post-treatment prediction, image segmentation, and vision-language tasks. Finally, we critically evaluate generative models, evaluation methods, and persistent challenges, such as the need for standardized evaluation frameworks, anatomical fidelity validation, and equitable integration into global healthcare systems. By addressing these barriers, GenAI holds transformative potential to improve diagnostic accuracy, streamline personalized treatment workflows, and democratize access to high-quality ophthalmic care.
Chapter
Single-photon emission computed tomography (SPECT)/computed tomography (CT) and positron emission tomography (PET)/CT are crucial for improving the quality of diagnostic imaging in nuclear medicine. These hybrid imaging modalities can simultaneously provide both functional and anatomical information by combining nuclear medicine images (SPECT or PET) with anatomical images (CT). However, the use of CT in SPECT/CT and PET/CT comes with the challenge of increased radiation exposure to patients. Therefore, from the perspective of radiation protection, it is necessary to optimize not only the dose from radiopharmaceuticals but also the CT dose, depending on the clinical application. Recent advances, such as the introduction of low-dose CT techniques and adaptive filtering technologies, have enabled the reduction of radiation exposure while maintaining diagnostic quality. CT in SPECT/CT and PET/CT are indispensable tools for improving diagnostic accuracy, enhancing the precision of quantitative evaluations, and optimizing treatment planning. Appropriate dose management and technological advancements will continue to be critical issues in balancing diagnostic efficacy with patient safety. In this chapter, we described not only some especially important technical points of SPECT/CT and PET/CT, but also MIRD method as a fundamental estimation method of radiation exposure dose, CTDI, various correction methods within reconstruction methods, and the newest technologies for nuclear medicine imaging.
Article
Full-text available
Anime content, such as Japanese-style illustrations, manga, and animation, is popular worldwide among diverse audiences. However, editing and repurposing this content for an enhanced viewing experience is complex and relies heavily on manual processes, due to the challenge of automatically identifying individual character instances. Therefore, automated and precise segmentation of these elements is essential to enable various anime editing applications such as visual style editing, motion decomposition and transfer, and depth estimation. Most state-of-the-art segmentation methods are designed for natural photographs and do not capture the intricate aesthetics of anime-style characters, which reduces segmentation quality. The primary challenges are the lack of high-quality anime-dedicated datasets and the absence of competent models for high-resolution instance extraction on anime. To address these issues, we introduce a high-quality dataset of over 100k paired high-resolution anime-style images and their instance labeling masks. We also present an instance-aware image segmentation model that generates accurate, high-resolution segmentation masks for characters in a wide variety of anime-style images. Furthermore, we show that our approach supports segmentation-dependent editing applications such as 3D Ken Burns effects, text-guided style editing, and puppet animation from illustrations and manga.
Article
Accurate traffic flow prediction is a critical component of intelligent transportation systems and smart cities, playing an essential role in traffic control, transportation planning, and infrastructure development. Numerous recent research studies highlight the need to enhance prediction accuracy by addressing complex temporal and spatial dependencies. However, due to the complexity of these spatio-temporal patterns, achieving accurate traffic predictions is still a main challenge in long-term scenarios. In this context, we first provide a comprehensive overview of the traffic forecasting to locate where research is going on. Then, we develop a Long-Term Spatio-Temporal Graph Attention Network (LSTGAN) architecture designed to analyze long-term historical data to address the above issue. This architecture encodes several previous time steps and extracts temporal patterns using convolutional layers. These features are then combined with the spatial features captured by a spatial attention module and a graph convolution layer to be processed by a temporal attention decoder responsible for making predictions. Experiments on METR-LA and PEMS-BAY datasets show that our proposed architecture outperforms most existing state-of-the-art baselines.
Article
Full-text available
Background This study aimed to approximate the level of extravascular lung water (EVLW) in patients with severe COVID-19 pneumonia using quantitative imaging techniques. The elevation of EVLW is known to correlate with the degree of diffuse alveolar damage and linked with the mortality of critically ill patients. Transpulmonary thermodilution (TPTD) is the gold standard technique to estimate the total amount of EVLW, but it is invasive and requires specialized equipment and trained personnel. Methods The study included patients with severe COVID-19 who required chest CT scanning within the first 48 h of Intensive Care Unit (ICU) admission and had TPTD monitoring. Using in-house software tools for automatic semantic segmentation, lung masks were obtained for estimating the EVLW content. The results were compared with the TPTD measurements. Results The results demonstrate a significant correlation between EVLW-TPTP measured by thermodilution and EVLW-CT estimated from the patient’s CT-image ( r = 0.629, p = 0.0014). Conclusion The study showed that quantitative imaging techniques using chest CT-scans could be used as a convenient and low-cost option for ICUs without TPTD equipment for the assessment of EVLW in severe COVID-19 pneumonia.
Article
Full-text available
The so-called Deep Image Prior approach is an unsupervised deep learning methodology which has gained great interest in recent years due to its effectiveness in tackling imaging problems. However, a well known drawback of Deep Image Prior is the need for a proper early stopping technique to prevent undesired corruptions in the reconstructed images. Determining the optimal number of iterations depends on both the specific application and the features of the images being processed. As a consequence, several numerical trials are typically required to decide when to stop the Deep Image Prior procedure, resulting in significant computational costs and time requirements. This paper aims to introduce two early stopping techniques for Deep Image Prior, based on different approaches and different perspectives. The first approach relies on the neural architecture search (NAS) strategy. The aim is to equip the neural network used in Deep Image Prior with hyperparameter configurations that produce high-quality reconstructed images that are (i) comparable to those obtained with the optimally stopped, originally configured Deep Image Prior, and (ii) achieved with significantly fewer iterations. The second proposed early stopping strategy is based on a modified version of the BRISQUE metric, a no-reference image quality measure, and it aims to track the behaviour of the PSNR curve, obtained by applying Deep Image Prior, without knowing the ground truth image. While the NAS-based early stopping technique is particularly suited in those situations where the computational time is limited, this latter one is also relevant when a larger number of iterations is allowed. Several numerical experiments on different denoising applications show a promising performance of Deep Image Prior combined with the suggested early stopping procedures.
Poster
Full-text available
Abstract - Introduction: Early melanoma detection is critical, particularly in skin of color (SOC), where diagnosis is often delayed due to atypical presentations. Despite advances in dermoscopy, the lack of SOC representation in training datasets hinders the development of reliable diagnostic tools. Machine learning models, predominantly trained on lighter skin types, may yield biased results when applied to SOC, exacerbating health disparities in melanoma outcomes. Methods: We used a U-Net deep learning model to segment features from dermoscopic images of melanocytic lesions. The widely cited ISIC dataset includes only 3% SOC cases, underscoring the underrepresentation of SOC in current datasets. The model, originally trained on only skin types 1–3, was applied to SOC images without retraining. Data augmentation techniques, such as rotation and contrast variation, were used to simulate diverse optical dermoscopic parameters. Results: The Dice similarity coefficient for SOC images was lower than the 0.82 obtained for lighter skin types, indicating a reduction in segmentation accuracy. This performance decline highlights the limitations of using models trained on lighter skin types for SOC populations. Subjective review of the segmentation on skin types 4–6 suggests flaws in the latent space of the original models that underrepresent acral lentiginous melanoma. Discussion: Our findings highlight the need for transfer-learning of current models with more diverse datasets. Retraining with SOC-specific data could improve segmentation accuracy and promote equitable melanoma detection. Conclusions: Future work should apply medical domain knowledge to improve melanoma detection aids, ensuring that algorithms are equally effective across all skin types. Enhancing feature recognition for SOC, particularly for melanoma subtypes like acral lentiginous melanoma, is critical for equitable outcomes.
Article
Video anomaly detection plays a crucial role in intelligent transportation systems by enhancing urban mobility and safety. This review provides a comprehensive analysis of recent advancements in artificial intelligence methods applied to traffic anomaly detection, including convolutional and recurrent neural networks (CNNs and RNNs), autoencoders, Transformers, generative adversarial networks (GANs), and multimodal large language models (MLLMs). We compare their performance across real-world applications, highlighting patterns such as the superiority of Transformer-based models in temporal context understanding and the growing use of multimodal inputs for robust detection. Key challenges identified include dependence on large labeled datasets, high computational costs, and limited model interpretability. The review outlines how recent research is addressing these issues through semi-supervised learning, model compression techniques, and explainable AI. We conclude with future directions focusing on scalable, real-time, and interpretable solutions for practical deployment.
Article
Diffusion-based approaches have recently emerged as powerful alternatives to GAN-based virtual try-on methods, offering improved detail preservation and visual realism. Despite their advantages, the substantial number of parameters and intensive computational requirements pose significant barriers to deployment on low-resource platforms. To tackle these limitations, we propose a diffusion-based virtual try-on framework optimized through feature-level knowledge compression. Our method introduces MP-VTON, an enhanced inpainting pipeline based on Stable Diffusion, which incorporates improved Masking techniques and Pose-conditioned enhancement to alleviate garment boundary artifacts. To reduce model size while maintaining performance, we adopt an attention-guided distillation strategy that transfers semantic and structural knowledge from MP-VTON to a lightweight model, LiteMP-VTON. Experiments demonstrate that LiteMP-VTON achieves nearly a 3× reduction in parameter count and close to 2× speedup in inference, making it well suited for deployment in resource-limited environments without significantly compromising generation quality.
ResearchGate has not been able to resolve any references for this publication.