Examples of feature maps extracted from context-aware feature pyramid networks. Column 1-10 are the original RGB image, outputs of DConv1_x, DConv2_x, DConv3_x, DConv4_x, DConv5_x, P2, P3, P4, and P5 respectively; Row 1-3 are three different tongue samples from three different datasets. (The feature maps at different levels are randomly selected for visualization.).

Examples of feature maps extracted from context-aware feature pyramid networks. Column 1-10 are the original RGB image, outputs of DConv1_x, DConv2_x, DConv3_x, DConv4_x, DConv5_x, P2, P3, P4, and P5 respectively; Row 1-3 are three different tongue samples from three different datasets. (The feature maps at different levels are randomly selected for visualization.).

Source publication
Article
Full-text available
Tongue diagnosis is an important way of monitoring human health status in traditional Chinese medicine. As a key step of achieving automatic tongue diagnosis, the major challenges for robust and accurate segmentation and identification of tongue body in tongue images lay in the large variations of tongue appearance, e.g., tongue texture and tongue...

Context in source publication

Context 1
... the feature extraction stage, a context-aware feature pyramid network is utilized to extract multi-scale features of tongue images, as described in section III-A.3. In this stage, the feature maps of tongue image in different pyramid levels are extracted for tongue localization in the next region proposal stage. As shown in Fig. 6, the extracted feature maps in different levels represent discriminatively for tongue body compared to the face and other ...

Similar publications

Article
Full-text available
In the vision task of a self-driving system, the use of visible light images to segment an object often loses its functionality at night or in harsh weather. The far-infrared image shows different pixel values according to the thermal radiation quantity of the object itself, so it can be adapted to perform well at night and in harsh weather conditi...
Article
Full-text available
Semantic segmentation for unmanned aerial vehicle (UAV) remote sensing images has become one of the research focuses in the field of remote sensing at present, which could accurately analyze the ground objects and their relationships. However, conventional semantic segmentation methods based on deep learning require large-scale models that are not...

Citations

... Teeth analysis is almost absent in the field, except as a minor part of more general face image research [47,48], practically without reliable numerical assessment. Tongue segmentation is often conducted in a highly specific area, where the tongue's behavior hardly retains its motion during ordinary speech, e.g., traditional Chinese medicine (TCM) [49][50][51][52]. In such cases, a large part of the tongue has to be clearly visible in the central part of the picture, sticking out of the mouth, and the segmentation IoU may reach 0.97 [50,52]. ...
... Tongue segmentation is often conducted in a highly specific area, where the tongue's behavior hardly retains its motion during ordinary speech, e.g., traditional Chinese medicine (TCM) [49][50][51][52]. In such cases, a large part of the tongue has to be clearly visible in the central part of the picture, sticking out of the mouth, and the segmentation IoU may reach 0.97 [50,52]. Instead, the tongue is mostly hidden behind the teeth or lips during pronunciation. ...
... In general, the teeth and tongue are the most difficult to detect and segment for several reasons mentioned before (teeth: multiple tiny gray objects; tongue: variable shape and covering by teeth or lips; both: illumination issues). Therefore, we are satisfied with the median segmentation Dice index exceeding 0.83 in both cases with a fully-automated method, as it reaches the accuracy level obtained in our previous study involving better-illuminated and established tongue segmentation [55], being hardly comparable to significantly different TCM approaches to the tongue [49][50][51][52], and having no teeth segmentation benchmarks. Not surprisingly, the tongue requires special attention as it is not always detected correctly, particularly in small, covered, or poorly illuminated appearances. ...
Article
Full-text available
This paper describes a multistage framework for face image analysis in computer-aided speech diagnosis and therapy. Multimodal data processing frameworks have become a significant factor in supporting speech disorders' treatment. Synchronous and asynchronous remote speech therapy approaches can use audio and video analysis of articulation to deliver robust indicators of disordered speech. Accurate segmentation of articulators in video frames is a vital step in this agenda. We use a dedicated data acquisition system to capture the stereovision stream during speech therapy examination in children. Our goal is to detect and accurately segment four objects in the mouth area (lips, teeth, tongue, and whole mouth) during relaxed speech and speech therapy exercises. Our database contains 17,913 frames from 76 preschool children. We apply a sequence of procedures employing artificial intelligence. For detection, we train the YOLOv6 (you only look once) model to catch each of the three objects under consideration. Then, we prepare the DeepLab v3+ segmentation model in a semi-supervised training mode. As preparation of reliable expert annotations is exhausting in video labeling, we first train the network using weak labels produced by initial segmentation based on the distance-regularized level set evolution over fuzzified images. Next, we fine-tune the model using a portion of manual ground-truth delineations. Each stage is thoroughly assessed using the independent test subset. The lips are detected almost perfectly (average precision and F1 score of 0.999), whereas the segmentation Dice index exceeds 0.83 in each articulator, with a top result of 0.95 in the whole mouth.
... The main aim of this study was to embed SFP and VRE into TransUNet through design and experiments and conduct in-depth analysis and quantitative evaluation of their capacity to improve tongue coating segmentation tasks. Currently, several models have been developed for accurate tongue body segmentation [16,[42][43][44]. However, to the best of our knowledge, few studies have investigated tongue coating segmentation models. ...
Article
Full-text available
Background: As an important part of the tongue, the tongue coating is closely associated with different disorders and has major diagnostic benefits. This study aims to construct a neural network model that can perform complex tongue coating segmentation. This addresses the issue of tongue coating segmentation in intelligent tongue diagnosis automation. Method: This work proposes an improved TransUNet to segment the tongue coating. We introduced a transformer as a self-attention mechanism to capture the semantic information in the high-level features of the encoder. At the same time, the subtraction feature pyramid (SFP) and visual regional enhancer (VRE) were constructed to minimize the redundant information transmitted by skip connections and improve the spatial detail information in the low-level features of the encoder. Results: Comparative and ablation experimental findings indicate that our model has an accuracy of 96.36%, a precision of 96.26%, a dice of 96.76%, a recall of 97.43%, and an IoU of 93.81%. Unlike the reference model, our model achieves the best segmentation effect. Conclusion: The improved TransUNet proposed here can achieve precise segmentation of complex tongue images. This provides an effective technique for the automatic extraction in images of the tongue coating, contributing to the automation and accuracy of tongue diagnosis.
... Trajanovski et al. [8] employed the Unet network and integrated different color spaces for tongue image segmentation. Zhou et al. [9] improved the Atrous Spatial Pyramid Pooling (ASPP) method by utilizing four parallel convolutional layers, enabling multi-scale feature extraction and contextual information capture. Lin et al propose an end-to-end trainable tongue image segmentation method using a deep convolutional neural network based on ResNet [10]. ...
Article
Full-text available
Automated tongue segmentation plays a crucial role in the realm of computer-aided tongue diagnosis. The challenge lies in developing algorithms that achieve higher segmentation accuracy and maintain less memory space and swift inference capabilities. To relieve this issue, we propose a novel Pool-unet integrating Pool-former and Multi-task mask learning for tongue image segmentation. First of all, we collected 756 tongue images taken in various shooting environments and from different angles and accurately labeled the tongue under the guidance of a medical professional. Second, we propose the Pool-unet model, combining a hierarchical Pool-former module and a U-shaped symmetric encoder-decoder with skip-connections, which utilizes a patch expanding layer for up-sampling and a patch embedding layer for down-sampling to maintain spatial resolution, to effectively capture global and local information using fewer parameters and faster inference. Finally, a Multi-task mask learning strategy is designed, which improves the generalization and anti-interference ability of the model through the Multi-task pre-training and self-supervised fine-tuning stages. Experimental results on the tongue dataset show that compared to the state-of-the-art method (OET-NET), our method has 25% fewer model parameters, achieves 22% faster inference times, and exhibits 0.91% and 0.55% improvements in Mean Intersection Over Union (MIOU), and Mean Pixel Accuracy (MPA), respectively.
... 9 Some scholars also improved the segmentation performance of the model by improving the loss function. 10,11 Some scholars also use residual networks or extended residual networks combined with feature pyramids to extract information at multiple scales, 12,13 which also has a good performance in tongue image segmentation. ...
Article
Full-text available
Objective Tongue segmentation as a basis for automated tongue recognition studies in Chinese medicine, which has defects such as network degradation and inability to obtain global features, which seriously affects the segmentation effect. This article proposes an improved model RTC_TongueNet based on DeepLabV3, which combines the improved residual structure and transformer and integrates the ECA (Efficient Channel Attention Module) attention mechanism of multiscale atrous convolution to improve the effect of tongue image segmentation. Methods In this paper, we improve the backbone network based on DeepLabV3 by incorporating the transformer structure and an improved residual structure. The residual module is divided into two structures and uses different residual structures under different conditions to speed up the frequency of shallow information mapping to deep network, which can more effectively extract the underlying features of tongue image; introduces ECA attention mechanism after concat operation in ASPP (Atrous Spatial Pyramid Pooling) structure to strengthen information interaction and fusion, effectively extract local and global features, and enable the model to focus more on difficult-to-separate areas such as tongue edge, to obtain better segmentation effect. Results The RTC_TongueNet network model was compared with FCN (Fully Convolutional Networks), UNet, LRASPP (Lite Reduced ASPP), and DeepLabV3 models on two datasets. On the two datasets, the MIOU (Mean Intersection over Union) and MPA (Mean Pixel Accuracy) values of the classic model DeepLabV3 were higher than those of FCN, UNet, and LRASPP models, and the performance was better. Compared with the DeepLabV3 model, the RTC_TongueNet network model increased MIOU value by 0.9% and MPA value by 0.3% on the first dataset; MIOU increased by 1.0% and MPA increased by 1.1% on the second dataset. RTC_TongueNet model performed best on both datasets. Conclusion In this study, based on DeepLabV3, we apply the improved residual structure and transformer as a backbone to fully extract image features locally and globally. The ECA attention module is combined to enhance channel attention, strengthen useful information, and weaken the interference of useless information. RTC_TongueNet model can effectively segment tongue images. This study has practical application value and reference value for tongue image segmentation.
... After several studies based on CNNs for MRI and ultrasound images of the vocal tract for understanding speech production [10][11][12][13][14], more studies on using semantic segmentation based on deep CNNs for tongue segmentation have been recently conducted, and the effect is better than most of the traditional image segmentation methods [15,16]. However, there are still limitations in those methods, e.g., including image preprocessing such as image enhancement [17] making the whole segmentation process more complex or brightness discrimination [18] reducing the ability of generalization as a deep learning-based model. ...
... Accurate delineation of the tongue from low-contrast medical MR images of soft tissue remains a challenge, due to the lack of definitive boundary features separating many of the adjacent soft tissue [25]. Different from the conventional segmentation tasks in nature scene, tongue segmentation is more challenging because of the following issues: (1) large variations of tongue appearance for different patients while higher precision requirement; (2) data imbalance, e.g., small parts of foreground region (tongue body) compared with the background region; and (3) hard sample mining, e.g., lip pixels as the hard samples is hard to be segmented from tongue pixels because the similar appearances and close touch between them [15]. Recent studies demonstrated the applicability of AI methods in tongue segmentation [14,26]. ...
Article
Full-text available
Purpose Motor neuron disease (MND) causes damage to the upper and lower motor neurons including the motor cranial nerves, the latter resulting in bulbar involvement with atrophy of the tongue muscle. To measure tongue atrophy, an operator independent automatic segmentation of the tongue is crucial. The aim of this study was to apply convolutional neural network (CNN) to MRI data in order to determine the volume of the tongue. Methods A single triplanar CNN of U-Net architecture trained on axial, coronal, and sagittal planes was used for the segmentation of the tongue in MRI scans of the head. The 3D volumes were processed slice-wise across the three orientations and the predictions were merged using different voting strategies. This approach was developed using MRI datasets from 20 patients with ‘classical’ spinal amyotrophic lateral sclerosis (ALS) and 20 healthy controls and, in a pilot study, applied to the tongue volume quantification to 19 controls and 19 ALS patients with the variant progressive bulbar palsy (PBP). Results Consensus models with softmax averaging and majority voting achieved highest segmentation accuracy and outperformed predictions on single orientations and consensus models with union and unanimous voting. At the group level, reduction in tongue volume was not observed in classical spinal ALS, but was significant in the PBP group, as compared to controls. Conclusion Utilizing single U-Net trained on three orthogonal orientations with consequent merging of respective orientations in an optimized consensus model reduces the number of erroneous detections and improves the segmentation of the tongue. The CNN-based automatic segmentation allows for accurate quantification of the tongue volumes in all subjects. The application to the ALS variant PBP showed significant reduction of the tongue volume in these patients and opens the way for unbiased future longitudinal studies in diseases affecting tongue volume.
... The region of interest (ROI) of the feature maps was also used for finer localization and segmentation. [15], in a small-scale tongue dataset, they also applied U-Net for fast tongue segmentation, and achieved the highest accuracy of 98.45% and consumed 0.267 s per picture on average, [16]. There is no tongue coating and the tongue is dark red or purplish red ① There may be something wrong with the circulatory system. ...
Article
Internet is an important development step in information times. With the tide of Internet development, internet plus health will become a trend of new times. The Chinese Traditional Medicine (CTM) tongue image consulting system based on deep learning technology has created a more perfect and intelligent personal health management system by connecting smart devices to mobile platforms. The system will serve customers perfectly through health food therapy, medical consultation, etc.
... Gu et al. [5] used threshold-based and improved level set models and achieved good results in tongue image segmentation. Changen Zhou et al. [6] designed and proposed an end-toend model for tongue localization and segmentation, called TongueNet, which introduced a feature pyramid network of context-aware residual blocks to extract tongue features. Better performance in tongue body segmentation is achieved in terms of robustness and accuracy. ...
Article
Full-text available
In response to the application scenarios of modernized Traditional Chinese Medicine (TCM) diagnostic and treatment equipment moving towards the user end, an effort has been made to enhance the user-friendliness of TCM diagnostic and treatment devices. This involves introducing the concept of edge computing into the mobile tongue diagnosis instrument, shifting the tasks of tongue image acquisition and analysis to portable auxiliary diagnostic devices. To improve the efficiency of edge computing devices in handling tongue image analysis tasks, a multi-task network model based on a lightweight network backbone is proposed. The model utilizes the lightweight feature extraction backbone of MobileNet to provide feature encoding for both the semantic segmentation branch and the multi-label classification branch. The semantic segmentation branch adopts a skip-layer connection structure with multi-scale feature maps, and an attention mechanism is incorporated into the classification branch to fuse the feature maps from the segmentation branch. This achieves tongue segmentation and multi-label classification tasks in a computationally efficient environment. The model achieves a pixel accuracy of 85.3% in semantic segmentation and an accuracy of 95.6% in multi-label classification. The network’s forward propagation speed on edge computing platforms reaches 7 frames per second (FPS). The proposed lightweight network backbone multi-task network model ensures a significant improvement in processing efficiency while maintaining the accuracy of segmentation and classification tasks. Additionally, the model exhibits advantages in terms of quantity and scale, saving both storage and computational resources. It not only enhances the accuracy and efficiency of tongue image real-time analysis in edge computing scenarios but also reduces the processing time, providing excellent precision and inference speed.
... The task of tongue segmentation presents certain difficulties due to the rich morphological and color features of the tongues, as well as the similarity in color between organs(lips,chin) and tongues. Several researches have employed deep neural network(DNN) based methods to address tongue segmentation problems and achieved remarkable experimental results [2]- [6]. However, these methods exhibit limited performance on datasets with distributions different from the training set [7]. ...
... With the rapid development of deep neural networks, increasing researches focused on leveraging deep semantic segmentation methods for tongue segmentation. [2] presented an end-to-end model called TongueNet, which employed multitask learning for tongue segmentation. This model effectively leverages pixel-level prior information to achieve superior performance in tongue segmentation. ...
Preprint
Tongue segmentation serves as the primary step in automated TCM tongue diagnosis, which plays a significant role in the diagnostic results. Currently, numerous deep learning based methods have achieved promising results. However, most of these methods exhibit mediocre performance on tongues different from the training set. To address this issue, this paper proposes a universal tongue segmentation model named TongueSAM based on SAM (Segment Anything Model). SAM is a large-scale pretrained interactive segmentation model known for its powerful zero-shot generalization capability. Applying SAM to tongue segmentation enables the segmentation of various types of tongue images with zero-shot. In this study, a Prompt Generator based on object detection is integrated into SAM to enable an end-to-end automated tongue segmentation method. Experiments demonstrate that TongueSAM achieves exceptional performance across various of tongue segmentation datasets, particularly under zero-shot. TongueSAM can be directly applied to other datasets without fine-tuning. As far as we know, this is the first application of large-scale pretrained model for tongue segmentation. The project and pretrained model of TongueSAM be publiced in :https://github.com/cshan-github/TongueSAM.
... Supervised learning-based tongue body segmentation approaches are also reported using support vector machine [29], AdaBoost algorithm [30], cascaded convolutional neural networks (CNN) [31], ResNet [32], SegNet [33], fully convolutional network [34,35], U-Net [36][37][38], iterative transfer learning [39], feature pyramid network [40], and patch-driven sparse representation [41]. These supervised learning approaches showed promising tongue segmentation performance by emphasizing producing more accurate segmentation. ...
... We performed segmentation for cytoplasm and nucleus for Dataset-1 and 2 whereas Dataset 3 has only cytoplasm annotations (for basophils) therefore we performed segmentation for cytoplasm only. Evaluation metrics used for MIF-Net includes precision (PRE) [25], misclassification error (ME) [26], the dice coefficient (DI) [27], mean intersection over union (mIoU) [28], falsepositive rate (FPR) [29], and false-negative rate (FNR) [11]. Mathematical expressions of evaluation measures are shown in equations (4)- (9). ...
Research
Full-text available
Exploiting the multi-scale Information Fusion Capabilities for Leukemia Diagnosis Through White Blood Cells Assessment