ArticlePDF Available

Research on Traffic Acoustic Event Detection Algorithm Based on Sparse Autoencoder

Authors:

Abstract and Figures

Road traffic monitoring is very important for intelligent transportation. The detection of traffic state based on acoustic information is a new research direction. A vehicles acoustic event classification algorithm based on sparse autoencoder is proposed to analysis the traffic state. Firstly, the multidimensional Mel-cepstrum features and energy features are extracted to form a feature vector of 125 features; Secondly, based on the computed features, the five-layers autoencoder is trained. Finally, vehicle audio samples are collected and the trained autoencoder is tested. The experimental results show that detection rate of the traffic acoustic event reaches 94.9%, which is 12.3% higher than that of the traditional Convolutional Neural Networks (CNN) algorithm.
Content may be subject to copyright.
aCorresponding author: zhangdaqing_925@163.com
Research on Traffic Acoustic Event Detection Algorithm Based on Sparse
Autoencoder
Xiaodan Zhang1,a, and Yongsheng Chen1and Guichen Tang2
1Research Institute of Highway, Ministry of Transport, Beijing 100088, China
2School of Communication Engineering, Nanjing Institute of Technology, Jiangsu Nanjing, 211167, China
Abstract. Road traffic monitoring is very important for intelligent transportation. The detection of traffic state based
on acoustic information is a new research direction. A vehicles acoustic event classification algorithm based on sparse
autoencoder is proposed to analysis the traffic state. Firstly, the multidimensional Mel-cepstrum features and energy
features are extracted to form a feature vector of 125 features; Secondly, based on the computed features, the five-
layers autoencoder is trained. Finally, vehicle audio samples are collected and the trained autoencoder is tested. The
experimental results show that detection rate of the traffic acoustic event reaches 94.9%, which is 12.3% higher than
that of the traditional Convolutional Neural Networks (CNN) algorithm.
1 Introduction
In order to provide auxiliary research for intelligent
transportation and traffic safety development, an
autoencoder based the acoustic feature for traffic event
detection is proposed. In order to integrate the dynamic
features of the sound, the algorithm extracts the
multidimensional Mel-cepstrum features and energy
features, and forms a 110 dimension feature vector. Then
the 5-layers autoencoder based on the computed features
is trained to improve the robustness. The experimental
results show that the detection rate of traffic acoustic
events reaches 94.9%, and the recognition rate of
collision sounds reaches 97. 9%. The proposed audio
surveillance system may be used to monitor traffic
accidents and save valuable time in rescue mission. In
addition, the system can be embedded in the automatic
driving system, which is conducive to the timely response
of the self-driving car to the traffic state, greatly
improving safety.
The detection of traffic state based on acoustic
information has been an important research direction for
intelligent transportation. Compared to existing
monitoring techniques, acoustic signal processing and
classification techniques have the advantage of being low
cost and unaffected by lighting conditions. Especially in
the case of insufficient light or intermediate obstructions,
acoustic signals have higher information coverage.
Therefore it is an important supplement to existing
monitoring methods. However, compared with the
laboratory environment, the real traffic environment is
complex. For example, the tunnel is a special traffic
environment, that is very different from the open road
environment. How to effectively process the traffic
acoustic data remains a challenge. In 1998, Henryk
Maciejewski et al.[1] studied and designed a
classification system based on wavelet and neural
network. The specific recognition model based the sound
signal was constructed for four different vehicles, and the
recognition accuracy was 73.68%. Audi Ovox et al.
applied sound recognition technology to the field of
intelligent transportation[2] and used voice recognition
technology in the car phones. Xianglong Luo et al.[3]
used empirical mode decomposition (EMD) and support
vector machine (SVM) to identify the vehicle state. In
recent years, some scholars have tried to apply
convolutional neural networks (CNN) to recognize sound
event[4]. Compared with traditional classifiers,
convolutional neural networks have greatly improved
recognition rate and recognition speed. The ConvNet
model[5] has improved the accuracy of nearly 20% on
Esc-50 database. The LSTM+CNN model proposed by
Bae et al. achieved an 84.1% accuracy rate in the
DCASE2016 competition[6]. However, there are still
some problems when the traditional CNN model is
applied to sound event recognition. The CNN adopts a
serial stack structure. The convolutional layer in the
network transforms the low-level feature map into a high-
level feature map layer by layer. Such a network structure
will result in the low-level feature information loss in the
final extracted features.
For the above mentioned problems, a sparse
autoencoder based on acoustic features is proposed to
classify the vehicle state. The original acoustic features
are multidimensional Mel-cepstrum features and energy
features. The autoencoder generates encoded features to
classify the vehicle state from the original acoustic
features, which improve the robustness of the algorithm.
To verify the performance of the proposed algorithm, we
collected a total of 829 samples for experiment, including
three types of data: engine running (normal driving),
brake and crash. Compared with four algorithms, the
recognition rate based on the proposed method can reach
94.9%, which is 11.45% higher than that of the traditional
MATEC Web of Conferences 308, 05002 (2020) https://doi.org/10.1051/matecconf/202030805002
ICTTE 2019
© The Authors, published by EDP Sciences. This is an open access article distributed under the terms of the Creative Commons Attribution
License 4.0 (http://creativecommons.org/licenses/by/4.0/).
MATEC Web of Conferences
classification algorithm or 12.3% higher than the
traditional CNN algorithm.
2 Acoustic Features
For traffic acoustic event detection, three types of
vehicles states are important, which are engine running
(normal driving) state, crake state and crash state. The
waveform and spectrogram of three kinds of signal are
shown in Figure 1. From the graph, three states have
some obvious difference. For example, the waveform of
the running signal is flat and its frequency range is below
800Hz. In addition, the waveform and the spectrogram of
the crash signal have the obvious change. After the
analysis, the valid features should be selected to reflect
these differences.
Figure 1.The representation of three types of signals
Among many acoustic features, the Mel-Cepstral
feature is widely used in sound event classification and
speaker recognition. The Mel-Cepstral is a spectral
feature which is calculated base on the non-linear
relationship between the human ear's auditory
characteristics and signal frequency. However, the
standard Mel-Cepstral only reflects the static
characteristics of speech parameters. The dynamic
characteristics of speech can be described by the
differential spectrum of these features. The extracted
features are differentiated to further strip the features, and
obtain the information such as type and speed changes of
the acoustic event. Combining the Mel-Cepstral feature
with its difference can improve the recognition
performance of the system.
In addition, different types of sounds have different
energy values and change trends, so short-time energy
and their statistical features are also computed. Table 1
describes the adopted 110 features.
Table 1. List of Acoustic Features.
No. features
1-95
Mel-Frequency Cepstral Coefficients (MFCC-0 to
MFCC-12), averages of their 1st and 2nd order
differences, their maximums, minimums, ranges,
and standard deviations.
96-110
Short-term energy, averages of its 1st and 2nd order
difference, its maximum, minimum, range, and
standard deviation.
3 Autoencoder for traffic acoustic event
detection
The autoencoder is the neural network composed of
several hidden layers and is an unsupervised learning
algorithm. The general data representation from the input
will be obtained by setting the target value as the input[7].
As shown in Fig.1, the autoencoder includes two parts:
encoder and decoder. A number of hidden layers on the
left of the graph form an encoder. The right hidden layers
form a decoder, which has the same output as the encoder.
In the middle of Figure 2, the output of the middle layer
is the coded feature. In the process of the input
reconfiguration, the data distribution is learned by the
encoding and decoding process. At last, the data is
compressed and coded, so the more compact features[8]
are obtained.
Figure 2. The architecture of autoencoder
For an autoencoder,
i
x
is the input,
i
y
is the output and
θ
represents the parameters. The input
i
x
represents the
110-dimensional audio feature we extracted, and the
output
is the optimized feature after the sparse
autoencoder, which is also 110-dimensional. During the
training, the goal of parameter optimization is:
2
1
min
N
i i
i
x y
θ(1)
By limiting the expected activation of the hidden unit
to sparsity, a regularization term[9] is added to punish the
deviations between the expected activation degree of the
hidden unit and the target sparsity
. So equation (1) is:
2
1 1
ˆ
min + ( || )
N m
i i j
i j
x y sp
 
 
 
θ(2)
2
MATEC Web of Conferences 308, 05002 (2020) https://doi.org/10.1051/matecconf/202030805002
ICTTE 2019
ICTTE 2019
where
ˆ
( || )
j
sp
 
is the sparsity penalty term, which is
computed as:
1
ˆ
( || ) log (1 )log
ˆ ˆ
1
j
j j
sp
 
 
 
 
(3)
Here,
ˆ
j
is the average activation degree for all
neurons in the hidden layer,
is sparsity,
is the
penalty coefficient, and
m
is the number of all neurons.
By adding the sparsity limitation[10], most of the neurons
in the autoencoder are suppressed, only a small number
of neurons are active, which reduces the redundancy[11]
of the network and increases the robustness of the model.
Table2. Parameters Setting.
Component
No. Layer
#Neurons
Activation
Encoder
1 Fully Connected
Layer
110 ReLU
2 Fully Connected
Layer
96 ReLU
3 Fully Connected
Layer
64 ReLU
Decoder
4 Fully Connected
Layer
96 ReLU
5 Fully Connected
Layer
110 -
The 5-layers autoencoder is constructed for learning
the latent feature representation of traffic acoustic event,
and its parameter settings are shown in Table 2. The
workflow of the proposed autoencoder is shown in Fig.3.
The training process is shown in Fig.3(a). At first, the
autoencoder is trained to study the latent representation of
the original acoustic features from the training set. The
learning strategy is to reconstruct input signals to obtain
the latent representation, and use gradient descent method
to minimize the error between the reconstructed signal
and the input signal. When the deviation reaches the set
requirements, a trained autoencoder is constructed. Then,
a softmax classifier is added to compute the deviation
between its output and the real labels. The gradient of the
deviation is calculated and the back propagation
algorithm is used to tune the parameters of each layer.
After the training is finished, its performance of trained
autoencoder can be estimated by the test data. The test
flow is shown in Fig.3(b).
(a) The training process
(b) The test process
Figure 3. The workflow of the proposed model
4 Experimental analysis
4.1. Experimental data and parameter setting
In this experiment, a total of 829 samples were collected
for the three types of sound, such as engine
running(normal driving), braking and crashing. Among
them, there are 442 engine running (normal driving)
sounds, 176 brake sounds and 211 crash sounds.
In order to improve the effectiveness of the algorithm,
the voice activity detection is firstly done on the sample
to remove the silent segment. The sample is then
resampled at a frequency of 16000 Hz. Next, the sample
is framed and the FFT is calculated, the number of FFT
points is 512, and the frame overlap rate is 50%.
The current general deep learning framework Caffe is
used to build and train the network. Because the selection
of hyperparameters in neural networks has a great impact
on the training and the convergence state of the network,
the final network hyperparameters are determined
through multiple experiments and comparisons.
The comparison classifier and its parameters are: 1)
Random forest[12], the maximum depth is 6, and the
number of base estimators is 100; 2) CovNet[13], a two-
layer convolutional layer with convolution kernels of
(110, 6) and (1, 3), with a full layer of 1,000 neurons; 3)
k-nearest neighbors (KNN); 4) the support vector
machine.
4.2 Comparison of state-of-the-art recognition
algorithms
The experimental evaluate the algorithm performance in a
five-fold cross-validation manner. For a more valid
assessment, Random Forest[14], KNN[15], CNN[16] and
ConvNet performed the same experiments and
comparisons on the same dataset.
Table 3 shows the experimental results for the five
classifiers on the data set. Compared with random forest
and KNN, the autoencoder improved the accuracy by
12.2% and 17.4%, respectively. It can be seen that the
performance of autoencoder is better than the traditional
classifier. The reason is that the data samples are very
comprehensive, and there are many recorded data under
various environments, while the generalization ability of
the traditional classifier is bad. In addition, through data
analysis, the traditional classifier has the lowest
recognition rate for the brake event category, the highest
recognition rate for the driving event category, followed
by the collision event category. So, it may also be caused
by the fact that the brake data is too small, and the data
needs to be supplemented later for further verification.
Compared with traditional CNN and ConvNet, the
recognition accuracy is improved by 12.3% and 3.9%,
respectively. The possible reason is that the autoencoder
uses encoded features to classify the state. These encoded
features are more robust than the original features.
Table 3. Recognition rate of five models on the data set.
Classifier
Overall
recognition rate
Collision
recognition rate
Random Forests
82. 7%
84. 8%
DNN
77. 5%
85. 6%
CNN
82. 6%
90. 1%
ConvNet
91. 0%
92. 3%
Proposed method
94. 9%
97. 9%
3
MATEC Web of Conferences 308, 05002 (2020) https://doi.org/10.1051/matecconf/202030805002
ICTTE 2019
MATEC Web of Conferences
In addition, Table 3 additionally compares the
recognition efficiency of various algorithms for collision
sounds. The experimental results show that the proposed
algorithm's recognition rate of collision sound is 3% more
than the overall recognition rate, reaching 97. 9%. This
shows that the algorithm can be effectively applied to the
traffic incident warning system to realize the timely alarm
function of traffic accidents based on traffic state
detection.
5 Conclusions
In order to provide auxiliary research for intelligent
transportation and traffic safety development, an
autoencoder based the acoustic feature for traffic event
detection is proposed. In order to integrate the dynamic
features of the sound, the algorithm extracts the
multidimensional Mel-cepstrum features and energy
features, and forms a 110 dimension feature vector. Then
the 5-layers autoencoder based on the computed features
is trained to improve the robustness. The experimental
results show that the detection rate of traffic acoustic
events reaches 94.9%, and the recognition rate of
collision sounds reaches 97. 9%. The proposed audio
surveillance system may be used to monitor traffic
accidents and save valuable time in rescue mission. In
addition, the system can be embedded in the automatic
driving system, which is conducive to the timely response
of the self-driving car to the traffic state, greatly
improving safety.
Acknowledgments
The work was supported by Science and Technology
Innovation Project of Research Institute of Highway,
Ministry of Transport (2018-E0021).
References
1. Maciejewski Henryk, Mazurkiewicz Jacek, Skowron
Krzysztof, Walkowiak Tomasz, Neural Networks for
Vehicle Recognition, in Proc. of the 6th International
Conference on Microelectronics for Neural Networks,
Evolutionary and Fuzzy Systems, 1998, pp.292–296.
2. Zhang Dian Ye, Jian Prof Jin, Zhi-Zheng Assoc Prof
Guo. Exploration into Road Traffic Accident
Prevention Research System. China Safety Science
Journal, Vol.17, No.7, 2007, pp.132-138.
3. Luo Xiang Long, Niu. Vehicle recognition by
acoustic signals based on EMD and SVM. Applied
Acoustics, Vol.29, No.3, 2010, pp.178-183.
4. Salamon Justin, Bello Juan. Deep Convolutional
Neural Networks and Data Augmentation for
Environmental Sound Classification. IEEE Signal
Processing Letters, Vol. 99, 2016, pp.1-4.
5. Piczak Karol J. Environmental sound classification
with convolutional neural networks. IEEE
International Workshop on Machine Learning for
Signal Processing, 2015, pp.1-4.
6. H Bae S, I Choi, S Kim N, Acoustic scene
classification using parallel combination of LSTM
and CNN, Proceedings of the Detection and
Classification of Acoustic Scenes and Events, 2016,
pp.11-15.
7. Xu J, Xiang L, Liu Q, et al. Stacked Sparse
Autoencoder (SSAE) for Nuclei Detection on Breast
Cancer Histopathology Images, IEEE Transactions
on Medical Imaging, Vol.35, No.1, 2016, pp.119-130.
8. Chandar A P S, Lauly S, Larochelle H, et al. An
autoencoder approach to learning bilingual word
representations, International Conference on Neural
Information Processing Systems. MIT Press, 2014,
pp.1853-1861.
9. Goodfellow I J, Le Q V, Saxe A M, et al. Measuring
invariances in deep networks, International
Conference on Neural Information Processing
Systems, 2009, pp.646-654.
10. Mairal J, Bach F, Ponce J, et al. Online Learning for
Matrix Factorization and Sparse Coding, Journal of
Machine Learning Research, Vol.11, No.1, 2009,
pp.19-60.
11. Hinton G E, Salakhutdinov R R. Reducing the
Dimensionality of Data with Neural Networks,
Science, 313(5786), 2006, pp.504-507.
12. Phan Huy, Maaß Marco, Mazur Radoslaw, Mertins
Alfred. Random Regression Forests for Acoustic
Event Detection and Classification. IEEE/ACM
Transactions on acoustic Speech & Language
Processing, Vol.23, No.1, 2015, pp.20-31.
13. Garcia-Pedrajas N, Hervas-Martinez C, Munoz-Perez
J. COVNET: a cooperative coevolutionary model for
evolving artificial neural networks. IEEE
Transactions on Neural Networks, Vol.14, No.3,
2003, pp.575-596.
14. Pal, M. Random forest classifier for remote sensing
classification. International Journal of Remote
Sensing, 2005, 26(1):217-222.
15. Zhang M L , Zhou Z H . ML-KNN: A lazy learning
approach to multi-label learning. Pattern Recognition,
2007, 40(7):2038-2048.
16. Wei Y , Zhao Y , Lu C , et al. Cross-Modal Retrieval
With CNN Visual Features: A New Baseline. IEEE
Transactions on Cybernetics, 2017, 47(2):449-460.
4
MATEC Web of Conferences 308, 05002 (2020) https://doi.org/10.1051/matecconf/202030805002
ICTTE 2019
... Input data that contains noise or anomalies contain signals that the network has not learned, and so the reconstructed output does not contain these, resulting in a large loss [6]. Applications of auto-encoders for acoustic signals have included the de-noising as a preliminary approach for speech recognition [4], [5], detection of distinct events in acoustic recordings (finger flexion, traffic states) [10], [11], and detecting wear and abnormalities in machine sounds [7], [12]- [15]. ...
... [19] proposed to utilize deep autoencoder to learn latent representation of high-dimensional mass spectrometry data which was a feasible and powerful instrument for mass spectrometry feature learning and also cancer diagnosis. [20] described a vehicles acoustic event classification algorithm based on sparse autoencoder to analysis the traffic state. ...
Article
Full-text available
To realize image information hiding, we train a multi-layers autoencoder includes an encoder and a decoder. Using encoder architecture, we input the secret color image and generate the cover image where the size stays the same directly. In addition, we reconstruct the secret color image by inputting the stego images passing on the decoder which accomplish the secret image blind extraction. Our method realizes a high resolution and high payload color image steganography. Experimental results show that we can realize the high capacity blind specified secret images information hiding automatically by our method.
Article
Full-text available
Despite the success of the automatic speech recognition framework in its own application field, its adaptation to the problem of acoustic event detection has resulted in limited success. In this paper, instead of treating the problem similar to the segmentation and classification tasks in speech recognition, we pose it as a regression task and propose an approach based on random forest regression. Furthermore, event localization in time can be efficiently handled as a joint problem. We first decompose the training audio signals into multiple interleaved superframes which are annotated with the corresponding event class labels and their displacements to the temporal onsets and offsets of the events. For a specific event category, a random-forest regression model is learned using the displacement information. Given an unseen superframe, the learned regressor will output the continuous estimates of the onset and offset locations of the events. To deal with multiple event categories, prior to the category-specific regression phase, a superframe-wise recognition phase is performed to reject the background superframes and to classify the event superframes into different event categories. While jointly posing event detection and localization as a regression problem is novel, the superior performance on two databases ITC-Irst and UPC-TALP demonstrates the efficiency and potential of the proposed approach.
Article
Full-text available
Cross-language learning allows us to use training data from one language to build models for a different language. Many approaches to bilingual learning require that we have word-level alignment of sentences from parallel corpora. In this work we explore the use of autoencoder-based methods for cross-language learning of vectorial word representations that are aligned between two languages, while not relying on word-level alignments. We show that by simply learning to reconstruct the bag-of-words representations of aligned sentences, within and between languages, we can in fact learn high-quality representations and do without word alignments. Since training autoencoders on word observations presents certain computational issues, we propose and compare different variations adapted to this setting. We also propose an explicit correlation maximizing regularizer that leads to significant improvement in the performance. We empirically investigate the success of our approach on the problem of cross-language test classification, where a classifier trained on a given language (e.g., English) must learn to generalize to a different language (e.g., German). These experiments demonstrate that our approaches are competitive with the state-of-the-art, achieving up to 10-14 percentage point improvements over the best reported results on this task.
Article
Full-text available
Growing an ensemble of decision trees and allowing them to vote for the most popular class produced a significant increase in classification accuracy for land cover classification. The objective of this study is to present results obtained with the random forest classifier and to compare its performance with the support vector machines (SVMs) in terms of classification accuracy, training time and user defined parameters. Landsat Enhanced Thematic Mapper Plus (ETM+) data of an area in the UK with seven different land covers were used. Results from this study suggest that the random forest classifier performs equally well to SVMs in terms of classification accuracy and training time. This study also concludes that the number of user‐defined parameters required by random forest classifiers is less than the number required for SVMs and easier to define.
Conference Paper
Full-text available
For many pattern recognition tasks, the ideal input feature would be invariant to multiple confounding properties (such as illumination and viewing angle, in com- puter vision applications). Recently, deep architectures trained in an unsupervised manner have been proposed as an automatic method for extracting useful features. However, it is difficult to evaluate the learned features by a ny means other than using them in a classifier. In this paper, we propose a number o f empirical tests that directly measure the degree to which these learned features are invariant to different input transformations. We find that stacked autoe ncoders learn modestly increasingly invariant features with depth when trained on natural images. We find that convolutional deep belief networks learn substantial ly more invariant features in each layer. These results further justify the use of "deep " vs. "shallower" repre- sentations, but suggest that mechanisms beyond merely stacking one autoencoder on top of another may be important for achieving invariance. Our evaluation met- rics can also be used to evaluate future work in deep learning, and thus help the development of future algorithms.
Article
The ability of deep convolutional neural networks (CNN) to learn discriminative spectro-temporal patterns makes them well suited to environmental sound classification. However, the relative scarcity of labeled data has impeded the exploitation of this family of high-capacity models. This study has two primary contributions: first, we propose a deep convolutional neural network architecture for environmental sound classification. Second, we propose the use of audio data augmentation for overcoming the problem of data scarcity and explore the influence of different augmentations on the performance of the proposed CNN architecture. Combined with data augmentation, the proposed model produces state-of-the-art results for environmental sound classification. We show that the improved performance stems from the combination of a deep, high-capacity model and an augmented training set: this combination outperforms both the proposed CNN without augmentation and a "shallow" dictionary learning model with augmentation. Finally, we examine the influence of each augmentation on the model's classification accuracy for each class, and observe that the accuracy for each class is influenced differently by each augmentation, suggesting that the performance of the model could be improved further by applying class-conditional data augmentation.
Article
Recently, convolutional neural network (CNN) visual features have demonstrated their powerful ability as a universal representation for various recognition tasks. In this paper, cross-modal retrieval with CNN visual features is implemented with several classic methods. Specifically, off-the-shelf CNN visual features are extracted from the CNN model, which is pretrained on ImageNet with more than one million images from 1000 object categories, as a generic image representation to tackle cross-modal retrieval. To further enhance the representational ability of CNN visual features, based on the pretrained CNN model on ImageNet, a fine-tuning step is performed by using the open source Caffe CNN library for each target data set. Besides, we propose a deep semantic matching method to address the cross-modal retrieval problem with respect to samples which are annotated with one or multiple labels. Extensive experiments on five popular publicly available data sets well demonstrate the superiority of CNN visual features for cross-modal retrieval.
Article
Multi-label learning deals with the problem where each example is represented by a single instance while associated with multiple class labels. A number of multi-label learning approaches have been proposed recently, among which multi-label lazy learning methods have shown to yield good generalization abilities. Existing multi-label learning algorithm based on lazy learning techniques does not address the correlations between different labels of each example, such that the performance of the algorithm could be negatively influenced. In this paper, an improved multi-label lazy learning approach named IMLLA is proposed. Given a test example, IMLLA works by firstly identifying its neighboring instances in the training set for each possible class. After that, a label counting vector is generated from those neighboring instances and fed to the trained linear classifiers. In this way, information embedded in other classes is involved in the process of predicting the label of each class, so that the inter-label relationships of each example are appropriately addressed. Experiments are conducted on several synthetic data sets and two benchmark real-world data sets regarding natural scene classification and yeast gene functional analysis. Experimental results show that the performance of IMLLA is superior to other well-established multi-label learning algorithms, including one of the state-of-the-art lazy-style multi-label leaner.
Article
Automated nuclear detection is a critical step for a number of computer assisted pathology related image analysis algorithms such as for automated grading of breast cancer tissue specimens. The Nottingham Histologic Score system is highly correlated with the shape and appearance of breast cancer nuclei in histopathological images. However, automated nucleus detection is complicated by (1) the large number of nuclei and the size of high resolution digitized pathology images, and (2) the variability in size, shape, appearance, and texture of the individual nuclei. Recently there has been interest in the application of "Deep Learning" strategies for classification and analysis of big image data. Histopathology, given its size and complexity, represents an excellent use case for application of deep learning strategies. In this paper, a Stacked Sparse Autoencoder (SSAE), an instance of a deep learning strategy, is presented for efficient nuclei detection on high-resolution histopathological images of breast cancer. The SSAE learns high-level features from just pixel intensities alone in order to identify distinguishing features of nuclei. A sliding window operation is applied to each image in order to represent image patches via high-level features obtained via the autoencoder, which are then subsequently fed to a classifier which categorizes each image patch as nuclear or nonnuclear. Across a cohort of 500 histopathological images (2200×2200) and approximately 3500 manually segmented individual nuclei serving as the groundtruth, SSAE was shown to have an improved F-measure 84.49% and an average area under Precision-Recall curve (AveP) 78.83%. The SSAE approach also out-performed 9 other state of the art nuclear detection strategies.
Article
Multi-label learning originated from the investigation of text categorization problem, where each document may belong to several predefined topics simultaneously. In multi-label learning, the training set is composed of instances each associated with a set of labels, and the task is to predict the label sets of unseen instances through analyzing training instances with known label sets. In this paper, a multi-label lazy learning approach named ML-KNN is presented, which is derived from the traditional K-nearest neighbor (KNN) algorithm. In detail, for each unseen instance, its K nearest neighbors in the training set are firstly identified. After that, based on statistical information gained from the label sets of these neighboring instances, i.e. the number of neighboring instances belonging to each possible class, maximum a posteriori (MAP) principle is utilized to determine the label set for the unseen instance. Experiments on three different real-world multi-label learning problems, i.e. Yeast gene functional analysis, natural scene classification and automatic web page categorization, show that ML-KNN achieves superior performance to some well-established multi-label learning algorithms.