Conference PaperPDF Available

Semantic Event Detection Using Ensemble Deep Learning

Authors:

Figures

Content may be subject to copyright.
Semantic Event Detection Using Ensemble Deep Learning
Samira Pouyanfar and Shu-Ching Chen
School of Computing and Information Sciences
Florida International University
Miami, FL 33199, USA
{spouy001, chens}@cs.fiu.edu
Abstract—Numerous deep learning architectures have been
designed for a variety of tasks in the past few years. However,
it is almost impossible for one model to work well for all
kinds of scenarios and datasets. Therefore, we present an
ensemble deep learning framework in this paper, which not
only decreases the information loss and over-fitting problems
caused by single models, but also overcomes the imbalanced
data issue in multimedia big data. First, a suite of deep
learning algorithms are utilized for deep feature selection.
Thereafter, an enhanced ensemble algorithm is developed
based on the performance of each single Support Vector
Machine classifier on each deep feature set. We evaluate our
proposed ensemble deep learning framework on a large and
highly imbalanced video dataset containing natural disaster
events. Experimental results demonstrate the effectiveness of
the proposed framework for semantic event detection, and
show how it outperforms several state-of-the-art deep learning
architectures, as well as handcrafted features integrated with
ensemble and non-ensemble algorithms.
Keywords-Deep learning; Ensemble learning; Imbalanced
data; Semantic event detection; Multimedia big data.
I. INTRODUCTION
Over the last decade, social networks and multimedia
sources such as Twitter, YouTube, and Facebook have gener-
ated a significant amount of digital data. For example, over
hundred hours of videos are uploaded to YouTube every
minute in a day. Due to such multimedia data explosion,
as well as its rich and significant contents, it is considered
as a valuable source of data in many research studies [1],
[2]. Video semantic event detection is one of the main
applications of multimedia management systems. Recently,
many researchers have tried to detect the most interesting
events and concepts from videos [3], [4]. Criminal event
detection from video and audio data, natural disaster retrieval
from video data, and interesting event detection in a sport
game are a few examples of video semantic event detection.
However, there are several challenges needed to be ad-
dressed in multimedia semantic analysis, including how
to analyze such huge volume and variety of data in an
efficient manner, and how to handle data with a non-
uniform distribution. The latter is known as imbalanced data
problem, which has been commonly seen in video event
detection scenarios. For example, suppose one is looking for
video shots containing natural disaster information among
thousands of videos in YouTube where meta-data and textual
information may not be reliable and accurate. This example
shows that the skewed distribution of the major class (non-
disaster video shots) and minor or interesting class (videos
containing disaster information). This rareness of interesting
events in videos makes the detection task more challenging.
Currently, the class imbalance issue has been studied by
many researchers in the literature [5], [6], [7]. Nevertheless,
conventional learning approaches are still biased toward the
majority classes.
In the past few years, deep learning has attracted lots of
attention in both academia and industry [8], [9], [10]. It
is one of the significant breakthrough techniques in data
mining and machine learning algorithms [11]. Using a
cascade of layers in a deep graph architecture, composed
of multifold linear and non-linear transformations, deep
learning intents to model very high-level data abstractions.
The recent explosion of deep learning studies has led to
significant advances and improvements in multimedia man-
agement systems. Although deep learning techniques have
been applied to lots of research studies in recent years,
there is still limited work focusing on the imbalanced data
problem in multimedia data.
Multi-classifier fusion is another hot topic in data mining
and machine learning because one single classifier can be
hardly applied to all scenarios and usually cannot handle
imbalanced and big multimedia data due to over-fitting,
information loss, and additional bias [12]. Inspired by the
fast progress and achievements of deep learning, this paper
leverages deep learning techniques for video feature analysis
with the application to semantic event detection. In addition,
due to the great success of ensemble learning techniques in
machine learning, an enhanced ensemble deep learning is
proposed in this paper to improve the event detection in
imbalanced multimedia data.
The remainder of the paper is organized as follows. In
section II, an overview of the state-of-the-art research in
imbalanced multimedia analysis is provided. Section III
discusses the details of the proposed ensemble deep learning
framework. In section IV, a comprehensive experimental
analysis is presented. Lastly, we conclude the paper in
section V.
II. RE LATE D WORK
Imbalanced data has been widely seen in many real
world applications [5], such as activity recognition, cancer
prediction, banking fraud detection, and video mining [1],
[13], to name a few. The imbalanced data solutions can be
grouped into three categories [14]: (1) Sampling methods
which modify data distributions in a way to generate more
balanced data, (2) Kernel-based and active learning methods
which utilize robust classifications techniques to naturally
handle imbalanced learning, and (3) Cost-sensitive methods
which apply a penalty for misclassification of instances
from one class to another. Among them, the integration of
ensemble learning techniques with imbalanced data solutions
have shown significant successes in the past years [12], [15].
Another hot topic in multimedia data is how to em-
ploy several feature extraction techniques to improve the
final detection results. Ha et al. [16] presented a multi-
modality fusion technique for multimedia semantic retrieval.
Specifically, the correlation between feature pairs is calcu-
lated to reduce the feature space by eliminating features
with low correlation toward others. Thereafter, features are
grouped using Hidden Coherent Feature Groups (HCFGs)
technique [17]. Finally, multiple classifiers are trained for all
feature groups and the scores generated by each classifier
are fused for final event retrieval. In another work, Liu
et al. [18] proposed a new feature representation method
by integrating spatial and temporal information from video
sequences. Using the optical flow field and Harris3D corner
detector, as well as a boosting ensemble algorithm based on
two classifier models (sparse representation and hamming
distance classifiers), the authors successfully improved the
performance of human action recognition.
Deep learning is not a new topic and has a long history
in artificial intelligence [11]. Convolutional Neural Networks
(CNNs) [19], for instance, have improved traditional feed-
forward neural networks in 1990s, especially in image pro-
cessing, by constraining the complexity of networks using
local weight sharing topology. Traditional neural network
techniques are difficult to interpret due to their black-box
nature and they are also very prone to over-fitting [9].
In contrast, new deep learning algorithms are more inter-
pretable because of their strong local modeling. In addition,
as new ideas, algorithms, and network architectures have
been designed in the last few years, deep learning has shown
significant advances mainly in image recognition and object
detection.
As a single classifier may not be able to handle large
datasets with multiple feature sources, ensemble algorithms
have attracted lots of attention in the literature, which can
be utilized to enhance the classification performance by tak-
ing advantages of multiple classifiers. A positive enhanced
ensemble algorithm which handles imbalanced data in video
event retrieval is presented [12]. Their proposed framework
Figure 1: The proposed ensemble deep learning framework
combines a sampling-based method with a classifier fusion
algorithm to enhance the detection of interesting events
(minor classes) in an imbalanced sport video dataset. An
ensemble neural network is proposed in [20]. Using a boot-
strapped sampling approach along with a group of neural
networks, the rare event issue is alleviated. The framework
was also evaluated using a large set of soccer videos with
the purpose of corner event detection.
III. ENSEMBLE DEE P LEARNING FRA ME WORK
In this paper, we propose a mixture of deep learning
feature extractors integrated with an enhanced ensemble
algorithm. The framework is shown in Figure 1, which
not only improves the performance of event detection from
videos, but also avoids over-fitting and information losses.
The proposed framework is divided into three main mod-
ules: (1) preprocessing, (2) deep feature extraction, and (3)
classification including training, validation, and testing.
A. Preprocessing
As the preprocessing module is domain specific, different
routines may be needed for different applications, such as
audio, image, video, and textual analysis. In this study, we
apply an automatic shot boundary detection approach [21] on
the raw video. This unsupervised algorithm is mainly based
on the object tracking and image segmentation techniques.
After all shots are obtained from the raw videos, the first
frame of each shot is selected as a keyframe because it is
the most distinctive one.
B. Deep Feature Extraction
Deep learning is an emerging research topic which has
been advanced tremendously during the last five years.
One of the main applications of deep learning is how to
generate useful and discriminative features from raw data.
In the last decade, researchers have developed various hand-
crafted features for visual recognition tasks [22]. HOG [23],
CEDD [24], and SIFT [25] are few examples of powerful
features that have been widely used in computer vision.
However, the progress of handcrafted features has slowed
down during 2010-2012 and new deep learning architectures
have exceedingly raised the performance levels [26]. There-
fore, in this paper, we apply various rich and deep feature
extraction models based on the CNN algorithm.
CNNs [19] are variations of MultiLayer Perceptron (MLP)
networks with the difference in their local connections.
The main idea is to have a locally connected network,
inspired by the localized biological neurons in animals visual
cortex. It contains a complex set of cells which locally
filters input data to extract the rich and deep spatially-local
correlations in images. A convolutional network generally
includes three main layers: (a) stacked convolutional layers,
(b) sub-sampling or pooling layers, and (c) fully connected
layers [27] as shown in the deep feature extraction module
in Figure 1. In the convolutional layer, a number of feature
maps are generated by iteratively applying a function across
local-region of the whole input. In other words, the input
data is convoluted with linear filters followed by nonlinear
activation functions. The kth feature map at a given layer is
denoted as xk
ij (given in Equation 1), where iand jare the
input dimensions, xk1
ij is the input data from the previous
layer, fis an activation function (e.g., sigmoid, tanh, etc.),
and filters of the kth layer are determined by Wk
ij (weights)
and bk
j(bias). A pooling layer is a nonlinear down-sampling,
and is located after each convolutional layer. It reduces
the number of feature maps by introducing spareness and
provides additional robustness to the network. This layer
takes a small block from the previous convolutional layer
and produces a single output as shown in Equation 2, where
down(.)is a subsampling function (e.g., max, average, etc.)
and βk
ij is a multiplicative bias.
xk
ij =f((Wk
ij xk1
ij ) + bk
j); (1)
xk
ij =f(βk
ij down(xk1
ij ) + bk
j).(2)
The last layer of CNNs is called fully-connected layer
which is responsible for the high-level reasoning in the
network. Similar to regular neural networks, in this layer,
all activations in the previous layer are fully connected to
a single neuron. The set of all feature maps at the last
convolutional-subsampling layers are the input to the first
fully connected layer.
In this paper, we utilize four advanced and successful
deep learning architectures for visual feature extraction as
explained below.
AlexNet [8]: the first work that made CNNs popular
in image processing. It significantly outperforms the
second runner-up (over 10%) in ILSVRC 2012. The
Alexnet architecture is very similar to CNNs, but
with larger, deeper, and stacked convolutional layers
followed by pooling layers.
CaffeNet [28]: a replication of AlexNet with some im-
provements, and developed and trained by the Berkeley
Vision and Learning Center (BVLC). It is not trained
with the relighting data-augmentation and the pool-
ing layer is done before normalization. This reference
model is trained on Image-Net dataset as explained
in [29].
R-CNN [26]: mainly used for object detection tasks. It
improved the performance results by over 30% com-
pared to the best results on PASCAL VOC 2012. First,
it generates candidate regions by leveraging bounding
box segmentation with low-level features, and then ap-
plies CNN classifiers to detect objects at those specific
locations.
GooglNet [10]: a deeper and wider network rather
than AlexNet, and developed by a Google team. It
contains 22 layers of a deep network which utilizes the
extra sparsity of layers. This framework, also known
as “Inception architecture”, attempts to find more op-
timal locality and repeats it spatially. It has shown its
promising performance in ILSVRC 2014 on ImageNet
dataset by wining the first place in two object detection
and classification tasks .
C. Classification
As ensemble methods alleviate the over-fitting problem
and increase the performance results, an enhanced ensemble
deep learning algorithm is proposed in this paper. After fea-
ture extraction, we analyze the extracted deep features and
measure the importance of each feature set. In addition, how
to optimally integrated the trained models in an effective way
is a key issue. For this purpose, we employ the proposed
enhanced ensemble method to adjust the weight coefficients
for the classification module (shown in Figure 1) which
depicts the multi-layer architecture of our learning method.
In this module, there are kmodels, each trained on a feature
set, and the performance of each model is considered to
adjust the weights of the weak classifiers. It contains two
main steps: deep ensemble learning and testing.
1) Deep Ensemble Learning: Algorithm 1 illustrates the
training procedure of the proposed deep ensemble learn-
ing step. First, the dataset is split into three categories:
training T, validation V, and testing T0.Tis defined as
T={(t1, c1),(t2, c2), ..., (tN, cN)}, where tiis the ith
training instance, Nis the total number of training instances,
and ci {0,1}is the class for a binary classification task.
In addition, the feature sets extracted from all deep learning
algorithms are stored in F r which is another input of the
training algorithm.
The proposed ensemble learning is basically constructed
based on a set of weak learners or models M={Mj, j =
1,2,· · · , k}as shown in Lines 1-3 of Algorithm 1, where
kis the number of total weak learners. In this paper, we
utilize linear Support Vector Machine (SVM) as the weak
learner which has been widely used for deep learning clas-
sification [30]. Each classifier model Mjis trained using the
training instances. After that, we evaluate each model using
the validation set Vas shown in Lines 4-7 of Algorithm 1.
For this purpose, we utilize the F1 measure (the weighted
average value of precision and recall) which is a number
between 0 (the worst case) and 1 (the best case). Afterward,
using the F1jmeasure for each trained model Mj, the
weight of each model is calculated using Equation 3.
Wj=F1j
Pk
j=1 F1j
.(3)
This probability gives higher weights to the models that
are confident about their prediction. Finally, the weight
factors Wjand the trained models Mjare outputted for
further classification analysis.
Algorithm 1 Training of Ensemble Deep Learning
Input: Training instances T{(ti, ci), i = 1,2,· · · , N },
Validation instances V{(vi, ci), i = 1,2,· · · , N2}, Feature
set F r ={Fj, j = 1,2,· · · , k}.
Output: Weight matrix Wj, Trained models Mj.
1: for all FjF r(j= 1,· · · , k)do
2: MjSVM(T , Fj);
3: end for
4: for all FjF r(j= 1,· · · , k)do
5: F1jVALIDATE(V, Fj);
6: Wj=F1j
Pk
j=1 F1j;
7: end for
8: return Wj, Mj
2) Testing: In the testing step, a weighted sum of the
weak learner results from the ktrained models is used to
predict the label of each testing instance (as illustrated in
Algorithm 2). The inputs of this step include testing data T0
and the corresponding features F r, as well as all the trained
models Mjand their assigned weights Wj. In Lines 2-4
of Algorithm 2, the labels Lj(j= 1,· · · , k) generated by
the jth weak learner is calculated for each testing instance.
Then, the final predicted label P Liis generated as shown
in Line 5 of Algorithm 2.
Algorithm 2 Testing of Ensemble Deep Learning
Input: Testing instances T0{(t0
i), i = 1,2,· · · , N3}, Fea-
ture set F r ={Fj, j = 1,2,· · · , k}, Trained models Mj,
Weight matrix Wj.
Output: Predicted labels P Li.
1: for all t0
iT0(i= 1,· · · , N3)do
2: for all FjF r(j= 1,· · · , k)do
3: LjMj(t0
i, Fj);
4: end for
5: P Li=1if Pk
j=1 LjWj1
2;
0otherwise
6: end for
7: return P Li
IV. EXP ER IM EN TAL ANA LYSIS
A. Experimental Setup
In this paper, we evaluate our proposed framework using
the dataset described in [31] which contains about 80
YouTube videos. Seven different natural disaster events,
including flood, damage, fire, mud-rock, tornado, lightening,
and snow, are selected. Our purpose is to detect these events
from a large set of video frames (almost 7000 shots), where
the average fraction of the positive to the negative instances
(P/N ratio) is 0.051, which shows the imbalanced data
distribution in this dataset.
When the data is imbalanced, how to evaluate the frame-
work is very important and critical because the accuracy or
other similar criteria that consider the performance of both
negative and positive classes are not reliable. Thus, we eval-
uate our framework using common metrics for imbalanced
data - Precision, Recall, and F1 measure.
Caffe [28] is a convolutional framework for the state-of-
the-art deep learning approaches. In addition, it includes
a set of pre-trained reference models, such as R-CNN,
GoogleNet, and AlexNet, to name a few. In this paper, we
extract several sets of semantic features from images using
the well-known Caffe reference models. More Specifically,
four pre-trained deep learning reference models are utilized
for feature analysis as explained in section III-B. We extract
features from the last fully-connected layer of each model.
For example, layer “fc8” of CaffeNet and AlexNet, “loss3”
of GoogleNet, and “fc-rcnn” of R-CNN are used as the
output layer of our feature extractors. R-CNN generates 200
feature element vectors, while other three reference models
generate 1000-dimension feature vectors each.
B. Experimental Results
Our proposed Ensemble Deep Learning (EDL) framework
is compared with two sets of algorithms. The first group
uses the handcrafted features, such as HOG, CEDD, color
histogram, texture, and wavelet. In total, 707 visual features
are extracted from each keyframe. The second group uses
features generated by deep learning. We apply several clas-
sifiers such as Decision Tree (DT), Multiple Correspondence
Analysis (MCA) [31], and an ensemble algorithm called
Boosting for handcrafted features. We also use the SVM
classifier for the second group as it has shown a promising
performance when it is integrated with deep learning tech-
niques. All the classifiers and deep learning approaches are
tuned to reach to their best results on our dataset and they
are evaluated through the 3-fold cross validation.
The average performance (precision, recall, and F1-score)
of various feature sets integrated with different classifiers
are shown in Table I. As can be inferred from the table,
the proposed EDL not only improves the classification
performance compared to all the well-known deep learning
algorithms, but also beats all exiting classifiers that utilized
engineering features. In the first group (handcrafted fea-
tures), the ensemble algorithm (boosting) and SVM show
the highest results in terms of F1-score. While, in the second
group (deep learning features), AlexNet has the highest F1
score. Although SVM has the highest precision compared
to the other algorithms, the low recall value decreases its
overall performance. Therefore, we utilize this classification
as well as the proposed ensemble method to improve the
overall performance using deep feature sets.
A visualized performance comparison is also shown in
Figure 2. In this plot, the F1 score of each deep learning
algorithm on each disaster event is depicted. As can be
seen from the figure, the proposed EDL framework outper-
forms all the state-of-the-art deep learning techniques for
all disaster events. R-CNN has the lowest performance for
almost all events, which can be due to its architecture which
is designed for region-based object detection, not frame-
based semantic event detection. CaffeNet and GoogleNet
have very close average performances. CaffeNet has much
higher F1 scores for damage and tornado, while GoogleNet
significantly outperforms CaffeNet in fire and snow events.
Among the four algorithms, AlexNet has almost the best
results on all events except lightening and snow. Since
our proposed method leverages the four algorithms in an
intelligent manner, it successfully improves the performance
for all semantic events.
In summary, the experimental results show the high
superiority and effectiveness of our proposed framework,
compared to various novel data mining algorithms.
V. CONCLUSION
In the paper, a novel ensemble deep classifier is proposed
which fuses the results from several weak learners and differ-
ent deep feature sets. The proposed framework is designed to
handle the imbalanced data problem in multimedia systems,
which is very common and unavoidable in current real
world applications. Specifically, it is applied to the detection
of semantic events from videos. Several experiments have
been conducted to evaluate the performance of the proposed
Table I: Average performance of various feature sets and
classifiers on the disaster dataset
Features Classifier precision recall F1-score
handcrafted DT 0.816 0.823 0.819
handcrafted MCA 0.894 0.720 0.782
handcrafted Boosting 0.910 0.841 0.867
handcrafted SVM 0.957 0.802 0.868
R-CNN SVM 0.930 0.722 0.794
GoogleNet SVM 0.918 0.840 0.875
CaffeNet SVM 0.919 0.840 0.876
AlexNet SVM 0.924 0.859 0.888
deep features EDL 0.949 0.883 0.913
Figure 2: Performance evaluation for different concepts on
the disaster dataset
framework by comparing to several state-of-the-art deep
learning and existing machine learning algorithms. The
experimental results demonstrate the effectiveness of the
proposed framework for video event detection.
ACKNOWLEDGMENT
For Shu-Ching Chen, this research is partially supported
by DHSs VACCINE Center under Award Number 2009-ST-
061-CI0001 and NSF HRD-0833093, HRD-1547798, CNS-
1126619, and CNS-1461926.
REFERENCES
[1] M.-L. Shyu, Z. Xie, M. Chen, and S.-C. Chen, “Video
semantic event/concept detection using a subspace-based
multimedia data mining framework, IEEE Transactions on
Multimedia, vol. 10, no. 2, pp. 252–259, 2008.
[2] S.-C. Chen, M.-L. Shyu, and C. Zhang, An intelligent
framework for spatio-temporal vehicle tracking, in Proceed-
ings of the 4th International IEEE Conference on Intelligent
Transportation Systems. IEEE, 2001, pp. 213–218.
[3] L. Lin, G. Ravitz, M.-L. Shyu, and S.-C. Chen, “Video se-
mantic concept discovery using multimodal-based association
classification,” in 2007 IEEE International Conference on
Multimedia and Expo. IEEE, 2007, pp. 859–862.
[4] X. Chen, C. Zhang, S.-C. Chen, and S. Rubin, “A human-
centered multiple instance learning framework for semantic
video retrieval, IEEE Transactions on Systems, Man, and
Cybernetics, Part C (Applications and Reviews), vol. 39,
no. 2, pp. 228–233, 2009.
[5] B. Krawczyk, “Learning from imbalanced data: open chal-
lenges and future directions,” Progress in Artificial Intelli-
gence, pp. 1–12, 2016.
[6] L. Lin, G. Ravitz, M.-L. Shyu, and S.-C. Chen, “Effective
feature space reduction with imbalanced data for semantic
concept detection,” in IEEE International Conference on
Sensor Networks, Ubiquitous and Trustworthy Computing
(SUTC). IEEE, 2008, pp. 262–269.
[7] H.-Y. Ha, Y. Yang, S. Pouyanfar, H. Tian, and S.-C. Chen,
“Correlation-based deep learning for multimedia semantic
concept detection,” in International Conference on Web Infor-
mation Systems Engineering. Springer, 2015, pp. 473–487.
[8] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet
classification with deep convolutional neural networks, in
Advances in neural information processing systems, 2012, pp.
1097–1105.
[9] M. Lin, Q. Chen, and S. Yan, “Network in network,” arXiv
preprint arXiv:1312.4400, 2013.
[10] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,
D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper
with convolutions, in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2015, pp. 1–9.
[11] J. Wan, D. Wang, S. C. H. Hoi, P. Wu, J. Zhu, Y. Zhang,
and J. Li, “Deep learning for content-based image retrieval:
A comprehensive study,” in Proceedings of the 22nd ACM
international conference on Multimedia, 2014, pp. 157–166.
[12] Y. Yang and S.-C. Chen, “Ensemble learning from imbalanced
data set for video event detection, in IEEE International Con-
ference on Information Reuse and Integration (IRI). IEEE,
2015, pp. 82–89.
[13] X. Chen, C. Zhang, S.-C. Chen, and M. Chen, “A latent
semantic indexing based method for solving multiple instance
learning problem in region-based image retrieval, in Sev-
enth IEEE International Symposium on Multimedia (ISM’05).
IEEE, 2005, pp. 8–pp.
[14] H. He and E. A. Garcia, “Learning from imbalanced data,”
IEEE Transactions on knowledge and data engineering,
vol. 21, no. 9, pp. 1263–1284, 2009.
[15] X.-Y. Liu, J. Wu, and Z.-H. Zhou, “Exploratory undersam-
pling for class-imbalance learning,” IEEE Transactions on
Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 39,
no. 2, pp. 539–550, 2009.
[16] H.-Y. Ha, Y. Yang, F. C. Fleites, and S.-C. Chen, “Correlation-
based feature analysis and multi-modality fusion framework
for multimedia semantic retrieval, in 2013 IEEE Interna-
tional Conference on Multimedia and Expo (ICME). IEEE,
2013, pp. 1–6.
[17] Y. Yang, “Exploring hidden coherent feature groups and
temporal semantics for multimedia big data analysis,” Ph.D.
dissertation, Florida International University, 2015.
[18] D. Liu, Y. Yan, M.-L. Shyu, G. Zhao, and M. Chen, “Spatio-
temporal analysis for human action detection and recogni-
tion in uncontrolled environments, International Journal of
Multimedia Data Engineering and Management (IJMDEM),
vol. 6, no. 1, pp. 1–18, 2015.
[19] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-
based learning applied to document recognition,” Proceedings
of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
[20] M. Chen, C. Zhang, and S.-C. Chen, “Semantic event ex-
traction using neural network ensembles,” in International
Conference on Semantic Computing (ICSC 2007). IEEE,
2007, pp. 575–580.
[21] S.-C. Chen, M.-L. Shyu, and C. Zhang, “Innovative shot
boundary detection for video indexing, Video data manage-
ment and information retrieval, pp. 217–236, 2005.
[22] X. Li, S.-C. Chen, M.-L. Shyu, and B. Furht, “Image retrieval
by color, texture, and spatial information, Proceedings of
the 8th International Conference on Distributed Multimedia
Systems (DMS’2002), pp. 152–159, 2002.
[23] N. Dalal and B. Triggs, “Histograms of oriented gradi-
ents for human detection,” in 2005 IEEE Computer Society
Conference on Computer Vision and Pattern Recognition
(CVPR’05), vol. 1. IEEE, 2005, pp. 886–893.
[24] S. A. Chatzichristofis and Y. S. Boutalis, “CEDD: Color
and edge directivity descriptor: a compact descriptor for
image indexing and retrieval, in International Conference
on Computer Vision Systems. Springer, 2008, pp. 312–322.
[25] D. G. Lowe, “Distinctive image features from scale-invariant
keypoints, International journal of computer vision, vol. 60,
no. 2, pp. 91–110, 2004.
[26] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich
feature hierarchies for accurate object detection and semantic
segmentation, in Proceedings of the IEEE conference on
computer vision and pattern recognition, 2014, pp. 580–587.
[27] Y. Yan, M. Chen, M.-L. Shyu, and S.-C. Chen, “Deep
learning for imbalanced multimedia data classification,” in
IEEE International Symposium on Multimedia (ISM), 2015,
pp. 483–488.
[28] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir-
shick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional
architecture for fast feature embedding,” in Proceedings of the
22nd ACM international conference on Multimedia. ACM,
2014, pp. 675–678.
[29] “Brewing imagenet, retrieved at: 2016-08-09. [Online].
Available: http://caffe.berkeleyvision.org/gathered/examples/
imagenet.html
[30] Z. Ge, C. McCool, and P. Corke, “Content specific feature
learning for fine-grained plant classification,” in Working
notes of CLEF 2015 conference, 2015.
[31] S. Pouyanfar and S.-C. Chen, “Semantic concept detec-
tion using weighted discretization multiple correspondence
analysis for disaster information management,” in The 17th
IEEE International Conference on Information Reuse and
Integration (IRI). IEEE, 2016, pp. 556–564.
... Semantics proposals for video quality assessment [14] are based on video semantics extraction methods [2,3,5,6,12,15,17]. The concept-extraction schemes were primarily carried out on the news video since those have content structures. ...
... More recent works seek to solve the problem of extracting video events using deep learning techniques [5,6,8,12]. Results obtained by these approaches are very promising. ...
Chapter
This work proposes a method to assess the quality of user-generated videos (UGVs) of specific social events. The method is based on matching the semantic information extracted from videos and the information obtained from text news of the same event. Deep learning techniques are used to detect objects in the video scenes. News articles are represented by a set of relevant terms automatically extracted from the news. This paper describes our method and an evaluation of it.
... Semantics proposals for video quality assessment [14] are based on video semantics extraction methods [2,3,5,6,12,15,17]. The concept-extraction schemes were primarily carried out on the news video since those have content structures. ...
... More recent works seek to solve the problem of extracting video events using deep learning techniques [5,6,8,12]. Results obtained by these approaches are very promising. ...
Conference Paper
A relevant aspect of assessing the quality of a video is the meaning of the message transmitted by the video. To obtain this meaning , semantic information needs to be extracted from the video. This work proposes a method to assess the quality of user-generated videos (UGVs) of specific social events. The method is based on matching the semantic information extracted from videos and the information obtained from text news of the same event. Deep learning techniques are used to detect objects in the video scenes. News articles are represented by a set of relevant terms automatically extracted from the news. The result of the matching is used to provide an assessment of the quality of the video. This paper describes our method for video quality assessment and an evaluation of it, using news and videos of social events.
... As each model is typically tuned to a specific task, it has been recognized that ensemble models-a combination of different models whose outputs are combined-can be a solution to broader tasks. For example, in Semantic Event Detection, an ensemble model outperforms state-of-the-art single models in classifying scenes to natural disaster events [68]. ...
Article
Full-text available
This survey considers the vision of TV broadcasting where content is personalised and personalisation is data-driven, looks at the AI and data technologies making this possible and surveys the current uptake and usage of those technologies. We examine the current state-of-the-art in standards and best practices for data-driven technologies and identify remaining limitations and gaps for research and innovation. Our hope is that this survey provides an overview of the current state of AI and data-driven technologies for use within broadcasters and media organisations. It also provides a pathway to the needed research and innovation activities to fulfil the vision of data-driven personalisation of TV content.
... The outputs of those CNNs, which have common layers, were combined by averaging their last layers' outputs. In [12], deep learning algorithms were combined with support vector machine (SVM) ensembles for semantic event detection. Hence, support vector machine ensembles were fed with feature vectors, which were created by deep learning algorithms. ...
... It is now being utilized the most in business insight frameworks and predictive analytics, as well as in increasingly advanced learning management systems (LMS). Therefore, we followed various DL methods inspired by the guidelines of Pouyanfar et al. (2018), Pouyanfar and Chen (2016), Perozzi et al. (2014), and Savage et al. (2014). The development of DL can potentially benefit to MID research. ...
Article
Full-text available
Recently, the use of social networks such as Facebook, Twitter, and Sina Weibo has become an inseparable part of our daily lives. It is considered as a convenient platform for users to share personal messages, pictures, and videos. However, while people enjoy social networks, many deceptive activities such as fake news or rumors can mislead users into believing misinformation. Besides, spreading the massive amount of misinformation in social networks has become a global risk. Therefore, misinformation detection (MID) in social networks has gained a great deal of attention and is considered an emerging area of research interest. We find that several studies related to MID have been studied to new research problems and techniques. While important, however, the automated detection of misinformation is difficult to accomplish as it requires the advanced model to understand how related or unrelated the reported information is when compared to real information. The existing studies have mainly focused on three broad categories of misinformation: false information, fake news, and rumor detection. Therefore, related to the previous issues, we present a comprehensive survey of automated misinformation detection on (i) false information, (ii) rumors, (iii) spam, (iv) fake news, and (v) disinformation. We provide a state-of-the-art review on MID where deep learning (DL) is used to automatically process data and create patterns to make decisions not only to extract global features but also to achieve better results. We further show that DL is an effective and scalable technique for the state-of-the-art MID. Finally, we suggest several open issues that currently limit real-world implementation and point to future directions along this dimension.
... The combination of different types of algorithms for road and line detection is not very common in the literature. However, it has been applied successfully in other fields, such as in machine learning, where, for example, there is a technique that combines the predictions from multiple trained models to reduce variance and improve prediction performance [28,25]. The objective is similar to the one presented in this work: combine different results in order to obtain a more accurate one. ...
Article
Full-text available
One of the most important challenges for Autonomous Driving and Driving Assistance systems is the detection of the road to perform or monitor navigation. Many works can be found in the literature to perform road and lane detection, using both algorithmic processing and learning based techniques. However, no single solution is mentioned to be applicable in any circumstance of mixed scenarios of structured, unstructured, lane based, line based or curb based limits, and other sorts of boundaries. So, one way to embrace this challenge is to have multiple techniques, each specialized on a different approach, and combine them to obtain the best solution from individual contributions. That is the central concern of this paper. By improving a previously developed architecture to combine multiple data sources, a solution is proposed to merge the outputs of two Deep Learning based techniques for road detection. A new representation for the road is proposed along with a workflow of procedures for the combination of two simultaneous Deep Learning models, based on two adaptations of the ENet model. The results show that the overall solution copes with the alternate failures or under-performances of each model, producing a road detection result that is more reliable than the one given by each approach individually.
... Inspired by these works, we propose our joint training model to learn features from social media and environmental data in comparison to individual or ensemble models employed in event detection (i.e. prediction) [5,18] ...
Preprint
Disaster prediction is one of the most critical tasks towards disaster surveillance and preparedness. Existing technologies employ different machine learning approaches to predict incoming disasters from historical environmental data. However, for short-term disasters (e.g., earthquakes), historical data alone has a limited prediction capability. Therefore, additional sources of warnings are required for accurate prediction. We consider social media as a supplementary source of knowledge in addition to historical environmental data. However, social media posts (e.g., tweets) is very informal and contains only limited content. To alleviate these limitations, we propose the combination of semantically-enriched word embedding models to represent entities in tweets with their semantic representations computed with the traditionalword2vec. Moreover, we study how the correlation between social media posts and typhoons magnitudes (also called intensities)-in terms of volume and sentiments of tweets-. Based on these insights, we propose an end-to-end based framework that learns from disaster-related tweets and environmental data to improve typhoon intensity prediction. This paper is an extension of our work originally published in K-CAP 2019 [32]. We extended this paper by building our framework with state-of-the-art deep neural models, up-dated our dataset with new typhoons and their tweets to-date and benchmark our approach against recent baselines in disaster prediction. Our experimental results show that our approach outperforms the accuracy of the state-of-the-art baselines in terms of F1-score with (CNN by12.1%and BiLSTM by3.1%) improvement compared with last experiments
Article
Full-text available
Ensemble multifeatured deep learning methodology has emerged as a powerful approach to overcome the limitations of single deep learning models in terms of generalization, robustness, and performance. This survey provides an extended review of ensemble multifeatured deep learning models, and their applications, challenges, and future directions. We explore potential applications of these models across various domains, including computer vision, medical imaging, natural language processing, and speech recognition. By combining the strengths of multiple models and features, ensemble multifeatured deep learning models have demonstrated improved performance and adaptability in diverse problem settings. We also discuss the challenges associated with these models, such as model interpretability, computational complexity, ensemble model selection, adversarial robustness, and personalized and federated learning. This survey highlights recent advancements in addressing these challenges and emphasizes the importance of continued research in tackling these issues to enable widespread adoption of ensemble multifeatured deep learning models. It provides an outlook on future research directions, focusing on the development of new algorithms, frameworks, and hardware architectures that can efficiently handle the large-scale computations required by these models. Moreover, it underlines the need for a better understanding of the trade-offs between model complexity, accuracy, and computational resources to optimize the design and deployment of ensemble multifeatured deep learning models.
Article
Automatic detection of surface faults or defects from images plays a crucial role in ensuring quality control in smart manufacturing. Traditional image processing techniques have limitations in handling background noise, texturing, and lighting variations. To overcome these limitations, the researchers explored deep learning for automated defect identification. The study investigates contemporary mainstream approaches and deep learning methods for flaw detection, highlighting their features, benefits, and drawbacks. The goal is to understand the potential of advanced techniques in enhancing defect identification processes. The research also evaluates the performance of the proposed method and discusses the achievements and limitations of existing defect detection methods. By identifying current challenges, the study aims to pave the way for future advancements in defect detection. It provides an outline to aid the defect detection research community in shaping a new and promising research agenda. Therefore, the study not only presents the proposed method’s performance but also offers valuable insights into the strengths and weaknesses of traditional and deep learning-based defect identification approaches.
Conference Paper
We study the question of feature sets for robust visual object recognition, adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of Histograms of Oriented Gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.
Conference Paper
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Article
This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Chapter
Recently, multimedia information, especially video data, has been made overwhelmingly accessible with the rapid advances in communication and multimedia computing technologies. Video is popular in many applications, which makes the efficient management and retrieval of the growing amount of video information very important. Toward such a demand, an effective video shot boundary detection method is necessary, which is a fundamental operation required in many multimedia applications. In this chapter, an innovative shot boundary detection method using an unsupervised segmentation algorithm and the technique of object tracking based on the segmentation mask maps is presented. A series of experiments on various types of video types are performed, and the experimental results show that our method can obtain object-level information of the video frames as well as accurate shot boundary detection, which are very useful for video content indexing. Purchase this chapter to continue reading all 20 pages >
Chapter
Recently, multimedia information, especially video data, has been made overwhelmingly accessible with the rapid advances in communication and multimedia computing technologies. Video is popular in many applications, which makes the efficient management and retrieval of the growing amount of video information very important. Toward such a demand, an effective video shot boundary detection method is necessary, which is a fundamental operation required in many multimedia applications. In this chapter, an innovative shot boundary detection method using an unsupervised segmentation algorithm and the technique of object tracking based on the segmentation mask maps is presented. A series of experiments on various types of video types are performed, and the experimental results show that our method can obtain object-level information of the video frames as well as accurate shot boundary detection, which are very useful for video content indexing.