ArticlePDF Available

On the Effect of Data Set Size on Bias and Variance in Classification Learning

Authors:

Abstract

With the advent of data mining, machine learning has come of age and is now a critical technology in many businesses. However, machine learning evolved in a different research context to that in which it now finds itself employed. A particularly important problem in the data mining world is working effectively with large data sets. However, most machine learning research has been conducted in the context of learning from very small data sets. To date most approaches to scaling up machine learning to large data sets have attempted to modify existing algorithms to deal with large data sets in a more computationally efficient and effective manner. But is this necessarily the best method? This paper explores the possibility of designing algorithms specifically for large data sets. Specifically, the paper looks at how increasing data set size affects bias and variance error decompositions for classification algorithms. Preliminary results of experiments to determine these effects are presen...
On the effect of data set size on bias and variance in classification learning
Damien Brain
Geoffrey I Webb
School of Computing and Mathematics
Deakin University
Geelong Vic 3217
Abstract
With the advent of data mining, machine learning has come of age and is now a critical technology
in many businesses. However, machine learning evolved in a different research context to that in
which it now finds itself employed. A particularly important problem in the data mining world is
working effectively with large data sets. However, most machine learning research has been
conducted in the context of learning from very small data sets. To date most approaches to scaling
up machine learning to large data sets have attempted to modify existing algorithms to deal with
large data sets in a more computationally efficient and effective manner. But is this necessarily the
best method? This paper explores the possibility of designing algorithms specifically for large data
sets. Specifically, the paper looks at how increasing data set size affects bias and variance error
decompositions for classification algorithms. Preliminary results of experiments to determine these
effects are presented, showing that, as hypothesised variance can be expected to decrease as
training set size increases. No clear effect of training set size on bias was observed. These results
have profound implications for data mining from large data sets, indicating that developing
effective learning algorithms for large data sets is not simply a matter of finding computationally
efficient variants of existing learning algorithms.
Introduction
The amount of data being stored by organisations is increasing at a rapid rate, and this trend is
likely to continue for the foreseeable future. Therefore, as time passes, we can expect machine
learning algorithms to be required to be used on increasingly large data sets - much larger than the
size of data sets with which they were originally developed. Hence, machine learning algorithms
will be required to perform well on very large data sets. This paper addresses the impact of this
trend on classification learning algorithms.
Classification learning algorithms aim to learn a model that maps a multivalued input X into a
single valued categorical output Y. Thus, classification algorithms can be used to predict the output
Y of an unseen X.
Although much work has been done on evaluating the performance of classification algorithms,
these evaluations have generally used relatively small data sets. Therefore, there is little evidence to
support the notion that "standard" versions of common classification algorithms perform well on
very large data sets. In fact, there is a large body of literature on attempts to "scale up" algorithms
to handle large data sets [1, 2, 3]. This body of work primarily addresses the issue of how to reduce
the high computational costs of traditional learning algorithms so as to make tractable their
application to large data sets.
However, this begs the question of whether machine learning algorithms developed for small data
sets are inherently suitable for large data sets. Is it really just a question of making existing
Pre-publication draft of a paper which appeared in the Proceedings of the Fourth Australian Knowledge Acquisition Workshop (AKAW 
'99), pages 117-128
algorithms more efficient, or are the demands of effective learning from large data sets
fundamentally different from those of effective learning from small data sets?
This paper argues for the later position. We argue that whereas the major problem confronting
classification learning from small data sets is the management of error resulting from learning
variance, as data set sizes increase the impact of variance can be expected to decrease. Hence,
fundamentally different types of learning algorithm are appropriate for effective learning from
small and large data sets.
The paper is organised as follows. First we describe the concepts of learning bias and learning
variance. Next we outline the reasoning that led us to hypothesise that variance can be expected to
decrease as data set sizes increase. Then we present some preliminary experimental results that
lend support to this hypothesis. Finally we present our conclusions and outline our proposed
directions for future research.
Learning Bias and Learning Variance
A number of recent studies have shown that the decomposition of a learner’s error into bias and
variance terms can provide considerable insight into the prediction performance of the learner.
This decomposition originates from analyses of regression, the learning of models with numeric
outputs [4]. Squared bias measures the contribution to error of the central tendency of the learner
when trained on different data. Variance is a measure of the contribution to error of deviations
from the central tendency. Bias and variance are evaluated with respect to a distribution of training
sets, such as a distribution containing all possible training sets of a specified size for a specified
domain.
Analysing learning systems in this way highlights the following issues. If a learning system learns
different classifiers from different training sets, then the degree to which the predictions of those
classifiers differ provides a lower limit on the average error of those classifiers when applied to
subsequent test data. If the predictions from different classifiers differ then not all can be correct!
However, inhibiting such variations between the classifiers will not necessarily eliminate prediction
error. The degree to which the correct answer for an object can differ from that for other objects
with identical descriptions (“irreducible error”) and the accuracy of the learning bias also affect
prediction error. Errors will also be caused by predictions from different classifiers that are
identical but incorrect!
Unfortunately, the definitions of bias and variance that have been developed for numeric regression
do not directly apply to classification learning. In numeric regression a prediction is not just simply
right or wrong, there are varying degrees of error. In consequence, a number of alternative
formulations of bias and variance for classification learning have emerged [5, 6, 7, 8, 9]. Each of
these definitions is able to offer valuable insight into different aspects of a learner's performance.
For this research we use Kohavi and Wolpert’s definition [6], as it is the most widely employed of
those available. Following Kohavi and Wolpert, irreducible error is aggregated into bias
2
. In
conseqence, bias
2
and variance sum to error.
Different learning algorithms may have quite different bias/variance profiles.
An example of a high bias classification algorithm is the Naïve-Bayes classifier [10, 11]. Naïve-
Bayes classifiers fit a simple parametric model to the data. They are limited in the extent to which
they can be adjusted to different data sets. Such adjustments are restricted to changes in a relatively
small number of conditional probability estimates that underlie a Naïve-Bayes classifier. In
consequence, the predictions of a Naïve-Bayes classifier will be little affected by small changes in
the training data.
Low bias algorithms (such as boosting decision trees[12]), are more flexible. They can not only
describe a wider range of concepts, but are usually more adaptive in dealing with the training set.
Boosting repeatedly applies a base learning algorithm to a training set, resampling or reweighting
the data each time so as to force the learning system toward correctly classifying all the training
cases. This process has been shown to result in a general major decrease in learning bias,
accompanied by a smaller decrease in learning variance when applied with a standard decision tree
learner as the base learning algorithm [13].
Our hypothesis
Variance measures the degree to which the predictions of the classifiers developed by a learning
algorithm differ from training sample to training sample. When sample sizes are small, the relative
impact of sampling on the general composition of a sample can be expected to be large. For
example, if 50% of a population exhibit some characteristic, in a sample of size 10 there is a 0.38
probability that only 40% or less of the sample will exhibit the characteristic and a 0.17 probability
that only 30% or less of the sample will exhibit the characteristic. In other words, a small sample is
likely to be quite unrepresentative of the population as a whole. In contrast, in a random sample of
size 1,000,000, if 50% of the population exhibits the characteristic then the probability that 40% or
less of the sample will exhibit the characteristic is less than 10
-22
. The probability that 30% or less
of the sample will exhibit the characteristic is less than 10
-26
. If 1% of a population exhibits a
characteristic, there is a probability of 0.37 that it will not be represented at all in a random sample
of size 100. The probability that it will not be represented at all in a sample of size 1,000,000 is
less than 10
-17
. Clearly, the degree of significant differences in the composition of alternative small
samples will be greater than that of significant differences in the composition of large samples.
In consequence, it is to be expected that classifiers learned from alternative small samples will
differ more significantly than classifiers learned from alternative large examples. It follows that
their predictions are likely to differ more and hence that their variance will be higher.
On the basis of this reasoning, we predict that there will be a tendency for standard learning
algorithms to exhibit decreasing levels of variance as training set sizes increase.
Experiments
Experiments were performed to investigate whether such a trend towards lower variance as training
set size increases exists in practice. Three different classification algorithms were used, each with
different bias and variance profiles. The high bias / low variance Naïve-Bayes classifier [10, 11],
the machine learning exemplar C4.5 [14] and the bias and (to a lesser extent) variance reducing
MultiBoost [9] were chosen. The latter combines the well-known AdaBoost [12] and Bagging [15]
algorithms, coupling most of the superior bias reduction of the former with most of the superior
variance reduction of the latter. Training set sizes started from 125 cases and increased up to
32,000, doubling at each step. The four data sets used (adult, shuttle, connect-4, and cover type)
were the only real-world data sets available from the UCI Machine Learning Repository [16] that
were suitable for classification learning and contained at least 32,000 cases.
For each data set, training set size, and classification algorithm, a bias-variance analysis was
performed using 10 times 3-fold cross-validation. The Kohavi-Wolpert bias-variance
decomposition was measured, along with the average number of nodes for each induced classifier
(Note. Naïve-Bayes algorithms do not have a variable number of nodes, hence this measure is not
presented for the Naïve-Bayes algorithm). While our hypothesis relates only to variance, we
believed that it was also interesting to examine bias, to see whether it was subject to any clear
trends with respect to increasing data set size.
Multiboost
Graphs of bias/variance decompositions of error for experiments using MultiBoost are presented in
Figures 1 – 4. In these and all subsequent bias/variance decomposition graphs, bias is represented
as the upper part of each vertical shaded bar, and variance the remainder. The total height of each
bar represents the total average error for that sample size. The figures show a clear trend toward
lower total error as sample size increases. The figures also show a clear trend toward lower
variance with increasing sample size. The only exception to this is a slight increase in variance
when moving from a sample size of 250 to 500 on the Connect-4 data set. Bias also has a clear
downward trend in the first three figures, however this is not apparent for the Adult data set.
Figure 1. Bias and variance of Multiboost on the Connect-4 data set.
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
12
5
25
0
50
0
1000
2000
4000
8000
16
0
00
32
0
00
Sample Size
Error
Bias
Variance
0
0.002
0.004
0.006
0.008
0.01
0.012
1
2
5
2
5
0
5
0
0
1
0
0
0
2
0
0
0
4
0
0
0
8
0
0
0
1
6
000
3
2
000
Sample Size
Error
Bias
Variance
Figure 2. Bias and variance of Multiboost on the Shuttle data set.
Figure 3. Bias and variance of Multiboost on the Cover Type data set.
Figure 4. Bias and variance of Multiboost on the Adult data set.
C4.5
The results of C4.5 experiments are contained in Figures 5 – 8. It is clear in Figures 6 and 7 that the
trend of lower bias, variance, and total error as sample size increases continues. However, in Figure
5, there is an increase in variance from sample sizes of 125 to 250, and 500 to 1,000. Bias decreases
in every case, though. Results for the Adult data set (Figure 8) show many increases in bias – 250
to 500, 1,000 to 2,000, 4,000 to 8,000, and 16,000 to 32,000. There is also a very slight increase in
variance when moving from 2,000 to 4,000 cases.
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
12
5
25
0
0
0.05
0.1
0.15
0.2
0.25
12
5
25
0
50
0
10
0
0
20
0
0
40
0
0
80
0
0
1
60
0
0
3
20
0
0
Sample Size
Error
Bias
Variance
Figure 5. Bias and variance of C4.5 on the Connect-4 data set.
Figure 6. Bias and variance of C4.5 on the Shuttle data set.
Figure 7. Bias and variance of C4.5 on the Cover Type data set.
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
1
25
2
50
5
00
1
0
00
2000
4000
8000
1
6
00
0
3
2
00
0
Sample Size
Error
Bias
Variance
0
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0.018
1
2
5
2
5
0
5
0
0
1
0
00
2000
4000
8000
16000
32000
Sample Size
Error
Bias
Variance
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
125
250
500
1
0
0
0
2
0
0
0
4
0
0
0
800
0
16000
32000
Sample Size
Error
Bias
Variance
Figure 8. Bias and variance of C4.5 on the Adult data set.
Naïve-Bayes
Figures 9 – 12 show the results of experiments using a Naïve-Bayes classifier. In every case
variance decreases as sample size increases. However, this is not true for bias in many situations. In
fact, for the Adult data set (Figure 12), bias increases as sample size increases. This is also true for
Figures 9 and 11, except for increasing sample size from 125 to 250 cases, where bias slightly
decreases. The trend for bias on the Shuttle data set is more like those for other classifiers.
Apart from a small increase from 8,000 to 16,000 cases, bias decreases as sample size increases.
It is interesting to note that the only situation in all of the experiments in which increasing sample
size corresponds to an increase in total error is in Figure 12, moving from 8,000 to 16,000 cases.
This is undoubtedly due to the increased bias of the Naïve-Bayes classifier.
Figure 9. Bias and variance of NB on the Connect-4 data set.
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
125
250
500
10
0
0
20
0
0
40
0
0
80
0
0
16
0
0
0
32
0
0
0
Sample Size
Error
Bias
Variance
0
0.05
0.1
0.15
0.2
0.25
12
5
2
50
5
00
1000
2
0
00
4
0
00
8
0
00
16000
32000
Sample Size
Error
Bias
Variance
Figure 10. Bias and variance of NB on the Shuttle data set.
Figure 11. Bias and variance of NB on the Cover Type data set.
Figure 12. Bias and variance of NB on the Adult data set.
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
12
5
25
0
50
0
10
0
0
20
0
0
40
0
0
80
0
0
1
60
0
0
3
20
0
0
Sample Size
Error
Bias
Variance
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
125
250
500
10
0
0
20
0
0
40
0
0
80
0
0
16
0
0
0
32
0
0
0
Sample Size
Error
Bias
Variance
0
0.05
0.1
0.15
0.2
0.25
12
5
25
0
50
0
10
0
0
20
0
0
40
0
0
80
0
0
1
60
0
0
3
20
0
0
Sample Size
Error
Bias
Variance
Statistical Significance
Table 1 shows the statistical significance of the above results. Significance was measured by
applying a binomial probability test to a count of the number of increases and decreases in bias or
variance when moving from one sample size to the next (i.e. sample sizes of 125 were not
compared to 500 or more). Because prior predictions were made for variance, one-tailed tests are
applied for the variance outcomes. As no predictions were made with respect to bias, two-tailed
tests are for applied for the bias outcomes
1
. Results were considered significant if the outcome of
the binomial test is less than 0.05.
As can be seen in Table 1, variance is shown to have a statistically significant reduction due to
increased sample size in all but one instance (Connect-4 data using C4.5). This result is supportive
of our hypothesis that variance will tend to decrease as training set sizes increase. These results
suggest a particularly powerful effect given the very low power of the analyses performed, having
only eight training set size steps available for each analysis.
Bias, on the other hand, has a significant decrease in four instances, and a significant increase in
one instance. This suggests that bias is not influenced by training set size in the straightforward
manner that variance appears to be.
MultiBoost C4.5 Naïve-Bayes
Bias Variance Bias Variance Bias Variance
Adult 6:2 (0.2891)
8:0 (0.0039)
4:4 (0.6367)
7:1 (0.0352)
0:8 (0.0078)
8:0 (0.0039)
Connect-4
8:0 (0.0078) 7:1 (0.0352) 8:0 (0.0078)
6:2 (0.1445) 1:7 (0.0703)
8:0 (0.0039)
Cover Type
8:0 (0.0078) 8:0 (0.0039) 8:0 (0.0078) 8:0 (0.0039)
1:7 (0.0703)
8:0 (0.0039)
Shuttle
8:0 (0.0078) 8:0 (0.0039) 8:0 (0.0078) 8:0 (0.0039)
7:1 (0.0703)
8:0 (0.0039)
Table 1. Statistical significance of reductions in bias and variance due to an increase in sample size.
The ratio of decreases in bias or variance to increases is shown, followed by the outcome of a binomial probability test.
Statistically significant results are shown in bold if positive, italics if negative.
Learned Theory Complexity
Another interesting aspect of the effect of larger training set sizes is the complexity of learned
theories. It should be expected that as training sets increase there will be more information
available to the classification algorithm, and this will allow a more complex theory to be produced.
Figures 13 and 14 present the average numbers of nodes induced by MultiBoost and C4.5
respectively. The results given are the number of nodes induced for a training set size divided by
the maximum number of nodes induced on all training sizes for a particular data set. As can be
expected, the number of nodes increases dramatically as the training set moves from smaller to
larger sizes.
1
Note that using one-tailed tests for bias would convert the three non-significant results for Naïve Bayes to significant
but have no impact on the significance or otherwise of the results for MultiBoost or C4.5. Using two-tailed tests for
variance would convert the two 7:1 results to being insignificant, but have no other impact on the assessments of
significance.
Figure 13. Comparison of the average number of nodes used by MultiBoost. The maximum number
of nodes were: Connect-4: 64343.8; Shuttle: 340.5; Cover Type: 38261.7; Adult: 43375.4
Figure 14. Comparison of the average number of nodes used by C4.5 The maximum number of
nodes were: Connect-4: 3477.1; Shuttle: 45.3; Cover Type: 4066.4; Adult: 607.6
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1
2
5
2
5
0
5
0
0
1
0
0
0
2
0
0
0
4
0
0
0
8
0
0
0
1
6
0
0
0
3
2
0
0
0
Sample Size
% Max Nodes
Connect-4
Shuttle
Cover Type
Adult
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
12
5
25
0
5
0
0
100
0
2
0
0
0
40
0
0
8
0
0
0
3200
Sample Size
% Max Nodes
Connect-4
Shuttle
Cover Type
Adult
Conclusions and Future Work
These preliminary results show statistically significant evidence to support the hypothesis that
variance can be expected to decrease as training set size increases. This may not be a surprising
result. However, the fact that the results do not show a similar decrease in bias is not entirely
expected. The algorithms used represented both high and low bias/variance algorithms, thus the
results do not seem specific to a certain type of algorithm.
This is further supported by the huge increases in complexity of learned theories as training size
increases. If the presented results are extrapolated to millions of training examples then the
complexity of learned models can be expected to be orders of magnitude higher than that for the
relatively small training sizes from which models are normally developed. However, it may be that
this increase in complexity is exactly what causes the noted decreases in variance.
It is important to note the limitations of the current study. Only four data sets were employed and
the largest data set sizes considered were very modest in data mining terms. Further experiments
must be performed on more and larger data sets. If such experiments confirm the above results,
then there are important implications. If the hypothesis is confirmed, some of the most well known
and widely used classification algorithms may be shown to be less suitable for large data sets than
for small. Our preliminary results suggest that while variance management is a critical property for
good generalisation performance with small data sets, bias management is far more critical for
good generalisation performance with large data sets.
References
1. Provost, F.J. and Aronis, J.M. (1996) “Scaling Up Inductive Learning with Massive
Parallelism,” Machine Learning, volume 23, number 1, pages 33-46. Kluwer Academic
Publishers.
2. Provost, F.J. and Kolluri, V (1997) “Scaling Up Inductive Algorithms: An Overview,”
Proceedings of the Third International Conference on Knowledge Discovery and Data Mining,
Newport Beach, CA, pages 239-242. AAAI Press.
3. Catlett, J. (1992) “Peepholing: Choosing Attributes Efficiently for Megainduction,”
Proceedings of the Ninth International Conference on Machine Learning, Aberdeen, Scotland,
pages 49-54, Morgan Kaufmann.
4. Geman, S. and Bienenstock, E. (1992) “Neural Networks and the Bias/Variance Dilemma,”
Neural Computation, volume 4, pages 1-48.
5. Breiman, L. (1996) “Bias, Variance, and Arcing Classifiers,” Technical Report 486, Statistics
Department, University of California, Berkeley, CA.
6. Kohavi, R. and Wolpert, D.H. (1996) “Bias Plus Variance Decomposition for Zero-One Loss
Functions,” Proceedings of the 13
th
International Conference on Machine Learning, Bari, Italy,
pages 275-283. Morgan Kaufmann.
7. Kong, E.B. and Dietterich, T.G. (1995) “Error-correcting Output Coding Corrects Bias and
Variance,” Proceedings of the 12
th
International Conference on Machine Learning, Tahoe City,
CA, pages 313-321. Morgan Kaufmann.
8. Friedman, J.H. (1997) “On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality,Data
Mining and Knowledge Discovery, volume 1, number 1, pages 55-77. Kluwer Academic
Publishers.
9. Webb, G.I. (in press) “MultiBoosting: A Technique for Combining Boosting and Wagging,”
Machine Learning.
10. Kononenko, I. (1990) “Comparison of Inductive and Naïve Bayesian Learning Approaches to
Automatic Knowledge Acquisition,” In B. Wielinga et al. (eds.), Current Trends in Knowledge
Acquisition. IOS Press.
11. Langley, P. and Iba, W.F. and Thompson, K. (1992) “An Analysis of Bayesian Classifiers,”
Proceedings of the Tenth National Conference on Artificial Intelligence, pages 223-228. AAAI
Press.
12. Freund, Y. and Schapire, R.E. (1996) “Experiments with a New Boosting Algorithm,”
Proceedings of the 13
th
International Conference on Machine Learning, Bari, Italy, pages 148-
156. Morgan Kaufmann.
13. Schapire, R.E and Freund, Y. and Bartlett, P. and Lee, W.S. (1998) “Boosting the Margin: A
New Explanation for the Effectiveness of Voting Methods,” The Annals of Statistics, 26, 1651-
1686.
14. Quinlan, J.R. (1993) C4.5: Programs for Machine Learning. Morgan Kaufmann.
15. Breiman, L. (1996) “Bagging Predictors,” Machine Learning, 24, pages123-140.
16. Blake, C.L. and Merz, C.J. (1998) UCI Repository of machine learning databases
[http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California,
Department of Information and Computer Science.
... The intrinsic noise is a lower bound on the expected error of any learning approach on the target; the bias measures how closely the average estimate of the learning approach is able to approximate the target; the variance measures how much the estimate of the learning approach fluctuates for different training sets of the same size [25]. Bias and variance are evaluated with respect to a distribution of training sets, such as a distribution containing all possible training sets of a specified size for a specified domain [26]. In other words, the variance is the variance of the predictions. ...
... There are a considerable number of studies that specialize in the area of data acquisition, most of which revolve around the generalization approach to data analysis. The work [1] foresees an accurate image of the current state of data acquisition. Managing large amounts of data comes down to counting losses we are ready to sacrifice, and finding balance is an approach where the authors reduced the size of a training test with the cost of decreasing variance. ...
Conference Paper
Full-text available
Data annotated by humans is a source of knowledge by describing the peculiarities of the problem and therefore fueling the decision process of the trained model. Unfortunately, the annotation process for subjective natural language processing (NLP) problems like offensiveness or emotion detection is often very expensive and time-consuming. One of the inevitable risks is to spend some of the funds and annotator effort on annotations that do not provide any additional knowledge about the specific task. To minimize these costs, we propose a new model-based approach that allows the selection of tasks annotated individually for each text in a multi-task scenario. The experiments carried out on three datasets, dozens of NLP tasks, and thousands of annotations show that our method allows up to 40% reduction in the number of annotations with negligible loss of knowledge. The results also emphasize the need to collect a diverse amount of data required to efficiently train a model, depending on the subjectivity of the annotation task. We also focused on measuring the relation between subjective tasks by evaluating the model in single-task and multi-task scenarios. Moreover, for some datasets, training only on the labels predicted by our model improved the efficiency of task selection as a self-supervised learning regularization technique.
... It is well known that increasing the size/diversity of training data often leads to a better predictive capability of the DL model. In most cases, the variance and bias also show a similar decrease when using a larger training dataset [59]. In this work, increasing the size of the training set requires performing more high-fidelity FE simulations, which could be a computational task. ...
Article
Full-text available
Pitting corrosion is one of the major causes of failure in high-pressure oil and gas pipelines. Various inspection techniques can be used to characterize the morphology of corrosion pits, which must be linked to the risk of failure to develop proper maintenance strategies. While numerical techniques such as the finite element method can accurately predict this risk, the labor and computational cost associated with these methods render their application unfeasible over hundreds of miles of a pipeline. In this manuscript, we introduce a deep learning approach relying on the squeeze-and-excitation residual network (SE-ResNet) to predict the strength and toughness of statistical volume elements (SVEs) of a corroded pipe. An automated microstructure reconstruction and mesh generation framework is utilized to synthesize the training data for this model by simulating the failure response of 10,000 SVEs subject to a tensile load (hoop stress). A Bayesian optimization approach is utilized to determine the optimal combination of hyperparameters for the SE-ResNet model, followed by a k-fold cross-validation of the model. We show that the trained SE-ResNet can accurately predict the failure response of corroded pipe SVEs with a maximum error of <1%<1%<1\%. Moreover, a comparison between the proposed model with several other well-known DL architectures shows that it yields superior accuracy and efficiency.
Article
Detecting vulnerabilities in source code is crucial for protecting software systems from cyberattacks. Pre-trained language models such as CodeBERT and GraphCodeBERT have been applied in multiple code-related downstream tasks such as code search and code translation and have achieved notable success. Recently, this pre-trained and fine-tuned paradigm has also been applied to detect code vulnerabilities. However, fine-tuning pre-trained language models using cross-entropy loss has several limitations, such as poor generalization performance and lack of robustness to noisy labels. In particular, when the vulnerable code and the benign code are very similar, it is difficult for deep learning methods to differentiate them accurately. In this context, we introduce a novel approach for code vulnerability detection using supervised contrastive learning, namely SCL-CVD, which leverages GraphCodeBERT. This method aims to enhance the effectiveness of existing vulnerable code detection approaches. SCL-CVD represents the source code as data flow graphs. These graphs are then processed by GraphCodeBERT, which has been fine-tuned using a supervised contrastive loss function combined with R-Drop. This fine-tuning process is designed to generate more resilient and representative code embedding. Additionally, we incorporate LoRA (Low-Rank Adaptation) to streamline the fine-tuning process, significantly reducing the time required for model training. Finally, a Multilayer Perceptron (MLP) is employed to detect vulnerable code leveraging the learned representation of code. We designed and conducted experiments on three public benchmark datasets, i.e., Devign, Reveal, Big-Vul, and a combined dataset created by merging these sources. The experimental results demonstrate that SCL-CVD can effectively improve the performance of code vulnerability detection. Compared with the baselines, the proposed approach has a relative improvement of 0.48%~3.42% for accuracy, 0.93%~45.99% for precision, 35.68%~67.48% for recall, and 16.31%~49.67% for F1-score, respectively. Furthermore, compared to baselines, the model fine-tuning time of the proposed approach is reduced by 16.67%~93.03%. In conclusion, our approach SCL-CVD offers significantly greater cost-effectiveness over existing approaches.
Article
Full-text available
Efficient and accurate classification of the microseismic data obtained in coal mine production is of great significance for the guidance of coal mine production safety, disaster prevention and early warning. In the early stage, the classification of microseismic events relies on human experiences, which is not only inefficient but also often causes some misclassifications. In recent years, the neural network-based classification method has become more favored by people because of its advantages in modeling procedures. A microseismic signal is a kind of time-series signal and the application of the classification method is widely optimistic. The number and the balance of the training data samples have an important impact on the accuracy of the classification result. However, the quality of the training data set obtained from the production cannot be guaranteed. A long short-term memory (LSTM) network can analyze the time-series input data, where the image classification at the pixel level can be achieved by the fully convolutional network (FCN). The two structures in the network can not only use the advantages of the FCN for extracting signal details but also use the characteristics of LSTM for conveying and expressing the long time-series information effectively. In this paper, a time-series data enhancement combination process is proposed for the actual poor microseismic data. A hybrid FCN-LSTM network structure was built, the optimal network parameters were obtained by experiments, and finally a reasonable microseismic data classifier was obtained.
Chapter
Textual escalation detection has been widely applied to e-commerce companies’ customer service systems to pre-alert and prevent potential conflicts. Similarly, acoustic-based escalation detection systems are also helpful in enhancing passengers’ safety and maintaining public order in public areas such as airports and train stations, where many impersonal conversations frequently occur. To this end, we introduce a multimodal system based on acoustic-linguistic features to detect escalation levels from human speech. Voice Activity Detection (VAD) and Label Smoothing are adopted to enhance the performance of this task further. Given the difficulty and high cost of data collection in open scenarios, the datasets we used in this task are subject to severe low resource constraints. To address this problem, we introduce transfer learning using a multi-corpus framework involving emotion detection datasets such as RAVDESS and CREMA-D to integrate emotion features into escalation signals representation learning. On the development set, our proposed system achieves 81.5% unweighted average recall (UAR), which significantly outperforms the baseline of 72.2%.Keywordsescalation detectiontransfer learningemotion recognitionmultimodal conflict detection
Article
In the future 6 G intelligent transportation system, the edge server will bring great convenience to the timely computing service for connected vehicles. To guarantee the quality of service, the time-critical services need to be migrated according to the future location of the vehicle. However, predicting vehicle mobility is challenging due to the time-varying of road traffic and the complex mobility patterns of vehicles. To address this issue, a spatial-temporal awareness proactive service migration strategy is proposed in this paper. First, a spatial-temporal neural network is designed to obtain accurate mobility by using gated recurrent units and graph convolutional layers extracting features from spatial road traffic and multi-time scales driving data. Then a proactive migration method is proposed to guarantee the reliability of services and reduce energy consumption. Considering the reliability of services and the real-time workload of servers, the migration problem is modeled as a multi-objective optimization problem, and the Lyapunov optimization method is utilized to obtain utility-optimal migration decisions. Extensive simulations based on real-world datasets are performed to validate the performance of the proposed method. The results show that the proposed method achieved 6% higher prediction accuracy, 10% lower dropping rate, and 10% lower energy consumption compared to state-of-the-art methods.
Article
Full-text available
Feedforward neural networks trained by error backpropagation are examples of nonparametric regression estimators. We present a tutorial on nonparametric inference and its relation to neural networks, and we use the statistical viewpoint to highlight strengths and weaknesses of neural models. We illustrate the main points with some recognition experiments involving artificial data as well as handwritten numerals. In way of conclusion, we suggest that current-generation feedforward neural networks are largely inadequate for difficult problems in machine perception and machine learning, regardless of parallel-versus-serial hardware or other implementation issues. Furthermore, we suggest that the fundamental challenges in neural modeling are about representation rather than learning per se. This last point is supported by additional experiments with handwritten numerals.
Article
Machine learning programs need to scale up to very large data sets for several reasons, including increasing accuracy and discovering infrequent special cases. Current inductive learners perform well with hundreds or thousands of training examples, but in some cases, up to a million or more examples may be necessary to learn important special cases with confidence. These tasks are infeasible for current learning programs running on sequential machines. We discuss the need for very large data sets and prior efforts to scale up machine learning methods. This discussion motivates a strategy that exploits the inherent parallelism present in many learning algorithms. We describe a parallel implementation of one inductive learning program on the CM-2 Connection Machine, show that it scales up to millions of examples, and show that it uncovers special-case rules that sequential learning programs, running on smaller datasets, would miss. The parallel version of the learning program is preferable to the sequential version for example sets larger than about 10K examples. When learning from a public-health database consisting of 3.5 million examples, the parallel rule-learning system uncovered a surprising relationship that has led to considerable follow-up research.
Article
MultiBoosting is an extension to the highly successful AdaBoost technique for forming decision committees. MultiBoosting can be viewed as combining AdaBoost with wagging. It is able to harness both AdaBoost's high bias and variance reduction with wagging's superior variance reduction. Using C4.5 as the base learning algorithm, MultiBoosting is demonstrated to produce decision committees with lower error than either AdaBoost or wagging significantly more often than the reverse over a large representative cross-section of UCI data sets. It offers the further advantage over AdaBoost of suiting parallel execution.
Article
In an earlier paper (9), we introduced a new "boosting" algorithm called AdaBoost which, theoretically, can be used to significantly reduce the error of any learning algorithm that consistently generates classifiers whose performance is a little better than random guessing. We also introduced the related notion of a "pseudo-loss" which is a method for forcing a learning algorithm of multi-label concepts to concentrate on the labels that are hardest to discriminate. In this paper, we describe experiments we carried out to assess how well AdaBoost with and without pseudo-loss, performs on real learning problems. We performed two sets of experiments. The first set compared boosting to Breiman's (1) "bagging" method when used to aggregate various classifiers (including decision trees and single attribute-value tests). We compared the performance of the two methods on a collection of machine-learning benchmarks. In the second set of experiments, we studied in more detail the performance of boosting using a nearest-neighbor classifier on an OCR problem.
Article
Bagging predictors is a method for generating multiple versions of a predictor and using these to get an aggregated predictor. The aggregation averages over the versions when predicting a numerical outcome and does a plurality vote when predicting a class. The multiple versions are formed by making bootstrap replicates of the learning set and using these as new learning sets. Tests on real and simulated data sets using classification and regression trees and subset selection in linear regression show that bagging can give substantial gains in accuracy. The vital element is the instability of the prediction method. If perturbing the learning set can cause significant changes in the predictor constructed, then bagging can improve accuracy.
Conference Paper
This paper presents a new method of speeding up the induction of decision trees from large noisy domains with continuous attributes. Empirical evaluations suggest it is several times faster than the basic ID3 algorithm on training sets of tens of thousands of examples, and for very large sets reduces learning time from a superlinear to an approximately linear function of the number of examples. This renders induction manageable on tasks that were previously too large to be considered feasible. The method works by examining random subsets (sometimes called subsamples) of the larger sets of examples to be assessed. From these it is typically able to eliminate from consideration many of the attributes and a large fraction of their range of values. It does not cause a significant change in accuracy, and is not greatly affected by noise.