ArticlePDF Available

# Scikit-learn: Machine Learning in Python

Authors:

## Abstract

Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from http://scikit-learn.sourceforge.net.
arXiv:1201.0490v1 [cs.LG] 2 Jan 2012
Journal of Machine Learning Research 12 (2011) 2825-2830 Submitted 3/11; Revised 8/11; Published 10/11
Scikit-learn: Machine Learning in Python
Fabian Pedregosa fabian.pedregosa@inria.fr
Ga¨el Varoquaux gael.varoquaux@normalesup.org
Alexandre Gramfort alexandre.gramfort@inria.fr
Vincent Michel vincent.michel@logilab.fr
Bertrand Thirion bertrand.thirion@inria.fr
Parietal, INRIA Saclay
Neurospin, Bˆat 145, CEA Saclay
91191 Gif sur Yvette – France
Olivier Grisel olivier.grisel@ensta.fr
Nuxeo
20 rue Soleillet
75 020 Paris – France
Mathieu Blondel mblondel@ai.cs.kobe-u.ac.jp
Kobe University
Kobe 657-8501 – Japan
Peter Prettenhofer peter.prettenhofer@gmail.com
Bauhaus-Universit¨at Weimar
Bauhausstr. 11
99421 Weimar – Germany
Ron Weiss ronweiss@gmail.com
76 Ninth Avenue
New York, NY 10011 – USA
Vincent Dubourg vincent.dubourg@gmail.com
Clermont Universit´e, IFMA, EA 3867, LaMI
BP 10448, 63000 Clermont-Ferrand – France
Jake Vanderplas vanderplas@astro.washington.edu
Astronomy Department
University of Washington, Box 351580
Seattle, WA 98195 – USA
Alexandre Passos alexandre.tp@gmail.com
IESL Lab
UMass Amherst
Amherst MA 01002 – USA
David Cournapeau cournape@gmail.com
Enthought
21 J.J. Thompson Avenue
Cambridge, CB3 0FA – UK
c
2011 Fabian Pedregosa, Ga¨el Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel,
Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David
Cournapeau, Matthieu Brucher, Matthieu Perrot and ´
Edouard Duchesnay
Pedregosa, Varoquaux, Gramfort et al.
Matthieu Brucher matthieu.brucher@gmail.com
Total SA, CSTJF
avenue Larribau
64000 Pau – France
Matthieu Perrot matthieu.perrot@cea.fr
´
Edouard Duchesnay edouard.duchesnay@cea.fr
LNAO
Neurospin, Bˆat 145, CEA Saclay
91191 Gif sur Yvette – France
Editor: Mikio Braun
Abstract
Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learn-
ing algorithms for medium-scale supervised and unsupervised problems. This package
focuses on bringing machine learning to non-specialists using a general-purpose high-level
language. Emphasis is put on ease of use, performance, documentation, and API consis-
encouraging its use in both academic and commercial settings. Source code, binaries, and
Keywords: Python, supervised learning, unsupervised learning, model selection
1. Introduction
The Python programming language is establishing itself as one of the most popular lan-
guages for scientiﬁc computing. Thanks to its high-level interactive nature and its maturing
ecosystem of scientiﬁc libraries, it is an appealing choice for algorithmic development and
exploratory data analysis (Dubois, 2007; Milmann and Avaizis, 2011). Yet, as a general-
purpose language, it is increasingly used not only in academic settings but also in industry.
Scikit-learn harnesses this rich environment to provide state-of-the-art implementations
of many well known machine learning algorithms, while maintaining an easy-to-use interface
tightly integrated with the Python language. This answers the growing need for statistical
data analysis by non-specialists in the software and web industries, as well as in ﬁelds
outside of computer-science, such as biology or physics. Scikit-learn diﬀers from other
machine learning toolboxes in Python for various reasons: i) it is distributed under the
BSD license ii) it incorporates compiled code for eﬃciency, unlike MDP (Zito et al., 2008)
and pybrain (Schaul et al., 2010), iii) it depends only on numpy and scipy to facilitate easy
distribution, unlike pymvpa (Hanke et al., 2009) that has optional dependencies such as
R and shogun, and iv) it focuses on imperative programming, unlike pybrain which uses
a data-ﬂow framework. While the package is mostly written in Python, it incorporates
the C++ libraries LibSVM (Chang and Lin, 2001) and LibLinear (Fan et al., 2008) that
provide reference implementations of SVMs and generalized linear models with compatible
2826
Scikit-learn: Machine Learning in Python
licenses. Binary packages are available on a rich set of platforms including Windows and any
POSIX platforms. Furthermore, thanks to its liberal license, it has been widely distributed
as part of major free software distributions such as Ubuntu, Debian, Mandriva, NetBSD and
Macports and in commercial distributions such as the “Enthought Python Distribution”.
2. Project Vision
Code quality. Rather than providing as many features as possible, the project’s goal has
been to provide solid implementations. Code quality is ensured with unit tests—as of release
0.8, test coverage is 81%—and the use of static analysis tools such as pyflakes and pep8.
Finally, we strive to use consistent naming for the functions and parameters used throughout
a strict adherence to the Python coding guidelines and numpy style documentation.
BSD licensing. Most of the Python ecosystem is licensed with non-copyleft licenses. While
such policy is beneﬁcial for adoption of these tools by commercial projects, it does impose
some restrictions: we are unable to use some existing scientiﬁc code, such as the GSL.
Bare-bone design and API. To lower the barrier of entry, we avoid framework code and keep
the number of diﬀerent objects to a minimum, relying on numpy arrays for data containers.
Community-driven development. We base our development on collaborative tools such as
git, github and public mailing lists. External contributions are welcome and encouraged.
Documentation. Scikit-learn provides a 300 page user guide including narrative docu-
mentation, class references, a tutorial, installation instructions, as well as more than 60
examples, some featuring real-world applications. We try to minimize the use of machine-
learning jargon, while maintaining precision with regards to the algorithms employed.
3. Underlying Technologies
Numpy: the base data structure used for data and model parameters. Input data is pre-
sented as numpy arrays, thus integrating seamlessly with other scientiﬁc Python libraries.
Numpy’s view-based memory model limits copies, even when binding with compiled code
(Van der Walt et al., 2011). It also provides basic arithmetic operations.
Scipy: eﬃcient algorithms for linear algebra, sparse matrix representation, special functions
and basic statistical functions. Scipy has bindings for many Fortran-based standard numer-
ical packages, such as LAPACK. This is important for ease of installation and portability,
as providing libraries around Fortran code can prove challenging on various platforms.
Cython: a language for combining C in Python. Cython makes it easy to reach the perfor-
mance of compiled languages with Python-like syntax and high-level operations. It is also
used to bind compiled libraries, eliminating the boilerplate code of Python/C extensions.
4. Code Design
Objects speciﬁed by interface, not by inheritance. To facilitate the use of external objects
with scikit-learn, inheritance is not enforced; instead, code conventions provide a consistent
interface. The central object is an estimator, that implements a fit method, accepting as
arguments an input data array and, optionally, an array of labels for supervised problems.
Supervised estimators, such as SVM classiﬁers, can implement a predict method. Some
2827
Pedregosa, Varoquaux, Gramfort et al.
scikit-learn mlpy pybrain pymvpa mdp shogun
Support Vector Classiﬁcation 5.2 9.47 17.5 11.52 40.48 5.63
Lasso (LARS) 1.17 105.3 - 37.35 - -
Elastic Net 0.52 73.7 - 1.44 - -
k-Nearest Neighbors 0.57 1.41 - 0.56 0.58 1.36
PCA (9 components) 0.18 - - 8.93 0.47 0.33
k-Means (9 clusters) 1.34 0.79 - 35.75 0.68
License BSD GPL BSD BSD BSD GPL
-: Not implemented. : Does not converge within 1 hour.
Table 1: Time in seconds on the Madelon data set for various machine learn-
ing libraries exposed in Python: MLPy (Albanese et al., 2008), Py-
Brain (Schaul et al., 2010), pymvpa (Hanke et al., 2009), MDP (Zito et al.,
2008) and Shogun (Sonnenburg et al., 2010). For more benchmarks see
http://github.com/scikit-learn.
estimators, that we call transformers, for example, PCA, implement a transform method,
returning modiﬁed input data. Estimators may also provide a score method, which is an
increasing evaluation of goodness of ﬁt: a log-likelihood, or a negated loss function. The
other important object is the cross-validation iterator, which provides pairs of train and test
indices to split input data, for example K-fold, leave one out, or stratiﬁed cross-validation.
Model selection. Scikit-learn can evaluate an estimator’s performance or select parameters
using cross-validation, optionally distributing the computation to several cores. This is ac-
complished by wrapping an estimator in a GridSearchCV object, where the “CV” stands for
“cross-validated”. During the call to fit, it selects the parameters on a speciﬁed parameter
grid, maximizing a score (the score method of the underlying estimator). predict,score,
or transform are then delegated to the tuned estimator. This object can therefore be used
transparently as any other estimator. Cross validation can be made more eﬃcient for certain
estimators by exploiting speciﬁc properties, such as warm restarts or regularization paths
(Friedman et al., 2010). This is supported through special objects, such as the LassoCV.
Finally, a Pipeline object can combine several transformers and an estimator to create
a combined estimator to, for example, apply dimension reduction before ﬁtting. It behaves
as a standard estimator, and GridSearchCV therefore tune the parameters of all steps.
5. High-level yet Eﬃcient: Some Trade Oﬀs
While scikit-learn focuses on ease of use, and is mostly written in a high level language, care
has been taken to maximize computational eﬃciency. In Table 1, we compare computation
time for a few algorithms implemented in the major machine learning toolkits accessible
in Python. We use the Madelon data set (Guyon et al., 2004), 4400 instances and 500
attributes, The data set is quite large, but small enough for most algorithms to run.
SVM. While all of the packages compared call libsvm in the background, the performance of
scikit-learn can be explained by two factors. First, our bindings avoid memory copies and
have up to 40% less overhead than the original libsvm Python bindings. Second, we patch
libsvm to improve eﬃciency on dense data, use a smaller memory footprint, and better use
2828
Scikit-learn: Machine Learning in Python
memory alignment and pipelining capabilities of modern processors. This patched version
also provides unique features, such as setting weights for individual samples.
LARS. Iteratively reﬁning the residuals instead of recomputing them gives performance
gains of 2–10 times over the reference R implementation (Hastie and Efron, 2004). Pymvpa
uses this implementation via the Rpy R bindings and pays a heavy price to memory copies.
Elastic Net. We benchmarked the scikit-learn coordinate descent implementations of Elastic
Net. It achieves the same order of performance as the highly optimized Fortran version
glmnet (Friedman et al., 2010) on medium-scale problems, but performance on very large
problems is limited since we do not use the KKT conditions to deﬁne an active set.
kNN. The k-nearest neighbors classiﬁer implementation constructs a ball tree (Omohundro,
1989) of the samples, but uses a more eﬃcient brute force search in large dimensions.
PCA. For medium to large data sets, scikit-learn provides an implementation of a truncated
PCA based on random projections (Rokhlin et al., 2009).
k-means. scikit-learn ’s k-means algorithm is implemented in pure Python. Its performance
is limited by the fact that numpy’s array operations take multiple passes over data.
6. Conclusion
Scikit-learn exposes a wide variety of machine learning algorithms, both supervised and
unsupervised, using a consistent, task-oriented interface, thus enabling easy comparison
of methods for a given application. Since it relies on the scientiﬁc Python ecosystem, it
can easily be integrated into applications outside the traditional range of statistical data
analysis. Importantly, the algorithms, implemented in a high-level language, can be used
as building blocks for approaches speciﬁc to a use case, for example, in medical imaging
(Michel et al., 2011). Future work includes online learning, to scale to large data sets.
References
D. Albanese, G. Merler, S.and Jurman, and R. Visintainer. MLPy: high-performance
Python package for predictive modeling. In NIPS, MLOSS workshop, 2008.
C.C. Chang and C.J. Lin. LIBSVM: a library for support vector machines.
http://www.csie.ntu.edu.tw/cjlin/libsvm, 2001.
P.F. Dubois, editor. Python: batteries included, volume 9 of Computing in Science &
Engineering. IEEE/AIP, May 2007.
R.E. Fan, K.W. Chang, C.J. Hsieh, X.R. Wang, and C.J. Lin. LIBLINEAR: A library for
large linear classiﬁcation. The Journal of Machine Learning Research, 9:1871–1874, 2008.
J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear
models via coordinate descent. Journal of statistical software, 33(1):1, 2010.
I Guyon, S. R. Gunn, A. Ben-Hur, and G. Dror. Result analysis of the NIPS 2003 feature
selection challenge, 2004.
2829
Pedregosa, Varoquaux, Gramfort et al.
M. Hanke, Y.O. Halchenko, P.B. Sederberg, S.J. Hanson, J.V. Haxby, and S. Pollmann.
PyMVPA: A Python toolbox for multivariate pattern analysis of fMRI data. Neuroin-
formatics, 7(1):37–53, 2009.
T. Hastie and B. Efron. Least Angle Regression, Lasso and Forward Stagewise.
http://cran.r-project.org/web/packages/lars/lars.pdf, 2004.
V. Michel, A. Gramfort, G. Varoquaux, E. Eger, C. Keribin, and B. Thirion. A supervised
clustering approach for fMRI-based inference of brain states. Patt Rec, page epub ahead
of print, April 2011. doi: 10.1016/j.patcog.2011.04.006.
K.J. Milmann and M. Avaizis, editors. Scientiﬁc Python, volume 11 of Computing in Science
& Engineering. IEEE/AIP, March 2011.
S.M. Omohundro. Five balltree construction algorithms. ICSI Technical Report TR-89-063,
1989.
V. Rokhlin, A. Szlam, and M. Tygert. A randomized algorithm for principal component
analysis. SIAM Journal on Matrix Analysis and Applications, 31(3):1100–1124, 2009.
T. Schaul, J. Bayer, D. Wierstra, Y. Sun, M. Felder, F. Sehnke, T. R¨uckstieß, and J. Schmid-
huber. PyBrain. The Journal of Machine Learning Research, 11:743–746, 2010.
S. Sonnenburg, G. R¨atsch, S. Henschel, C. Widmer, J. Behr, A. Zien, F. de Bona, A. Binder,
C. Gehl, and V. Franc. The SHOGUN Machine Learning Toolbox. Journal of Machine
Learning Research, 11:1799–1802, 2010.
S. Van der Walt, S.C Colbert, and G. Varoquaux. The NumPy array: a structure for
eﬃcient numerical computation. Computing in Science and Engineering, 11, 2011.
T. Zito, N. Wilbert, L. Wiskott, and P. Berkes. Modular toolkit for Data Processing (MDP):
a Python data processing framework. Frontiers in neuroinformatics, 2, 2008.
2830
... Classification We performed multiclass classification [39]. Input for classifiers were features and targets. ...
... Here, the features were the same as the input for the clustering algorithm; the targets were the clusters assigned by k-means. The dataset was split into a training dataset with 75% of the data and a testing dataset with 25% of the data using sklearn [39]. The Gradient Boosting classifier XGBoost [40] was used for classification. ...
... All other parameters were left at default. We used mlogloss, which returns the logistic loss in a multiclass dataset [39,41] as evaluation metric. The performance of the trained model on the testing dataset was evaluated using the overall accuracy as the performance metric. ...
Article
Full-text available
Background The Progress Test Medizin (PTM) is a 200-question formative test that is administered to approximately 11,000 students at medical universities (Germany, Austria, Switzerland) each term. Students receive feedback on their knowledge (development) mostly in comparison to their own cohort. In this study, we use the data of the PTM to find groups with similar response patterns. Methods We performed k-means clustering with a dataset of 5,444 students, selected cluster number k = 5, and answers as features. Subsequently, the data was passed to XGBoost with the cluster assignment as target enabling the identification of cluster-relevant questions for each cluster with SHAP. Clusters were examined by total scores, response patterns, and confidence level. Relevant questions were evaluated for difficulty index, discriminatory index, and competence levels. Results Three of the five clusters can be seen as "performance" clusters: cluster 0 (n = 761) consisted predominantly of students close to graduation. Relevant questions tend to be difficult, but students answered confidently and correctly. Students in cluster 1 (n = 1,357) were advanced, cluster 3 (n = 1,453) consisted mainly of beginners. Relevant questions for these clusters were rather easy. The number of guessed answers increased. There were two "drop-out" clusters: students in cluster 2 (n = 384) dropped out of the test about halfway through after initially performing well; cluster 4 (n = 1,489) included students from the first semesters as well as "non-serious" students both with mostly incorrect guesses or no answers. Conclusion Clusters placed performance in the context of participating universities. Relevant questions served as good cluster separators and further supported our "performance" cluster groupings.
... The features are split into training (60%), validation (20%), and testing (20%) sets using scikit-learn's [29] stratified train_test_split method and k-folds methods with the number of folds set to 5. The stratified methods are chosen because they ensure that each of the splits has the same distribution of normal and faulty data. Scikit's min-max scaler is used to scale the data to a range of {0, 1}. ...
... As a baseline, we use the SVM-based classification algorithm of [29], while the nearest-neighbor algorithm is based on the Ward minimum variance method [30]. To prioritize recent data points over previous ones, both methods make use of sliding windows. ...
... Feature selection algorithms such as sequential feature selection can be used to determine the optimal features. For instance, using the features shown in Table III derived from Sklearn's sequential forward feature selection [29] results in an average lead time increase of 0.1s over training a binary classification with (3). However, to truly take advantage of the multi-class classification algorithm more investigation into optimal feature selection is needed to determine whether the additional average lead time gained can overcome the fault identifier delay. ...
Preprint
For legged robots to operate in complex terrains, they must be robust to the disturbances and uncertainties they encounter. This paper contributes to enhancing robustness through the design of fall detection/prediction algorithms that will provide sufficient lead time for corrective motions to be taken. Falls can be caused by abrupt (fast-acting), incipient (slow-acting), or intermittent (non-continuous) faults. Early fall detection is a challenging task due to the masking effects of controllers (through their disturbance attenuation actions), the inverse relationship between lead time and false positive rates, and the temporal behavior of the faults/underlying factors. In this paper, we propose a fall detection algorithm that is capable of detecting both incipient and abrupt faults while maximizing lead time and meeting desired thresholds on the false positive and negative rates.
... Features recorded against temperature include date and time of the measurement, peak center, peak height, full width at half max, area, kurtosis, sensor/grating type, coating, vendor (composite variable standing in for fabrication process variability), time, laser power and experiment type (experiments where number of consecutive scans is 100 or greater are referred to "annealing" as these experiments were designed to detect any slow relaxation process that might be occurring following temperature step). Data exploration was carried out using standard python [21] libraries (pandas [22], seaborn [23], matplotlib [24]) while sklearn sci-kit [25] and statmodels [26] libraries were used for data modeling. A brief discussion of the exploratory data analysis and methodology employed for data modeling is included in the supplemental. ...
... Further-22 more, as shown inFig SP 3and SP 4, changes in spectra-derived features23 are slow and cumulative, rising above the prevailing measurement noise only 24 over long time periods. As shown inFig SP 5, the Allen deviation (ADEV)25 plots for sensor S5 and S7 show that the measurement uncertainty at any26 given temperature is dominated by 1/f noise as evidenced by a linear de-27 crease in variance with integration time. Based on these results we conclude28 that peak center drift due to hysteresis either occurs outside the observation29 times used in this study 1 or that the impact of these changes on measurement 30 variance is smaller than the impact of other processes over the observation 31 times used for each temperature measurement. ...
Preprint
Full-text available
In recent years there has been considerable interest in using photonic thermometers such as Fiber Bragg grating (FBG) and silicon ring resonators as an alternative technology to resistance-based legacy thermometers. Although FBG thermometers have been commercially available for decades their metrological performance remains poorly understood, hindered in part by complex behavior at elevated temperatures. In this study we systematically examine the temporal evolution of the temperature response of 14 sensors that were repeatedly cycled between 233 K and 393 K. Data exploration and modelling indicate the need to account for serial-correlation in model selection. Utilizing the coupled-mode theory treatment of FBG to guide feature selection we evaluate various calibration models. Our results indicates that a dynamic regression model can effectively reduce measurement uncertainty due to hysteresis by up to $\approx 70 \% ... For the baseline, in unsupervised learning setting, the stateof-the-art to the author's knowledge is the K-means based method [53]. We use an off-the-shelf K-means implementation [54], with input features formed from cascading the 200 × 2 bits traffic states and the 72 × 2 real value CSI (real and imaginary number as two independent channels). The configurations of the hyperparameters are set to the default values as in [54,KMeans]. ... ... We use an off-the-shelf K-means implementation [54], with input features formed from cascading the 200 × 2 bits traffic states and the 72 × 2 real value CSI (real and imaginary number as two independent channels). The configurations of the hyperparameters are set to the default values as in [54,KMeans]. The evaluation of the testing phase performance follows the same label matching procedures as adopted in the W-VAE. ... Preprint Wireless fingerprinting refers to a device identification method leveraging hardware imperfections and wireless channel variations as signatures. Beyond physical layer characteristics, recent studies demonstrated that user behaviours could be identified through network traffic, e.g., packet length, without decryption of the payload. Inspired by these results, we propose a multi-layer fingerprinting framework that jointly considers the multi-layer signatures for improved identification performance. In contrast to previous works, by leveraging the recent multi-view machine learning paradigm, i.e., data with multiple forms, our method can cluster the device information shared among the multi-layer features without supervision. Our information-theoretic approach can be extended to supervised and semi-supervised settings with straightforward derivations. In solving the formulated problem, we obtain a tight surrogate bound using variational inference for efficient optimization. In extracting the shared device information, we develop an algorithm based on the Wyner common information method, enjoying reduced computation complexity as compared to existing approaches. The algorithm can be applied to data distributions belonging to the exponential family class. Empirically, we evaluate the algorithm in a synthetic dataset with real-world video traffic and simulated physical layer characteristics. Our empirical results show that the proposed method outperforms the state-of-the-art baselines in both supervised and unsupervised settings. ... We focus on the hadronic final states (l + l − jj) where the dijets forming a fat-jet signature, from the heavy neutrinos with masses, 100 GeV < m N < 1 TeV. For the ML method, we use Gradient Boosted Decision Tree (GBDT) [56] or Multi-Layer Perceptron (MLP) [57] technique in the Scikit-learn framework [58]. The multi-variate are taken from the observables including the fat-jet system. ... ... We use GBDT provided by Sklearn [58] package under all the default parameters. The MLP model is built by Keras [74], a high level interface of the machine learning framework Tensorflow [75]. ... Preprint Full-text available We explore the potential to use machine learning methods to search for heavy neutrinos, from their hadronic final states including a fat-jet signal, via the processes$pp \rightarrow W^{\pm *}\rightarrow \mu^{\pm} N \rightarrow \mu^{\pm} \mu^{\mp} W^{\pm} \rightarrow \mu^{\pm} \mu^{\mp} J$at hadron colliders. We use either the Gradient Boosted Decision Tree or Multi-Layer Perceptron methods to analyse the observables incorporating the jet substructure information, which is performed at hadron colliders with$\sqrt{s}=$13, 27, 100 TeV. It is found that, among the observables, the invariant masses of variable system and the observables from the leptons are the most powerful ones to distinguish the signal from the background. With the help of machine learning techniques, the limits on the active-sterile mixing have been improved by about one magnitude comparing to the cut-based analyses, with$V_{\mu N}^2 \lesssim 10^{-4}$for the heavy neutrinos with masses, 100 GeV$~<m_N<~\$1 TeV.
... φ here represents the WResNet101 [48] backbone pre-trained on ImageNet [16] without fine-tuning. The OPTICS algorithm we used is re-implemented by scikitlearn [33]. Following the settings in [25], we rescale all the images into 256×256 in our experiments. ...
Preprint
Anomaly detectors are widely used in industrial production to detect and localize unknown defects in query images. These detectors are trained on nominal images and have shown success in distinguishing anomalies from most normal samples. However, hard-nominal examples are scattered and far apart from most normalities, they are often mistaken for anomalies by existing anomaly detectors. To address this problem, we propose a simple yet efficient method: \textbf{H}ard Nominal \textbf{E}xample-aware \textbf{T}emplate \textbf{M}utual \textbf{M}atching (HETMM). Specifically, \textit{HETMM} aims to construct a robust prototype-based decision boundary, which can precisely distinguish between hard-nominal examples and anomalies, yielding fewer false-positive and missed-detection rates. Moreover, \textit{HETMM} mutually explores the anomalies in two directions between queries and the template set, and thus it is capable to capture the logical anomalies. This is a significant advantage over most anomaly detectors that frequently fail to detect logical anomalies. Additionally, to meet the speed-accuracy demands, we further propose \textbf{P}ixel-level \textbf{T}emplate \textbf{S}election (PTS) to streamline the original template set. \textit{PTS} selects cluster centres and hard-nominal examples to form a tiny set, maintaining the original decision boundaries. Comprehensive experiments on five real-world datasets demonstrate that our methods yield outperformance than existing advances under the real-time inference speed. Furthermore, \textit{HETMM} can be hot-updated by inserting novel samples, which may promptly address some incremental learning issues.
... The Scikit-learn package (Pedregosa et al., 2011) was used to build the RF models in this study. The MSE is used as the loss function because the training time is extremely long with the MAE loss function. ...
Preprint
Railway operations involve different types of entities (stations, trains, etc.), making the existing graph/network models with homogenous nodes (i.e., the same kind of nodes) incapable of capturing the interactions between the entities. This paper aims to develop a heterogeneous graph neural network (HetGNN) model, which can address different types of nodes (i.e., heterogeneous nodes), to investigate the train delay evolution on railway networks. To this end, a graph architecture combining the HetGNN model and the GraphSAGE homogeneous GNN (HomoGNN), called SAGE-Het, is proposed. The aim is to capture the interactions between trains, trains and stations, and stations and other stations on delay evolution based on different edges. In contrast to the traditional methods that require the inputs to have constant dimensions (e.g., in rectangular or grid-like arrays) or only allow homogeneous nodes in the graph, SAGE-Het allows for flexible inputs and heterogeneous nodes. The data from two sub-networks of the China railway network are applied to test the performance and robustness of the proposed SAGE-Het model. The experimental results show that SAGE-Het exhibits better performance than the existing delay prediction methods and some advanced HetGNNs used for other prediction tasks; the predictive performances of SAGE-Het under different prediction time horizons (10/20/30 min ahead) all outperform other baseline methods; Specifically, the influences of train interactions on delay propagation are investigated based on the proposed model. The results show that train interactions become subtle when the train headways increase . This finding directly contributes to decision-making in the situation where conflict-resolution or train-canceling actions are needed.
... Individual data points are represented by the black dots in Figure 3, and the clouds around each cluster's center display the distribution of the data. Similar background colors on different parts of All methods evaluated in this study are implemented in the scikit-learn machine learning library for Python [29]. This library was chosen because it is widely available and easy to implement. ...
Preprint
Modeling parameters are essential to the fidelity of nonlinear models of concrete structures subjected to earthquake ground motions, especially when simulating seismic events strong enough to cause collapse. This paper addresses two of the most significant barriers to improving nonlinear modeling provisions in seismic evaluation standards using experimental data sets: identifying the most likely mode of failure of structural components, and implementing data fitting techniques capable of recognizing interdependencies between input parameters and nonlinear relationships between input parameters and model outputs. Machine learning tools in the Scikit-learn and Pytorch libraries were used to calibrate equations and black-box numerical models for nonlinear modeling parameters (MP) a and b of reinforced concrete columns defined in the ASCE 41 and ACI 369.1 standards, and to estimate their most likely mode of failure. It was found that machine learning regression models and machine learning black-boxes were more accurate than current provisions in the ACI 369.1/ASCE 41 Standards. Among the regression models, Regularized Linear Regression was the most accurate for estimating MP a, and Polynomial Regression was the most accurate for estimating MP b. The two black-box models evaluated, namely the Gaussian Process Regression and the Neural Network (NN), provided the most accurate estimates of MPs a and b. The NN model was the most accurate machine learning tool of all evaluated. A multi-class classification tool from the Scikit-learn machine learning library correctly identified column mode of failure with 79% accuracy for rectangular columns and with 81% accuracy for circular columns, a substantial improvement over the classification rules in ASCE 41-13.
... I use scikit-learn's SVM implementation(Pedregosa et al. 2011) ...
Preprint
Supervised text models are a valuable tool for political scientists but present several obstacles to their use, including the expense of hand-labeling documents, the difficulty of retrieving rare relevant documents for annotation, and copyright and privacy concerns involved in sharing annotated documents. This article proposes a partial solution to these three issues, in the form of controlled generation of synthetic text with large language models. I provide a conceptual overview of text generation, guidance on when researchers should prefer different techniques for generating synthetic text, a discussion of ethics, and a simple technique for improving the quality of synthetic text. I demonstrate the usefulness of synthetic text with three applications: generating synthetic tweets describing the fighting in Ukraine, synthetic news articles describing specified political events for training an event detection system, and a multilingual corpus of populist manifesto statements for training a sentence-level populism classifier.
... For the client local training process, we used GNU Parallel [57] to coordinate and execute all the jobs in parallel. We implemented the clustering algorithms, validation metrics, dimensionality reduction, etc. with scikit-learn [58]. ...
Preprint
Full-text available
There is a growing trend of cyberattacks against Internet of Things (IoT) devices; moreover, the sophistication and motivation of those attacks is increasing. The vast scale of IoT, diverse hardware and software, and being typically placed in uncontrolled environments make traditional IT security mechanisms such as signature-based intrusion detection and prevention systems challenging to integrate. They also struggle to cope with the rapidly evolving IoT threat landscape due to long delays between the analysis and publication of the detection rules. Machine learning methods have shown faster response to emerging threats; however, model training architectures like cloud or edge computing face multiple drawbacks in IoT settings, including network overhead and data isolation arising from the large scale and heterogeneity that characterizes these networks. This work presents an architecture for training unsupervised models for network intrusion detection in large, distributed IoT and Industrial IoT (IIoT) deployments. We leverage Federated Learning (FL) to collaboratively train between peers and reduce isolation and network overhead problems. We build upon it to include an unsupervised device clustering algorithm fully integrated into the FL pipeline to address the heterogeneity issues that arise in FL settings. The architecture is implemented and evaluated using a testbed that includes various emulated IoT/IIoT devices and attackers interacting in a complex network topology comprising 100 emulated devices, 30 switches and 10 routers. The anomaly detection models are evaluated on real attacks performed by the testbed's threat actors, including the entire Mirai malware lifecycle, an additional botnet based on the Merlin command and control server and other red-teaming tools performing scanning activities and multiple attacks targeting the emulated devices.
Article
Full-text available
In the Python world, NumPy arrays are the standard representation for numerical data and enable efficient implementation of numerical computations in a high-level language. As this effort shows, NumPy performance can be improved through three techniques: vectorizing calculations, avoiding copying data in memory, and minimizing operation counts.
Article
Full-text available
LIBLINEAR is an open source library for large-scale linear classification. It supports logistic regres- sion and linear support vector machines. We provide easy-to-use command-line tools and library calls for users and developers. Comprehensive documents are available for both beginners and advanced users. Experiments demonstrate that LIBLINEAR is very efficient on large sparse data sets.
Article
Full-text available
We have developed a machine learning toolbox, called SHOGUN, which is designed for unified large-scale learning for a broad range of feature types and learning settings. It offers a considerable number of machine learning models such as support vector machines for classification and regression, hidden Markov models, multiple kernel learning, linear discriminant analysis, linear programming machines, and perceptrons. Most of the specific algorithms are able to deal with several different data classes, including dense and sparse vectors and sequences using floating point or discrete data types. We have used this toolbox in several applications from computational biology, some of them coming with no less than 10 million training examples and others with 7 billion test examples. With more than a thousand installations worldwide, SHOGUN is already widely adopted in the machine learning community and beyond. SHOGUN is implemented in C++ and interfaces to MATLAB, R, Octave, Python, and has a stand-alone command line interface. The source code is freely available under the GNU General Public License, Version 3 at http://www.shogun-toolbox.org.
Article
Full-text available
We develop fast algorithms for estimation of generalized linear models with convex penalties. The models include linear regression, two-class logistic regression, and multi- nomial regression problems while the penalties include Ã¢ÂÂ_1 (the lasso), Ã¢ÂÂ_2 (ridge regression) and mixtures of the two (the elastic net). The algorithms use cyclical coordinate descent, computed along a regularization path. The methods can handle large problems and can also deal efficiently with sparse features. In comparative timings we find that the new algorithms are considerably faster than competing methods.
Article
Full-text available
Decoding patterns of neural activity onto cognitive states is one of the central goals of functional brain imaging. Standard univariate fMRI analysis methods, which correlate cognitive and perceptual function with the blood oxygenation-level dependent (BOLD) signal, have proven successful in identifying anatomical regions based on signal increases during cognitive and perceptual tasks. Recently, researchers have begun to explore new multivariate techniques that have proven to be more flexible, more reliable, and more sensitive than standard univariate analysis. Drawing on the field of statistical learning theory, these new classifier-based analysis techniques possess explanatory power that could provide new insights into the functional properties of the brain. However, unlike the wealth of software packages for univariate analyses, there are few packages that facilitate multivariate pattern classification analyses of fMRI data. Here we introduce a Python-based, cross-platform, and open-source software toolbox, called PyMVPA, for the application of classifier-based analysis techniques to fMRI datasets. PyMVPA makes use of Python's ability to access libraries written in a large variety of programming languages and computing environments to interface with the wealth of existing machine learning packages. We present the framework in this paper and provide illustrative examples on its usage, features, and programmability.
Article
Full-text available
Modular toolkit for Data Processing (MDP) is a data processing framework written in Python. From the user's perspective, MDP is a collection of supervised and unsupervised learning algorithms and other data processing units that can be combined into data processing sequences and more complex feed-forward network architectures. Computations are performed efficiently in terms of speed and memory requirements. From the scientific developer's perspective, MDP is a modular framework, which can easily be expanded. The implementation of new algorithms is easy and intuitive. The new implemented units are then automatically integrated with the rest of the library. MDP has been written in the context of theoretical research in neuroscience, but it has been designed to be helpful in any context where trainable data processing algorithms are used. Its simplicity on the user's side, the variety of readily available algorithms, and the reusability of the implemented units make it also a useful educational tool.
Article
LIBSVM is a library for Support Vector Machines (SVMs). We have been actively developing this package since the year 2000. The goal is to help users to easily apply SVM to their applications. LIBSVM has gained wide popularity in machine learning and many other areas. In this article, we present all implementation details of LIBSVM. Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Article
LIBSVM is a library for support vector machines (SVM). Its goal is to help users to easily use SVM as a tool. In this document, we present all its imple-mentation details. For the use of LIBSVM, the README file included in the package and the LIBSVM FAQ provide the information.