ArticlePDF Available

Multivariate Analysis of the Vector Boson Fusion Higgs Boson

Authors:

Abstract and Figures

A multivariate analysis is presented for the study of the vector boson fusion (VBF) Higgs boson decaying to a pair of tau leptons. While the VBF production mechanism of the Higgs is roughly an order of magnitude lower in cross section than the dominant gluon-gluon fusion mechanism, it is shown that VBF produces a distinctive signature that is well suited for detection by multivariate analyses. A number of discriminant variables are explored in addition to a direct comparison of different machine learning toolkits. Ultimately, a statistical significance of 7.9 is achieved for detection of the VBF Higgs boson in this truth level study.
Content may be subject to copyright.
Multivariate Analysis of the Vector Boson Fusion Higgs Boson
Brendan Marsh University of Missouri August 8, 2016
Ph.D. Student Supervisor: Antonio De Maria
Supervisor: Prof. Dr. Arnulf Quadt
Abstract
A multivariate analysis is presented for the study of the vector boson
fusion (VBF) Higgs boson decaying to a pair of tau leptons. While the VBF
production mechanism of the Higgs is roughly an order of magnitude lower
in cross section than the dominant gluon-gluon fusion mechanism, it is
shown that VBF produces a distinctive signature that is well suited for
detection by multivariate analyses. A number of discriminant variables are
explored in addition to a direct comparison of different machine learning
toolkits. Ultimately, a statistical significance of 7.9 is achieved for detection
of the VBF Higgs boson in this truth level study.
Multivariate Analysis of the Vector Boson Fusion Higgs Boson
1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
. . . . . . . . . . . . . . . . . . . . . . . . . 9
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
. . . . . . . . . . . . . . . . . . . . . . . . . . 14
. . . . . . . . . . . . . . . . . . . . . . . . . . . 14
. . . . . . . . . . . . . . . . . . . . . . . . . . . 15
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Contents
1. Motivation and Background
1.1 The Higgs Boson
1.2 Vector Boson Fusion
1.3 Fully Hadronic Decay Mode
1.4 Background Processes
2. Multivariate Analysis
2.1 Monte Carlo Samples
2.2 Preselection Cuts
2.3 Cut Based Analysis
2.4 Decision Trees
2.5 Adaptive Boosting
2.6 Discriminant Variables
2.6.1 Collinear Approximation
2.6.2 Tau Centrality Product
2.6.3 ! Variables
2.6.4 Tau-Jet Angular Correlations
2.6.5 Fox-Wolfram Moments
2.6.6 MVA Variables
2.7 TMVA Multivariate Analysis
2.8 Scikit Learn Multivariate Analysis
3. Conclusions
3.1 Outlook for VBF Higgs Analysis
3.2 Suggestions for Future Studies
3.3 Thanks!
References
Multivariate Analysis of the Vector Boson Fusion Higgs Boson
2
1. Motivation and Background
1.1 The Higgs Boson
Within the context of the Standard Model (SM),
the Higgs mechanism is necessary for the mass
generation of the W and Z gauge bosons. By
invoking a break in electroweak symmetry, the
Higgs mechanism implies the existence of a spin
zero, neutral particle; we know this particle as the
Higgs boson.
For many years, the Higgs remained elusive in
particle detectors. It was not until July 4, 2012 that
CERN announced that both the CMS and ATLAS
experiments at the large hadron collider (LHC) met
the 5" discovery benchmark for a new boson with a
mass of roughly 125 GeV that was consistent with
a Higgs boson. It seems the Higgs has finally been
found!
Many studies of the Higgs boson are ongoing as Run II of the LHC is currently approaching an
online integrated luminosity of 20 inverse femtobarns. As our studies of the Higgs progress, the vector
boson fusion production mechanism becomes increasingly important as a detection pathway, in CP
violation studies [1], and in other areas.
1.2 Vector Boson Fusion
A standard model Higgs boson may be produced via one of four production mechanisms at the
LHC. The vector boson fusion (VBF) mechanism involves the scattering of two quarks via the
exchange of a W or Z (vector) boson. This pair of vector bosons then fuses to produce a low mass
Higgs boson.
Figure 2 Left: Feynman diagrams of the four Higgs production mechanisms at the LHC, with vector boson
fusion highlighted in red. Right: Corresponding cross section for Higgs production mechanisms.
One can see from the cross section that the gluon-gluon mechanism is roughly an order of
magnitude greater than that of the VBF mechanism for a Higgs of mass 125 GeV [2]. However,
the addition of the two quarks into the final state, visible as highly energetic jets, produces a
Figure 1 The elementary particles of the Standard
Model, labelled with their mass, charge, and spin.
Multivariate Analysis of the Vector Boson Fusion Higgs Boson
3
distinctive signature that is lacking in gluon-gluon fusion. In terms of measurable quantities,
VBF events may be recognized by the following characteristics:
Highly ! separated jets
Jets in opposite hemispheres
High invariant mass of jets
No central jets above a certain #$
1.3 Fully Hadronic Decay Mode
The 125 GeV Higgs boson most often decays into a %% pair, however this decay mode is not easily
recovered in a sea of && background [3]. The Higgs additionally may decay into a '(') pair; this is the
decay mode studied in this analysis. Specifically, I investigate the “fully hadronic” decay mode in which
both tau leptons subsequently decay into a tau neutrino and a number of pions, which accounts for
roughly 41% of the branching ratio[2]. A Feynman diagram of the signal process is given below.
Figure 3 The Feynman diagram of the signal process of this study; a Higgs boson production via vector boson
fusion with a subsequent decay into tau leptons, a tau neutrino, and pion.
1.4 Background Processes
A bit like searching for a needle in a haystack, the VBF Higgs process is a rare event that is drowned
out by background processes with similar event characteristics and much higher cross sections. To
detect a small signal in a sea of background, one’s goal is to remove as much of the background as
possible while retaining as many signal events as possible. Thus, it is equally as important to
understand the background processes competing with your signal process as it is important to
understand your signal process. The main background processes relevant to this study are the Z*''
and && processes.
Z*'' +,-&.
According to the particle data group [2], the Z boson decays into a pair of tau leptons with a
branching ratio of roughly 3.4%. As Z bosons are produced in excess at the LHC, this channel
introduces a large background with the same final state, a pair of tau leptons. Fortunately, there do
exist features of VBF that we expect to differ in the case of Z*''. Foremost, the invariant mass of the
reconstructed taus should reflect the mass of the particle from which it came, although mass
reconstruction can be difficult (section 2.6.1). For VBF taus we expect to see the mass of the Higgs,
roughly 125 GeV, while for the Z*'' channel we expect a peak around 91 GeV. Additionally, the
distinctive jet topology of VBF is not expected in the Z*'' channel.
&&
Top quarks almost always decay into W boson b quark pairs, with the W boson then emitting a
tau lepton. Thus, given two top quarks it is possible to have two taus in the final state. Therefore &&
background, also produced in excess at the LHC, poses another background process. However, there
exist a number of features of the && background that make it quite easy to eliminate. Very often in the
,
,
Multivariate Analysis of the Vector Boson Fusion Higgs Boson
4
final state of the && background there exist jets originating from b quarks, while this is rare for VBF final
states. Fortunately, there exist “b-tagging” algorithms capable of labelling jets in the detector that most
likely arise from b quarks. Thus, we may cut out events with b jets, leaving Z*'' as irreducible
background. Additionally, we do not expect to find any correlations between the tau decay products
and the missing transverse energy, unlike VBF in which they are heavily correlated.
2. Multivariate Analysis
The basic goal of any multivariate analysis (MVA) is to classify signal events over background
events, with as high of an efficiency as possible, given some input variables for each event. Most
MVAs take a number of input variables and return a single measure of “signal-likeness”, which must
hit a certain threshold to be considered a signal event.
Before diving into the multivariate techniques used for this analysis, the training samples used to
develop and test the analysis will be described, along with the traditional cut based analysis for VBF
and reasons why it can be improved using a multivariate analysis.
2.1 Monte Carlo Samples
Monte Carlo simulations provide a powerful tool for studying stochastic processes. Here, Powheg
and Pythia 8 Monte Carlo generators were used to simulate truth level events for both VBF and the
relevant background processes at a centre of mass energy of . / 012345. Using these simulated
events, one may train a multivariate analysis method to be applied to real data. The Monte Carlo
samples used for this study are given below.
It is important to note that this was truth level study only; no reconstruction or trigger level effects
have been incorporated. These effects are non-negligible and should incorporated in future studies.
2.2 Preselection Cuts
A number of cuts may be applied to the events before any classifier is used. Some of these cuts
correspond to limitations of the ATLAS detector (corresponding to events that would not be well
reconstructed in practice) while others are made specifically to remove background events. The
preselection cuts used for this analysis are given below. If any event does not fulfill the criteria, it is
discarded from the analysis.
The transverse momentum of both tau leptons must be at least 20 GeV
to be detected and reconstructed by tau reconstruction algorithms.
The absolute value of !, the pseudorapidity, of each tau lepton must be
less than 2.5 for good reconstruction in the tracker.
The missing transverse energy should be greater than 20 GeV, as we
expect missing energy from neutrinos in the final state.
'
678
9$> 20 GeV
:!;: < 2.5
MET > 20 GeV
Multivariate Analysis of the Vector Boson Fusion Higgs Boson
5
The transverse momentum of the leading and subleading jet should be
greater than 20 GeV to be detected.
B-tagging algorithms can identify jets originating from b quarks, thus b-
tagged jets can be cut to eliminate && background. In truth level studies,
one uses the PDG (Particle Data Group) ID to identify and cut b-jets.
2.3 Cut Based Analysis
The most basic form of classifier, and the one that is often used due to its simplicity and physical
motivation, is a simple cut based analysis. This entails requiring a candidate event to pass a series of
univariate “cuts” which are motivated by knowledge of the signal process. The traditional cuts used to
identify VBF events over background events are given below [4].
VBF produces highly energetic quark jets into the final state, we expect
to see a leading jet with high transverse momentum.
There are two quark jets into the final state, thus the subleading jet
should also have high transverse momentum.
The jets of VBF have characteristically high separation in
pseudorapidity.
The VBF topology exhibits jets that are back-to-back.
The highly energetic jets show a high invariant mass.
The tau leptons should be detected in the central part of the detector in
comparison to the jets. Explicitly, the pseudorapidity of the taus should
lie between the range spanned by the jets.
The cut based analysis has its advantages; it is very simple to implement, requires no “training” like
the multivariate methods, and the rationale for each of the cuts is grounded in physics. However, while
it excels in its understandability, it often lacks the classification power required to recover rare
processes like the VBF Higgs.
The inferiority of the cut based analysis lies in the assumption that each variable can be cut upon
independently of the others when, in fact, the best cut to make on one variable may depend on another,
or even many others. That is, correlations cannot be accounted for. This issue is addressed by
multivariate classification methods like decision trees.
2.4 Decision Trees
Decision trees, like cut based analyses, split events into groups by setting a threshold on some
variable. However, while the cut based analysis only makes a single round of cuts, decision trees
continue to further subdivide groups, separating signal from background more and more at each step
by making the most efficient cut possible. Additionally, the most efficient cuts are calculated
algorithmically from a set of data used to “train” the decision tree.
9
$
<=>? > 40 GeV
9$
8@A<=>? > 30 GeV
:!<=>? B !8@A<=>? : > 3
!<=>? C !8@A<=>? < 0
DEFGH I (EJKLFGHI
> 300 GeV
Jets-Taus Centrality
!
9
$
<=>? M 9
$
8@A<=>? > 20 GeV
No b-tagged jets
!
Multivariate Analysis of the Vector Boson Fusion Higgs Boson
6
Figure 4 A simple decision tree. Here orange represents VBF events while blue represents background events.
At each stage, groups become more purely signal or background by splitting on some variable.
The metric that is normally minimized for each split is the Gini impurity of the current group of
events. It is defined as the probability of incorrectly labelling a random event in the group based on
the known distribution of signal and background within the group. For a binary classification problem,
the Gini impurity for a group of events is given by the following formula:
NO/ P87Q C 0 B !87Q + PAQ C 0 B PAQ
Unlike a cut based analysis, which can only form rectangular signal regions in the variable phase
space, decision trees can be grown to approximate arbitrarily complex decision functions. However,
decision trees, too, are not without their flaws. The intuition of a cut based analysis is lost since the
splits are generated algorithmically. Additionally, it is very easy to grow a tree that is too deep that
begins to train itself to recognize individual points in the training data, becoming artificially complex.
This phenomenon is well known in the field of machine learning, and is commonly known as
overtraining. To address this issue, a technique known as boosting is performed as opposed to older
“pruning” methods which grow full decision trees then backtrack and discard unimportant splits.
2.5 Adaptive Boosting
Adaptive boosting, or AdaBoost, is a general method that can be applied to a number of
classifiers, such as decisions trees, to improve reliability, performance, and resistance to
overtraining. In the context of adaptive boosting of decision trees, the single decision tree is replaced
by a “forest” consisting of hundreds of decision trees which are restricted to only a few levels, such
as the one above. As a whole, this forest of decision trees is called a boosted decision tree (BDT),
and the output of the BDT is a weighted sum of the outputs of each individual tree.
Each individual decision tree is called a “weak learner” in the sense that it is only one of many
classifiers in the forest. Here is where the adaptive boosting comes in; each weak learner is trained
iteratively to improve upon the previous one. The first weak learner is trained as a normal decision
tree from the training data. However, the results of the first weak learner are then used to weight the
importance of the training data for the next weak learner; points that were classified correctly receive
small weights while incorrectly classified points receive large weights. In this way, the next weak
learner is trained focusing on points that have not been classified well by the previous weak learner.
This process continues such that each weak learner focuses on correcting mistakes of the last,
improving at each step. The process is visualised below.
Multivariate Analysis of the Vector Boson Fusion Higgs Boson
7
Figure 6 A view of the transverse plane depicting the collinear
approximation. The tau neutrinos go collinearly with the tau leptons
such that their sum matc hes the missing transverse energy.
Figure 5 Training of an AdaBoost classifier. The first classifier trains on unweighted data, then
reweights the data for the next and so on to produce the final classifier.
2.6 Discriminant Variables
When training a BDT, a balance should be found between the number of variable inputs to the
BDT and the performance of the BDT. Additionally, while BDTs are known to handle correlated
variables quite well, it is superfluous to include two strongly correlated variables, only one of which
adds discriminatory power to the classification.
Much of my work this summer was spent investigating variables, both common and newly
devised, to search for new discriminating variables for use in a multivariate analysis. The most
important in the analysis was the ditau mass, calculated via the collinear approximation.
2.6.1 Collinear Approximation
In the case of VBF, the mass of the ditau should correspond to the mass of the Higgs, for Z*''
the mass of the Z boson, and for && we expect no clear peak. Thus, there are good physical motivations
for the use of the ditau mass in our MVA. However, in order to fully reconstruct the ditau one needs
the missing neutrinos. The collinear approximation accounts for the missing neutrinos by making the
following assumptions.
1. The tau neutrinos are perfectly collinear with their associated tau lepton.
2. The missing transverse energy is entirely due to the tau neutrinos.
Under these approximations, the magnitude
of the neutrino momenta becomes completely
determined by the missing transverse energy.
One is then left with a simple matter of
constructing the neutrinos collinearly with the
taus such that the sum of the neutrinos is
precisely the missing transverse energy.
The collinear approximation is not always
applicable; when the tau leptons are emitted
back to back in the R plane, it is impossible to
reconstruct the missing transverse energy.
This leads to a simple constraint between taus:
STU VR W BXYZZ
Multivariate Analysis of the Vector Boson Fusion Higgs Boson
8
Historically, the collinear approximation has relied upon using the charged decay products of the
tau leptons, be it either 1-prong or 3-prong decays. However, the decay products may also include a
neutral pion. Recently, tau substructure algorithms have become available that allow for reconstruction
of the entire visible (charged + neutral) tau [5]. One of my first studies was on the marked improvement
in the collinear approximation as a result of using the entire visible tau.
Figure 7 The collinear approximation using the charged tau leptons (left) and the full visible tau leptons (right).
The blue histograms represent VBF and red represents combined backgrounds scaled appropriately. All
distributions normalized to unity, and units are in GeV.
As you can see, there is a remarkable improvement using tau substructure techniques to
reconstruct the visible tau. In future studies, I suggest applying smearing of the transverse momentum
or otherwise modelling imprecision in the detector to see if the collinear approximation remains as
robust as it is in this truth study. Needless to say, this variable made it to the final MVA.
2.6.2 Tau Centrality Product
In the context of VBF topology, centrality has been used as a flag indicating whether or not a tau
lepton is centrally located in the detector with respect to the jets. Explicitly, a tau lepton is central if
its pseudorapidity lies in the range spanned by the leading and subleading jet. To generalize this
binary variable to a continuous variable, which is more powerful in multivariate analyses, the
following definition has been suggested [6].
[;\4]9 B!;B !>6Q
V!
^
2222_`4a422222!>6Q \!<=>? + !8@A<=>?
b222M2222 V! \ !<=>? B !8@A<=>?
A perfectly central tau lepton (with exactly the average ! of the jets) will have a centrality of one,
while a tau lepton far from the average ! of the jets will have centrality close to zero. Note that if the
jets are not well separated in !, the centrality also approaches zero.
The authors of this continuous centrality variable used the centrality of the two taus as independent
variables. However, I found the two variables to have an 88% positive correlation for VBF. By taking
the product of the two tau centralities, a single uncorrelated variable is achieved with greater
separation power than either of the individual centralities.
[cde? \ [;fC [;g/4]9 B!;fB !>6Q
V!
^
B!;gB !>6Q
V!
^
Collinear Approximation Ditau Mass (Charged)
0 20 40 60 80 100 120 140 160
Events
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
Collinear Approximation Ditau Mass (Visible)
0 20 40 60 80 100 120 140 160
Events
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Multivariate Analysis of the Vector Boson Fusion Higgs Boson
9
Figure 8 The centrality of the individual tau leptons (left and centre) vs. the product of tau centrality (right).
Given the redundancy of the correlated variables and increased separation power of the product
variable, it was the centrality product variable that made it to the final multivariate analysis.
2.6.3 h Variables
Variables explicitly related to the pseudorapidity of the leading and subleading jets are common in
analyses of the VBF Higgs, including the cut based analysis already presented. On the surface, these
variables seem well suited to multivariate analysis as well given their separation power. However, I
found that these traditional VBF variables are highly correlated with the invariant mass of the jets.
Figure 9 V! (centre) and !<=>? C !8@A<=>? (right) of the leading and subleading jets, along with their correlations to
the invariant mass of the jets (left).
Given the strong correlations within this group of variables, I was not surprised to find that
eliminating V! and !<=>? C !8@A<=>? from the MVA led to no decrease in performance of the BDT. The
invariant mass of the jets displayed the greatest separation power (see figure 11), thus, despite their
prevalence in traditional VBF studies, I have chosen to exclude V! and !<=>? C !8@A<=>? from the final
analysis.
2.6.4 Tau-Jet Angular Correlations
The Higgs boson is a spin 0 particle; Z bosons are spin 1 particles. My Ph.D. supervisor and I were
interested in whether or not this difference in spin quantum number manifests itself in angular
correlations between the tau leptons themselves or between tau leptons and the leading and
subleading jet. A number of variables were investigated, boosted into different reference frames,
probing any angular correlations.
Tau 0 Centrality
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Events
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
Tau 1 Centrality
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Events
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
Tau Centrality Product
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Events
0
0.05
0.1
0.15
0.2
0.25
0.3
Jets dEta
0123456789
Events
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
Jets Eta Product
151050 5 10
Events
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
Multivariate Analysis of the Vector Boson Fusion Higgs Boson
10
Jets Plane / Taus Plane Angle
0 0.5 1 1.5 2 2.5 3
Events
0
0.005
0.01
0.015
0.02
0.025
0.03
Jets Plane Eta
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
Events
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Selected Angular Variables
Taus Vi The Vi separation of the two tau leptons.
Taus R Centrality The same as the continuous tau centrality
variable, but in R instead of !.
Jets-Taus Plane Total angle between the two planes formed by
Angle the tau leptons and the jets.
Jets Plane ! ! of the normal vector to the plane formed by
the two jets.
The angular relationships amongst the tau leptons and jets, beyond the expected VBF jet topology,
seems to be subtle if existent at all. While the Vi of the taus above shows modest separation, inclusion
in the MVA yielded no improvement, and unfortunately the angle between the tau plane and jet plane
seems indifferentiable between VBF and background. Boosting to various center of mass reference
frames generally had little effect on separation power.
2.6.5 Fox-Wolfram Moments
The Fox-Wolfram moments are a set of event descriptors that are currently under investigation for
use in replacing traditional cuts with these more advanced metrics [7]. The moments arise from
superpositions of spherical harmonics, defined as follows.
j
7ME
kl/
Above, the sum goes over any number of objects in the event (such as the leading and subleading
jet for the VBF topology), m7ME corresponds to the total angle between the i’th and j’th objects, and n<
are the Legendre polynomials. The weight term j
7ME
k may take many forms, as given above.
A preliminary study of the Fox-Wolfram moments in the analysis of VBF has shown that the
moments display considerable separation power, however, when included in the multivariate analysis
have not improved the classification efficiency. Included below are plots of two sets of Fox-Wolfram
moments. On the left, only the leading and subleading jets were considered, and the best weight was
found to be the unit weight. On the right, both tau leptons are also included as objects into the moment
calculations, for which the transverse momentum weighting scheme was found to be best.
Tau 1 Phi Centrality
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Events
0
0.1
0.2
0.3
0.4
0.5
Taus dR
0.5 1 1.5 2 2.5 3 3.5
Events
0
0.01
0.02
0.03
0.04
0.05
0.06
Multivariate Analysis of the Vector Boson Fusion Higgs Boson
11
100
80
60
40
20
0
20
40
60
80
100
ditauMass
mjj
sumPT
PTsum
tausCentrality
ditauMass
mjj
sumPT
PTsum
tausCentrality
Correlation Matrix (signal)
100
100 -12 26 42
-12 100 21 -28
26 21 100 -2
42 -28 -2 100
Linear correlation coefficients in %
D
;f(;g
!
!
DEFG HI(EJKL FGHI
[;fC [;g
9$
<=>? + 9$
8@A<=>?
9$
<=>?(8@A<=>?
100
80
60
40
20
0
20
40
60
80
100
ditauMass
mjj
sumPT
PTsum
tausCentrality
ditauMass
mjj
sumPT
PTsum
tausCentrality
Correlation Matrix (background)
100 2 2
100 19 38 39
2 19 100 35
2 38 35 100 -2
39 -2 100
Linear correlation coefficients in %
Figure 10 The first four Fox-Wolfram moments considering only jets, with a unit weighting (left). The first four
Fox-Wolfram moments considering jets and tau leptons, with transverse momentum weight (right).
While only the first four moments are displayed here for brevity, the odd and even moments
are highly correlated though distinct. Unfortunately, my time has run short to fully investigate
the Fox-Wolfram moments as potentially useful discriminating variables in the multivariate
analysis. For future studies, I would suggest to explore the “modified” Fox-Wolfram moments
which are invariant to Lorentz boosts, and explore any correlations that may exist between the
moments and the MVA variables already in use.
2.6.6 MVA Variables
The final list of variables for use in the multivariate analysis was pruned down starting with roughly
ten variables that showed the strongest separation power. After identifying correlations and removing
variables that led to no improvement in classification efficiency, the following variables remain in the
final analysis.
The invariant mass of the ditau, reconstructed via the collinear approximation using
the full visible tau leptons.
The invariant mass of the leading and subleading jets.
The product of the centrality of the two tau leptons.
The scalar sum of the transverse momenta of the leading and subleading jets.
The transverse momentum of the vector sum of the leading and subleading jets.
Multivariate Analysis of the Vector Boson Fusion Higgs Boson
12
Figure 11 Discriminatory variables for the multivariate analysis. The blue histograms represent VBF and red
represents combined backgrounds scaled appropriately. All distributions normalized to unity, masses and
momenta are in units of GeV.
2.7 TMVA Multivariate Analysis
This multivariate analysis was performed at a centre-of-mass energy of . / 012345 and at an
integrated luminosity of 20 inverse femtobarns, corresponding roughly to current Run II conditions at
the LHC. The ROOT analysis framework (or my preference, the python adaptation PyROOT) provides
a toolkit for multivariate analysis known as TMVA [8]. This toolkit was utilized to train a boosted
decision tree using the discriminant variables presented in section 2.6.6. I was interested in comparing
the performance of TMVA with the well-known python machine learning library Scikit Learn. To this
end, a boosted decision tree was optimized in TMVA and compared with an identically parameterized
boosted decision tree trained in Scikit Learn.
Optimization of the BDT parameters in TMVA was
performed by performing single scans over parameters
like the number of trees or tree depth. A full multivariate
sweep over parameter settings and variables was simply
too computationally timely and out of the scope of this
project. Should one like to take this analysis to the next
step, I would recommend performing such a multivariate
sweep over BDT parameter settings. The final
configuration of the BDT parameters that were found to be
important are given to the left.
Multivariate Analysis of the Vector Boson Fusion Higgs Boson
13
When training and testing any multivariate method, one must be careful to weigh the training data
correctly; while we have a similar amount training data for both the VBF and background processes,
in reality the number of background events is much larger than the number of signal events. Thus a
weight needs to be applied to events from each process to correct for their relative abundance.
j / opqrY
stu/spqrYu
stu where sv" is provided by the Monte Carlo sample.
Cross sections were determined for each Monte Carlo sample from the TWiki cross section
summaries of the MC15 samples for Run II analyses. Given these cross sections and an integrated
luminosity of 20 inverse femtobarns, the expected number of events may be calculated. Additionally,
the percentage of events that pass the preselection criteria presented earlier may be calculated per
sample, and then applied to determine the expected number of events after preselection.
Process
Cross Section 9w)x
Events at s / bXyw)x
Events (Preselected)
VBF
ZYZZ1Zz0 C0X)^
1,999
398
Z*{{
0YZ|X}1b C0X~
39,012,642
1,148,098
••
zY|0|Z0| C0X^
9,031,830
26,394
As was expected from eliminating b-tagged jets, the && background is more than decimated, leaving
Z*'' as the main background. Roughly speaking, the signal to combined background ratio is a
staggering 01XXX!
The metric for defining the optimal cut value of the classifier is the statistical significance defined
as follows, where “s” is the number of signal events and “b” the number of background events. For a
Poisson random variable, the standard deviation is defined as the square root of the total number of
events, . + %. Then, the following statistical significance measures the ratio of signal events relative
to one standard deviation.
€•‚•ƒU•ƒS‚„2€ƒ…†ƒyƒS‚†S4 l/ .
. + % .
%2yTa2w ˆ U
Thus, this definition of the statistical significance can either be interpreted as the number of signal
events relative to one standard deviation or, if b is much larger than s, as is usual, the number of signal
events over the background fluctuation level.
The TMVA output classifier along with the optimal cut value after training a boosted decision tree
using the parameters given above is shown below.
BDT response
0.150.10.050 0.05 0.1 0.15 0.2
dx
/
(1/N) dN
0
2
4
6
8
10
12
14
16
18 Signal (test sample)
Background (test sample)
Signal (training sample)
Background (training sample)
Kolmogorov-Smirnov test: signal (background) probability = 0.008 (0.016)
U/O-flow (S,B): (0.0, 0.0)% / (0.0, 0.0)%
TMVA overtraining check for classifier: BDT
Cut value applied on BDT output
0.150.10.050 0.05 0.1 0.15 0.2
Efficiency (Purity)
0
0.2
0.4
0.6
0.8
1
Signal efficiency
Background efficiency
Signal purity
Signal efficiency*purity
S+BS/
For 398 signal and 1174098 background
isS+Bevents the maximum S/
7.9024 when cutting at 0.1453
Cut efficiencies and optimal cut value
Significance
0
1
2
3
4
5
6
7
8
Multivariate Analysis of the Vector Boson Fusion Higgs Boson
14
The final statistical significance of the classifier reaches 7.9, albeit the significance curve becomes
noisy most likely due to statistical fluctuations with such heavily weighted background events. By any
interpretation, the statistical significance can be said to be roughly 6 at minimum. The full interpretation
of the outcome will be discussed in the conclusion.
2.8 Scikit Learn Multivariate Analysis
Scikit Learn (SKL) is a free, general machine learning library for python [9]. Given its popularity
and ease of use, I was interested to see how SKL compares to TMVA in terms of final classifier
efficiency, ease of use, and configurability.
SKL supports all of the machine learning methods implemented by TMVA and many more, and in
the case of boosted decision trees supports many of the same configuration options. However,
unlike TMVA, SKL does not directly provide the user with plots (classifier output distributions,
optimum cuts, correlation matrices) via a nice GUI. Code had to be written to randomize training and
test samples, for viewing the output classifier distribution, for calculation of the maximum statistical
significance, and other tasks.
For a direct comparison of TMVA and SKL, a boosted decision tree was trained in SKL with
identical parameters as was done for TMVA. The resulting output classifier is given below.
Max. Statistical Significance: 3.5
SKL performed worse in many regards. As
can be seen by the shape of the output
classifiers, there exists much more overlap
between signal and background even when
trained identically to TMVA, leading to roughly
only half the statistical significance, seen as
the green line, not to scale, that was achieved
by TMVA. Additionally, SKL took almost five
times longer to train the BDT.
3. Conclusions
3.1 Outlook for VBF Higgs Analysis
Overall, the development of a multivariate analysis for the detection of a VBF Higgs boson
decaying to a pair of tau leptons with subsequent hadronic decays was quite successful. A theoretical
basis was developed to understand the signal process and main backgrounds at play. With only a few
basic preselection cuts, the vast majority of && background was eliminated, leaving the Z*'' process
as the main background. From knowledge of the underlying physics, a number of candidate
discriminant variables were explored for use in the multivariate analysis. Deserving of special attention
is the reconstructed ditau mass using the collinear approximation, which has shown very promising
improvements in mass resolution with the introduction of tau substructure reconstruction algorithm.
Some of the variables typically associated with vector boson fusion, such as the distinctively large
separation in pseudorapidity of the leading and subleading jet, were found to be highly correlated and
did not make it into the final analysis. Both TMVA and Scikit Learn were used to train boosted decision
Multivariate Analysis of the Vector Boson Fusion Higgs Boson
15
trees; TMVA provided faster results with better classification power, and a convenient interface for
producing plots. The final statistical significance of the VBF signal reached 7.9.
Many aspects of the study, including the final statistical significance, must be kept in context. First
and foremost, all aspects of this study were calculated on purely the truth level, no trigger level effects
were accounted for, no detector effects beyond simple preselection cuts on pseudorapidity ranges
accounted for, and no reconstruction level effects were considered. These effects may pose important
effects that should be taken into account in further analyses. Additionally, every algorithm, in particular
the b-tagging, tau ID and tau substructure algorithm, has an associated efficiency. On truth level, these
efficiencies are not modelled and will further decrease performance on the reconstruction level.
Nevertheless, I hope that this multivariate analysis serves as a useful proof of concept for a full scale
multivariate analysis in which all of the above issues are addressed. Finally, I hope this study has
provided insight into the nature of the vector boson fusion production pathway of the Higgs and into
associated variables that may be used in the analysis.
3.2 Suggestions for Future Studies
The collinear approximation performed surprisingly, perhaps suspiciously, well once the entire
visible tau was used as opposed to the charged tau products. It is possible that the collinear
approximation is in fact a valid approximation much of the time, however, I have strong suspicions that
it will not work as well on reconstructed data. One way this could be studied still within a truth study is
by “smearing” (adding zero mean Gaussian noise) to the transverse momentum of all objects in the
event to simulate reconstruction inaccuracy and observe how well the collinear approximation holds
up. Additionally, one could test just how collinear the neutrinos are with their respective tau leptons
explicitly by studying the Vi between the neutrino and tau on the truth level.
While there were over 750,000 Z*'' events, and over 6,000,000 && events, in the Monte Carlo
samples, only about 40,000 total background events survived preselection cuts, then only half of those
events were used to train the boosted decision tree while the other half was used for testing. In
comparison, over 300,000 VBF events make it past preselection to the multivariate analysis stage.
Although the initial number of events is very large for the background processes, I could have actually
used far more while training the BDT. For further Monte Carlo studies, I would suggest increasing the
statistics at least for the Z*'' background to at least a couple millions of events to ensure that enough
events make it past preselection to the BDT training.
The Fox-Wolfram moments have shown promising separation power, and may be very powerful
given a correct tuning to the VBF topology. In this study, moments calculated using just the leading
and subleading jet were experimented with in addition to a few studies using both the jets and the two
tau leptons. Further analyses may explore different combinations of objects to use in the moments,
perhaps even a third jet or no jets at all, in addition finding the optimal weighting term to use.
Additionally, there exist modified Fox-Wolfram moments that are invariant to Lorentz boosts which
may provide more clear results. In any case, it will need to be demonstrated the Fox-Wolfram moments
provide new information about the event that is not contained in the five variables presented for the
analysis in this study if they are to be useful in a multivariate analysis.
3.3 Thanks!
I can’t express my gratitude enough for the opportunity to study here in Göttingen for the
summer, it has been an eye opening and truly enjoyable experience to live abroad and get a taste of
particle physics. To everyone within the institute, thank you for your kindness and help over the
summer; you’re all brilliant physicists and even better people. Finally, I have to thank my Ph.D.
student supervisor Antonio De Maria for organizing a great project for me to work on, for his help
whenever it was needed, and his fantastic taste in music.
Multivariate Analysis of the Vector Boson Fusion Higgs Boson
16
References
[1] Test of CP Invariance in vector-boson fusion production of the Higgs bson using the Optimal
Observable method in the ditau decay channel with the ATLAS detector”.
arXiv:1602.04516v1
[2] K.A. Olive et al. (Particle Data Group), Chin. Phys. C, 38, 090001 (2014).
[3] “Search for the %% decay of the Standard Model Higgs boson in associated (W/Z)H
production with the ATLAS detector”. arXiv:1409.6212v2
[4] Prospects for the Search for a Standard Model Higgs Boson in ATLAS using Vector Boson
Fusion”. arXiv:hep-ph/0402254v1
[5] Reconstruction of hadronic decay products of tau leptons with the ATLAS experiment”.
arXiv:1512.05955
[6] Evidence for the Higgs-boson Yukawa coupling to tau leptons with the ATLAS detector.
arXiv:1501.04943
[7] Fox-Wolfram Moments in Higgs Physics”. arXiv:1212.4436
[8] A. Hoecker, P. Speckmayer, J. Stelzer, J. Therhaag, E. von Toerne, and H. Voss, TMVA -
Toolkit for Multivariate Data Analysis, PoS ACAT 040 (2007), arXiv:physics/0703039
[9] Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
... Ada Boost classifier steps[49]. ...
Thesis
Full-text available
Coronary heart disease (CHD) has attracted the most attention around the world because it leads to death. These days, data mining in many fields, including commercial fields and medical fields, where medical fields are the most productive of large data on a continuous basis, and which must find different ways to extract information, may be important in predicting the spread of this disease. We have designed a system to help the diagnosis of CHD with better reduction of costs and time required for the process by using a programing language with data mining classification techniques. These algorithms produced good results and high accuracy. We applied our study to various CHD datasets. We obtained the best accuracy at 99% through the use of the Random Forest (RF) algorithm with Hungarian two classes. With Cleveland, we obtained 94% accuracy using the same algorithm while the better accuracy with the same dataset in the previous study was 58% when using the SVM algorithm. Moreover, with the Hungarian five class dataset, we obtained 99% as the best accuracy using random Forest Classifier algorithm rather than the accuracy that was achieved with this dataset in previous work, which was close to 67% using the SVM algorithm. In addition, we obtained 88% as a better accuracy using the AdaBoost classifier with the Hungarian data set and 87% accuracy using the Logistic Regression classifier with the heart.csv dataset. With the Switzerland dataset, we had 95% as the best accuracy using Random Forest and 91% best accuracy with the Long-Beach dataset using the same classifier. Finally, with the Switzerland dataset, we achieved a 78% better accuracy using the AdaBoost and Logistic Regression classifier. With Long-Beach, we had 80% using the AdaBoost classifier and 76% xii using the Logistic Regression classifier. Also with the heart.csv dataset, we achieved 87% best accuracy using the Logistic Regression classifier and 86% accuracy when using the AdaBoost classifier. We used a train test split and preprocessing for the CHD dataset in this study and processed the missing values that were found with attributes with a less complicated system. This process differs significantly from previous study is proposed results and accuracy for this purpose with the same CHD dataset.
... A description of how the AdaBoost algorithm works. The learner is incrementally boosted at each iteration, where the wrongly classified points from the last iteration are prioritized and the weights assigned to them are adjusted[41]. ...
Article
Full-text available
The performance of a machine learning algorithm is dependent on the quality of the available data for model development. However, in practical situations, the availability of the data is variable and can be limited. This limitation creates a budget problem for data-driven techniques and the objective in such situations is to develop the best model given the available data. In this article, we examine the budgeted learning problem for spatial data within the urban context. We demonstrate the effectiveness of a novel approach for inferring the attributes of spatial data when the data for the model is budgeted. This is achieved using urban functions - which describe the designated use of a geographical space - to infer the types of streets in a city. We evaluated the approach by comparing the performance of the model using the data in each urban function (the budget) against the results from the aggregate of all the functions (all data). The results indicate that with our model, individual urban functions are sufficient to infer the type attributes of streets.
... AdaBoost Algorithms.[7] Multiple learners are formed in series. ...
Conference Paper
Greater Bongkot North is a gas field located in Gulf of Thailand and on production since 1993. Most of the old wellhead platforms (30%) lack remote well test facilities which requires personnel visits for any well test measurement. Often, well testing in these platforms get lower priority compared to other operations in a matured field. This project implemented artificial intelligent (AI) technique to estimate gas rate from other available engineering and geological parameters. A new approach using machine learning was applied to estimate gas production rate where actual measurements are not available. Actual production well test data was used to train the model. Input parameters used were: Surface facility information Fluid properties Production condition Geological setup A blind test on the subset of historical data showed a level of confidence (R2) value of 0.93. This provided confidence to proceed with a full field pilot. A pilot was conducted during January to May 2018. The area of pilot was spread across various geological, operating and surface condition setups to reduce sampling bias. The pilot demonstrated the following use cases: Improved prediction accuracy in wells with no recent test, achieving primary object of model. Detection of well behavior changes: The model could detect changes in well behavior without human intervention much before the trends become obvious for engineers to detect. Improved potential estimation in wells with leaks in wellhead chokes where conventional analysis followed in Bongkot is not possible due to improper wellhead shut-in pressure measurement. Improved efficiency with production allocation: The conventional method requires significant time (40-80 person hours per month) to make the data available for production allocation. This can be shortened significantly by use of this method In essence, this project demonstrated the potential use of artificial intelligent to improve efficiency in a matured gas field operating under marginal conditions.
Article
Full-text available
A test of CP invariance in Higgs boson production via vector-boson fusion using the method of the Optimal Observable is presented. The analysis exploits the decay mode of the Higgs boson into a pair of \(\tau \) leptons and is based on 20.3 \(\mathrm{fb}^{-1}\) of proton–proton collision data at \(\sqrt{s}\) = 8 \(\,\mathrm{TeV}\) collected by the ATLAS experiment at the LHC. Contributions from CP-violating interactions between the Higgs boson and electroweak gauge bosons are described in an effective field theory framework, in which the strength of CP violation is governed by a single parameter \(\tilde{d}\). The mean values and distributions of CP-odd observables agree with the expectation in the Standard Model and show no sign of CP violation. The CP-mixing parameter \(\tilde{d}\) is constrained to the interval \((-0.11,0.05)\) at 68% confidence level, consistent with the Standard Model expectation of \(\tilde{d}=0\).
Article
The toolkit for multivariate analysis, TMVA, provides a large set of advanced multivariate analysis techniques for signal/background classification. In addition, TMVA now also contains regression analysis, all embedded in a framework capable of handling the preprocessing of the data and the evaluation of the output, thus allowing a simple and convenient use of multivariate techniques. The analysis techniques implemented in TMVA can be invoked easily and the direct comparison of their performance allows the user to choose the most appropriate for a particular data analysis. This article gives an overview of the TMVA package and presents recently developed features.
  • Scikit-Learn
Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
  • K A Olive
K.A. Olive et al. (Particle Data Group), Chin. Phys. C, 38, 090001 (2014).
TMVAToolkit for Multivariate Data Analysis
  • A Hoecker
  • P Speckmayer
  • J Stelzer
  • J Therhaag
  • E Von Toerne
  • H Voss
A. Hoecker, P. Speckmayer, J. Stelzer, J. Therhaag, E. von Toerne, and H. Voss, TMVAToolkit for Multivariate Data Analysis, PoS ACAT 040 (2007), arXiv:physics/0703039
Search for the decay of the Standard Model Higgs boson in associated (W/Z)H production with the ATLAS detector
"Search for the decay of the Standard Model Higgs boson in associated (W/Z)H production with the ATLAS detector". arXiv:1409.6212v2
Prospects for the Search for a Standard Model Higgs Boson in ATLAS using Vector Boson Fusion
"Prospects for the Search for a Standard Model Higgs Boson in ATLAS using Vector Boson Fusion". arXiv:hep-ph/0402254v1
Reconstruction of hadronic decay products of tau leptons with the ATLAS experiment
"Reconstruction of hadronic decay products of tau leptons with the ATLAS experiment". arXiv:1512.05955
Evidence for the Higgs-boson Yukawa coupling to tau leptons with the ATLAS detector
"Evidence for the Higgs-boson Yukawa coupling to tau leptons with the ATLAS detector". arXiv:1501.04943
Fox-Wolfram Moments in Higgs Physics
"Fox-Wolfram Moments in Higgs Physics". arXiv:1212.4436