Chapter 5: First evidence of a Higgs boson decay to a pair of muons
5.6 VBF category
5.6.3 The final Deep Neural Network for the VBF category
Bkg efficiency Bkg Yield (DY + VBF-Z) Signal Yield (VBF + ggH)
BDT DNN
0.1 339.36±5.98 8.21±0.05 8.59±0.05
0.01 33.96±1.65 3.03±0.03 3.31±0.03
Table 5.2: The expected signal and background yields for different background efficiency working points for the preliminary BDT and DNN.
Figure 5.18: Scheme of the 4-fold training, validation and evaluation procedure.
such that the relative weight of any process in the background class is preserved (i.e., 𝑡𝑡¯ is still rarer than Drell-Yan). Similarly, the signal weights are modified such that the sum of weights of signal events is the sum total of weights of all background processes. This re-weighting is done in order to make the network roughly understand the "strength" or cross-section of each process, without giving too much importance to the background over the signal (since the signal weighted to cross section is a ∼100 times smaller than the background). Additionally, all input features to the DNN are standardized. The sample mean (calculated from the training set) is subtracted from each value (per feature) and the result is divided by the sample standard deviation (per feature; also calculated from the training set). As a result, the new input variable distributions (for train/validation/test subsets) have a mean of 0 and standard deviation of 1.
The input variables used for the DNN are (for description, see Section5.6.1):
• 𝑚(𝜇 𝜇),Δ𝑚(𝜇 𝜇), Δ𝑚(𝑚(𝜇 𝜇)𝜇 𝜇)
• 𝑚(𝑗 𝑗), log𝑚𝑗 𝑗
• 𝑅(𝑝𝑇)
• 𝑧∗
• cos(𝜃𝐶 𝑆), cos(𝜙𝐶 𝑆)
• Δ𝜂(𝑗 𝑗)
• 𝑁soft
5
• 𝐻soft
𝑇 2
• min(|𝜂(𝑗
1) −𝜂(𝜇 𝜇) |,|𝜂(𝑗
2) −𝜂(𝜇 𝜇) |)
• 𝑝𝑇(𝜇 𝜇), log𝑝𝑇(𝜇 𝜇),𝜂(𝜇 𝜇)
• 𝑝𝑇(𝑗
1), 𝑝𝑇(𝑗
2),𝜂(𝑗
1),𝜂(𝑗
2),𝜙(𝑗
1), 𝜙(𝑗
2)
• 𝑞 𝑔𝑙(𝑗
1), 𝑞 𝑔𝑙(𝑗
2)
• year: the data-taking period.
The signal region distributions for these variables are shown in Fig.5.19,5.20,5.21, 5.22,5.23,5.24,5.25, and 5.26.
Figure 5.19: Transverse momentum distribution of the dimuon system after the event selection in the Signal Region for 2016 (left), 2017 (center), and 2018 (right) [150].
Figure 5.20: Dimuon mass uncertaintyΔ𝑚𝜇 𝜇after the event selection in the Signal Region for 2016 (left), 2017 (center), and 2018 (right) [150].
Figure 5.21: Transverse momentum distributions for leading (top) and subleading (bottom) jets after the event selection in the Signal Region for 2016 (left), 2017 (center), and 2018 (right) [150].
The neural network training is performed in multiple steps. Four networks are first optimized independently with different inputs and for different tasks. These inde- pendent networks target different backgrounds or event topology, and are described by the following goals:
1. signal vs Electro-weak Z (VBF Z) : The VBF Z process is the most signal-like background and a dedicated network with 3 hidden layers is trained in order to separate signal from this background.
2. signal vs DY : A network with 3 hidden layers is dedicated to separate signal from DY events.
3. mass independent signal -vs- background : A network with 3 hidden layers is trained with the 22 input variables uncorrelated with the dimuon mass. This network is trained against all three backgrounds.
4. mass + mass resolution signal vs background : A network with two hidden layers is trained using only three input variables: the dimuon mass and its
Figure 5.22: Pseudorapidity distributions for leading (top) and subleading (bottom) jets after the event selection in the Signal Region for 2016 (left), 2017 (center), and 2018 (right) [150].
Figure 5.23: Dijet invariant mass distributions after the event selection in the Signal Region for 2016 (left), 2017 (center), and 2018 (right) [150].
absolute and relative resolutions. This network is trained against all three backgrounds.
The last hidden layer of all the four networks are then merged and combined to build the final classifier. A final, fifth network with 2 hidden layers is fine-tuned as
Figure 5.24: Distributions of theZeppenfeldvariable𝑧∗after the event selection in the Signal Region for 2016 (left), 2017 (center), and 2018 (right) [150].
Figure 5.25: Distributions of the transverse momentum balance𝑅(𝑝
T)after the event selection in the Signal Region for 2016 (left), 2017 (center), and 2018 (right) [150].
• Step 1: the node weights of the networks 1, 2, and 3 are fixed to values obtained by the previous training while the network (4) weights are free to float.
• Step 2: the node weights of the network 3 are also free to float.
This architecture is visually represented in Fig.5.27. The orange blocks denote the input features, the grey blocks indicate the DNNs optimized for their specific tasks, with their outputs in blue. The last hidden layer from the 4 networks are merged into a single vector and is used as an input for the combination (5th) network, whose output is shown in red.
Each network has a few dozen nodes in each hidden layer and employs a 20% dropout rate for regularizing the model. A batch size of 1024 events is used, except for the Step 2 in the final network, where 10240 events are used. Based on the validation
Figure 5.26: QGL output distributions for leading (top) and subleading (bottom) jets after the event selection in the Signal Region for 2016 (left), 2017 (center), and 2018 (right) [150].
loss, the learning rate is also gradually decreased over training epochs from a start value of 0.05. The loss function used at each step is the binary cross-entropy or 𝑙 𝑜𝑔−𝑙 𝑜 𝑠 𝑠, which is the maximum likelihood estimator for binary classification problems.
The best trained model is chosen using the estimated Asimov significance [151], which is computed both for the training and validation folds. The minimum sig- nificance is used as to pick the best model. The Asimov significance is given by
Asimov significance= vu t 𝑁
∑︁
𝑖=1
2[(𝑆𝑖+𝐵𝑖)log(1− 𝑆𝑖 𝐵𝑖
) −𝑆𝑖] (5.3) where 𝑆𝑖and 𝐵𝑖 are the expected of signal and background events in the i-th bin.
The binning of the DNN (for significance calculation and final template fitting) is constructed as follows (starting from high output scores and moving towards lower scores):
• the bin has to contain at least 0.5 expected signal events,
Figure 5.27: Schematic representation of the DNN architecture: the training pro- cedure involves optimizing for individual tasks, combining the network outputs and fine-tuning the final model by unfreezing upstream weights.
• the bin yield of the background must have a relative error smaller than 30%,
• use the smallest bin width that satisfies the two previous conditions.
The signal and background DNN output distributions are shown in figure5.28.
Figure 5.28: Plot of the signal and background normalized distributions for the DNN output score. The simulated samples of 2016, 2017, and 2018 are used all together.
In an alternative approach, Appendix Cdescribes a mass agnostic machine learning based strategy for the VBF category. To perform a data-driven fit to the 𝑚𝜇 𝜇 distribution (unlike the template based fit to the DNN score output chosen for
the VBF category), one must develop a 𝑚𝜇 𝜇 independent discriminator, to avoid sculpting the distribution of the backgrounds. However, for a simple DNN classifier, it is easy to learn the dimuon mass with the input kinematic variables, even if the mass is not explicitly given as an input to the training. Appendix C describes an adversarial training technique based on Ref. [152], which was used to develop a mass agnostic neural network.