Picking a discriminator – BDT vs DNN

Chapter 5: First evidence of a Higgs boson decay to a pair of muons

5.6 VBF category

5.6.2 Picking a discriminator – BDT vs DNN

Figure 5.12: An illustration of the Collins-Soper frame.

Figure 5.13: Variable comparisons between VBF Higgs signal (blue) and Drell-Yan + Electro-weak Z backgrounds (orange): cos(𝜃_{𝐶 𝑆}).

Each variable described in this section has been studied in detail in the Z control region for the three data taking periods and any observed differences between the data and the MC were corrected for according to Section5.3and5.4(see Fig.5.14, 5.15, and5.16). Some residual mis-modelling of the simulation compared to data are observed, but are known to be covered by the uncertainties due to jet energy scale and resolution.

Figure 5.14: Data/MC comparisons for some of the VBF discriminating variables in 2016. (Top) 𝑚(𝑗 𝑗) (left), 𝑝

T (jj) (center) and 𝜂(𝑗 𝑗) (right). (Bottom) 𝑝_𝑇(𝜇 𝜇) (left) and 𝑀(𝜇 𝜇) (right).

discriminators are described in this section. Based on the performance comparison of these discriminators, the final choice to use a DNN for the VBF category was made. The final version of the DNN discriminator used in this analysis is described later in Section5.6.3.

5.6.2.1 BDT architecture

A preliminary BDT was trained using simulated samples for the three years all mixed together. The simulated samples used in the training are:

• Signal: VBF𝐻 → 𝜇 𝜇 and ggH𝐻 → 𝜇 𝜇with𝑚_𝐻=125.0 GeV

• Background: Drell-Yan Z and Electro-weak Z

The following input variables were used to train the model (for description, see Section5.6.1):

Figure 5.15: Data/MC comparisons for some of the VBF discriminating variables in 2017. (Top) 𝑚(𝑗 𝑗) (left), 𝑝

T (jj) (center) and 𝜂(𝑗 𝑗) (right). (Bottom) 𝑝_𝑇(𝜇 𝜇) (left) and 𝑀(𝜇 𝜇) (right).

• 𝑀_{𝜇 𝜇}, 𝑝_𝑇(𝜇 𝜇)and rapidity𝑦_{𝜇 𝜇}of the dimuon pair

• 𝑚(𝑗 𝑗)

• 𝑅_𝑝

• 𝑝

Tcentrality =

𝑝

𝜇 𝜇 T −| ®𝑝

T 𝑗1+ ®𝑝

T 𝑗2|/2

| ®𝑝

T 𝑗1− ®𝑝

T 𝑗2|

• Single muon variables: 𝑝

𝜇1

T /𝑚_{𝜇 𝜇}, 𝑝

𝜇2

T /𝑚_{𝜇 𝜇}, 𝜂^𝜇1, 𝜂^𝜇2

• Δ𝜂(𝑗 𝑗),Δ𝜙(𝑗 𝑗)

• 𝑧^∗

• 𝑝_𝑇, 𝜂for the two leading jets

• cos(𝜃_{𝐶 𝑆}) , cos(𝜙_{𝐶 𝑆})

• min(|𝜂(𝑗

1) −𝜂(𝜇 𝜇) |,|𝜂(𝑗

2) −𝜂(𝜇 𝜇) |), min(|𝜙(𝑗

1) −𝜙(𝜇 𝜇) |,|𝜙(𝑗

2) −𝜙(𝜇 𝜇) |)

• Jet multiplicity: 𝑁_{𝑗 𝑒𝑡 𝑠}

Figure 5.16: Data/MC comparisons for some of the VBF discriminating variables in 2018. (Top) 𝑚(𝑗 𝑗) (left), 𝑝

T (jj) (center) and 𝜂(𝑗 𝑗) (right). (Bottom) 𝑝_𝑇(𝜇 𝜇) (left) and 𝑀(𝜇 𝜇) (right).

• 𝐻soft 𝑇 2

• 𝑁^soft

The BDT training is made aware of the dimuon mass resolution by weighting the signal events proportionally to 𝜎¹𝜇 𝜇, where 𝜎_{𝜇 𝜇} is the calibrated per-event dimuon mass resolution. The 𝜎_{𝜇 𝜇} value is not used as an input to the MVA, but only as a weighting factor in the training. The weight is not applied in the evaluation of the MVA score. The BDT is trained using theGradient Boostmethod. The training is done on 50% of the available simulated events, and the remaining is used for testing purposes. The parameters used for the BDT training are as follows:

• Number of trees = 1000

• Minimum node size = 3%

• Shrinkage = 0.10

• Bagged sample fraction = 0.5

• Number of cuts = 30

• Maximum depth = 4

• Transformation for inputs: (I, N)

• Separation type: Cross-Entropy 5.6.2.2 DNN architecture

A preliminary DNN was trained using simulated samples for the three years all mixed together. The simulated samples used in the training are:

• Signal: VBF𝐻 → 𝜇 𝜇 and ggH𝐻 → 𝜇 𝜇with𝑚_𝐻=125.0 GeV

• Background: Drell-Yan Z and Electro-weak Z

The following 26 input variables were used to train the model (for description, see Section5.6.1):

• 𝜂(𝜇 𝜇),𝑀_{𝜇 𝜇},𝛿 𝑀(𝜇 𝜇),𝛿 𝑀(𝜇 𝜇)/𝑀(𝜇 𝜇), 𝑝_𝑇(𝜇 𝜇) of the dimuon pair

• 𝑚(𝑗 𝑗),𝜂, 𝜙and 𝑝_𝑇 of the leading dijet pair.

• Δ𝜂(𝑗 𝑗)

• Mass,𝜂and𝜙of the leading dijet+dimuon pair.

• 𝑧^∗

• 𝑝_𝑇, 𝜂and QGL for the two leading jets

• cos(𝜃_{𝐶 𝑆})

• Δ𝜂(𝜇 𝜇, 𝑗

1),Δ𝜂(𝜇 𝜇, 𝑗

2),Δ𝜙(𝜇 𝜇, 𝑗

1) andΔ𝜙(𝜇 𝜇, 𝑗

• 𝐻soft 𝑇 5.

The network was trained with 3 hidden layers and 100 nodes per layer. TheAdam optimizer was used with a learning rate of 10⁻⁵. The tanh activation function was used for the inner layers and the sigmoid activation was used for the final

Figure 5.17: BDT vs DNN performance. (Left) A ROC curve comparison of the performances of the preliminary BDT and DNN. (Right) A zoomed in version of the ROC curve. The DNN performs slightly better than the BDT.

output. The loss function used was thebinary cross-entropy. A 20% drop-out rate and batch-normalization were also used to regularize the training. 60% of the events were used for training and validation, and the remaining 40% were used for testing. The events were weighted during training according to their cross-section.

The weights for background events were modified and brought to O (1), in a way such that the relative weight of any process in the background class was preserved (i.e., Electro-weak Z is still rarer than Drell-Yan). Similarly, the signal weights were modified such that the sum of weights of signal events was the sum total of weights of all background processes. Additionally, all input features to the DNN were standardized to have a mean of 0 and standard deviation of 1. Finally, the training was done on events with𝑚_{𝜇 𝜇} ∈ [115,135]GeV.

5.6.2.3 Performance comparison

The input variables for the preliminary BDT (Sec.5.6.2.1) and DNN (Sec.5.6.2.2) were slightly different, however, the major inputs that contribute to most of the discrimination power were the same (𝑀(𝜇 𝜇), 𝑝_𝑇(𝜇 𝜇), Δ𝜂(𝑗 𝑗), etc). A comparison of the ROC curves of the preliminary BDT and DNN showed that the DNN performance was better by ∼ 5% at 0.1 background efficiency, and ∼ 9% at 0.01 background efficiency (see Fig. 5.17and Table5.2). Thus, it was concluded that the DNN can separate the signal better from the background and was chosen as the final discriminator. While it is difficult to quantify exactly why the DNN outperforms the BDT, a simple explanation is that the BDT is too "shallow" to capture all the features in the high dimensional VBF phase-space.

Bkg efficiency Bkg Yield (DY + VBF-Z) Signal Yield (VBF + ggH)

BDT DNN

0.1 339.36±5.98 8.21±0.05 8.59±0.05

0.01 33.96±1.65 3.03±0.03 3.31±0.03

Table 5.2: The expected signal and background yields for different background efficiency working points for the preliminary BDT and DNN.

Dalam dokumen Detector Studies for HL-LHC CMS Upgrade (Halaman 118-124)