Background estimation - The 𝑔𝑔 𝐻 𝐻 analysis

Chapter 6: Nonresonant pair production of highly energetic Higgs bosons

6.4 The 𝑔𝑔 𝐻 𝐻 analysis

6.4.2 Background estimation

60 80 100 120 140 160 180 200 220 240 260 280 [GeV]

mSD

0.0 0.5 1.0 1.5

Data / pred.

0 500 1000 1500 2000 2500 3000

Events / 10 GeV

Data tt+jets V+jets, VV

QCD Bkgd. stat. unc.

Data tt+jets V+jets, VV

QCD Bkgd. stat. unc.

CMSSupplementary CR

t t

(13 TeV) 138 fb-1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.0 0.5 1.0 1.5

Data / pred.

0 500 1000 1500 2000 2500 3000 3500 4000

Events / 0.05

Data tt+jets V+jets, VV

QCD Bkgd. stat. unc.

Data tt+jets V+jets, VV

QCD Bkgd. stat. unc.

CMSSupplementary CR

t t

(13 TeV) 138 fb-1

800 1000 1200 1400 1600 1800

[GeV]

mHH

0.0 0.5 1.0 1.5

Data / pred.

0 100 200 300 400 500 600 700 800 900

Events / 30 GeV

Data tt+jets V+jets, VV

QCD Bkgd. stat. unc.

Data tt+jets V+jets, VV

QCD Bkgd. stat. unc.

CMSSupplementary CR

t t

(13 TeV) 138 fb-1

0.2 0.3 0.4 0.5 0.6 0.7

/mHH T j1

p 0.0

0.5 1.0 1.5

Data / pred.

0 500 1000 1500 2000 2500

Events / 0.025

Data tt+jets V+jets, VV

QCD Bkgd. stat. unc.

Data tt+jets V+jets, VV

QCD Bkgd. stat. unc.

CMSSupplementary CR

t t

(13 TeV) 138 fb-1

Figure 6.13: The data and expected background distributions from simulation in the𝑡 𝑡 control region (Sec.6.4.2.1) for some of the most discriminating BDT input variables, including the Jet 1𝑚

SD(upper left) and𝑇

Xbb(upper right), 𝑚_{𝐻 𝐻} (lower left), and 𝑝

T j₁/𝑚_{𝐻 𝐻} (lower right). The lower panel shows the ratio of the data and the total background prediction, with its statistical uncertainty represented by the shaded band. The error bars on the data points represent the statistical uncertainties.

as a control region for QCD background estimation, and will be discussed later in Sec.6.4.2.2.

BDT Score Jet2 TXbb Tagger

0.43 0.980

0.11 0.950

0.03

Bin3

Bin2

Bin2 Bin1

Fail

Figure 6.14: The event categories in the𝑔𝑔 𝐻 𝐻analysis are visually represented in a 2D grid of the BDT score and the jet 2𝑇

Xbbscore.

The QCD multijet background will be estimated from the data in control regions using the parametric alphabet fit method [222], and is described in Sec.6.4.2.2. The QCD background prediction from this method also predicts the contribution from gluon fusion and VBF single Higgs boson production (see Sec.6.4.2.2).

The top quark background is predicted using simulation, with several data-driven corrections derived from dedicated control regions and is described in Sec.6.4.2.1.

Finally, the remaining backgrounds like 𝑡 𝑡 𝐻, 𝑉 𝐻, VV, and V+jets are predicted using the simulation.

6.4.2.1 𝑡 𝑡 backgrounds

In this subsection, we describe all corrections applied to the simulation prediction of the top quark backgrounds and their associated uncertainties.

𝑇_Xbbshape correction The distribution of the𝑇

Xbbscore for the𝑡 𝑡background may not be modeled perfectly in the simulation. To correct for this discrepancy, we measured its shape in a control region dominated by semi-leptonic𝑡 𝑡 events. The events in this control region were selected using online single muon or electron triggers. Additionally, events were required to satisfy the following conditions:

• At least one selected electron (Sec.6.3.2) or muon (Sec.6.3.1)

• At least one AK8 jet with 𝑝

T >300 GeV,𝜏

32 <0.46,𝑚

SD > 140 GeV

• Δ𝑅(lepton, AK8 jet) > 1.0

• 𝑝^miss

T > 100 GeV.

The requirements on jet 𝜏

32, jet SD mass (𝑚

SD), and 𝑝miss

T are applied in order to suppress the QCD multijet and W+jets events. The𝑡 𝑡 events make up for 90% of this control region, as can be seen in Figure6.15.

0 50 100 150 200 250 300

soft drop mass (GeV) j1

0.6 0.8 1 1.2

data/mc

0 50 100 150 200 250 300

0 500 1000 1500 2000 2500

Events

V+jets,VV QCD tt+jets 2L tt+jets 1L Data

CMSPreliminary 35 fb^-1 (13 TeV)

0 50 100 150 200 250 300

soft drop mass (GeV) j1

0.6 0.8 1 1.2

data/mc

0 50 100 150 200 250 300

0 500 1000 1500 2000 2500 3000

Events

V+jets,VV QCD tt+jets 2L tt+jets 1L Data

CMSPreliminary 41 fb^-1 (13 TeV)

0 50 100 150 200 250 300

soft drop mass (GeV) j1

0.6 0.8 1 1.2

data/mc

0 50 100 150 200 250 300

0 500 1000 1500 2000 2500 3000 3500 4000 4500

Events

V+jets,VV QCD tt+jets 2L tt+jets 1L Data

CMSPreliminary 59 fb^-1 (13 TeV)

Figure 6.15: The distribution of the AK8 jet SD mass in the semi-leptonic𝑡 𝑡control region after imposing all requirements. The 2016, 2017, and 2018 datasets are shown in the left, center, and right plots, respectively.

In Figure 6.16, we show the distribution of the ParticleNet𝑇

Xbb tagger. The first bin shown here includes an in-situ normalization correction (between 0.0 and 0.8) to offset any discrepancies that might occur due to mismodeling of the lepton trig- ger/identification/isolation efficiencies or 𝜏

32 top tagging efficiencies. Figure 6.17 is a zoomed in version of this plot showing the most important bins. The remaining bins (between 0.8 and 1.0) are used to derive the correction to𝑇

Xbb. We subtract the simulation prediction from the data for all the non top-quark backgrounds and compare the shape of the residual data to the top quark simulation prediction. We then derive correction factors as residual data to 𝑡 𝑡 simulation ratios in different 𝑇Xbbbins. The size of these corrections in different𝑇

Xbbbins varies between a few percent to∼20%, with uncertainties below 2%.

The impact of these corrections on the distribution of the Jet 2𝑇

Xbb score and the event BDT score for the 𝑡 𝑡 background, for the different event categories in the signal region, is shown in Fig.6.18and6.19. These corrections tend to shift the top background towards lower𝑇

Xbband BDT scores, which is desirable.

𝑡 𝑡recoil correction

The recoil of the 𝑡 𝑡 system is mis-modeled by the simulation in the phase space relevant for this analysis. Therefore, we will measure the recoil in-situ using an

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

PNet Xbb tagger j1

0.6 0.8 1 1.2

data/mc

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 1000 2000 3000 4000 5000 6000

Events

V+jets,VV QCD tt+jets 2L tt+jets 1L Data

CMSPreliminary 35 fb^-1 (13 TeV)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

PNet Xbb tagger j1

0.6 0.8 1 1.2

data/mc

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 1000 2000 3000 4000 5000 6000 7000 8000

Events

V+jets,VV QCD tt+jets 2L tt+jets 1L Data

CMSPreliminary 41 fb^-1 (13 TeV)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

PNet Xbb tagger j1

0.6 0.8 1 1.2

data/mc

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 2000 4000 6000 8000 10000 12000

Events

V+jets,VV QCD tt+jets 2L tt+jets 1L Data

CMSPreliminary 59 fb^-1 (13 TeV)

Figure 6.16: The distribution of the jet𝑇

Xbbscore in the 1ℓ+1j𝑡 𝑡control region for 2016 (left), 2017 (middle), and 2018 (right); the first bin includes underflow events in each plot.

0.9 0.92 0.94 0.96 0.98 1

PNet Xbb tagger j1

0.6 0.8 1 1.2

data/mc

0.9 0.92 0.94 0.96 0.98 1

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

Events

V+jets,VV QCD tt+jets 2L tt+jets 1L Data

CMSPreliminary 35 fb^-1 (13 TeV)

0.9 0.92 0.94 0.96 0.98 1

PNet Xbb tagger j1

0.6 0.8 1 1.2

data/mc

0.9 0.92 0.94 0.96 0.98 1

0 2000 4000 6000 8000 10000

Events

V+jets,VV QCD tt+jets 2L tt+jets 1L Data

CMSPreliminary 41 fb^-1 (13 TeV)

0.9 0.92 0.94 0.96 0.98 1

PNet Xbb tagger j1

0.6 0.8 1 1.2

data/mc

0.9 0.92 0.94 0.96 0.98 1

0 2000 4000 6000 8000 10000 12000 14000 16000 18000

Events

V+jets,VV QCD tt+jets 2L tt+jets 1L Data

CMSPreliminary 59 fb^-1 (13 TeV)

Figure 6.17: The distribution of the jet𝑇

Xbbscore in the 1ℓ+1j𝑡 𝑡control region for 2016 (left), 2017 (middle), and 2018 (right) showing the most important bins.

all-hadronic𝑡 𝑡control region. For this control region, we will require events to have two AK8 jets with each jet satisfying the following conditions:

• Jet 𝑝

T > 450 GeV,

• Jet𝑚

SD > 50 GeV,

• Jet𝑇

Xbb >0.1 (looser than signal region to increase statistics),

• Jet𝜏

32 < 0.46

• Jet contains an AK4 sub-jet that passes the loose WP of the DeepCSV b- tagging algorithm (described in Sec.5.3.6).

These selections result in a pool of𝑡 𝑡events with a purity of about 90%. Figure6.20 shows the 𝑝

Tof the𝑡 𝑡system in this control region.

We subtract all non-𝑡 𝑡backgrounds from the data, and then divide the residual data by the 𝑡 𝑡 simulation prediction to obtain ratios in bins of 𝑝^jj

T. These ratios linearly

Figure 6.18: The distribution of the second jet𝑇

Xbbscore in the signal region for the 𝑡𝑡¯background, for event category 1 (top left), category 2 (top right), and category 3 (bottom). The histograms are normalized to unit area.

increase until about 300 GeV, and then decrease linearly above 300 GeV. We fit these ratios with a “ramp-model“ defined as follows:

• Fit a linear function with a positive slope 𝑝1 and constant 𝑝

0, for 𝑝^jj

T ∈ [0, 300] GeV as: 𝑝

0+𝑝

1× 𝑝

• Fit another linear function with a negative slope 𝑝3 and constant 𝑝

2, for𝑝^jj

T ∈ [300, 1000] GeV as: 𝑝

2+𝑝

3× 𝑝

• Assume a flat correction for 𝑝^jj

T > 1000 GeV :𝑝

2+ 𝑝

3×1000.

The results from this ramp-model fit can be found in Figure6.21. The ramp-model is used to derive correction factors for top quark simulation in various bins of 𝑝^jj

T. The fit uncertainties (shown with the dotted blue line) are propagated as systematic uncertainties for this correction.

Figure 6.22 shows the distribution of the 𝑝

T of the 𝑡 𝑡 system after applying the

“ramp-model” recoil correction and we observe a good level of agreement between the simulation prediction and data. Furthermore, we also check the modeling

Figure 6.19: The distribution of the event BDT score in the signal region for the𝑡𝑡¯ background, for event category 1 (top left), category 2 (top right), and category 3 (bottom). The histograms are normalized to unit area.

0 100 200 300 400 500 600 700 800 900 1000

(GeV)

pT 0

0.5 1 1.5 2

data/mc

0 100 200 300 400 500 600 700 800 900 1000

0 200 400 600 800 1000 1200 1400

Events

V+jets,VV QCD tt+jets Data CMSPreliminary 35 fb^-1 (13 TeV)

0 100 200 300 400 500 600 700 800 900 1000

(GeV)

pT 0

0.5 1 1.5 2

data/mc

0 100 200 300 400 500 600 700 800 900 1000

0 200 400 600 800 1000 1200 1400 1600 1800

Events

V+jets,VV QCD tt+jets Data CMSPreliminary 41 fb^-1 (13 TeV)

0 100 200 300 400 500 600 700 800 900 1000

(GeV)

pT 0

0.5 1 1.5 2

data/mc

0 100 200 300 400 500 600 700 800 900 1000

0 500 1000 1500 2000 2500

Events

V+jets,VV QCD tt+jets Data CMSPreliminary 59 fb^-1 (13 TeV)

Figure 6.20: The distribution of the 𝑝

T of the𝑡 𝑡 system is shown for events in the all hadronic top control region, for 2016 (left), 2017 (middle), and 2018 (right) datasets.

of the 𝑝

T of the two AK8 jets individually and the mass of the dijet system in Figures 6.23, 6.24, and 6.25. All distributions show a good level of agreement between the simulation prediction and data.

For the ramp-model correction, we assumed that our QCD simulation is modelled well enough in the hadronic𝑡 𝑡 control region. To support this statement, we per- formed a check. When looking at the 80–120 GeV region in Jet 1𝑚

SD(Figure6.26), the relative contribution of QCD and 𝑡 𝑡 is similar and the data matches the MC

Figure 6.21: This figure shows the results of the “ramp-model“ fit (solid blue line) in the hadronic 𝑡 𝑡 control region, for 2016 (top left), 2017 (top right), and 2018 (bottom) datasets. The black points here represent the residual data (data subtracted with non-top backgrounds) divided by the 𝑡 𝑡 simulation in different bins of 𝑝^jj

T. The fit uncertainties (shown with the dotted blue line) are propagated as systematic uncertainties for this correction in the final signal extraction fit.

prediction relatively well. Thus, this region shows that an arbitrary increase or decrease (let us say by a factor of 2) to the QCD background yield from simulation would most likely be incompatible with the observed data.

Nevertheless, we tried changing the QCD background yield per bin by a factor of 2 and 0.5 and then re-derived the ramp-model corrections. We observed that the impact of varying the QCD background in this manner is already covered by the total uncertainty on the 𝑡 𝑡 recoil correction considered in the nominal fit. Considering these two pieces of evidence together, we can rule that the impact of the QCD background is negligible for this correction.

Top jet mass scale

We would also like to ensure that the top mass scale as modelled by the regressed mass for top jets is well simulated in the hadronic 𝑡 𝑡 control region, defined in Section6.4.2.1. The regressed mass distribution is shown in Figure6.27for different bins of the BDT score. The mass distributions are then fitted to a Gaussian, and the

0 100 200 300 400 500 600 700 800 900 1000

(GeV)

pT 0

0.5 1 1.5 2

data/mc

0 100 200 300 400 500 600 700 800 900 1000

0 200 400 600 800 1000 1200 1400

Events

V+jets,VV QCD tt+jets Data CMSPreliminary 35 fb^-1 (13 TeV)

0 100 200 300 400 500 600 700 800 900 1000

(GeV)

pT 0

0.5 1 1.5 2

data/mc

0 100 200 300 400 500 600 700 800 900 1000

0 200 400 600 800 1000 1200 1400

Events

V+jets,VV QCD tt+jets Data CMSPreliminary 35 fb^-1 (13 TeV)

0 100 200 300 400 500 600 700 800 900 1000

(GeV)

pT 0

0.5 1 1.5 2

data/mc

0 100 200 300 400 500 600 700 800 900 1000

0 200 400 600 800 1000 1200 1400

Events

V+jets,VV QCD tt+jets Data CMSPreliminary 35 fb^-1 (13 TeV)

Figure 6.22: The distribution of the 𝑝

T of the𝑡 𝑡 system is shown for events in the hadronic𝑡 𝑡control region after applying the ramp-model recoil correction, for 2016 (left), 2017 (middle), and 2018 (right) datasets separately.

400 600 800 1000 1200 1400 1600 1800 2000 2200

(GeV)

pT 0

0.5 1 1.5 2

data/mc

400 600 800 10001200 14001600 1800 20002200

0 200 400 600 800 1000

Events

V+jets,VV QCD tt+jets Data CMSPreliminary 35 fb^-1 (13 TeV)

400 600 800 1000 1200 1400 1600 1800 2000 2200

(GeV)

pT 0

0.5 1 1.5 2

data/mc

400 600 800 1000 12001400 16001800 2000 2200

0 200 400 600 800 1000

Events

V+jets,VV QCD tt+jets Data CMSPreliminary 35 fb^-1 (13 TeV)

400 600 800 1000 1200 1400 1600 1800 2000 2200

(GeV)

pT 0

0.5 1 1.5 2

data/mc

400 600 800 100012001400 1600180020002200

0 200 400 600 800 1000

Events

V+jets,VV QCD tt+jets Data CMSPreliminary 35 fb^-1 (13 TeV)

Figure 6.23: The distribution of the𝑝

Tof the AK8 Jet 1 (jet with highest𝑇

Xbbscore) is shown for events in the hadronic𝑡 𝑡 control region after applying the ramp-model recoil correction, for 2016 (left), 2017 (middle), and 2018 (right) datasets separately.

location of the fitted peak is plotted as a function of the event BDT and shown in Figure6.28. We observe that the simulation agrees with the data, and predicts the mass scale very well in all BDT bins.

Modeling of second jet regressed mass in the𝑡 𝑡 control region

As mentioned previously, the variable used for the final signal extraction is the regressed mass of the second jet. We would like to ensure that this variable is well modelled by the simulation prediction in the hadronic𝑡 𝑡 control region. We apply the following corrections and check the distribution of the Jet 2 regressed mass:

• corrections for the𝑇

Xbbscore shape,

• corrections for the recoil of the𝑡 𝑡system, and

400 600 800 1000 1200 1400 1600 1800 2000 2200

(GeV)

pT 0

0.5 1 1.5 2

data/mc

400 600 800 10001200 14001600 1800 20002200

0 200 400 600 800 1000

Events

V+jets,VV QCD tt+jets Data CMSPreliminary 35 fb^-1 (13 TeV)

400 600 800 1000 1200 1400 1600 1800 2000 2200

(GeV)

pT 0

0.5 1 1.5 2

data/mc

400 600 800 1000 12001400 16001800 2000 2200

0 200 400 600 800 1000

Events

V+jets,VV QCD tt+jets Data CMSPreliminary 35 fb^-1 (13 TeV)

400 600 800 1000 1200 1400 1600 1800 2000 2200

(GeV)

pT 0

0.5 1 1.5 2

data/mc

400 600 800 100012001400 1600180020002200

0 200 400 600 800 1000

Events

V+jets,VV QCD tt+jets Data CMSPreliminary 35 fb^-1 (13 TeV)

Figure 6.24: The distribution of the 𝑝

T of the AK8 Jet 2 (jet with second highest 𝑇Xbb score) is shown for events in the hadronic 𝑡 𝑡 control region after applying the ramp-model recoil correction, for 2016 (left), 2017 (middle), and 2018 (right) datasets separately.

0 500 1000 1500 2000 2500

(GeV) mjj

0 0.5 1 1.5 2

data/mc

0 500 1000 1500 2000 2500

Events

V+jets,VV QCD tt+jets Data CMS Preliminary 137 fb^-1 (13 TeV)

Figure 6.25: The distribution of the dijet mass is shown for events in the top control region using the full Run 2 dataset corresponding to 137 fb⁻¹.

• corrections to the regressed mass scale and mass resolution, as as recom- mended centrally for all CMS data analyses [139].

We observe good agreement between the pre-fit predicted shape and the data, as shown on the left in Fig.6.29. Furthermore, we verified that the residual disagree- ment observed is covered by the jet mass resolution uncertainty, by performing a fit in which we allowed that nuisance parameter to float within its constraints. The post-fit distribution of the jet 2 regressed mass is shown on the right of Figure6.29, where we observe improved shape agreement with data.

Figure 6.26: The Jet 1 𝑚

SD in the 𝑡 𝑡 hadronic control region. The 80–120 GeV region shows that a factor of two increase or decrease to the QCD background would be incompatible with the observed data.

BDT shape uncertainty

Lastly, we also check the event BDT score distribution in the𝑡 𝑡all-hadronic control region. For this, we tighten the selection on the𝜏

32of both AK8 jets to𝜏

32 < 0.39, in order to achieve very high𝑡 𝑡purity in all BDT bins. Figure6.30shows the event BDT shape in the control region, after applying the𝑇

Xbbshape corrections and the 𝑡 𝑡 recoil corrections derived in the previous sections. The left plot shows the BDT distribution in the𝑡 𝑡all-hadronic control region with the nominal binning. The right plot shows the same BDT distribution, but with the last bin split into three bins, to confirm that variations in bin size do not alter the data to simulation agreement.

From the left plot of Figure6.30, we subtract all non-𝑡 𝑡 background from the data and measure a small difference between the residual data and the 𝑡 𝑡 simulation prediction. This small difference is propagated as a shape systematic uncertainty for the𝑡 𝑡background prediction in the final signal extraction fit.

6.4.2.2 QCD

Since QCD is a relatively more difficult process to model accurately with simulation, we would like to estimate it using a data-driven approach from a signal-depleted region. For this purpose, we will use the Fail region, mentioned previously in Sec.6.4.1.3, defined as the phase space of events where Jet 2𝑇

Xbb< 0.95 and BDT

50 100 150 200 250 300 350 400 450 500

regressed mass (GeV) j2

0 0.5 1 1.5 2

data/mc

50 100 150 200 250 300 350 400 450 500

0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200

Events

V+jets,VV QCD tt+jets Data CMSPreliminary 137 fb^-1 (13 TeV)

50 100 150 200 250 300 350 400 450 500

regressed mass (GeV) j2

0 0.5 1 1.5 2

data/mc

50 100 150 200 250 300 350 400 450 500

0 50 100 150 200 250 300 350

Events

V+jets,VV QCD tt+jets Data CMSPreliminary 137 fb^-1 (13 TeV)

50 100 150 200 250 300 350 400 450 500

regressed mass (GeV) j2

0 0.5 1 1.5 2

data/mc

50 100 150 200 250 300 350 400 450 500

0 20 40 60 80 100 120 140

Events

V+jets,VV QCD tt+jets Data CMSPreliminary 137 fb^-1 (13 TeV)

50 100 150 200 250 300 350 400 450 500

regressed mass (GeV) j2

0 0.5 1 1.5 2

data/mc

50 100 150 200 250 300 350 400 450 500

0 20 40 60 80 100 120 140

Events

V+jets,VV QCD tt+jets Data CMSPreliminary 137 fb^-1 (13 TeV)

Figure 6.27: The top mass in different BDT bins (BDT range of top left: 0 - 0.00008;

top right: 0.00008 - 0.0002; bottom left: 0.0002 - 0.0004; bottom right: 0.0004 - 1.0) for the full Run2 dataset. Here, the top recoil correction and the 𝑇

Xbb shape correction are already applied.

score > 0.03 (see Fig.6.14). The fail region is enriched in backgrounds and can be used to extrapolate the shape of the QCD in the signal region as :

𝑝

𝑖 𝑝 𝑎 𝑠 𝑠

QCD (Jet 2𝑚

reg) =𝑇 𝐹^𝑖(Jet 2𝑚

reg) ∗ 𝑝

𝑓 𝑎𝑖𝑙

QCD(Jet 2𝑚

reg) (6.9) where 𝑝

𝑖 𝑝 𝑎 𝑠 𝑠

𝑄𝐶 𝐷 (Jet 2𝑚

reg) and 𝑝

𝑓 𝑎𝑖𝑙

𝑄𝐶 𝐷(Jet 2𝑚

reg) are the QCD regressed mass distributions of Jet 2 in the 𝑖-th pass region (signal event category 1, 2 and 3) and fail regions, respectively. 𝑇 𝐹^𝑖(Jet 2𝑚

reg) is a𝑛-th order polynomial transfer factor for the𝑖-th pass region, as a function of the Jet 2𝑚

reg. This is known as the parametric alphabet fit method. The gluon fusion and VBF single Higgs boson production background exhibits the same shape as the QCD multijet background, as the second jet in those events originate from initial state radiation (ISR), same as the second

0 0.0002 0.0004 0.0006 0.0008 0.001 Event BDT score 150

155 160 165 170 175 180 185 190

Fitted top mass [GeV]

MC Data

Figure 6.28: The fitted top mass (from a Gaussian fit to Figure 6.27) is shown as a function of the event BDT. Here, the top recoil correction and the 𝑇

Xbb shape corrections are already applied.

60 80 100 120 140 160 180 200 220

regressed mass (GeV) j2

0 0.5 1 1.5 2

data/mc

60 80 100 120 140 160 180 200 220

0 500 1000 1500 2000 2500

Events

others QCD tt+jets

Total unc Data

CMSPreliminary 137 fb^-1 (13 TeV)

60 80 100 120 140 160 180 200 220

regressed mass (GeV) j2

0 0.5 1 1.5 2

data/mc

60 80 100 120 140 160 180 200 220

0 500 1000 1500 2000 2500

Events

others QCD tt+jets

Total unc Data

CMSPreliminary 137 fb^-1 (13 TeV)

Figure 6.29: The distribution of the jet 2 regressed mass is shown for events in the top all-hadronic control region, for the full Run 2 dataset corresponding to 137 fb⁻¹. The pre-fit result is shown on the left, and on the right we show a post-fit result where the jet mass resolution uncertainty nuisance parameter was allowed to float within its constraints.

jet in QCD multijet production. Therefore, the QCD background prediction from this method also predicts the contribution from gluon fusion and VBF single Higgs Boson background.

The mass shape of the QCD multijet background in the fail region (𝑝^{𝑓 𝑎𝑖𝑙}

𝑄𝐶 𝐷(Jet 2𝑚

reg)) is defined as the data subtracted with all non-QCD backgrounds as predicted by the simulation (the 𝑡 𝑡 background prediction used here includes all the corrections

0 0.0010.0020.0030.0040.0050.0060.0070.0080.009 0.01

Event BDT

0 0.5 1 1.5 2

data/mc

0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.01 1

10 102

103

104

105

Events

V+jets,VV QCD tt+jets Data CMSPreliminary 137 fb^-1 (13 TeV)

0 0.0010.0020.0030.0040.0050.0060.0070.0080.009 0.01

Event BDT

0 0.5 1 1.5 2

data/mc

0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.01 1

10 102

103

104

105

Events

V+jets,VV QCD tt+jets Data CMSPreliminary 137 fb^-1 (13 TeV)

Figure 6.30: The event BDT distribution in the 𝑡 𝑡 all-hadronic control region is shown for the data and simulation prediction, for the full Run2 dataset. The binning edges are [0.000-0.00008], [0.00008-0.0002], [0.0002-0.0004], [>0.0004] for the left plot, and [0.000-0.00008], [0.00008-0.0002], [0.0002-0.0004], [0.0004-0.0005], [0.0005-0.0007], [>0.0007] for the right plot. The left plot is used to derive uncertainties for the BDT shape modeling in the𝑡 𝑡simulation.

described in Section6.4.2.1). The Jet 2 regressed mass shape of the QCD multijet background in the different event categories (𝑝

𝑖 𝑝 𝑎 𝑠 𝑠

𝑄𝐶 𝐷 (Jet 2 𝑚

reg)) is then estimated from a simultaneous likelihood fit to the pass and fail regions. The order of the transfer factor for each individual event category is determined by performing a F-test and a goodness-of-fit (GOF) test. We will first describe these tests and then explain the procedure to derive the polynomial order of the transfer factors.

F-Test and goodness-of-fit (GOF) Test

We perform the Fisher test (F-Test) to determine the minimum number of parameters that can describe the transfer factors well. For two fit models, 1 and 2, where model 1 with𝑝

1parameters is “nested” within model 2 with𝑝

2parameters (p2 > p1), F-test determines whether model 2 gives a significantly better fit to the data. For example, a polynomial function of zero-th order is nested within a 1st order polynomial. A random variable X will follow the F-distribution with parameters (𝑑

1, 𝑑

2) 𝑋 = 𝑈

1/𝑑

𝑈2/𝑑

(6.10) where𝑈

1 and𝑈

2 are independent variables that follow chi-squared distributions, with𝑑

1and𝑑

2degrees of freedom, respectively.

For data that is distributed like a Gaussian, we have the chi-squared distribution defined as

𝜒²=∑︁

𝑗

(𝑓_𝑗−𝑑_𝑗)² 𝜎²

𝑗

(6.11) where 𝑑_𝑖 ±𝜎_𝑖 is the i-th measured data point with rms deviation 𝜎_𝑖, and 𝑓_𝑖 is the model prediction. For the same data and model, the likelihood is given by:

L =Ö

𝑖

√︃

2𝜋 𝜎2 𝑖

𝑒⁻⁽^𝑑^𝑖⁻^𝑓^𝑖⁾²^/²^𝜎^𝑖². (6.12)

The goodness-of-fit (GOF) test is a test of the null hypothesis in the absence of an alternate hypothesis. In a situation where we only have the data and a model to fit it with, one can always come up with an alternative model where 𝑓_𝑖 = 𝑑_𝑖 at every measured value. Such a model is called a saturated model. In Eqn.6.12, if we set

𝑓_𝑖 =𝑑_𝑖, the saturated model [79,223,224] becomes:

L𝑠𝑎𝑡 𝑢𝑟 𝑎𝑡 𝑒 𝑑 =Ö

𝑖

√︃

2𝜋 𝜎²

𝑖

. (6.13)

The ratio of the two likelihoods (Eqn.6.12and6.13),𝜆, is then 𝜆=Ö

𝑖

𝑒⁻⁽^𝑑^𝑖⁻^𝑓^𝑖⁾

2/2𝜎2

𝑖, (6.14)

and more importantly,

𝜒²=−2𝑙 𝑛𝜆. (6.15)

Thus, the likelihood ratio asymptotically follows a chi-squared distribution and is a suitable candidate for a GOF test. In case of Poissonian distribution, one can write the GOF test statistic as

−2𝑙 𝑛𝜆=2

𝑛𝑏𝑖 𝑛𝑠

∑︁

𝑗

(𝑓_𝑗 −𝑑_𝑗+𝑑_𝑗×𝑙 𝑛(𝑑_𝑗/𝑓_𝑗)). (6.16)

Eqn.6.16asymptotically follows a chi-squared distribution with𝑛_{𝑏𝑖𝑛𝑠}− 𝑝_𝑖 degrees of freedom, where 𝑝_𝑖 is the polynomial order. Therefore, we can construct the F-statistic (from Eqn.6.10) as,

𝐹 =

−2 log𝜆

1+2 log𝜆

𝑝2−𝑝

−2 log𝜆

𝑛bins−𝑝

. (6.17)

Dalam dokumen Detector Studies for HL-LHC CMS Upgrade (Halaman 185-200)