Chapter VI: Topological HLT development at √
7.4 Background fit results and signal injection
and Z MmaxR
MminR
Z R2max
R2min f(MR, R2)dR2dMR =n(bn)−n
×
Γ
n,bn(M0R−MmaxR )1/n(R20−R2max)1/n
−Γn,bn(MminR −MR0)1/n(R2max−R20)1/n
−Γ
n,bn(MmaxR −MR0)1/n(R2min−R20)1/n +Γ
n,bn(MminR −MR0)1/n(R2min−R20)1/n , (7.5) respectively, whereΓ(a,x)is the incomplete gamma function:
Γ(a,x) = Z ∞
x ta−1e−tdt. (7.6)
where PSM(MR, R2) is the empirical function in Eqn. (7.1) normalized to unity, NSM is the corresponding normalization factor, and θ is the set of background shape and normalization parameters, and the product runs over the N events in the data set. This form of the likelihood is for one background process. The same form of the likelihood is used for the other boxes, for each b-tagged jet multiplicity. The total likelihood in these boxes is computed as the product of the likelihood functions for each b-tagged jet multiplicity.
The fits are performed independently for each box and simultaneously across the b-tagged jet multiplicity bins. Common background shape parameters (b, MR0, R20, and n) are used for the 2 b-tag and ≥3 b-tag bins, since no substantial difference between the two distributions is observed on large samples of tt and V+jets MC events. A difference is observed between 1 b- tag and≥2 b-tag samples, due to the observed dependence of the b-tagging efficiency on the jet pT. Consequently, the shape parameters for the 1 b-tag bins are allowed to differ from the corresponding parameters for the ≥2 b- tag bins. The background normalization parameters for each b-tagged jet multiplicity bin are also treated as independent parameters.
The background shape parameters are estimated from the events in the two sidebands (Section7.2). This shape is then used to derive a background pre- diction in the signal-sensitive region: 30 000 alternative sets of background shape parameters are generated from the covariance matrix returned by the fit. An ensemble of pseudo-experiment data sets is created, generating random (MR, R2) pairs distributed according to each of these alternative shapes. For each bin of the signal-sensitive region, the distribution of the predicted yields in each pseudo-experiment is compared to the observed yield in data in order to quantify the agreement between the background model and the observation. The agreement, described as a two-sided p- value, is then translated into the corresponding number of standard devia- tions for a normal distribution. The p-value is computed using the proba- bility density as the ordering principle. The observed numbers of standard deviations in the two-lepton boxes are shown in Fig. 7.7, as a function of MRand R2. Positive and negative significance correspond to regions where the observed yield is respectively larger and smaller than the predicted one.
Light gray areas correspond to empty bins with less than one event expected
on average. Similar results for the one-lepton and hadronic boxes are shown in Figs. 7.8–7.10. Figures7.11–7.14 illustrate the extrapolation of the fit re- sults to the full (MR, R2) plane, projected onto R2andMRand summed over the b-tagged jet multiplicity bins. No significant deviation of the data from the SM background predictions is observed.
A goodness-of-fit (GOF) p-value may be computed for each fit using a mixed- sample method [185, 186]. In this method, a pseudo-experiment data set is generated with a factor of 10 more events than in data,nMC =10ndata. The two data sets are mixed and for the mixed data set, a test statistic T, is cal- culated by finding the 10 nearest neighbors to each point in the mixed set and then counting how many of these neighbors are from the same sample:
T = 1
nk(ndata+nMC)
ndata+nMC i
∑
=1nk k
∑
=1I(i,k), (7.8) where I(i,k) = 1 if the ith event and its kth nearest neighbor belong to the same sample and I(i,k) = 0 otherwise, andnk = 10 is the number of nearest-neighbor events being considered. A normalized Euclidean metric is used to quantify distances,
|~xi−~xj|2=
∑
D νxνi −xνj σν
!
, (7.9)
where Dis the dimensionality of the dataset,~xi = (x1i, . . . ,xiD) is a random point, andσνis the root mean square of the νthvariate in the dataset. If the two samples, the pseudo-experiment data set and the real data set, have the same parent distribution, then there is a quantity (T−µT)/σT that has a limiting standard normal distribution. The quantities µT and σT are given by
µT = ndata(ndata−1) +nMC(nMC−1)
n(n−1) , (7.10)
n,nlimk,D→∞σT2 = 1 nnk
ndatanMC
n2 +4n2datan2MC n4
!
, (7.11)
wheren =ndata+nMC. The convergence to this limit forσT is fast enough that it is acceptable to use even forD =2,nk =10, andnMC =10ndata. The values of the modified test statistic (T−µT)/σT and the correspond- ing one-sided p-value for each of the sideband fits are tabulated in Tab.7.2.
Table 7.2: Goodness-of-fit test statistic and one-sided p-values for each of sideband fits. The mixed-sample method [185, 186] used to derive these measures is described in text. To evalute the test statistic, all of the data in the sidebands and the signal-sensitive region is used. Therefore, these measures quantify both the agreement between the sideband data and the fit and the agreement between the signal-sensitive-region data and the ex- trapolation.
Box b-tags (T−µT)/σT p-value Two-lepton boxes
MuEle ≥1 0.74 23%
MuMu ≥1 -0.34 63%
EleEle ≥1 -0.73 77%
Single-lepton boxes
MuMultiJet 1 0.43 33%
2 1.61 5%
≥3 -0.62 73%
MuJet 1 0.82 21%
2 -0.26 60%
≥3 1.04 15%
EleMultiJet 1 -1.39 92%
2 0.46 32%
≥3 0.49 31%
EleJet 1 0.25 40%
2 0.55 29%
≥3 1.05 15%
Hadronic boxes
2b-Jet 2 1.96 2%
≥3 -0.60 73%
MultiJet 1 0.68 25%
2 -0.77 78%
≥3 -0.47 68%
To evalute the test statistic, all of the data in the sidebands and the signal- sensitive region is used. Therefore, these measures quantify both the agree- ment between the sideband data and the fit and the agreement between the signal-sensitive-region data and the extrapolation. We find GOF p-values that are consistent with a uniform distribution and suggest a high-level of agreement between the data and the fit plus the extrapolation.
To demonstrate the discovery potential of this analysis, we apply the background- prediction procedure to a simulated signal-plus-background MC sample.
Fig.7.15shows the MR and R2distributions of SM background events and
T1bbbb events (Section 3.8). The gluino and LSP masses are set respec- tively to 1.3 TeV and 50 GeV, representing a new-physics scenario near the expected sensitivity of the analysis. A signal-plus-background sample is obtained by adding the two distributions of Fig. 7.15, assuming an inte- grated luminosity of 19.3 fb−1and a gluino-gluino production cross section of 0.02 pb, corresponding to 78 expected signal events in the signal-sensitive region. The agreement between the background prediction from the side- band fit and the yield of the signal-plus-background pseudo-experiments is displayed in Fig. 7.16-7.18 for different values of the gluino-gluino pro- duction cross section, including 0.003, 0.01, and 0.02 pb. The contribution of signal events to the sideband region has a negligible impact on the de- termination of the background shape, while a disagreement is observed in the signal-sensitive region, characterized as an excess of events clustered around MR ≈ 1.3 TeV. The excess indicates the presence of a signal in this simulated MC sample, and the position of the excess in the MR variable provides information about the underlying SUSY mass spectrum.