HIGH-THRESHOLD ANALYSIS: PARAMETRIC FIDUCIALIZATION
4.4 Parametric fiducialization Methods
4.4.2 Dimensionality Reduction
As we have seen, our goal is to systematically change a single parameter and ob- serve how that restricts or relaxes our final signal region in some way. This ability to order a proposed set of fiducializing cut positions is important for a few reasons.
First it allows human analyzers to manually explore and enumerate these cuts by hand, both to build an intuition as to how various choices of cut behave as well as for understanding the results of our optimizer. As was previously mentioned, con- struction of a background model in more than a single dimension quickly runs into problems with the sparse nature of our data, especially theγ-model dataset. This is even more of a problem once we move to optimizing the cut locations. With a single fiducializing parameter per detector, our SAE, which we are trying to max- imize, and our misidentified background, which is serving as a constraint on that maximization are both ten dimensional. In the most general case (assuming recti- linear cuts in all dimensions), using all seven discriminating parameters described above, we would have to maximize a seventy dimensional function subject to a sev- enty dimensional constraint. Even assuming we had the incredibly high statistics in our model datasets required to construct these functions, actually preforming the optimization would be practically intractable. To solve this we turn to a machine learning method called a “decision tree”.
Classifiers take a number of attributes or “input features” of a thing and attempt to assign it to one of a finite number of categories or classes. DT’s are also supervised learners, which means that they need to be shown many example elements of each class in order to train themselves. In our case we have two classes: signal and background. Our DT uses the seven discriminating parameters of interest described in section 4.4.1 as input features. The output of our DT is a “score” or number in [−1,1]. A very signal-like event would have a score of−1, a very background-like event would have a score of 1 and an event that has an equal probability of being signal and background would be scored at 0. In this sense, our DT would act as our dimensionality reducer: it takes the seven parameters that each have some ability to separate signal and background and produces a single score which is ideally better than any one of the input features at discriminating between signal and background.
Although a complete description of Decision trees is outside the scope of this thesis a very brief outline of the method will be given below.
4.4.3.1 Single Tree overview
Decision Trees are simple enough that it is conceivable to construct one by hand provided enough time. This is because decision trees are simple collections of single variable, rectilinear splittings or decisions. As discussed above our tree is a classifier with two classes K one for signalKS and one for background KB. Given N samples of background NB and signalNS model events as training data, the tree trains itself as follows. For each of the seven inputs discusses in section 4.4.1 the BDT proposes an initial splitting cut to separate background and signal. At each proposed splitting it can calculate the probability that an event on each side of the split is signal or a background event by examining the ratio of events in the training like:
PK = NK
N (4.23)
where the subscriptK can represent signalS or background B. Using this we can define a metric of the impurity (called the Gini impurity) of each of these samples as:
fGini=
K∈{S,B}
X
K
Pk(1−Pk)=1−
K∈{S,B}
X
K
P2K (4.24)
This impurity is calculated for the total sample fGiniT otal before the proposed splitting as well as for the population on each side of the spiting (which we will call the Left and Right). Finally the information gain∆Gis calculated for each of these proposed splittings where
∆G= N fGiniT otal−NLe f tfGiniLe f t−NRightfGiniRight (4.25) The splitting that produces the largest information gain is chosen as the first node in our decision tree. The procedure can then be recursively repeated on the Left and Right sub-populations.
The recursion will continue until the sub-populations are all pure signal or pure background. The problem with this is that the tree will be fitting individual events from our models, and will not preform well when scoring events that were not used in the training dataset (these reserved events are usually referred to as “testing events”). To prevent this over-training we can artificially limit the growth of the tree in a process called “pruning”. Two of these pruning methods were used in this analysis. The first is to limit the maximum recursion depth of the tree and is called the “maximum tree depth”. The second is to limit the minim number of events in a sample (as a percentage of the entire sample) that the DT will preform a split on and is called the “minimum leaf size”. To differentiate these DT-tuning parameters from the normal input parameters of the DT they are usually referred to as “metaparameters”. There are two complications which arise from this pruning.
First, we need a way to choose good values for these metaparameters. Second, because the DT can only preform rectilinear cuts, a single pruned tree will typically produce decision boundaries which are very blocky and course. We will address both in the next section.
conditions from one another. Each of these trees can “vote” on the classification of an event, and that vote is averaged into a final score. There are many methods that are used to grow this ensemble, but the one used here is type of boosting called
“gradient boosting”. As a result, we typically refer to our machine learning method as a “boosted decision tree” or BDT. In essence, the first tree of our ensemble is grown as described in the previous section. The next tree is trained slightly dif- ferently. All training events that the first tree misclassified are re-weighted to be more important (essentially counting as more than a single event) as the next tree is grown. This allows the next tree to “try harder” to classify them correctly. The procedure is repeated (with the third three weighted to “try harder” than the second) to construct the entire ensemble.
This leaves us with tuning our metaparameters: the maximum tree depth, the mini- mum leaf size, and the number of trees in our ensemble. This was done by repeat- edly spiting our data into training and testing sets, and measuring the classification performance of the resulting BDT at a variety of metaparameter combinations. The result of this is shown in figure 4.10. The values that were settled on were max depth: 5, min leaf weight: 1%, number of trees: 50.