HIGH-THRESHOLD ANALYSIS: PARAMETRIC FIDUCIALIZATION
4.4 Parametric fiducialization Methods
4.4.4 Bootstrapping
conditions from one another. Each of these trees can “vote” on the classification of an event, and that vote is averaged into a final score. There are many methods that are used to grow this ensemble, but the one used here is type of boosting called
“gradient boosting”. As a result, we typically refer to our machine learning method as a “boosted decision tree” or BDT. In essence, the first tree of our ensemble is grown as described in the previous section. The next tree is trained slightly dif- ferently. All training events that the first tree misclassified are re-weighted to be more important (essentially counting as more than a single event) as the next tree is grown. This allows the next tree to “try harder” to classify them correctly. The procedure is repeated (with the third three weighted to “try harder” than the second) to construct the entire ensemble.
This leaves us with tuning our metaparameters: the maximum tree depth, the mini- mum leaf size, and the number of trees in our ensemble. This was done by repeat- edly spiting our data into training and testing sets, and measuring the classification performance of the resulting BDT at a variety of metaparameter combinations. The result of this is shown in figure 4.10. The values that were settled on were max depth: 5, min leaf weight: 1%, number of trees: 50.
(a) (b)
(c)
Figure 4.9: Accuracy of our BDT at classifying an event at various metaparameter values. For our two pruning methods (a) and (b) we would like to be in a region where small changes in the metaparameter vale do not strongly effect the resulting accuracy. For the ensemble size (c) more should always be better, but eventually memory usage becomes a constraint [82].
section 4.4.3.2. In this context there arises a slight chicken-and-egg problem. We need to use some subset of our signal and background model datasets to train our BDT. However we would like to be able to produce scores for all events from all datasets. This includes the subsets of our calibration and WIMP search data used to train the BDT itself. This has the potential to introduce bias in the scoring of those particular events. After all, as previously discussed, a dataset independent of the training set is required to understand our BDT’s behavior. There is a related problem, and that is one of low statistics. Our background models, especially the one describing our expectedγ-sourced interactions, have a number of very rare, but
Figure 4.10: The relative number of times a particular input feature is chosen for a splitting, or the “feature importance” is plotted for a single BDT trained on the 50 GeV/C2model datasets for detector IT1Z1. This example has the recoil energy of the events as the most important discriminating parameter followed by the ion- ization position estimators. Given that this model is entirely inside of the NR-band, this is consistent with our expectations, and acts as a nice cross-check of our BDT behavior.
very important events5. As a result, any subset of data that we pick has the potential to miss a number of our very rare, but very important, low radius γ events. As with the solution to most chicken-and-egg problems the solution to this is referred to as bootstrapping6. Our requirements for this process are pretty simple: we want to utilize all of our data to train our method, but ideally when scoring a particular event we use a BDT that was not trained on this particular event. The process also needs to be deterministic. A single interaction7 should always produce the same BDT score regardless of the context in which it is used—as a part of the testing set used to characterize the trees during their training, as a part of a signal or background modeling dataset, or event as just a normal calibration or WIMP-
5Important in the sense that they have a high weight, and are an important contribution to the total expected background
6To start a computer the CPU needs to load code from disk. But to know exactly where to go to get this code, and how to execute it once it it loaded requires some control logic or code.
Which is, of course, located on disk. The process devised to solve this problem is also referred to as bootstrapping or simply booting.
7Defined by the unique combination ofEventNumberandSeriesNumber
search event used by a third party, so while randomness is going to be an important component, it needs to be tracked. Finally, as a matter of bookkeeping it would be ideal if our final fiducializing parameter produced by this method ranged from [0,1], with 0 being most signal-like and 1 most background-like. This is for consistency with our definition of qrpart_zhalf; ideally we can re-use the same optimization machinery for both parameters. This process is divided into two parts. Training, which uses our model datasets to bootstrap an ensemble of BDT’s that are saved for later, andscoringwhich uses the saved BTD’s to produce a BDT score for any event in the SuperCDMS dataset.
4.4.4.2 Training
The basic idea with bootstrapping is this: instead of training a single BDT on a ran- dom subset of our model datasets and using the remainder of data to test that BDT’s performance, repeat this process many times, constructing a number of BDT’s.
Then average the results. The basic training procedure is as follows
1. For each BDT in our bootstrapping ensemble, construct a boolean mask where each entry is set astrueorfalsewith a probability of 0.5. This mask is the same size as our entire model dataset and is indexed by a unique event ID consisting ofEventNumberandSeriesNumber. The mask is then saved to disk in hdf5 format in such a way as to be unambiguous which BDT it is associated with.
2. Using this mask, our model datasets are split into testing (a mask vale of false) and training datasets (a mask value of true), which are used to train the BDT and test its performance.
3. For each individual BDT the minimum and maximum scores achieved from scoring the testing set are saved.
4. Finally, the BDT’s themselves are serialized to disk. It should be noted the the nativePython serialization methods found in the standard library’s pickle and cpickle were as of this analysis incapable of serializing our decision tree objects. For this we utilized Michael McKerns’ excellent scientific seri- alization librarydilldescribed here [83].
The above training procedure may seem needlessly complex, but all components are required to construct a repeatable, unbiased BDT score. The scoring for a particular interaction,ε, is done as follows:
1. Load each of the 25 trees, masks, and bounds into memory.
2. For each tree,t, look up the mask value associated withEventNumberε and SeriesNumberε. If the value istrue, the score is set to zero and we skip to step 4.
3. If the mask value isfalseor the event is not found in the mask (indicating it was not a part of one of our model datasets) the event is given a raw score of sraw. This score is than normalized using the max and min bounds found from step 3 in the training as
snorm= sraw−btmin
btmax−btmin (4.26)
4. The resulting scores from events that are actually run through the BDT (and not manually set to zero) are then averaged to produce our final multivariate fiducializing parameter.