To search for the best set of hyperparameters, we undertake an optimization problem. First, we construct a Bayes net structure of the set of hyperparameters in a complex hyperparameter space, and used-separation methods to decompose the large space into smaller subsets of hyperparameters that can be optimized inde- pendently, thus making the optimization problem feasible in terms of time and memory requirements.
We formulate the hyperparameter optimization problem and discuss the set of steps to reduce the com- plexity of the tuning process.
5.1.1 Problem Formulation
Let us assume that an ensembleAof models (not limited to machine learning models) is used to solve a problem associated with a complex CPS,D. The ensemble has an associated set of hyperparametersH= (H1,H2, . . . ,Hj). The performance of the ensemble associated withDis defined by multiple metrics, L= (L1,L2, . . . ,Lk). Our task is to identify the values for the set of hyperparameters, H∗, that provides the best performance for the ensemble in terms of the metrics inL. This is mathematically formulated as the hyperparameter optimization of the ensemble modelAwith respect to the metrics inL:
H∗=argminh∈HL(A,D;h), (5.1)
whereL can be a single scalar evaluated as a weighted combination of the metrics inLor a set of individual equations if the elements inLdepend on different subsets ofH. If the cardinality ofH, i.e.,|H|is large, the joint hyperparameter optimization problem can become computationally very complex, and sometimes, in- tractable. We develop an approach for systematically decomposing the hyperparameter space to independent
subsets of hyperparameters, so that each subset can be optimized individually.
5.1.2 Approach
We adopt a Bayes Net for characterizing the hyperparameter space as a graphical structure, where the links in the graph capture the directed relations between corresponding hyperparameters and evaluation metrics they influence. Then we use concepts of d-separation to determine the conditionally independent subsets of hyperparameters that can be optimized independently.
5.1.2.1 Creation of the Bayes Net
Using our knowledge of the ensemble of models and the relations between the performance metrics in the application domain, we derive a Bayes Net that captures the relations between the model hyperparameters and their corresponding evaluation metrics.
Adopting the method discussed in [142], we construct a Bayes Net using a topological order (cause precedes effects) to link allHi∈Hto the corresponding performance variables,Lk∈L, such that the order is consistent with the directed graph structure. This helps establish the Markov condition, i.e.,P(Xi|X1,X2,Xn) = P(Xi|Parents(Xi), whereX={X1,X2,· · ·Xn}=H∪L.
5.1.2.2 Decomposition of Hyperparameter Space
The hyperparameter optimization problem is further simplified by classifying the hyperparameters in the ensemble models into two primary groups: (1)localand (2)global, based on which evaluation metrics they are linked to. This grouping starts with characterizing local and global metrics.
Definition 3 (Local and Global Metrics) For an ensemble of data-driven models A associated with a system D, the evaluation metrics in L may be decomposed into two classes. The metrics that are only affected by an individual data-driven model performance are called local metrics Ll. The metrics that are affected by multiple models, are called global metrics Lh.
Accordingly, let us now divide hyperparameters based on which metrics they influence. Corresponding to global and local metrics, we decompose the hyperparameter space.
Definition 4 (Decomposition of Hyperparameter space) For a data-driven ensemble of models, A applied to a system D, the existing hyperparameter set H can be decomposed into two main subsets Hhand Hl using the causal structure [143] of the derived Bayes net. Hyperparameters that are linked to at least one global metric, Lhare termed global hyperparameters, Hh. Hyperparameters that are only linked to the local metrics Llare called local hyperparameters: Hl.
5.1.2.3 Separation of Hyperparameters: Connected Components and D-separation
Given our Bayes Net that captures the relations between hyperparameters and the metrics, we apply the concepts connected components and d-separation [144] to identify independent sets of hyperparameters.
5.1.2.3.1 Connected Components
A graph, G, is said to be connected if every vertex of the graph is reachable from every other vertex. For our Bayes Net, which is a directed acyclic graph (DAG), we can use a simple Breath First Search to first identify the connected components in the Bayes Net. Once, we identify the connected components, we use the concept of d-separation by converting all directed links to undirected links, and then establishing if a set of variables,X in the graph is independent of a second set of variables,Y, given a third set,Z.
5.1.2.3.2 D-separation
D-separation encompasses a set of well established rules that determine whether two sets of nodesX andY are mutually independent when conditioned on a third set of nodes,Z. There are four primary rules that can be used to test the independence relation.
1. Indirect Causal EffectX→Z→Y :Xcan influence evidenceY ifZhas not been observed.
2. Indirect Evidential EffectY→Z→X : EvidenceXcan influenceY ifZhas not been observed.
3. Common CauseX←Z→Y:Xcan influenceYifZhas not been observed.
4. Common EffectX →Z←Y : X can influenceY only ifZ has been observed. Otherwise, they are independent.
In our formulation of Ensemble models, we frequently encounter the following examples based on the above rules.
1. The evaluation of local metrics blocks the effect of the local hyperparameters on global metrics (Rule 1). Therefore, any two models affecting the same global metric will not have their hyperparameters related via a common effect (Rule 4).
2. Assume thatZ is the set of global hyperparameters, whileX andY represent local hyperparameter sets associated with individual models. If they belong to the same connected component (sub-graph), we will have to jointly optimize hyperparameters of the individual models, thus increasing the com- putational complexity of the optimization process. Instead, we can consider a two level optimization process: First, we chose candidate values for the global hyperparameters inZ. Given the values of the
variables inZ, we can independently optimize the hyperparameters connected componentsX andY, thus decomposing the hyperparameter space further by applying Rule 3.
3. In many cases, all the global hyperparameters are linked to all the global metrics, and then they have to be jointly optimized in accordance with Rule 4.
Overall, applying the rules to the ensemble models and their performance metrics, helps us optimize the hyperparameters of most of the models independently. In certain cases, as we demonstrate later, the global hyperparameters may have to be optimized jointly.
During each trial of the hyperparameter optimization process, we follow the following steps:
1. Choose candidate values of the global hyperparametersHh
2. With the chosen values ofHh, optimize the hyperparameters inHl associated with each model in the ensemble by using the corresponding model performance as the objective function.
3. Evaluate the performance of subsets inHhonLhusing the set of equations in 5.1
5.1.2.4 Bayesian Optimization for Hyperparameter Tuning
Once we have obtained the independent subsets of the hyperparameters, we optimize them using the per- formance metrics which they influence. In this work, we consider a model based black box optimization approach called Bayesian Optimization or Sequential Model Based Optimization(SMBO) [145]. We chose this process due to several advantages compared to other gradient based, derivate-free optimization processes.
Unlike gradient-based optimization, SMBO can optimize on non-convex surfaces. This becomes a deciding factor in choosing SMBO over gradient based optimization when the hyperparameter space includes discrete variables. Also, unlike other derivative free methods, SMBO keeps track of the history of past evaluations of candidate hyperparameter settings and uses that information to form update a prior, which it uses to choose for hyperparameter values with potential improvements to the best choice until that point.
The Bayesian Optimization process starts with the initialization of a surrogate probability model. Let us denote this byP(score|h). Here, scorerepresents the result of evaluating the fitness metricL(A,D;h) defined in equation 5.1 for a given choice of the hyperparametersh∈H. This model is continuously updated to create a better estimate of the hyperparameter choices mapped to scores across the search space and make informed decisions when choosing candidate hyperparameter values. For our work, we apply Bayes rule on the surrogate probability model and then use the Tree Parzen Estimator(TPE)[145] to create a generative
Black Box
Selection function
Surrogate Probability Model
𝑳(𝑨, 𝑫; ℎ
𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒)
sc or e
H ℎ
𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒Iterate until Termination Criterion is achieved
Figure 5.1: Schematic of Hyperparameter Optimization for our Approach based on Bayesian Optimization
model for creating the hyperparameter distribution conditioned on a threshold(score∗) value of the score.
P(θ|score) =
l(θ), ifscore≥score∗
g(θ), otherwise
where either ofl(θ)org(θ)are a combination of Gaussian Kernel Density Functions [146], which grad- ually estimate the true probability model as more evaluations of the hyperparameters become available.
We also initialize a Selection Function(also called the Acquisition Function) which decides the next can- didate hyperparameter instantθto choose for the potential improvement. We use the Expected Improvement Metric[145]
EIscore∗= Z score∗
−inf
(score∗−score)p(score|θ)d(score)
evaluated using the surrogate probability model to calculate the candidate improvements and choose the most promising candidate hyperparameter instantiation. The aim is to maximize the Expected Improvement with respect toθ.
The optimization process itself then includes three iterative steps that we run until some termination
criterion is achieved. These are enumerated below and schematically presented in Figure 5.1:
• Using the surrogate probability model and the selection function, choose the best performing hyperpa- rameter
• Apply the hyperparameters to the fitness functionL
• Update the surrogate model incorporating the new trial results.
Once, this process terminates, we obtain a locally optimal estimate of the hyperparameters that maxi- mize the performance metric under consideration and use these values for evaluation of our approach when performing benchmarking experiments or for deployment.