Interpretable Machine Learning

I recommend the book "Elements of Statistical Learning" by Hastie, Tibshirani and Friedman (2009)⁴and Andrew Ng's online course "Machine Learning"5 on the online learning platform coursera.com to get started with machine learning. Model-agnostic methods work by changing the input of the machine learning model and measuring the changes in the prediction output.

Fermi’s Paperclips

Her tax affiliation was eventually integrated into the Civic Trust Score System, but she was never told. A dozen of the small drones repositioned themselves in front of the children and began projecting the video directly into their eyes.

What Is Machine Learning?

A major disadvantage of using machine learning is that insights about the data and the task the machine solves are hidden in increasingly complex models. Just look at interviews with winners on the kaggle.com machine learning competition platform⁸: the winning models were usually ensembles of models or very complex models such as amplified trees or deep neural networks.

Terminology

Model-agnostic methods for interpretability treat machine learning models as black boxes, even when they are not. The prediction is what the machine learning model "guess" what the target value should be based on the given features.

Interpretability

Importance of Interpretability

The vet's explanation reconciles the dog owner's contradiction: "The dog was under stress and bit." The more a machine's decision affects a person's life, the more important it is for the machine to explain its behavior. The following scenarios illustrate when we don't need or even don't want machine learning model interpretation.

Taxonomy of Interpretability Methods

Model specific or model independent? Model-specific interpretation tools are limited to certain classes of models. The interpretation of regression weights in a linear model is a model-specific interpretation, since—by definition—the interpretation of intrinsically interpretable models is always model-specific.

Scope of Interpretability

Algorithm Transparency

Global, Holistic Model Interpretability

Global Model Interpretability on a Modular Level

Local Interpretability for a Single Prediction

Local Interpretability for a Group of Predictions

Evaluation of Interpretability

Properties of Explanations

High accuracy is especially important if the explanation is used for predictions instead of the machine learning model. Degree of importance: How well the explanation reflects the importance of features or parts of the explanation.

Human-friendly Explanations

What Is an Explanation?

What Is a Good Explanation?

What this means for interpretable machine learning: Be aware of the social environment of your machine learning application and the target audience. A good example is "The house is expensive because it is big", which is a very general, good explanation of why houses are expensive or cheap.

Datasets

Bike Rentals (Regression)

YouTube Spam Comments (Text Classification)

Risk Factors for Cervical Cancer (Classification)

However, this is not a book on missing data imputation, so mode imputation will suffice for the examples. To reproduce this book's examples with this dataset, find the preprocessing R script³³and the final RData file³⁴ in the book's Github repository.

Interpretable Models

Linear Regression

The variance of the error terms is assumed to be constant over the entire feature space. You don't want highly correlated features because that spoils the weight estimate.

Interpretation

The interpretation of functions in a linear regression model can be automated using the following text templates. Interpretable Models 55 Interpretation of a numerical characteristic (temperature): A temperature increase of 1 degree Celsius increases the predicted number of wheels by 110.7 when all other characteristics are held constant.

Visual Interpretation

Interpretation of a categorical feature (“weather”): The estimated number of bicycles is -1901.5 lower when it is raining, snowing or stormy, compared to good weather – again assuming all other characteristics do not change. On the bad side of things, the interpretation ignores the joint distribution of the features.

Explain Individual Predictions

A single-case effect plot shows the distribution of the effect and highlights the effects of the case of interest. In comparison, the forecast for the 6th model is small, as only 1,571 bike rentals are planned.

Encoding of Categorical Features

For reference category A, −(β1+β2) is the difference to the overall mean and β0−(β1+β2) is the overall effect. βa category is the estimated mean value of y for each category (provided that all other feature values are zero or the reference category).

Do Linear Models Create Good Explanations?

Linear models create true explanations as long as the linear equation is an appropriate model for the relationship between characteristics and outcome. The linear nature of the model, I believe, is the main factor why people use linear models to explain relationships.

Sparse Linear Models

Interpretable models 66 already “in” the model are penalized less and may receive a greater absolute weight. Continue until a criterion is reached, such as the maximum number of features in the model.

Advantages

Interpretable Models 67 I recommend using Lasso, because it can be automated, consider all features at once and can be controlled via lambda.

Disadvantages

Example: You have a model to predict the value of a house and features such as the number of rooms and the size of the house. The size of the house and the number of rooms are strongly related: the bigger the house, the more rooms it has.

Logistic Regression

What is Wrong with Linear Regression for Classification?

Theory

For classification, we prefer probabilities between 0 and 1, so we wrap the right side of the equation into the logistic function. Binary categorical feature: One of the two values of the feature is the reference category (in some languages the one encoded in 0).

Advantages and Disadvantages

Interpretation of a categorical feature (“Hormonal contraceptives y/n”): For women using hormonal contraceptives, odds of cancer versus linear model, interpretations always come with the clause that “all other characteristics remain the same”.

Software

On the other hand, the logistic regression model is not only a classification model, it also gives you probabilities.

GLM, GAM and more

Three assumptions of the linear model (left side): Gaussian distribution of the score given features, additivity (= no interaction), and linear relationship. But since you're already here, I've made a small summary of the problem plus solutions for linear model extensions, which you can find at the end of the chapter.

Non-Gaussian Outcomes - GLMs

In the linear model, the link function links the weighted sum of the features to the mean of the Gaussian distribution. The GLM framework makes it possible to choose the link function independently of the distribution.

Interactions

Effect (including interaction) of temperature and workday on the predicted number of bicycles for a linear model. Effectively, we obtain two slopes for temperature, one for each weekday feature category.

Nonlinear Effects - GAMs

Some statistical programs also allow you to specify transformations in the linear model call. Effect of the GAM feature of temperature for predicting the number of bicycles rented (temperature used as a single feature).

Further Extensions

Unsurpassed by any other analytics language, R is home to every possible extension of the linear regression model. The medicine has a direct effect on some blood values and this blood value affects the result.

Decision Tree

Interpretable Models 107 leaf nodes increase rapidly with depth. The more leaf nodes and the deeper the tree, the more difficult it becomes to understand the decision rules of a tree. The maximum number of leaf nodes in a tree is 2 to the power of the depth.

Decision Rules

No Rules Apply: What if I want to predict the value of a house and none of the rules apply. If the first-rule condition is true for an instance, we use the first-rule prediction.

Learn Rules from a Single Feature (OneR)

For each feature value, create a rule that predicts the most common class of instances that have that particular feature value (you can read from the crosstab). For each function, we calculate the total error rate of the generated rules, which is the sum of the errors.

Sequential Covering

RIPPER (Repeated Incremental Pruning to Produce Error Reduction) by Cohen (1995)45 is a variant of the Sequential Covering algorithm. When a condition is matched, then the right-hand side of the rule is the prediction for this instance.

Bayesian Rule Lists

At each step, the algorithm estimates the posterior probability of the decision list (a mixture of precision and brevity). The BRL algorithm selects the decision list of samples with the highest posterior probability.

Software and Alternatives

RuleFit

ELSE 0

RuleFit also introduces partial dependency plots to show the mean change in prediction by changing a function. The partial dependency plot is a model-agnostic method that can be used with any model and is explained in the partial dependency plot book chapter.

Interpretation and Example

The feature importance measure includes the importance of the raw feature term and all decision rules in which the feature appears. Interpretable Models 135 The result is a linear model that has linear effects for all original functions and for rules.

Software and Alternative

If the weather is good and the temperature is above 15 degrees, the temperature is automatically higher than 10. The estimated weight explanation for the second rule is: "Assuming all other characteristics remain constant, the predicted number of wheels increases by β2 when the weather is good and the temperature is above 15 degrees."

Other Interpretable Models

Naive Bayes Classifier

Nearest Neighbors

Interpretable Models 139 nearest neighbor method assigns the most common class of the nearest neighbors of an instance. The k-nearest neighbor model differs from the other interpretable models presented in this book because it is an instance-based learning algorithm.

Model-Agnostic Methods

By fitting machine learning models based on the data layer, we get the Black Box Model layer. Above the Black Box Model layer is the Interpretability Methodslayer, which helps us deal with the opacity of machine learning models.

Partial Dependence Plot (PDP)

To illustrate a partial dependency graph with a categorical feature, we examine the effect of the seasonal feature on predicted bicycle rental. It is assumed that the function(s) for which the partial dependency is calculated are not correlated with other characteristics.

Individual Conditional Expectation (ICE)

A simple solution is to center the curves at a certain point in the feature and display only the difference in the prediction up to this point. This can be useful if we don't want to see the absolute change of a predicted value, but the difference in the prediction compared to a fixed point of the feature range.

Accumulated Local Effects (ALE) Plot

Motivation and Intuition

We can average over the conditional distribution of the feature, which means that at a grid value of x1, we average the predictions of cases with a similar x1 value. We collect the local gradients over the range of features in set S, which gives us the effect of the feature on the prediction.

Estimation

Remember that the second-order effect is the additive effect of the interaction of the two features and does not include the main effects. PDP of the total effect of temperature and humidity on the predicted number of cycles.

Implementation and Alternatives

Model-agnostic methods 180 effect of the two functions, but it is only the additional effect of the interaction. Because if they have a very strong correlation, it only makes sense to analyze the effect of changing both functions together and not in isolation.

Feature Interaction

Feature Interaction?

For this table we need an additional interaction term: +100,000 if the house is large and in a good location. One way to assess the strength of an interaction is to measure how much of the prediction variance is due to the interaction of the features.

Theory: Friedman’s H-statistic

The amount of variance explained by the interaction (difference between observed and no-interaction PD) is used as a statistic of the interaction strength. In general, the interaction effects between the traits are very weak (less than 10% of the variance explained per trait).

Implementations

Alternatives

Feature Importance

Fisher, Rudin and Dominici (2018) suggest in their paper to split the data set in half and swap the values of feature j from the two halves instead of permuting feature j. If you want a more accurate estimate, you can estimate the error of permuting feature j by pairing each instance with the value of feature j from every other instance (except with itself).

Compute Importance on Training or Test Data?

The SVM overfits the data: Feature importance based on the training data shows many important features. PDP of attribute X42, which is the most important attribute according to the importance of the attribute based on the training data.