Models, algorithms and black boxes - Big Data Demystified

2. diagnostic analytics might reveal that it was caused by a shortage of key inventory;

3. predictive analytics could forecast future supply and demand; and

4. prescriptive analytics could optimize pricing, based on the balance of supply and demand, as well as on the price elasticity of the customer base.

Keep in mind

You can realize tremendous value simply by organizing your data and making it available across the organization. More advanced projects can come later.

intuitively, particularly for a non-technical audience. An example of a transparent model would be one that computes insurance risk based on age and geography, and that is calibrated using historical insurance claims. It’s easy to understand the factors that influence such insurance premiums. A model that is completely non- transparent is called a black-box model, because the end user will not be able to understand its inner workings.

Choose simple, intuitive models whenever possible. A simple model, such as a basic statistical model, is easier to develop, to fit to the data, and to explain to end users than is a more complicated model, such as non-linear support vector machines or neural networks. In addition, transparent models allow end users to apply intuition and suggest modelling improvements, as we saw previously in the University of Washington study, where model improvements suggested by non- technical staff raised accuracy by nearly 50 per cent.

Model transparency is particularly important for applications where outcomes must be explained to healthcare patients, government regulators or customers (e.g. rejection of a loan application).

Black box models from big data

In our daily lives, we draw certain conclusions from concrete facts and others from intuition, an invaluable skill, which is nearly impossible to explain. A

thermometer gives a clear indication of fever, but we can simply look at someone we know and perceive when they aren’t feeling well. Through years of

experience looking at many healthy people and sick people, combined with our knowledge of what this person typically looks like, we ‘just know’ that the person is not feeling well. Remarkably, models such as neural networks work in a very similar way, trained by millions of sample data points to recognize

patterns. Big data enables much stronger computer ‘intuition’ by providing a massive training set for such machine learning models.

These models, which improve by consuming masses of training data, have recently become significantly better than any transparent algorithm for certain applications. This makes us increasingly dependent on black-box models. Even for classic business applications such as churn prediction and lead scoring, for which we already have reasonably effective and transparent models, data scientists are increasingly testing the effectiveness of black-box models, particularly neural networks (deep learning).

Although an ML method such as neural networks can recognize patterns through extensive training, it is not able to ‘explain’ its pattern recognition skills. When such a model wrongly labels a dog as an ostrich (as illustrated in Chapter 2), it is very difficult to detect and explain what went wrong.

This lack of transparency is a significant problem in areas such as insurance, law enforcement and medicine. Consider an article published in Science magazine in April 2017,⁶¹ describing recent results in the prediction of cardiovascular disease (heart attacks, strokes, etc.). Currently, many doctors evaluate patient risk in this area using eight risk factors, including age, cholesterol level and blood pressure.

Researchers at the University of Nottingham recently trained a neural network to detect cardiovascular events, using as inputs the medical records of nearly

400,000 patients. Their model achieved both a higher detection rate (+7.6 per cent) and a lower false alarm rate (−1.6 per cent) than the conventional eight- factor method.

Such a model can potentially save millions of lives every year, but being a neural network it is a black box and hence raises several concerns.

If patients ask the reason why they were labelled as high risk, the black-box model will not provide an answer. The patients will have difficulty knowing how to reduce their risk.

If insurance companies use this more accurate method to calculate insurance premiums, they will be unable to justify premiums. At this point, we also start to run into legal implications, as some countries are introducing so- called ‘right to explanation’ legislation, mandating that customers be given insight into the decision processes that impact them.

There has been recent progress in adding transparency to black box models. The University of Washington has recently developed a tool they call LIME – Local Interpretable Model-Agnostic Explanations, which they describe as ‘an algorithm that can explain the predictions of any classifier or regressor in a faithful way, by approximating it locally with an interpretable model.’⁶² Their tool works by fitting local linear approximations, which are easy to understand, and it can be used for any model with a scoring function (including neural networks).

Fitting the model to the data

After a trained analyst has selected a model (or several candidate models), they’ll fit the model to your data, which involves:

1. choosing the exact structure of the model; and 2. calibrating the parameters of the model.

Choosing the structure of the model involves deciding which features in your data are most important. You’ll need to decide how to create binned categories (where numbers are grouped into ranges) and whether an engineered feature such as ‘average purchase price’ or ‘days since last purchase’ should be included. You might start with hundreds of potential features but use only a half dozen in your final model.

When using neural networks (including deep learning), you won’t need to walk through this variable selection or feature engineering process, but you will need to find a network architecture that works for your problem (some sample

architectures are illustrated in Chapter 2). You’ll follow a trial-and-error

approach to choosing the types and arrangements of nodes and layers within the network.

As you calibrate the model parameters, you’ll tune the structured model to best fit the training data. Each model will have one or more associated algorithms for adjusting parameters to maximize some target score, which itself describes how well the model fits the training data. Take care to avoid over-fitting the training data, by using a method called cross-validation or by assessing a goodness-of- fit test.

During the modelling process, try several possible model structures or

architectures, using specialized tools and programs to optimize the parameters for each. Assess the effectiveness of each structure or architecture and select one that seems best.

Deploying the model

Develop your models in a programming language and environment suitable for rapid prototyping, using a limited and cleaned data set. Bring in more data after you’ve demonstrated the effectiveness of the model. Don’t spend time making it fast and efficient until it’s proven its worth, perhaps even after it has already been deployed in a limited capacity to production.

Many data scientists prototype in the Python or R languages, testing on laptops or a company server. When the code is ready to be deployed to production, you may want to completely re-write it in a language such as C++ or Java and deploy it to a different production system. You’ll connect it to production data systems, requiring additional safeguards and governance.

Work with your IT department in selecting:

hardware for deployment, including memory and processor (typically central processing unit (CPU), but GPU or tensor processing unit (TPU) for neural networks);

architecture that will satisfy requirements for speed and fault-tolerance;

choice of programming language and/or third-party tool (such as SAS or SPSS);

scheduling of the computational workload, such as whether it should be run centrally or locally, possibly even at the point of data input (recall our discussion of fog computing in Chapter 5).

Work with IT to cover the normal operational processes: backups, software updates, performance monitoring, etc.

Work with your company’s privacy officer to make sure you are satisfying requirements for data governance, security and privacy. Your model may be accessing a database with personal information even though the model results do not themselves contain personal information, and this may put you at risk. A government regulation such as Europe’s General Data Protection Regulation (GDPR) may allow use of personal data for one analytic purpose but not for another, depending on the permissions granted by each individual. I’ll talk more about this in Chapter 11, when I discuss governance and legal compliance.

Dalam dokumen Big Data Demystified - BooksFree (Halaman 110-114)