PPTX PowerPoint Presentation

Deliverable: Software that automatically selects a specified number of customers from the database to which the shipment should be sent, running time max. Organizational requirements had to be adopted in later stages as problems with data became apparent. Representativeness: If conclusions are to be drawn about a specific target group, there must be a sufficiently large number of cases from this group.

Informativeness: To cover all aspects of the model, most of the influencing factors (identified in the cognitive map) should be represented by attributes in the database. Good data quality: The relevant data must be correct, complete, up-to-date and unambiguous thanks to the available documentation. If the goal of the analysis is a report outlining possible explanations for a particular situation, the ultimate goal is to understand the provided model.

If the analysis is performed more than once, we can achieve similar performance - but not necessarily similar patterns. If the problem domain is complex, the model learned from the data must also be complex to be successful.

Determine analysis goals (3/3)

If restrictive runtime requirements are given (either for model building or model application), this may preclude some computationally expensive approaches. The more an expert already knows, the more challenging it is to surprise him with new findings. So if there is a possibility to include any kind of prior knowledge, it can significantly facilitate the search for the best model and can prevent us from discovering many known artifacts.

CRISP-DM

Industry Standard Process for Data

Data Understanding I

Agenda

Checking the assumptions made during the project understanding phase (representativeness, informativeness, data quality, . presence/absence of external factors, dependencies,.

Goals of data understanding

Attribute understanding

Types of attributes

Scales for numerical attributes

The most sophisticated level provides the most detailed information, but will not help discover general associations such as "Wine and cheese are often bought together". The analysis of such data will be biased towards values (example: products .) that have been in the domain for a long time. Low accuracy of numeric attributes due to noisy measurements, limited accuracy, incorrect measurements, carryover of digits when manually entered.

Data quality

Syntactic accuracy is violated if an entry does not belong to the domain of the attribute.

Data quality – syntactic accuracy

Semantic accuracy is violated if an entry is not correct even if it belongs to the domain of the attribute. The entry female for the categorical attribute gender in the record with name entry John Smith is within the domain for the attribute gender, but obviously incorrect since the name is correct.

Data quality – semantic accuracy

Note that missing values are not always explicitly marked as missing, for example in the case of standard entries. Example 1: Three years ago a new system was introduced and not all customer data was transferred to the new system.

Data quality – completeness

Example 2: The data set is biased, e.g. the bank may have rejected customers without income, but did not record it. Production line for goods including quality control  defective goods will be a very small part of all records.

Data quality – unbalanced data and timeliness

Bar charts

The range of numerical attributes is discretized into a fixed number of intervals (“bins”), usually of equal length. For each interval, the (absolute) frequency of values falling within it is indicated by the height of a bar.

Histograms (1/4)

Histograms (2/4)

Histograms (3/4)

Sturge's rule is suitable for data from normal distributions and from data sets of moderate size.

Histograms (4/4)

Reminder: median, quantiles, quartiles, interquartile range

Iris data set: boxplots

Scatter plots

For large data sets, points are plotted on top of each other and density information is lost.

Scatter plots: density

Scatter plots can be enriched with additional information: color or different symbols include a third attribute in the scatter plot.

Scatter plots: further elaboration

3D scatter plots

Data Preparation

Feature extraction

But such automatic feature extraction methods usually result in features that can no longer be interpreted in a meaningful way.

Dimensionality reduction for feature extraction

Feature selection refers to techniques that select a subset of the features (attributes) that are as small as possible and sufficient for data analysis.

Feature extraction and selection

For the removal of irrelevant features, a performance measure is needed that indicates how well a feature or subset of features performs w.r.t. For the removal of redundant features, either a performance measure for subsets of features or a correlation measure is needed.

Removing irrelevant/redundant features

It measures the deviation of the sample marginal distributions from the marginal distribution that would be obtained assuming that the attribute under consideration and the target variable were independent. Train the model with different subsets of features and select the features that lead to the model with the best performance.

Feature selection techniques (1/2)

Feature selection techniques (2/2)

Feature selection – example (1/2)

Evaluating the performance of isolated attributes usually does not provide accurate information about their performance in combination.

Feature selection – example (2/2)

If data has been collected over a long period of time, some of the older data may not be useful or even misleading for the data analysis task. When we have information about the distribution of the population, we can draw a representative subsample from our database.

Record selection

Data clean(s)ing (1/2)

Data clean(s)ing (2/2)

Use dictionaries that contain all possible attribute values to ensure that all values are consistent with domain knowledge.

Missing values

Types of missing values (1/3)

The probability that a value for X is missing does not depend on the actual value of X, nor on other variables. Example: The maintenance personnel sometimes forgets to replace the batteries of a sensor, so that the sensor sometimes does not provide measurements. The probability of missing a value for X does not depend on the true value of X.

Types of missing values (2/3)

At least in principle, when the dataset is large enough – based on the values of the other attributes. For MCAR, it can be assumed that the missing values follow the same distribution as the observed values of X. However, by taking the other attributes into account, it is possible to derive reasonable imputations for the missing values.

For missing values that cannot be ignored, it is impossible to give sensible estimates for the missing values.

Types of missing values (3/3)

If knowing the domain doesn't help with what types of missing values to expect, you can use the following strategy. Convert the considered attribute X to a binary attribute by replacing all measured values with "yes" and all missing values with "no". Build a classifier with the now binary attribute X as the target attribute and use all the other attributes to predict the "yes" and "no" class values.

Determine the percentage of misclassifications. This is the percentage of data objects that are not assigned the correct class by the classifier.

How to determine the type of missing values (1/2)

For MCAR, the other attributes should not provide any information, regardless of whether there is a missing value or not. If there are 10% missing values for attribute ܺ, the misclassification rate of the classifier should not be much less than 10%. If the misclassification rate is significantly better than pure guessing, this is an indication that there is a correlation between the missing values for Ǻ and the values of the other attributes.

How to determine the type of missing values (2/2)

If only a few records have missing values, and the values can be assumed to be MCAR, these records can be deleted for the next data analysis step.

How to handle missing values

Data transformation

Discretization (1/2)

Discretization (2/2)

Normalization / Standardization (1/2)

Normalization / Standardization (2/2)

Data integration (1/2)

Inner join: Creates a row in the output table if at least one entry can be found in the left and right tables with a matching ID. Outer join: Creates at least one row for each row in the left and right tables.

Data integration (2/2)

Data integration – example

Conclusion