de Jonge van der Loo Introduction to data cleaning with R

Data cleaning is the process of converting raw data into consistent data that can be analyzed. The Rstatistical environment provides a good environment for reproducible data cleaning, as all cleaning actions can be scripted and therefore reproduced.

Figure 1: Statistical analysis value chain

Some general background in R

Variable types and indexing techniques

In summary, a statistical analysis can be divided into five stages, from raw data to formatted output, where the quality of the data is improved at each step towards the final result. This can be done by passing named arguments to thec() function or later with thenames function.

Special values

If one of the indices is omitted, no selection is made (so all are returned). Calculations involving numbers and NaNs always result in NaNs, so the result of the following calculations should be self-explanatory.

Reading text data into a R data.frame

If the column names are stored on the first line, they can be automatically mapped to the name attribute of the result. With the exception of read.tableandread.fwf, each of the above functions by default assumes that the first line in the text file contains column headings.

Reading data with readLines

The easiest way to standardize rows is to write a function that takes a single character vector as input and assigns the values in the correct order. String normalization is the subject of Section 2.4.1, and type conversion is discussed in more detail in the next section.

Table 1: Steps to take when converting lines in a raw text ﬁle to a data.frame with correctly typed columns.

Type conversion

Introduction to R's typing system

Tip. Here is a quick way to return the classes of all columns in dat.framequared dat. The type of Structure used to store a base type can be found with the type function. Briefly, we can think of the class of an object as the type of the object from the user's perspective, while the type of an object is the way it views the object.

It is important to realize that R's coercion functions are essentially functions that change the underlying type of an object, and that changes to the class result from changes to the type. Confusingly, objects also have amodes(andstorage.mode) that can be obtained or set using functions of the same name. The user is therefore advised to avoid using these functions to retrieve or change the object type.

Recoding factors

Converting dates

Here Sys.time uses the time zone stored in the locale of the machine running R. In addition, the name of the month (or weekday) is language dependent, with the language being redefined in the regional settings of the operating system. Here the functiondmy assumes that dates are given in the order of day-month-year and tries to extract valid dates.

Here, the names of (abbreviated) week or month names searched for in the text depend on the locale settings of the machine running R. If you know the text format used to describe a date in the input, you might want to use R's core functionality to convert from text to POSIXct. In the format string, date and time fields are indicated by a letter preceded by a percent sign.

Strings that are not in the exact format specified by argumentformatargument (such as the third string in the example above) will not be converted by as.POSIXct. Impossible dates, such as the leap day in the fourth date above, are also not converted.

Table 2: Day, month and year formats recognized by R.

String normalization

Approximate string matching

Converting strings to full upper or lower case can be done using R's built-in functions -intoupperand tolower. If you need to search for any of these characters in a string, you can use the fixed=TRUE option. However, a concise description of the regular expressions allowed by R's built-in string processing functions can be found by typing ?regex at the R command line.

Fitzgerald10 or Friedl11's books provide a thorough introduction to the subject of regular expression. If you often deal with ``messy'' text variables, learning regular expressions is a worthwhile investment. We now turn our attention to the second method of approximate matching, namely string spacing.

A string distance is an algorithm or equation that shows how much two strings differ from each other. At the end of this subsection, we show how this code can be simplified with the stringdist package.

Character encoding issues

Missing values
Special values
Outliers
Obvious inconsistencies
Error localization

The ASCII characters include the uppercase and lowercase letters of the Latin alphabet (a-z,A-Z), Arabic numerals (0-9), a number of punctuation marks and a number of invisible so-called control characters such as newline and carriage return. Depending on the operating system, Reither uses the conversion service provided by the operating system, or uses a third-party library included in the R.R function that allows users to translate character representations due to the operating system dependencies. We will force the columns of the data from the previous exercise into a structured dataset.

If unknown is indeed a category, it should be added as a factor level so that it can be analyzed appropriately. Although more precise definitions exist (see e.g. the book by Hawkins15), this definition is sufficient for the present tutorial. In this method, an observation is an outlier when it is larger than the so-called ``whiskers'' of the set of observations.

Here, 20 and 50 are detected as outliers because they are above the upper whisker. The box-and-whisker method can be visualized with a box-and-whisker plot, where the box indicates the interquartile range and median, the whiskers are represented at the ends of the box-and-whisker plots, and the outliers are indicated as separate points above or below the whiskers. In addition, editrules can check which rules are followed or not, and allows finding the minimum set of variables that need to be adjusted so that all rules can be followed.

Because rules can relate to multiple variables, and variables can appear in multiple rules (for example, the age variable in the current example), there is a dependency between rules and variables.

Figure 2: A box-and-whisker plot, pro- pro-duced with the boxplot function.

Correction

Simple transformation rules

When such scripts are neatly written and annotated, they can be treated almost like a log of the actions performed by the analyst. However, as scripts get longer, it is better to store the transformation rules separately and log which rule is executed on which record. As an example, consider the following (fictitious) dataset of the height of some brothers.

With deducorrect we can read these rules, apply them to the data, and get a log of all the actual changes as follows. So here, with just two commands, data is processed and all actions are recorded in adata.frame which can be saved or analyzed. Rules that can be applied with deducorrective rules that can be executed record by record.

When correctionRules reads the rules, it checks to see if any symbols that are not in the list of allowed symbols occur, and returns an error message when such a symbol is found, as in the following example. Finally, it is not currently possible to add new variables using patch rules, although such a feature will likely be added in the future.

Deductive correction

It does this by trying combinations of variable swaps on variables that occur in violated edits. It does so by computing candidate solutions and checking whether these candidates are less than a certain string distance (see Section 2.4.2) removed from the original. When printed, it does not show the full content (the corrected data and logging information), but a summary of what happened to the data.

Deterministic imputation

In general, if one of the variables is missing, the value can be clearly deduced by solving for it for the first rule (provided the solution does not violate the last rule). Assuming these values are correct, the only possible values for the other two variables are . Note that deduImputeonly imputes those values that can be derived with absolute certainty (uniquely) from the rules.

In the example, there are many possible solutions to impute the last record and thus leave duImpute it untouched. Here, deduImpute uses automated logic to infer from the conditional rules that if someone has a driver's license, he must be an adult.

Imputation

Basic numeric imputation models

Methods are ultimately based on some form of regression, but are more involved than simple linear regression. Ratio imputation has the property that the estimated value equals ̂𝑥 = 0when𝑦 = 0, which is generally not guaranteed in linear regression. There is no package that directly implements ratio imputation, unless it is considered a special case of regression imputation.

Below we assume that x andya are numerical vectors of equal length, x contains missing values, and y is complete. Unfortunately, it's not possible to simply wrap the above in a function and pass it to HMisc's impute function. It should be noted that while Hmisc, VIM, miandmice all implement imputation methods that ultimately use some form of regression, they do not include a simple interface for the case described above.

The more advanced methods in these packages aim to be more accurate and/or robust to outliers than standard (generalized) linear regression. Furthermore, both mus and mi implement methods for multiple imputation, which allows for the estimation of the imputation variance.

Table 3: An overview of imputation functionality oﬀered by some R packages. reg: regression, rand:

Hot deck imputation

In sequential hot deck imputation, the vector containing missing values is sorted by one or more auxiliary variables so that records with similar auxiliary variables appear consecutively in the data.frame. Then each missing value is imputed with the value from the next record that has an observed value. The value ̂𝑥 is then estimated using a prediction model (often a form of regression) for the record with the missing values, as well as for possible donor records 𝑘 ≠ 𝑗 where the 𝑖th value is not missing.

A missing value is then imputed by first finding the 𝑘 records closest to the record with one or more missing values. In the case where a value is chosen from the nearest neighbors, kNN imputation is a form of hot deck imputation. For categorical variables,𝑑 (𝑖, 𝑗) = 0 when the value for the𝑘 𝑡ℎvariable is the same in record𝑖and𝑗and1else.

For numerical variables, the median of the nearest neighbor is used as the imputation value, for categorical variables, the category that occurs most frequently among the nearest neighbors is used. Therspapackage34 can take a numerical record𝒙 and replace it with a record𝒙, so that the weighted Euclidean distance.