126 CHAPTER 3. PROBABILITY DISTRIBUTIONS - DRAFT Listing 3.34: Bivariate normal data
1 using Distributions, Plots; pyplot() 2
3 include("../data/mvParams.jl") 4 biNorm = MvNormal(meanVect,covMat) 5
6 N = 10^3
7 points = rand(MvNormal(meanVect,covMat),N) 8
9 support = 15:0.5:40
10 z = [ pdf(biNorm,[x,y]) for y in support, x in support ] 11
12 p1 = scatter(points[1,:], points[2,:], ms=0.5, c=:black, legend=:none) 13 p1 = contour!(support, support, z,
14 levels=[0.001, 0.005, 0.02], c=[:blue, :red, :green], 15 xlims=(15,40), ylims=(15,40), ratio=:equal, legend=:none, 16 xlabel="x", ylabel="y")
17 p2 = surface(support, support, z, lw=0.1, c=cgrad([:blue, :red]), 18 legend=:none, xlabel="x", ylabel="y",camera=(-35,20)) 19
20 plot(p1, p2, size=(800, 400))
In line 3 we include another Julia file defining meanVect and covMat. This file is generated in Listing 4.12 of Chapter 4. In line 4 we create an MvNormal distribution object representing the bivariate distribution. In line 7 we use rand() with a method provided via the Distributions package to generate random points. The rest of the code plots Figure 3.26. Notice the call to contour()in lines 13-16, with specifiedlevels. In lines 17-18 the parameters supplied viacamera are horizontal rotation and vertical rotation in degrees.
15 20 25 30 35 40
x
15 20 25 30 35 40
y
15 20 25 30x 35 40
y302520 15 4035
0.005 0.010 0.015 0.020 0.025
Figure 3.26: Contour lines and a surface plot for a bivariate normal distribution with randomly generated points on the contour plot.
Chapter 4
Processing and Summarizing Data - DRAFT
In this chapter we introduce methods and techniques for processing and summarizing data.
In statistics nomenclature, the act of summarizing data is known as descriptive statistics. In data- science nomenclature such activities take the names ofanalyticsanddash-boarding, while the process of manipulating and pre-processing data is sometimes calleddata cleansing, or data cleaning.
The statistical techniques and tools that we introduce include summary statistics and methods for data visualization, sometimes called Exploratory Data Analysis (EDA). We introduce several Julia tools for this, including the DataFrames package which allows for the storage of datasets that contain non-homogeneous data and includes support for missing entries. We also use the Statistics andStatsBase packages, which contain useful functions for summarizing data.
In practice statisticians and data-scientists often collect data in various ways, includingexperi- mental studies,observational studies,longitudinal studies,survey sampling, anddata scraping. Then to gain insight from the data, one may consider different data configurationssuch as:
Single sample: A case where all observations are considered to represent items from a homogeneous population. The configuration of the data takes the form: x1, x2, . . . , xn.
Single sample over time(time-series): The configuration of the data takes the form: xt1, xt2, . . . , xtn
with time pointst1< t2< . . . < tn.
Two samples: Similar to the single sample case, only now there are two populations (x’s and y’s).
The configuration of the data takes the form: x1, . . . , xn and y1, . . . , ym.
Generalizations from two samplestoksamples (each of potentially different sample size,n1, . . . , nk).
Observations in pairs (2-tuples): In this case, although similar to the two sample case, each observa- tion is a tuple of points,(x, y). Hence the configuration of data is(x1, y1),(x2, y2), . . . ,(xn, yn).
Generalizations from pairs to vectors of observations. (x11, . . . , x1p), . . . ,(xn1, . . . , xnp).
Other configurations including relationship data (graphs of connections), images, and many more.
128 CHAPTER 4. PROCESSING AND SUMMARIZING DATA - DRAFT This chapter is structured as follows: In Section 4.1 we see how to manipulate tabular data via data frames in Julia. In Section 4.2 we deal with methods of summarizing data including basic elements of descriptive statistics. We then move on to plotting where in Section 4.3 we present a variety of methods for plotting single sample data. In Section 4.4 we present plots for comparing samples. Section 4.5 presents plots for multivariate and high-dimensional data. We then present more simplistic business style plots in Section 4.6. The chapter closes with Section 4.7, where we show several ways of handling files using Julia as well as how to interact with a server side database.
For readers who wish to better understand the concepts of copies and mutability used in Sec- tion 4.1, the subsection below provides an optional overview. It can be skipped on a first reading.
Mutability, References, Shallow Copies and Deep Copies in Julia
When using any programming language, it is useful to have a basic understanding of how data is organized and referenced in memory. For this reason we now briefly overview the differences between mutabletypes,immutabletypes, reference copying,shallow copiesand deep copiesin Julia. We also introduce the basic programming concepts of ‘call by value’ and ‘call by reference’. This basic understanding is important in its own right, however it may also help readers better understand certain aspects of Julia’sDataFrame package, described in the sequel.
As a starting point, we review the difference between two mechanisms for passing variables to functions. Assume you have a variable x, a function f(), and then you then execute f(x). One can envision two general mechanisms by which this can take place. The first is named call by value and describes a situation where the code implementing f() gets a copy of the variable x.
As f() executes, even if its code appears to modify x, it is actually modifying a local copy. The second mechanism is namedcall by referenceand describes a situation wheref()obtains amemory reference(orpointer) tox. In such a case, asf()executes, if it modifiesx, then it actually modifies values in the original memory location of x.
In Julia, both mechanisms exist under a unified umbrella called pass by sharing. This means that variables are not copied when passed to functions. However, if a value is about to be changed within a function then depending on the mutability attribute of its type, either of the mechanisms may be employed. If the variable type is immutable then a local copy is made and the behavior follows the ‘call by value’ type. However, if the type is mutable then the called function does not create a local copy. Instead, it can modify the original variable according to the ‘call by reference’
mechanism. Hence the variable’s property, mutable or immutable, determines which function calling mechanism is exhibited.
As a general rule, primitive types such as Int64 or Float32 are immutable. The same goes for composite types defined using thestructkeyword. An exception to this is for composite types that are explicitly defined asmutable struct. Note that the code examples in this book seldom define types - however many of the types we use from packages are composite types. While not often used, if you wish to programmatically check if the type of a variable is immutable or not, you can use theisimmutable()function.
Importantly, arrays are mutable. Listing 4.1 implements two different methods for the function f(). The first method is forInt, a primitive type (immutable), and the second is forArray{Int}
129 (mutable). It then demonstrates the ‘call by value’ behavior exhibited for the primitive type, while the ‘call by reference’ behavior is exhibited for the array.
Listing 4.1: Call by value vs. call by reference
1 f(z::Int) = begin z = 0 end
2 f(z::Array{Int}) = begin z[1] = 0 end 3
4 x = 1
5 @show typeof(x) 6 @show isimmutable(x)
7 println("Before call by value: ", x) 8 f(x)
9 println("After call by value: ", x,"\n") 10
11 x = [1]
12 @show typeof(x) 13 @show isimmutable(x)
14 println("Before call by reference: ", x) 15 f(x)
16 println("After call by reference: ", x) typeof(x) = Int64
isimmutable(x) = true Before call by value: 1 After call by value: 1 typeof(x) = Array{Int64,1}
isimmutable(x) = false
Before call by reference: [1]
After call by reference: [0]
In line 1 we implement a method of f() for integer types. The codez = 0will operate on a local copy ofz. In line 2 we implement a method off()for arrays. Here the codez[1] = 0will modify the first entry of the input argument z. Lines 4-9 use the first method, passing the variable xinto f(). As can be see from the output, the operation of the function f() does not modify x. Also note the use of the@showmacro, useful for debugging or understanding code. Lines 11-16 invoke the method off()for arrays of integers (this is multiple dispatch). The key point is thatf(x)in line 15 modifies the original xfrom global scope.
Ideally, for performance reasons, the level of actual copying of memory should be kept to a minimum. This is the underlying motivation for having a default ‘pass by reference’ mechanism when working with arrays, as you can give functions references to huge data arrays without any memory duplication. However, this entails some level of danger because function calls may modify variables that are passed to them as arguments. For this reason, Julia offers explicit functions for creating copies of variables, namelycopy()anddeepcopy(). The former creates a ‘shallow copy’
of the variable and copies all entries, but does not do it recursively. The latter recursively produces a copy until a completely independent copy of the variable is created.
We demonstrate the different type of copies and their interaction with mutability in Listing 4.2.
The basic example on which we apply a deep copy is a doubly nested array, e.g. [[10]]. In this case, using copy()will not be applied to the inner array[10], however usingdeepcopy() recursively copies all mutable entries.
130 CHAPTER 4. PROCESSING AND SUMMARIZING DATA - DRAFT Listing 4.2: Deep copy and shallow copy
1 println("Immutable:") 2 a = 10
3 b = a 4 b = 20 5 @show a 6
7 println("\nNo copy:") 8 a = [10]
9 b = a 10 b[1] = 20 11 @show a 12
13 println("\nCopy:") 14 a = [10]
15 b = copy(a) 16 b[1] = 20 17 @show a 18
19 println("\nShallow copy:") 20 a = [[10]]
21 b = copy(a) 22 b[1][1] = 20 23 @show a 24
25 println("\nDeep copy:") 26 a = [[10]]
27 b = deepcopy(a) 28 b[1][1] = 20 29 @show a;
Immutable:
a = 10 No copy:
a = [20]
Copy:
a = [10]
Shallow copy:
a = Array{Int64,1}[[20]]
Deep copy:
a = Array{Int64,1}[[10]]
Lines 1-5 exhibit no surprise due to immutability. TheInt64 ais assigned tobandbis modified in line 4. At this point Julia creates a copy because the variable is immutable. Lines 7-11 demonstrate different behavior. The arrayais mutable and hence afterbis assigned toain line 9, the modification ofbin line 10 also modifies a. Lines 13-17 show a case where acopy()ofais created. In this case modification of b in line 16 does not alter a. Lines 19-23 are similar, however in this case the fact thatcopy()is only a shallow copy matters. The variablebhas a new outer array, however the inner array is still shared with a. Hence the modification in line 22 modifies the inner array of aas well.
Finally, in lines 25-29 this is resolved by creating adeepcopy().