WORKING WITH FILES AND REMOTE SERVERS 175 Searching for Files in a Directory

In Listing 4.31 we present our next example, where we create a function which searches a directory for all filenames which contain a particular string. It then saves a list of these files to a file,fileList. Note that this function does not behave recursively and only searches the directory given.

Listing 4.31: Searching files in a directory

1 function directorySearch(directory, searchString) 2 outfile = open("fileList.txt","w")

3 fileList = filter(x->occursin(searchString, x), readdir(directory)) 4

5 for file in fileList 6 println(outfile, file)

7 end

8 close(outfile) 9 end

11 directorySearch(pwd(),".jl")

In lines 1-9 we define the functiondirectorySearch. As arguments, it takes adirectoryto search through, and asearchString. Line 2 usesopenwith ‘w’ to create our output filefileList.txt, which we will write to. In line 3 we create a string array of all filenames in our specifieddirectory that contain our searchString. This string array is defined as the variable fileList. The readdir()function is used to list all files in the given directory, andfilter()is used, along withoccursin()to check each element contains thesearchString. Lines 5-7 loop through each element in fileList and print them to theoutput file outfile. Line 8 closes the IOStream outfile. Line 11 provides an example of the use of ourdirectorySearchfunction, where we use it to obtain a shortlist of all files whose extensions contain “.jl” within our current working directory, i.e.pwd().

Connecting to a Remote Server

One may not always work with data stored locally on their machine or network. For example, sometimes a dataset is too large to be stored on a workstation, and therefore must be stored remotely in a datacentre, or on aremote server. In this scenario one must first connect to the server before working with the data. A typical workflow involves connecting to the remote database, submitting a query, and then saving the result locally. There are different types of databases, including: Oracle, MySQL, PostgreSQL, MongoDB, and many others. There are several Julia packages for connecting to remote servers including LibPQ.jl, which is a wrapper for the PostgreSQL libpq C library, SQLite.jl, as well as ODBC.jl and several others. Once a connection is established, one will typically submit a so-called SQL query to the server. SQL stands for strucutred query language, and is a common syntax used to query remote databases in order to extract a subsect of data from the database.

In this section we do not expand on the details of databases, nor the syntax of SQL queries.

Instead, in Listing 4.32 we present a simple pseudocode example of how a user may connect to a

176 CHAPTER 4. PROCESSING AND SUMMARIZING DATA - DRAFT remote PostgreSQL database, sumbit a SQL query, and then save the results.

Listing 4.32: Pseudocode for a remote database query

1 using LibPQ, DataFrames, CSV 2

3 host = "remoteHost"

4 dbname = "db1"

5 user = "username"

6 password = "userPwd"

7 port = "1111"

9 conStr= "host=" *host *

10 " port=" *port *

11 " dbname=" *dbname *

12 " user=" *user *

13 " password=" *password

14 conn = LibPQ.Connection(conStr) 15

16 df = DataFrame(execute(conn, "SELECT * FROM S1.T1")) 17 close(conn)

19 CSV.write("example.csv", df);

In line 1 theLibPQpackage is included. It is a wrapper for the libpq postgreSQL library, and contains methods to remotely connect to postgreSQL servers and submit queries. In lines 3-7 the details of the connection are specified and stored as strings, they include the: host name, database name, username, password, and specific port to connect on. Lines 9-13 concatenate these details together into the string conStr. In line 14 a connection to the remote server is established via theConnection()function from theLibPQpackage. The details in the stringconStrare used to establish the connection. Note that if the password is not given in the connection string, then the server will prompt for a password.

In line 16 a SQL query is submitted to the server via theexecute()function. It takes two arguments, the first is the connection to the server, and the second is the SQL query. The query submitted here is simple: SELECT * is used to select all columns FROM the T1 table, from the S1 schema, from database db1. The results are stored as theDataFrame df. The connection to the server is closed in line 17 via close(). In line 19 the data in df is written to the CSV file example.csv, in the current working directory.

Chapter 5

Statistical Inference Concepts - DRAFT

This chapter introduces statistical inference concepts, with the goal of establishing a theoretical footing of key concepts that follow in later chapters. The approach is that of classical statistics as opposed to machine learning, covered in Chapter 9. The action of statistical inference involves using mathematical techniques to make conclusions about unknown population parameters based on collected data. The field of statistical inference employs a variety of stochastic models to analyze and put forward efficient methods for carrying out such analyses.

In broad generality, analysis and methods of statistical inference can be categorized as either frequentist (also known as classical) or Bayesian. The former is based on the assumption that population parameters of some underlying distribution, or probability law, exist and are fixed, but are yet unknown. The process of statistical inference then deals with making conclusions about these parameters based on sampled data. In the latter Bayesian case, it is only assumed that there is a prior distribution of the parameters. In this case, the key process deals with analyzing a posterior distribution (of the parameters) - an outcome of the inference process. In this book we focus almost solely on the classical frequentist approach with the exception of Section 5.7 where we explore Bayesian statistics briefly.

In general, a statistical inference process involves data, a model, and analysis. The data is assumed to be comprised of random samples from the model. The goal of the analysis is then to make informed statements about population parameters of the model based on the data. Such statements typically take one of the following forms:

Point estimation- Determination of a single value (or vector of values) representing a best estimate of the parameter/parameters. In this case, the notion of “best” can be defined in different ways.

Confidence intervals - Determination of a range of values where the parameter lies. Under the model and the statistical process used, it is guaranteed that the parameter lies within this range with a pre-specified probability.

Hypothesis tests- The process of determining if the parameter lies in a given region, in the comple- ment of that region, or fails to take on a specific value. Such tests often represent a scientific hypothesis in a very natural way.

178 CHAPTER 5. STATISTICAL INFERENCE CONCEPTS - DRAFT Most of the point estimation, confidence intervals and hypothesis tests that we introduce and carry out in this book are elementary. Chapter 6 is devoted to covering elementary confidence intervals in detail, and Chapter 7 is devoted to covering elementary hypothesis tests in detail. We now begin to explore key ideas and concepts of statistical inference.

This chapter is structured as follows: In Section 5.1 we present the concept of a random sample together with the distribution of statistics, such as the distribution of the sample mean and the sample variance. In Section 5.2 we focus on random samples of normal random variables. In this common case, certain statistics have well known distributions that play a central role in statistics.

In Section 5.3 we explore the central limit theorem, providing justification for the ubiquity of the normal distribution. In Section 5.4 we explore basics of point estimation. In Section 5.5 we explore the concept of a confidence interval. In Section 5.6 we explore concepts of hypothesis testing.

Finally, in Section 5.7 we explore the basics of Bayesian statistics.

Dalam dokumen Fundamentals for Data Science, Machine Learning and Artificial Intelligence. (Halaman 185-188)