• Tidak ada hasil yang ditemukan

R Data Import/Export - The Comprehensive R Archive Network

N/A
N/A
Nguyễn Gia Hào

Academic year: 2023

Membagikan "R Data Import/Export - The Comprehensive R Archive Network"

Copied!
35
0
0

Teks penuh

The section on relational databases in this guide is based in part on an earlier guide by Douglas Bates and Saikat DebRoy. Unless otherwise stated, everything described in this guide is (at least in principle) available on all platforms running R. For example, Therneau & Grambsch (2000) noted that they preferred to manipulate data in SAS and then use package survival (https://CRAN.R-project.org/package=survival) in S for the analysis. ) Database manipulation systems are often well suited for manipulating and extracting data: different packages for interacting with DBMSs are discussed here.

Imports

Encodings

Unless the file to be imported from is entirely in ASCII, it is usually necessary to know how it was encoded. For text files, a good way to find out something about its structure is the file's command-line tool (for Windows, included in Rtools). All too often we have been reduced to looking at the file with the command line tool od or a hex editor to figure out its encoding.

Export to text files

Such files are known as CSV (Comma Separated Values) files, and the write.csv wrapper function provides the appropriate defaults. All of these functions can write to a connection that can specify an encoding for the file, and write.table has a fileEncoding argument to make this easier. The write.matrix function in package MASS (https://CRAN.R-project.org/package=MASS) provides a specialized interface for writing matrices, with the ability to write them in blocks and thereby reduce memory consumption .

XML

Using UTF-8 is a good way to create portable files that won't be easily confused with other encodings, but even macOS applications (where UTF-8 is the system encoding) may not recognize them, and Windows applications will most likely don't. It is possible to use sink to redirect the standard R output to a file, capturing the output of (possibly implicit) print statements. The write.foreign function in package foreign (https://CRAN.R-project.org/ . package=foreign) uses write.table to produce a text file and also writes a code file that will read this text file in another statistics package .

Variations on read.table

Unless you do something special, read.table reads all the columns as character vectors and then tries to select an appropriate class for each variable in the data frame. By default read.table uses '#' as a comment character and if this is encountered (except in quoted strings) the rest of the line is ignored. If the read.table options are specified incorrectly, the error message will usually be of the form

Fixed-width-format files

Some people argue that UTF-8 files should never have BOMs, but some software (apparently including Excel:mac) uses them, and many Unix-like operating systems do not accept them. The convenience functions read.csvand read.delim provide arguments toread.table suitable for CSV files and tab-delimited files exported from spreadsheets in English-speaking languages. The read.csv2 and read.delim2 versions are suitable for use in those languages ​​where a comma is used for a decimal point and (forread.csv2) for spreadsheets that use semicolons to separate fields.

This may provide enough information to locate the problem, but the helper function count.fields may be useful for further investigation. It would help to specify comment.char colClasses as one of the atomic vector types (logical, integer, numeric, complex, character, or perhaps raw) for each column, and to give nrows, the number of rows to be read (and a mild overestimation is better than not specifying it at all).

Data Interchange Format (DIF)

On Windows, spreadsheet programs often save spreadsheet data copied to the clipboard in this format; read.DIF("clipboard") can read it directly from there.

Using scan directly

Re-shaping data

Flat contingency tables

In all cases, the facilities described are written to data files from specific versions of the other system (often in the early 2000s), and have not necessarily been updated to the latest versions of the other system.

EpiInfo, Minitab, S-PLUS, SAS, SPSS, Stata, Systat

Octave

Why use a database?

Overview of RDBMSs

SQL queries

SELECT State, Murder FROM USA Arrests WHERE Rape > 30 ORDER BY Murder SELECT t.sch, c.means, t.sex, t.achieve. FROM student as t, school as c WHERE t.sch = c.id SELECT gender, COUNT(*) FROM student GROUP BY GENDER. The five aggregation functions are COUNT(*) and SUM, MAX, MIN, and AVG, each of which is applied to a single column.).

SELECT queries use FROM to select the table, WHERE to specify a condition for inclusion (or more than one condition separated by AND or OR), and ORDER BY to sort the result. Unlike data frames, rows in RDBMS tables are best thought of as unordered, and without an ORDER BY statement, the order is indeterminate. You can sort (in lexicographic order) on more than one column by separating them with commas.

If more than one column (separated by commas) is specified, multiway cross-classifications can be summarized with one of the five aggregation functions. If the SELECT statement contains an ORDER BY clause that produces a unique order, a LIMIT statement can be added to select (by number) a contiguous block of output rows. It may not be reliable unless the order is unique, as the LIMIT clause can be used to optimize the query.).

Kline and Kline (2001) discuss the details of the implementation of SQL in Microsoft SQL Server 2000, Oracle, MySQL, and PostgreSQL.

Data types

R interface packages

Packages using DBI

PackageRMySQL(https://CRAN.R-project.org/package=RMySQL) onCRAN provides an interface to the MySQL database system (see https://www.mysql.com and Dubois, 2000) or its fork MariaDB (see https ://mariadb.org/). The current version requires the DBI package (https://CRAN.R-project.org/package=DBI) and this description applies to all other DBI back-ends (https://CRAN .R) with minor changes -project.org/ .package=DBI). The call to dbDriver("MySQL") returns a database connection manager object and then a call to dbConnect opens a database connection which can then be closed by a call to the dbDisconnect generic function.

Use dbDriver("Oracle"), dbDriver("PostgreSQL") or dbDriver("SQLite") with those DBMSs and packages ROracle (https://cran.r-project.org/package=ROracle), RPostgreSQL (https:// :/ /CRAN.R-project.org/ .package=RPostgreSQL) or RSQLite (https://CRAN.R-project.org/package=RSQLite) respectively. The fetch function is used to retrieve some or all rows in the query result as a list. The dbHasCompleted function indicates whether all rows have been retrieved and dbGetRowCount returns the number of rows in the result.

Package RODBC

On Windows, DSNs are configured in the ODBCapplet in Control Panel ("Data Sources (ODBC)" in the "Administrative Tools" section).

Binary data formats

A particular class of binary files are files that represent images, and a common request is to read such a file into R as an array. These provide convert and gm convert command line tools to convert images from one format to another: which formats they can take is determined when they are compiled, and the supported formats can be listed by e.g. convert -list format. Packagepixmap (https://CRAN.R-project.org/package=pixmap) has a function read.pnm to create 'portable anymap' images in PBM (black and white), PGM (grey) and PPM (RGB color) formats to read.

Packages bmp (https://CRAN.R-project.org/package=bmp), jpeg (https://CRAN.R-project.org/package=jpeg) and png (https://CRAN.R-project .org/package=png) read the formats they are named after. CRAN.R-project.org/package=biOps) and Momocs (https://CRAN.R-project.org/ . package=Momocs), and Bioconductor package EBImage. TIFF is more of a meta-format, a wrapper within which a very large variety of image formats can be embedded.

The rtiff (https://CRAN.R-project.org/package=rtiff) and tiff (https://CRAN.R-project.org/package=tiff) packages can read some subformats (depending on the external libtiff , with which they are translated). Raster files are common in the geographic sciences, and the rgdal package (https://CRAN.R-project.org/package=rgdal) provides an interface to GDAL that provides some of its own capabilities for reading raster files and links to many others. It can be useful for non-common formats such as JPEG 2000 (which is a different format than JPEG and is not currently supported in rgdal binaries for macOS or Windows (https://CRAN.R-project.org/package=rgdal) ).

Connections are used in R in the sense of Chambers (1998) and Ripley (2001), a set of functions to replace the use of filenames with a flexible interface to file-like objects.

Types of connections

Output to connections

Input from connections

Pushback

R connections have the same idea in a more powerful way, in that an (essentially) arbitrary number of lines of text can be pushed back to a connection via a call to pushBack. Pushbacks work like a stack, so a read request first uses each line of the most recently pushed text, then those from previous pushbacks, and finally reads from the connection itself.

Listing and manipulating connections

Binary connections

Special values

The missing value for R logical and integer types is INT_MIN, the smallest representable int defined in the C header limits.h, normally corresponding to the bit pattern 0x80000000. The representation of the special values ​​for R numeric and complex types is machine dependent and may also be compiler dependent. The easiest way to use it is to link an external application to the standaloneRmath library that exports the double constants NA_REAL, R_PosInf and R_NegInf, and include the Rmath.h header that defines the ISNAN and R_FINITE macros.

Reading from sockets

Using download.file

Note that most of the tips are for pre-Excel 2007 spreadsheets and not the later .xlsx format. You may need to use read.delim2orread.csv2 in a country that uses a comma as a decimal point.). If you don't have Excel, many other programs can read such tables and export to a text format on both Windows and Unix, for example Gnumeric (http://www.gnumeric. org) and OpenOffice (https:/ /www. .openoffice.org).

You can also cut and paste in such a program between displaying a spreadsheet and R: read.table reading from the R console or, on Windows, from the clipboard (viafile="clipboard" or read Clipboard). Windows users (from 32-bit R) can use odbcConnectExcel in package RODBC (https:// .CRAN.R-project.org/package=RODBC). This can select rows and columns from any of the sheets in an Excel spreadsheet file (at least from Excel 97–2003, depending on your ODBC drivers: calling odbcConnect directly can read versions back to Excel 3.0).

The odbcConnectExcel2007 version will read Excel 2007 as well as earlier formats (provided the drivers are installed, including Windows R 64-bit: see section 4.3.2 [RODBC], page 18). The packages dataframes2xls (https://CRAN.R-project.org/package=dataframes2xls) and WriteXLS (https://CRAN.R-project.org/package=WriteXLS) each contain a function to write one or more frames data into an .xls file, using Python and Perl respectively. The xlsx package (https://CRAN.R-project.org/package=xlsx) can read and manipulate Excel 2007 and later spreadsheets: requires Java.

Package XLConnect (https://CRAN.R-project.org/package=XLConnect) can read, write and manipulate both Excel 97–2003 and Excel 2007/10 spreadsheets using Java.

Referensi

Dokumen terkait