Comparison with R / R libraries - Comparison with other tools

3.3.9 .dt accessor

3.5 Comparison with other tools

3.5.1 Comparison with R / R libraries

Transforming

R pandas

select(df, col_one = col1)

df.rename(columns={'col1': 'col_one'})['col_one']

rename(df, col_one = col1)

df.rename(columns={'col1': 'col_one'}) mutate(df, c=a-b) df.assign(c=df.a-df.b)

Grouping and summarizing

R pandas

summary(df) df.describe()

gdf <- group_by(df, col1) gdf = df.groupby('col1') summarise(gdf, avg=mean(col1, na.

rm=TRUE))

df.groupby('col1').agg({'col1':

'mean'})

summarise(gdf, total=sum(col1)) df.groupby('col1').sum()

Base R

Slicing with R’sc

R makes it easy to accessdata.framecolumns by name

df <- data.frame(a=rnorm(5), b=rnorm(5), c=rnorm(5), d=rnorm(5), e=rnorm(5)) df[, c("a", "c", "e")]

or by integer location

df <- data.frame(matrix(rnorm(1000), ncol=100)) df[, c(1:10, 25:30, 40, 50:100)]

Selecting multiple columns by name inpandasis straightforward

In [1]: df = pd.DataFrame(np.random.randn(10, 3), columns=list('abc')) In [2]: df[['a', 'c']]

Out[2]:

a c

0 -0.675786 -0.705430 1 -1.312106 -0.458568 2 0.760743 0.794540 3 -1.993663 -0.638385 4 0.174879 0.456440 5 0.021686 -1.546643 6 1.189064 0.034015 7 -0.212239 1.375358 8 1.488271 -0.044956 9 -0.209352 -0.531272

In [3]: df.loc[:, ['a', 'c']]

(continues on next page)

(continued from previous page)

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[3]:

˓→

a c

Selecting multiple noncontiguous columns by integer location can be achieved with a combination of theilocindexer attribute andnumpy.r_.

In [4]: named = list('abcdefg') In [5]: n = 30

In [6]: columns = named + np.arange(len(named), n).tolist() In [7]: df = pd.DataFrame(np.random.randn(n, n), columns=columns) In [8]: df.iloc[:, np.r_[:10, 24:30]]

Out[8]:

a b c d e f g 7

˓→ 8 9 24 25 26 27 28 29

0 0.845186 1.477387 -1.101084 0.045371 -0.152402 1.193802 -0.641544 0.727758 0.

˓→166301 -0.262495 0.865643 -0.583461 0.611756 0.043941 -0.180218 -0.234422

1 -0.091082 -0.241438 0.888987 0.950624 1.486497 0.816097 -0.303517 -0.185567 1.

˓→005420 0.528180 1.150292 0.366950 -0.302928 -1.782282 -1.060319 0.926426

2 -0.120281 -1.355382 1.316443 1.035293 1.832802 0.840813 1.224717 1.189179 0.

˓→708129 -1.088905 -2.090951 -0.249315 -0.255315 1.062796 0.120178 -2.567354

3 0.363030 0.337972 0.562713 -1.958954 1.343284 0.294966 -0.838727 0.537231 0.

˓→471620 -0.241791 -0.229986 -0.042663 1.067309 1.733714 0.187126 1.714257

4 -0.924266 -0.529364 0.900472 -2.246269 1.311554 -0.346161 -1.563239 0.821391 0.

˓→257079 2.025030 -0.997635 -1.166100 -0.194001 1.539780 1.141356 -0.671260

5 0.365321 2.220535 -0.862720 0.500629 -0.463077 0.646191 -0.586978 -1.123404 0.

˓→755633 -0.171989 -0.934973 0.407911 -0.049064 -0.501261 1.166429 -1.639652

6 -0.779156 0.872386 1.498847 0.802301 1.120817 0.682671 0.387492 1.676350 1.

˓→250301 0.511174 1.725580 0.736709 -0.075654 -0.433720 -0.531258 0.163096

7 -1.092626 2.100230 0.809544 -0.357548 -0.141715 -1.231609 0.940635 1.341796 0.

˓→219083 0.134680 -0.052375 -0.313744 -1.509039 0.369685 -0.491988 0.634582

8 -0.770382 -0.148715 0.995860 -0.497789 -0.337062 -0.811544 -0.136611 -0.159909 -3.

˓→307796 1.151714 -1.823490 -0.584232 -0.360641 0.343535 0.655504 0.944038

9 -0.212662 -1.868740 0.405428 -0.203476 -0.556169 0.388479 -0.078858 -0.665643 -0.

˓→083556 0.355772 0.105630 1.056475 0.441954 -1.677874 -2.292959 -0.015734

10 1.029785 -1.357948 0.736661 -0.373011 0.010686 0.891226 1.708380 -0.621027 -1.

˓→361612 0.876017 -0.747836 -0.101313 0.153197 1.522210 -0.543723 -1.056901

11 0.344063 0.844541 -0.203043 -1.072639 -0.339559 -1.805151 -0.803211 -1.194427 -0.

˓→186082 -0.398677 -0.022149 0.467412 -0.062829 -1.303249 1.008333 0.484176

12 0.714130 -1.809731 0.500489 -0.964478 -1.151533 -0.081869 0.364836 -0.478761 0.

˓→175269 -1.043484 -0.700437 -0.446162 0.262790 -0.195336 -0.610571 1.284843

13 0.590426 -1.066204 0.647251 0.490382 -0.342527 2.300159 0.017046 -1.059625 2.

˓→150886 0.251143 -0.743321 -0.286809 0.930005 0.109723 -0.088245 0.058119

(continues on next page)

(continued from previous page) 14 -3.048565 -1.434827 1.417011 -0.044549 -1.436573 -2.093045 0.844530 0.896037 -1.

˓→068920 -0.517810 -0.519054 -0.359963 0.739228 0.516530 0.398125 0.898604

15 -1.816139 0.784040 -0.616144 0.528060 -1.467376 1.490376 -1.339519 0.690586 2.

˓→210548 -0.244964 1.224123 0.466515 -0.567060 -0.122205 -1.724897 -1.719786

16 -1.013236 -1.192154 1.601083 0.210545 -1.479283 2.875824 1.156338 0.018656 -0.

˓→370621 1.413207 -1.062860 1.021655 -1.014025 -0.492616 -0.437969 0.325006

17 0.341647 0.423006 -0.079704 0.193268 1.065203 2.303325 0.672584 0.975313 0.

˓→048812 -0.362334 0.474592 -1.214451 0.260621 -0.065847 -0.370592 0.734567

18 0.548488 2.030576 -1.737444 0.424595 0.888119 0.811904 -2.663094 -0.214405 -1.

˓→553917 -0.326102 0.129004 0.376030 0.772534 1.201531 0.153554 -1.852295

19 0.386408 1.627682 0.042071 -1.738630 2.002411 1.014131 -0.363660 -2.089632 -0.

˓→575169 -0.846599 -1.584384 -0.487075 0.776227 0.234607 -0.482598 -0.325331

20 1.150351 0.062999 0.000056 -1.650372 -0.573644 -1.277146 -0.758280 -0.655414 -0.

˓→513040 2.952950 -0.840759 0.308092 0.757281 1.328255 0.023158 0.527266

21 -0.135547 0.221105 0.169215 -0.669499 -1.341411 -1.056065 -0.243079 1.261740 1.

˓→361390 2.318099 0.756522 1.248659 -0.089996 3.271387 -0.387008 -1.385124

22 0.681371 -0.683013 0.187458 0.163973 0.487176 0.383422 0.846315 0.648417 0.

˓→414951 -1.188660 0.592502 1.145729 -0.922409 -1.195030 -0.585130 0.337075

23 -1.322151 1.388533 0.326662 0.849451 1.564112 -0.351552 -0.256252 -0.711547 1.

˓→330510 -0.123016 -1.009203 -0.089746 -0.114824 0.435513 -1.723209 -1.155514

24 0.900846 -0.254314 0.089641 -0.390505 -0.365457 -1.873269 2.119430 0.911508 -0.

˓→773568 0.174730 -0.248129 1.116340 0.847847 -0.561636 1.146456 -1.669738

25 0.750372 -0.441270 0.692703 -0.450323 -0.310025 -0.897737 0.597373 -0.965110 0.

˓→780583 -0.580247 0.205643 1.521676 0.008620 -0.201451 2.674895 -1.269927

26 1.583249 0.635870 -0.219556 0.673535 -0.255959 0.689100 -0.364517 0.477642 0.

˓→006587 1.591417 0.376625 -1.035038 2.026937 0.802101 -0.135218 2.809340

27 0.704517 0.054065 0.364239 0.487663 -1.491507 -0.423214 0.961798 -1.934210 -0.

˓→841410 0.215011 0.654141 0.088707 -1.851563 -0.310688 0.834693 -0.778363

28 0.704446 0.521850 -0.069554 -1.498666 -1.234736 -1.064675 1.485964 -1.049791 0.

˓→479308 -0.442094 0.355866 -1.864982 -1.614115 -1.301276 -1.038078 -1.122105

29 0.516201 -0.687797 -0.261853 -0.950345 -0.759850 -0.148198 -2.312528 0.577857 -0.

˓→306088 0.825769 0.215203 -0.135859 0.023179 -0.991331 0.082219 0.071864

aggregate

In R you may want to split data into subsets and compute the mean for each. Using a data.frame calleddfand splitting it into groupsby1andby2:

df <- data.frame(

v1 = c(1,3,5,7,8,3,5,NA,4,5,7,9),

v2 = c(11,33,55,77,88,33,55,NA,44,55,77,99),

by1 = c("red", "blue", 1, 2, NA, "big", 1, 2, "red", 1, NA, 12), by2 = c("wet", "dry", 99, 95, NA, "damp", 95, 99, "red", 99, NA, NA)) aggregate(x=df[, c("v1", "v2")], by=list(mydf2$by1, mydf2$by2), FUN = mean) Thegroupby()method is similar to base Raggregatefunction.

In [9]: df = pd.DataFrame(

...: {'v1': [1, 3, 5, 7, 8, 3, 5, np.nan, 4, 5, 7, 9],

...: 'v2': [11, 33, 55, 77, 88, 33, 55, np.nan, 44, 55, 77, 99],

...: 'by1': ["red", "blue", 1, 2, np.nan, "big", 1, 2, "red", 1, np.nan, 12], ...: 'by2': ["wet", "dry", 99, 95, np.nan, "damp", 95, 99, "red", 99, np.nan,

...: np.nan]})

...:

(continues on next page)

(continued from previous page) In [10]: g = df.groupby(['by1', 'by2'])

In [11]: g[['v1', 'v2']].mean() Out[11]:

v1 v2 by1 by2

1 95 5.0 55.0 99 5.0 55.0 2 95 7.0 77.0 99 NaN NaN big damp 3.0 33.0 blue dry 3.0 33.0 red red 4.0 44.0 wet 1.0 11.0

For more details and examples seethe groupby documentation.

match/%in%

A common way to select data in R is using%in%which is defined using the functionmatch. The operator%in%is used to return a logical vector indicating if there is a match or not:

s <- 0:4 s %in% c(2,4)

Theisin()method is similar to R%in%operator:

In [12]: s = pd.Series(np.arange(5), dtype=np.float32) In [13]: s.isin([2, 4])

Out[13]:

0 False 1 False 2 True 3 False 4 True dtype: bool

Thematchfunction returns a vector of the positions of matches of its first argument in its second:

s <- 0:4

match(s, c(2,4))

For more details and examples seethe reshaping documentation.

tapply

tapplyis similar toaggregate, but data can be in a ragged array, since the subclass sizes are possibly irregular.

Using a data.frame calledbaseball, and retrieving information based on the arrayteam:

baseball <-

data.frame(team = gl(5, 5,

labels = paste("Team", LETTERS[1:5])),

(continues on next page)

(continued from previous page) player = sample(letters, 25),

batting.average = runif(25, .200, .400)) tapply(baseball$batting.average, baseball.example$team,

max)

Inpandaswe may usepivot_table()method to handle this:

In [14]: import random In [15]: import string

In [16]: baseball = pd.DataFrame(

....: {'team': ["team %d" % (x + 1) for x in range(5)] * 5, ....: 'player': random.sample(list(string.ascii_lowercase), 25), ....: 'batting avg': np.random.uniform(.200, .400, 25)})

....:

In [17]: baseball.pivot_table(values='batting avg', columns='team', aggfunc=np.max) Out[17]:

team team 1 team 2 team 3 team 4 team 5 batting avg 0.376302 0.382559 0.364652 0.358454 0.39535 For more details and examples seethe reshaping documentation.

subset

Thequery()method is similar to the base Rsubsetfunction. In R you might want to get the rows of adata.

framewhere one column’s values are less than another column’s values:

df <- data.frame(a=rnorm(10), b=rnorm(10)) subset(df, a <= b)

df[df$a <= df$b,] # note the comma

Inpandas, there are a few ways to perform subsetting. You can usequery()or pass an expression as if it were an index/slice as well as standard boolean indexing:

In [18]: df = pd.DataFrame({'a': np.random.randn(10), 'b': np.random.randn(10)}) In [19]: df.query('a <= b')

Out[19]:

a b

5 -0.963807 -0.875940 8 -1.318729 0.725287 9 1.144169 1.691462 In [20]: df[df.a <= df.b]

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[20]:

˓→

a b

5 -0.963807 -0.875940 8 -1.318729 0.725287 9 1.144169 1.691462

In [21]: df.loc[df.a <= df.b]

(continues on next page)

(continued from previous page)

˓→

a b

5 -0.963807 -0.875940 8 -1.318729 0.725287 9 1.144169 1.691462

For more details and examples seethe query documentation.

with

An expression using a data.frame calleddfin R with the columnsaandbwould be evaluated usingwithlike so:

df <- data.frame(a=rnorm(10), b=rnorm(10)) with(df, a + b)

df$a + df$b # same as the previous expression

Inpandasthe equivalent expression, using theeval()method, would be:

In [22]: df = pd.DataFrame({'a': np.random.randn(10), 'b': np.random.randn(10)}) In [23]: df.eval('a + b')

Out[23]:

0 0.197930 1 0.704779 2 -1.138875 3 2.270749 4 -0.006389 5 -0.797885 6 -1.848427 7 -1.365816 8 -0.516445 9 -0.500554 dtype: float64

In [24]: df.a + df.b # same as the previous expression

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[24]:

˓→

0 0.197930 1 0.704779 2 -1.138875 3 2.270749 4 -0.006389 5 -0.797885 6 -1.848427 7 -1.365816 8 -0.516445 9 -0.500554 dtype: float64

In certain caseseval()will be much faster than evaluation in pure Python. For more details and examples seethe eval documentation.

plyr

plyris an R library for the split-apply-combine strategy for data analysis. The functions revolve around three data structures in R,a for arrays,l for lists, and dfor data.frame. The table below shows how these data structures could be mapped in Python.

R Python

array list

lists dictionary or list of objects data.frame dataframe

ddply

An expression using a data.frame calleddfin R where you want to summarizexbymonth:

require(plyr) df <- data.frame(

x = runif(120, 1, 168), y = runif(120, 7, 334), z = runif(120, 1.7, 20.7), month = rep(c(5,6,7,8),30), week = sample(1:4, 120, TRUE) )

ddply(df, .(month, week), summarize, mean = round(mean(x), 2), sd = round(sd(x), 2))

Inpandasthe equivalent expression, using thegroupby()method, would be:

In [25]: df = pd.DataFrame({'x': np.random.uniform(1., 168., 120), ....: 'y': np.random.uniform(7., 334., 120), ....: 'z': np.random.uniform(1.7, 20.7, 120),

....: 'month': [5, 6, 7, 8] * 30,

....: 'week': np.random.randint(1, 4, 120)}) ....:

In [26]: grouped = df.groupby(['month', 'week']) In [27]: grouped['x'].agg([np.mean, np.std]) Out[27]:

mean std

month week

5 1 97.816995 42.303852 2 84.334809 64.534765 3 87.306702 44.926170 6 1 100.570938 46.104949 2 66.712939 41.685774 3 96.207214 43.942976 7 1 88.447249 49.705294 2 93.931880 41.423754 3 79.571586 44.195428 8 1 115.128923 47.347190 2 71.987975 61.081189 3 100.213814 51.661295

For more details and examples seethe groupby documentation.

reshape / reshape2 melt.array

An expression using a 3 dimensional array calledain R where you want to melt it into a data.frame:

a <- array(c(1:23, NA), c(2,3,4)) data.frame(melt(a))

In Python, sinceais a list, you can simply use list comprehension.

In [28]: a = np.array(list(range(1, 24)) + [np.NAN]).reshape(2, 3, 4)

In [29]: pd.DataFrame([tuple(list(x) + [val]) for x, val in np.ndenumerate(a)]) Out[29]:

0 1 2 3

0 0 0 0 1.0

1 0 0 1 2.0

2 0 0 2 3.0

3 0 0 3 4.0

4 0 1 0 5.0

5 0 1 1 6.0

6 0 1 2 7.0

7 0 1 3 8.0

8 0 2 0 9.0

9 0 2 1 10.0 10 0 2 2 11.0 11 0 2 3 12.0 12 1 0 0 13.0 13 1 0 1 14.0 14 1 0 2 15.0 15 1 0 3 16.0 16 1 1 0 17.0 17 1 1 1 18.0 18 1 1 2 19.0 19 1 1 3 20.0 20 1 2 0 21.0 21 1 2 1 22.0 22 1 2 2 23.0 23 1 2 3 NaN

melt.list

An expression using a list calledain R where you want to melt it into a data.frame:

a <- as.list(c(1:4, NA)) data.frame(melt(a))

In Python, this list would be a list of tuples, soDataFrame()method would convert it to a dataframe as required.

In [30]: a = list(enumerate(list(range(1, 5)) + [np.NAN]))

(continues on next page)

(continued from previous page) In [31]: pd.DataFrame(a)

Out[31]:

0 1

0 0 1.0 1 1 2.0 2 2 3.0 3 3 4.0 4 4 NaN

For more details and examples seethe Into to Data Structures documentation.

melt.data.frame

An expression using a data.frame calledcheesein R where you want to reshape the data.frame:

cheese <- data.frame(

first = c('John', 'Mary'), last = c('Doe', 'Bo'), height = c(5.5, 6.0), weight = c(130, 150) )

melt(cheese, id=c("first", "last"))

In Python, themelt()method is the R equivalent:

In [32]: cheese = pd.DataFrame({'first': ['John', 'Mary'],

....: 'last': ['Doe', 'Bo'],

....: 'height': [5.5, 6.0],

....: 'weight': [130, 150]})

....:

In [33]: pd.melt(cheese, id_vars=['first', 'last']) Out[33]:

first last variable value 0 John Doe height 5.5 1 Mary Bo height 6.0 2 John Doe weight 130.0 3 Mary Bo weight 150.0

In [34]: cheese.set_index(['first', 'last']).stack() # alternative way

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[34]:

˓→

first last

John Doe height 5.5 weight 130.0

Mary Bo height 6.0

weight 150.0 dtype: float64

For more details and examples seethe reshaping documentation.

cast

In Racastis an expression using a data.frame calleddfin R to cast into a higher dimensional array:

df <- data.frame(

x = runif(12, 1, 168), y = runif(12, 7, 334), z = runif(12, 1.7, 20.7), month = rep(c(5,6,7),4), week = rep(c(1,2), 6) )

mdf <- melt(df, id=c("month", "week")) acast(mdf, week ~ month ~ variable, mean) In Python the best way is to make use ofpivot_table():

In [35]: df = pd.DataFrame({'x': np.random.uniform(1., 168., 12), ....: 'y': np.random.uniform(7., 334., 12), ....: 'z': np.random.uniform(1.7, 20.7, 12),

....: 'month': [5, 6, 7] * 4,

....: 'week': [1, 2] * 6})

....:

In [36]: mdf = pd.melt(df, id_vars=['month', 'week'])

In [37]: pd.pivot_table(mdf, values='value', index=['variable', 'week'], ....: columns=['month'], aggfunc=np.mean)

....:

Out[37]:

month 5 6 7

variable week

x 1 52.316747 135.360079 85.058099 2 53.321560 73.355744 115.364655 y 1 249.154194 166.180787 79.850345 2 206.716451 238.948519 236.880722 z 1 10.998118 16.935719 15.090803 2 13.049358 16.288492 10.394538

Similarly for dcast which uses a data.frame called df in R to aggregate information based on Animal and FeedType:

df <- data.frame(

Animal = c('Animal1', 'Animal2', 'Animal3', 'Animal2', 'Animal1', 'Animal2', 'Animal3'),

FeedType = c('A', 'B', 'A', 'A', 'B', 'B', 'A'), Amount = c(10, 7, 4, 2, 5, 6, 2)

)

dcast(df, Animal ~ FeedType, sum, fill=NaN)

# Alternative method using base R

with(df, tapply(Amount, list(Animal, FeedType), sum))

Python can approach this in two different ways. Firstly, similar to above usingpivot_table():

In [38]: df = pd.DataFrame({

....: 'Animal': ['Animal1', 'Animal2', 'Animal3', 'Animal2', 'Animal1', ....: 'Animal2', 'Animal3'],

....: 'FeedType': ['A', 'B', 'A', 'A', 'B', 'B', 'A'], ....: 'Amount': [10, 7, 4, 2, 5, 6, 2],

....: })

(continues on next page)

(continued from previous page) ....:

In [39]: df.pivot_table(values='Amount', index='Animal', columns='FeedType',

....: aggfunc='sum')

....:

Out[39]:

FeedType A B

Animal

Animal1 10.0 5.0 Animal2 2.0 13.0 Animal3 6.0 NaN

The second approach is to use thegroupby()method:

In [40]: df.groupby(['Animal', 'FeedType'])['Amount'].sum() Out[40]:

Animal FeedType

Animal1 A 10

B 5

Animal2 A 2

B 13

Animal3 A 6

Name: Amount, dtype: int64

For more details and examples seethe reshaping documentationorthe groupby documentation.

factor

pandas has a data type for categorical data.

cut(c(1,2,3,4,5,6), 3) factor(c(1,2,3,2,2,3))

In pandas this is accomplished withpd.cutandastype("category"):

In [41]: pd.cut(pd.Series([1, 2, 3, 4, 5, 6]), 3) Out[41]:

0 (0.995, 2.667]

1 (0.995, 2.667]

2 (2.667, 4.333]

3 (2.667, 4.333]

4 (4.333, 6.0]

5 (4.333, 6.0]

dtype: category

Categories (3, interval[float64]): [(0.995, 2.667] < (2.667, 4.333] < (4.333, 6.0]]

In [42]: pd.Series([1, 2, 3, 2, 2, 3]).astype("category")

˓→

0 1

1 2

2 3

3 2

4 2

5 3

(continues on next page)

(continued from previous page) dtype: category

Categories (3, int64): [1, 2, 3]

For more details and examples seecategorical introductionand theAPI documentation. There is also a documentation regarding thedifferences to R’s factor. {{ header }}

Dalam dokumen pandas: powerful Python data analysis toolkit (Halaman 163-175)