3.3.9 .dt accessor
3.5 Comparison with other tools
3.5.1 Comparison with R / R libraries
Transforming
R pandas
select(df, col_one = col1)
df.rename(columns={'col1': 'col_one'})['col_one']
rename(df, col_one = col1)
df.rename(columns={'col1': 'col_one'}) mutate(df, c=a-b) df.assign(c=df.a-df.b)
Grouping and summarizing
R pandas
summary(df) df.describe()
gdf <- group_by(df, col1) gdf = df.groupby('col1') summarise(gdf, avg=mean(col1, na.
rm=TRUE))
df.groupby('col1').agg({'col1':
'mean'})
summarise(gdf, total=sum(col1)) df.groupby('col1').sum()
Base R
Slicing with R’sc
R makes it easy to accessdata.framecolumns by name
df <- data.frame(a=rnorm(5), b=rnorm(5), c=rnorm(5), d=rnorm(5), e=rnorm(5)) df[, c("a", "c", "e")]
or by integer location
df <- data.frame(matrix(rnorm(1000), ncol=100)) df[, c(1:10, 25:30, 40, 50:100)]
Selecting multiple columns by name inpandasis straightforward
In [1]: df = pd.DataFrame(np.random.randn(10, 3), columns=list('abc')) In [2]: df[['a', 'c']]
Out[2]:
a c
0 -0.675786 -0.705430 1 -1.312106 -0.458568 2 0.760743 0.794540 3 -1.993663 -0.638385 4 0.174879 0.456440 5 0.021686 -1.546643 6 1.189064 0.034015 7 -0.212239 1.375358 8 1.488271 -0.044956 9 -0.209352 -0.531272
In [3]: df.loc[:, ['a', 'c']]
(continues on next page)
(continued from previous page)
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[3]:
˓→
a c
0 -0.675786 -0.705430 1 -1.312106 -0.458568 2 0.760743 0.794540 3 -1.993663 -0.638385 4 0.174879 0.456440 5 0.021686 -1.546643 6 1.189064 0.034015 7 -0.212239 1.375358 8 1.488271 -0.044956 9 -0.209352 -0.531272
Selecting multiple noncontiguous columns by integer location can be achieved with a combination of theilocindexer attribute andnumpy.r_.
In [4]: named = list('abcdefg') In [5]: n = 30
In [6]: columns = named + np.arange(len(named), n).tolist() In [7]: df = pd.DataFrame(np.random.randn(n, n), columns=columns) In [8]: df.iloc[:, np.r_[:10, 24:30]]
Out[8]:
a b c d e f g 7
˓→ 8 9 24 25 26 27 28 29
0 0.845186 1.477387 -1.101084 0.045371 -0.152402 1.193802 -0.641544 0.727758 0.
˓→166301 -0.262495 0.865643 -0.583461 0.611756 0.043941 -0.180218 -0.234422
1 -0.091082 -0.241438 0.888987 0.950624 1.486497 0.816097 -0.303517 -0.185567 1.
˓→005420 0.528180 1.150292 0.366950 -0.302928 -1.782282 -1.060319 0.926426
2 -0.120281 -1.355382 1.316443 1.035293 1.832802 0.840813 1.224717 1.189179 0.
˓→708129 -1.088905 -2.090951 -0.249315 -0.255315 1.062796 0.120178 -2.567354
3 0.363030 0.337972 0.562713 -1.958954 1.343284 0.294966 -0.838727 0.537231 0.
˓→471620 -0.241791 -0.229986 -0.042663 1.067309 1.733714 0.187126 1.714257
4 -0.924266 -0.529364 0.900472 -2.246269 1.311554 -0.346161 -1.563239 0.821391 0.
˓→257079 2.025030 -0.997635 -1.166100 -0.194001 1.539780 1.141356 -0.671260
5 0.365321 2.220535 -0.862720 0.500629 -0.463077 0.646191 -0.586978 -1.123404 0.
˓→755633 -0.171989 -0.934973 0.407911 -0.049064 -0.501261 1.166429 -1.639652
6 -0.779156 0.872386 1.498847 0.802301 1.120817 0.682671 0.387492 1.676350 1.
˓→250301 0.511174 1.725580 0.736709 -0.075654 -0.433720 -0.531258 0.163096
7 -1.092626 2.100230 0.809544 -0.357548 -0.141715 -1.231609 0.940635 1.341796 0.
˓→219083 0.134680 -0.052375 -0.313744 -1.509039 0.369685 -0.491988 0.634582
8 -0.770382 -0.148715 0.995860 -0.497789 -0.337062 -0.811544 -0.136611 -0.159909 -3.
˓→307796 1.151714 -1.823490 -0.584232 -0.360641 0.343535 0.655504 0.944038
9 -0.212662 -1.868740 0.405428 -0.203476 -0.556169 0.388479 -0.078858 -0.665643 -0.
˓→083556 0.355772 0.105630 1.056475 0.441954 -1.677874 -2.292959 -0.015734
10 1.029785 -1.357948 0.736661 -0.373011 0.010686 0.891226 1.708380 -0.621027 -1.
˓→361612 0.876017 -0.747836 -0.101313 0.153197 1.522210 -0.543723 -1.056901
11 0.344063 0.844541 -0.203043 -1.072639 -0.339559 -1.805151 -0.803211 -1.194427 -0.
˓→186082 -0.398677 -0.022149 0.467412 -0.062829 -1.303249 1.008333 0.484176
12 0.714130 -1.809731 0.500489 -0.964478 -1.151533 -0.081869 0.364836 -0.478761 0.
˓→175269 -1.043484 -0.700437 -0.446162 0.262790 -0.195336 -0.610571 1.284843
13 0.590426 -1.066204 0.647251 0.490382 -0.342527 2.300159 0.017046 -1.059625 2.
˓→150886 0.251143 -0.743321 -0.286809 0.930005 0.109723 -0.088245 0.058119
(continues on next page)
(continued from previous page) 14 -3.048565 -1.434827 1.417011 -0.044549 -1.436573 -2.093045 0.844530 0.896037 -1.
˓→068920 -0.517810 -0.519054 -0.359963 0.739228 0.516530 0.398125 0.898604
15 -1.816139 0.784040 -0.616144 0.528060 -1.467376 1.490376 -1.339519 0.690586 2.
˓→210548 -0.244964 1.224123 0.466515 -0.567060 -0.122205 -1.724897 -1.719786
16 -1.013236 -1.192154 1.601083 0.210545 -1.479283 2.875824 1.156338 0.018656 -0.
˓→370621 1.413207 -1.062860 1.021655 -1.014025 -0.492616 -0.437969 0.325006
17 0.341647 0.423006 -0.079704 0.193268 1.065203 2.303325 0.672584 0.975313 0.
˓→048812 -0.362334 0.474592 -1.214451 0.260621 -0.065847 -0.370592 0.734567
18 0.548488 2.030576 -1.737444 0.424595 0.888119 0.811904 -2.663094 -0.214405 -1.
˓→553917 -0.326102 0.129004 0.376030 0.772534 1.201531 0.153554 -1.852295
19 0.386408 1.627682 0.042071 -1.738630 2.002411 1.014131 -0.363660 -2.089632 -0.
˓→575169 -0.846599 -1.584384 -0.487075 0.776227 0.234607 -0.482598 -0.325331
20 1.150351 0.062999 0.000056 -1.650372 -0.573644 -1.277146 -0.758280 -0.655414 -0.
˓→513040 2.952950 -0.840759 0.308092 0.757281 1.328255 0.023158 0.527266
21 -0.135547 0.221105 0.169215 -0.669499 -1.341411 -1.056065 -0.243079 1.261740 1.
˓→361390 2.318099 0.756522 1.248659 -0.089996 3.271387 -0.387008 -1.385124
22 0.681371 -0.683013 0.187458 0.163973 0.487176 0.383422 0.846315 0.648417 0.
˓→414951 -1.188660 0.592502 1.145729 -0.922409 -1.195030 -0.585130 0.337075
23 -1.322151 1.388533 0.326662 0.849451 1.564112 -0.351552 -0.256252 -0.711547 1.
˓→330510 -0.123016 -1.009203 -0.089746 -0.114824 0.435513 -1.723209 -1.155514
24 0.900846 -0.254314 0.089641 -0.390505 -0.365457 -1.873269 2.119430 0.911508 -0.
˓→773568 0.174730 -0.248129 1.116340 0.847847 -0.561636 1.146456 -1.669738
25 0.750372 -0.441270 0.692703 -0.450323 -0.310025 -0.897737 0.597373 -0.965110 0.
˓→780583 -0.580247 0.205643 1.521676 0.008620 -0.201451 2.674895 -1.269927
26 1.583249 0.635870 -0.219556 0.673535 -0.255959 0.689100 -0.364517 0.477642 0.
˓→006587 1.591417 0.376625 -1.035038 2.026937 0.802101 -0.135218 2.809340
27 0.704517 0.054065 0.364239 0.487663 -1.491507 -0.423214 0.961798 -1.934210 -0.
˓→841410 0.215011 0.654141 0.088707 -1.851563 -0.310688 0.834693 -0.778363
28 0.704446 0.521850 -0.069554 -1.498666 -1.234736 -1.064675 1.485964 -1.049791 0.
˓→479308 -0.442094 0.355866 -1.864982 -1.614115 -1.301276 -1.038078 -1.122105
29 0.516201 -0.687797 -0.261853 -0.950345 -0.759850 -0.148198 -2.312528 0.577857 -0.
˓→306088 0.825769 0.215203 -0.135859 0.023179 -0.991331 0.082219 0.071864
aggregate
In R you may want to split data into subsets and compute the mean for each. Using a data.frame calleddfand splitting it into groupsby1andby2:
df <- data.frame(
v1 = c(1,3,5,7,8,3,5,NA,4,5,7,9),
v2 = c(11,33,55,77,88,33,55,NA,44,55,77,99),
by1 = c("red", "blue", 1, 2, NA, "big", 1, 2, "red", 1, NA, 12), by2 = c("wet", "dry", 99, 95, NA, "damp", 95, 99, "red", 99, NA, NA)) aggregate(x=df[, c("v1", "v2")], by=list(mydf2$by1, mydf2$by2), FUN = mean) Thegroupby()method is similar to base Raggregatefunction.
In [9]: df = pd.DataFrame(
...: {'v1': [1, 3, 5, 7, 8, 3, 5, np.nan, 4, 5, 7, 9],
...: 'v2': [11, 33, 55, 77, 88, 33, 55, np.nan, 44, 55, 77, 99],
...: 'by1': ["red", "blue", 1, 2, np.nan, "big", 1, 2, "red", 1, np.nan, 12], ...: 'by2': ["wet", "dry", 99, 95, np.nan, "damp", 95, 99, "red", 99, np.nan,
...: np.nan]})
...:
(continues on next page)
(continued from previous page) In [10]: g = df.groupby(['by1', 'by2'])
In [11]: g[['v1', 'v2']].mean() Out[11]:
v1 v2 by1 by2
1 95 5.0 55.0 99 5.0 55.0 2 95 7.0 77.0 99 NaN NaN big damp 3.0 33.0 blue dry 3.0 33.0 red red 4.0 44.0 wet 1.0 11.0
For more details and examples seethe groupby documentation.
match/%in%
A common way to select data in R is using%in%which is defined using the functionmatch. The operator%in%is used to return a logical vector indicating if there is a match or not:
s <- 0:4 s %in% c(2,4)
Theisin()method is similar to R%in%operator:
In [12]: s = pd.Series(np.arange(5), dtype=np.float32) In [13]: s.isin([2, 4])
Out[13]:
0 False 1 False 2 True 3 False 4 True dtype: bool
Thematchfunction returns a vector of the positions of matches of its first argument in its second:
s <- 0:4
match(s, c(2,4))
For more details and examples seethe reshaping documentation.
tapply
tapplyis similar toaggregate, but data can be in a ragged array, since the subclass sizes are possibly irregular.
Using a data.frame calledbaseball, and retrieving information based on the arrayteam:
baseball <-
data.frame(team = gl(5, 5,
labels = paste("Team", LETTERS[1:5])),
(continues on next page)
(continued from previous page) player = sample(letters, 25),
batting.average = runif(25, .200, .400)) tapply(baseball$batting.average, baseball.example$team,
max)
Inpandaswe may usepivot_table()method to handle this:
In [14]: import random In [15]: import string
In [16]: baseball = pd.DataFrame(
....: {'team': ["team %d" % (x + 1) for x in range(5)] * 5, ....: 'player': random.sample(list(string.ascii_lowercase), 25), ....: 'batting avg': np.random.uniform(.200, .400, 25)})
....:
In [17]: baseball.pivot_table(values='batting avg', columns='team', aggfunc=np.max) Out[17]:
team team 1 team 2 team 3 team 4 team 5 batting avg 0.376302 0.382559 0.364652 0.358454 0.39535 For more details and examples seethe reshaping documentation.
subset
Thequery()method is similar to the base Rsubsetfunction. In R you might want to get the rows of adata.
framewhere one column’s values are less than another column’s values:
df <- data.frame(a=rnorm(10), b=rnorm(10)) subset(df, a <= b)
df[df$a <= df$b,] # note the comma
Inpandas, there are a few ways to perform subsetting. You can usequery()or pass an expression as if it were an index/slice as well as standard boolean indexing:
In [18]: df = pd.DataFrame({'a': np.random.randn(10), 'b': np.random.randn(10)}) In [19]: df.query('a <= b')
Out[19]:
a b
5 -0.963807 -0.875940 8 -1.318729 0.725287 9 1.144169 1.691462 In [20]: df[df.a <= df.b]
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[20]:
˓→
a b
5 -0.963807 -0.875940 8 -1.318729 0.725287 9 1.144169 1.691462
In [21]: df.loc[df.a <= df.b]
(continues on next page)
(continued from previous page)
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[21]:
˓→
a b
5 -0.963807 -0.875940 8 -1.318729 0.725287 9 1.144169 1.691462
For more details and examples seethe query documentation.
with
An expression using a data.frame calleddfin R with the columnsaandbwould be evaluated usingwithlike so:
df <- data.frame(a=rnorm(10), b=rnorm(10)) with(df, a + b)
df$a + df$b # same as the previous expression
Inpandasthe equivalent expression, using theeval()method, would be:
In [22]: df = pd.DataFrame({'a': np.random.randn(10), 'b': np.random.randn(10)}) In [23]: df.eval('a + b')
Out[23]:
0 0.197930 1 0.704779 2 -1.138875 3 2.270749 4 -0.006389 5 -0.797885 6 -1.848427 7 -1.365816 8 -0.516445 9 -0.500554 dtype: float64
In [24]: df.a + df.b # same as the previous expression
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[24]:
˓→
0 0.197930 1 0.704779 2 -1.138875 3 2.270749 4 -0.006389 5 -0.797885 6 -1.848427 7 -1.365816 8 -0.516445 9 -0.500554 dtype: float64
In certain caseseval()will be much faster than evaluation in pure Python. For more details and examples seethe eval documentation.
plyr
plyris an R library for the split-apply-combine strategy for data analysis. The functions revolve around three data structures in R,a for arrays,l for lists, and dfor data.frame. The table below shows how these data structures could be mapped in Python.
R Python
array list
lists dictionary or list of objects data.frame dataframe
ddply
An expression using a data.frame calleddfin R where you want to summarizexbymonth:
require(plyr) df <- data.frame(
x = runif(120, 1, 168), y = runif(120, 7, 334), z = runif(120, 1.7, 20.7), month = rep(c(5,6,7,8),30), week = sample(1:4, 120, TRUE) )
ddply(df, .(month, week), summarize, mean = round(mean(x), 2), sd = round(sd(x), 2))
Inpandasthe equivalent expression, using thegroupby()method, would be:
In [25]: df = pd.DataFrame({'x': np.random.uniform(1., 168., 120), ....: 'y': np.random.uniform(7., 334., 120), ....: 'z': np.random.uniform(1.7, 20.7, 120),
....: 'month': [5, 6, 7, 8] * 30,
....: 'week': np.random.randint(1, 4, 120)}) ....:
In [26]: grouped = df.groupby(['month', 'week']) In [27]: grouped['x'].agg([np.mean, np.std]) Out[27]:
mean std
month week
5 1 97.816995 42.303852 2 84.334809 64.534765 3 87.306702 44.926170 6 1 100.570938 46.104949 2 66.712939 41.685774 3 96.207214 43.942976 7 1 88.447249 49.705294 2 93.931880 41.423754 3 79.571586 44.195428 8 1 115.128923 47.347190 2 71.987975 61.081189 3 100.213814 51.661295
For more details and examples seethe groupby documentation.
reshape / reshape2 melt.array
An expression using a 3 dimensional array calledain R where you want to melt it into a data.frame:
a <- array(c(1:23, NA), c(2,3,4)) data.frame(melt(a))
In Python, sinceais a list, you can simply use list comprehension.
In [28]: a = np.array(list(range(1, 24)) + [np.NAN]).reshape(2, 3, 4)
In [29]: pd.DataFrame([tuple(list(x) + [val]) for x, val in np.ndenumerate(a)]) Out[29]:
0 1 2 3
0 0 0 0 1.0
1 0 0 1 2.0
2 0 0 2 3.0
3 0 0 3 4.0
4 0 1 0 5.0
5 0 1 1 6.0
6 0 1 2 7.0
7 0 1 3 8.0
8 0 2 0 9.0
9 0 2 1 10.0 10 0 2 2 11.0 11 0 2 3 12.0 12 1 0 0 13.0 13 1 0 1 14.0 14 1 0 2 15.0 15 1 0 3 16.0 16 1 1 0 17.0 17 1 1 1 18.0 18 1 1 2 19.0 19 1 1 3 20.0 20 1 2 0 21.0 21 1 2 1 22.0 22 1 2 2 23.0 23 1 2 3 NaN
melt.list
An expression using a list calledain R where you want to melt it into a data.frame:
a <- as.list(c(1:4, NA)) data.frame(melt(a))
In Python, this list would be a list of tuples, soDataFrame()method would convert it to a dataframe as required.
In [30]: a = list(enumerate(list(range(1, 5)) + [np.NAN]))
(continues on next page)
(continued from previous page) In [31]: pd.DataFrame(a)
Out[31]:
0 1
0 0 1.0 1 1 2.0 2 2 3.0 3 3 4.0 4 4 NaN
For more details and examples seethe Into to Data Structures documentation.
melt.data.frame
An expression using a data.frame calledcheesein R where you want to reshape the data.frame:
cheese <- data.frame(
first = c('John', 'Mary'), last = c('Doe', 'Bo'), height = c(5.5, 6.0), weight = c(130, 150) )
melt(cheese, id=c("first", "last"))
In Python, themelt()method is the R equivalent:
In [32]: cheese = pd.DataFrame({'first': ['John', 'Mary'],
....: 'last': ['Doe', 'Bo'],
....: 'height': [5.5, 6.0],
....: 'weight': [130, 150]})
....:
In [33]: pd.melt(cheese, id_vars=['first', 'last']) Out[33]:
first last variable value 0 John Doe height 5.5 1 Mary Bo height 6.0 2 John Doe weight 130.0 3 Mary Bo weight 150.0
In [34]: cheese.set_index(['first', 'last']).stack() # alternative way
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[34]:
˓→
first last
John Doe height 5.5 weight 130.0
Mary Bo height 6.0
weight 150.0 dtype: float64
For more details and examples seethe reshaping documentation.
cast
In Racastis an expression using a data.frame calleddfin R to cast into a higher dimensional array:
df <- data.frame(
x = runif(12, 1, 168), y = runif(12, 7, 334), z = runif(12, 1.7, 20.7), month = rep(c(5,6,7),4), week = rep(c(1,2), 6) )
mdf <- melt(df, id=c("month", "week")) acast(mdf, week ~ month ~ variable, mean) In Python the best way is to make use ofpivot_table():
In [35]: df = pd.DataFrame({'x': np.random.uniform(1., 168., 12), ....: 'y': np.random.uniform(7., 334., 12), ....: 'z': np.random.uniform(1.7, 20.7, 12),
....: 'month': [5, 6, 7] * 4,
....: 'week': [1, 2] * 6})
....:
In [36]: mdf = pd.melt(df, id_vars=['month', 'week'])
In [37]: pd.pivot_table(mdf, values='value', index=['variable', 'week'], ....: columns=['month'], aggfunc=np.mean)
....:
Out[37]:
month 5 6 7
variable week
x 1 52.316747 135.360079 85.058099 2 53.321560 73.355744 115.364655 y 1 249.154194 166.180787 79.850345 2 206.716451 238.948519 236.880722 z 1 10.998118 16.935719 15.090803 2 13.049358 16.288492 10.394538
Similarly for dcast which uses a data.frame called df in R to aggregate information based on Animal and FeedType:
df <- data.frame(
Animal = c('Animal1', 'Animal2', 'Animal3', 'Animal2', 'Animal1', 'Animal2', 'Animal3'),
FeedType = c('A', 'B', 'A', 'A', 'B', 'B', 'A'), Amount = c(10, 7, 4, 2, 5, 6, 2)
)
dcast(df, Animal ~ FeedType, sum, fill=NaN)
# Alternative method using base R
with(df, tapply(Amount, list(Animal, FeedType), sum))
Python can approach this in two different ways. Firstly, similar to above usingpivot_table():
In [38]: df = pd.DataFrame({
....: 'Animal': ['Animal1', 'Animal2', 'Animal3', 'Animal2', 'Animal1', ....: 'Animal2', 'Animal3'],
....: 'FeedType': ['A', 'B', 'A', 'A', 'B', 'B', 'A'], ....: 'Amount': [10, 7, 4, 2, 5, 6, 2],
....: })
(continues on next page)
(continued from previous page) ....:
In [39]: df.pivot_table(values='Amount', index='Animal', columns='FeedType',
....: aggfunc='sum')
....:
Out[39]:
FeedType A B
Animal
Animal1 10.0 5.0 Animal2 2.0 13.0 Animal3 6.0 NaN
The second approach is to use thegroupby()method:
In [40]: df.groupby(['Animal', 'FeedType'])['Amount'].sum() Out[40]:
Animal FeedType
Animal1 A 10
B 5
Animal2 A 2
B 13
Animal3 A 6
Name: Amount, dtype: int64
For more details and examples seethe reshaping documentationorthe groupby documentation.
factor
pandas has a data type for categorical data.
cut(c(1,2,3,4,5,6), 3) factor(c(1,2,3,2,2,3))
In pandas this is accomplished withpd.cutandastype("category"):
In [41]: pd.cut(pd.Series([1, 2, 3, 4, 5, 6]), 3) Out[41]:
0 (0.995, 2.667]
1 (0.995, 2.667]
2 (2.667, 4.333]
3 (2.667, 4.333]
4 (4.333, 6.0]
5 (4.333, 6.0]
dtype: category
Categories (3, interval[float64]): [(0.995, 2.667] < (2.667, 4.333] < (4.333, 6.0]]
In [42]: pd.Series([1, 2, 3, 2, 2, 3]).astype("category")
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\Out[42]:
˓→
0 1
1 2
2 3
3 2
4 2
5 3
(continues on next page)
(continued from previous page) dtype: category
Categories (3, int64): [1, 2, 3]
For more details and examples seecategorical introductionand theAPI documentation. There is also a documentation regarding thedifferences to R’s factor. {{ header }}