pandas: Basic Pandas for Data Science#

Package Import#

import pandas as pd
import numpy as np

Dataset Import#

The dataset used in this notebook is from Kaggle - Pokemon.

data = pd.read_csv('data/Pokemon.csv')

Manually Create a DataFrame#

From a Dictionary
The columns order is the order of keys insertion:

df = pd.DataFrame({'Column 1': [100,200], 'Column 2': [300,400]})
df

Show code cell output

Hide code cell output

	Column 1	Column 2
0	100	300
1	200	400

From a list of random values w/ column names:

pd.DataFrame(np.random.rand(4, 8), columns=list('abcdefgh'))

Show code cell output

Hide code cell output

	a	b	c	d	e	f	g	h
0	0.860898	0.424386	0.160110	0.202603	0.837460	0.356385	0.044539	0.339014
1	0.837671	0.601876	0.443424	0.088542	0.683828	0.725811	0.567350	0.979437
2	0.550181	0.050053	0.744824	0.208462	0.808865	0.437747	0.976386	0.227302
3	0.145377	0.023350	0.707948	0.594533	0.290170	0.690900	0.856202	0.618862

From a dictionary including Series:

pd.DataFrame({'col1': [0,1,2,3], 'col2': pd.Series([2,3], index=[2,3])}, index=[0,1,2,3])

Show code cell output

Hide code cell output

	col1	col2
0	0	NaN
1	1	NaN
2	2	2.0
3	3	3.0

From numpy ndarray:

df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
                   columns=['a', 'b', 'c'])
df

Show code cell output

Hide code cell output

	a	b	c
0	1	2	3
1	4	5	6
2	7	8	9

From a numpy ndarray that has labeled columns:

d = np.array([(1,2,3), (4,5,6), (7,8,9)], dtype=[("a", "i4"), ("b", "i4"), ("c", "i4")])
df = pd.DataFrame(data=d, columns=['c', 'a'])
df

Show code cell output

Hide code cell output

	c	a
0	3	1
1	6	4
2	9	7

From Series/DataFrame:

ser = pd.Series([1,2,3], index=['a','b','c'])
df = pd.DataFrame(data=ser, index=['c', 'a'], columns=['hehe'])
df

Show code cell output

Hide code cell output

	hehe
c	3
a	1

If we construct from DataFrame, then the columns in the new DataFrame must be a subset of the original columns. If not, the new columns will be filled with NaN.

df1 = pd.DataFrame([1,2,3], index=['a','b','c'], columns=['x'])
df2 = pd.DataFrame(data=df1, index=['c', 'a'])
df3 = pd.DataFrame(data=df1, index=['c', 'a'], columns=['z'])
print(df2, '\n',df3)

Reverse Row/Column Order#

Row#

data.iloc[::-1].head()

Show code cell output

Hide code cell output

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
799	721	Volcanion	Fire	Water	600	80	110	120	130	90	70	6	True
798	720	HoopaHoopa Unbound	Psychic	Dark	680	80	160	60	170	130	80	6	True
797	720	HoopaHoopa Confined	Psychic	Ghost	600	80	110	60	150	130	70	6	True
796	719	DiancieMega Diancie	Rock	Fairy	700	50	160	110	160	110	110	6	True
795	719	Diancie	Rock	Fairy	600	50	100	150	100	150	50	6	True

Column#

data.iloc[:, ::-1].head()

Show code cell output

Hide code cell output

	Legendary	Generation	Speed	Sp. Def	Sp. Atk	Defense	Attack	HP	Total	Type 2	Type 1	Name	#
0	False	1	45	65	65	49	49	45	318	Poison	Grass	Bulbasaur	1
1	False	1	60	80	80	63	62	60	405	Poison	Grass	Ivysaur	2
2	False	1	80	100	100	83	82	80	525	Poison	Grass	Venusaur	3
3	False	1	80	120	122	123	100	80	625	Poison	Grass	VenusaurMega Venusaur	3
4	False	1	65	50	60	43	52	39	309	NaN	Fire	Charmander	4

Select column by data type#

Check columns data types:

data.dtypes

Select all columns that are int or float:

data.select_dtypes(include='number').head()

Show code cell output

Hide code cell output

	#	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation
0	1	318	45	49	49	65	65	45	1
1	2	405	60	62	63	80	80	60	1
2	3	525	80	82	83	100	100	80	1
3	3	625	80	100	123	122	120	80	1
4	4	309	39	52	43	60	50	65	1

Select multiple types by passing as a list:

data.select_dtypes(include=['number', 'object']).head()

Show code cell output

Hide code cell output

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation
0	1	Bulbasaur	Grass	Poison	318	45	49	49	65	65	45	1
1	2	Ivysaur	Grass	Poison	405	60	62	63	80	80	60	1
2	3	Venusaur	Grass	Poison	525	80	82	83	100	100	80	1
3	3	VenusaurMega Venusaur	Grass	Poison	625	80	100	123	122	120	80	1
4	4	Charmander	Fire	NaN	309	39	52	43	60	50	65	1

Or exclude types by using exclude parameter:

data.select_dtypes(exclude=['number', 'object']).head()

Show code cell output

Hide code cell output

	Legendary
0	False
1	False
2	False
3	False
4	False

Convert strings to numbers#

df = pd.DataFrame({'col1': ['1.1', '2.2', '3.3'], 'col2': ['4.4', '5.5', '6.6'], 'col3': ['7.7', '8.8', '-']})
df, df.dtypes

df.astype() can convert multiple columns at once. Use errors='ignore' to skip conversion errors.

df.astype({'col1': 'float', 'col2': 'float'}, errors='raise').dtypes

df.astype({'col1': 'float', 'col2': 'float', 'col3': 'float'}, errors='ignore').dtypes

A better way to convert strings to numbers is to use pd.to_numeric() with errors='coerce' to convert invalid parsing to NaN.

pd.to_numeric(df.col3, errors='coerce')

pd.to_numeric(df.col3, errors='coerce').fillna(0)

df = df.apply(pd.to_numeric, errors='coerce').fillna(0)
df, df.dtypes

Deal with large datasets#

Check memory usage#

data.info(memory_usage='deep')

Load specific columns#

We can load only the columns we need by using the usecols parameter of pd.read_csv().

small_data = pd.read_csv('data/Pokemon.csv', usecols=['Name', 'Type 1'])
small_data.head()

Show code cell output

Hide code cell output

	Name	Type 1
0	Bulbasaur	Grass
1	Ivysaur	Grass
2	Venusaur	Grass
3	VenusaurMega Venusaur	Grass
4	Charmander	Fire

And it indeed saves memory:

small_data.info(memory_usage='deep')

If we know a column has only a few unique values, we can load it as category type to save memory:

smaller_data = pd.read_csv('data/Pokemon.csv', usecols=['Name', 'Type 1'], dtype={'Type 1': 'category'})
smaller_data.info(memory_usage='deep')

Load in chunks#

next() is used here to get the first chunk (a DataFrame with 100 rows) from the chunked CSV reader, so you can call .head() on it.

chunks_df = pd.read_csv('data/Pokemon.csv', chunksize=100)
next(chunks_df) # first 100 rows

Show code cell output

Hide code cell output

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
0	1	Bulbasaur	Grass	Poison	318	45	49	49	65	65	45	1	False
1	2	Ivysaur	Grass	Poison	405	60	62	63	80	80	60	1	False
2	3	Venusaur	Grass	Poison	525	80	82	83	100	100	80	1	False
3	3	VenusaurMega Venusaur	Grass	Poison	625	80	100	123	122	120	80	1	False
4	4	Charmander	Fire	NaN	309	39	52	43	60	50	65	1	False
...	...	...	...	...	...	...	...	...	...	...	...	...	...
95	88	Grimer	Poison	NaN	325	80	80	50	40	50	25	1	False
96	89	Muk	Poison	NaN	500	105	105	75	65	100	50	1	False
97	90	Shellder	Water	NaN	305	30	65	100	45	25	40	1	False
98	91	Cloyster	Water	Ice	525	50	95	180	85	45	70	1	False
99	92	Gastly	Ghost	Poison	310	30	35	30	100	35	80	1	False

100 rows × 13 columns

We can save each chunk to separate data files for later usage:

for i, df in enumerate(pd.read_csv('data/Pokemon.csv', chunksize=100)):
    df.to_csv(f'data/Pokemon_{i}.csv', index=False)

df0 = pd.read_csv('data/Pokemon_0.csv')
df0.head()

Show code cell output

Hide code cell output

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
0	1	Bulbasaur	Grass	Poison	318	45	49	49	65	65	45	1	False
1	2	Ivysaur	Grass	Poison	405	60	62	63	80	80	60	1	False
2	3	Venusaur	Grass	Poison	525	80	82	83	100	100	80	1	False
3	3	VenusaurMega Venusaur	Grass	Poison	625	80	100	123	122	120	80	1	False
4	4	Charmander	Fire	NaN	309	39	52	43	60	50	65	1	False

df1 = pd.read_csv('data/Pokemon_1.csv')
df1.head()

Show code cell output

Hide code cell output

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
0	93	Haunter	Ghost	Poison	405	45	50	45	115	55	95	1	False
1	94	Gengar	Ghost	Poison	500	60	65	60	130	75	110	1	False
2	94	GengarMega Gengar	Ghost	Poison	600	60	65	80	170	95	130	1	False
3	95	Onix	Rock	Ground	385	35	45	160	30	45	70	1	False
4	96	Drowzee	Psychic	NaN	328	60	48	45	43	90	42	1	False

But how do we combine the data back? See below

Build a DataFrame from multiple files (row-wise)#

Suppose we want to join Pokemon_0.csv and Pokemon_1.csv row-wise:

Use pd.concat and pass in the list of dataframes we want to join

df0 = pd.read_csv('data/Pokemon_0.csv')
df1 = pd.read_csv('data/Pokemon_1.csv')
df = pd.concat([df0, df1])
df.head()

Show code cell output

Hide code cell output

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
0	1	Bulbasaur	Grass	Poison	318	45	49	49	65	65	45	1	False
1	2	Ivysaur	Grass	Poison	405	60	62	63	80	80	60	1	False
2	3	Venusaur	Grass	Poison	525	80	82	83	100	100	80	1	False
3	3	VenusaurMega Venusaur	Grass	Poison	625	80	100	123	122	120	80	1	False
4	4	Charmander	Fire	NaN	309	39	52	43	60	50	65	1	False

But here, we’ll see some unexpected indices, as they are not consecutive (reset at the first row of 2nd dataframe):

df.iloc[95:105]

Show code cell output

Hide code cell output

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
95	88	Grimer	Poison	NaN	325	80	80	50	40	50	25	1	False
96	89	Muk	Poison	NaN	500	105	105	75	65	100	50	1	False
97	90	Shellder	Water	NaN	305	30	65	100	45	25	40	1	False
98	91	Cloyster	Water	Ice	525	50	95	180	85	45	70	1	False
99	92	Gastly	Ghost	Poison	310	30	35	30	100	35	80	1	False
0	93	Haunter	Ghost	Poison	405	45	50	45	115	55	95	1	False
1	94	Gengar	Ghost	Poison	500	60	65	60	130	75	110	1	False
2	94	GengarMega Gengar	Ghost	Poison	600	60	65	80	170	95	130	1	False
3	95	Onix	Rock	Ground	385	35	45	160	30	45	70	1	False
4	96	Drowzee	Psychic	NaN	328	60	48	45	43	90	42	1	False

And, we just use ignore_index=True to workaround:

df = pd.concat([df0, df1], ignore_index=True)
df.iloc[95:105]

Show code cell output

Hide code cell output

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
95	88	Grimer	Poison	NaN	325	80	80	50	40	50	25	1	False
96	89	Muk	Poison	NaN	500	105	105	75	65	100	50	1	False
97	90	Shellder	Water	NaN	305	30	65	100	45	25	40	1	False
98	91	Cloyster	Water	Ice	525	50	95	180	85	45	70	1	False
99	92	Gastly	Ghost	Poison	310	30	35	30	100	35	80	1	False
100	93	Haunter	Ghost	Poison	405	45	50	45	115	55	95	1	False
101	94	Gengar	Ghost	Poison	500	60	65	60	130	75	110	1	False
102	94	GengarMega Gengar	Ghost	Poison	600	60	65	80	170	95	130	1	False
103	95	Onix	Rock	Ground	385	35	45	160	30	45	70	1	False
104	96	Drowzee	Psychic	NaN	328	60	48	45	43	90	42	1	False

Build a DataFrame from multiple files (columns-wise)#

data.to_csv('data/Pokemon_first_6_cols.csv', columns=data.columns[:6], index=False)
data.to_csv('data/Pokemon_second_7_cols.csv', columns=data.columns[6:], index=False)

df0 = pd.read_csv('data/Pokemon_first_6_cols.csv')
df0.head()

Show code cell output

Hide code cell output

	#	Name	Type 1	Type 2	Total	HP
0	1	Bulbasaur	Grass	Poison	318	45
1	2	Ivysaur	Grass	Poison	405	60
2	3	Venusaur	Grass	Poison	525	80
3	3	VenusaurMega Venusaur	Grass	Poison	625	80
4	4	Charmander	Fire	NaN	309	39

df1 = pd.read_csv('data/Pokemon_second_7_cols.csv')
df1.head()

Show code cell output

Hide code cell output

	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
0	49	49	65	65	45	1	False
1	62	63	80	80	60	1	False
2	82	83	100	100	80	1	False
3	100	123	122	120	80	1	False
4	52	43	60	50	65	1	False

And, combine them column-wise by using axis='columns:

df = pd.concat([df0, df1], axis='columns')
df.head()

Show code cell output

Hide code cell output

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
0	1	Bulbasaur	Grass	Poison	318	45	49	49	65	65	45	1	False
1	2	Ivysaur	Grass	Poison	405	60	62	63	80	80	60	1	False
2	3	Venusaur	Grass	Poison	525	80	82	83	100	100	80	1	False
3	3	VenusaurMega Venusaur	Grass	Poison	625	80	100	123	122	120	80	1	False
4	4	Charmander	Fire	NaN	309	39	52	43	60	50	65	1	False

Split a DataFrame into 2 random subsets#

We sample 75% of our dataframe into data_1:

data = pd.read_csv('data/Pokemon.csv')
len(data)

data_1 = data.sample(frac=0.75, random_state=1234)
np.sort(data_1.index)

Show code cell output

Hide code cell output

array([  0,   2,   5,   6,   7,   8,   9,  11,  13,  16,  17,  19,  20,
        21,  22,  23,  24,  25,  27,  28,  29,  31,  33,  34,  35,  36,
        37,  38,  39,  40,  42,  43,  44,  47,  49,  50,  51,  52,  54,
        55,  56,  57,  58,  59,  60,  61,  62,  63,  64,  65,  66,  67,
        69,  70,  71,  72,  73,  74,  77,  78,  79,  80,  83,  85,  88,
        89,  90,  91,  92,  93,  94,  95,  96,  97,  99, 100, 101, 102,
       104, 105, 106, 108, 109, 110, 111, 112, 113, 114, 115, 118, 121,
       122, 123, 124, 125, 126, 127, 128, 129, 131, 132, 133, 134, 137,
       138, 140, 141, 143, 144, 145, 146, 147, 148, 149, 151, 152, 153,
       155, 156, 157, 159, 160, 161, 162, 163, 166, 167, 168, 169, 170,
       171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 184,
       185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 198, 199, 200,
       201, 202, 203, 205, 206, 207, 213, 214, 215, 216, 217, 218, 219,
       220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232,
       234, 237, 238, 239, 240, 241, 242, 245, 247, 248, 249, 251, 252,
       253, 254, 255, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269,
       270, 271, 272, 273, 274, 276, 277, 278, 280, 281, 282, 285, 287,
       288, 289, 290, 296, 297, 298, 299, 300, 301, 302, 304, 305, 306,
       307, 308, 309, 310, 315, 319, 320, 321, 322, 323, 324, 326, 327,
       328, 329, 330, 331, 333, 335, 337, 338, 339, 341, 342, 344, 347,
       348, 349, 351, 352, 353, 355, 356, 357, 358, 359, 360, 361, 362,
       363, 364, 366, 367, 368, 369, 370, 373, 375, 376, 378, 380, 381,
       382, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395,
       397, 398, 399, 401, 402, 403, 405, 406, 407, 409, 410, 411, 412,
       413, 414, 415, 416, 418, 419, 420, 421, 422, 423, 424, 426, 427,
       428, 430, 432, 433, 435, 437, 438, 441, 443, 444, 445, 446, 449,
       450, 451, 452, 453, 454, 456, 457, 458, 461, 462, 463, 464, 466,
       467, 468, 470, 472, 473, 474, 475, 477, 478, 479, 480, 481, 483,
       484, 485, 486, 488, 489, 491, 492, 494, 495, 496, 498, 499, 500,
       501, 502, 503, 504, 505, 507, 508, 511, 512, 513, 515, 516, 517,
       518, 519, 521, 522, 523, 524, 525, 526, 529, 530, 532, 533, 534,
       535, 536, 537, 538, 539, 541, 542, 543, 544, 545, 548, 549, 552,
       554, 555, 556, 557, 558, 560, 563, 564, 565, 566, 567, 569, 570,
       571, 572, 575, 576, 578, 579, 581, 582, 583, 584, 585, 586, 587,
       588, 589, 590, 592, 594, 598, 599, 602, 603, 604, 605, 606, 607,
       609, 611, 612, 613, 614, 615, 616, 618, 620, 621, 622, 623, 624,
       627, 628, 629, 630, 631, 632, 634, 635, 636, 637, 639, 641, 642,
       643, 644, 645, 646, 647, 648, 649, 650, 651, 652, 653, 656, 657,
       658, 659, 660, 662, 665, 666, 667, 668, 669, 670, 672, 673, 674,
       675, 676, 677, 678, 679, 680, 681, 682, 684, 688, 690, 691, 692,
       693, 694, 696, 697, 698, 699, 700, 701, 703, 704, 705, 708, 709,
       710, 711, 712, 713, 714, 715, 716, 718, 719, 720, 721, 722, 724,
       725, 726, 727, 728, 729, 732, 733, 734, 735, 737, 738, 741, 742,
       744, 746, 747, 749, 750, 753, 754, 755, 758, 759, 760, 761, 762,
       764, 766, 767, 769, 770, 771, 772, 773, 774, 776, 777, 778, 780,
       781, 782, 783, 785, 786, 787, 788, 789, 791, 793, 794, 796, 797,
       798, 799])

Get data_2 by simple drop data_1.index:

data_2 = data.drop(data_1.index)
len(data_2), np.sort(data_2.index)

Do a little check if the fraction was successful:

len(data_1) + len(data_2)

Handle missing values#

DataFrame.isna() return a same-sized object:

data.isna().head()

Show code cell output

Hide code cell output

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
0	False	False	False	False	False	False	False	False	False	False	False	False	False
1	False	False	False	False	False	False	False	False	False	False	False	False	False
2	False	False	False	False	False	False	False	False	False	False	False	False	False
3	False	False	False	False	False	False	False	False	False	False	False	False	False
4	False	False	False	True	False	False	False	False	False	False	False	False	False

isna().sum() to check total missing values for each column:

data.isna().sum()

isna().mean() to check the proportion of missing values:

data.isna().mean()

We can choose to delete rows/columns that have missing values:

data.dropna(axis='index').head() # Drop rows

Show code cell output

Hide code cell output

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
0	1	Bulbasaur	Grass	Poison	318	45	49	49	65	65	45	1	False
1	2	Ivysaur	Grass	Poison	405	60	62	63	80	80	60	1	False
2	3	Venusaur	Grass	Poison	525	80	82	83	100	100	80	1	False
3	3	VenusaurMega Venusaur	Grass	Poison	625	80	100	123	122	120	80	1	False
6	6	Charizard	Fire	Flying	534	78	84	78	109	85	100	1	False

We can use thresh= to keep the rows/columns that has at least thresh non-missing values:

data.dropna(axis='columns', thresh=len(data)*0.6).head()

Show code cell output

Hide code cell output

	#	Name	Type 1	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
0	1	Bulbasaur	Grass	318	45	49	49	65	65	45	1	False
1	2	Ivysaur	Grass	405	60	62	63	80	80	60	1	False
2	3	Venusaur	Grass	525	80	82	83	100	100	80	1	False
3	3	VenusaurMega Venusaur	Grass	625	80	100	123	122	120	80	1	False
4	4	Charmander	Fire	309	39	52	43	60	50	65	1	False

data.dropna(axis='columns', thresh=len(data)*0.4).head()

Show code cell output

Hide code cell output

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
0	1	Bulbasaur	Grass	Poison	318	45	49	49	65	65	45	1	False
1	2	Ivysaur	Grass	Poison	405	60	62	63	80	80	60	1	False
2	3	Venusaur	Grass	Poison	525	80	82	83	100	100	80	1	False
3	3	VenusaurMega Venusaur	Grass	Poison	625	80	100	123	122	120	80	1	False
4	4	Charmander	Fire	NaN	309	39	52	43	60	50	65	1	False

DataFrame.fillna() to fill missing values:

data.fillna('hehehehe').head()

Show code cell output

Hide code cell output

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
0	1	Bulbasaur	Grass	Poison	318	45	49	49	65	65	45	1	False
1	2	Ivysaur	Grass	Poison	405	60	62	63	80	80	60	1	False
2	3	Venusaur	Grass	Poison	525	80	82	83	100	100	80	1	False
3	3	VenusaurMega Venusaur	Grass	Poison	625	80	100	123	122	120	80	1	False
4	4	Charmander	Fire	hehehehe	309	39	52	43	60	50	65	1	False

Or DataFrame.interpolate() to fill wrt surrounding values:

data.infer_objects(copy=False).interpolate().head()

Show code cell output

Hide code cell output

/tmp/ipykernel_11297/3876796089.py:1: FutureWarning: DataFrame.interpolate with object dtype is deprecated and will raise in a future version. Call obj.infer_objects(copy=False) before interpolating instead.
  data.infer_objects(copy=False).interpolate().head()

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
0	1	Bulbasaur	Grass	Poison	318	45	49	49	65	65	45	1	False
1	2	Ivysaur	Grass	Poison	405	60	62	63	80	80	60	1	False
2	3	Venusaur	Grass	Poison	525	80	82	83	100	100	80	1	False
3	3	VenusaurMega Venusaur	Grass	Poison	625	80	100	123	122	120	80	1	False
4	4	Charmander	Fire	NaN	309	39	52	43	60	50	65	1	False

Split a string into multiple columns:#

This is extremely useful when we have Name, and we want to split into First,Middle,Last:

df = pd.DataFrame(dict(name=['John Arthur Doe', 'Jane Ann Smith'], location=['Los Angeles, CA', 'Washington, DC']))
df

Show code cell output

Hide code cell output

	name	location
0	John Arthur Doe	Los Angeles, CA
1	Jane Ann Smith	Washington, DC

df[['first','middle','last']] = df.name.str.split(expand=True) # if not specify delimiters, split based on whitespace
df

Show code cell output

Hide code cell output

	name	location	first	middle	last
0	John Arthur Doe	Los Angeles, CA	John	Arthur	Doe
1	Jane Ann Smith	Washington, DC	Jane	Ann	Smith

Reshape a MultiIndexed Series#

data.groupby(['Type 1', 'Legendary'])['#'].count()

What if we want to convert the above Series into a DataFrame, with Type 1 as rows, Legendary as columns, and the counts as values?

Use unstack() after groupby() and count():

data.groupby(['Type 1', 'Legendary'])['#'].count().unstack()

Show code cell output

Hide code cell output

Legendary	False	True
Type 1
Bug	69.0	NaN
Dark	29.0	2.0
Dragon	20.0	12.0
Electric	40.0	4.0
Fairy	16.0	1.0
Fighting	27.0	NaN
Fire	47.0	5.0
Flying	2.0	2.0
Ghost	30.0	2.0
Grass	67.0	3.0
Ground	28.0	4.0
Ice	22.0	2.0
Normal	96.0	2.0
Poison	28.0	NaN
Psychic	43.0	14.0
Rock	40.0	4.0
Steel	23.0	4.0
Water	108.0	4.0

But there’s a better way to do this by using pivot_table(), which can do more things like aggregation, filling missing values, and margins, which unstack() cannot do. See below.

Create a pivot table#

data.pivot_table(index='Type 1', columns='Legendary', values='Attack', aggfunc='mean')

Show code cell output

Hide code cell output

Legendary	False	True
Type 1
Bug	70.971014	NaN
Dark	86.862069	110.500000
Dragon	103.400000	126.666667
Electric	66.125000	98.750000
Fairy	57.187500	131.000000
Fighting	96.777778	NaN
Fire	82.191489	109.000000
Flying	50.000000	107.500000
Ghost	71.366667	110.000000
Grass	72.119403	97.666667
Ground	88.000000	150.000000
Ice	73.227273	67.500000
Normal	72.083333	140.000000
Poison	74.678571	NaN
Psychic	54.953488	122.142857
Rock	89.925000	122.250000
Steel	92.086957	96.250000
Water	72.777778	111.250000

The advantage of pivot table compared to using groupby + unstack is the margins parameter, which adds a row/column that gives the totals (or other aggregate functions) for each row/column:

data.pivot_table(index='Type 1', columns='Legendary', values='Attack', aggfunc='mean', margins=True)

Show code cell output

Hide code cell output

Legendary	False	True	All
Type 1
Bug	70.971014	NaN	70.971014
Dark	86.862069	110.500000	88.387097
Dragon	103.400000	126.666667	112.125000
Electric	66.125000	98.750000	69.090909
Fairy	57.187500	131.000000	61.529412
Fighting	96.777778	NaN	96.777778
Fire	82.191489	109.000000	84.769231
Flying	50.000000	107.500000	78.750000
Ghost	71.366667	110.000000	73.781250
Grass	72.119403	97.666667	73.214286
Ground	88.000000	150.000000	95.750000
Ice	73.227273	67.500000	72.750000
Normal	72.083333	140.000000	73.469388
Poison	74.678571	NaN	74.678571
Psychic	54.953488	122.142857	71.456140
Rock	89.925000	122.250000	92.863636
Steel	92.086957	96.250000	92.703704
Water	72.777778	111.250000	74.151786
All	75.669388	116.676923	79.001250

Reshape a DataFrame from wide format to long format#

What if we want ‘HP’ and ‘Attack’ to be the values of the column named ‘Stats’?

f = data[['Name', 'HP', 'Attack']].iloc[:5]
f

Show code cell output

Hide code cell output

	Name	HP	Attack
0	Bulbasaur	45	49
1	Ivysaur	60	62
2	Venusaur	80	82
3	VenusaurMega Venusaur	80	100
4	Charmander	39	52

f.melt(id_vars='Name', var_name='Stats', value_name='Points')

Show code cell output

Hide code cell output

	Name	Stats	Points
0	Bulbasaur	HP	45
1	Ivysaur	HP	60
2	Venusaur	HP	80
3	VenusaurMega Venusaur	HP	80
4	Charmander	HP	39
5	Bulbasaur	Attack	49
6	Ivysaur	Attack	62
7	Venusaur	Attack	82
8	VenusaurMega Venusaur	Attack	100
9	Charmander	Attack	52

Convert continuous data to categorical data#

What if we want Attack to be categorized (< 50: ‘weak’, 50-100: ‘normal’, 100-150: ‘strong’, >150: ‘nani?!’)

Use pd.cut(<column>, <bin>, <labels>) to convert continuous data to categorical data. Here, we convert ‘Attack’ into 4 categories: ‘Weak’, ‘Normal’, ‘Strong’, ‘nani?!’.

df = data.copy()
df['Attack'] = pd.cut(df['Attack'], bins=[0, 50, 100, 150, 200], labels=['Weak', 'Normal', 'Strong', 'nani?!'])

df

Show code cell output

Hide code cell output

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
0	1	Bulbasaur	Grass	Poison	318	45	Weak	49	65	65	45	1	False
1	2	Ivysaur	Grass	Poison	405	60	Normal	63	80	80	60	1	False
2	3	Venusaur	Grass	Poison	525	80	Normal	83	100	100	80	1	False
3	3	VenusaurMega Venusaur	Grass	Poison	625	80	Normal	123	122	120	80	1	False
4	4	Charmander	Fire	NaN	309	39	Normal	43	60	50	65	1	False
...	...	...	...	...	...	...	...	...	...	...	...	...	...
795	719	Diancie	Rock	Fairy	600	50	Normal	150	100	150	50	6	True
796	719	DiancieMega Diancie	Rock	Fairy	700	50	nani?!	110	160	110	110	6	True
797	720	HoopaHoopa Confined	Psychic	Ghost	600	80	Strong	60	150	130	70	6	True
798	720	HoopaHoopa Unbound	Psychic	Dark	680	80	nani?!	60	170	130	80	6	True
799	721	Volcanion	Fire	Water	600	80	Strong	120	130	90	70	6	True

800 rows × 13 columns

Profile a DataFrame#

from ydata_profiling import ProfileReport

report = ProfileReport(data)
report.to_notebook_iframe()

	#	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation
0	1	318	45	49	49	65	65	45	1
1	2	405	60	62	63	80	80	60	1
2	3	525	80	82	83	100	100	80	1
3	3	625	80	100	123	122	120	80	1
4	4	309	39	52	43	60	50	65	1

	#	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation
0	1	318	45	49	49	65	65	45	1
1	2	405	60	62	63	80	80	60	1
2	3	525	80	82	83	100	100	80	1
3	3	625	80	100	123	122	120	80	1
4	4	309	39	52	43	60	50	65	1

pandas: Basic Pandas for Data Science#

Package Import#

Dataset Import#

Manually Create a DataFrame#

Reverse Row/Column Order#

Row#

Column#

Select column by data type#

Convert strings to numbers#

Deal with large datasets#

Check memory usage#

Load specific columns#

Load in chunks#

Build a DataFrame from multiple files (row-wise)#

Build a DataFrame from multiple files (columns-wise)#

Split a DataFrame into 2 random subsets#

Handle missing values#

Split a string into multiple columns:#

Reshape a MultiIndexed Series#

Create a pivot table#

Reshape a DataFrame from wide format to long format#

Convert continuous data to categorical data#

Profile a DataFrame#

This Page

	#	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation
0	1	318	45	49	49	65	65	45	1
1	2	405	60	62	63	80	80	60	1
2	3	525	80	82	83	100	100	80	1
3	3	625	80	100	123	122	120	80	1
4	4	309	39	52	43	60	50	65	1