pandas: Basic Pandas for Data Science#

Package Import#

import pandas as pd
import numpy as np

Dataset Import#

The dataset used in this notebook is from Kaggle - Pokemon.

data = pd.read_csv('data/Pokemon.csv')

Manually Create a DataFrame#

From a Dictionary
The columns order is the order of keys insertion:

df = pd.DataFrame({'Column 1': [100,200], 'Column 2': [300,400]})
df

Hide code cell output

Column 1 Column 2
0 100 300
1 200 400

From a list of random values w/ column names:

pd.DataFrame(np.random.rand(4, 8), columns=list('abcdefgh'))

Hide code cell output

a b c d e f g h
0 0.860898 0.424386 0.160110 0.202603 0.837460 0.356385 0.044539 0.339014
1 0.837671 0.601876 0.443424 0.088542 0.683828 0.725811 0.567350 0.979437
2 0.550181 0.050053 0.744824 0.208462 0.808865 0.437747 0.976386 0.227302
3 0.145377 0.023350 0.707948 0.594533 0.290170 0.690900 0.856202 0.618862

From a dictionary including Series:

pd.DataFrame({'col1': [0,1,2,3], 'col2': pd.Series([2,3], index=[2,3])}, index=[0,1,2,3])

Hide code cell output

col1 col2
0 0 NaN
1 1 NaN
2 2 2.0
3 3 3.0

From numpy ndarray:

df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
                   columns=['a', 'b', 'c'])
df

Hide code cell output

a b c
0 1 2 3
1 4 5 6
2 7 8 9

From a numpy ndarray that has labeled columns:

d = np.array([(1,2,3), (4,5,6), (7,8,9)], dtype=[("a", "i4"), ("b", "i4"), ("c", "i4")])
df = pd.DataFrame(data=d, columns=['c', 'a'])
df

Hide code cell output

c a
0 3 1
1 6 4
2 9 7

From Series/DataFrame:

ser = pd.Series([1,2,3], index=['a','b','c'])
df = pd.DataFrame(data=ser, index=['c', 'a'], columns=['hehe'])
df

Hide code cell output

hehe
c 3
a 1

If we construct from DataFrame, then the columns in the new DataFrame must be a subset of the original columns. If not, the new columns will be filled with NaN.

df1 = pd.DataFrame([1,2,3], index=['a','b','c'], columns=['x'])
df2 = pd.DataFrame(data=df1, index=['c', 'a'])
df3 = pd.DataFrame(data=df1, index=['c', 'a'], columns=['z'])
print(df2, '\n',df3)

Hide code cell output

   x
c  3
a  1 
     z
c NaN
a NaN

Reverse Row/Column Order#

See also iloc here

data.head()

Hide code cell output

# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False
1 2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False
2 3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False
3 3 VenusaurMega Venusaur Grass Poison 625 80 100 123 122 120 80 1 False
4 4 Charmander Fire NaN 309 39 52 43 60 50 65 1 False

Row#

data.iloc[::-1].head()

Hide code cell output

# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
799 721 Volcanion Fire Water 600 80 110 120 130 90 70 6 True
798 720 HoopaHoopa Unbound Psychic Dark 680 80 160 60 170 130 80 6 True
797 720 HoopaHoopa Confined Psychic Ghost 600 80 110 60 150 130 70 6 True
796 719 DiancieMega Diancie Rock Fairy 700 50 160 110 160 110 110 6 True
795 719 Diancie Rock Fairy 600 50 100 150 100 150 50 6 True

See also reset_index here.

data.iloc[::-1].reset_index().head()

Hide code cell output

index # Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 799 721 Volcanion Fire Water 600 80 110 120 130 90 70 6 True
1 798 720 HoopaHoopa Unbound Psychic Dark 680 80 160 60 170 130 80 6 True
2 797 720 HoopaHoopa Confined Psychic Ghost 600 80 110 60 150 130 70 6 True
3 796 719 DiancieMega Diancie Rock Fairy 700 50 160 110 160 110 110 6 True
4 795 719 Diancie Rock Fairy 600 50 100 150 100 150 50 6 True
data.iloc[::-1].reset_index(drop=True).head()

Hide code cell output

# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 721 Volcanion Fire Water 600 80 110 120 130 90 70 6 True
1 720 HoopaHoopa Unbound Psychic Dark 680 80 160 60 170 130 80 6 True
2 720 HoopaHoopa Confined Psychic Ghost 600 80 110 60 150 130 70 6 True
3 719 DiancieMega Diancie Rock Fairy 700 50 160 110 160 110 110 6 True
4 719 Diancie Rock Fairy 600 50 100 150 100 150 50 6 True

Column#

data.iloc[:, ::-1].head()

Hide code cell output

Legendary Generation Speed Sp. Def Sp. Atk Defense Attack HP Total Type 2 Type 1 Name #
0 False 1 45 65 65 49 49 45 318 Poison Grass Bulbasaur 1
1 False 1 60 80 80 63 62 60 405 Poison Grass Ivysaur 2
2 False 1 80 100 100 83 82 80 525 Poison Grass Venusaur 3
3 False 1 80 120 122 123 100 80 625 Poison Grass VenusaurMega Venusaur 3
4 False 1 65 50 60 43 52 39 309 NaN Fire Charmander 4

Select column by data type#

Check columns data types:

data.dtypes

Hide code cell output

#              int64
Name          object
Type 1        object
Type 2        object
Total          int64
HP             int64
Attack         int64
Defense        int64
Sp. Atk        int64
Sp. Def        int64
Speed          int64
Generation     int64
Legendary       bool
dtype: object

Select all columns that are int or float:

data.select_dtypes(include='number').head()

Hide code cell output

# Total HP Attack Defense Sp. Atk Sp. Def Speed Generation
0 1 318 45 49 49 65 65 45 1
1 2 405 60 62 63 80 80 60 1
2 3 525 80 82 83 100 100 80 1
3 3 625 80 100 123 122 120 80 1
4 4 309 39 52 43 60 50 65 1

Select multiple types by passing as a list:

data.select_dtypes(include=['number', 'object']).head()

Hide code cell output

# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation
0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1
1 2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1
2 3 Venusaur Grass Poison 525 80 82 83 100 100 80 1
3 3 VenusaurMega Venusaur Grass Poison 625 80 100 123 122 120 80 1
4 4 Charmander Fire NaN 309 39 52 43 60 50 65 1

Or exclude types by using exclude parameter:

data.select_dtypes(exclude=['number', 'object']).head()

Hide code cell output

Legendary
0 False
1 False
2 False
3 False
4 False

Convert strings to numbers#

df = pd.DataFrame({'col1': ['1.1', '2.2', '3.3'], 'col2': ['4.4', '5.5', '6.6'], 'col3': ['7.7', '8.8', '-']})
df, df.dtypes

Hide code cell output

(  col1 col2 col3
 0  1.1  4.4  7.7
 1  2.2  5.5  8.8
 2  3.3  6.6    -,
 col1    object
 col2    object
 col3    object
 dtype: object)

df.astype() can convert multiple columns at once. Use errors='ignore' to skip conversion errors.

df.astype({'col1': 'float', 'col2': 'float'}, errors='raise').dtypes

Hide code cell output

col1    float64
col2    float64
col3     object
dtype: object
df.astype({'col1': 'float', 'col2': 'float', 'col3': 'float'}, errors='ignore').dtypes

Hide code cell output

col1    float64
col2    float64
col3     object
dtype: object

A better way to convert strings to numbers is to use pd.to_numeric() with errors='coerce' to convert invalid parsing to NaN.

pd.to_numeric(df.col3, errors='coerce')

Hide code cell output

0    7.7
1    8.8
2    NaN
Name: col3, dtype: float64
pd.to_numeric(df.col3, errors='coerce').fillna(0)

Hide code cell output

0    7.7
1    8.8
2    0.0
Name: col3, dtype: float64
df = df.apply(pd.to_numeric, errors='coerce').fillna(0)
df, df.dtypes

Hide code cell output

(   col1  col2  col3
 0   1.1   4.4   7.7
 1   2.2   5.5   8.8
 2   3.3   6.6   0.0,
 col1    float64
 col2    float64
 col3    float64
 dtype: object)

Deal with large datasets#

Check memory usage#

data.info(memory_usage='deep')

Hide code cell output

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800 entries, 0 to 799
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   #           800 non-null    int64 
 1   Name        800 non-null    object
 2   Type 1      800 non-null    object
 3   Type 2      414 non-null    object
 4   Total       800 non-null    int64 
 5   HP          800 non-null    int64 
 6   Attack      800 non-null    int64 
 7   Defense     800 non-null    int64 
 8   Sp. Atk     800 non-null    int64 
 9   Sp. Def     800 non-null    int64 
 10  Speed       800 non-null    int64 
 11  Generation  800 non-null    int64 
 12  Legendary   800 non-null    bool  
dtypes: bool(1), int64(9), object(3)
memory usage: 179.0 KB

Load specific columns#

We can load only the columns we need by using the usecols parameter of pd.read_csv().

small_data = pd.read_csv('data/Pokemon.csv', usecols=['Name', 'Type 1'])
small_data.head()

Hide code cell output

Name Type 1
0 Bulbasaur Grass
1 Ivysaur Grass
2 Venusaur Grass
3 VenusaurMega Venusaur Grass
4 Charmander Fire

And it indeed saves memory:

small_data.info(memory_usage='deep')

Hide code cell output

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800 entries, 0 to 799
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    800 non-null    object
 1   Type 1  800 non-null    object
dtypes: object(2)
memory usage: 87.8 KB

If we know a column has only a few unique values, we can load it as category type to save memory:

smaller_data = pd.read_csv('data/Pokemon.csv', usecols=['Name', 'Type 1'], dtype={'Type 1': 'category'})
smaller_data.info(memory_usage='deep')

Hide code cell output

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800 entries, 0 to 799
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   Name    800 non-null    object  
 1   Type 1  800 non-null    category
dtypes: category(1), object(1)
memory usage: 47.7 KB

Load in chunks#

next() is used here to get the first chunk (a DataFrame with 100 rows) from the chunked CSV reader, so you can call .head() on it.

chunks_df = pd.read_csv('data/Pokemon.csv', chunksize=100)
next(chunks_df) # first 100 rows

Hide code cell output

# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False
1 2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False
2 3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False
3 3 VenusaurMega Venusaur Grass Poison 625 80 100 123 122 120 80 1 False
4 4 Charmander Fire NaN 309 39 52 43 60 50 65 1 False
... ... ... ... ... ... ... ... ... ... ... ... ... ...
95 88 Grimer Poison NaN 325 80 80 50 40 50 25 1 False
96 89 Muk Poison NaN 500 105 105 75 65 100 50 1 False
97 90 Shellder Water NaN 305 30 65 100 45 25 40 1 False
98 91 Cloyster Water Ice 525 50 95 180 85 45 70 1 False
99 92 Gastly Ghost Poison 310 30 35 30 100 35 80 1 False

100 rows × 13 columns

We can save each chunk to separate data files for later usage:

for i, df in enumerate(pd.read_csv('data/Pokemon.csv', chunksize=100)):
    df.to_csv(f'data/Pokemon_{i}.csv', index=False)
df0 = pd.read_csv('data/Pokemon_0.csv')
df0.head()

Hide code cell output

# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False
1 2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False
2 3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False
3 3 VenusaurMega Venusaur Grass Poison 625 80 100 123 122 120 80 1 False
4 4 Charmander Fire NaN 309 39 52 43 60 50 65 1 False
df1 = pd.read_csv('data/Pokemon_1.csv')
df1.head()

Hide code cell output

# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 93 Haunter Ghost Poison 405 45 50 45 115 55 95 1 False
1 94 Gengar Ghost Poison 500 60 65 60 130 75 110 1 False
2 94 GengarMega Gengar Ghost Poison 600 60 65 80 170 95 130 1 False
3 95 Onix Rock Ground 385 35 45 160 30 45 70 1 False
4 96 Drowzee Psychic NaN 328 60 48 45 43 90 42 1 False

But how do we combine the data back? See below

Build a DataFrame from multiple files (row-wise)#

Suppose we want to join Pokemon_0.csv and Pokemon_1.csv row-wise:

  • Use pd.concat and pass in the list of dataframes we want to join

df0 = pd.read_csv('data/Pokemon_0.csv')
df1 = pd.read_csv('data/Pokemon_1.csv')
df = pd.concat([df0, df1])
df.head()

Hide code cell output

# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False
1 2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False
2 3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False
3 3 VenusaurMega Venusaur Grass Poison 625 80 100 123 122 120 80 1 False
4 4 Charmander Fire NaN 309 39 52 43 60 50 65 1 False

But here, we’ll see some unexpected indices, as they are not consecutive (reset at the first row of 2nd dataframe):

df.iloc[95:105]

Hide code cell output

# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
95 88 Grimer Poison NaN 325 80 80 50 40 50 25 1 False
96 89 Muk Poison NaN 500 105 105 75 65 100 50 1 False
97 90 Shellder Water NaN 305 30 65 100 45 25 40 1 False
98 91 Cloyster Water Ice 525 50 95 180 85 45 70 1 False
99 92 Gastly Ghost Poison 310 30 35 30 100 35 80 1 False
0 93 Haunter Ghost Poison 405 45 50 45 115 55 95 1 False
1 94 Gengar Ghost Poison 500 60 65 60 130 75 110 1 False
2 94 GengarMega Gengar Ghost Poison 600 60 65 80 170 95 130 1 False
3 95 Onix Rock Ground 385 35 45 160 30 45 70 1 False
4 96 Drowzee Psychic NaN 328 60 48 45 43 90 42 1 False

And, we just use ignore_index=True to workaround:

df = pd.concat([df0, df1], ignore_index=True)
df.iloc[95:105]

Hide code cell output

# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
95 88 Grimer Poison NaN 325 80 80 50 40 50 25 1 False
96 89 Muk Poison NaN 500 105 105 75 65 100 50 1 False
97 90 Shellder Water NaN 305 30 65 100 45 25 40 1 False
98 91 Cloyster Water Ice 525 50 95 180 85 45 70 1 False
99 92 Gastly Ghost Poison 310 30 35 30 100 35 80 1 False
100 93 Haunter Ghost Poison 405 45 50 45 115 55 95 1 False
101 94 Gengar Ghost Poison 500 60 65 60 130 75 110 1 False
102 94 GengarMega Gengar Ghost Poison 600 60 65 80 170 95 130 1 False
103 95 Onix Rock Ground 385 35 45 160 30 45 70 1 False
104 96 Drowzee Psychic NaN 328 60 48 45 43 90 42 1 False

Build a DataFrame from multiple files (columns-wise)#

data.to_csv('data/Pokemon_first_6_cols.csv', columns=data.columns[:6], index=False)
data.to_csv('data/Pokemon_second_7_cols.csv', columns=data.columns[6:], index=False)
df0 = pd.read_csv('data/Pokemon_first_6_cols.csv')
df0.head()

Hide code cell output

# Name Type 1 Type 2 Total HP
0 1 Bulbasaur Grass Poison 318 45
1 2 Ivysaur Grass Poison 405 60
2 3 Venusaur Grass Poison 525 80
3 3 VenusaurMega Venusaur Grass Poison 625 80
4 4 Charmander Fire NaN 309 39
df1 = pd.read_csv('data/Pokemon_second_7_cols.csv')
df1.head()

Hide code cell output

Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 49 49 65 65 45 1 False
1 62 63 80 80 60 1 False
2 82 83 100 100 80 1 False
3 100 123 122 120 80 1 False
4 52 43 60 50 65 1 False

And, combine them column-wise by using axis='columns:

df = pd.concat([df0, df1], axis='columns')
df.head()

Hide code cell output

# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False
1 2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False
2 3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False
3 3 VenusaurMega Venusaur Grass Poison 625 80 100 123 122 120 80 1 False
4 4 Charmander Fire NaN 309 39 52 43 60 50 65 1 False

Split a DataFrame into 2 random subsets#

We sample 75% of our dataframe into data_1:

data = pd.read_csv('data/Pokemon.csv')
len(data)

Hide code cell output

800
data_1 = data.sample(frac=0.75, random_state=1234)
np.sort(data_1.index)

Hide code cell output

array([  0,   2,   5,   6,   7,   8,   9,  11,  13,  16,  17,  19,  20,
        21,  22,  23,  24,  25,  27,  28,  29,  31,  33,  34,  35,  36,
        37,  38,  39,  40,  42,  43,  44,  47,  49,  50,  51,  52,  54,
        55,  56,  57,  58,  59,  60,  61,  62,  63,  64,  65,  66,  67,
        69,  70,  71,  72,  73,  74,  77,  78,  79,  80,  83,  85,  88,
        89,  90,  91,  92,  93,  94,  95,  96,  97,  99, 100, 101, 102,
       104, 105, 106, 108, 109, 110, 111, 112, 113, 114, 115, 118, 121,
       122, 123, 124, 125, 126, 127, 128, 129, 131, 132, 133, 134, 137,
       138, 140, 141, 143, 144, 145, 146, 147, 148, 149, 151, 152, 153,
       155, 156, 157, 159, 160, 161, 162, 163, 166, 167, 168, 169, 170,
       171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 184,
       185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 198, 199, 200,
       201, 202, 203, 205, 206, 207, 213, 214, 215, 216, 217, 218, 219,
       220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232,
       234, 237, 238, 239, 240, 241, 242, 245, 247, 248, 249, 251, 252,
       253, 254, 255, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269,
       270, 271, 272, 273, 274, 276, 277, 278, 280, 281, 282, 285, 287,
       288, 289, 290, 296, 297, 298, 299, 300, 301, 302, 304, 305, 306,
       307, 308, 309, 310, 315, 319, 320, 321, 322, 323, 324, 326, 327,
       328, 329, 330, 331, 333, 335, 337, 338, 339, 341, 342, 344, 347,
       348, 349, 351, 352, 353, 355, 356, 357, 358, 359, 360, 361, 362,
       363, 364, 366, 367, 368, 369, 370, 373, 375, 376, 378, 380, 381,
       382, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395,
       397, 398, 399, 401, 402, 403, 405, 406, 407, 409, 410, 411, 412,
       413, 414, 415, 416, 418, 419, 420, 421, 422, 423, 424, 426, 427,
       428, 430, 432, 433, 435, 437, 438, 441, 443, 444, 445, 446, 449,
       450, 451, 452, 453, 454, 456, 457, 458, 461, 462, 463, 464, 466,
       467, 468, 470, 472, 473, 474, 475, 477, 478, 479, 480, 481, 483,
       484, 485, 486, 488, 489, 491, 492, 494, 495, 496, 498, 499, 500,
       501, 502, 503, 504, 505, 507, 508, 511, 512, 513, 515, 516, 517,
       518, 519, 521, 522, 523, 524, 525, 526, 529, 530, 532, 533, 534,
       535, 536, 537, 538, 539, 541, 542, 543, 544, 545, 548, 549, 552,
       554, 555, 556, 557, 558, 560, 563, 564, 565, 566, 567, 569, 570,
       571, 572, 575, 576, 578, 579, 581, 582, 583, 584, 585, 586, 587,
       588, 589, 590, 592, 594, 598, 599, 602, 603, 604, 605, 606, 607,
       609, 611, 612, 613, 614, 615, 616, 618, 620, 621, 622, 623, 624,
       627, 628, 629, 630, 631, 632, 634, 635, 636, 637, 639, 641, 642,
       643, 644, 645, 646, 647, 648, 649, 650, 651, 652, 653, 656, 657,
       658, 659, 660, 662, 665, 666, 667, 668, 669, 670, 672, 673, 674,
       675, 676, 677, 678, 679, 680, 681, 682, 684, 688, 690, 691, 692,
       693, 694, 696, 697, 698, 699, 700, 701, 703, 704, 705, 708, 709,
       710, 711, 712, 713, 714, 715, 716, 718, 719, 720, 721, 722, 724,
       725, 726, 727, 728, 729, 732, 733, 734, 735, 737, 738, 741, 742,
       744, 746, 747, 749, 750, 753, 754, 755, 758, 759, 760, 761, 762,
       764, 766, 767, 769, 770, 771, 772, 773, 774, 776, 777, 778, 780,
       781, 782, 783, 785, 786, 787, 788, 789, 791, 793, 794, 796, 797,
       798, 799])

Get data_2 by simple drop data_1.index:

data_2 = data.drop(data_1.index)
len(data_2), np.sort(data_2.index)

Hide code cell output

(200,
 array([  1,   3,   4,  10,  12,  14,  15,  18,  26,  30,  32,  41,  45,
         46,  48,  53,  68,  75,  76,  81,  82,  84,  86,  87,  98, 103,
        107, 116, 117, 119, 120, 130, 135, 136, 139, 142, 150, 154, 158,
        164, 165, 183, 195, 196, 197, 204, 208, 209, 210, 211, 212, 233,
        235, 236, 243, 244, 246, 250, 256, 257, 258, 259, 275, 279, 283,
        284, 286, 291, 292, 293, 294, 295, 303, 311, 312, 313, 314, 316,
        317, 318, 325, 332, 334, 336, 340, 343, 345, 346, 350, 354, 365,
        371, 372, 374, 377, 379, 383, 396, 400, 404, 408, 417, 425, 429,
        431, 434, 436, 439, 440, 442, 447, 448, 455, 459, 460, 465, 469,
        471, 476, 482, 487, 490, 493, 497, 506, 509, 510, 514, 520, 527,
        528, 531, 540, 546, 547, 550, 551, 553, 559, 561, 562, 568, 573,
        574, 577, 580, 591, 593, 595, 596, 597, 600, 601, 608, 610, 617,
        619, 625, 626, 633, 638, 640, 654, 655, 661, 663, 664, 671, 683,
        685, 686, 687, 689, 695, 702, 706, 707, 717, 723, 730, 731, 736,
        739, 740, 743, 745, 748, 751, 752, 756, 757, 763, 765, 768, 775,
        779, 784, 790, 792, 795]))

Do a little check if the fraction was successful:

len(data_1) + len(data_2)

Hide code cell output

800

Handle missing values#

DataFrame.isna() return a same-sized object:

data.isna().head()

Hide code cell output

# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 False False False False False False False False False False False False False
1 False False False False False False False False False False False False False
2 False False False False False False False False False False False False False
3 False False False False False False False False False False False False False
4 False False False True False False False False False False False False False

isna().sum() to check total missing values for each column:

data.isna().sum()

Hide code cell output

#               0
Name            0
Type 1          0
Type 2        386
Total           0
HP              0
Attack          0
Defense         0
Sp. Atk         0
Sp. Def         0
Speed           0
Generation      0
Legendary       0
dtype: int64

isna().mean() to check the proportion of missing values:

data.isna().mean()

Hide code cell output

#             0.0000
Name          0.0000
Type 1        0.0000
Type 2        0.4825
Total         0.0000
HP            0.0000
Attack        0.0000
Defense       0.0000
Sp. Atk       0.0000
Sp. Def       0.0000
Speed         0.0000
Generation    0.0000
Legendary     0.0000
dtype: float64

We can choose to delete rows/columns that have missing values:

data.dropna(axis='index').head() # Drop rows

Hide code cell output

# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False
1 2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False
2 3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False
3 3 VenusaurMega Venusaur Grass Poison 625 80 100 123 122 120 80 1 False
6 6 Charizard Fire Flying 534 78 84 78 109 85 100 1 False
  • We can use thresh= to keep the rows/columns that has at least thresh non-missing values:

data.dropna(axis='columns', thresh=len(data)*0.6).head()

Hide code cell output

# Name Type 1 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 1 Bulbasaur Grass 318 45 49 49 65 65 45 1 False
1 2 Ivysaur Grass 405 60 62 63 80 80 60 1 False
2 3 Venusaur Grass 525 80 82 83 100 100 80 1 False
3 3 VenusaurMega Venusaur Grass 625 80 100 123 122 120 80 1 False
4 4 Charmander Fire 309 39 52 43 60 50 65 1 False
data.dropna(axis='columns', thresh=len(data)*0.4).head()

Hide code cell output

# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False
1 2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False
2 3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False
3 3 VenusaurMega Venusaur Grass Poison 625 80 100 123 122 120 80 1 False
4 4 Charmander Fire NaN 309 39 52 43 60 50 65 1 False

DataFrame.fillna() to fill missing values:

data.fillna('hehehehe').head()

Hide code cell output

# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False
1 2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False
2 3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False
3 3 VenusaurMega Venusaur Grass Poison 625 80 100 123 122 120 80 1 False
4 4 Charmander Fire hehehehe 309 39 52 43 60 50 65 1 False

Or DataFrame.interpolate() to fill wrt surrounding values:

data.infer_objects(copy=False).interpolate().head()

Hide code cell output

/tmp/ipykernel_11297/3876796089.py:1: FutureWarning: DataFrame.interpolate with object dtype is deprecated and will raise in a future version. Call obj.infer_objects(copy=False) before interpolating instead.
  data.infer_objects(copy=False).interpolate().head()
# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False
1 2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False
2 3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False
3 3 VenusaurMega Venusaur Grass Poison 625 80 100 123 122 120 80 1 False
4 4 Charmander Fire NaN 309 39 52 43 60 50 65 1 False

Split a string into multiple columns:#

This is extremely useful when we have Name, and we want to split into First,Middle,Last:

df = pd.DataFrame(dict(name=['John Arthur Doe', 'Jane Ann Smith'], location=['Los Angeles, CA', 'Washington, DC']))
df

Hide code cell output

name location
0 John Arthur Doe Los Angeles, CA
1 Jane Ann Smith Washington, DC
df[['first','middle','last']] = df.name.str.split(expand=True) # if not specify delimiters, split based on whitespace
df

Hide code cell output

name location first middle last
0 John Arthur Doe Los Angeles, CA John Arthur Doe
1 Jane Ann Smith Washington, DC Jane Ann Smith

Reshape a MultiIndexed Series#

data.groupby(['Type 1', 'Legendary'])['#'].count()

Hide code cell output

Type 1    Legendary
Bug       False         69
Dark      False         29
          True           2
Dragon    False         20
          True          12
Electric  False         40
          True           4
Fairy     False         16
          True           1
Fighting  False         27
Fire      False         47
          True           5
Flying    False          2
          True           2
Ghost     False         30
          True           2
Grass     False         67
          True           3
Ground    False         28
          True           4
Ice       False         22
          True           2
Normal    False         96
          True           2
Poison    False         28
Psychic   False         43
          True          14
Rock      False         40
          True           4
Steel     False         23
          True           4
Water     False        108
          True           4
Name: #, dtype: int64

What if we want to convert the above Series into a DataFrame, with Type 1 as rows, Legendary as columns, and the counts as values?

  • Use unstack() after groupby() and count():

data.groupby(['Type 1', 'Legendary'])['#'].count().unstack()

Hide code cell output

Legendary False True
Type 1
Bug 69.0 NaN
Dark 29.0 2.0
Dragon 20.0 12.0
Electric 40.0 4.0
Fairy 16.0 1.0
Fighting 27.0 NaN
Fire 47.0 5.0
Flying 2.0 2.0
Ghost 30.0 2.0
Grass 67.0 3.0
Ground 28.0 4.0
Ice 22.0 2.0
Normal 96.0 2.0
Poison 28.0 NaN
Psychic 43.0 14.0
Rock 40.0 4.0
Steel 23.0 4.0
Water 108.0 4.0

But there’s a better way to do this by using pivot_table(), which can do more things like aggregation, filling missing values, and margins, which unstack() cannot do. See below.

Create a pivot table#

data.pivot_table(index='Type 1', columns='Legendary', values='Attack', aggfunc='mean')

Hide code cell output

Legendary False True
Type 1
Bug 70.971014 NaN
Dark 86.862069 110.500000
Dragon 103.400000 126.666667
Electric 66.125000 98.750000
Fairy 57.187500 131.000000
Fighting 96.777778 NaN
Fire 82.191489 109.000000
Flying 50.000000 107.500000
Ghost 71.366667 110.000000
Grass 72.119403 97.666667
Ground 88.000000 150.000000
Ice 73.227273 67.500000
Normal 72.083333 140.000000
Poison 74.678571 NaN
Psychic 54.953488 122.142857
Rock 89.925000 122.250000
Steel 92.086957 96.250000
Water 72.777778 111.250000

The advantage of pivot table compared to using groupby + unstack is the margins parameter, which adds a row/column that gives the totals (or other aggregate functions) for each row/column:

data.pivot_table(index='Type 1', columns='Legendary', values='Attack', aggfunc='mean', margins=True)

Hide code cell output

Legendary False True All
Type 1
Bug 70.971014 NaN 70.971014
Dark 86.862069 110.500000 88.387097
Dragon 103.400000 126.666667 112.125000
Electric 66.125000 98.750000 69.090909
Fairy 57.187500 131.000000 61.529412
Fighting 96.777778 NaN 96.777778
Fire 82.191489 109.000000 84.769231
Flying 50.000000 107.500000 78.750000
Ghost 71.366667 110.000000 73.781250
Grass 72.119403 97.666667 73.214286
Ground 88.000000 150.000000 95.750000
Ice 73.227273 67.500000 72.750000
Normal 72.083333 140.000000 73.469388
Poison 74.678571 NaN 74.678571
Psychic 54.953488 122.142857 71.456140
Rock 89.925000 122.250000 92.863636
Steel 92.086957 96.250000 92.703704
Water 72.777778 111.250000 74.151786
All 75.669388 116.676923 79.001250

Reshape a DataFrame from wide format to long format#

What if we want ‘HP’ and ‘Attack’ to be the values of the column named ‘Stats’?

f = data[['Name', 'HP', 'Attack']].iloc[:5]
f

Hide code cell output

Name HP Attack
0 Bulbasaur 45 49
1 Ivysaur 60 62
2 Venusaur 80 82
3 VenusaurMega Venusaur 80 100
4 Charmander 39 52
f.melt(id_vars='Name', var_name='Stats', value_name='Points')

Hide code cell output

Name Stats Points
0 Bulbasaur HP 45
1 Ivysaur HP 60
2 Venusaur HP 80
3 VenusaurMega Venusaur HP 80
4 Charmander HP 39
5 Bulbasaur Attack 49
6 Ivysaur Attack 62
7 Venusaur Attack 82
8 VenusaurMega Venusaur Attack 100
9 Charmander Attack 52

Convert continuous data to categorical data#

What if we want Attack to be categorized (< 50: ‘weak’, 50-100: ‘normal’, 100-150: ‘strong’, >150: ‘nani?!’)

Use pd.cut(<column>, <bin>, <labels>) to convert continuous data to categorical data. Here, we convert ‘Attack’ into 4 categories: ‘Weak’, ‘Normal’, ‘Strong’, ‘nani?!’.

df = data.copy()
df['Attack'] = pd.cut(df['Attack'], bins=[0, 50, 100, 150, 200], labels=['Weak', 'Normal', 'Strong', 'nani?!'])
df

Hide code cell output

# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 1 Bulbasaur Grass Poison 318 45 Weak 49 65 65 45 1 False
1 2 Ivysaur Grass Poison 405 60 Normal 63 80 80 60 1 False
2 3 Venusaur Grass Poison 525 80 Normal 83 100 100 80 1 False
3 3 VenusaurMega Venusaur Grass Poison 625 80 Normal 123 122 120 80 1 False
4 4 Charmander Fire NaN 309 39 Normal 43 60 50 65 1 False
... ... ... ... ... ... ... ... ... ... ... ... ... ...
795 719 Diancie Rock Fairy 600 50 Normal 150 100 150 50 6 True
796 719 DiancieMega Diancie Rock Fairy 700 50 nani?! 110 160 110 110 6 True
797 720 HoopaHoopa Confined Psychic Ghost 600 80 Strong 60 150 130 70 6 True
798 720 HoopaHoopa Unbound Psychic Dark 680 80 nani?! 60 170 130 80 6 True
799 721 Volcanion Fire Water 600 80 Strong 120 130 90 70 6 True

800 rows × 13 columns

Profile a DataFrame#

from ydata_profiling import ProfileReport

report = ProfileReport(data)
report.to_notebook_iframe()

Hide code cell output

Upgrade to ydata-sdk

Improve your data and profiling with ydata-sdk, featuring data quality scoring, redundancy detection, outlier identification, text validation, and synthetic data generation.


  0%|          | 0/13 [00:00<?, ?it/s]


 15%|█▌        | 2/13 [00:00<00:00, 12.94it/s]

100%|██████████| 13/13 [00:00<00:00, 83.11it/s]