pandas: Deal with Large Datasets#

Package Import#

import pandas as pd
import numpy as np

Dataset Import#

The dataset used in this notebook is from Kaggle - Pokemon.

data = pd.read_csv('data/Pokemon.csv')
data

Hide code cell output

# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False
1 2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False
2 3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False
3 3 VenusaurMega Venusaur Grass Poison 625 80 100 123 122 120 80 1 False
4 4 Charmander Fire NaN 309 39 52 43 60 50 65 1 False
... ... ... ... ... ... ... ... ... ... ... ... ... ...
795 719 Diancie Rock Fairy 600 50 100 150 100 150 50 6 True
796 719 DiancieMega Diancie Rock Fairy 700 50 160 110 160 110 110 6 True
797 720 HoopaHoopa Confined Psychic Ghost 600 80 110 60 150 130 70 6 True
798 720 HoopaHoopa Unbound Psychic Dark 680 80 160 60 170 130 80 6 True
799 721 Volcanion Fire Water 600 80 110 120 130 90 70 6 True

800 rows × 13 columns

Deal with large datasets#

Check memory usage#

data.info(memory_usage='deep')

Hide code cell output

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800 entries, 0 to 799
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   #           800 non-null    int64 
 1   Name        800 non-null    object
 2   Type 1      800 non-null    object
 3   Type 2      414 non-null    object
 4   Total       800 non-null    int64 
 5   HP          800 non-null    int64 
 6   Attack      800 non-null    int64 
 7   Defense     800 non-null    int64 
 8   Sp. Atk     800 non-null    int64 
 9   Sp. Def     800 non-null    int64 
 10  Speed       800 non-null    int64 
 11  Generation  800 non-null    int64 
 12  Legendary   800 non-null    bool  
dtypes: bool(1), int64(9), object(3)
memory usage: 179.0 KB

Load specific columns#

We can load only the columns we need by using the usecols parameter of pd.read_csv().

small_data = pd.read_csv('data/Pokemon.csv', usecols=['Name', 'Type 1'])
small_data.head()

Hide code cell output

Name Type 1
0 Bulbasaur Grass
1 Ivysaur Grass
2 Venusaur Grass
3 VenusaurMega Venusaur Grass
4 Charmander Fire

And it indeed saves memory:

small_data.info(memory_usage='deep')

Hide code cell output

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800 entries, 0 to 799
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    800 non-null    object
 1   Type 1  800 non-null    object
dtypes: object(2)
memory usage: 87.8 KB

If we know a column has only a few unique values, we can load it as category type to save memory:

smaller_data = pd.read_csv('data/Pokemon.csv', usecols=['Name', 'Type 1'], dtype={'Type 1': 'category'})
smaller_data.info(memory_usage='deep')

Hide code cell output

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800 entries, 0 to 799
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   Name    800 non-null    object  
 1   Type 1  800 non-null    category
dtypes: category(1), object(1)
memory usage: 47.7 KB

Load in chunks#

next() is used here to get the first chunk (a DataFrame with 100 rows) from the chunked CSV reader, so you can call .head() on it.

chunks_df = pd.read_csv('data/Pokemon.csv', chunksize=100)
next(chunks_df) # first 100 rows

Hide code cell output

# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False
1 2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False
2 3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False
3 3 VenusaurMega Venusaur Grass Poison 625 80 100 123 122 120 80 1 False
4 4 Charmander Fire NaN 309 39 52 43 60 50 65 1 False
... ... ... ... ... ... ... ... ... ... ... ... ... ...
95 88 Grimer Poison NaN 325 80 80 50 40 50 25 1 False
96 89 Muk Poison NaN 500 105 105 75 65 100 50 1 False
97 90 Shellder Water NaN 305 30 65 100 45 25 40 1 False
98 91 Cloyster Water Ice 525 50 95 180 85 45 70 1 False
99 92 Gastly Ghost Poison 310 30 35 30 100 35 80 1 False

100 rows × 13 columns

We can save each chunk to separate data files for later usage:

for i, df in enumerate(pd.read_csv('data/Pokemon.csv', chunksize=100)):
    df.to_csv(f'data/Pokemon_{i}.csv', index=False)
df0 = pd.read_csv('data/Pokemon_0.csv')
df0.head()

Hide code cell output

# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False
1 2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False
2 3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False
3 3 VenusaurMega Venusaur Grass Poison 625 80 100 123 122 120 80 1 False
4 4 Charmander Fire NaN 309 39 52 43 60 50 65 1 False
df1 = pd.read_csv('data/Pokemon_1.csv')
df1.head()

Hide code cell output

# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 93 Haunter Ghost Poison 405 45 50 45 115 55 95 1 False
1 94 Gengar Ghost Poison 500 60 65 60 130 75 110 1 False
2 94 GengarMega Gengar Ghost Poison 600 60 65 80 170 95 130 1 False
3 95 Onix Rock Ground 385 35 45 160 30 45 70 1 False
4 96 Drowzee Psychic NaN 328 60 48 45 43 90 42 1 False

But how do we combine the data back? See this article for more details.