pandas: Deal with Large Datasets#
Package Import#
import pandas as pd
import numpy as np
Dataset Import#
The dataset used in this notebook is from Kaggle - Pokemon.
data = pd.read_csv('data/Pokemon.csv')
data
Deal with large datasets#
Check memory usage#
data.info(memory_usage='deep')
Load specific columns#
We can load only the columns we need by using the usecols
parameter of pd.read_csv()
.
small_data = pd.read_csv('data/Pokemon.csv', usecols=['Name', 'Type 1'])
small_data.head()
And it indeed saves memory:
small_data.info(memory_usage='deep')
If we know a column has only a few unique values, we can load it as category
type to save memory:
smaller_data = pd.read_csv('data/Pokemon.csv', usecols=['Name', 'Type 1'], dtype={'Type 1': 'category'})
smaller_data.info(memory_usage='deep')
Load in chunks#
next()
is used here to get the first chunk (a DataFrame with 100 rows) from the chunked CSV reader, so you can call .head()
on it.
chunks_df = pd.read_csv('data/Pokemon.csv', chunksize=100)
next(chunks_df) # first 100 rows
We can save each chunk to separate data files for later usage:
for i, df in enumerate(pd.read_csv('data/Pokemon.csv', chunksize=100)):
df.to_csv(f'data/Pokemon_{i}.csv', index=False)
df0 = pd.read_csv('data/Pokemon_0.csv')
df0.head()
df1 = pd.read_csv('data/Pokemon_1.csv')
df1.head()
But how do we combine the data back? See this article for more details.