pandas: Deal with Large Datasets#

Package Import#

import pandas as pd
import numpy as np

The dataset used in this notebook is from Kaggle - Pokemon.

data = pd.read_csv('data/Pokemon.csv')

data

Show code cell output

Hide code cell output

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
0	1	Bulbasaur	Grass	Poison	318	45	49	49	65	65	45	1	False
1	2	Ivysaur	Grass	Poison	405	60	62	63	80	80	60	1	False
2	3	Venusaur	Grass	Poison	525	80	82	83	100	100	80	1	False
3	3	VenusaurMega Venusaur	Grass	Poison	625	80	100	123	122	120	80	1	False
4	4	Charmander	Fire	NaN	309	39	52	43	60	50	65	1	False
...	...	...	...	...	...	...	...	...	...	...	...	...	...
795	719	Diancie	Rock	Fairy	600	50	100	150	100	150	50	6	True
796	719	DiancieMega Diancie	Rock	Fairy	700	50	160	110	160	110	110	6	True
797	720	HoopaHoopa Confined	Psychic	Ghost	600	80	110	60	150	130	70	6	True
798	720	HoopaHoopa Unbound	Psychic	Dark	680	80	160	60	170	130	80	6	True
799	721	Volcanion	Fire	Water	600	80	110	120	130	90	70	6	True

800 rows × 13 columns

data.info(memory_usage='deep')

We can load only the columns we need by using the usecols parameter of pd.read_csv().

small_data = pd.read_csv('data/Pokemon.csv', usecols=['Name', 'Type 1'])
small_data.head()

Show code cell output

Hide code cell output

And it indeed saves memory:

small_data.info(memory_usage='deep')

If we know a column has only a few unique values, we can load it as category type to save memory:

smaller_data = pd.read_csv('data/Pokemon.csv', usecols=['Name', 'Type 1'], dtype={'Type 1': 'category'})
smaller_data.info(memory_usage='deep')

next() is used here to get the first chunk (a DataFrame with 100 rows) from the chunked CSV reader, so you can call .head() on it.

chunks_df = pd.read_csv('data/Pokemon.csv', chunksize=100)
next(chunks_df) # first 100 rows

Show code cell output

Hide code cell output

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
0	1	Bulbasaur	Grass	Poison	318	45	49	49	65	65	45	1	False
1	2	Ivysaur	Grass	Poison	405	60	62	63	80	80	60	1	False
2	3	Venusaur	Grass	Poison	525	80	82	83	100	100	80	1	False
3	3	VenusaurMega Venusaur	Grass	Poison	625	80	100	123	122	120	80	1	False
4	4	Charmander	Fire	NaN	309	39	52	43	60	50	65	1	False
...	...	...	...	...	...	...	...	...	...	...	...	...	...
95	88	Grimer	Poison	NaN	325	80	80	50	40	50	25	1	False
96	89	Muk	Poison	NaN	500	105	105	75	65	100	50	1	False
97	90	Shellder	Water	NaN	305	30	65	100	45	25	40	1	False
98	91	Cloyster	Water	Ice	525	50	95	180	85	45	70	1	False
99	92	Gastly	Ghost	Poison	310	30	35	30	100	35	80	1	False

100 rows × 13 columns

We can save each chunk to separate data files for later usage:

for i, df in enumerate(pd.read_csv('data/Pokemon.csv', chunksize=100)):
    df.to_csv(f'data/Pokemon_{i}.csv', index=False)

df0 = pd.read_csv('data/Pokemon_0.csv')
df0.head()

Show code cell output

Hide code cell output

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
0	1	Bulbasaur	Grass	Poison	318	45	49	49	65	65	45	1	False
1	2	Ivysaur	Grass	Poison	405	60	62	63	80	80	60	1	False
2	3	Venusaur	Grass	Poison	525	80	82	83	100	100	80	1	False
3	3	VenusaurMega Venusaur	Grass	Poison	625	80	100	123	122	120	80	1	False
4	4	Charmander	Fire	NaN	309	39	52	43	60	50	65	1	False

df1 = pd.read_csv('data/Pokemon_1.csv')
df1.head()

Show code cell output

Hide code cell output

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
0	93	Haunter	Ghost	Poison	405	45	50	45	115	55	95	1	False
1	94	Gengar	Ghost	Poison	500	60	65	60	130	75	110	1	False
2	94	GengarMega Gengar	Ghost	Poison	600	60	65	80	170	95	130	1	False
3	95	Onix	Rock	Ground	385	35	45	160	30	45	70	1	False
4	96	Drowzee	Psychic	NaN	328	60	48	45	43	90	42	1	False

But how do we combine the data back? See this article for more details.