pandas: Build DataFrame from Multiple Files (Row-wise)#

Package Import#

import pandas as pd
import numpy as np

Dataset Import#

The dataset used in this notebook is from Kaggle - Pokemon.

data = pd.read_csv('data/Pokemon.csv')
data

Hide code cell output

# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False
1 2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False
2 3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False
3 3 VenusaurMega Venusaur Grass Poison 625 80 100 123 122 120 80 1 False
4 4 Charmander Fire NaN 309 39 52 43 60 50 65 1 False
... ... ... ... ... ... ... ... ... ... ... ... ... ...
795 719 Diancie Rock Fairy 600 50 100 150 100 150 50 6 True
796 719 DiancieMega Diancie Rock Fairy 700 50 160 110 160 110 110 6 True
797 720 HoopaHoopa Confined Psychic Ghost 600 80 110 60 150 130 70 6 True
798 720 HoopaHoopa Unbound Psychic Dark 680 80 160 60 170 130 80 6 True
799 721 Volcanion Fire Water 600 80 110 120 130 90 70 6 True

800 rows × 13 columns

Build a DataFrame from multiple files (row-wise)#

Suppose we want to join Pokemon_0.csv and Pokemon_1.csv row-wise:

  • Use pd.concat and pass in the list of dataframes we want to join

df0 = pd.read_csv('data/Pokemon_0.csv')
df1 = pd.read_csv('data/Pokemon_1.csv')
df = pd.concat([df0, df1])
df.head()

Hide code cell output

# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False
1 2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False
2 3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False
3 3 VenusaurMega Venusaur Grass Poison 625 80 100 123 122 120 80 1 False
4 4 Charmander Fire NaN 309 39 52 43 60 50 65 1 False

But here, we’ll see some unexpected indices, as they are not consecutive (reset at the first row of 2nd dataframe):

df.iloc[95:105]

Hide code cell output

# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
95 88 Grimer Poison NaN 325 80 80 50 40 50 25 1 False
96 89 Muk Poison NaN 500 105 105 75 65 100 50 1 False
97 90 Shellder Water NaN 305 30 65 100 45 25 40 1 False
98 91 Cloyster Water Ice 525 50 95 180 85 45 70 1 False
99 92 Gastly Ghost Poison 310 30 35 30 100 35 80 1 False
0 93 Haunter Ghost Poison 405 45 50 45 115 55 95 1 False
1 94 Gengar Ghost Poison 500 60 65 60 130 75 110 1 False
2 94 GengarMega Gengar Ghost Poison 600 60 65 80 170 95 130 1 False
3 95 Onix Rock Ground 385 35 45 160 30 45 70 1 False
4 96 Drowzee Psychic NaN 328 60 48 45 43 90 42 1 False

And, we just use ignore_index=True to workaround:

df = pd.concat([df0, df1], ignore_index=True)
df.iloc[95:105]

Hide code cell output

# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
95 88 Grimer Poison NaN 325 80 80 50 40 50 25 1 False
96 89 Muk Poison NaN 500 105 105 75 65 100 50 1 False
97 90 Shellder Water NaN 305 30 65 100 45 25 40 1 False
98 91 Cloyster Water Ice 525 50 95 180 85 45 70 1 False
99 92 Gastly Ghost Poison 310 30 35 30 100 35 80 1 False
100 93 Haunter Ghost Poison 405 45 50 45 115 55 95 1 False
101 94 Gengar Ghost Poison 500 60 65 60 130 75 110 1 False
102 94 GengarMega Gengar Ghost Poison 600 60 65 80 170 95 130 1 False
103 95 Onix Rock Ground 385 35 45 160 30 45 70 1 False
104 96 Drowzee Psychic NaN 328 60 48 45 43 90 42 1 False