pandas: Create DataFrame#

Package Import#

import pandas as pd
import numpy as np

Dataset Import#

The dataset used in this notebook is from Kaggle - Pokemon.

data = pd.read_csv('data/Pokemon.csv')
data

Hide code cell output

# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False
1 2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False
2 3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False
3 3 VenusaurMega Venusaur Grass Poison 625 80 100 123 122 120 80 1 False
4 4 Charmander Fire NaN 309 39 52 43 60 50 65 1 False
... ... ... ... ... ... ... ... ... ... ... ... ... ...
795 719 Diancie Rock Fairy 600 50 100 150 100 150 50 6 True
796 719 DiancieMega Diancie Rock Fairy 700 50 160 110 160 110 110 6 True
797 720 HoopaHoopa Confined Psychic Ghost 600 80 110 60 150 130 70 6 True
798 720 HoopaHoopa Unbound Psychic Dark 680 80 160 60 170 130 80 6 True
799 721 Volcanion Fire Water 600 80 110 120 130 90 70 6 True

800 rows × 13 columns

Manually Create a DataFrame#

From a Dictionary
The columns order is the order of keys insertion:

df = pd.DataFrame({'Column 1': [100,200], 'Column 2': [300,400]})
df

Hide code cell output

Column 1 Column 2
0 100 300
1 200 400

From a list of random values w/ column names:

pd.DataFrame(np.random.rand(4, 8), columns=list('abcdefgh'))

Hide code cell output

a b c d e f g h
0 0.155808 0.426032 0.828060 0.220790 0.990389 0.747130 0.048883 0.967204
1 0.459662 0.909124 0.942143 0.585818 0.165209 0.396012 0.585996 0.012020
2 0.451281 0.454347 0.902485 0.801921 0.007451 0.954774 0.510324 0.733164
3 0.027769 0.490229 0.210495 0.879985 0.370400 0.412179 0.689901 0.277350

From a dictionary including Series:

pd.DataFrame({'col1': [0,1,2,3], 'col2': pd.Series([2,3], index=[2,3])}, index=[0,1,2,3])

Hide code cell output

col1 col2
0 0 NaN
1 1 NaN
2 2 2.0
3 3 3.0

From numpy ndarray:

df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
                   columns=['a', 'b', 'c'])
df

Hide code cell output

a b c
0 1 2 3
1 4 5 6
2 7 8 9

From a numpy ndarray that has labeled columns:

d = np.array([(1,2,3), (4,5,6), (7,8,9)], dtype=[("a", "i4"), ("b", "i4"), ("c", "i4")])
df = pd.DataFrame(data=d, columns=['c', 'a'])
df

Hide code cell output

c a
0 3 1
1 6 4
2 9 7

From Series/DataFrame:

ser = pd.Series([1,2,3], index=['a','b','c'])
df = pd.DataFrame(data=ser, index=['c', 'a'], columns=['hehe'])
df

Hide code cell output

hehe
c 3
a 1

If we construct from DataFrame, then the columns in the new DataFrame must be a subset of the original columns. If not, the new columns will be filled with NaN.

df1 = pd.DataFrame([1,2,3], index=['a','b','c'], columns=['x'])
df2 = pd.DataFrame(data=df1, index=['c', 'a'])
df3 = pd.DataFrame(data=df1, index=['c', 'a'], columns=['z'])
print(df2, '\n',df3)

Hide code cell output

   x
c  3
a  1 
     z
c NaN
a NaN