pandas: Create DataFrame#

Package Import#

import pandas as pd
import numpy as np

Dataset Import#

The dataset used in this notebook is from Kaggle - Pokemon.

data = pd.read_csv('data/Pokemon.csv')

data

Show code cell output

Hide code cell output

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
0	1	Bulbasaur	Grass	Poison	318	45	49	49	65	65	45	1	False
1	2	Ivysaur	Grass	Poison	405	60	62	63	80	80	60	1	False
2	3	Venusaur	Grass	Poison	525	80	82	83	100	100	80	1	False
3	3	VenusaurMega Venusaur	Grass	Poison	625	80	100	123	122	120	80	1	False
4	4	Charmander	Fire	NaN	309	39	52	43	60	50	65	1	False
...	...	...	...	...	...	...	...	...	...	...	...	...	...
795	719	Diancie	Rock	Fairy	600	50	100	150	100	150	50	6	True
796	719	DiancieMega Diancie	Rock	Fairy	700	50	160	110	160	110	110	6	True
797	720	HoopaHoopa Confined	Psychic	Ghost	600	80	110	60	150	130	70	6	True
798	720	HoopaHoopa Unbound	Psychic	Dark	680	80	160	60	170	130	80	6	True
799	721	Volcanion	Fire	Water	600	80	110	120	130	90	70	6	True

800 rows × 13 columns

Manually Create a DataFrame#

From a Dictionary
The columns order is the order of keys insertion:

df = pd.DataFrame({'Column 1': [100,200], 'Column 2': [300,400]})
df

Show code cell output

Hide code cell output

	Column 1	Column 2
0	100	300
1	200	400

From a list of random values w/ column names:

pd.DataFrame(np.random.rand(4, 8), columns=list('abcdefgh'))

Show code cell output

Hide code cell output

	a	b	c	d	e	f	g	h
0	0.155808	0.426032	0.828060	0.220790	0.990389	0.747130	0.048883	0.967204
1	0.459662	0.909124	0.942143	0.585818	0.165209	0.396012	0.585996	0.012020
2	0.451281	0.454347	0.902485	0.801921	0.007451	0.954774	0.510324	0.733164
3	0.027769	0.490229	0.210495	0.879985	0.370400	0.412179	0.689901	0.277350

From a dictionary including Series:

pd.DataFrame({'col1': [0,1,2,3], 'col2': pd.Series([2,3], index=[2,3])}, index=[0,1,2,3])

Show code cell output

Hide code cell output

	col1	col2
0	0	NaN
1	1	NaN
2	2	2.0
3	3	3.0

From numpy ndarray:

df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
                   columns=['a', 'b', 'c'])
df

Show code cell output

Hide code cell output

	a	b	c
0	1	2	3
1	4	5	6
2	7	8	9

From a numpy ndarray that has labeled columns:

d = np.array([(1,2,3), (4,5,6), (7,8,9)], dtype=[("a", "i4"), ("b", "i4"), ("c", "i4")])
df = pd.DataFrame(data=d, columns=['c', 'a'])
df

Show code cell output

Hide code cell output

	c	a
0	3	1
1	6	4
2	9	7

From Series/DataFrame:

ser = pd.Series([1,2,3], index=['a','b','c'])
df = pd.DataFrame(data=ser, index=['c', 'a'], columns=['hehe'])
df

Show code cell output

Hide code cell output

	hehe
c	3
a	1

If we construct from DataFrame, then the columns in the new DataFrame must be a subset of the original columns. If not, the new columns will be filled with NaN.

df1 = pd.DataFrame([1,2,3], index=['a','b','c'], columns=['x'])
df2 = pd.DataFrame(data=df1, index=['c', 'a'])
df3 = pd.DataFrame(data=df1, index=['c', 'a'], columns=['z'])
print(df2, '\n',df3)

pandas: Create DataFrame#

Package Import#

Dataset Import#

Manually Create a DataFrame#

This Page