pandas: Basic Pandas for Data Science#
Package Import#
import pandas as pd
import numpy as np
Dataset Import#
The dataset used in this notebook is from Kaggle - Pokemon.
data = pd.read_csv('data/Pokemon.csv')
Manually Create a DataFrame#
From a Dictionary
The columns order is the order of keys insertion:
df = pd.DataFrame({'Column 1': [100,200], 'Column 2': [300,400]})
df
From a list of random values w/ column names:
pd.DataFrame(np.random.rand(4, 8), columns=list('abcdefgh'))
From a dictionary including Series:
pd.DataFrame({'col1': [0,1,2,3], 'col2': pd.Series([2,3], index=[2,3])}, index=[0,1,2,3])
From numpy ndarray:
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
df
From a numpy ndarray that has labeled columns:
d = np.array([(1,2,3), (4,5,6), (7,8,9)], dtype=[("a", "i4"), ("b", "i4"), ("c", "i4")])
df = pd.DataFrame(data=d, columns=['c', 'a'])
df
From Series/DataFrame:
ser = pd.Series([1,2,3], index=['a','b','c'])
df = pd.DataFrame(data=ser, index=['c', 'a'], columns=['hehe'])
df
If we construct from DataFrame, then the columns in the new DataFrame must be a subset of the original columns. If not, the new columns will be filled with NaN.
df1 = pd.DataFrame([1,2,3], index=['a','b','c'], columns=['x'])
df2 = pd.DataFrame(data=df1, index=['c', 'a'])
df3 = pd.DataFrame(data=df1, index=['c', 'a'], columns=['z'])
print(df2, '\n',df3)
Reverse Row/Column Order#
See also iloc
here
data.head()
Row#
data.iloc[::-1].head()
See also reset_index
here.
data.iloc[::-1].reset_index().head()
data.iloc[::-1].reset_index(drop=True).head()
Column#
data.iloc[:, ::-1].head()
Select column by data type#
Check columns data types:
data.dtypes
Select all columns that are int or float:
data.select_dtypes(include='number').head()
Select multiple types by passing as a list:
data.select_dtypes(include=['number', 'object']).head()
Or exclude types by using exclude
parameter:
data.select_dtypes(exclude=['number', 'object']).head()
Convert strings to numbers#
df = pd.DataFrame({'col1': ['1.1', '2.2', '3.3'], 'col2': ['4.4', '5.5', '6.6'], 'col3': ['7.7', '8.8', '-']})
df, df.dtypes
df.astype()
can convert multiple columns at once. Use errors='ignore'
to skip conversion errors.
df.astype({'col1': 'float', 'col2': 'float'}, errors='raise').dtypes
df.astype({'col1': 'float', 'col2': 'float', 'col3': 'float'}, errors='ignore').dtypes
A better way to convert strings to numbers is to use pd.to_numeric()
with errors='coerce'
to convert invalid parsing to NaN.
pd.to_numeric(df.col3, errors='coerce')
pd.to_numeric(df.col3, errors='coerce').fillna(0)
df = df.apply(pd.to_numeric, errors='coerce').fillna(0)
df, df.dtypes
Deal with large datasets#
Check memory usage#
data.info(memory_usage='deep')
Load specific columns#
We can load only the columns we need by using the usecols
parameter of pd.read_csv()
.
small_data = pd.read_csv('data/Pokemon.csv', usecols=['Name', 'Type 1'])
small_data.head()
And it indeed saves memory:
small_data.info(memory_usage='deep')
If we know a column has only a few unique values, we can load it as category
type to save memory:
smaller_data = pd.read_csv('data/Pokemon.csv', usecols=['Name', 'Type 1'], dtype={'Type 1': 'category'})
smaller_data.info(memory_usage='deep')
Load in chunks#
next()
is used here to get the first chunk (a DataFrame with 100 rows) from the chunked CSV reader, so you can call .head()
on it.
chunks_df = pd.read_csv('data/Pokemon.csv', chunksize=100)
next(chunks_df) # first 100 rows
We can save each chunk to separate data files for later usage:
for i, df in enumerate(pd.read_csv('data/Pokemon.csv', chunksize=100)):
df.to_csv(f'data/Pokemon_{i}.csv', index=False)
df0 = pd.read_csv('data/Pokemon_0.csv')
df0.head()
df1 = pd.read_csv('data/Pokemon_1.csv')
df1.head()
But how do we combine the data back? See below
Build a DataFrame from multiple files (row-wise)#
Suppose we want to join Pokemon_0.csv
and Pokemon_1.csv
row-wise:
Use
pd.concat
and pass in the list of dataframes we want to join
df0 = pd.read_csv('data/Pokemon_0.csv')
df1 = pd.read_csv('data/Pokemon_1.csv')
df = pd.concat([df0, df1])
df.head()
But here, we’ll see some unexpected indices, as they are not consecutive (reset at the first row of 2nd dataframe):
df.iloc[95:105]
And, we just use ignore_index=True
to workaround:
df = pd.concat([df0, df1], ignore_index=True)
df.iloc[95:105]
Build a DataFrame from multiple files (columns-wise)#
data.to_csv('data/Pokemon_first_6_cols.csv', columns=data.columns[:6], index=False)
data.to_csv('data/Pokemon_second_7_cols.csv', columns=data.columns[6:], index=False)
df0 = pd.read_csv('data/Pokemon_first_6_cols.csv')
df0.head()
df1 = pd.read_csv('data/Pokemon_second_7_cols.csv')
df1.head()
And, combine them column-wise by using axis='columns
:
df = pd.concat([df0, df1], axis='columns')
df.head()
Split a DataFrame into 2 random subsets#
We sample 75% of our dataframe into data_1
:
data = pd.read_csv('data/Pokemon.csv')
len(data)
data_1 = data.sample(frac=0.75, random_state=1234)
np.sort(data_1.index)
Get data_2
by simple drop data_1.index
:
data_2 = data.drop(data_1.index)
len(data_2), np.sort(data_2.index)
Do a little check if the fraction was successful:
len(data_1) + len(data_2)
Handle missing values#
DataFrame.isna()
return a same-sized object:
data.isna().head()
isna().sum()
to check total missing values for each column:
data.isna().sum()
isna().mean()
to check the proportion of missing values:
data.isna().mean()
We can choose to delete rows/columns that have missing values:
data.dropna(axis='index').head() # Drop rows
We can use
thresh=
to keep the rows/columns that has at leastthresh
non-missing values:
data.dropna(axis='columns', thresh=len(data)*0.6).head()
data.dropna(axis='columns', thresh=len(data)*0.4).head()
DataFrame.fillna()
to fill missing values:
data.fillna('hehehehe').head()
Or DataFrame.interpolate()
to fill wrt surrounding values:
data.infer_objects(copy=False).interpolate().head()
Split a string into multiple columns:#
This is extremely useful when we have Name
, and we want to split into First
,Middle
,Last
:
df = pd.DataFrame(dict(name=['John Arthur Doe', 'Jane Ann Smith'], location=['Los Angeles, CA', 'Washington, DC']))
df
df[['first','middle','last']] = df.name.str.split(expand=True) # if not specify delimiters, split based on whitespace
df
Reshape a MultiIndexed Series#
data.groupby(['Type 1', 'Legendary'])['#'].count()
What if we want to convert the above Series into a DataFrame, with Type 1
as rows, Legendary
as columns, and the counts as values?
Use
unstack()
aftergroupby()
andcount()
:
data.groupby(['Type 1', 'Legendary'])['#'].count().unstack()
But there’s a better way to do this by using pivot_table()
, which can do more things like aggregation, filling missing values, and margins
, which unstack()
cannot do. See below.
Create a pivot table#
data.pivot_table(index='Type 1', columns='Legendary', values='Attack', aggfunc='mean')
The advantage of pivot table compared to using groupby + unstack is the margins
parameter, which adds a row/column that gives the totals (or other aggregate functions) for each row/column:
data.pivot_table(index='Type 1', columns='Legendary', values='Attack', aggfunc='mean', margins=True)
Reshape a DataFrame from wide format to long format#
What if we want ‘HP’ and ‘Attack’ to be the values of the column named ‘Stats’?
f = data[['Name', 'HP', 'Attack']].iloc[:5]
f
f.melt(id_vars='Name', var_name='Stats', value_name='Points')
Convert continuous data to categorical data#
What if we want Attack
to be categorized (< 50: ‘weak’, 50-100: ‘normal’, 100-150: ‘strong’, >150: ‘nani?!’)
Use pd.cut(<column>, <bin>, <labels>)
to convert continuous data to categorical data. Here, we convert ‘Attack’ into 4 categories: ‘Weak’, ‘Normal’, ‘Strong’, ‘nani?!’.
df = data.copy()
df['Attack'] = pd.cut(df['Attack'], bins=[0, 50, 100, 150, 200], labels=['Weak', 'Normal', 'Strong', 'nani?!'])
df
Profile a DataFrame#
from ydata_profiling import ProfileReport
report = ProfileReport(data)
report.to_notebook_iframe()