pandas: Pandas Basics Exploration#
Loading data into Pandas DataFrame#
The dataset is from Kaggle - Pokemon.
import pandas as pd
data = pd.read_csv('data/Pokemon.csv')
data.shape, data.ndim
Get headers#
data.columns
First and last rows#
data.head(1), data.tail(1)
High level description of the data#
data.describe()
Get specific columns#
multiple_columns = data[['Name', 'HP', 'Legendary']]
print(multiple_columns.head())
Get specific rows#
# Get the first row, which contains the feature names
middle_row = data.iloc[0]
print(middle_row)
Get specific coordinates item in the DataFrame#
# Get the name of the first 3 rows
item = data.iloc[[0, 1, 2], 1]
print(item)
Remove a column#
Here, if you do not use the
columns
parameter, you need to specify the axis (0 for rows and 1 for columns).
# Here, 'Total' column is the sum of HP,Attack,Defense, SP. Atk, SP. Def, Speed.
dropped_data = data.drop('Total', axis=1)
dropped_data.head(3)
Alternatively, you can use the
columns
parameter to specify the column(s) to drop without needing to specify the axis.
dropped_data = data.drop(columns='Total')
dropped_data.head(3)
inplace=True
modifies the original DataFrame without needing to reassign it.
data.drop(columns='Total', inplace=True)
data.head(3)
Add a column#
# So, in order to add the 'Total' column again, we do the summation
data['Total'] = data.iloc[:, 4:10].sum(axis=1)
data.head(3)
iloc
and loc
in Pandas#
iloc
#
# Read single row
data.iloc[3]
# Read multiple rows
data.iloc[[3,6]]
# Read a range of rows
data.iloc[3:7]
loc
#
The loc
property in Pandas is used to access a group of rows and columns by labels or a boolean array. Unlike iloc
, which uses integer-based indexing, loc
uses the actual labels of the index and columns. This makes it very useful for selecting data based on meaningful row or column names.
data.loc[data['Legendary'] == True].head()
If you have a condition to filter rows, then you must pass the columns that you want to select after the comma as a sequence. e.g., [condition, ['col1', 'col2']]
data.loc[data['Attack'] == 49, ['Name']]
Filtering data#
We can filter with multiple conditions at ease.
data.loc[(data['Type 1'] == 'Grass') & (data['Type 2'] == 'Poison')].head(3)
data.loc[(data['Type 1'] == 'Grass') | (data['Type 2'] == 'Poison')][0::10].loc[data['HP'] > 50].head(3)
Regex Filtering#
import re
Explicit filtering#
data.loc[data['Name'].str.contains('Mega')].head(3)
Regex filtering, case sensitive by default#
data.loc[data['Type 1'].str.contains('Fire|grass', regex=True)].head(3)
Regex filtering, case insensitive#
data.loc[data['Type 1'].str.contains('Fire|grass', flags=re.IGNORECASE, regex=True)].head(10)
Using regex#
data.loc[data['Name'].str.contains('pi[a-z]*', flags=re.IGNORECASE, regex=True)].head(3)
data.loc[data['Name'].str.contains('^pi[a-z]*', flags=re.IGNORECASE, regex=True)].head(3)
data.loc[data['Name'].str.contains('pi.*', flags=re.IGNORECASE, regex=True)].head(10)
Sorting#
Sort by 1 column, ascending#
data.sort_values('Type 1').head(3)
Sort by 1 column, descending#
data.sort_values('Type 1', ascending=False).head(3)
Sort by multiple columns, 1 = ascending, 0 = descending#
data.sort_values(['Type 1', 'HP'], ascending=[1, 0]).head(3)
Reset index#
In the cell below, you can see the index is not consecutive (the leftmost column). We can reset it with the reset_index()
method. By default, it adds the old index as a new column. If you do not want that, use the drop=True
parameter.
new_data = data.loc[(data['Type 1'] == 'Grass') | (data['Type 2'] == 'Poison')][0::10].loc[data['HP'] > 50].head(3)
new_data
new_data.reset_index(drop=True, inplace=True)
new_data
Group by/Aggregate using aggregate function#
Aggregate functions like mean()
, sum()
, min()
, and max()
in pandas only work with numerical data. Non-numeric columns (such as strings or booleans) cannot be averaged or summed, so you must select only numeric columns when using these functions to avoid errors and get meaningful results.
Group by 1 column#
# Select only numeric columns for aggregation
numeric_cols = data.select_dtypes(include='number').columns
data.groupby(['Type 1'])[numeric_cols].mean().sort_values('Type 1').head(10)
Group by hierarchical columns#
data.groupby(['Type 1', 'Type 2'])[numeric_cols].count().head()
Quizzes#
1. Import the pandas library as pd.#
import pandas as pd
2. Read CSV into df#
df = pd.read_csv('data/Pokemon.csv')
3. Show first 5 rows#
df.head(5)
4. Show last 3 rows#
df.tail(3)
5. Show all column names#
df.columns
6. Select rows index 10 to 15#
df.iloc[10:16]
7. Select row index 50#
df.iloc[50]
8. Filter rows Type 1 == Grass#
df.loc[df['Type 1'] == 'Grass'].head()
9. Filter rows HP > 100#
df.loc[df['HP'] > 100].head()
10. Filter rows Name contains chu#
df.loc[df['Name'].str.contains('chu')].head()
11. Sort by Attack descending#
df.sort_values('Attack', ascending=False).head()
12. Reset index#
new_df = df.reset_index(drop=True)
new_df.head()
13. Select columns Name, Type 1, HP#
df[['Name', 'Type 1', 'HP']].head()
14. Filter Type 1 == Grass and Type 2 == Poison#
df.loc[(df['Type 1'] == 'Grass') & (df['Type 2'] == 'Poison')].head()
15. Mean HP of Legendary#
df.loc[df['Legendary'] == True]['HP'].mean()
16. Rows with even indices#
even_indices_df = df.iloc[::2]
even_indices_df.head()
17. Save even indices to filtered_pokemon.csv#
even_indices_df = df.iloc[::2]
even_indices_df.to_csv('data/filtered_pokemon.csv', index=True)
even_indices_df = pd.read_csv('data/filtered_pokemon.csv', index_col=0)
even_indices_df.head()
18. First 10 rows Name, Type 1, Attack#
df[['Name', 'Type 1', 'Attack']].head(10)
19. Max Defense#
df['Defense'].max()
20. Rows Speed above mean#
mean_speed = df['Speed'].mean()
new_df = df.loc[df['Speed'] > mean_speed]
print('mean speed: ', mean_speed)
new_df.head()
21. First 3 rows Name and Attack#
df[['Name', 'Attack']].iloc[:3]
22. Count per Type 1#
df.groupby(['Type 1'])['#'].count().sort_index()
23. Rows Generation 3 or 5#
df.loc[(df['Generation'] == 3) | (df['Generation'] == 5)].head()
24. Replace Fire with Flame in Type 1#
df_copy = df.copy()
df_copy['Type 1'] = df_copy['Type 1'].str.replace('Fire', 'Flame')
df_copy.head()
25. Rows Name startswith P#
df.loc[df['Name'].str.startswith('P')].head()
26. Mean Attack per Type 1#
df.groupby('Type 1')['Attack'].mean().sort_index()
27. Sort by Type 1 asc, Attack desc#
df.sort_values(['Type 1', 'Attack'], ascending=[1, 0]).head()
28. Mean Defense per Type 1#
df.groupby('Type 1')['Defense'].mean().sort_index()
29. Legendary with Attack > 100#
df.loc[(df['Legendary'] == True) & (df['Attack'] > 100)].head()
30. New DataFrame Name and Total renamed Overall#
new_df = df[['Name', 'Total']].rename(columns={'Total': 'Overall'})
new_df.head()
31. Filter Type 1 == Water and Attack >= 80#
df.loc[(df['Type 1'] == 'Water') & (df['Attack'] >= 80)].head()
32. Generation 4 Name and Defense#
df.loc[df['Generation'] == 4][['Name', 'Defense']].head()
33. Mean Total per (Type 1, Type 2)#
df.groupby(['Type 1', 'Type 2'])['Total'].mean().head(10)
34. Filter Total between 400 and 500#
df.loc[(df['Total'] >= 400) & (df['Total'] <= 500)].head()