Gen AI Developer Week 2 — Day 4
Data preparation indeed serves as the most important step in making a dataset ready for machine learning models. The main objective of this is dealing with missing data accurately so as to preserve data integrity, create the names and labels of categorical variables through various mechanisms one-hot encoding and label encoding, and standardize features with tools such as StandardScaler and MinMaxScaler in order to achieve uniformity. From a more general perspective, not only does this improve the model performance but also make the knowledge extracted from the data it applies to better quality. The nuances of onehot encoding and label encoding are discussed, as well as strategies in managing this aspect of modelingit is used.
Tasks
- Handle Missing Data
Use the given dataset to fill in missing values with appropriate strategies. - One-Hot Encode Categorical Data
Convert categorical columns into numerical ones. - Scale Features
Apply StandardScaler and MinMaxScaler to a numerical dataset.
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# Dataset
data = {
'Age': [25, 30, np.nan, 40, 50],
'Salary': [50000, 60000, 70000, np.nan, 90000],
'Country': ['USA', 'Canada', 'Mexico', 'USA', np.nan]
}
df = pd.DataFrame(data)
# Task 1: Handle Missing Data
df['Age'] = df['Age'].fillna(df['Age'].mean()) # Fill Age with mean
df['Salary'] = df['Salary'].fillna(df['Salary'].median()) # Fill Salary with median
df['Country'] = df['Country'].fillna(df['Country'].mode()[0]) # Fill Country with mode
print("After Handling Missing Data:")
print(df)
# Task 2: One-Hot Encode 'Country'
df_encoded = pd.get_dummies(df, columns=['Country'], drop_first=True)
print("\nAfter One-Hot Encoding:")
print(df_encoded)
# Task 3: Scale Features
scaler_standard = StandardScaler()
scaler_minmax = MinMaxScaler()
df_encoded[['Age', 'Salary']] = scaler_standard.fit_transform(df_encoded[['Age', 'Salary']])
print("\nAfter Standard Scaling:")
print(df_encoded)
df_minmax = df_encoded.copy()
df_minmax[['Age', 'Salary']] = scaler_minmax.fit_transform(df_minmax[['Age', 'Salary']])
print("\nAfter Min-Max Scaling:")
print(df_minmax)
Practice Task — 1
Filling with zeros or custom values.
Practice Task — 2
Dropping rows with missing data.
Deliverables for the day:
- Submit a well-preprocessed dataset where:
Missing values are handled.
Categorical data is encoded.
Numerical features are scaled. - Experiment with both
StandardScaler
andMinMaxScaler
and describe the differences.
Happy Learning!😊.. For any questions or support, feel free to message me on LinkedIn.