Gen AI Developer Week 2 — Day 4

Sai Chinmay Tripurari
2 min readJan 16, 2025

--

Data preparation indeed serves as the most important step in making a dataset ready for machine learning models. The main objective of this is dealing with missing data accurately so as to preserve data integrity, create the names and labels of categorical variables through various mechanisms one-hot encoding and label encoding, and standardize features with tools such as StandardScaler and MinMaxScaler in order to achieve uniformity. From a more general perspective, not only does this improve the model performance but also make the knowledge extracted from the data it applies to better quality. The nuances of onehot encoding and label encoding are discussed, as well as strategies in managing this aspect of modelingit is used.

Tasks

  1. Handle Missing Data
    Use the given dataset to fill in missing values with appropriate strategies.
  2. One-Hot Encode Categorical Data
    Convert categorical columns into numerical ones.
  3. Scale Features
    Apply StandardScaler and MinMaxScaler to a numerical dataset.
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Dataset
data = {
'Age': [25, 30, np.nan, 40, 50],
'Salary': [50000, 60000, 70000, np.nan, 90000],
'Country': ['USA', 'Canada', 'Mexico', 'USA', np.nan]
}
df = pd.DataFrame(data)

# Task 1: Handle Missing Data
df['Age'] = df['Age'].fillna(df['Age'].mean()) # Fill Age with mean
df['Salary'] = df['Salary'].fillna(df['Salary'].median()) # Fill Salary with median
df['Country'] = df['Country'].fillna(df['Country'].mode()[0]) # Fill Country with mode

print("After Handling Missing Data:")
print(df)

# Task 2: One-Hot Encode 'Country'
df_encoded = pd.get_dummies(df, columns=['Country'], drop_first=True)
print("\nAfter One-Hot Encoding:")
print(df_encoded)

# Task 3: Scale Features
scaler_standard = StandardScaler()
scaler_minmax = MinMaxScaler()

df_encoded[['Age', 'Salary']] = scaler_standard.fit_transform(df_encoded[['Age', 'Salary']])
print("\nAfter Standard Scaling:")
print(df_encoded)

df_minmax = df_encoded.copy()
df_minmax[['Age', 'Salary']] = scaler_minmax.fit_transform(df_minmax[['Age', 'Salary']])
print("\nAfter Min-Max Scaling:")
print(df_minmax)
Output

Practice Task — 1
Filling with zeros or custom values.

Practice Task — 2
Dropping rows with missing data.

Deliverables for the day:

  1. Submit a well-preprocessed dataset where:
    Missing values are handled.
    Categorical data is encoded.
    Numerical features are scaled.
  2. Experiment with both StandardScaler and MinMaxScaler and describe the differences.

Happy Learning!😊.. For any questions or support, feel free to message me on LinkedIn.

--

--

Sai Chinmay Tripurari
Sai Chinmay Tripurari

Written by Sai Chinmay Tripurari

Software Developer | ReactJS & React Native Expert | AI & Cloud Enthusiast | Building intuitive apps, scalable APIs, and exploring AI-driven solutions.

No responses yet