EXERCISE 2.1: DATA CLEANING AND TRANSFORMATION

Objective: Clean and transform a dataset to prepare it for analysis.

Tasks:

  1. Handle missing values (NaN) in the “data_cleaning_transfomation.csv” dataset.
  2. Convert the ‘Last Login Date’ from a string to a datetime object.
  3. Create a new feature, ‘Monthly Spend per Day’, by dividing ‘Monthly Spend’ by ‘Subscription Length’.

Steps:

  1. Importing Required Libraries:
    1. import pandas as pd
    • pandas is used for data manipulation and analysis.
  2. Loading the Data:
    2. data_exercise_1 = pd.read_csv('path_to_csv_file')

    This line of code reads the CSV file containing the data into a Pandas DataFrame, enabling us to work with the data in Python.

  3. Handling Missing Values:
    • Filling Missing ‘Age’ Values:
    3. mean_age = data_exercise_1['Age'].mean()
    4. data_exercise_1['Age'].fillna(mean_age, inplace=True)

    Here, we calculate the mean of the ‘Age’ column and fill missing values (NaN) in the ‘Age’ column with this mean. This approach is chosen as age data typically follows a normal distribution, making the mean a good estimate for missing values.

    • Filling Missing ‘Monthly Spend’ Values:
    5. median_monthly_spend = data_exercise_1['Monthly Spend ($)'].median()
    6. data_exercise_1['Monthly Spend ($)'].fillna(median_monthly_spend, inplace=True)
    

    We fill missing values in ‘Monthly Spend ($)’ with the median, because financial data often has outliers, and the median is less sensitive to them compared to the mean.

    • Filling Missing ‘Feedback Score’ Values:
    7. mode_feedback ...

Get Mastering Marketing Data Science now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.