Missing values¶

1- Importing required libraries

In [1]:
import numpy as np
import pandas as pd

2- Let's generate a simulated dataset.

In [3]:
# Generate a simulated dataset with missing values
np.random.seed(42)
data = pd.DataFrame({
    'Age': np.random.randint(18, 65, size=100),
    'Sex': np.random.choice(['Male', 'Female'], size=100),
    'Salary': np.random.normal(50000, 10000, size=100),
    'UniversityDegree': np.random.choice([True, False], size=100)
})

# Introduce missing values
data.iloc[10:20, 0] = np.nan
data.iloc[30:40, 1] = np.nan
data.iloc[50:60, 2] = np.nan
data.iloc[70:80, 3] = np.nan

print(data.head())
print(data.info())
    Age     Sex        Salary UniversityDegree
0  56.0  Female  57240.832515            False
1  46.0  Female  47442.353630             True
2  32.0    Male  58499.212041            False
3  60.0    Male  36886.757742             True
4  25.0    Male  41296.950454            False
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Age               90 non-null     float64
 1   Sex               90 non-null     object 
 2   Salary            90 non-null     float64
 3   UniversityDegree  90 non-null     object 
dtypes: float64(2), object(2)
memory usage: 3.2+ KB
None
/tmp/ipykernel_41/771621819.py:14: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'nan' has dtype incompatible with bool, please explicitly cast to a compatible dtype first.
  data.iloc[70:80, 3] = np.nan

3- Did you find any missing value? Calculate the percentage of missing values.

In [4]:
percent_missing = round(data.isnull().sum() * 100 / len(data),3)
missing_value_df = pd.DataFrame({'Missing_Percentage': percent_missing})
missing_value_df.sort_values(by="Missing_Percentage",ascending=False).head(5)
Out[4]:
Missing_Percentage
Age 10.0
Sex 10.0
Salary 10.0
UniversityDegree 10.0

4- Drop rows with missing values, and store the new dataframe in data_dropna.

In [5]:
data_dropna = data.dropna()
print("After dropping rows with missing values:")
print(data_dropna.info())
After dropping rows with missing values:
<class 'pandas.core.frame.DataFrame'>
Index: 60 entries, 0 to 99
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Age               60 non-null     float64
 1   Sex               60 non-null     object 
 2   Salary            60 non-null     float64
 3   UniversityDegree  60 non-null     object 
dtypes: float64(2), object(2)
memory usage: 2.3+ KB
None

5- Fill missing values with the mean for age and Salary, and the mode value for UniversityDegree and Sex. Store the result in data_filled.

In [11]:
data_filled = data.copy()

# Fill missing values in 'Sex' column with mode
data_filled['Sex'].fillna(data_filled['Sex'].mode()[0], inplace=True)

# Fill missing values in 'University Degree' column with mode
data_filled['UniversityDegree'].fillna(data_filled['UniversityDegree'].mode()[0], inplace=True)

# Fill missing values in numerical columns with median
numeric_cols = data_filled.select_dtypes(include='number').columns
data_filled[numeric_cols] = data_filled[numeric_cols].fillna(data_filled[numeric_cols].median())
print(data_filled.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Age               100 non-null    float64
 1   Sex               100 non-null    object 
 2   Salary            100 non-null    float64
 3   UniversityDegree  100 non-null    bool   
dtypes: bool(1), float64(2), object(1)
memory usage: 2.6+ KB
None
/tmp/ipykernel_41/192241617.py:4: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data_filled['Sex'].fillna(data_filled['Sex'].mode()[0], inplace=True)
/tmp/ipykernel_41/192241617.py:7: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data_filled['UniversityDegree'].fillna(data_filled['UniversityDegree'].mode()[0], inplace=True)
/tmp/ipykernel_41/192241617.py:7: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  data_filled['UniversityDegree'].fillna(data_filled['UniversityDegree'].mode()[0], inplace=True)

6- Analyzing the impact of replacing missing values or dropping them involves assessing the changes in the dataset and the potential effects on the analysis or model performance.

In [12]:
# Calculate the mean salary before handling missing values
mean_salary_before = data['Salary'].mean()

# Calculate the mean salary after filling missing values with mean
mean_salary_after_mean = data_filled['Salary'].mean()

# Calculate the mean salary after filling missing values with mean
mean_salary_after_median = data['Salary'].fillna(data['Salary'].median()).mean()

print("Mean salary before handling missing values:", mean_salary_before)
print("Mean salary after filling missing values with mean:", mean_salary_after_mean)
print("Mean salary after filling missing values with median:", mean_salary_after_median)
Mean salary before handling missing values: 50842.18789025037
Mean salary after filling missing values with mean: 50744.31997140812
Mean salary after filling missing values with median: 50744.31997140812
In [ ]: