import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
df = pd.read_csv('california_cities.csv')
df.head()
city | latd | longd | elevation_m | elevation_ft | population_total | area_total_sq_mi | area_land_sq_mi | area_water_sq_mi | area_total_km2 | area_land_km2 | area_water_km2 | area_water_percent | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Adelanto | 34.576111 | -117.432778 | 875.0 | 2871.0 | 31765.0 | 56.027 | 56.009 | 0.018 | 145.107 | 145.062 | 0.046 | 0.03 |
1 | AgouraHills | 34.153333 | -118.761667 | 281.0 | 922.0 | 20330.0 | 7.822 | 7.793 | 0.029 | 20.260 | 20.184 | 0.076 | 0.37 |
2 | Alameda | 37.756111 | -122.274444 | NaN | 33.0 | 75467.0 | 22.960 | 10.611 | 12.349 | 59.465 | 27.482 | 31.983 | NaN |
3 | Albany | 37.886944 | -122.297778 | NaN | 43.0 | 18969.0 | 5.465 | 1.788 | 3.677 | 14.155 | 4.632 | 9.524 | NaN |
4 | Alhambra | 34.081944 | -118.135000 | 150.0 | 492.0 | 83089.0 | 7.632 | 7.631 | 0.001 | 19.766 | 19.763 | 0.003 | NaN |
df.columns
Index(['city', 'latd', 'longd', 'elevation_m', 'elevation_ft', 'population_total', 'area_total_sq_mi', 'area_land_sq_mi', 'area_water_sq_mi', 'area_total_km2', 'area_land_km2', 'area_water_km2', 'area_water_percent'], dtype='object')
Lets Explore¶
Part - I Idetification and Clearning Missing Values:¶
1. Find out the missing values in each column and store them in the variable col_missing_values
col_missing_values = df.isna().sum()
col_missing_values
city 0 latd 12 longd 15 elevation_m 50 elevation_ft 14 population_total 2 area_total_sq_mi 6 area_land_sq_mi 3 area_water_sq_mi 5 area_total_km2 7 area_land_km2 6 area_water_km2 6 area_water_percent 62 dtype: int64
2. Which Column has the most number of missing values?
# you can check from the result of question 1 above
3. Which Column has the minimum missing values?
df.isnull().sum()
city 0 latd 12 longd 15 elevation_m 50 elevation_ft 14 population_total 2 area_total_sq_mi 6 area_land_sq_mi 3 area_water_sq_mi 5 area_total_km2 7 area_land_km2 6 area_water_km2 6 area_water_percent 62 dtype: int64
4. How many values of the column elevation_m
are missing?
5. Find the total missing values in the whole dataset and store the number in df_missing_values
df_missing_values = df.isna().sum().sum()
df_missing_values.sum()
188
6. Drop the column area_water_percent
as it has the most of its values missing
Note: Drop it permanently by using inplace=True
df = df.drop(["area_water_percent"], axis=1)
df.columns
Index(['city', 'latd', 'longd', 'elevation_m', 'elevation_ft', 'population_total', 'area_total_sq_mi', 'area_land_sq_mi', 'area_water_sq_mi', 'area_total_km2', 'area_land_km2', 'area_water_km2'], dtype='object')
7. Drop the rows having missing values and store your resulting DataFrame in the variable df_narows_dropped
df_narows_dropped = df.dropna(axis=0, how="any")
df_narows_dropped.shape
(408, 12)
8. Drop Rows with More Than 5 Missing Values
df_rows_dropped = df.dropna(axis=0, thresh=5)
df_rows_dropped.shape
(480, 12)
9. Remove Columns with Missing Values
df_nacols_dropped = df.dropna(axis=1, how="any")
df_nacols_dropped.shape
(482, 1)
10. Drop colomns with more than 10 missing values and store the resulting DataFrame in the variable df_cols_dropped
df_cols_dropped = df.dropna(axis=1, thresh=len(df) - 10)
df_cols_dropped.shape
(482, 8)
Part - II Data Imputation Methods:¶
df.isna().sum()
city 0 latd 12 longd 15 elevation_m 50 elevation_ft 14 population_total 2 area_total_sq_mi 6 area_land_sq_mi 3 area_water_sq_mi 5 area_total_km2 7 area_land_km2 6 area_water_km2 6 dtype: int64
11. Fill the 50 missing values in elevation_m
with -999. Store your result in the variable filled_elevation_m
filled_elevation_m = df["elevation_m"].fillna(-999)
filled_elevation_m
0 875.0 1 281.0 2 -999.0 3 -999.0 4 150.0 ... 477 30.0 478 787.0 479 18.0 480 798.0 481 1027.0 Name: elevation_m, Length: 482, dtype: float64
12. Fill the 7 missing values in area_total_km2
with the value 0 permanently, store your result in the variable filled_area_total
df["area_total_km2"].fillna(0, inplace=True)
filled_area_total = df["area_total_km2"]
filled_area_total
0 145.107 1 20.260 2 59.465 3 14.155 4 19.766 ... 477 3.966 478 26.036 479 37.959 480 72.244 481 103.639 Name: area_total_km2, Length: 482, dtype: float64
13. Fill the missing values of the column latd
using backward filling method and store your result in the variable bfill_latd
bfill_latd = df["latd"].bfill()
bfill_latd
0 34.576111 1 34.153333 2 37.756111 3 37.886944 4 34.081944 ... 477 38.403056 478 41.726667 479 39.134722 480 34.030278 481 34.133333 Name: latd, Length: 482, dtype: float64
14. Fill the 15 missing values of the column longd
using forwarding filling method and store your result in the variable ffill_longd
ffill_longd = df["longd"].ffill()
ffill_longd
0 -117.432778 1 -118.761667 2 -122.274444 3 -122.297778 4 -118.135000 ... 477 -122.362222 478 -122.637500 479 -121.626111 480 -117.048611 481 -116.416667 Name: longd, Length: 482, dtype: float64
15. Fill the 2 missing values of the column population_total
with the mean of the column and store your result in the variable mean_total_population
mean_total_population = df["population_total"].fillna(df["population_total"].mean())
mean_total_population
0 31765.0 1 20330.0 2 75467.0 3 18969.0 4 83089.0 ... 477 2933.0 478 7765.0 479 64925.0 480 51367.0 481 20700.0 Name: population_total, Length: 482, dtype: float64
16. Fill the 5 missing values of the column area_water_sq_mi
with the median value of the column and store your result in the variable median_fill
median_fill = df["area_water_sq_mi"].fillna(df["area_water_sq_mi"].median())
median_fill
0 0.018 1 0.029 2 12.349 3 3.677 4 0.001 ... 477 0.000 478 0.073 479 0.078 480 0.005 481 0.000 Name: area_water_sq_mi, Length: 482, dtype: float64
17. Fill the 6 missing values of the column area_land_km2
with the mode value of the column and store your result in the variable mode_fill
mode_fill = df["area_land_km2"].fillna(df["area_land_km2"].mode()[0])
mode_fill
0 145.062 1 20.184 2 27.482 3 4.632 4 19.763 ... 477 3.966 478 25.847 479 37.758 480 72.231 481 103.639 Name: area_land_km2, Length: 482, dtype: float64
18. Which of the following code will fill the missing values in the DataFrame with zeros and store your result in a variable filled_df
19. Which of the following is/are the general structure for filling a column values with any value (mean, mode, or median)