Project: K-nearest neighbors (KNN)¶

Construction of a KNN manually¶

Importing required libraries¶

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_blobs

Generate Your Sample Dataset¶

In [2]:
X, y = make_blobs(n_samples=1000, centers=2,
                  random_state=0, cluster_std=1.3)
In [3]:
plt.scatter(X[:, 0], X[:, 1], c=y, s=25, cmap='bwr')
plt.colorbar()
plt.xlabel('x1')
plt.ylabel('x2')
plt.show()
No description has been provided for this image

1- Calculate Euclidean Distance: calculate the Euclidean distance between two vectors.

These cells are intentionally left black for you to practice

In [4]:
# calculate the Euclidean distance between two vectors
def euclidean_distance(row1, row2):
    distance = 0.0
    for i in range(len(row1)-1):
        distance +=  (row1[i] - row2[i])**2
        return (distance)**0.5

2- Let's test the distance function.

In [5]:
distance = euclidean_distance(X[0], X)

3- Run the function get_neighbors.

In [6]:
# Locate the most similar neighbors
def get_neighbors(train, test_row, num_neighbors):
 distances = list()
 for train_row in train:
    dist = euclidean_distance(test_row, train_row)
    distances.append((train_row, dist))
    distances.sort(key=lambda tup: tup[1])
    neighbors = list()
    #print(distances)
 for i in range(num_neighbors):
    neighbors.append(distances[i][0])
 return neighbors

4- For this task, calculate the 3 most similar records in the dataset to the first record, in order of similarity using the function get_neighbors.

These cells are intentionally left black for you to practice

In [7]:
neighbors = get_neighbors(X, X[0], 3)
neighbors
Out[7]:
[array([0.31372224, 3.73429074]),
 array([0.3122234 , 2.76896549]),
 array([0.3164972 , 2.93634319])]

5- Use the following function to make predictions using 3 neighbors for the first row.

# Make a classification prediction with neighbors
def predict_classification(train, test_row, num_neighbors):
 neighbors = get_neighbors(train, test_row, num_neighbors)
 output_values = [row[-1] for row in neighbors]
 prediction = max(set(output_values), key=output_values.count)
 return prediction
In [8]:
# Make a classification prediction with neighbors
def predict_classification(train, test_row, num_neighbors):
    neighbors = get_neighbors(train, test_row, num_neighbors)
    output_values = [row[-1] for row in neighbors]
    prediction = max(set(output_values), key=output_values.count)
    return prediction

# Adding a column to the array using concatenate()
Z=np.concatenate([X, y.reshape(-1,1)], axis=1)
prediction = predict_classification(Z, Z[0], 3)
print('Expected %d, Got %d.' % (y[0], prediction))
Expected 0, Got 0.

KNN with sklearn¶

Training and testing dataset¶

6- Split the dataset in 70% training and 30% test.

In [9]:
from sklearn.model_selection import train_test_split
# Split the dataset into the training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=0) # 70% training and 30% test
In [ ]:
 

7- Let's build KNN classifier model using sklearn. First, import the KNeighborsClassifier module and create KNN classifier using the n_neighbors=3. Then, fit your model on the dataset using fit() and perform prediction.

These cells are intentionally left black for you to practice

In [10]:
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=3)

# Train the model using the training sets
model.fit(X_train,y_train)
Out[10]:
KNeighborsClassifier(n_neighbors=3)
In [ ]:
 

8- Evaluate the KNN model on the test dataset.

In [11]:
from sklearn import metrics
#Predict Output
predicted= model.predict(X_test) 
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, predicted))
Accuracy: 0.9033333333333333

9- Decision boundaries.

In the code below we plot the decision boudnary of the previous trained model with k=3.

Let's practice using k=1,k=10 and k=100.

In [14]:
from mlxtend.plotting import plot_decision_regions

# Plotting decision regions
plot_decision_regions(X ,y, clf=model, legend=2)

plt.title('K=3')
plt.show()
No description has been provided for this image

Check your knowledge: predict the winner of the presidential election¶

10- For this task:

  • Read the datafile county_election.csv into a Pandas data frame.

  • Separete the target and the features in two variables and create the response variable based on the columns trump and clinton.

      ```python
      X=df[['minority','bachelor']]
      y=np.where(df.trump>df.clinton,1,0)
      ```
  • Split the dataset in 70% training, 30% testing and random_state=0.

  • Use StandardScaler() Function to Standardize the training Data.

    from sklearn.preprocessing import StandardScaler
    scaler = StandardScaler()
    scaler=scaler.fit(X_train)
    # standardization 
    X_train_scaler = scaler.transform(X_train) 
    X_test_scaler = scaler.transform(X_test)
    
  • Initialize a KNN classifier (name this variables as clf) and fit on the data with a k : 3.

  • Calculate the accuracy score of the train dataset.

  • Finnaly, compute the accuracy of the test dataset.

In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]: