import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_blobs
Generate Your Sample Dataset¶
X, y = make_blobs(n_samples=1000, centers=2,
random_state=0, cluster_std=1.3)
plt.scatter(X[:, 0], X[:, 1], c=y, s=25, cmap='bwr')
plt.colorbar()
plt.xlabel('x1')
plt.ylabel('x2')
plt.show()
1- Calculate Euclidean Distance: calculate the Euclidean distance between two vectors.
These cells are intentionally left black for you to practice
# calculate the Euclidean distance between two vectors
def euclidean_distance(row1, row2):
distance = 0.0
for i in range(len(row1)-1):
distance += (row1[i] - row2[i])**2
return (distance)**0.5
2- Let's test the distance function.
distance = euclidean_distance(X[0], X)
3- Run the function get_neighbors
.
# Locate the most similar neighbors
def get_neighbors(train, test_row, num_neighbors):
distances = list()
for train_row in train:
dist = euclidean_distance(test_row, train_row)
distances.append((train_row, dist))
distances.sort(key=lambda tup: tup[1])
neighbors = list()
#print(distances)
for i in range(num_neighbors):
neighbors.append(distances[i][0])
return neighbors
4- For this task, calculate the 3 most similar records in the dataset to the first record, in order of similarity using the function get_neighbors
.
These cells are intentionally left black for you to practice
neighbors = get_neighbors(X, X[0], 3)
neighbors
[array([0.31372224, 3.73429074]), array([0.3122234 , 2.76896549]), array([0.3164972 , 2.93634319])]
5- Use the following function to make predictions using 3 neighbors for the first row.
# Make a classification prediction with neighbors
def predict_classification(train, test_row, num_neighbors):
neighbors = get_neighbors(train, test_row, num_neighbors)
output_values = [row[-1] for row in neighbors]
prediction = max(set(output_values), key=output_values.count)
return prediction
# Make a classification prediction with neighbors
def predict_classification(train, test_row, num_neighbors):
neighbors = get_neighbors(train, test_row, num_neighbors)
output_values = [row[-1] for row in neighbors]
prediction = max(set(output_values), key=output_values.count)
return prediction
# Adding a column to the array using concatenate()
Z=np.concatenate([X, y.reshape(-1,1)], axis=1)
prediction = predict_classification(Z, Z[0], 3)
print('Expected %d, Got %d.' % (y[0], prediction))
Expected 0, Got 0.
from sklearn.model_selection import train_test_split
# Split the dataset into the training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=0) # 70% training and 30% test
7- Let's build KNN classifier model using sklearn. First, import the KNeighborsClassifier module and create KNN classifier using the n_neighbors=3. Then, fit your model on the dataset using fit() and perform prediction.
These cells are intentionally left black for you to practice
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=3)
# Train the model using the training sets
model.fit(X_train,y_train)
KNeighborsClassifier(n_neighbors=3)
8- Evaluate the KNN model on the test dataset.
from sklearn import metrics
#Predict Output
predicted= model.predict(X_test)
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, predicted))
Accuracy: 0.9033333333333333
9- Decision boundaries.
In the code below we plot the decision boudnary of the previous trained model with k=3.
Let's practice using k=1,k=10 and k=100.
from mlxtend.plotting import plot_decision_regions
# Plotting decision regions
plot_decision_regions(X ,y, clf=model, legend=2)
plt.title('K=3')
plt.show()
Check your knowledge: predict the winner of the presidential election¶
10- For this task:
Read the datafile
county_election.csv
into a Pandas data frame.Separete the target and the features in two variables and create the response variable based on the columns trump and clinton.
```python X=df[['minority','bachelor']] y=np.where(df.trump>df.clinton,1,0) ```
Split the dataset in 70% training, 30% testing and random_state=0.
Use StandardScaler() Function to Standardize the training Data.
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() scaler=scaler.fit(X_train) # standardization X_train_scaler = scaler.transform(X_train) X_test_scaler = scaler.transform(X_test)
Initialize a KNN classifier (name this variables as clf) and fit on the data with a k : 3.
Calculate the accuracy score of the train dataset.
Finnaly, compute the accuracy of the test dataset.