The k-nearest neighbors (KNN) algorithm is a simple, supervised machine learning algorithm that can be used to solve both classification and regression problems.
The algorithm is based on the principle that similar data points (i.e. data points that are nearby in space) tend to have similar labels (i.e. they tend to belong to the same class). Therefore, the KNN algorithm can be used to predict the label of a new data point by looking at the labels of the data points that are nearby in space.
Features of knn algorithm:
- KNN is a simple and easy to understand algorithm.
- KNN can be used for both classification and regression problems.
- KNN is a versatile algorithm and can be used with data that has a variety of different sized dimensions.
- KNN is aNon-parametric algorithm, which means it does not make any assumptions about the data.
- KNN is an instance-based learning algorithm, which means it does not require any training data.
- KNN is a lazy algorithm, which means it does not perform any training phase.
- KNN is a distance-based algorithm, which means it uses a distance metric to calculate the distance between two data points.
Examples of KNN algorithm:
1. KNN can be used for classification, regression, and outlier detection.
def knn(k, X_train, y_train, X_test):
# calculate the distance between the test point and all the training points
distances = []
for i in range(len(X_train)):
dist = np.linalg.norm(X_test - X_train[i])
distances.append((dist, y_train[i]))
# sort the distances
distances.sort(key=lambda x: x[0])
# get the k nearest neighbors
neighbors = []
for i in range(k):
neighbors.append(distances[i][1])
# get the most common class among the neighbors
counts = np.bincount(neighbors)
return np.argmax(counts)
2. It can be used to predict the creditworthiness of customers from financial data.
def knn(k, X_train, y_train, X_test):
# calculate the distance between the test point and all the training points
distances = []
for i in range(len(X_train)):
dist = euclidean_distance(X_test, X_train[i])
distances.append((X_train[i], y_train[i], dist))
# sort the distances
distances = sorted(distances, key=lambda x: x[2])
# get the k nearest neighbors
neighbors = []
for i in range(k):
neighbors.append(distances[i][1])
# get the most common class among the neighbors
output = max(set(neighbors), key=neighbors.count)
return output
3. It can be used to identify customer segments from demographic data.
def knn(k, X_train, y_train, X_test):
# create a list for distances and targets
distances = []
targets = []
# loop over rows in X_test
for ix in range(len(X_test)):
# get row from X_train
row = X_train[ix, :]
# compute the distance between the row and X_test
distance = np.sqrt(np.sum((row - X_test[ix, :])**2))
# add the distance and target to the list
distances.append(distance)
targets.append(y_train[ix])
# sort the list of distances and targets
indices = np.argsort(distances)
# initialize the KNN target and the KNN counter
KNN_target = []
KNN_counter = 0
# loop over the sorted indices
for i in indices:
# add the target to the KNN target
KNN_target.append(targets[i])
# increment the KNN counter
KNN_counter += 1
# if the KNN counter is equal to k
if KNN_counter == k:
# break
break
# return the KNN target
return np.array(KNN_target)
4. It can be used to detect fraudulent activities in transaction data.
def knn(k, X_train, y_train, X_test):
# calculate the distance between each test point and each training point
distances = []
for i in range(len(X_train)):
dist = euclidean_distance(X_test, X_train[i])
distances.append((X_train[i], y_train[i], dist))
# sort the distances
distances = sorted(distances, key=lambda x: x[2])
# select the k nearest neighbors
neighbors = []
for i in range(k):
neighbors.append(distances[i][1])
# return the most common class among the neighbors
return max(set(neighbors), key=neighbors.count)
Conclusion
The algorithm is relatively simple and can be applied to a wide variety of data sets. However, the algorithm is also susceptible to overfitting if the data set is not carefully preprocessed.