k-means clustering is a machine learning algorithm that groups data points into k clusters.
What are usages of “k mean clustering”?
There are a few common usages for k-means clustering:
- As a preprocessing step for other algorithms
- To simplify data for further analysis
- To find unusual data points
- To group similar data points together
Features of “k mean clustering”?
- K-Means clustering is a method of vector quantization, that can be used for cluster analysis in data mining.
- K-Means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.
Examples of “k mean clustering”
K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining.
def k_mean_clustering(X, k, iterations=1000):
"""
X: is a numpy.ndarray of shape (n, d) containing the dataset
n: is the number of data points
d: is the number of dimensions for each data point
k: is the number of clusters
iterations: is the number of iterations to execute the algorithm
Returns: a numpy.ndarray of shape (k, d) containing the centroids
"""
if type(X) is not np.ndarray:
return TypeError("X must be a numpy.ndarray")
if len(X.shape) != 2:
return TypeError("X must be a 2D numpy.ndarray")
if type(k) is not int:
return TypeError("k must be an integer")
if k <= 0:
return ValueError("k must be a positive integer")
if type(iterations) is not int:
return TypeError("iterations must be an integer")
if iterations <= 0:
return ValueError("iterations must be a positive integer")
centroids = np.ndarray((k, X.shape[1]))
for i in range(k):
centroids[i] = X[np.random.randint(0, X.shape[0])]
for i in range(iterations):
clusters = np.ndarray((k, X.shape[0]))
for j in range(k):
clusters[j] = np.array([np.linalg.norm(X[i] - centroids[j]) for i in range(X.shape[0])])
clusters = np.argmin(clusters, axis=0)
for j in range(k):
centroids[j] = np.mean(X[clusters == j], axis=0)
return centroids
OR
def k_means_clustering(X, k, tolerance=0.0001, max_iterations=500):
# initialize k cluster centroids
# initialize a list to hold the cluster centroids
centroids = random.sample(X, k)
# initialize a list to hold the clusters
clusters = [[] for _ in range(k)]
# initialize a list to hold the old centroids
old_centroids = None
# loop until the centroids stop updating
for i in range(max_iterations):
# loop over each observation
for x in X:
# loop over each cluster centroid
min_distance = float('inf')
closest_cluster = None
for j in range(k):
# compute the distance between the observation and the cluster centroid
distance = euclidean_distance(x, centroids[j])
# if the distance is less than the min_distance, update the min_distance and closest_cluster
if distance < min_distance:
min_distance = distance
closest_cluster = j
# append the observation to the closest cluster
clusters[closest_cluster].append(x)
# compute the new centroids
new_centroids = compute_centroids(X, clusters)
# check if the centroids have converged
if converged(old_centroids, new_centroids, tolerance):
break
# update the old centroids
old_centroids = new_centroids
# return the clusters and centroids
return clusters, new_centroids
K-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.
def voronoi_cells(points):
# create a list of clusters
clusters = []
# create a list of clusters
for point in points:
# create a cluster
cluster = []
# add the point to the cluster
cluster.append(point)
# add the cluster to the list of clusters
clusters.append(cluster)
# create a list of edges
edges = []
# create a list of visited points
visited = []
# create a list of unvisited points
unvisited = points
# while there are unvisited points
while unvisited:
# set the closest point to the first unvisited point
closest = unvisited[0]
# set the closest distance to infinity
closest_distance = float('inf')
# for each unvisited point
for point in unvisited:
# calculate the distance between the closest point and the current point
distance = distance_between_points(closest, point)
# if the distance is less than the closest distance
if distance < closest_distance:
# set the closest point to the current point
closest = point
# set the closest distance to the distance
closest_distance = distance
# add the closest point to the visited points
visited.append(closest)
# remove the closest point from the unvisited points
unvisited.remove(closest)
# for each unvisited point
for point in unvisited:
# calculate the distance between the closest point and the current point
distance = distance_between_points(closest, point)
# if the distance is less than the closest distance
if distance < closest_distance:
# add the edge to the list of edges
edges.append((closest, point))
# return the list of clusters and the list of edges
return clusters, edges