What is K Mean Clustering?

k-means clustering is a machine learning algorithm that groups data points into k clusters.

Table of Contents

What are usages of “k mean clustering”?

There are a few common usages for k-means clustering:

As a preprocessing step for other algorithms
To simplify data for further analysis
To find unusual data points
To group similar data points together

Features of “k mean clustering”?

K-Means clustering is a method of vector quantization, that can be used for cluster analysis in data mining.
K-Means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.

Examples of “k mean clustering”

K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining.

def k_mean_clustering(X, k, iterations=1000):
    """
    X: is a numpy.ndarray of shape (n, d) containing the dataset
    n: is the number of data points
    d: is the number of dimensions for each data point
    k: is the number of clusters
    iterations: is the number of iterations to execute the algorithm
    Returns: a numpy.ndarray of shape (k, d) containing the centroids
    """
    if type(X) is not np.ndarray:
        return TypeError("X must be a numpy.ndarray")
    if len(X.shape) != 2:
        return TypeError("X must be a 2D numpy.ndarray")
    if type(k) is not int:
        return TypeError("k must be an integer")
    if k <= 0:
        return ValueError("k must be a positive integer")
    if type(iterations) is not int:
        return TypeError("iterations must be an integer")
    if iterations <= 0:
        return ValueError("iterations must be a positive integer")
    centroids = np.ndarray((k, X.shape[1]))
    for i in range(k):
        centroids[i] = X[np.random.randint(0, X.shape[0])]
    for i in range(iterations):
        clusters = np.ndarray((k, X.shape[0]))
        for j in range(k):
            clusters[j] = np.array([np.linalg.norm(X[i] - centroids[j]) for i in range(X.shape[0])])
        clusters = np.argmin(clusters, axis=0)
        for j in range(k):
            centroids[j] = np.mean(X[clusters == j], axis=0)
    return centroids

def k_means_clustering(X, k, tolerance=0.0001, max_iterations=500):
    # initialize k cluster centroids
    # initialize a list to hold the cluster centroids
    centroids = random.sample(X, k)
    # initialize a list to hold the clusters
    clusters = [[] for _ in range(k)]
    # initialize a list to hold the old centroids
    old_centroids = None
    # loop until the centroids stop updating
    for i in range(max_iterations):
        # loop over each observation
        for x in X:
            # loop over each cluster centroid
            min_distance = float('inf')
            closest_cluster = None
            for j in range(k):
                # compute the distance between the observation and the cluster centroid
                distance = euclidean_distance(x, centroids[j])
                # if the distance is less than the min_distance, update the min_distance and closest_cluster
                if distance < min_distance:
                    min_distance = distance
                    closest_cluster = j
            # append the observation to the closest cluster
            clusters[closest_cluster].append(x)
        # compute the new centroids
        new_centroids = compute_centroids(X, clusters)
        # check if the centroids have converged
        if converged(old_centroids, new_centroids, tolerance):
            break
        # update the old centroids
        old_centroids = new_centroids
    # return the clusters and centroids
    return clusters, new_centroids

K-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.

def voronoi_cells(points):
    # create a list of clusters
    clusters = []
    # create a list of clusters
    for point in points:
        # create a cluster
        cluster = []
        # add the point to the cluster
        cluster.append(point)
        # add the cluster to the list of clusters
        clusters.append(cluster)
    # create a list of edges
    edges = []
    # create a list of visited points
    visited = []
    # create a list of unvisited points
    unvisited = points
    # while there are unvisited points
    while unvisited:
        # set the closest point to the first unvisited point
        closest = unvisited[0]
        # set the closest distance to infinity
        closest_distance = float('inf')
        # for each unvisited point
        for point in unvisited:
            # calculate the distance between the closest point and the current point
            distance = distance_between_points(closest, point)
            # if the distance is less than the closest distance
            if distance < closest_distance:
                # set the closest point to the current point
                closest = point
                # set the closest distance to the distance
                closest_distance = distance
        # add the closest point to the visited points
        visited.append(closest)
        # remove the closest point from the unvisited points
        unvisited.remove(closest)
        # for each unvisited point
        for point in unvisited:
            # calculate the distance between the closest point and the current point
            distance = distance_between_points(closest, point)
            # if the distance is less than the closest distance
            if distance < closest_distance:
                # add the edge to the list of edges
                edges.append((closest, point))
    # return the list of clusters and the list of edges
    return clusters, edges