K means example¶

Sven Halvorson, 2019-07-24¶

I think I'm going to try and replicate this tutorial on the k-means algorithm in python. This will give me some more python practice as well as a more in-depth look at this algorithm.

K-means is an unsupervised algorithm (user does not supply examples of correct classification) that tries assign each observation to one of k groups. The number of groups, k, is called a 'hyperparameter' which must be specified by the user. Each group will have a centroid which is like the locus or center of the group. While this algorithm can work for any number of dimensions, it's easy to visualize in two or three dimensions.

The author of the post states that k means will do these two things:

Create the least possible distance between the observations and their respective centroids
Create the largest possible distances between the centroids

The algorithm works like this:

Assign groups to all observations at random
Take the centroids at random (here we'll just do the first three observations)
Iterate over these two steps until no changes are made:
1. Assign each data point to the nearest centroid (expectation step)
2. Create new centroids as the average x and y coordinates of every member of their group (maximization step)

Let's create some fake data. Here we're creating data from three bivariate uniform distributions. This will make the classification very obvious when it succeeds but in practice, it will never be this clean.

import numpy as np
import pandas as pd
import math, seaborn, matplotlib, random

random.seed(30)
k = 3

# Sample 3 different uniform distributions:
samples = list(map(np.random.uniform, [1, 7, 20], [5, 12, 25], [100, 100, 100]))
for i in range(k):
    samples[i].shape = (50, 2)
samples = [pd.DataFrame(samp) for samp in samples]
samples = pd.concat(samples).reset_index(drop = True)
samples = samples.rename(columns = {0 : 'x', 1 : 'y'})
samples['type'] = 'point'

# Group names
_groups = ['A', 'B', 'C']

## Make the centroids the first three observations:
centroids = samples.loc[0:(k-1), :].copy()
centroids.loc[:, 'group'] = np.array(_groups)
centroids.loc[:,'type'] = 'centroid'

# Add random groups to all the observations
samples['group'] = np.random.choice(a = _groups, size = len(samples))

# Save a copy for later:
samples_copy = samples.copy()

centroids

samples.head()

We can take a peek at the data. I had to look up how to do the multiple colors on the same plot and found this seaborn module. Seems real good:

def plot_iteration(pts, cents, title):
    dat = pd.concat([pts, cents], sort=False)
    borders = ['#FFFFFF' for i in range(150)] + ['#000000' for i in range(3)] 
    fg = seaborn.relplot(x = "x", edgecolor= borders, y = "y", \
                         hue = "group", hue_order = _groups, data=dat, \
                         style = 'type', aspect=1.5, height=5, \
                         size="type", sizes=(200, 40))
    fg.fig.suptitle(title)
    return(fg)

plot_iteration(samples, centroids, 'Intialization')

<seaborn.axisgrid.FacetGrid at 0x1e05d4d1748>

Now we'll define a function to get the distances between each point an each centroid

def euclidean(x1, x2, y1, y2):
    return(math.sqrt((x1 - x2) ** 2 + (y1 - y2) ** 2))
def get_distances(pts, cents):
    
    # So we need to loop through each centroid, and computee the distances to the points
    dmat = []
    for i in range(k):
        distances = map(euclidean, [cents.iloc[i, 0] for r in range(len(pts))], pts.iloc[:, 0], \
                        [cents.iloc[i, 1] for r in range(len(pts))], pts.iloc[:, 1])
        dmat.append(list(distances))
    return(dmat)

Now let's do one iteration of the algorithm to see if we can get it going.

# Compute the distances to each centroid, then figure out which is closest to each point:
distances = get_distances(samples, centroids)
distances = np.array(distances)
nearest_centroids = np.array([_groups[i] for i in np.argmin(distances, axis = 0)])

# Assign new groups and plot
samples.loc[:, 'group'] = nearest_centroids
plot_iteration(samples, centroids, 'Expectation 1')

<seaborn.axisgrid.FacetGrid at 0x1e05dc2d088>

Next we compute the average x and y coordinate for each group and assign these as the new centroids.

new_centroids = samples.groupby('group').agg({'x': 'mean', 'y': 'mean'}).reset_index(drop = True)
centroids.iloc[0:k, 0:2] = new_centroids.iloc[0:k, 0:2]
plot_iteration(samples, centroids, 'Maximization 1')

<seaborn.axisgrid.FacetGrid at 0x1e05dd28e88>

Continue to run this until no changes are made:

i = 2
while True:
    # Get distances and nearest centroids
    distances = get_distances(samples, centroids)
    distances = np.array(distances)
    nearest_centroids = np.array([_groups[i] for i in np.argmin(distances, axis = 0)])
    
    # Now check to see if it's the same as the previous iteration:
    if sum(nearest_centroids == samples['group']) == len(nearest_centroids):
        break
    
    # Display the expectation step
    samples.loc[:, 'group'] = nearest_centroids
    plot_iteration(samples, centroids, 'Expectation ' + str(i))
    
    # Display the maximization:
    new_centroids = samples.groupby('group').agg({'x': 'mean', 'y': 'mean'}).reset_index(drop = True)
    centroids.iloc[0:k, 0:2] = new_centroids.iloc[0:k, 0:2]
    plot_iteration(samples, centroids, 'Maximization ' + str(i))
    i += 1

Voila!

From my perspective, it's more intuitive to state that the goal of the algorithm is to find centroids that fit the groups more than centroids that are as far apart as possible although it does do that.

How close to the centers of the bivariate distributions did we arrive at? The distributions we drew from had means of 3, 9.5, and 22.5.

centroids

Very nice!

Last question is how do do we do this from a module?

from sklearn.cluster import KMeans
from_module = KMeans(n_clusters = 3, random_state = 0).fit_predict(samples_copy[['x', 'y']].to_numpy())
from_module

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

So we can see that it predicted all the first values to be part of one group (because they're the smallest in both x and y), followed by medium and large. We also have the same number in each group:

pd.DataFrame(from_module)[0].value_counts()

2    50
1    50
0    50
Name: 0, dtype: int64

	x	y	type	group
0	2.317466	3.741545	centroid	A
1	4.703789	4.168763	centroid	B
2	2.573627	3.891529	centroid	C

	x	y	type	group
0	2.317466	3.741545	point	B
1	4.703789	4.168763	point	C
2	2.573627	3.891529	point	C
3	2.004588	3.788092	point	C
4	4.666476	3.254417	point	B

	x	y	type	group
0	3.154379	3.210471	centroid	A
1	22.835079	22.607440	centroid	B
2	9.709459	9.419262	centroid	C