K means example

Sven Halvorson, 2019-07-24

I think I'm going to try and replicate this tutorial on the k-means algorithm in python. This will give me some more python practice as well as a more in-depth look at this algorithm.

K-means is an unsupervised algorithm (user does not supply examples of correct classification) that tries assign each observation to one of k groups. The number of groups, k, is called a 'hyperparameter' which must be specified by the user. Each group will have a centroid which is like the locus or center of the group. While this algorithm can work for any number of dimensions, it's easy to visualize in two or three dimensions.

The author of the post states that k means will do these two things:

  • Create the least possible distance between the observations and their respective centroids
  • Create the largest possible distances between the centroids

The algorithm works like this:

  • Assign groups to all observations at random
  • Take the centroids at random (here we'll just do the first three observations)
  • Iterate over these two steps until no changes are made:
    1. Assign each data point to the nearest centroid (expectation step)
    2. Create new centroids as the average x and y coordinates of every member of their group (maximization step)

Let's create some fake data. Here we're creating data from three bivariate uniform distributions. This will make the classification very obvious when it succeeds but in practice, it will never be this clean.

In [1]:
import numpy as np
import pandas as pd
import math, seaborn, matplotlib, random

random.seed(30)
k = 3

# Sample 3 different uniform distributions:
samples = list(map(np.random.uniform, [1, 7, 20], [5, 12, 25], [100, 100, 100]))
for i in range(k):
    samples[i].shape = (50, 2)
samples = [pd.DataFrame(samp) for samp in samples]
samples = pd.concat(samples).reset_index(drop = True)
samples = samples.rename(columns = {0 : 'x', 1 : 'y'})
samples['type'] = 'point'

# Group names
_groups = ['A', 'B', 'C']
In [2]:
## Make the centroids the first three observations:
centroids = samples.loc[0:(k-1), :].copy()
centroids.loc[:, 'group'] = np.array(_groups)
centroids.loc[:,'type'] = 'centroid'

# Add random groups to all the observations
samples['group'] = np.random.choice(a = _groups, size = len(samples))

# Save a copy for later:
samples_copy = samples.copy()
In [3]:
centroids
Out[3]:
x y type group
0 2.317466 3.741545 centroid A
1 4.703789 4.168763 centroid B
2 2.573627 3.891529 centroid C
In [4]:
samples.head()
Out[4]:
x y type group
0 2.317466 3.741545 point B
1 4.703789 4.168763 point C
2 2.573627 3.891529 point C
3 2.004588 3.788092 point C
4 4.666476 3.254417 point B

We can take a peek at the data. I had to look up how to do the multiple colors on the same plot and found this seaborn module. Seems real good:

In [5]:
def plot_iteration(pts, cents, title):
    dat = pd.concat([pts, cents], sort=False)
    borders = ['#FFFFFF' for i in range(150)] + ['#000000' for i in range(3)] 
    fg = seaborn.relplot(x = "x", edgecolor= borders, y = "y", \
                         hue = "group", hue_order = _groups, data=dat, \
                         style = 'type', aspect=1.5, height=5, \
                         size="type", sizes=(200, 40))
    fg.fig.suptitle(title)
    return(fg)

plot_iteration(samples, centroids, 'Intialization')
Out[5]:
<seaborn.axisgrid.FacetGrid at 0x1e05d4d1748>

Now we'll define a function to get the distances between each point an each centroid

In [6]:
def euclidean(x1, x2, y1, y2):
    return(math.sqrt((x1 - x2) ** 2 + (y1 - y2) ** 2))
def get_distances(pts, cents):
    
    # So we need to loop through each centroid, and computee the distances to the points
    dmat = []
    for i in range(k):
        distances = map(euclidean, [cents.iloc[i, 0] for r in range(len(pts))], pts.iloc[:, 0], \
                        [cents.iloc[i, 1] for r in range(len(pts))], pts.iloc[:, 1])
        dmat.append(list(distances))
    return(dmat)

Now let's do one iteration of the algorithm to see if we can get it going.

In [7]:
# Compute the distances to each centroid, then figure out which is closest to each point:
distances = get_distances(samples, centroids)
distances = np.array(distances)
nearest_centroids = np.array([_groups[i] for i in np.argmin(distances, axis = 0)])

# Assign new groups and plot
samples.loc[:, 'group'] = nearest_centroids
plot_iteration(samples, centroids, 'Expectation 1')
Out[7]:
<seaborn.axisgrid.FacetGrid at 0x1e05dc2d088>

Next we compute the average x and y coordinate for each group and assign these as the new centroids.

In [8]:
new_centroids = samples.groupby('group').agg({'x': 'mean', 'y': 'mean'}).reset_index(drop = True)
centroids.iloc[0:k, 0:2] = new_centroids.iloc[0:k, 0:2]
plot_iteration(samples, centroids, 'Maximization 1')
Out[8]:
<seaborn.axisgrid.FacetGrid at 0x1e05dd28e88>

Continue to run this until no changes are made:

In [9]:
i = 2
while True:
    # Get distances and nearest centroids
    distances = get_distances(samples, centroids)
    distances = np.array(distances)
    nearest_centroids = np.array([_groups[i] for i in np.argmin(distances, axis = 0)])
    
    # Now check to see if it's the same as the previous iteration:
    if sum(nearest_centroids == samples['group']) == len(nearest_centroids):
        break
    
    # Display the expectation step
    samples.loc[:, 'group'] = nearest_centroids
    plot_iteration(samples, centroids, 'Expectation ' + str(i))
    
    # Display the maximization:
    new_centroids = samples.groupby('group').agg({'x': 'mean', 'y': 'mean'}).reset_index(drop = True)
    centroids.iloc[0:k, 0:2] = new_centroids.iloc[0:k, 0:2]
    plot_iteration(samples, centroids, 'Maximization ' + str(i))
    i += 1

Voila!

From my perspective, it's more intuitive to state that the goal of the algorithm is to find centroids that fit the groups more than centroids that are as far apart as possible although it does do that.

How close to the centers of the bivariate distributions did we arrive at? The distributions we drew from had means of 3, 9.5, and 22.5.

In [10]:
centroids
Out[10]:
x y type group
0 3.154379 3.210471 centroid A
1 22.835079 22.607440 centroid B
2 9.709459 9.419262 centroid C

Very nice!

Last question is how do do we do this from a module?

In [11]:
from sklearn.cluster import KMeans
from_module = KMeans(n_clusters = 3, random_state = 0).fit_predict(samples_copy[['x', 'y']].to_numpy())
from_module
Out[11]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

So we can see that it predicted all the first values to be part of one group (because they're the smallest in both x and y), followed by medium and large. We also have the same number in each group:

In [12]:
pd.DataFrame(from_module)[0].value_counts()
Out[12]:
2    50
1    50
0    50
Name: 0, dtype: int64