I think I'm going to try and replicate this tutorial on the k-means algorithm in python. This will give me some more python practice as well as a more in-depth look at this algorithm.
K-means is an unsupervised algorithm (user does not supply examples of correct classification) that tries assign each observation to one of k groups. The number of groups, k, is called a 'hyperparameter' which must be specified by the user. Each group will have a centroid which is like the locus or center of the group. While this algorithm can work for any number of dimensions, it's easy to visualize in two or three dimensions.
The author of the post states that k means will do these two things:
The algorithm works like this:
Let's create some fake data. Here we're creating data from three bivariate uniform distributions. This will make the classification very obvious when it succeeds but in practice, it will never be this clean.
import numpy as np
import pandas as pd
import math, seaborn, matplotlib, random
random.seed(30)
k = 3
# Sample 3 different uniform distributions:
samples = list(map(np.random.uniform, [1, 7, 20], [5, 12, 25], [100, 100, 100]))
for i in range(k):
samples[i].shape = (50, 2)
samples = [pd.DataFrame(samp) for samp in samples]
samples = pd.concat(samples).reset_index(drop = True)
samples = samples.rename(columns = {0 : 'x', 1 : 'y'})
samples['type'] = 'point'
# Group names
_groups = ['A', 'B', 'C']
## Make the centroids the first three observations:
centroids = samples.loc[0:(k-1), :].copy()
centroids.loc[:, 'group'] = np.array(_groups)
centroids.loc[:,'type'] = 'centroid'
# Add random groups to all the observations
samples['group'] = np.random.choice(a = _groups, size = len(samples))
# Save a copy for later:
samples_copy = samples.copy()
centroids
samples.head()
We can take a peek at the data. I had to look up how to do the multiple colors on the same plot and found this seaborn module. Seems real good:
def plot_iteration(pts, cents, title):
dat = pd.concat([pts, cents], sort=False)
borders = ['#FFFFFF' for i in range(150)] + ['#000000' for i in range(3)]
fg = seaborn.relplot(x = "x", edgecolor= borders, y = "y", \
hue = "group", hue_order = _groups, data=dat, \
style = 'type', aspect=1.5, height=5, \
size="type", sizes=(200, 40))
fg.fig.suptitle(title)
return(fg)
plot_iteration(samples, centroids, 'Intialization')
Now we'll define a function to get the distances between each point an each centroid
def euclidean(x1, x2, y1, y2):
return(math.sqrt((x1 - x2) ** 2 + (y1 - y2) ** 2))
def get_distances(pts, cents):
# So we need to loop through each centroid, and computee the distances to the points
dmat = []
for i in range(k):
distances = map(euclidean, [cents.iloc[i, 0] for r in range(len(pts))], pts.iloc[:, 0], \
[cents.iloc[i, 1] for r in range(len(pts))], pts.iloc[:, 1])
dmat.append(list(distances))
return(dmat)
Now let's do one iteration of the algorithm to see if we can get it going.
# Compute the distances to each centroid, then figure out which is closest to each point:
distances = get_distances(samples, centroids)
distances = np.array(distances)
nearest_centroids = np.array([_groups[i] for i in np.argmin(distances, axis = 0)])
# Assign new groups and plot
samples.loc[:, 'group'] = nearest_centroids
plot_iteration(samples, centroids, 'Expectation 1')
Next we compute the average x and y coordinate for each group and assign these as the new centroids.
new_centroids = samples.groupby('group').agg({'x': 'mean', 'y': 'mean'}).reset_index(drop = True)
centroids.iloc[0:k, 0:2] = new_centroids.iloc[0:k, 0:2]
plot_iteration(samples, centroids, 'Maximization 1')
Continue to run this until no changes are made:
i = 2
while True:
# Get distances and nearest centroids
distances = get_distances(samples, centroids)
distances = np.array(distances)
nearest_centroids = np.array([_groups[i] for i in np.argmin(distances, axis = 0)])
# Now check to see if it's the same as the previous iteration:
if sum(nearest_centroids == samples['group']) == len(nearest_centroids):
break
# Display the expectation step
samples.loc[:, 'group'] = nearest_centroids
plot_iteration(samples, centroids, 'Expectation ' + str(i))
# Display the maximization:
new_centroids = samples.groupby('group').agg({'x': 'mean', 'y': 'mean'}).reset_index(drop = True)
centroids.iloc[0:k, 0:2] = new_centroids.iloc[0:k, 0:2]
plot_iteration(samples, centroids, 'Maximization ' + str(i))
i += 1
Voila!
From my perspective, it's more intuitive to state that the goal of the algorithm is to find centroids that fit the groups more than centroids that are as far apart as possible although it does do that.
How close to the centers of the bivariate distributions did we arrive at? The distributions we drew from had means of 3, 9.5, and 22.5.
centroids
Very nice!
Last question is how do do we do this from a module?
from sklearn.cluster import KMeans
from_module = KMeans(n_clusters = 3, random_state = 0).fit_predict(samples_copy[['x', 'y']].to_numpy())
from_module
So we can see that it predicted all the first values to be part of one group (because they're the smallest in both x and y), followed by medium and large. We also have the same number in each group:
pd.DataFrame(from_module)[0].value_counts()