Color Clustering in Python

4 min readApr 2, 2021

Take a look at the MARVEL logo below. you can probably figure out the most dominant colors in a second. We clearly see Black, Red & White

It’s pretty easy for a human mind to pick these out. But what if we wanted to write an automated script that will take out the dominant colors from any image? Sounds cool right?

So that's what we're going to do today.

We will be using an algorithm known as the K-Means algorithm to automatically detect the dominant colors in the image.

What is the K-Means algorithm?

Our goal is to partition N data points into K clusters. Each of the N data points will be assigned to a cluster with the nearest mean. The mean of each cluster is called its centroidor center.

Applying k-means will yield K separate clusters of the original n data points. Data points inside a particular cluster are considered to be â€œmore similar to each other than data points that belong to other clusters.

In our case, we will be clustering the pixel intensities of a RGB image. Given a MxN size image, we thus have MxN pixels, each consisting of three components: Red, Green, and Blue respectively.

We will treat these MxN pixels as our data points and cluster them using k-means.

Pixels that belong to a given cluster will be more similar in color than pixels belonging to a separate cluster.

The only facet we need to keep in mind is that we will need to provide the number of clusters, K ahead of time

The code

I'll try and explain the steps as we go through

We have imported all the dependencies needed for our project today

This is just the run timer so that we can check on the efficiency at the end. As you can see, it shows 0% as of now

We have added the CLI(command line) argument parser. Now we can send the arguments directly to the client while running the script.

This is completely optional and I’ll add the values inline for the sake of this article.

Here, we have fetched the image and converted that to RGB color channel. An important point to note here is that open CV and most other image processing or Computer Vision frameworks take images in the BGR channel and not RGB

Now, an image is basically MxN matrix of pixels. We are simply re-shaping our NumPy array to be a list of RGB pixels. So that we can run our algorithm on the image

We were using the sci-kit-learn implementation of the K-Means algorithm. Scikit-learn takes care of most of the heavy lifting. On calling fit, it returns the clusters of data points or basically color intensity in this

But, we do need to define some helper functions to help us display the dominant colors of the image. So, let's open up a new file and name it helper.py

In the first function, we take the number of clusters and then create a Histogram based on the number of pixels assigned to each cluster

In the second function, we send 2 parameters — the histogram returned from the first function and then a list of all centroids returned by the K-Means algorithm We calculate the average percentage of each color assigned to each color. This gives us a percentage breakdown of the dominant colors

Now, we have glued everything together since the helper functions are done