What is Cluster Analysis?
Machine Learning: unsupervised edition - intro to clustering
Cluster analysis is an unsupervised machine learning method used to organise data.
The approach used is to call a distance function which calculates how closely associated examples are to one another. Unlike many statistical models, cluster analysis is used without assumptions being made on the relationships in the dataset.
It provides information about where associations and patterns in data exist, but not what those might be or what they mean.
In Data Science and Machine Learning problems often the first step is to apply cluster analysis - to gain a better understanding of the underlying relationships in the dataset.
Uses of Clustering?
Some common applications include:
- Anomaly Detection
- Signal Processing
- Search Grouping
- Medical Imaging
- Image Segmentation
- Market Segmentation & Analysis
- Social Network Analysis
There are various methods, with Python in scikit learn there are ten base methods as detailed here:
We can see how these methods perform when applied to test datasets in here:
Images from Scikit Learn
kMeans divides a set of samples into clusters which are described by a sample mean, and are commonly referred to centroids.
The algorithm aims to minimise inertia - within cluster sum of squares criterion:
Mini batch kMeans uses the mini batches to reduce computation time, with the same objective function. Mini batches are subsets of the input data, randomly sampled in each iteration.
The algorithm iterates between two major steps.
- First - samples are drawn randomly and assigned to the closest centroid.
- Second - centroids are updated on a sample basis.
For each sample the centroid is updated by with the average of the sample + all samples assigned to that centroid previously. This has the effect of decreasing the rate of change for a centroid over time.
These steps are performed until convergence or a predetermined number of iterations is reached.
Agglomerative clustering is a hierarchical method that uses a bottom up approach. Each observation starts in its own cluster, and clusters are successively merged together.
It recursively merges a pair of clusters that minimally increases a given linkage distance .The linkage determines the metric used for merging:
- Ward - minimises the sum of squared differences within all clusters - a variance minimising approach.
- Maximum / complete - minimises the maximum distance between observations of pairs of clusters.
- Average - minimises the average of the distances between all observations of pairs of clusters.
- Single - minimises the distance between the closest observations of pairs of clusters.
Agglomerative Clustering can scale to large numbers of samples when it is used with a connectivity matrix - this can be computationally expensive.
DBSCAN: Density Based Spatial Clustering of Applications with Noise
DBSCAN views clusters as areas of high density separated by areas of low density.
Clusters found by DBSCAN can be any shape, as opposed to k-means which assumes that clusters are convex shaped.
There are two parameters to the algorithm:
- min_samples - primarily controls how tolerant the algorithm is towards noise.
- eps - is the maximum distance between two samples to be considered as in the neighborhood of another.
Higher min_samples / lower eps indicate a higher density - necessary to form a cluster.
The central component to DBSCAN is the core sample.
- A cluster is therefore a set of core samples, as defined by a distance measure, as well as a set of non-core samples.
- A core sample is a sample for which there exist min_samples other samples within a distance of eps, which are defined as neighbors of the core sample.
OPTICS: Ordering Points To Identify the Clustering Structure
OPTICS shares many similarities with DBSCAN. It can be considered a generalised version of DBSCAN which relaxes the distance (eps) requirement from a single value to a range.
The key difference between is that this algorithm builds a reachability graph which assigns each sample:
- A reachability distance,
- A spot within the cluster ordering_
These attributes are assigned when the model is fitted and used to determine cluster membership. OPTICS is better suited for large datasets compared to the current implementation of DBSCAN.
Spectral clustering is a technique with roots in graph theory, where the approach is used to identify communities of nodes in a graph based on the edges connecting them.
It is related to DBSCAN clustering and can also be used to partition graphs (via their spectral embeddings).
the algorithm performs a low-dimension embedding of the affinity matrix between samples, followed by clustering of components. It's computationally efficient when the affinity matrix is sparse and the amg solver is used.
In practice it's useful when the individual clusters structures are non-convex, and when centre point & spread are not suitable descriptions of the cluster, such as nested circles on a 2D plane.
That's all for this quick intro to cluster analysis. Thanks for reading, I hope this was of use 🙏
Now, there's lots of data out there just waiting for you to put it in boxes 🗳