Unsupervised learning is a type of machine learning algorithm used for drawing inferences from the datasets consisting of input data without labeled responses. In a simple language we can say it is the type of machine learning used for finding hidden patterns without any external input. We only have the raw data. It finds structures and relationship between data.
Sometimes, it may not be practicable to provide information about the types of data that a computer system will be receiving. Therefore, supervised learning may not be appropriate in the cases where computer systems need regular information regarding new types of data. For example, the hackers performing hacking attacks on financial systems or bank servers have a tendency of changing their patterns and nature regularly. The unsupervised learning can be more suitable for such scenarios because systems need to learn quickly from attacks and should conclude the possible upcoming attacks.
Unsupervised learning methods are used in bioinformatics for sequence analysis and genetic clustering; in data mining for sequence and pattern mining; in medical imaging for image segmentation; and in computer vision for object recognition. With the help of unsupervised learning we can make our data more organized and readable. The best example of unsupervised learning is the image recognition feature of Facebook.
Whenever we upload any picture on Facebook, the Facebook suggests us name of the person present with us in the picture, so that the person present in the picture could be tagged. Now it suggests us the name by taking the picture of that particular person and matches with our picture.
Cluster analysis is the most common method in unsupervised learning. It is the process where similar entities are grouped together. The goal of this unsupervised machine learning technique is to find similarities in the data point and group similar data points together.
It is used for the analysis of exploratory data and for finding the hidden patterns within the data. We model these clusters by using a measure of similarity (metrics such as Euclidean or probabilistic distance).
The common clustering algorithms are as follows:
· Hierarchical clustering: In this, a multilevel hierarchy of clusters is built by creating a cluster tree.
· K-Means clustering: In this, the data is partitioned into k distinct clusters on the basis of distance from the centroid of a cluster. We can also say that it is used to predict sub-groups from a given dataset.
· Gaussian mixture model: In this, the clusters are modeled as a mixture of multivariate normal density components.
· Self-organizing maps: uses neural networks that learn the topology and distribution of the data
· Hidden Markov models: In this, observed data is used to recover the sequence of states.
Why should we use Clustering?
When we group similar entities jointly, then we outline the characteristics of different groups. In other words, this gives us the insight about the underlying patterns of different groups. The grouping of unlabeled data have many applications, such as, we can easily identify the various groups or segments of customers in order to increase the profits. One more example is about the grouping of documents that belong to the same subject.
Clustering is also used to reduce the dimensionality of the data when you are dealing with a copious number of variables.
Working of clustering algorithm?
There are many algorithms that are developed for implementing this algorithm but let us talk about the following in this post:
1. K-mean Clustering
2. Hierarchical Clustering
K-mean Clustering: The prediction in K-mean clustering depends on the number of cluster centers present and on the nearest mean value (Euclidean distance) between observations. We use K-means in customer segmentation, insurance fraud detection and cost modeling. The working of K-means is as follows:
1. It starts with K as the input. K is basically a number which shows how many clusters we want to form. If K=3 then there will be three clusters and if K=4, then there will be four clusters.
2. Now, with the help of the Euclidean distance (it is used to find which centroid is closest to each data point) between data points and centroids (central point of the given dataset), we will assign each data point to its closest cluster.
3. We then recalculate the cluster center.
4. We will repeat step 2 and 3 until no additional changes takes place.
· It is quite easy to understand and is robust.
· When datasets are distinct then this algorithm gives the best result.
We use this method of unsupervised learning for predicting subgroups within data. We do so by finding the distance between every data point and their nearest neighbor. Then each data point is linked to its neighbor. Unlike K-mean clustering Hierarchical clustering we start by assigning each data points to their own cluster. In the next step, it combines two closest data points and merges them together to one cluster.
1. At first, we assign every data point its own cluster.
2. Then, by using the Euclidean distance we find the cluster’s closest pair and after that we merge them into single cluster.
3. The distance between two nearest clusters is calculated and joined until all points are clustered in to single cluster.
In this technique, we decide the best possible number of clusters by noticing which horizontal line can cut the vertical lines without intersecting a cluster and covers the maximum distance. We can find sub-groups b the help of dendrogram.
Dendrogram: Dendrogram is a tree diagram (basically used in biology for showing the cluster between genes or samples) which is used for depicting the hierarchical cluster- the relationship between similar set of data. It can be row or a column graph.