What are anomalies?
Anomalies are items, events or observations that do not conform to an expected pattern or other items in a data set. Anomalies can be found from any kind of domain. In network data different attacks can be categorized as anomalies. In a production line of a factory fault products can be categorized as anomalies. In commercial transactions frauds can be categorized as anomalies. Those are few examples of anomalies.
Why anomaly detection is critical?
In most of the real world scenarios of anomaly detection are very critical. They are rare events that may have great significance but are difficult to identify. So It is very important that identifying the anomalies and take necessary steps to stop or reduce the bad outcome. As examples anomaly detection can be used to identify possible illnesses such as cancers, heart attacks before it grows significantly. Also you can block fraudulent transactions before it goes through. You can detect unexpected network attacks as well as you can identify unexpected weather changes before it occurs significant damage by using real time anomaly detection techniques.
Why general machine learning techniques doesn’t suits for anomaly detection?
most of the machine learning algorithms works in a way that they train the model using the available data. So to achieve better accuracy rates those algorithms need quite large amount of data to do the training and build the models. But in the anomaly detection scenario anomalous data are very rare. As an example fraudulent transactions to normal transaction ratio could be 1:10,000. So always there is an unbalanced situation between normal and anomalous data. Because of that we can’t apply most of the common machine learning algorithms to anomaly detection scenario. Even though we apply, accuracy rates will be very less.
Why clustering is good at Anomaly Detection?
Clustering is an unsupervised machine learning algorithms. In simply what we do in clustering is we group the data by considering the similarities of data. So once we do the clustering we will get some clusters with heterogeneous data inside each clusters. Data distribution of the clusters are very important for anomaly detection scenario. Due of the rareness of anomalous data those will be distributed more like outliers to the clusters rather than making a different cluster. We are more considering here is detecting unknown anomalies which can’t be identified using few simple rules or applying a knowledge of a domain expert. Those anomalies can be rise as an unexpected way and also those anomalies won’t have similar behaviors in every time. Another advantage is we can apply clustering despite the fact that data set is very high dimensional.
How Anomaly Detection Algorithm works?
First step is clustering the data set. For that we use K means algorithm. It will basically do the clustering based on the distance between data points. After we run the algorithm we get K number of clusters. Since the K means works with only numerical data if there are any categorical features we will have to drop them before apply the K means algorithm.
Then the next step is to identify the cluster boundaries. For that we consider a percentile distance value rather than considering the maximum distance point of each cluster with their cluster centers. The reason is that anomalous data. We assume that anomalous data also can be included in the clusters. But because of the deviation from normal behavior those anomalies will be mostly in far distance away with their cluster centers than normal data. So those anomalies still can be near the cluster boundaries. If we consider the max distance as cluster boundary those anomalies also will be taken into the clusters. So to avoid that we consider a percentile distance value (ex: 95th percentile from all distances between data points and their respective cluster centers of each cluster)
So after determining the cluster boundaries of each cluster we can do the predictions for new data. When a new data point comes we get the closest cluster of that data point. By calculating the distance between new data point and each cluster center we can select the closest cluster. After that we compare the distance of Its cluster boundary and the distance of that data point to Its cluster center. If that distance is greater than the cluster boundary we consider It as a anomalous data point since It is in outside the cluster. If it is less than cluster boundary we consider it as a normal data point since it is in inside the cluster.
Since we are considering the anomaly detection, a true positive would be a case where a true anomaly detected as a anomaly by the model.
As I said the anomaly detection is a special scenario. Always there will be an unbalanced distribution between anomalous and normal data.Therefore we need to be more focused on detecting anomalies. Therefore rather that calculating general prediction accuracy for all data points we should give a high priority to true positives than true negatives. Due to that nature we can’t go for more general accuracy measures. There are few good accuracy measures we can used for this scenario.
- Sensitivity(recall) – gives the True Positive Rate. ( TP/(TP + FN) )
- Precision – gives the probability of predicting a True Positive from all positive predictions ( TP/(TP+FP) )
- PR cure – Precision recall(Sensitivity) curve – PR curve plots Precision Vs. Recall.
- F1 score – gives the harmonic mean of Precision and Sensitivity(recall) ( 2TP / (2TP + FP + FN) )
So Precision and the Sensitivity are the most suitable measures to measure a model where positive instances are very less. And PR curve and F1 score are mixtures of both Sensitivity and Precision. So PR curve and F1 score can be used to tell how good is the model. We can look into Sensitivity and Precision also separately depending on the exact scenario.
Precision or Recall?
This is mostly depends on the domain and how you look into that particular problem. For the anomaly detection case there can be two scenarios.
- Always detecting a true anomaly is the most important. In other words true positives are more important. As an example in a critical system like a space shuttle the system should be highly reliable.
- Not detecting a normal behavior as a anomaly is most important
As an example let’s consider a diagnostic of a cancer. It’s very important that identifying a cancer in the early stages. Therefore It’s very important that identifying all positives. So we should consider about the recall as the important accuracy measure here.
But also in the other hand although we identify most of the positives correctly we need to focus on reducing false positives as well. So getting true positives in the results is important. But if we get lot of false positives that might also be a problem. In this scenario then there can be lot of persons who does not have cancers will be identified as cancer patients due to those false positives. Recent study have shows that in breast cancers only 10% are true positives. So 90% of the results are false positives. That can be leads to lot of unnecessary problems.
So if the most important thing is reducing false positives we should consider the precision as the most relevant accuracy measure.
If the most important thing is detecting all positives as positives we should consider the recall (Sensitivity) as the most relevant accuracy measure.