Skip to main content

The Importance of F1 Score

At CereLabs, we are building various image classification systems. While building any kind of classification system one is often challenged to test the trained models. One useful measure to test such models is accuracy, which is the proportion of true results and the total number of images examined. Accuracy thus communicates the essential message of how close one comes to the correct result. In the case of an image classification system, accuracy is how accurately the trained model is able to classify the test image dataset. If we are trying to classify the image of an apple, accuracy will be the measure of how accurately the classifier is able to detect the apple in an image.

Consider the following confusion matrix.

True Positive (TP)
Actual image contains an apple, and is correctly classified as an apple
False Negative (FN)
Actual image contains an apple but is not classified as an apple
False Positive (FP)
Actual image does not contain an apple but is classified as an apple
True Negative (TN)
Actual image does not contain an apple and is correctly classified as not an apple.

Suppose we use a model that is able to detect apples in an image. We test the model on 100000 images out of which 1000 have apples, and 99000 have no apples. After testing we get the following confusion matrix.

The formula to calculate accuracy is as follows:
Accuracy = (TP + TN) / (TP + TN + FP + FN)
                                                             = (445+97976)/(445+97976+1024+555)
                                                             = 0.98

Although accuracy can be an useful tool to test any model, it can fail when the data is skewed, especially in classification models. In such a case, having a higher accuracy does not guarantee that the classifier is doing well, just that it is good in justifying the skewed data. In the case of detecting apples, if the number of times an apple is presented to the model is low, the chances of the model to detect it as an apple also increases, and hence the accuracy. But in this case we don’t know whether our model will do well when it is presented with an image that is not an apple. We will never come to know even if the accuracy is above 98%.

Thus the disadvantages of using accuracy are as follows:
  • Not able to distinguish between false positives and false negatives
  • Not an useful metric if the data is skewed
  • Accuracy Paradox: Predictive models with a given level of accuracy may have greater predictive power than models with higher accuracy.

These disadvantages may turn out to be catastrophic if we are using Accuracy to calculate correctness of a model in cases of fraud detection, cancer detection, etc. We might want to detect cases where a tumour is classified into benign. Just imagine a model predicting that a person does not have a disease even when the person has the disease. Accuracy score ignores such cases, and we might not be able to test how good a model is to detect tumors or any other diseases. 

In such a scenario, F1 score comes for our rescue. To calculate F1 score we need to calculate Precision and Rescue.

Precision (P): Of all the images classified as an apple, what fraction actually have apple. That means how many positive predictions were correct.

P = TP / (TP + FP).
                                                                     = 445/ (445+1024)
                                                                     = 0.30
Here you can see that the Precision has severely dropped because of the skewed data.

Recall (R): Of all the images that have apple, what fraction were correctly classified as apple.
R = TP / (TP + FN)
                                                                     = 445/ (445+ 555)
                                                                     = 0.44

Now both Precision and Recall are important measures. F1 score helps us in finding a value that incorporates both Precision and Recall. F1 score is the weighted average of Precision and Recall.
F1 = 2 * (P * R) / (P + R)
                                                                   = 2 * (0.30 * 0.44) / (0.30 + 0.44)
                                                                   = 0.36

Thus F1 score captures the balance between Precision and Recall and a good F1 score can justify the strength of a model in classifying data. If there is any imbalance in Precision and Recall, or both Precision and Recall are low, the F1 score will penalize the classifier. The above F1 score shows that the model has failed miserably, which the accuracy score was not able to identify.

Hope you are able to use F1 score in your classification project. The higher the better. If you have any queries, please mention them in comments.

Comments

Popular posts from this blog

Implement XOR in Tensorflow

XOR is considered as the 'Hello World' of Neural Networks. It seems like the best problem to try your first TensorFlow program.

Tensorflow makes it easy to build a neural network with few tweaks. All you have to do is make a graph and you have a neural network that learns the XOR function.

Why XOR? Well, XOR is the reason why backpropogation was invented in the first place. A single layer perceptron although quite successful in learning the AND and OR functions, can't learn XOR (Table 1) as it is just a linear classifier, and XOR is a linearly inseparable pattern (Figure 1). Thus the single layer perceptron goes into a panic mode while learning XOR – it can't just do that. 

Deep Propogation algorithm comes for the rescue. It learns an XOR by adding two lines L1 and L2 (Figure 2). This post assumes you know how the backpropogation algorithm works.



Following are the steps to implement the neural network in Figure 3 for XOR in Tensorflow:
1. Import necessary libraries
impo…

From Cats to Convolutional Neural Networks

Widely used in image recognition, Convolutional Neural Networks (CNNs) consist of multiple layers of neuron collection which look at small window of the input image, called receptive fields.
The history of Convolutional Neural Networks begins with a famous experiment “Receptive Fields of Single Neurons in the Cat’s Striate Cortex” conducted by Hubel and Wiesel. The experiment confirmed the long belief of neurobiologists and psychologists that the neurons in the brain act as feature detectors.
The first neural network model that drew inspiration from the hierarchy model of the visual nervous system proposed by Hubel and Wiesel was Neocognitron invented by Kunihiko Fukushima, and had the ability of performing unsupervised learning. Kunihiko Fukushima’s approach was commendable as it was the first neural network model having the capability of pattern recognition similar to human brain. The model gave a lot of insight and helped future understanding of the brain.
A successful advancement i…

Understanding Projection Pursuit Regression

The following article gives an overview of the paper "Projection Pursuit Regression” published by Friedman J. H and Stuetzle W. You will need basic background of Machine Learning and Regression before understanding this article. The algorithms and images are taken from the paper. (http://www.stat.washington.edu/courses/stat527/s13/readings/FriedmanStuetzle_JASA_1981.pdf
What is Regression? Regression is a machine learning technology used to predict a response variable given multiple predictor variables or features. The main distinction is that the response to be predicted is any real value and not just any class or cluster name. Hence though similar to Classification in terms of making a prediction, it is largely different given what it’s predicting. 
A simple to understand real world problem of regression would be predicting the sale price of a particular house based on it’s square footage, given that we have data of similar houses sold in that area in the past. The regression so…