The Art of Choosing The Right Metric in Machine Learning (1)
Performance metric is an important metric to measure impact and successfulness. In term of machine learning development, the performance metric is essential in measuring and evaluating the readiness of the machine learning model. In this passage we would discuss the performance metric in classification task.
There are so many performance metric that could be used. We choose the metric based on task of the model. In classification, accuracy is the most common metric. We often use accuracy for any classification model. However, using accuracy as silver bullet could impact our business, since the cost of false positive (FP) and false negative (FP) is different.
In several case, having false positive is more desirable than having false negative. For example, in churn modeling, we could tolerate recontacting the retain user, than letting the churn users were being not detected. In short, for this case, we cannot apply accuracy as our main metric. Now, let’s take a look onto the list of performance metric and it’s case.
Accuracy
definition
As I mentioned before, accuracy is the most common metric. Accuracy have most intuitive measurement than the others. Accuracy measure the percentage of correct prediction of all prediction. In short, we can measure accuracy by dividing the number of correct prediction by all prediction.
when to use
The use case of accuracy is kinda tricky. We could use accuracy when the label class is balanced. For example, you have 100 news text to analyze, 50 of them are having negative sentiment, and the rest are positive. You want to measure how is your model performance. If the model could predict 90 out of 100 news text, we can take 90/100 as our model accuracy.
In another case, when the label is imbalanced, the usage of accuracy, could bring us into failure. For example, we want to measure our fraud detection model. The data consist of 990 not fraud and 10 fraud label. The model successfully predict all not fraud data, but none fraud data could be detected. If we use accuracy, our model performance is 99 percent. By only looking in to this number, our model performed outstanding. However, this model could harm our company if they can’t detect any fraud. This will bring us into a big loss.
Precision
definition
Precision measure the number of positive prediction that was correct. To calculate precision, you need to divide the number of TP (true possitive) by the total number of all positive prediction (TP+FP), where FP is false positive.
when to use
You can use precision when you have to avoid false positive, but can deal with false negative. For example, in spam detector, We have 100 email, 90 of them are non-spam and 10 of them are spam. If the model classify 3 non-spam as spam and 8 spam email as spam, the precision would be 8/11. By this calculation, we can see that this metric is resistant from imbalanced label.
In this case, we can deal if the model classify spam email as non-spam, but there is no tolerant if the model put non-spam email in spam box. Imagine, if the non-spam that been classify as spam is an important email, we could possibly receive a massive loss.
Recall
definition
Recall is the opposite of precision. Recall measure the number of positive class that could be predicted as positive class. We can calculate recall by dividing the number of true positive (TP) prediction per total number of all positive class (TP+FN), where FN is the false negative.
when to use
If we want to maximize the number correct positive prediction, we can use recall. For example, in fraud detection, if we can not detect the real fraud, we could cost our company. It could possibly produce loss from the fraud that we were unable to detect.
Suppose we have 900 non fraud data and 100 fraud data. Our model predict 90 fraud as fraud and 10 as non-fraud. The recall for our model is 90/100. In this paragraph we can conclude that recall can deal with imbalanced label, because we only concern about the positive class.
F1-Score
definition
F1-score is defined as harmonic mean of precision and recall. For multi class classification, there are three type of f1 score, those are weighted, macro and micro. In this passage, we would only discuss about macro f1-score. Macro F1-score measure the mean of f1-=score between class.
For binary classification, we can calculate f1-score by calculating 2 per sum of 1 per recall plus 1 per precision. In the other hand, for multi class, we can measure f1-score by calculating the average of f1-score per class.
when to use
If we have to deal with our imbalanced data to measure unbiased prediction and we want to measure our model performance as if our label is balanced, we can use this metric.
For example, we have 900 AI generated text and 100 human written text. Our model classify 100 AI generated text as human written, and 40 human written as AI generated text. We can calculate f1-score as below.
AI Class
Human Class
Macro
Key Takeaways
After understanding performance metric for classification task, we can adjust our model to align with the business objective. This will lead us into a better impact on business metric. For example, you can adjust the classification threshold to adjust model based on precision and recall.
Now, my question is, in fraud detection, for fraud prediction. In order to avoid loss from undetected fraud, what we need to do? increase or decrease our classification threshold?