We express common performance metrics, such as recall, precision and so on, for classification tasks using probabilities and examine the F1 score and simplify it to a ratio that is simpler to understand.

The F1 score \(F\) is usually defined as harmonic mean of precision and recall:

\[ F_1 = \frac{2}{\text{precision}^{-1} + \text{recall}^{-1}}. \]

We will see that the F1 score can be equivalently expressed as the following metric, denoted by \(F_1'\):

\[ F_1 \equiv \frac{\text{# errors}}{\text{# true positives}} = F_1', \]

which is less mysterious. Note that \(F_1'\) does not yield the same score values, but captures the same relationships as the F1 score—up to a non-linear transformation.

Performance Metrics for Classification

Metrics like recall and precision are often defined using concepts like the true positive rate or the positive predictive value. I find the terminology confusing and hard to remember, so let’s have a quick look at a different way of expressing them.

We are looking at a binary classification task, where a sample can be either positive or negative, and we are generally more interested in finding positive samples out of many negative ones (think COVID tests). Our classification model can make a prediction which can be either true or false.

A true positive (prediction) is then a correct prediction of a positive sample, while a true negative one is a correct prediction of a negative sample. So far so simple.

Before we go further, let’s construct a simple probabilistic model, where have two events and their complements:

\(P\) is the event of a positive prediction;
\(A\) is the event of an actually positive sample.

The complements \(\bar{P}\) and \(\bar{A}\) then denote negative predictions and actually negative samples.

We can then introduce well-known performance metrics as follows:

Performance Metrics
1. Precision or positive predictive value: \[ \text{precision/}PPV = p(A \mid P) = p(\text{actually positive} \mid \text{positive prediction}).\]
2. Recall, sensitivity, or true positive rate: \[ \text{recall/} TPR = p(P \mid A) = p(\text{positive prediction} \mid \text{actually positive}).\]
3. Miss rate, false negative rate: \[ \text{miss rate/} FNR = 1 - TPR = p(\bar{P} \mid A) = p(\text{negative prediction} \mid \text{actually positive}).\]
4. Specificity, selectivity, or true negative rate: \[ \text{specificity/} TNR = p(\bar{P} \mid \bar{A}) = p(\text{negative prediction} \mid \text{actually negative}).\]
5. Fall-out, false positive rate: \[ \text{fall-out/} FPR = 1 - TNR = p(P \mid A) = p(\text{positive prediction} \mid \text{actually negative}).\]
6. Accuracy: \[ \text{accuracy/} ACC = p(A \leftrightarrow P) = p(A, P) + p(\bar{A}, \bar{P}) = p(\text{prediction is correct}).\]

For multi-class classification, we can examine per-class events \(P_y\) and \(A_y\) separately for each class and define these quantities accordingly.

Type-1 and Type-2 Error Probability

The probability of false positives is \(p(\bar{A}, P)\). This is also the type-1 error probability because false positives are type-1 errors. Likewise, the probability of false negatives is \(p(A, \bar{P})\), which is also the type-2 error probability as false negatives are type-2 errors.

\(F_1\) Score

The F1 score is the harmonic mean between precision and recall: \[ F_1 = \frac{2}{\text{recall}^{-1} + \text{precision}^{-1}} = \frac{2}{TPR^{-1} + PPV^{-1}}.\] Note that this is equivalent to the definition above: to see this, multiply both numerator and denominator by \(TPR \cdot PVR\). Yet this formulation does not make it easier to intuit about what the score values mean.

Let’s express the definition using probabilities: \[ F_1 = \frac{2}{\frac{1}{p(P \mid A)} + \frac{1}{p(A \mid P)}} = \frac{2}{\frac{p(A)}{p(A, P)} + \frac{p(P)}{p(A, P)}} = \frac{2}{\frac{p(A) + p(P)}{p(A, P)}}. \] To simplify \(\tfrac{p(A) + p(P)}{p(A, P)}\), we can decompose \(P\) into \(P=(A \cap P) \cup (\bar{A} \cap P)\) and \(A\) into \(A=(A \cap P) \cup (A \cap \bar{P})\). Then, we have \(p(P) = p(A, P) + p(\bar{A}, P)\) and \(p(A) = p(A, P) + p(A, \bar{P})\). Substituting this, we obtain: \[\begin{aligned}\frac{p(A) + p(P)}{p(A, P)} &= \frac{p(A, P) + p(A, \bar{P}) + p(A, P) + p(\bar{A}, P)}{p(A, P)} \\&=2+\frac{p(A, \bar{P}) + p(\bar{A}, P)}{p(A, P)}.\end{aligned}\]

Let’s call the right-hand term the \(F_1'\) score: \[F_1' = \frac{p(A, \bar{P}) + p(\bar{A}, P)}{p(A, P)}.\]

\(p(A, \bar{P}) + p(\bar{A}, P)\) is exactly the probability that the prediction is incorrect, i.e. \(1-ACC = p(A \triangle P)\). And \(p(A, P)\) is the probability of positive samples with positive predictions, i.e. the probability of a true positive.

When we have an empirical dataset, this translates to the ratio between the number of correct predictions and the number of true positives: \[F_1' = \frac{\text{# errors}}{\text{# true positives}}.\]

\(F_1'\) can be very easily computed using mental arithmetic if you know the number of mistakes your model made and how many positive samples it identified.

The \(F_1'\) score is \(0\) when the model makes no errors and goes \(\to \infty\) as it makes more errors. It trades-off correct positive predictions with false negatives \(p(A, \bar{P})\) and false positives \(p(A, P)\): \(F_1'\) is 1 exactly when for every correct positive prediction, there is one false negative or false positive one. Moreover, additional errors have a linear effect on \(F_1'\). If your model makes half the errors, its \(F_1'\) score is halved. If it makes twice the errors, its \(F1'\) score doubles.

The F1 score \(F_1\) is a non-linear transformation of the \(F_1'\) score: \[F_1=\frac{2}{2+F_1'}\] However, this non-linear \(\frac{2}{2+\cdot}\) transformation makes it harder to gauge changes and intuit about the score values.

Converting \(F_1\) scores into \(F_1'\) scores might help. The transformation is straight-forward: \[ F_1' = \frac{2}{F_1} - 2.\]

I hope this provides intuitions about the \(F_1\) score and might encourage others to use the \(F_1'\) score directly.

Thanks for reading!

Appendix

The above treatment can also be applied to the more general \(F_\beta\) score.

\(F_\beta\) Score

The trade-off between false negatives \(p(A, \bar{P})\) and false positives \(p(\bar{A}, P)\) is fixed in F1 scores. To allow for a more flexible trade-offs, \(F_\beta\) scores were introduced: \[ F_\beta = (1+\beta)^2 \frac{\text{recall} \cdot \text{precision}}{\text{recall} + \beta^2 \text{precision}}.\] Using similar transformations as above, we can come up with a simpler \(F_\alpha'\) score: \[ F_\alpha' = \frac{\alpha \, p(A, \bar{P}) + p(\bar{A}, P)}{p(A, P)},\] where \(\alpha = \beta^2\).

It behaves similar as the \(F_1'\) score: it is 0 when the model makes no errors; and 1 when for every correct positive prediction there is \(\tfrac{1}{\alpha}\) false negatives or one false positive.

Let’s derive \(F_\alpha'\). \[ \begin{aligned} F_\beta &= (1+\beta)^2 \frac{TPR \cdot PPV}{TPR + \beta^2 PPV} \\ &= (1+\beta)^2 \left (\frac{\beta^2}{TPR} + \frac{1}{PPV} \right )^{-1} \\ &= (1+\beta)^2 \left (\frac{\beta^2}{p(P \mid A)} + \frac{1}{p(A \mid P)} \right )^{-1} \\ &= (1+\beta)^2 \left (\frac{\beta^2}{\frac{p(A, P)}{p(A)}} + \frac{1}{\frac{p(A, P)}{p(P)}} \right )^{-1}. \end{aligned} \] For the right-hand expression \(\frac{\beta^2}{\frac{p(A, P)}{p(A)}} + \frac{1}{\frac{p(A, P)}{p(P)}}\), we can again use the decomposition of probabilities: \[\begin{aligned} & \frac{\beta^2}{\frac{p(A, P)}{p(A)}} + \frac{1}{\frac{p(A, P)}{p(P)}} = \\ & \quad = \frac{\beta^2 \, p(A) + p(P)}{p(A, P)} \\ & \quad = \frac{\beta^2 \, (p(A, P) + p(A, \bar{P})) + p(A, P) + p(\bar{A}, P)}{p(A, P)} \\ & \quad = (\beta^2 + 1) + \frac{ \beta^2 p(A, \bar{P}) + p(\bar{A}, P)}{p(A, P)} \\ & \quad = (\beta^2 + 1) + \frac{ \alpha p(A, \bar{P}) + p(\bar{A}, P)}{p(A, P)}. \end{aligned}\] Hence, we define \(F_\alpha'\) as: \[ F_\alpha' = \frac{ \alpha \, p(A, \bar{P}) + p(\bar{A}, P)}{p(A, P)}.\] and have \[ F_\beta = \frac{1+\beta^2}{1+\beta^2 + F_\alpha'}. \] As can easily be verified, this is consistent with the definitions of \(F_1\) and \(F_1'\).