<aside> 📌 This is part-1 of a multi-part series on the Evaluating forecasting methods.
</aside>
<aside> 🔎 Table of contents:
</aside>
Measures of forecasting accuracy should be indicative of how well models perform on unseen (future) data. The metrics covered in this document are generally referred to as point forecast error measures (PFEMs). Such measures are usually defined in terms of the signed forecast error $e_t = \hat{y}_t - y_t$ — where $t$ is a future time-step $t \in H$ and $H$ is the forecasting horizon — and then aggregated over $H$ (usually averaged).
Many different PFEMs have been proposed, discussed, and used in both academia and industry. At times, this can make it difficult for newcomers and practitioners to navigate the landscape and know which one(s) to choose for a given application. To a large extent, this can be attributed to the large number of degrees of freedom one has when picking or defining a custom metric (more of this below). To make matters worse, there doesn’t seem to be consensus on which measures to use for different applications, or we even multiple incompatible definitions for the same named metrics (e.g., sMAPE).
In this section, we will define the different PFEMs that will be discussed and mentioned in the rest of the document. As mentioned above, all measures of error (i.e. deviation from the actual) are defined as a function of $e_t$:
Name | Definition |
---|---|
Absolute Error | $\text{AE}_t = |
Squared Error | $\text{SE}_t = e_t^2$ |
Absolute Percent Error | $\text{APE}_t = \frac{100 |
Symmetric Error | $\text{sE}_t = \frac{200 |
Scaled Absolute Error | $\text{SAE}_t = \frac{ |
Scaled Squared Error | $\text{SSE}_t = \frac{e_t^2}{\mathbb{E}[\text{SE}t]{\text{in-sample}}}$ |
The errors above are then aggregated over $H$ to compute the final measure of accuracy for a given forecast. Aggregation functions like the median and the geometric mean can be used. However, in this document, we will only consider the mean, which is the most popular and recommended option for many use cases. The aggregated version of the errors above can therefore be written with an $M$ prefix. For instance, the Mean Absolute Error $(\text{MAE})$ and Mean Squared Error $(\text{MSE})$ are defined as:
$$ \text{MAE} = \mathbb{E}[\text{AE}_t] = \frac{1}{H} \sum_t^H |e_t| \; , $$
and
$$ \text{MSE} = \mathbb{E}[\text{SE}_t] = \frac{1}{H} \sum_t^H e_t^2 \; . $$
It is also common to take the square root of the squared errors, as this transforms the final value into the original units of measurement. For instance, the Root Mean Scaled Square Error $(\text{RMSSE})$ is defined as:
$$ \text{RMSSE} = \sqrt{\mathbb{E}[\text{SSE}t]} = \sqrt{\frac{\frac{1}{H}\sum_t^H e_t^2}{\mathbb{E}[\text{SE}t]{\text{in-sample}}}} = \sqrt{\frac{\frac{1}{H}\sum_t^H e_t^2}{\frac{1}{S}\sum{t}^S e_t^2}} \; , $$
where $\mathbb{E}[\text{SE}t]{\text{in-sample}}$ is the in-sample $\text{MSE}$ for a benchmark model. The most common benchmark used for this metric is the one-step naive forecaster $(\hat{y}{t} = y{t-1})$.
The PFEMs that we will discuss here serve two main purposes: 1) model selection, and 2) evaluation. In both instances, the chosen PFEM should provide an unbiased estimation of model performance on unseen data.
Another place where such metrics are likely to be used is during model fitting. Loss functions are used here to learn the underlying parameters of the statistical model that best fits the training data.
An important thing to remember is that: when a model is fit/tuned/selected using a specific error measure, it comes at the cost of performing worst on other PFEMs. The reasons for this can be different for different pairs of measures but, as we’ll discuss below, central tendency and error asymmetry play an important role here.