Data Science

How to Conduct Time-series Forecasting on Real-world Data

Introduction

In a variety of real-world problems, the objective is to model and predict the future based on past data. In many problems, we are presented with a series of  data points occurring in a periodic manner and we want to predict the future based on these data points.

  1. An example of such a problem is predicting hourly water and electricity consumption in buildings based on water and electricity consumption over the past few hours.
  2. Another example is to predict the temperature of a place for the next few hours based on the last few hours.  
Such problems can be modeled as a classic time series prediction problem, where the goal at time T is to predict  k steps into the future (i.e., YT =  y1 , y2, …., yk ) based on past  values in a window size of n (i.e., XT = x1 , x2, …., xn)..

Data

Let us first discuss how time-series data looks like. For each time instance for an univariate time series, we have one data point and for a multi-variate time series we have multiple data points. Let us consider the water consumption prediction problem. Figure 1 shows that amount of water consumed in a building for 48 consecutive hours (i.e., over a period of two days). In Figure 1, hour 1 and hour 25 correspond to the time between 12 am and 1 am for two consecutive days. We  observe from the figure that the water consumption changes over time which makes this an interesting forecasting problem.

Water Consumption in a Building per Hour
Figure 1: Water Consumption in a Building per Hour

Models

Once we have identified that the particular problem under consideration is amiable to time-series forecasting, the next step is to select the appropriate model for the task at hand. Depending on the properties of the data (i.e., stationarity, seasonality), a variety of different models exist. One can adopt classic time-series statistical regression models such as linear regression, ARIMA or Seasonal ARIMA or adopt machine learning models such as deep learning or structured regression models. We explain a couple of regression models here.

Linear Regression – It is a statistical model that produces the best fit straight line based on the data.

ARIMA – Auto-Regressive Integrated Moving Average, popularly known as ARIMA is a statistical model that comprises of three terms. The first term is the Autoregressive term (AR), the second is the differencing term (I) and the third is the moving average term (MA).

If you want to know more about how to design such models for particular problems (e.g., COVID-19 case prediction) please  take a look here. For any real-world problem, it is best to investigate a suite of models because it is difficult to determine beforehand which model will provide the best performance for a specific dataset.

Evaluation

As we mentioned in the previous paragraph, it is unlikely that a single model will provide good performance in all scenarios. To identify which model provides superior prediction performance for the task at hand, we need to consider specific evaluation metrics. For a time-series prediction problem, the best metrics for evaluation are 1) Root Mean Squared Error (RMSE), and 2) Mean Absolute Error (MAE). 

RMSE – It is the square root of the mean of the square of the difference between the actual and predicted values.

MAE – It is the mean of  the absolute difference between the predicted and actual values.

Both RMSE and MAE inform us of how much error is there in the predictions in comparison to the actual values. When we design multiple models for the same time-series prediction task, the better model is the one that has lower RMSE and MAE.

Figure 2: RMSE

Figure 2 shows the RMSE prediction for multiple models for the water consumption prediction problem. In Figure 2, the models are predicting multiple steps into the future based on the past data. As we can see the quality of all predictions becomes worse as the models try to predict further into the future. This is expected as predicting further into the future is significantly harder than predicting the near future. In Figure 2, we observe that GCRF (a structured regression models) provides the best performance (least RMSE), followed by an LSTM based deep learning model and then ARIMA and finally linear regression.