Machine Learning

Ensemble Regression Models for Short-term Prediction of Confirmed COVID-19 Cases

Introduction

COVID-19 is a major global pandemic that has impacted the lives of people  around the world. In spite of  severe lockdowns in countries around the world to curb its spread,  more than 4 million people around the world have tested positive for the virus by May 15, 2020. As the virus spreads unabated, a large of number of individuals continue to get infected globally every day. For example, in USA, starting from  a handful of cases in early March, the number of confirmed cases has exceeded 1.4 million by May 15, 2020. Making accurate short-term predictions of the number of COVID-19 cases is critical for upgrading scarce resources such as hospital beds and ventilators as well as procuring vital medicines, particularly in developing countries. Additionally, as countries start to reopen, accurate short-term predictions are necessary to quickly identify new cluster of cases and take appropriate measures.

Therefore, in this article, our goal is to develop a regression-based ensemble model comprising of Linear regression, Ridge, Lasso, ARIMA, and SVR to predict the number of  COVID-19 cases in the short-term future using the number of confirmed cases in the past 14 days.  The ensemble model selects the best performing model among the above-mentioned ones for the particular dataset in consideration. We consider the data from 50 countries around the world that have the highest number of confirmed cases  between January 21, 2020 and April 30, 2020 and execute our ensemble model1

Dataset and Problem Statement

We use the data collected and distributed by Johns Hopkins COVID-19 Github data repository, which provides an overview of COVID-19 cases (confirmed, deaths, and recovered) for countries around the world. The data on the site is updated daily. For the purpose of this study, we selected the  top 50 countries with the highest number of COVID-19 confirmed cases. Figure 1 shows the daily number of confirmed cases for the US during this time period.

US COVID-19 Cases
Figure 1: US COVID-19 Cases

The confirmed COVID-19  case prediction problem can be cast as a time-series prediction problem where we consider data for the past n time steps  and predict k time steps into the future. Statistical regression models are ideally suited for such time-series prediction problems and therefore, in this work, we develop a predictive modeling system using an ensemble of statistical regression models to predict future confirmed COVID-19 cases based on past confirmed cases.

Ensemble Regression Models

Figure 2: Ensemble model for prediction

In this section, we provide an overview of our COVID-19 confirmed cases  prediction system. It takes data collected and distributed by Johns Hopkins COVID-19 Github data repository  as input and outputs the predicted number of confirmed cases in the future. It comprises of two main components:  i) the data pre-processing component that pre-processes the data, and ii)  the prediction component consisting of five different models (i.e., Linear regression, Ridge, Lasso, ARIMA, and SVR) that takes the pre-processed data to generate the predictions. The ensemble layer then selects the best model for the particular dataset under consideration. Figure 2 shows the architecture of  our prediction architecture.

Performance Evaluation

RPE comparison
Table 1: Three day ahead average RPE comparison across various countries

We consider the relative percentage error (RPE) as our performance metric. We observe from our experiments that for some countries during the initial few days of the spread of the virus, the number of cases is low which throws off the prediction performance of the model. Therefore, we run the models considering 95%, 90%, and 85% of the data for each country and investigate the prediction performance. For example at 95%, if the total number of cases for the entire time period of the study is 100, we consider at least 95 cases or more, leaving out data entirely from days in the beginning. The reason behind this is that for most countries the number of cases during the first few days is small and sometimes even 0. This makes it challenging to predict for those days because if the actual number of cases for a particular day is 0 and our model predicts 1, the percentage error is infinity.

Table 1 shows the average over 3-days RPE for the various countries. The table reports the RPE as well as the percentage of data points used for generating the prediction. We observe from the tables that 6 countries including USA have less than 10% error in prediction, while 27 countries have less than 40% error in prediction. Additionally, we observe that Linear regression, Ridge, and Lasso provide similar prediction performance. This is expected because of the underlying similarity of these models and the relatively simple nature of the data.

Discussion of Results

We discuss some interesting observations about the evolution of the number of  COVID-19 cases in the different countries to give us a better understanding of the prediction performance. We observe that while our model provides good prediction performance for USA and Italy, the countries with the first and third highest infection rates, respectively,  it does not provide good performance (around 50%) for Spain, the country with the second highest infection rate.  This can be attributed to the high variation in the number of cases from the middle to the end of April.   Similarly, we observe from the data that the number of cases in UAE is continuously increasing. This is the primary reason for the superior performance of our ensemble model. In fact, we observe that if we leave the first 20%  data, the RPE for UAE is only 2.7. While the government there is taking preventive action, the lockdowns have not been too severe, resulting  in the number of cases increasing steadily.  UAE went with night lockdown (i.e.,10 hour lockdown) and then  imposed a complete lockdown in early April. However, the restrictions were eased soon and they reverted back to  10 hour lockdowns. India  implemented a nationwide lockdown since mid March and parts of the country are still  in lockdown. But due to the high population density, it has seen a steady increase in the number of new cases, thus resulting in the model providing good prediction performance. Looking at the absolute numbers, we believe that  the number of people infected without the lockdown would have been significantly higher compared to the current number of cases.

We observe from our experiments that a number of countries including Spain, Australia, Norway, Greece, Pakistan, Thailand, and others have prediction performance worse than 40%.     A closer look at these countries shows that this set consists of both developed and developing countries. Economically speaking, these countries are also spread across the spectrum. While it is unclear as to why these countries have worse performance in comparison to others, we hypothesize that the difference in prediction performance for the various countries  is due to the following reasons—i) the difference in testing capability of the different countries,  ii) the severity of the lockdown and the social distancing measures, and  iii) the veracity of the number of cases being reported by some countries.

For example, Australia saw its peak in mid March. The strict measures taken by its government to prevent the spread of COVID-19 resulted in a significant decrease in the number of new cases by April   and thus our predictive model gave higher RPE for those days. A similar situation is seen when we study the data for Japan, Norway, and Germany. All these countries reached their peak by the end of March or early April and strict lockdown measures resulted in a drastic decline in the number of new cases. Once again, as the rate of decrease was high and  could not be accurately captured by the number of cases in the last 14 days’, the performance of our predictions for those days was less accurate, leading to overall higher RPE. In comparison to these countries, Sweden asked its people to self isolate, but did not enforce any the nation-wide lockdown. As a consequence, COVID-19 has spread   significantly and the number of cases still continue to increase.  Due to this combination of self-isolation and open economy, we can see oscillations in the number of cases, making it a difficult prediction task.  

Developing countries such as Pakistan, South Africa, and Peru have all resorted to severe lockdowns, but the number of cases are still increasing on average. One reason could be the inability to maintain social distancing while performing everyday activities in these countries because of their socio-economic situation. Additionally, we  see sudden variations in the number of confirmed cases with some days having significantly larger number of confirmed cases than others. This could be the result of skewed testing or reporting, but such drastic variations result in our model providing overall poor prediction performance for these countries.

Video Explanation

Subscribe to Our monthly newsletter for exciting data science news and learnings

Subscribe to our monthly newsletter, DataTrain, our thought train on all things data.