SWaP: Probabilistic Graphical and Deep Learning Models for Water Consumption Prediction

Introduction

With climate change exacerbating extreme weather conditions including droughts and famines, understanding and predicting human water consumption is critical for ensuring a sustainable future. For example, the state of California, USA experienced one of its longest droughts from December 2011 to March 2019. Similarly, in recent times, the city of Capetown, South Africa was faced with a severe water crisis, where it was about to run out of drinking water for its citizens. Therefore, predicting future water consumption in residential and commercial buildings has become an extremely important problem, particularly to efficiently monitor water consumption, identify possible leaks, minimize wastage, and match demand and supply.

Therefore, in this article, we design SWaP, a Smart Water Prediction system, which predicts future hourly water consumption based on historical data. The water consumption prediction problem can be viewed as a classic time series prediction problem, thus making it amenable to statistical methods such as ARIMA as well as recently developed machine learning methods. To enable SWaP make effective predictions, we explore two classes of discriminative machine learning models—probabilistic graphical models and deep learning models that have been shown to be effective for multiple time-series prediction problems . We design a structured regression graphical model, Gaussian conditional random fields (GCRFs), to successfully encode dependencies between historical and future water consumption. Specifically, we leverage and adapt a recently developed sparse and computationally efficient variant of GCRFs. We also design a Long Short-Term Memory (LSTM) based recurrent neural network (RNN) model that captures the underlying patterns in water consumption data.

Problem Statement

In this article, our goal is to design a system to predict hourly water consumption based on real-world data collected from multiple buildings in a university campus. This problem can be modeled as a classic time series prediction problem, where the goal at time T is to predict water consumption variations for k steps into the future (i.e., YT = y1 , y2, …., yk ) based on past water consumption values in a window size of n (i.e., XT = x1 , x2, …., xn).

Data

We collect hourly water consumption data for 14 buildings in a university campus. These buildings fall into 4 categories— academic building, dining hall, gym, and residence hall. The buildings in the dataset comprise of 6 academic buildings, 1 dining hall, 1 gym and 6 residence halls. We collect data for approximately 4.5 months when the university is in session, beginning from August 1, 2018 to December 8, 2018 (i.e., Fall 2018 semester). Therefore, we have approximately 3000 data points for each building.

The figures below shows the hourly water usage for 48 hours (September 6 and 7) where hour 1 and hour 25 correspond to the time between 12 am and 1 am for two consecutive days. We observe that gym (Figure 1a) and dining hall (Figure 1b) have highest water usage from 9 am to 9 pm (which approximately corresponds to the time duration for which these facilities are open). In comparison, we observe from our investigation that water consumption for residence halls drops at night for around 5 hours when most students are asleep. Our study also reveals that academic buildings have water consumption in the same range throughout the day. We hypothesize long/late working hours of graduate students and cooling needs for equipment to be the main reason for this behavior. We note that most utilities including water and electricity are shut down during Thanksgiving week for all campus buildings. As water consumption values mostly correspond to zeroes during this week, we remove the Thanksgiving week values to prevent possible misrepresentation in the model due to this data. Additionally, the dataset has around 0.3% missing values. We use linear regression to fill in these missing values.

SWaP System

In this section, we provide an overview of SWaP, a Smart Water Prediction system that takes as input historical water consumption data and outputs future water consumption predictions. Figure 2 shows the different components of our system. SWaP comprises of a data pre-processing component, which pre-processes the water consumption data and a prediction component consisting of the proposed models that takes the pre-processed data to generate the desired predictions. We design two models, a discriminative probabilistic graphical model and a deep learning model for the prediction component in SWaP. Specifically, we design i) sparse Gaussian Conditional Random Fields (GCRFs) and ii) Long Short Term Memory (LSTM) based deep Recurrent Neural Network (RNN) models to successfully encode dependencies in the water consumption data.

The designed GCRF model is parsimonious in nature and captures the underlying dependencies between the input (i.e., the past water consumption data) and output variables (i.e., the future water consumption predictions) as well as those between the output variables. As we construct a sparse GCRF model, the model only learns the necessary dependencies among the input and output variables that are helpful in the prediction. In comparison, the deep learning model consists of an encoder and a decoder, each of which separately is an RNN. The encoder takes past water consumption data and computes a state vector that encodes the underlying dependancies in the data. The decoder then utilizes this state vector to generate water consumption predictions.

Performance Evaluation

To evaluate the performance of SWaP, we collect hourly water consumption data for 14 buildings from a university campus for the Fall 2018 semester (approximately 4.5 months). We classify these buildings into 4 categories—academic building, dining hall, gym and residence hall. The buildings in the dataset comprise of 6 academic buildings, 1 dining hall, 1 gym and 6 residence halls. We compare the performance of SWaP with linear regression and ARIMA baselines with respect to the Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) and demonstrate that SWaP significantly outperforms the baselines.

Table 1 shows RMSE results for the average over the 12 predicted hours for all buildings. We observe from the table that for all the buildings, GCRF and LSTM outperform the baselines. The overall performance improvement of GCRF over ARIMA and linear regression is in the range of 14% to 65%, while the gains of LSTM over ARIMA and linear regression is in the range of 7% to 62%. We also see that for most buildings GCRF performs better than LSTM. We believe that the sparse nature of the L1-regularized GCRF model helps in learning the dependencies that positively affect the prediction performance, while excluding those that do not matter. This helps in yielding a model that is better suited to the data. Additionally, we observe that augmenting our models with temporal features such as time of the day and day of the week can improve the overall average prediction performance.

Conclusion

The above experiments demonstrate that the GCRF-based SWaP overall outperforms the LSTM-based SWaP. Therefore, we recommend using the GCRF-based SWaP due to its superior prediction performance. Employing the GCRF-based SWaP also provides the system with greater interpretability as GCRF is a probabilistic graphical model and it is easy to understand and appreciate which inputs/past outputs are instrumental in arriving at the predictions.