Predictive Analytics for Smart Water Management in Developing Regions

Introduction

Water availability and management is an important problem plaguing many developing and under-developed countries. Many factors including geographic, political, management, and environmental factors affect the availability of water in these regions. In this paper, we develop an ensemble-learning based predictive-analytics framework for smart water management to predict: i) water pump operation status (e.g., functional, non functional), ii) water quality, and iii) quantity. In the predictive-analytics framework, we first perform feature engineering to select relevant features, use them to develop the XGBoost and Random Forest ensemble learning models, and then perform extensive feature analysis to identify the most predictive features, for each prediction problem mentioned above.

We evaluate our framework on two publicly available smart water management datasets pertaining to Tanzania and Nigeria and show that our proposed models outperform several baseline approaches, including logistic regression, SVMs, and multi-layer perceptrons in terms of precision, recall and F1 score. We also demonstrate that our models are able to achieve a superior prediction performance for predicting water pump operation status for different water extraction methods. We conduct a detailed feature analysis to investigate the importance of the various feature groups (e.g., geographic, management) on the performance of the models for predicting water pump operation status, water quality and quantity. We then perform a fine-grained feature analysis to identify how individual features, not just feature groups, impact performance. We identify that among individual features, location (x, y, z coordinates) has the maximum impact on performance. Our analysis is helpful in understanding the types of data that should be collected in future for accurately predicting the different water problems.

Dataset

In this section, we describe the two water management datasets from Tanzania and Nigeria used in this work. The dataset for Tanzania and Nigeria have been made publicly available by Taarifa and the Tanzanian and Nigerian Ministry of Water, respectively. The Tanzania dataset was collected using hand-held sensors, paper reports, and feedback from people using cellular phones. The Tanzania dataset has 59,401 instances and contains information such as the pump operation status, water quality, water quantity, pump location, source type, extraction technique, and population demographics in the region where the pump is installed. The Nigeria dataset has 132,542 instances and has features similar to, but less in comparison to the Tanzania dataset.

Figure 1a) Tanzania Pump Operation Status

The primary difference between the datasets is that the Nigeria dataset does not contain information regarding the water quality and quantity. In the Tanzania dataset, the pump operation status is described using three values namely functional, functional needs repair and non-functional, while in Nigeria the pump operation status is described using two values functional and non-functional. The water quality in the Tanzania dataset is described using the values good, milky, salty, colored and fluoride and the water quantity takes the values dry, enough, insufficient and seasonal.

Figure 2: Nigeria: Comparison of pump operation status for different states

Figures 1a, 1b and 1c show the normalized distribution of the water pump operation status, the water quality, and quantity for the different regions for Tanzania and Figure 2 shows the normalized distribution of the water pump operation status for Nigeria. The width of the bars in the figures denote the number of instances that correspond to a particular region or state. We make multiple important observations from these figures – i) the total number of recorded data points varies with region/state, ii) there is a significant portion of pumps that are non-functional in almost all regions, with some regions such as LIN, MTW, RUK having greater than 50% non-functional pumps, and iii) there is an uneven distribution of the values for water quality and quantity. For example, if we consider water quality, a large fraction of instances have good value. But, for the same regions, we can observe that a considerable fraction of pumps do not have enough water quantity.

Predictive Models

We leverage two ensemble learning models, namely Random Forest and a recently developed ensemble model, XGBoost, to address the smart water management problem. Ensemble learning methods leverage multiple learning algorithms to obtain better prediction performance than what could be obtained from the respective individual learning algorithms in the ensemble and have been shown to be effective in a number of applications, particularly in problems that involve data that has class imbalance.

Random Forest: It constructs multiple decision trees based on bootstrapping and random attribute selection during the training phase. The algorithm uses them to predict the class during the test phase, and then outputs the result by carefully combining the results from the different trees. Random Forest avoids overfitting by randomly selecting a set of attributes instead of taking all the available attributes into consideration for constructing the trees.

XGBoost: In contrast to Random Forest, XGBoost uses dependent but smaller decision trees. It uses a gradient boosting algorithm to improve the results of the previous trees to predict the next tree. The final output is decided on the basis of a voting algorithm that is applied on the results obtained from all the trees.

Experimental Results

In all our experiments, we use 5-fold crossvalidation, where we divide the data into 5 partitions, iteratively train on four partitions and report the prediction performance on the fifth partition. We report standard performance metrics of precision, recall, and F1 score for all the models.

Precision is defined as a ratio of the true positives to the sum of the true positives and false negatives.

Recall is the ratio of the true positives to the true instances in the dataset (i.e., the sum of true positives and the false negatives).

F1 score is calculated as the harmonic mean of the precision and recall.

We compare our models with several classic machine learning approaches such as Support Vector Machines (SVM), Logistic Regression, Multilayer Perceptrons, and Naive Bayes. We report results for SVM, the model that performs the best on our dataset. Statistically significant differences evaluated at a rejection threshold of p = 0.05 are typed in \textbf{bold} in all the tables below. We measure statistical significance between XGBoost and Random Forest, wherever relevant, to show which of these models is a better fit for the prediction problem. For scores where we cannot establish statistical significance between XGBoost and Random Forest, we report statistical significance with SVM. We note that both our ensemble models perform statistically better than SVM across all prediction tasks and in all performance metrics.

We report performance results for pump operation status for Tanzania and Nigeria. Table 1 gives the performance results for the pump operation status for Tanzania. We observe that Random Forest and XGBoost perform better than SVM across all performance metrics. Our models achieve a 78% performance improvement in F1 score over SVM for non-functional, 79% for functional needs repair, and 13% for functional, respectively. Looking closely at the results for individual class values, we observe from Table 1 that the performance of the proposed models is better for the functional and non-functional classes in comparison to the functional needs repair class for the Tanzania dataset. The main reason behind the lower performance for the functional needs repair class is the lack of enough instances pertaining to this class in our dataset (as shown in Figure 1a).

Similarly, from Table 2, we observe that XGBoost and Random Forest perform better than SVM on the Nigeria dataset. We observe that the F1 score performance is higher for the non-functional class in the Nigeria dataset in comparison to the functional class for all the three models. We observe that our proposed models achieve a performance improvement of 39% in the functional class when compared to SVM.