Predicting
Water Point Functionality
in Tanzania

1) Introduction

As I wrap up Module 3 of Flat Iron’s Data Science bootcamp, I will be tacking a Driven Data competition, Pump It Up: Data Mining the Water Table.

Follow along below, or take a look at the Jupyter notebook and repo on GitHub.

The competition provides a dataset of water points in Tanzania and their associated characteristics. It is our job to predict, using the supplied training labels, whether a pump is functional, non-functional, or functional but in need of repair.

In the below, I’ll be building a model to predict water point status given a testing dataset. Let’s get started!

2) Define the relevant classes

I’ll be taking an object-oriented approach to this project and will begin by defining the classes and constants I’ll need.

2A) Accessing the data

Paths. We’ll define our path strings within a simple dictionary for easy loading:

Data Loader. The Data Loader will load the appropriate csvs. This helper class includes an option (run_type_dev) to downsample our dataset as needed:

2B) Cleaning the data

VizHelper. The Viz Helper will output relevant visualizations to inform iterative cleaning and analysis:

Cleaner. Our Cleaner will perform basic cleaning tasks (eliminate impossible 0 values, correct column data types) on our raw data:

2C) Build the pipeline

Splits Manager. The Splits Manager will allow us to easily access our train-test datasets (without worrying about the typos that a simple dictionary access command is susceptible to):

Pre-Processor. The Pre-Processor defines transformations our pipeline will use. Since our dataset is unbalanced (a histogram of our outcome variable shows that ‘functional’ is vastly over-represented in the dataset relative to the other two outcomes), we will oversample our minority outcomes to achieve a more balanced dataset:

Classifiers. We can store our classifiers in a dictionary so that we can easily iterate over them during analysis. Notice that SVM has been commented out of the list of classifiers in circulation — my local computing resources couldn’t handle the workload:

Param Grids. We’ll also store the corresponding param grid for each classifier in a dictionary. Note that again elements have been iteratively disabled (xgboost this time) due to computing considerations:

Results Manager. The Report Manager will use store and organize the results:

Report Manager. The Report Manager will use the shared keys in these two dictionaries to iterate over each classifier and score its performance:

2D) Tying it all together

Data Manager. The Data Manager is responsible for coordinating the different helper classes responsible for cleaning, pipeline-creation, visualization, and analysis:

3) Analyze Results & Take-aways

Now that we’ve defined our classes, we can kick off the analysis:

Let’s first consider the implications of our exploratory visualizations:

1) Our predictors are not highly correlated. We do not need to drop features out of concerns for multicollinearity.

2) Our outcome variables is extremely unbalanced. We will compensate for this by oversampling the underrepresented classes in our pipeline.

3) Our boxplots indicate that our predictors are fairly free of outliers.

Turning to our confusion matrices, let’s review the relative performance of our different classifiers:

1) Both our Decision Tree and K-Nearest Neighbors classifiers performed well (73.7% and 75.1% accuracy, respectively).

2) Random Forest and XGBoost performed best, with XGBoost (77.6%) taking a slight lead over Random Forest (77.0%).

3) Support Vector Machine could not be completed locally with the available computing resources. In future projects, I will train my models using cloud-based servers.

Additional avenues for exploration

In the future, it would be exciting to consider how frequency of conflict incidences related to water-resource usage might impact the model’s performance!

Sources