Exploring King County Housing Data
1) Introduction
As I wrap up Module 2 of Flat Iron’s Data Science bootcamp, I will be conducting multiple linear regression on a subset of the King County Housing Sale Price dataset. I’ve referenced the King County Realtor Glossary to interpret the feature names included in the dataset. Follow along below, or take a look at the Jupyter notebook.
To guide my analysis, I began by asking three questions that were of interest to me as I read through the King County Realtor Glossary:
1) What is the effect on sale price of classification as a low grade* property?
2) What is the effect on sale price of classification as a high condition* property?
3) What is the effect on sale price of having a waterfront?
*as defined by the King County Assesor’s Office
I’ll try to answer these questions using multiple linear regression — let’s get started!
2) Load the data¶
Import the relevant libraries:
import pandas as pd
pd.options.mode.chained_assignment = None
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from scipy import statsfrom statsmodels.formula.api import ols
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import explained_variance_score
from sklearn.metrics import confusion_matrix
import statsmodels.api as smfrom sklearn.metrics import mean_squared_error, mean_squared_log_error
from sklearn.linear_model import LinearRegression
Read in and preview the data:
df = pd.read_csv(‘kc_house_data.csv’)
df.head()
Examining the data, we see that we have 21597 rows:
df.shape
# (21597, 21)
We can call describe() on the dataframe to get an overview of the descriptive statistics corresponding to each attribute:
df.describe()
3) Prepare the data
Reviewing the data types, we see that both date and sqft_basement are stored as strings. We will need to address these:
df.dtypes# id int64
# date object
# price float64
# bedrooms int64
# bathrooms float64
# sqft_living int64
# sqft_lot int64
# floors float64
# waterfront float64
# view float64
# condition int64
# grade int64
# sqft_above int64
# sqft_basement object
# yr_built int64
# yr_renovated float64
# zipcode int64
# lat float64
# long float64
# sqft_living15 int64
# sqft_lot15 int64
# dtype: object
Dropping irrelevant columns
We don’t need ‘id’ for our analysis — we can drop this column:
df.drop([‘id’], axis=1, inplace=True)
df.reset_index(inplace=True, drop=True)
Identify null values¶
Our dataset consists of 21,597 rows. The total number of records with null values (2376+63+454+3842=6735, or 31%) is currently far greater than 5%, meaning that simply dropping these records would likely significantly impact our analyses.
Let’s examine each attribute individually to determine the appropriate course of action.
df.isna().sum()# date 0
# price 0
# bedrooms 0
# bathrooms 0
# sqft_living 0
# sqft_lot 0
# floors 0
# waterfront 2376
# view 63
# condition 0
# grade 0
# sqft_above 0
# sqft_basement 0
# yr_built 0
# yr_renovated 3842
# zipcode 0
# lat 0
# long 0
# sqft_living15 0
# sqft_lot15 0
# dtype: int64
We need to decide how to handle null values in sqft_basement, waterfront, view, and yr_renovated.
Clean the categorical features
Identify and isolate the categorical features:
cat_features = [‘waterfront’, ‘view’, ‘condition’, ‘grade’, ‘zipcode’, ‘date’]
df_cat = df[cat_features]
Date is currently stored as strings. If we inspect the unique years, we see that only 2014 and 2015 are present in the data. We can simplify date to reflect only the relevant year:
pd.DatetimeIndex(pd.to_datetime(df_cat[‘date’])).year.unique()
df_cat[‘date’] = pd.DatetimeIndex(pd.to_datetime(df[‘date’])).year
Waterfront and View are both categorical features. Waterfront has 2376 missing values (roughly 15% of all records) while view has only 63. The great news is that we don’t have much heavy lifting to do here — all we have to do is specify that our features will have nan values in our one-hot encoding:
df_cat = df_cat.astype(‘str’)
df_ohe = pd.get_dummies(df_cat, prefix=cat_features, dummy_na=True, drop_first=True)
Clean the continuous features
Our continuous features are those that are not categorical:
cont_features = [feat for feat in df.columns.tolist() if feat not in cat_features]
df_cont = df[cont_features]
In addition to missing values, Year renovated contains a problematic value: 0. 0 does not make logical sense in this context. Let’s replace these values with the median:
def impute_median(df, col):
df_one_col = df[[col]]
df_one_col.fillna(df_one_col.median(), inplace=True)
df[col] = df_one_col[col]df_cont['yr_renovated'] = df_cont.apply(
lambda row: np.nan if row['yr_renovated'] == 0 else
row['yr_renovated'],
axis=1)impute_median(df_cont, 'yr_renovated')
Values for Basement square footage are currently stored as strings — and these include question marks! We’ll convert the entire column to numeric values, replacing nulls and question marks (missing values) with the median.
df_cont[‘sqft_basement’] = df_cont.apply(
lambda row: np.nan if row[‘sqft_basement’] == ‘?’ else float(row[‘sqft_basement’]),
axis=1)impute_median(df_cont, ‘sqft_basement’)
Before we can normalize our continuous values, we need to apply an offset to handle any negatives or 0s:
def handle_zeros_pre_log(df, col):
no_zeros = df.loc[df[col] > 0]
col_min = no_zeros[col].min()
offset = col_min/2
df[col] = df.apply(
lambda row: row[col] + offset,
axis=1)
def handle_negatives_pre_log(df, col):
col_min = abs(df[col].min()) + 1
df[col] = df.apply(
lambda row: row[col] + col_min,
axis=1)handle_zeros_pre_log(df_cont, 'sqft_basement')
handle_negatives_pre_log(df_cont, 'long')
And normalize!
df_log = np.log(df_cont)
cont_features = [f’{column}_log’ for column in df_cont.columns]
df_log.columns = cont_featuresdef normalize(feature):
return (feature — feature.mean()) / feature.std()df_log_norm = df_log.apply(normalize)
Recombine features
Now we’ll ‘glue’ our continuous and categorical features back together in a single dataframe:
df_clean = pd.concat([df_log_norm, df_ohe], axis=1)
df_clean = df_clean.astype(‘float’)
df_clean.head()
Let’s verify that our log transformation has not introduced any nan values:
df_clean.isna().sum().max()
# 0
We’ll generate a correlation matrix to return pairwise correlations and identify highly correlated predictors. We will want to be aware of these correlations as we narrow in on our primary predictors. We are interested in values with absolute values roughly between 0.75 (strong positive correlation) and 1 (perfect positive correlation).
(df_clean[df_log_cols].corr()).head() # Snapshot of correlation test results
With this many predictor variables, it can be easier to visualize the correlations with a heat map. In the heat map below, we can identify relationships to explore in greater detail.
Note the high positive correlation in pairs for which both variables relate to square footage: for example, sqft_above and sqft_living.
def generate_heat_map(df, features):
plt.figure(figsize=(7, 6))
sns.heatmap(df[features].corr(), center=0)generate_heat_map(df_clean, cont_features)
The correlation matrix indicates an almost perfect (~0.91) relationship between sqft_lot15_log and sqft_lot_log. This makes sense, since we would expect lots within the same neighborhood to be opf similar size. Let’s remove sqft_lot15_log and run the heat map analysis again:
def remove_col(df, col, features):
df.drop([col], axis=1, inplace=True)
features.remove(col)
remove_col(df_clean, 'sqft_lot15_log', cont_features)
generate_heat_map(df_clean, cont_features)
Much better.
We also can get a sense of the outliers in our data using boxplots.
def plot_boxplot(df, cols):
fig, axes = plt.subplots(4, 4, figsize=(25, 15))
axe = axes.ravel() for i, xcol in enumerate(cols):
sns.boxplot(x=df[xcol], ax=axe[i])
plot_boxplot(df_clean, cont_features)
After some experimentation removing outliers beyond three standard deviations, I’ve decided to keep the outliers. Housing prices vary widely, and outliers can provide us with meaningful information about luxury properties and similarly about dilapidated properties.
4) Preview the relationships
We’ll get a feel for the general trends in our data by previewing the relationship between each continuous predictors and the outcome variable (price) in a scatter plot.
def plot_scatterplot(df, cols, outcome):
fig, axes = plt.subplots(3, 4, figsize=(25, 15))
axe = axes.ravel() for i, xcol in enumerate(cols):
df.plot(kind='scatter', x=xcol, y=outcome, alpha=0.4,
color='b', ax=axe[i])outcome = 'price_log'
preds = [i for i in cont_features if i != outcome]
plot_scatterplot(df_clean, preds, outcome)
In the below, we can already visually detect a strong positive correlation between price and sqft_living, bathrooms, sqft_living_above, sqft_basement and sqft_living15. Evidently, larger houses command higher prices.
A weaker positive correlation is also visible between price and bedrooms. We can also see in a clear indication in sqft_basement and yr_renovated where we have replaced missing values with the median.
5) Transform the data
Define the functions we’ll use with the training and test data subsets:
def get_test_and_training(df, preds, outcome): # Create X and y
y = df_clean[outcome]
X = df_clean[preds] # Split data into training and test sets
return train_test_split(X, y, test_size=0.2, random_state=10)def get_ols(X, y):
results = sm.OLS(y, X).fit()
print(results.summary())
Set those values and perform OLS regression:
preds = [i for i in df_clean.columns if i != outcome]
X_train, X_test, y_train, y_test = get_test_and_training(df, preds, outcome)result = get_ols(X_train, y_train)
Reviewing the output, we now want to remove all features associated with a p-value greater than our alpha of 0.05. Let’s isolate these feature names in an array called ‘exclude’ and then re-run the analysis with an updated predictor array:
exclude = [
‘zipcode_98178’,
‘zipcode_98155’,
‘zipcode_98148’,
‘zipcode_98133’,
‘zipcode_98055’,
‘zipcode_98030’,
‘zipcode_98031’,
‘zipcode_98028’,
‘zipcode_98019’,
‘zipcode_98011’,
‘zipcode_98002’,
‘grade_3’,
‘view_nan’,
‘condition_nan’,
‘waterfront_nan’,
‘condition_4’,
‘zipcode_nan’,
‘date_nan’
]preds = [i for i in df_clean.columns if i not in exclude and i != outcome]X_train, X_test, y_train, y_test = get_test_and_training(df, preds, outcome)result = get_ols(X_train, y_train)
Great! We’ve eliminated non-significant predictors from our model *and* we’ve got a R-square value that’s not too shabby.
6) Build a linear regression model
Now we’ll fit the model using a simple linear regression classifier to establish a baseline r-squared value and evaluate the mean-squared errors:
linreg = LinearRegression()
linreg.fit(X_train, y_train)# Print R2 and MSE for training and test sets
print(‘Training r²:’, linreg.score(X_train, y_train))
print(‘Test r²:’, linreg.score(X_test, y_test))
print(‘Training MSE:’, mean_squared_error(y_train, linreg.predict(X_train)))
print(‘Test MSE:’, mean_squared_error(y_test, linreg.predict(X_test)))# Training r²: 0.8872475048229136
# Test r²: 0.8807761768707825
# Training MSE: 0.1129412222332609
# Test MSE: 0.11839812810173139
Neat! Not only are our r-squared values fairly high (0.887 and 0.881), but our mean-squared errors (0.113 and 0.118) do not differ significantly.
7) Answering our questions
Each coefficient in the earlier OLS ouput reflects the effect of the corresponding feature on price when all other variables are held constant. (Note that the documentation does not indicate the what scaling have been used on the price column, and so “sale price units” is used in place of specific dollar amounts.)
Now that the linear regression output has given us reason to feel confident in our model, we can finally tackle the questions we posed at the start of this notebook.
For example, we see that Squarefoot living has a significant (but low to moderate) coefficient of 0.1785. We interpret this as: “Each additional 1 squarefoot of living space yields a 0.1785 increase in sale price units.”
Question 1: What is the effect on sale price of classification as a low grade property?
The lowest grade in our dataset is grade 4 (*”Generally older low quality construction. Does not meet code.”*). It’s possible that grades 0–3 are not present in our dataset because such structures do not meet the minimum standards for sale in King County.
- Grade 4 (coeff = -0.7861 ): Having a grade code of 4 yields a 0.7861 decrease in sale price units.
I am surprised that the effect of a grade 4 classification is lower than that of a grade 5 (“Lower construction costs and workmanship. Small, simple design”) clasification.
Question 2: What is the effect on sale price of classification as a high condition property?
The highest condition included in our dataset and in the range of possible values is condition 5 *(Very good: “Excellent maintenance and updating on home. Not a total renovation.”).
- Condition 5 (coeff = 0.1305): Having a condition code of 5 yields a 0.1305 increase in sale price units.
Note that the lower conditions (condition 3, coeff = -0.0751; condition 2, coeff = -0.3391) predictably have the effect of decreasing the sale price.
Question 3: What is the effect on sale price of having a waterfront?
It should come as no surprise that properties with a waterfront view can command significantly better sale prices:
- Waterfront_1 (coeff = 0.8910): Having a waterfront view yields a 0.8910 increase in sale price units.
8) Conclusion
Overall, the patterns that emerge in the above analyses do not deviate from what a layman might expect to find in the housing market:
> Bigger, higher quality houses with attractive views tend to command higher sale prices.
> Looking to increase your property’s market value? Invest in the structural integrity and protect against property degradation over time.
Sources
* Outliers: https://towardsdatascience.com/ways-to-detect-and-remove-the-outliers-404d16608dba, https://stackoverflow.com/questions/23199796/detect-and-exclude-outliers-in-pandas-data-frame
* Log transformations: https://discuss.analyticsvidhya.com/t/methods-to-deal-with-zero-values-while-performing-log-transformation-of-variable/2431/2
* Stepwise regression: https://towardsdatascience.com/stepwise-regression-tutorial-in-python-ebf7c782c922
* Interpreting coefficients: https://statisticsbyjim.com/regression/interpret-coefficients-p-values-regression/