1 Background

As a statistical consultant working for a real estate investment firm, your task is to develop a model to predict the selling price of a given home in Ames, Iowa. Your employer hopes to use this information to help assess whether the asking price of a house is higher or lower than the true value of the house. If the home is undervalued, it may be a good investment for the firm.

2 Training Data and relevant packages

In order to better assess the quality of the model you will produce, the data have been randomly divided into three separate pieces: a training data set, a testing data set, and a validation data set. For now we will load the training data set, the others will be loaded and used later.


Use the code block below to load any necessary packages

## Warning: package 'BAS' was built under R version 3.4.2

2.1 Part 1 - Exploratory Data Analysis (EDA)

When you first get your data, it’s very tempting to immediately begin fitting models and assessing how they perform. However, before you begin modeling, it’s absolutely essential to explore the structure of the data and the relationships between the variables in the data set.

Do a detailed EDA of the ames_train data set, to learn about the structure of the data and the relationships between the variables in the data set (refer to Introduction to Probability and Data, Week 2, for a reminder about EDA if needed). Your EDA should involve creating and reviewing many plots/graphs and considering the patterns and relationships you see.

After you have explored completely, submit the three graphs/plots that you found most informative during your EDA process, and briefly explain what you learned from each (why you found each informative).

The first thing to be done was to restructure the MS.SubClass variable from the data set into a factor from a integer. With that complete, I could continue with the EDA. The next step in my process was to only analyze and regress from the normal sales in the data set so all of my models and the diagnostics of those models we’re based on normal sales and not partial, abnormal, etc sales. From there, I picked a host of variables that I thought would correlate to the price of a home (Chunk 1) and noted those correlations as they may make good covariates in the model. Lastly, I graphed the most promising covariates in Figures 2-4 which represent the relationship between the log price of the home and the log area, the overall quality, and the year the home was built.

Chunk 1

ames_train$MS.SubClass <- as.factor(ames_train$MS.SubClass)

# build a new df with only normal sales so abnormal or partial sales don't corrupt the EDA
normal.sales <- ames_train %>%
  filter(Sale.Condition == 'Normal')

# find correlations for variables that should affect the price of the home 
# log transform non-normal variables

cor <- data_frame(
  var = c('log.area','log.lot.area','overall.quality','overall.condition','year.built','year.sold','bedrm','rooms','full.bath'),
  cor = c(

## # A tibble: 9 x 2
##                 var         cor
##               <chr>       <dbl>
## 1          log.area  0.75795218
## 2      log.lot.area  0.39562223
## 3   overall.quality  0.82061753
## 4 overall.condition -0.07574616
## 5        year.built  0.60597666
## 6         year.sold  0.01187609
## 7             bedrm  0.27207018
## 8             rooms  0.54218041
## 9         full.bath  0.59047853

Figure 2

ggplot(normal.sales, aes(x = log(area), y = log(price))) + geom_point() + xlab('Log of Area of Home') + ylab('Log of Home Price')

Figure 3

ggplot(normal.sales, aes(x = Overall.Qual, y = log(price))) + geom_point() + xlab('Overall Quality') + ylab('Log of Home Price')

Figure 4

ggplot(normal.sales, aes(x = Year.Built, y = log(price))) + geom_point() + xlab('Year Built') + ylab('Log of Home Price')